LangChain's Cost of Abstraction

Posted February 29, 2024

Hand-rolling the same solution to the same problem multiple times, I was drawn to the benefit of using LangChain’s¹ output parsers. Then I looked back to see how many hoops I jumped through to get everything to work and realized it wasn’t worth it.

TL;DR

You can already guess how this will go down, so I have to tell you, I want to like LangChain, and I still use it but not for anything mission-critical.

And that’s the problem. If I wouldn’t use LangChain for anything mission-critical, what am I using it for?

There’s a longer conceptual gripe here. Suffice it to say that in the shakey, volatile landscape of AI application development in 2024, investing in an abstraction like LangChain yields no promises of forward compatibility, provides slight headaches now, and is wholly unproven in terms of the long-term maintainability of your source code.

Okay good. Glad I got that off my chest. Let’s get boring again and talk about how parsers work.

Parsing LLM responses

For a short synopsis, LangChain’s Output Parsers ingest an LLM’s generated response as plain text and convert it into something more usable.

But to get the plain-text response to be parseable, an LLM has to be instructed on how to format its response. In LangChain, output parsers also generate text fragments capable of instructing the LLM to do just that.

What problem does this solve?

This pattern of instructing an LLM how to structure a response is a basic pattern in designing prompts, and I’ve relied on simple templating strategies to do this. The promise of maintaining less boilerplate code that specifies an output format seemed enticing. But like other features of LangChain, it comes with significant caveats.²

LangChain and Output Parsers

I’ll look at the behavior of a few output parsers below and show the instructions they generate.

JSON Output Parser

First, let’s start simply with JSON.

from langchain_core.output_parsers import JsonOutputParser
import tiktoken as tk

# only used to count prompt tokens (below)
enc = tk.get_encoding("cl100k_base")

# create a parser & output its format instructions
parser = JsonOutputParser()
instructions = parser.get_format_instructions()

print(instructions)
print(f"[tokens: {len(enc.encode(instructions))}]\n")

This generates the following output and would insert 5 tokens into the prompt.

Return a JSON object.
[tokens: 5]

Using the above in a prompt, we would inject the term ‘Return a JSON object’ into the body of the prompt, which would instruct GPT to, well, return JSON. At the tail end of your chain, the output parser would do the same thing any JSON handling code would do, including error out if the output is invalid JSON.

With the availability of JSON Mode in OpenAI, there is no longer a need to insert this format instruction into the prompt - GPT will return JSON by default if enabled.

CSV Output Parser

Notice that the interface used below is the same as above, and get_format_instructions on any parser returns instructions injected into a system prompt to generate the required response format. This is the benefit of the abstractions in LangChain.

from langchain_core.output_parsers import CommaSeparatedListOutputParser

# creating a CSV parser
parser = CommaSeparatedListOutputParser()
instructions = parser.get_format_instructions()

print(instructions)
print(f"[tokens: {len(enc.encode(instructions))}]\n")

Here’s the output of the CSV parser. It inserts an extra 20 tokens into the prompt. Not too bad.

Your response should be a list of comma separated values, eg: `foo, bar, baz`

[tokens: 20]

Again, it’s straightforward and nothing you couldn’t simply inject in your prompt yourself, although it saves you from having to repeat this instruction throughout an application if you always want the LLM to generate CSV data.

Parsing DataFrames

Let’s look at a slightly more involved example - parsing a response as a DataFrame:

import pandas as pd
from langchain.output_parsers.pandas_dataframe import PandasDataFrameOutputParser

# a dataframe in the format we care about (values are not important)
df = pd.DataFrame(
    {
        "cities": ["Chicago", "New York", "Amsterdam", "Tokyo", "Los Angeles"],
        "ranking": [1, 2, 3, 4, 5],
        "scale": [0.342, 0.123, 0.0, 0.18, 0.54],
    }
)

# creates a parser using the dataframe above
parser = PandasDataFrameOutputParser(dataframe=df)
instructions = parser.get_format_instructions()

print(instructions)
print(f"[tokens: {len(enc.encode(instructions))}]\n")

This is starting to feel a bit weird. Yes, I suppose it is neat to have the production of this boilerplate be someone else’s problem, but as you see below, the amount of text required to get the LLM to return a response in the given format gets out of hand.

The output should be formatted as a string as the operation, followed by a colon, followed by the column or row to be queried on, followed by optional array parameters.
1. The column names are limited to the possible columns below.
2. Arrays must either be a comma-separated list of numbers formatted as [1,3,5], or it must be in range of numbers formatted as [0..4].
3. Remember that arrays are optional and not necessarily required.
4. If the column is not in the possible columns or the operation is not a valid Pandas DataFrame operation, return why it is invalid as a sentence starting with either "Invalid column" or "Invalid operation".

As an example, for the formats:
1. String "column:num_legs" is a well-formatted instance which gets the column num_legs, where num_legs is a possible column.
2. String "row:1" is a well-formatted instance which gets row 1.
3. String "column:num_legs[1,2]" is a well-formatted instance which gets the column num_legs for rows 1 and 2, where num_legs is a possible column.
4. String "row:1[num_legs]" is a well-formatted instance which gets row 1, but for just column num_legs, where num_legs is a possible column.
5. String "mean:num_legs[1..3]" is a well-formatted instance which takes the mean of num_legs from rows 1 to 3, where num_legs is a possible column and mean is a valid Pandas DataFrame operation.
6. String "do_something:num_legs" is a badly-formatted instance, where do_something is not a valid Pandas DataFrame operation.
7. String "mean:invalid_col" is a badly-formatted instance, where invalid_col is not a possible column.

Here are the possible columns:
```
cities, ranking, scale
```

[tokens: 406]

Parsing Into a Data Model

Finally, here’s the Pydantic parser with a simple data model and the format instructions it generates using that data model.

Rather than inheriting from pydantic.BaseModel, your models must inherit from langchain_core.pydantic_v1.BaseModel. That’s an unfortunate but necessary step allowing LangChain to inspect & generate metadata that you must define for each Field to have everything work.

Extending LangChain’s version of Pydantic base classes may increase the number of places in your code that must import LangChain as a dependency. Which can make the reuse of these data models tricky. I don’t like this kind of substitution by vendors. It’s brittle.

from enum import Enum
from langchain.output_parsers import PydanticOutputParser
from langchain_core.pydantic_v1 import BaseModel
from langchain_core.pydantic_v1 import Field

class FoodCategory(Enum):
	ingredient = "ingredient"
	food = "food"
	meal = "meal"
	beverage = "beverage"
	unknown = "unknown"


class FoodItemModel(BaseModel):
	name: str = Field(description="singular form of the item's name")
	description: str = Field(description="a short description of the food item")
	serving_size: float = Field(description="the serving size of this food item")
	category: FoodCategory = Field(description="the category of food. If the category can not be determined, return 'unknown'")
	per_serving_min: float = Field(description="the minimum number of grams per serving")
	per_serving_max: float = Field(description="the maximum number of grams per serving")
	calories: float = Field(description="the average calories in this item")
	carbohydrates: float = Field(description="the average grams of carbohydrates in this item")
	protein: float = Field(description="the average grams of protein in this item")


# creates a parser using the data model above
parser = PydanticOutputParser(pydantic_object=FoodItemModel)
instructions = parser.get_format_instructions()

print(instructions)
print(f"[tokens: {len(enc.encode(instructions))}]\n")

And here’s the output generated by the Pydantic parser for this simple model (I reformatted the ‘output schema’ to show its structure - it’s originally contained in one line):

The output should be formatted as a JSON instance that conforms to the JSON schema below.

As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array", "items": {"type": "string"}}}, "required": ["foo"]}
the object {"foo": ["bar", "baz"]} is a well-formatted instance of the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.

Here is the output schema:
```
{
    "properties": {
        "name": {
            "title": "Name",
            "description": "singular form of the item's name",
            "type": "string"
        },
        "description": {
            "title": "Description",
            "description": "a short description of the food item",
            "type": "string"
        },
        "serving_size": {
            "title": "Serving Size",
            "description": "the serving size of this food item",
            "type": "number"
        },
        "category": {
            "description": "the category of food. If the category can not be determined, return 'unknown'",
            "allOf": [
                {
                    "$ref": "#/definitions/FoodCategory"
                }
            ]
        },
        "per_serving_min": {
            "title": "Per Serving Min",
            "description": "the minimum number of grams per serving",
            "type": "number"
        },
        "per_serving_max": {
            "title": "Per Serving Max",
            "description": "the maximum number of grams per serving",
            "type": "number"
        },
        "calories": {
            "title": "Calories",
            "description": "the average calories in this item",
            "type": "number"
        },
        "carbohydrates": {
            "title": "Carbohydrates",
            "description": "the average grams of carbohydrates in this item",
            "type": "number"
        },
        "protein": {
            "title": "Protein",
            "description": "the average grams of protein in this item",
            "type": "number"
        }
    },
    "required": [
        "name",
        "description",
        "serving_size",
        "category",
        "per_serving_min",
        "per_serving_max",
        "calories",
        "carbohydrates",
        "protein"
    ],
    "definitions": {
        "FoodCategory": {
            "title": "FoodCategory",
            "description": "An enumeration.",
            "enum": [
                "ingredient",
                "food",
                "meal",
                "beverage",
                "unknown"
            ]
        }
    }
}
```
[tokens: 483]

That seems like way too much for the problem at hand. In my experience with the Pydantic parser, several responses failed to parse, only to succeed upon subsequent invocations of the same payload. So, I’m unsure how far this template generation gets you in the end.

What if it breaks?

I can only assume a significant amount of testing went into the ability of these format instructions to work. But they don’t always work.

This is why the OutputFixingParser and RetryOutputParser classes exist - I’m not sure, I didn’t stick around long enough with this approach to find out. But a quick read of the retry parser basically does exactly what you think– send everything back and hope the LLM corrects itself.³

Retrying is not without cost. Depending on how often it is needed, this cost could become an issue in token count, API calls, and the latency of your code.

Token Counts

Most people aren’t concerned with all the extra instructions and tokens. An extra thousand tokens here and there doesn’t mean anything, especially if it’s just you exploring data or prototyping.

But one of the implicit promises of frameworks such as LangChain is the productization of LLMs, which means managing scale. Small decisions in framework behavior can have significant impacts down the line. And when you talk about tens of thousands of requests being sent through GPT-4, having an extra hundred tokens here or there or re-processing the same request as a fallback strategy carries a lot of complexity.

Alternative Approaches

I’ve had better luck sticking to JSON and providing GPT with short, sample responses in my system prompt rather than describing the response format. Rather than generating JSON schemas or lengthy descriptions, this yields better results, often with fewer tokens.

I am hard-pressed to think of a real-world case where parsing isn’t better served by turning on JSON mode and converting the LLM response to your desired data structure within your code. But that begs the question of how valuable LangChain is in the first place. I was initially optimistic, but I can’t categorically recommend it.

For purposes of this, I am assuming LangChain is being used with OpenAI. ↩︎
I came across this post, which had me wondering if there was a more, say, structured criticism I could make on the subject of LangChain and why some people seem to dislike it. ↩︎
🤦 ↩︎