·4 min read

Upstash Ratelimit in LangChain

Cahid Arda OzCahid Arda OzSoftware Engineer @Upstash

In this post, we will showcase how Upstash Ratelimit can be used in LangChain in Javascript and in Python.

Motivation

Large Language Models (LLMs) are powerful tools but can be costly. To manage their cost, it's essential to limit the number of requests processed, ensuring the use of LLMs remains affordable. This is where Upstash Ratelimit comes into play.

Using Upstash Ratelimit in LangChain allows you to:

  1. Allow a limited number of chain invocations over a time period.
  2. Allow a limited number of tokens (either prompt or prompt and completion) over a time period.

Usage

Start by installing @upstash/ratelimit and @langchain/community (you can find more information about installing LangChain community here):

npm install @upstash/ratelimit @langchain/community

If you are using Python, run:

pip install upstash-ratelimit langchain-community

Then, set the environment variables UPSTASH_REDIS_REST_URL and UPSTASH_REDIS_REST_TOKEN. You can get these environment variables by going to the Upstash console and creating a redis instance.

Now we can demonstrate how you can add rate limiting to your chain in LangChain.

First, create a rate limit instance as shown below:

import { Ratelimit } from "@upstash/ratelimit";
import { Redis } from "@upstash/redis";
 
const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  // 10 requests per window, where window size is 60 seconds:
  limiter: Ratelimit.fixedWindow(10, "60 s"),
});

This ratelimit object will use the Redis database to store how many requests were made by different users. Learn more about Ratelimit from its documentation.

Next, create a mock chain in LangChain to showcase the callback:

import { RunnableLambda } from "@langchain/core/runnables";
 
const chain = new RunnableLambda({ func: (str: string): string => str });

Finally, create a callback and invoke the chain:

import {
  UpstashRatelimitHandler,
  UpstashRatelimitError,
} from "@langchain/community/callbacks/handlers/upstash_ratelimit";
 
const user_id = getUserId(); // your own method to get user ids
 
try {
  const response = await chain.invoke("Hello World!", {
    callbacks: [
      new UpstashRatelimitHandler(user_id, {
        requestRatelimit: ratelimit,
      });
    ],
  });
  console.log(response);
} catch (err) {
  if (err instanceof UpstashRatelimitError) {
    console.log("Handling ratelimit!");
  }
}

Here is the same code written in Python:

from upstash_redis import Redis
from upstash_ratelimit import Ratelimit, FixedWindow
 
from langchain_core.runnables import RunnableLambda
from langchain_community.callbacks import UpstashRatelimitError, UpstashRatelimitHandler
 
# your own method to get user ids
const user_id = getUserId();
 
# mock chain
chain = RunnableLambda(str)
 
# init ratelimiter
ratelimit = Ratelimit(
    redis=Redis.from_env(),
    # 10 requests per window, where window size is 60 seconds:
    limiter=FixedWindow(max_requests=10, window=60),
)
 
try:
    response = chain.invoke("Hello World!", {
        "callbacks": [
            UpstashRatelimitHandler(
                identifier=user_id,
                request_ratelimit: ratelimit,
            )
        ]
    })
except UpstashRatelimitError:
    print("Handling ratelimit!")
 

Note that we initialize the callback when we invoke the chain. This is because the handler has a state which needs to be reset with each invocation.

In the configuration above, the handler will allow 10 requests per minute for each user. However, this is not the only way to configure the handler. To learn more about Ratelimit callbacks in LangChain, see the Upstash Ratelimit Callback documentation for TypeScript and Python.

You can also rate limit based on the number of tokens. You can configure the handler to rate limit based on requests, number of tokens, or both. To add token-based rate limiting, use the tokenRatelimit parameter when initializing the handler:

const handler = new UpstashRatelimitHandler(user_id, {
  tokenRatelimit: ratelimit,
  includeOutputTokens: true // whether completion is included when counting tokens
});

In the token-based rate limiting, we expect the LLM in your chain to return an LLMResult in this format:

{
  "tokenUsage": {
    "totalTokens": 123,
    "promptTokens": 456,
    "otherFields: "..."
  },
  "otherFields: "..."
}

Not all LLMs in LangChain use this format, however. If the keys are different, you can use llmOutputTokenUsageField, llmOutputTotalTokenField and llmOutputPromptTokenField fields of UpstashRatelimitHandler.

Conclusion

The UpstashRatelimitHandler provides a simple way to add rate limiting to your LangChain applications. With just a few steps, you can control the number of requests and tokens used. For more detailed information and advanced configurations, explore the LangChain documentation.