-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: added a usage-based ratelimiting section in the docs #260
Merged
Merged
Changes from 12 commits
Commits
Show all changes
15 commits
Select commit
Hold shift + click to select a range
e0656a9
updated order of prerequisite docs to ensure neccessary tools installed
melsal13 0d00b85
Merge remote-tracking branch 'origin/main' into update-docs
melsal13 de039a2
added a usage-based ratelimiting section in the docs
melsal13 2f3baa8
added a usage-based ratelimiting section in the docs
melsal13 49fc8a1
made usage based rate limiting docs more ai-gateway specific
melsal13 6d4a9f3
Merge remote-tracking branch 'origin/main' into update-docs
melsal13 4aa9144
demonstrates combo of model and user rate limiting
melsal13 930b381
first draft of usage based rate limiting
melsal13 d5b4dcc
2nd usage based rate limiting docs with model headers
melsal13 9190bc3
Merge remote-tracking branch 'origin/main' into docs-usagebased
melsal13 04075cc
Merge branch 'docs-usagebased' of https://github.com/melsal13/ai-gate…
melsal13 578bed4
fixed white space error
melsal13 a602571
changed note to clarify note on unifies openai schema
melsal13 62bb705
Merge branch 'main' into docs-usagebased
melsal13 eab1f70
Merge branch 'main' into docs-usagebased
missBerg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
--- | ||
id: capabilities | ||
title: Capabilities | ||
sidebar_position: 3 | ||
--- | ||
|
||
# Envoy AI Gateway Capabilities | ||
|
||
Welcome to the Envoy AI Gateway capabilities documentation! This section provides detailed information about the various features and capabilities that Envoy AI Gateway offers to help you manage and optimize your AI/LLM traffic. | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,171 @@ | ||
--- | ||
id: usage-based-ratelimiting | ||
title: Usage-based Rate Limiting | ||
sidebar_position: 5 | ||
--- | ||
|
||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
|
||
This guide focuses on AI Gateway's specific capabilities for token-based rate limiting in LLM requests. For general rate limiting concepts and configurations, refer to [Envoy Gateway's Rate Limiting documentation](https://gateway.envoyproxy.io/docs/tasks/traffic/global-rate-limit/). | ||
|
||
## Overview | ||
|
||
AI Gateway leverages Envoy Gateway's Global Rate Limit API to provide token-based rate limiting for LLM requests. Key features include: | ||
- Token usage tracking based on model and user identifiers | ||
- Configuration for tracking input, output, and total token metadata from LLM responses | ||
- Model-specific rate limiting using AI Gateway headers (`x-ai-eg-model`) | ||
- Support for custom token cost calculations using CEL expressions | ||
|
||
## Token Usage Behavior | ||
|
||
AI Gateway has specific behavior for token tracking and rate limiting: | ||
|
||
1. **Token Extraction**: AI Gateway automatically extracts token usage from LLM responses that follow the OpenAI schema format. The token counts are stored in the metadata specified in your `llmRequestCosts` configuration. | ||
|
||
2. **Rate Limit Timing**: The check for whether the total count has reached the limit happens during each request. When a request is received: | ||
- AI Gateway checks if processing this request would exceed the configured token limit | ||
- If the limit would be exceeded, the request is rejected with a 429 status code | ||
- If within the limit, the request is processed and its token usage is counted towards the total | ||
|
||
3. **Token Types**: | ||
- `InputToken`: Counts tokens in the request prompt | ||
- `OutputToken`: Counts tokens in the model's response | ||
- `TotalToken`: Combines both input and output tokens | ||
- `CEL`: Allows custom token calculations using CEL expressions | ||
|
||
4. **Multiple Rate Limits**: You can configure multiple rate limit rules for the same user-model combination. For example: | ||
- Limit total tokens per hour | ||
- Separate limits for input and output tokens | ||
- Custom limits using CEL expressions | ||
|
||
:::note | ||
The token counts are extracted from the model's response. Make sure your model backend provides token usage information in a format compatible with the OpenAI schema. | ||
::: | ||
|
||
## Configuration | ||
|
||
### 1. Configure Token Tracking | ||
|
||
AI Gateway automatically tracks token usage for each request. Configure which token counts you want to track in your `AIGatewayRoute`: | ||
|
||
```yaml | ||
spec: | ||
llmRequestCosts: | ||
melsal13 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- metadataKey: llm_input_token | ||
type: InputToken # Counts tokens in the request | ||
- metadataKey: llm_output_token | ||
type: OutputToken # Counts tokens in the response | ||
- metadataKey: llm_total_token | ||
type: TotalToken # Tracks combined usage | ||
``` | ||
|
||
For advanced token calculations specific to your use case: | ||
|
||
```yaml | ||
spec: | ||
llmRequestCosts: | ||
- metadataKey: custom_cost | ||
type: CEL | ||
celExpression: "input_tokens * 0.5 + output_tokens * 1.5" # Example: Weight output tokens more heavily | ||
``` | ||
|
||
### 2. Configure Rate Limits | ||
|
||
AI Gateway uses Envoy Gateway's Global Rate Limit API to configure rate limits. Rate limits should be defined using a combination of user and model identifiers to properly control costs at the model level. Configure this using a `BackendTrafficPolicy`: | ||
|
||
#### Example: Cost-Based Model Rate Limiting | ||
|
||
The following example demonstrates a common use case where different models have different token limits based on their costs. This is useful when: | ||
- You want to limit expensive models (like GPT-4) more strictly than cheaper ones | ||
- You need to implement different quotas for different tiers of service | ||
- You want to prevent cost overruns while still allowing flexibility with cheaper models | ||
|
||
```yaml | ||
apiVersion: gateway.envoyproxy.io/v1alpha1 | ||
kind: BackendTrafficPolicy | ||
metadata: | ||
name: model-specific-token-limit-policy | ||
namespace: default | ||
spec: | ||
targetRefs: | ||
- name: envoy-ai-gateway-token-ratelimit | ||
kind: Gateway | ||
group: gateway.networking.k8s.io | ||
rateLimit: | ||
type: Global | ||
global: | ||
rules: | ||
# Rate limit rule for GPT-4: 1000 total tokens per hour per user | ||
# Stricter limit due to higher cost per token | ||
- clientSelectors: | ||
- headers: | ||
- name: x-user-id | ||
type: Distinct | ||
- name: x-ai-eg-model | ||
type: Exact | ||
value: gpt-4 | ||
limit: | ||
requests: 1000 # 1000 total tokens per hour | ||
unit: Hour | ||
cost: | ||
request: | ||
from: Number | ||
number: 0 # Set to 0 so only token usage counts | ||
response: | ||
from: Metadata | ||
metadata: | ||
namespace: io.envoy.ai_gateway | ||
key: llm_total_token # Uses total tokens from the responses | ||
# Rate limit rule for GPT-3.5: 5000 total tokens per hour per user | ||
# Higher limit since the model is more cost-effective | ||
- clientSelectors: | ||
- headers: | ||
- name: x-user-id | ||
type: Distinct | ||
- name: x-ai-eg-model | ||
type: Exact | ||
value: gpt-3.5-turbo | ||
limit: | ||
requests: 5000 # 5000 total tokens per hour (higher limit for less expensive model) | ||
unit: Hour | ||
cost: | ||
request: | ||
from: Number | ||
number: 0 # Set to 0 so only token usage counts | ||
response: | ||
from: Metadata | ||
metadata: | ||
namespace: io.envoy.ai_gateway | ||
key: llm_total_token # Uses total tokens from the response | ||
``` | ||
|
||
:::warning | ||
When configuring rate limits: | ||
1. Always set the request cost number to 0 to ensure only token usage counts towards the limit | ||
2. Set appropriate limits for different models based on their costs and capabilities | ||
3. Ensure both user and model identifiers are used in rate limiting rules | ||
::: | ||
|
||
## Making Requests | ||
|
||
For proper cost control and rate limiting, requests must include: | ||
- `x-user-id`: Identifies the user making the request | ||
- `x-ai-eg-model`: Identifies the model being used | ||
|
||
Example request: | ||
```shell | ||
curl --fail \ | ||
-H "Content-Type: application/json" \ | ||
-H "x-user-id: user123" \ | ||
-H "x-ai-eg-model: gpt-4" \ # Both user ID and model are required | ||
-d '{ | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "Hello!" | ||
} | ||
] | ||
}' \ | ||
$GATEWAY_URL/v1/chat/completions | ||
``` |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the AI GW transforms for example AWS bedrock responses to OpenAI schema Envoy AI GW can also capture usage from model providers that have transformation of requests and responses into the open a i schema. Update this to be clear that thanks to the request/response transformer into a unified API based on openAI schema we can capture usage in a unified way 😊 @melsal13