-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: added a usage-based ratelimiting section in the docs #260
Changes from 1 commit
e0656a9
0d00b85
de039a2
2f3baa8
49fc8a1
6d4a9f3
4aa9144
930b381
d5b4dcc
9190bc3
04075cc
578bed4
a602571
62bb705
eab1f70
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
--- | ||
id: capabilities | ||
title: Capabilities | ||
sidebar_position: 3 | ||
--- | ||
|
||
# Envoy AI Gateway Capabilities | ||
|
||
Welcome to the Envoy AI Gateway capabilities documentation! This section provides detailed information about the various features and capabilities that Envoy AI Gateway offers to help you manage and optimize your AI/LLM traffic. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,288 @@ | ||
--- | ||
id: usage-based-ratelimiting | ||
title: Usage-based Rate Limiting | ||
sidebar_position: 5 | ||
--- | ||
|
||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
|
||
This guide will help you configure usage-based rate limiting for your AI Gateway to control token consumption across different LLM requests. | ||
|
||
## Overview | ||
|
||
Usage-based rate limiting allows you to control and monitor token consumption for your LLM requests. You can set separate limits for: | ||
- Input tokens | ||
- Output tokens | ||
- Total tokens | ||
|
||
This is particularly useful for: | ||
- Controlling costs per user | ||
- Implementing fair usage policies | ||
- Preventing abuse of your LLM endpoints | ||
|
||
## Configuration | ||
|
||
### 1. Configure Token Tracking | ||
|
||
First, you need to configure which metadata keys will store the token counts from LLM requests. Add the following configuration to your `AIGatewayRoute`: | ||
|
||
```yaml | ||
spec: | ||
# ... other configuration ... | ||
llmRequestCosts: | ||
melsal13 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- metadataKey: llm_input_token | ||
type: InputToken | ||
- metadataKey: llm_output_token | ||
type: OutputToken | ||
- metadataKey: llm_total_token | ||
type: TotalToken | ||
``` | ||
|
||
### 2. Configure Rate Limit Policy | ||
|
||
Create a `BackendTrafficPolicy` to define your rate limit rules: | ||
|
||
mathetake marked this conversation as resolved.
Show resolved
Hide resolved
|
||
```yaml | ||
apiVersion: gateway.envoyproxy.io/v1alpha1 | ||
kind: BackendTrafficPolicy | ||
metadata: | ||
name: ai-gateway-token-ratelimit-policy | ||
namespace: default | ||
spec: | ||
targetRefs: | ||
- name: your-gateway-name | ||
kind: Gateway | ||
group: gateway.networking.k8s.io | ||
rateLimit: | ||
type: Global | ||
global: | ||
rules: | ||
# Input Token Rate Limit | ||
- clientSelectors: | ||
- headers: | ||
- name: x-user-id | ||
type: Distinct | ||
limit: | ||
requests: 10000 # Adjust based on your needs | ||
unit: Hour | ||
cost: | ||
request: | ||
from: Number | ||
number: 0 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would like an explanation of what zero means here if i were new to this project There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this request section required ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good q @arkodg There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah cost of a request is 0 ( by default it's 1 i.e. every request costs 1 count towards total limit ) , and cost of a response is Y. But the check of whether the total count has reached the limit or not happens during a request There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I meant, the default of 1 must not be changed for legacy API use when the top level There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe i should’ve done it before v1.3 release… There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for now we can have this, and at least people won't have things broken when we get it better ;) |
||
response: | ||
from: Metadata | ||
metadata: | ||
namespace: io.envoy.ai_gateway | ||
key: llm_input_token | ||
|
||
# Output Token Rate Limit | ||
- clientSelectors: | ||
- headers: | ||
- name: x-user-id | ||
type: Distinct | ||
limit: | ||
requests: 20000 # Adjust based on your needs | ||
unit: Hour | ||
cost: | ||
request: | ||
from: Number | ||
number: 0 | ||
response: | ||
from: Metadata | ||
metadata: | ||
namespace: io.envoy.ai_gateway | ||
key: llm_output_token | ||
|
||
# Total Token Rate Limit | ||
- clientSelectors: | ||
- headers: | ||
- name: x-user-id | ||
type: Distinct | ||
limit: | ||
requests: 30000 # Adjust based on your needs | ||
unit: Hour | ||
cost: | ||
request: | ||
from: Number | ||
number: 0 | ||
response: | ||
from: Metadata | ||
metadata: | ||
namespace: io.envoy.ai_gateway | ||
key: llm_total_token | ||
``` | ||
|
||
## Understanding the Configuration | ||
|
||
### Rate Limit Rules | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this feels like a documentation for Envoy Gateway as per the comment above. Ack this might be helpful but also i would like to avoid the duplicate effort with Envoy Gateway project (not here). defer to @missBerg for the decision. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think keeping some of it in for now is good, let's add a link to EG docs for people to dive into more details @melsal13 |
||
|
||
Each rule in the configuration consists of: | ||
|
||
1. **Client Selectors**: Define how to identify unique clients (e.g., by `x-user-id` header) | ||
2. **Limit**: Specify the token budget and time unit | ||
3. **Cost**: Configure how to calculate the cost of each request | ||
- `request`: Usually set to 0 to only track response tokens | ||
- `response`: Uses metadata from the LLM response to count tokens | ||
|
||
### Time Units | ||
|
||
You can specify rate limits using different time units: | ||
- `Second` | ||
- `Minute` | ||
- `Hour` | ||
- `Day` | ||
|
||
### Client Identification | ||
|
||
There are several ways to identify clients for rate limiting. Here are the most common approaches: | ||
|
||
#### 1. Simple Header-based Identification | ||
|
||
The simplest approach is using a custom header: | ||
|
||
```yaml | ||
clientSelectors: | ||
- headers: | ||
- name: x-user-id | ||
type: Distinct | ||
``` | ||
|
||
#### 2. JWT Token Claims | ||
|
||
You can extract client identifiers from JWT tokens. This is particularly useful when your application already uses JWT for authentication: | ||
|
||
```yaml | ||
apiVersion: gateway.envoyproxy.io/v1alpha1 | ||
kind: SecurityPolicy | ||
metadata: | ||
name: jwt-auth | ||
namespace: default | ||
spec: | ||
targetRefs: | ||
- name: your-gateway-name | ||
group: gateway.networking.k8s.io | ||
kind: Gateway | ||
jwt: | ||
providers: | ||
my-provider: | ||
issuer: https://your-issuer.com | ||
audiences: | ||
- your-audience | ||
remoteJWKS: | ||
uri: https://your-issuer.com/.well-known/jwks.json | ||
claimToHeaders: | ||
- claim: sub | ||
header: x-jwt-sub | ||
- claim: client_id | ||
header: x-client-id | ||
|
||
--- | ||
apiVersion: gateway.envoyproxy.io/v1alpha1 | ||
kind: BackendTrafficPolicy | ||
metadata: | ||
name: rate-limit-with-jwt | ||
namespace: default | ||
spec: | ||
targetRefs: | ||
- name: your-gateway-name | ||
kind: Gateway | ||
group: gateway.networking.k8s.io | ||
rateLimit: | ||
type: Global | ||
global: | ||
rules: | ||
- clientSelectors: | ||
- headers: | ||
- name: x-jwt-sub # Using the extracted JWT subject claim | ||
type: Distinct | ||
- name: x-client-id # Additionally using client_id for more granular control | ||
type: Distinct | ||
limit: | ||
requests: 10000 | ||
unit: Hour | ||
# ... rest of the rate limit configuration ... | ||
``` | ||
|
||
#### 3. Combined Identification | ||
|
||
You can combine multiple identifiers for more granular control: | ||
|
||
```yaml | ||
clientSelectors: | ||
- headers: | ||
- name: x-jwt-sub | ||
type: Distinct | ||
- name: x-client-id | ||
type: Distinct | ||
- name: x-organization-id | ||
type: Distinct | ||
``` | ||
|
||
#### 4. Dynamic Header Transformation | ||
|
||
For complex scenarios, you can use Envoy's header transformation to create custom identifiers: | ||
|
||
```yaml | ||
apiVersion: gateway.envoyproxy.io/v1alpha1 | ||
kind: HTTPRoute | ||
metadata: | ||
name: transform-headers | ||
spec: | ||
parentRefs: | ||
- name: your-gateway-name | ||
rules: | ||
- filters: | ||
- type: RequestHeaderModifier | ||
requestHeaderModifier: | ||
set: | ||
- name: x-rate-limit-id | ||
value: "%REQ(x-organization-id)%_%REQ(x-client-id)%" | ||
# ... rest of the route configuration ... | ||
``` | ||
|
||
Then use the transformed header in your rate limit configuration: | ||
|
||
```yaml | ||
clientSelectors: | ||
- headers: | ||
- name: x-rate-limit-id | ||
type: Distinct | ||
``` | ||
|
||
:::warning | ||
Avoid using sensitive claims directly in headers. Instead, use derived or hashed values when needed. | ||
::: | ||
|
||
## Making Requests | ||
|
||
When making requests to your rate-limited endpoint, include the appropriate client identifier: | ||
|
||
```shell | ||
curl --fail \ | ||
-H "Content-Type: application/json" \ | ||
-H "x-user-id: user123" \ | ||
-d '{ | ||
"model": "gpt-4", | ||
"messages": [ | ||
{ | ||
"role": "user", | ||
"content": "Hello!" | ||
} | ||
] | ||
}' \ | ||
$GATEWAY_URL/v1/chat/completions | ||
``` | ||
|
||
## Rate Limit Responses | ||
|
||
When a rate limit is exceeded, the API will return a 429 (Too Many Requests) status code. The response will include headers indicating: | ||
- The current rate limit status | ||
- When the rate limit will reset | ||
|
||
## Best Practices | ||
|
||
1. **Set Appropriate Limits**: Consider your use case and adjust limits accordingly | ||
2. **Monitor Usage**: Keep track of rate limit hits to adjust limits if needed | ||
3. **Client Identification**: Choose a reliable way to identify clients | ||
4. **Error Handling**: Implement proper handling of rate limit responses in your applications |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cost needs be controlled at the model level, commonly a combination of user and target model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah model header demonstration would be helpful yea
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah let's add that example for combo of user and target model @melsal13
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @yuzisun @mathetake @missBerg
I added an example of a user and target model. Please let me know what you all think :)