Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: added a usage-based ratelimiting section in the docs #260

Merged
merged 15 commits into from
Feb 8, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions site/docs/capabilities/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
id: capabilities
title: Capabilities
sidebar_position: 3
---

# Envoy AI Gateway Capabilities

Welcome to the Envoy AI Gateway capabilities documentation! This section provides detailed information about the various features and capabilities that Envoy AI Gateway offers to help you manage and optimize your AI/LLM traffic.

288 changes: 288 additions & 0 deletions site/docs/capabilities/usage-based-ratelimiting.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,288 @@
---
id: usage-based-ratelimiting
title: Usage-based Rate Limiting
sidebar_position: 5
---

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This guide will help you configure usage-based rate limiting for your AI Gateway to control token consumption across different LLM requests.

## Overview

Usage-based rate limiting allows you to control and monitor token consumption for your LLM requests. You can set separate limits for:
- Input tokens
- Output tokens
- Total tokens

This is particularly useful for:
- Controlling costs per user
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost needs be controlled at the model level, commonly a combination of user and target model.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah model header demonstration would be helpful yea

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah let's add that example for combo of user and target model @melsal13

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @yuzisun @mathetake @missBerg

I added an example of a user and target model. Please let me know what you all think :)

- Implementing fair usage policies
- Preventing abuse of your LLM endpoints

## Configuration

### 1. Configure Token Tracking

First, you need to configure which metadata keys will store the token counts from LLM requests. Add the following configuration to your `AIGatewayRoute`:

```yaml
spec:
# ... other configuration ...
llmRequestCosts:
- metadataKey: llm_input_token
type: InputToken
- metadataKey: llm_output_token
type: OutputToken
- metadataKey: llm_total_token
type: TotalToken
```

### 2. Configure Rate Limit Policy

Create a `BackendTrafficPolicy` to define your rate limit rules:

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: ai-gateway-token-ratelimit-policy
namespace: default
spec:
targetRefs:
- name: your-gateway-name
kind: Gateway
group: gateway.networking.k8s.io
rateLimit:
type: Global
global:
rules:
# Input Token Rate Limit
- clientSelectors:
- headers:
- name: x-user-id
type: Distinct
limit:
requests: 10000 # Adjust based on your needs
unit: Hour
cost:
request:
from: Number
number: 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like an explanation of what zero means here if i were new to this project

Copy link
Contributor

@yuzisun yuzisun Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this request section required ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good q @arkodg

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah cost of a request is 0 ( by default it's 1 i.e. every request costs 1 count towards total limit ) , and cost of a response is Y.

But the check of whether the total count has reached the limit or not happens during a request

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant, the default of 1 must not be changed for legacy API use when the top level cost field is not given but when this field is given then we can technically default to zero right? There’s no backward compatibility concern

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe i should’ve done it before v1.3 release…

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for now we can have this, and at least people won't have things broken when we get it better ;)

response:
from: Metadata
metadata:
namespace: io.envoy.ai_gateway
key: llm_input_token

# Output Token Rate Limit
- clientSelectors:
- headers:
- name: x-user-id
type: Distinct
limit:
requests: 20000 # Adjust based on your needs
unit: Hour
cost:
request:
from: Number
number: 0
response:
from: Metadata
metadata:
namespace: io.envoy.ai_gateway
key: llm_output_token

# Total Token Rate Limit
- clientSelectors:
- headers:
- name: x-user-id
type: Distinct
limit:
requests: 30000 # Adjust based on your needs
unit: Hour
cost:
request:
from: Number
number: 0
response:
from: Metadata
metadata:
namespace: io.envoy.ai_gateway
key: llm_total_token
```

## Understanding the Configuration

### Rate Limit Rules
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like a documentation for Envoy Gateway as per the comment above. Ack this might be helpful but also i would like to avoid the duplicate effort with Envoy Gateway project (not here). defer to @missBerg for the decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keeping some of it in for now is good, let's add a link to EG docs for people to dive into more details @melsal13


Each rule in the configuration consists of:

1. **Client Selectors**: Define how to identify unique clients (e.g., by `x-user-id` header)
2. **Limit**: Specify the token budget and time unit
3. **Cost**: Configure how to calculate the cost of each request
- `request`: Usually set to 0 to only track response tokens
- `response`: Uses metadata from the LLM response to count tokens

### Time Units

You can specify rate limits using different time units:
- `Second`
- `Minute`
- `Hour`
- `Day`

### Client Identification

There are several ways to identify clients for rate limiting. Here are the most common approaches:

#### 1. Simple Header-based Identification

The simplest approach is using a custom header:

```yaml
clientSelectors:
- headers:
- name: x-user-id
type: Distinct
```

#### 2. JWT Token Claims

You can extract client identifiers from JWT tokens. This is particularly useful when your application already uses JWT for authentication:

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
name: jwt-auth
namespace: default
spec:
targetRefs:
- name: your-gateway-name
group: gateway.networking.k8s.io
kind: Gateway
jwt:
providers:
my-provider:
issuer: https://your-issuer.com
audiences:
- your-audience
remoteJWKS:
uri: https://your-issuer.com/.well-known/jwks.json
claimToHeaders:
- claim: sub
header: x-jwt-sub
- claim: client_id
header: x-client-id

---
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: BackendTrafficPolicy
metadata:
name: rate-limit-with-jwt
namespace: default
spec:
targetRefs:
- name: your-gateway-name
kind: Gateway
group: gateway.networking.k8s.io
rateLimit:
type: Global
global:
rules:
- clientSelectors:
- headers:
- name: x-jwt-sub # Using the extracted JWT subject claim
type: Distinct
- name: x-client-id # Additionally using client_id for more granular control
type: Distinct
limit:
requests: 10000
unit: Hour
# ... rest of the rate limit configuration ...
```

#### 3. Combined Identification

You can combine multiple identifiers for more granular control:

```yaml
clientSelectors:
- headers:
- name: x-jwt-sub
type: Distinct
- name: x-client-id
type: Distinct
- name: x-organization-id
type: Distinct
```

#### 4. Dynamic Header Transformation

For complex scenarios, you can use Envoy's header transformation to create custom identifiers:

```yaml
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: HTTPRoute
metadata:
name: transform-headers
spec:
parentRefs:
- name: your-gateway-name
rules:
- filters:
- type: RequestHeaderModifier
requestHeaderModifier:
set:
- name: x-rate-limit-id
value: "%REQ(x-organization-id)%_%REQ(x-client-id)%"
# ... rest of the route configuration ...
```

Then use the transformed header in your rate limit configuration:

```yaml
clientSelectors:
- headers:
- name: x-rate-limit-id
type: Distinct
```

:::warning
Avoid using sensitive claims directly in headers. Instead, use derived or hashed values when needed.
:::

## Making Requests

When making requests to your rate-limited endpoint, include the appropriate client identifier:

```shell
curl --fail \
-H "Content-Type: application/json" \
-H "x-user-id: user123" \
-d '{
"model": "gpt-4",
"messages": [
{
"role": "user",
"content": "Hello!"
}
]
}' \
$GATEWAY_URL/v1/chat/completions
```

## Rate Limit Responses

When a rate limit is exceeded, the API will return a 429 (Too Many Requests) status code. The response will include headers indicating:
- The current rate limit status
- When the rate limit will reset

## Best Practices

1. **Set Appropriate Limits**: Consider your use case and adjust limits accordingly
2. **Monitor Usage**: Keep track of rate limit hits to adjust limits if needed
3. **Client Identification**: Choose a reliable way to identify clients
4. **Error Handling**: Implement proper handling of rate limit responses in your applications
Loading