- Local rate limiting — enforced per agentgateway instance using a token-bucket algorithm. No external dependencies required.
- Global rate limiting — enforced across all instances by delegating to a remote rate limit service (compatible with Envoy’s ratelimit server).
Local rate limiting
Local rate limiting is configured on a route’spolicies block using the localRateLimit field. It uses a token-bucket algorithm: the bucket starts with maxTokens tokens, refills tokensPerFill tokens every fillInterval, and each request consumes one token.
Token-bucket fields
| Field | Description |
|---|---|
maxTokens | Maximum bucket capacity — the burst limit |
tokensPerFill | Tokens added each fill interval |
fillInterval | How often the bucket is refilled (e.g. 60s, 1m) |
Running the local rate limit example
maxTokens: 10, after 10 requests within a minute, subsequent requests return 429 Too Many Requests until the bucket refills.
Global rate limiting
Global rate limiting delegates enforcement to an external rate limit service. This ensures limits are shared across multiple agentgateway instances. Agentgateway implements the Envoy ratelimit gRPC protocol.Infrastructure setup
The global rate limit example uses Envoy’s ratelimit server with a Redis backend:Agentgateway configuration
ConfigureremoteRateLimit on the route policy:
Remote rate limit fields
| Field | Description |
|---|---|
domain | The rate limit domain — must match the domain in the ratelimit server config |
host | Address of the remote rate limit service (host:port) |
failureMode | Behavior when the service is unreachable: failOpen (allow) or failClosed (deny with 500) |
descriptors | List of descriptor sets that define what to rate limit |
Failure modes
- failOpen (default)
- failClosed
Requests are allowed through when the rate limit service is unavailable. This prevents a rate limit outage from blocking all traffic — matching Envoy’s default behavior.
Rate limit server configuration
The Envoy ratelimit server uses its own YAML config to define limits. This file corresponds toexamples/ratelimiting/global/ratelimit-config.yaml:
- Combined limit: 5 requests/minute for
(user=test-user, tool=echo) - Tool limit: 20 requests/minute for any request to
tool=echo
OVER_LIMIT (429 Too Many Requests).
To monitor enforcement:
Combining rate limiting with authentication
Rate limiting works alongside JWT authentication and MCP authorization. ThejwtAuth policy authenticates requests before rate limits are checked:
CEL expressions for descriptors
Descriptor values inremoteRateLimit can reference request context using CEL-like string expressions. In the example above, the value fields use quoted strings ('"test-user"' and '"echo"') that are evaluated as literal values matched against the JWT subject and MCP tool name from the request context.
Refer to the telemetry guide to visualize rate limit metrics and traces for your MCP servers.