Skip to main content
Agentgateway supports two rate limiting modes:
  • Local rate limiting — enforced per agentgateway instance using a token-bucket algorithm. No external dependencies required.
  • Global rate limiting — enforced across all instances by delegating to a remote rate limit service (compatible with Envoy’s ratelimit server).

Local rate limiting

Local rate limiting is configured on a route’s policies block using the localRateLimit field. It uses a token-bucket algorithm: the bucket starts with maxTokens tokens, refills tokensPerFill tokens every fillInterval, and each request consumes one token.
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
binds:
- port: 3000
  listeners:
  - routes:
    - policies:
        localRateLimit:
          - maxTokens: 10
            tokensPerFill: 1
            fillInterval: 60s
      backends:
      - mcp:
          targets:
          - name: everything
            stdio:
              cmd: npx
              args: ["@modelcontextprotocol/server-everything"]

Token-bucket fields

FieldDescription
maxTokensMaximum bucket capacity — the burst limit
tokensPerFillTokens added each fill interval
fillIntervalHow often the bucket is refilled (e.g. 60s, 1m)
Increase maxTokens to allow bursting. Set tokensPerFill equal to maxTokens and use a longer fillInterval to implement a sliding window.

Running the local rate limit example

cargo run -- -f examples/ratelimiting/local/config.yaml
With maxTokens: 10, after 10 requests within a minute, subsequent requests return 429 Too Many Requests until the bucket refills.

Global rate limiting

Global rate limiting delegates enforcement to an external rate limit service. This ensures limits are shared across multiple agentgateway instances. Agentgateway implements the Envoy ratelimit gRPC protocol.

Infrastructure setup

The global rate limit example uses Envoy’s ratelimit server with a Redis backend:
1

Start Redis

docker run -d --name redis --network host redis:7.4.3
2

Start the ratelimit server

docker run -d --name ratelimit \
  --network host \
  -e REDIS_URL=127.0.0.1:6379 \
  -e USE_STATSD=false \
  -e LOG_LEVEL=debug \
  -e REDIS_SOCKET_TYPE=tcp \
  -e RUNTIME_ROOT=/data \
  -e RUNTIME_SUBDIRECTORY=ratelimit \
  -v $(pwd)/examples/ratelimiting/global/ratelimit-config.yaml:/data/ratelimit/config/config.yaml:ro \
  envoyproxy/ratelimit:3e085e5b \
  /bin/ratelimit -config /data/ratelimit/config/config.yaml
3

Start agentgateway

cargo run -- -f examples/ratelimiting/global/config.yaml

Agentgateway configuration

Configure remoteRateLimit on the route policy:
# yaml-language-server: $schema=https://agentgateway.dev/schema/config
binds:
- port: 3000
  listeners:
  - routes:
    - policies:
        remoteRateLimit:
          domain: "agentgateway"
          host: "127.0.0.1:8081"
          failureMode: failOpen
          descriptors:
            - entries:
                - key: "user"
                  value: '"test-user"'
                - key: "tool"
                  value: '"echo"'
              type: "requests"
      backends:
      - mcp:
          targets:
          - name: everything
            stdio:
              cmd: npx
              args: ["@modelcontextprotocol/server-everything"]

Remote rate limit fields

FieldDescription
domainThe rate limit domain — must match the domain in the ratelimit server config
hostAddress of the remote rate limit service (host:port)
failureModeBehavior when the service is unreachable: failOpen (allow) or failClosed (deny with 500)
descriptorsList of descriptor sets that define what to rate limit

Failure modes

Requests are allowed through when the rate limit service is unavailable. This prevents a rate limit outage from blocking all traffic — matching Envoy’s default behavior.
failureMode: failOpen

Rate limit server configuration

The Envoy ratelimit server uses its own YAML config to define limits. This file corresponds to examples/ratelimiting/global/ratelimit-config.yaml:
domain: agentgateway
descriptors:
  - key: user
    value: "test-user"
    descriptors:
      - key: tool
        value: "echo"
        rate_limit:
          unit: minute
          requests_per_unit: 5
  - key: tool
    value: "echo"
    rate_limit:
      unit: minute
      requests_per_unit: 20
This configuration defines:
  • Combined limit: 5 requests/minute for (user=test-user, tool=echo)
  • Tool limit: 20 requests/minute for any request to tool=echo
When thresholds are exceeded, the system returns OVER_LIMIT (429 Too Many Requests). To monitor enforcement:
docker logs -f ratelimit | grep -E '(OVER_LIMIT|OK)'

Combining rate limiting with authentication

Rate limiting works alongside JWT authentication and MCP authorization. The jwtAuth policy authenticates requests before rate limits are checked:
policies:
  localRateLimit:
    - maxTokens: 10
      tokensPerFill: 1
      fillInterval: 60s
  jwtAuth:
    issuer: agentgateway.dev
    audiences: [test.agentgateway.dev]
    jwks:
      file: ./manifests/jwt/pub-key
  mcpAuthorization:
    rules:
    - 'mcp.tool.name == "echo"'
    - 'jwt.sub == "test-user" && mcp.tool.name == "add"'

CEL expressions for descriptors

Descriptor values in remoteRateLimit can reference request context using CEL-like string expressions. In the example above, the value fields use quoted strings ('"test-user"' and '"echo"') that are evaluated as literal values matched against the JWT subject and MCP tool name from the request context.
Refer to the telemetry guide to visualize rate limit metrics and traces for your MCP servers.