Skip to main content

LLM context

The llm context object is available in CEL expressions when Agentgateway is proxying requests to an AI backend (backend.type == "ai"). It exposes information about the LLM request and response, including the model used, token counts, and the raw prompt and completion.
The llm object is only present when using an ai backend. Token count fields such as llm.inputTokens and llm.outputTokens are only populated after the response is received from the LLM provider.

Core fields

llm.streaming
boolean
required
Whether the LLM response is being streamed.
# Only apply a policy for non-streaming requests
!llm.streaming

# Log differently based on streaming mode
llm.streaming ? "stream" : "batch"
llm.requestModel
string
required
The model name requested by the client. This may differ from the model that actually served the response (see llm.responseModel).
llm.requestModel == "gpt-4o"
llm.requestModel.startsWith("gpt-4")
llm.responseModel
string
The model that actually served the LLM response. May differ from llm.requestModel if the provider mapped the requested model to a different version.
llm.responseModel == "gpt-4o-2024-08-06"
llm.provider
string
required
The name of the LLM provider handling the request.
llm.provider == "openai"
llm.provider == "anthropic"

Token counts

Token count fields are populated from the LLM provider’s response and are available after the response is received (for example, in logging and post-response authorization policies).
llm.inputTokens
integer
The number of tokens in the input prompt as reported by the provider.
llm.inputTokens > 10000
llm.cachedInputTokens
integer
The number of input tokens served from the provider’s prompt cache. These represent cost savings.
llm.cachedInputTokens > 0
llm.cacheCreationInputTokens
integer
The number of tokens written to the provider’s prompt cache. These represent additional cost for cache creation.
llm.cacheCreationInputTokens is not present when using OpenAI. It is specific to providers that support explicit cache creation (such as Anthropic).
llm.cacheCreationInputTokens > 0
llm.outputTokens
integer
The number of tokens in the LLM response.
llm.outputTokens > 5000
llm.reasoningTokens
integer
The number of reasoning tokens in the output. Only populated for models that support extended reasoning (such as o1 or Claude with thinking enabled).
llm.reasoningTokens > 0
llm.totalTokens
integer
The total number of tokens for the request (input + output).
llm.totalTokens > 20000
llm.countTokens
integer
The number of tokens counted when using the token counting endpoint. These are not counted as input tokens since the token counting endpoint does not consume tokens.
llm.countTokens > 50000

Prompt and completion

llm.prompt
object[]
The prompt sent to the LLM as an array of chat messages. Each message has role and content fields.
Accessing llm.prompt has performance implications for large prompts, as the data must be retained in memory. Only reference this field when necessary.
# Check the role of the first message
llm.prompt[0].role == "system"

# Check if any message contains a keyword
llm.prompt.exists(m, m.content.contains("confidential"))

# Count user messages
llm.prompt.filter(m, m.role == "user").size() > 10
llm.completion
string[]
The completion returned by the LLM as an array of strings.
Accessing llm.completion has performance implications for large responses, as the data must be retained in memory. Only reference this field when necessary.
# Check if the completion contains sensitive content
llm.completion.exists(c, c.contains("SSN"))

Parameters

The llm.params object contains the inference parameters from the LLM request.
llm.params
object
required
The parameters for the LLM request.

llmRequest

llmRequest
object
The raw LLM request before any LLM policy processing. This is only available during LLM policy evaluation. Policies that run after the LLM policy — such as logging policies — will not have this field even for LLM requests.Use llmRequest when you need access to the unmodified request before transformations are applied.
# Access raw request fields during LLM policy processing
has(llmRequest)

Examples

Use token counts as the rate limiting key to enforce per-user token budgets. In a rate limiting policy, you can use CEL to select the dimension to limit on:
# Limit by user (from JWT) and total tokens
jwt.sub
Combined with threshold checks in authorization policies:
# Deny requests that would exceed a threshold (if pre-checked)
llm.totalTokens > 100000
In a logging policy, define named fields using CEL expressions:
# provider: 'llm.provider'
# requested_model: 'llm.requestModel'
# actual_model: 'llm.responseModel'
# input_tokens: 'llm.inputTokens'
# output_tokens: 'llm.outputTokens'
# total_tokens: 'llm.totalTokens'
# cached_tokens: 'llm.cachedInputTokens'
# streaming: 'llm.streaming'
Reject requests with a temperature above a threshold in an authorization policy:
!has(llm.params.temperature) || llm.params.temperature <= 1.5
Reject requests that request more tokens than your policy allows:
!has(llm.params.max_tokens) || llm.params.max_tokens <= 4096
Only allow requests for approved models:
llm.requestModel in ["gpt-4o-mini", "gpt-3.5-turbo"]
Check prompt messages for sensitive patterns. Use with prompt guard policies or custom authorization:
!llm.prompt.exists(m, m.content.matches("(?i)password|secret|api.key"))
Sample only requests that exceed a token count threshold for detailed tracing:
llm.totalTokens > 5000 || random() < 0.01
Apply different policies based on the provider:
# Anthropic-specific: check cache creation tokens
llm.provider == "anthropic" && has(llm.cacheCreationInputTokens)

# Require streaming for OpenAI to reduce cost
llm.provider == "openai" && llm.streaming