LLM context

The llm context object is available in CEL expressions when Agentgateway is proxying requests to an AI backend (backend.type == "ai"). It exposes information about the LLM request and response, including the model used, token counts, and the raw prompt and completion.

The llm object is only present when using an ai backend. Token count fields such as llm.inputTokens and llm.outputTokens are only populated after the response is received from the LLM provider.

Core fields

boolean

required

Whether the LLM response is being streamed.

# Only apply a policy for non-streaming requests
!llm.streaming

# Log differently based on streaming mode
llm.streaming ? "stream" : "batch"

string

required

The model name requested by the client. This may differ from the model that actually served the response (see llm.responseModel).

llm.requestModel == "gpt-4o"
llm.requestModel.startsWith("gpt-4")

string

The model that actually served the LLM response. May differ from llm.requestModel if the provider mapped the requested model to a different version.

llm.responseModel == "gpt-4o-2024-08-06"

string

required

The name of the LLM provider handling the request.

llm.provider == "openai"
llm.provider == "anthropic"

Token counts

Token count fields are populated from the LLM provider’s response and are available after the response is received (for example, in logging and post-response authorization policies).

integer

The number of tokens in the input prompt as reported by the provider.

llm.inputTokens > 10000

integer

The number of input tokens served from the provider’s prompt cache. These represent cost savings.

llm.cachedInputTokens > 0

integer

The number of tokens written to the provider’s prompt cache. These represent additional cost for cache creation.

llm.cacheCreationInputTokens is not present when using OpenAI. It is specific to providers that support explicit cache creation (such as Anthropic).

llm.cacheCreationInputTokens > 0

integer

The number of tokens in the LLM response.

llm.outputTokens > 5000

integer

The number of reasoning tokens in the output. Only populated for models that support extended reasoning (such as o1 or Claude with thinking enabled).

llm.reasoningTokens > 0

integer

The total number of tokens for the request (input + output).

llm.totalTokens > 20000

integer

The number of tokens counted when using the token counting endpoint. These are not counted as input tokens since the token counting endpoint does not consume tokens.

llm.countTokens > 50000

Prompt and completion

object[]

The prompt sent to the LLM as an array of chat messages. Each message has role and content fields.

Accessing llm.prompt has performance implications for large prompts, as the data must be retained in memory. Only reference this field when necessary.

Show Message fields

string

The role of the message sender. Typically system, user, or assistant.

string

The text content of the message.

# Check the role of the first message
llm.prompt[0].role == "system"

# Check if any message contains a keyword
llm.prompt.exists(m, m.content.contains("confidential"))

# Count user messages
llm.prompt.filter(m, m.role == "user").size() > 10

string[]

The completion returned by the LLM as an array of strings.

Accessing llm.completion has performance implications for large responses, as the data must be retained in memory. Only reference this field when necessary.

# Check if the completion contains sensitive content
llm.completion.exists(c, c.contains("SSN"))

Parameters

The llm.params object contains the inference parameters from the LLM request.

object

required

The parameters for the LLM request.

Show Parameter fields

number

The sampling temperature. Higher values produce more random outputs.

llm.params.temperature > 1.0

number

The nucleus sampling probability. Values closer to 1.0 include more of the probability distribution.

llm.params.top_p < 0.5

number

Penalizes tokens based on how frequently they appear in the output so far.

has(llm.params.frequency_penalty)

number

Penalizes tokens based on whether they have appeared in the output at all.

has(llm.params.presence_penalty)

integer

The random seed for deterministic output.

has(llm.params.seed)

integer

The maximum number of tokens to generate.

llm.params.max_tokens > 4096

string

The encoding format for embeddings (for example, float or base64).

llm.params.encoding_format == "float"

integer

The number of dimensions for embedding models.

llm.params.dimensions == 1536

`llmRequest`

object

The raw LLM request before any LLM policy processing. This is only available during LLM policy evaluation. Policies that run after the LLM policy — such as logging policies — will not have this field even for LLM requests.Use llmRequest when you need access to the unmodified request before transformations are applied.

# Access raw request fields during LLM policy processing
has(llmRequest)

Examples

Rate limiting by total token count

Use token counts as the rate limiting key to enforce per-user token budgets. In a rate limiting policy, you can use CEL to select the dimension to limit on:

# Limit by user (from JWT) and total tokens
jwt.sub

Combined with threshold checks in authorization policies:

# Deny requests that would exceed a threshold (if pre-checked)
llm.totalTokens > 100000

Log the model used and token breakdown

In a logging policy, define named fields using CEL expressions:

# provider: 'llm.provider'
# requested_model: 'llm.requestModel'
# actual_model: 'llm.responseModel'
# input_tokens: 'llm.inputTokens'
# output_tokens: 'llm.outputTokens'
# total_tokens: 'llm.totalTokens'
# cached_tokens: 'llm.cachedInputTokens'
# streaming: 'llm.streaming'

Block high-temperature requests

Reject requests with a temperature above a threshold in an authorization policy:

!has(llm.params.temperature) || llm.params.temperature <= 1.5

Enforce max token limits

Reject requests that request more tokens than your policy allows:

!has(llm.params.max_tokens) || llm.params.max_tokens <= 4096

Restrict access to specific models

Only allow requests for approved models:

llm.requestModel in ["gpt-4o-mini", "gpt-3.5-turbo"]

Detect prompt content for policy enforcement

Check prompt messages for sensitive patterns. Use with prompt guard policies or custom authorization:

!llm.prompt.exists(m, m.content.matches("(?i)password|secret|api.key"))

Tracing: sample expensive requests

Sample only requests that exceed a token count threshold for detailed tracing:

llm.totalTokens > 5000 || random() < 0.01

Provider-specific logic

Apply different policies based on the provider:

# Anthropic-specific: check cache creation tokens
llm.provider == "anthropic" && has(llm.cacheCreationInputTokens)

# Require streaming for OpenAI to reduce cost
llm.provider == "openai" && llm.streaming

Overview

Resources

Policies

CEL Reference

LLM context

LLM context

Core fields

Token counts

Prompt and completion

Parameters

`llmRequest`

Examples

​LLM context

​Core fields

​Token counts

​Prompt and completion

​Parameters

​llmRequest

​Examples

LLM context

Core fields

Token counts

Prompt and completion

Parameters

`llmRequest`

Examples