Reference

AI Gateway Glossary

Every term you need to understand AI gateways, LLM routing, and production AI infrastructure. 28 definitions, updated for 2026.

AI Gateway

Middleware that sits between your application and LLM providers. It routes requests to the optimal model, caches responses, handles failover, enforces rate limits, and provides observability. Requesty is an AI gateway that supports 400+ models through a single OpenAI-compatible API.

LLM Routing

The practice of automatically directing AI model requests to the optimal provider based on criteria like cost, latency, model capability, and availability. Smart routing can reduce LLM costs by 50 to 80% by sending simple tasks to budget models.

Prompt Caching

Storing LLM responses and returning cached results for identical or near-identical prompts. Eliminates redundant API calls. Cached responses return in under 10ms with zero token cost. Typical production workloads see 30 to 60% cache hit rates.

Model Failover

Automatically rerouting LLM requests to a backup provider when the primary provider fails or degrades. A fallback chain defines the order of providers to try. Good failover detects failures in under 100ms and switches before the user notices.

Fallback Chain

An ordered list of LLM models or providers that a gateway tries in sequence. If the first model fails or times out, the request moves to the next in the chain. Requesty supports up to 10 retry attempts per model before moving to the next fallback.

Smart Routing

A routing strategy that analyzes request complexity and matches it to the optimal model. Simple queries go to cheap, fast models. Complex reasoning goes to frontier models. This is more sophisticated than basic load balancing because the router understands the task.

Token

The basic unit of text that LLMs process. Roughly 1 token equals 0.75 words in English. LLM APIs charge per token for both input (your prompt) and output (the model response). A 1,000-word document is approximately 1,333 tokens.

Rate Limiting

Restrictions on how many API requests or tokens you can send in a given time window. LLM providers enforce rate limits per API key. An AI gateway can distribute requests across multiple keys and providers to effectively pool rate limits.

RBAC

Role-Based Access Control. A governance model that assigns permissions based on user roles. In an AI gateway context, RBAC controls who can access which models, set budgets, view logs, and manage API keys. Requesty uses a 5-layer policy hierarchy from organization to API key.

Observability

The ability to understand what is happening inside your AI system by examining its outputs. For LLM infrastructure, this means tracking token usage, latency, cost per request, error rates, and model performance per user, team, or API key.

Cost per Token

The price charged by an LLM provider for processing one token. Prices vary dramatically: GPT-4.1 charges $2 per million input tokens while DeepSeek V3 charges $0.14 per million. Output tokens are typically 2 to 5x more expensive than input tokens.

Context Window

The maximum number of tokens an LLM can process in a single request, including both the prompt and the response. Larger context windows allow longer documents and conversation histories. Gemini 2.5 Pro supports up to 1 million tokens.

OpenAI-compatible API

An API that accepts the same request format as OpenAI (chat completions, embeddings, etc). Many AI gateways and alternative providers implement this format so you can switch providers by changing only the base URL. Requesty uses this format for all 400+ models.

Exponential Backoff

A retry strategy where each retry waits longer than the last (e.g. 500ms, 1s, 2s, 4s). Prevents overwhelming a recovering provider. When combined with jitter (random time offset), it avoids the thundering herd problem where all clients retry simultaneously.

Jitter

A random time offset added to retry delays. Without jitter, all clients using exponential backoff would retry at exactly the same intervals, creating synchronized retry waves. Jitter randomizes the timing so retries are spread out evenly.

Thundering Herd

A failure pattern where many clients simultaneously retry requests after an outage, overwhelming the recovering system. Prevented by combining exponential backoff with jitter. Common in LLM architectures where thousands of applications share the same provider endpoint.

Zero Data Retention (ZDR)

A provider policy that guarantees prompts and responses are not stored after processing and are not used for model training. OpenAI, Anthropic, and Google offer ZDR for enterprise customers. An AI gateway can enforce ZDR by routing sensitive data only to ZDR-eligible models.

EU Data Residency

Processing and storing data within the European Union to comply with GDPR. Requesty offers EU routing through its Frankfurt infrastructure. This ensures that prompts and responses never leave EU borders, which is required for many European enterprise workloads.

BYOK

Bring Your Own Key. Using your own API keys from LLM providers (OpenAI, Anthropic, etc) through an AI gateway, instead of the gateway provider's keys. This gives you direct billing relationships with providers and is common in enterprise setups.

Frontier Model

The most capable AI models available at any given time. As of 2026, frontier models include GPT-4.1, Claude Sonnet 4, and Gemini 2.5 Pro. They excel at complex reasoning, coding, and creative tasks but cost 10 to 50x more than budget models.

Budget Model

Cost-effective AI models suitable for simple tasks like classification, extraction, and summarization. Examples: DeepSeek V3 ($0.14/M input tokens), Gemini Flash ($0.15/M), GPT-4o mini ($0.15/M). Smart routing sends 50 to 70% of production traffic to budget models.

TTFT

Time To First Token. The latency between sending a request and receiving the first token of the response. Critical for streaming applications where users see tokens appear in real time. Typical TTFT ranges from 200ms (budget models) to 2s (frontier models).

Streaming

Receiving LLM responses token by token as they are generated, instead of waiting for the complete response. Streaming improves perceived latency because users see output immediately. Most AI gateways support server-sent events (SSE) for streaming.

Guardrails

Safety filters applied to LLM inputs and outputs. Common guardrails include PII detection and masking, content policy enforcement, prompt injection detection, and output validation. Guardrails run at the gateway level so they protect all models behind it.

Load Balancing

Distributing LLM requests across multiple providers or endpoints to prevent overloading any single one. Unlike smart routing (which considers task complexity), load balancing focuses on even distribution. Most AI gateways combine both approaches.

Circuit Breaker

A pattern that stops sending requests to a failing provider after a threshold of errors is reached. The circuit "opens" and traffic routes elsewhere. After a cooldown period, a few test requests are sent to check if the provider has recovered.

Semantic Caching

Caching that matches prompts by meaning rather than exact text. Two prompts asking the same question in different words can return the same cached response. More effective than exact-match caching but requires embedding computation to determine similarity.

Model Context Protocol (MCP)

A standard protocol that lets LLMs connect to external tools and data sources. MCP servers expose capabilities (file access, database queries, API calls) that any MCP-compatible model can use. Requesty's MCP gateway connects 400+ models to any MCP server.

See these concepts in action

Requesty implements every concept in this glossary. Smart routing, caching, failover, RBAC, observability, and more. Start free with $10 credits.

Start free