Every production AI application faces the same problem: you need multiple models, multiple providers, and the ability to switch between them without rewriting code. An LLM routing platform solves this by giving you one API endpoint that routes to 100+ models, handles failover when providers go down, caches repeated calls, and tracks costs across your entire organization.
By June 2026, the market has consolidated around seven platforms. Each makes different tradeoffs between control, cost, latency, and operational complexity. This guide compares them with real numbers so you can pick the right one for your stack.
The seven platforms at a glance
| Platform | Type | Models | Overhead (P50) | Pricing | Self-Host | SOC 2 | Best For |
|---|---|---|---|---|---|---|---|
| Requesty | Managed | 400+ | 8ms (16ms agentic) | 5% markup | No | Yes | Production routing with caching and governance |
| Portkey | Managed + self-host | 1,600+ | 10-20ms | Per-log pricing | Gateway only | Yes | Guardrails and enterprise observability |
| LiteLLM | Open-source self-host | 100+ | 10-20ms | Free (self-host) | Yes | N/A | Zero markup at scale |
| OpenRouter | Managed marketplace | 300+ | 40-55ms | 5.5% on credits | No | No | Quick model access for solo devs |
| Kong AI Gateway | Enterprise API platform | Provider-dependent | Varies | Enterprise license | Yes | Yes | Teams already on Kong |
| Cloudflare AI Gateway | Edge platform | Major providers | Sub-10ms at edge | Pay-per-request | No | Yes | Cloudflare-native stacks |
| Helicone | Observability-first | Major providers | Proxy overhead | Free tier + paid | No | Yes | LLM monitoring and analytics |
Requesty: production routing with the lowest overhead
Requesty is a managed AI gateway built in Rust. It routes to 400+ models across OpenAI, Anthropic, Google, DeepSeek, Mistral, and dozens of other providers through a single OpenAI-compatible endpoint.
Latency
In published benchmarks, Requesty adds 8ms P50 overhead on standard requests and 16ms P50 on agentic workloads that include routing logic. TrueFoundry's independent comparison confirms the approximately 8ms overhead figure. For context, OpenRouter adds 40 to 55ms in the same tests.
The Rust core handles the difference. Requesty uses a PeakEWMA algorithm that adapts to real-time provider health rather than relying on static priority lists. Each request routes to whichever provider is responding fastest at that moment, measured on a rolling one-hour window.
Three routing modes
Smart Routing: Classifies the request by type (code generation, reasoning, summarization) and dispatches to the best model for that task. You toggle it on in the dashboard. No code changes needed. Code generation goes to Claude Opus 4.8. Simple classification goes to Gemini 3.5 Flash. Cost drops 50% or more while quality stays constant.
Fallback Policies: Ordered sequences of models. If the primary model times out or returns a 5xx, Requesty tries the next in the chain. Failover completes in under 50ms. Each step supports 0 to 10 retries with exponential backoff (500ms to 4s with jitter).
Latency Routing: Measures real-time model performance and routes to the fastest available. For streaming calls, the metric is time-to-first-token. For non-streaming, total response time. New or cold-start models get 5 to 10% of traffic for data collection, then join the ranking.
Response caching
Requesty's semantic cache hits on identical and similar requests. In production, teams report 40 to 60% cache hit rates on repeated calls. The cache requires zero configuration. It runs automatically on every request. At scale, caching savings often exceed the 5% gateway markup, making Requesty net-negative on cost compared to calling providers directly.
Governance
A five-layer policy hierarchy from org level down to individual API key: budget caps, rate limits, model allowlists, PII redaction, and usage policies. RBAC controls who can create keys, view costs, and modify routing policies. Compliance: SOC 2, GDPR, HIPAA, with EU hosting in Frankfurt.
Integration
Change your base URL to router.requesty.ai/v1 and use your Requesty API key. Existing OpenAI SDK code works unchanged. Tested compatible with LangChain, LangGraph, CrewAI, Claude Agent SDK, Google ADK, Vercel AI SDK, and every major agent framework.
from openai import OpenAI
client = OpenAI(
base_url="https://router.requesty.ai/v1",
api_key="your-requesty-key"
)
response = client.chat.completions.create(
model="anthropic/claude-opus-4-8",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)When to use Requesty
You need production-grade routing with the lowest latency. You want caching that pays for itself. You need multi-provider failover that completes in under 50ms. You want governance (budgets, RBAC, PII redaction) without managing infrastructure. You are building agentic applications and need per-agent cost tracking.
Portkey: guardrails and enterprise observability
Portkey is a managed AI gateway with the largest model catalog: 1,600+ models across all major providers. The standout feature set is guardrails and observability.
Guardrails
Portkey ships 50+ built-in guardrails that run on every request: PII detection, toxicity filtering, hallucination checks, regex validation, and custom webhook-based rules. Guardrails execute before the request reaches the model (input guardrails) and after the response returns (output guardrails). This is the most comprehensive built-in safety layer of any gateway.
Observability
Every request logged with token counts, latency, cost, model, and custom metadata. Dashboards break down spend by team, project, model, and API key. Portkey's analytics include prompt performance tracking (which prompts produce the best outputs) and A/B testing dashboards.
Self-hosting
The Portkey Gateway is open-source (MIT). You can self-host the routing layer and run it as a proxy without using Portkey's managed service. The managed platform (observability, guardrails dashboard, team management) is proprietary.
Pricing
Free tier: 10,000 logs per month. Paid tiers use per-log pricing. At high volume, the per-log cost can add up, especially for agentic workloads where a single task generates hundreds of LLM calls.
When to use Portkey
You need the largest model catalog (1,600+). Guardrails (PII, toxicity, hallucination) are a hard requirement. You want the deepest observability dashboards. You need SOC 2 and HIPAA compliance with a managed service.
LiteLLM: the open-source self-hosted option
LiteLLM is the most popular open-source LLM proxy, with 22,000+ GitHub stars (the project's own page says 47,800+, though different counting methods apply). It provides an OpenAI-compatible API over 100+ providers and is completely free to self-host.
Architecture
A Python proxy server backed by PostgreSQL for spend tracking and Redis for caching. You deploy it on your infrastructure, own every byte of data, and pay zero markup on provider costs.
Key features
Virtual API keys: Create project-scoped or user-scoped keys with independent spend limits, model access lists, and rate limits. This is the multi-tenancy layer for self-hosted deployments.
Routing strategies: Load balancing (weighted distribution), fallback chains, latency-based routing, and cost-based routing. All configurable via YAML.
A2A and MCP support: LiteLLM added native A2A protocol support and MCP tool integration for agent-to-agent communication and tool routing alongside model routing.
Tradeoffs
You need a DevOps team. PostgreSQL, Redis, the proxy server, TLS termination, monitoring, and upgrades are all your responsibility. No managed guardrails, no built-in PII detection, no hosted dashboard (though a basic admin UI exists). At $50K per month in API spend, the zero-markup savings ($2,750 per month versus OpenRouter's 5.5%) justify the operational cost. Below $5K per month, the engineering time to maintain LiteLLM likely exceeds the savings.
When to use LiteLLM
You have a DevOps team and want zero vendor markup. You need to own all data on your infrastructure (air-gapped environments, strict data residency). Your API spend is $10K+ per month, making the markup savings significant. You want open-source with full code access.
OpenRouter: the model marketplace
OpenRouter is a hosted API that provides access to 300+ models through a single endpoint. It operates as a marketplace: multiple providers host the same model, and OpenRouter routes to the cheapest or fastest available instance.
Strengths
The broadest model access with the simplest integration. Sign up, get a key, and you can call GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, DeepSeek V4, Llama 4 Maverick, and hundreds of open-source models immediately. OpenRouter is often the first platform to host newly released models.
For solo developers and startups that want instant access to every model without infrastructure, OpenRouter is the fastest path. A free tier covers 25+ models with 50 to 1,000 requests per day.
Tradeoffs
Latency: 40 to 55ms of overhead in independent benchmarks. For user-facing applications, this is noticeable. For batch processing, it is acceptable.
No semantic caching: OpenRouter does not cache responses. Every call goes to the provider, every time. At high volume, this means significantly higher costs compared to platforms with caching.
No governance: No RBAC, no budget hierarchies, no PII redaction, no compliance certifications. Per-key budget caps exist, but there is no team-level or project-level spend management.
5.5% markup plus minimums: The 5.5% fee on credit purchases plus a $0.80 minimum charge on small transactions adds up. At $50K per month, you pay $2,750 in markup. At $1K per month, the effective rate is higher due to minimums.
Credit expiration: Credits expire after 365 days. If you buy in bulk and usage drops, you lose the balance.
When to use OpenRouter
You are a solo developer or early-stage startup. You want instant access to 300+ models with no setup. You do not need caching, governance, or compliance. Latency is not a primary concern. You want to evaluate many models quickly.
Kong AI Gateway: for existing Kong users
Kong AI Gateway adds AI-specific plugins to the Kong API platform. If your organization already runs Kong for API management, adding AI routing is a plugin installation, not a new service.
Key features
AI-specific rate limiting, prompt caching, token-aware load balancing, and request transformation plugins. All built on Kong's existing plugin architecture, so they compose with your existing auth, logging, and rate limiting plugins.
Tradeoffs
Kong is an API gateway that added AI features, not an AI-native platform. The routing intelligence (prompt-aware model selection, latency-based routing) is less sophisticated than purpose-built platforms like Requesty or Portkey. Pricing follows Kong's enterprise license model, which is expensive for teams that only need AI routing.
When to use Kong AI Gateway
You already run Kong and want to add AI routing without deploying a separate service. Your API team manages Kong and wants AI traffic to go through the same governance layer.
Cloudflare AI Gateway: edge-first routing
Cloudflare AI Gateway routes AI traffic through Cloudflare's edge network. Requests are processed at the nearest Cloudflare PoP, adding sub-10ms of overhead at edge.
Key features
Edge caching (responses cached at Cloudflare's 300+ locations worldwide), rate limiting, request logging, and analytics. Integration with Cloudflare Workers for custom pre/post-processing logic.
Tradeoffs
Works best within the Cloudflare ecosystem. If you are not already using Cloudflare Workers, adding their AI Gateway means adopting a new platform. Provider support is limited to major providers (OpenAI, Anthropic, Google, Azure OpenAI). No smart routing by request type. No multi-provider failover chains.
When to use Cloudflare AI Gateway
You already run on Cloudflare Workers. You want edge caching for AI responses. Your use case is straightforward (one or two providers, no complex routing logic).
Helicone: observability with routing
Helicone started as an LLM observability platform and has added proxy and routing capabilities. The core strength remains analytics: cost tracking, latency monitoring, prompt performance analysis, and user-level usage dashboards.
Key features
One-line integration (add a header to your existing OpenAI calls). Detailed cost and latency analytics per request, per user, per prompt. Prompt experiments for A/B testing different prompts and models. Session tracking that groups related requests into logical sessions.
Tradeoffs
Routing features are less mature than dedicated gateways. No smart routing by request type, no latency-based model selection, limited failover configuration. Best used alongside a routing platform or as a lightweight proxy for teams where observability is the primary need.
When to use Helicone
Observability and cost analytics are your primary need. You want the simplest possible integration (one header). You do not need advanced routing logic. You want prompt experiment tracking.
Cost comparison at scale
What does each platform cost at $10,000 per month in provider API spend?
| Platform | Markup | Caching Savings | Net Cost | Infrastructure |
|---|---|---|---|---|
| Requesty | $500 (5%) | $4,000-$6,000 (40-60%) | Net savings of $3,500-$5,500 | None (managed) |
| Portkey | Per-log (varies) | Provider-side only | $200-$1,000+ depending on volume | None (managed) |
| LiteLLM | $0 | Self-configured | $500-$2,000 (infra + engineering) | PostgreSQL, Redis, proxy |
| OpenRouter | $550 (5.5%) | None | $550 | None (managed) |
| Kong | Enterprise license | Plugin-based | $2,000-$10,000+ (license) | Kong cluster |
| Cloudflare | Pay-per-request | Edge cache | $100-$500 | Cloudflare account |
| Helicone | Free tier + paid | None | $0-$500 | None (managed) |
The math favors platforms with caching at scale. Requesty's semantic cache hitting 40 to 60% of requests means you pay for 40 to 60% fewer provider tokens. At $10K monthly spend, the caching savings alone are $4,000 to $6,000, far exceeding the 5% markup.
Feature comparison matrix
| Feature | Requesty | Portkey | LiteLLM | OpenRouter | Kong | Cloudflare | Helicone |
|---|---|---|---|---|---|---|---|
| Smart routing (by task type) | Yes | Basic | No | Basic | No | No | No |
| Latency-based routing | Yes (PeakEWMA) | Yes | Yes | Yes | Plugin | No | No |
| Fallback chains | Yes (under 50ms) | Yes | Yes | Yes | Plugin | No | No |
| Semantic caching | Yes (40-60%) | No | Self-configured | No | Plugin | Edge cache | No |
| PII redaction | Yes | Yes (guardrails) | No | No | Plugin | No | No |
| RBAC | Yes (5-layer) | Yes | Virtual keys | No | Yes | IAM | No |
| Budget controls | Per-key, per-team | Per-key | Per-key | Per-key | Plugin | Rate limits | Alerts |
| SOC 2 | Yes | Yes | N/A | No | Yes | Yes | Yes |
| HIPAA | Yes | Yes | N/A | No | Yes | No | No |
| EU hosting | Frankfurt | EU option | Self-host | No | Self-host | Edge | No |
| OpenAI-compatible API | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| Self-host option | No | Gateway only | Full | No | Yes | No | No |
Decision tree
Start here: Do you have a DevOps team that wants to own the proxy layer?
- Yes, and spend is over $10K/month: LiteLLM self-hosted. Zero markup justifies the operational cost.
- Yes, but you need guardrails and dashboards too: Portkey. Open-source gateway with managed observability.
- No, I want managed: Continue below.
Do you need caching to reduce costs?
- Yes: Requesty. Semantic caching at 40-60% hit rates. Net cost reduction at scale.
- No, cost is secondary to access: Continue below.
Do you need governance (RBAC, budgets, compliance)?
- Yes: Requesty or Portkey. Both offer SOC 2, HIPAA, RBAC, and budget controls.
- No, just routing: Continue below.
How latency-sensitive is your application?
- Very (user-facing, agentic): Requesty at 8ms P50. Or Cloudflare AI Gateway if you are already on Cloudflare.
- Not very (batch, async): OpenRouter for the broadest model access with zero setup.
The bottom line
The LLM routing market in 2026 is mature. You do not need to build your own proxy. The choice comes down to what you value most: lowest latency and cost savings through caching (Requesty), guardrails and observability depth (Portkey), zero markup and full control (LiteLLM), or instant model access with zero setup (OpenRouter). For production applications with multiple agent SDKs, failover requirements, and cost governance, Requesty provides the most complete package at the lowest overhead.
Frequently asked questions
- What is an LLM routing platform?
- An LLM routing platform sits between your application and AI model providers (OpenAI, Anthropic, Google, and others). It provides a single API endpoint that routes requests to the best model based on cost, latency, or quality. Production platforms also add failover, caching, observability, rate limiting, and governance. You change one base URL and gain access to hundreds of models without managing individual provider integrations.
- Which LLM routing platform has the lowest latency?
- Requesty adds 8ms P50 overhead in production (16ms P50 on agentic workloads with routing logic). Portkey reports 8ms P95 in benchmarks but adds 10 to 20ms in independent testing. OpenRouter adds 40 to 55ms of overhead. LiteLLM self-hosted adds 10 to 20ms depending on infrastructure. For latency-sensitive agentic workloads, Requesty's Rust-based router is the fastest tested option.
- Which LLM gateway is cheapest?
- LiteLLM is free and open-source if you self-host and maintain your own infrastructure. For managed services, Requesty charges a flat 5% markup with no hidden fees, and caching savings (40 to 60% on repeated calls) often offset the cost entirely. OpenRouter charges 5.5% on credit purchases plus a $0.80 minimum on small transactions. Portkey offers a free tier with 10K logs per month, then per-log pricing on paid plans.
- Should I self-host my LLM gateway or use a managed service?
- Self-host (LiteLLM) if you have a DevOps team, need zero vendor markup at high spend ($50K+ per month), and want full control over every byte. Use a managed service (Requesty, Portkey, OpenRouter) if you want sub-day setup, zero infrastructure maintenance, and built-in compliance. Most teams start managed and only self-host after reaching significant scale.
- Can I use multiple LLM routing platforms together?
- Yes. A common pattern is LiteLLM as the self-hosted proxy layer with OpenRouter as one of its upstream providers for model access, and Requesty or Portkey wrapping the stack for observability and governance. However, each additional layer adds latency. For most teams, a single platform that covers routing, caching, failover, and observability (like Requesty or Portkey) is simpler and faster.
- JUN '26
Best AI Agent SDKs Compared (2026): LangGraph, CrewAI, OpenAI, Anthropic, and Google ADK
Six agent SDKs compete for production deployments in 2026. LangGraph leads on state control, CrewAI on rapid prototyping, and the vendor SDKs from Anthropic, OpenAI, and Google ship native tool execution. This guide compares architecture, benchmarks, token efficiency, and gateway compatibility so you can pick the right SDK for your stack.
- MAY '26
Agentic Coding Tools Compared (2026): Claude Code vs Cursor vs Codex vs Aider
Claude Code, Cursor 3, OpenAI Codex, Aider, Roo Code, and Cline are all shipping autonomous agents in 2026. Here is how they compare on architecture, pricing, benchmarks, and which LLM gateway they support.
- JAN '26
Routing policies 101: fallback, load balancing, and latency in production
The three routing-policy primitives every LLM gateway needs (fallback chains, weighted load balancing, and latency-based selection) and when to use each. Written for teams deploying multi-model production setups.

