Requesty
Back|JUN '26BEST PRACTICES / INTEGRATIONS
12 MIN READ|

Best LLM Routing Platforms Compared (2026): Requesty, Portkey, LiteLLM, OpenRouter, and More

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Published

Every production AI application faces the same problem: you need multiple models, multiple providers, and the ability to switch between them without rewriting code. An LLM routing platform solves this by giving you one API endpoint that routes to 100+ models, handles failover when providers go down, caches repeated calls, and tracks costs across your entire organization.

By June 2026, the market has consolidated around seven platforms. Each makes different tradeoffs between control, cost, latency, and operational complexity. This guide compares them with real numbers so you can pick the right one for your stack.

The seven platforms at a glance

PlatformTypeModelsOverhead (P50)PricingSelf-HostSOC 2Best For
RequestyManaged400+8ms (16ms agentic)5% markupNoYesProduction routing with caching and governance
PortkeyManaged + self-host1,600+10-20msPer-log pricingGateway onlyYesGuardrails and enterprise observability
LiteLLMOpen-source self-host100+10-20msFree (self-host)YesN/AZero markup at scale
OpenRouterManaged marketplace300+40-55ms5.5% on creditsNoNoQuick model access for solo devs
Kong AI GatewayEnterprise API platformProvider-dependentVariesEnterprise licenseYesYesTeams already on Kong
Cloudflare AI GatewayEdge platformMajor providersSub-10ms at edgePay-per-requestNoYesCloudflare-native stacks
HeliconeObservability-firstMajor providersProxy overheadFree tier + paidNoYesLLM monitoring and analytics

Requesty: production routing with the lowest overhead

Requesty is a managed AI gateway built in Rust. It routes to 400+ models across OpenAI, Anthropic, Google, DeepSeek, Mistral, and dozens of other providers through a single OpenAI-compatible endpoint.

Latency

In published benchmarks, Requesty adds 8ms P50 overhead on standard requests and 16ms P50 on agentic workloads that include routing logic. TrueFoundry's independent comparison confirms the approximately 8ms overhead figure. For context, OpenRouter adds 40 to 55ms in the same tests.

The Rust core handles the difference. Requesty uses a PeakEWMA algorithm that adapts to real-time provider health rather than relying on static priority lists. Each request routes to whichever provider is responding fastest at that moment, measured on a rolling one-hour window.

Three routing modes

Smart Routing: Classifies the request by type (code generation, reasoning, summarization) and dispatches to the best model for that task. You toggle it on in the dashboard. No code changes needed. Code generation goes to Claude Opus 4.8. Simple classification goes to Gemini 3.5 Flash. Cost drops 50% or more while quality stays constant.

Fallback Policies: Ordered sequences of models. If the primary model times out or returns a 5xx, Requesty tries the next in the chain. Failover completes in under 50ms. Each step supports 0 to 10 retries with exponential backoff (500ms to 4s with jitter).

Latency Routing: Measures real-time model performance and routes to the fastest available. For streaming calls, the metric is time-to-first-token. For non-streaming, total response time. New or cold-start models get 5 to 10% of traffic for data collection, then join the ranking.

Response caching

Requesty's semantic cache hits on identical and similar requests. In production, teams report 40 to 60% cache hit rates on repeated calls. The cache requires zero configuration. It runs automatically on every request. At scale, caching savings often exceed the 5% gateway markup, making Requesty net-negative on cost compared to calling providers directly.

Governance

A five-layer policy hierarchy from org level down to individual API key: budget caps, rate limits, model allowlists, PII redaction, and usage policies. RBAC controls who can create keys, view costs, and modify routing policies. Compliance: SOC 2, GDPR, HIPAA, with EU hosting in Frankfurt.

Integration

Change your base URL to router.requesty.ai/v1 and use your Requesty API key. Existing OpenAI SDK code works unchanged. Tested compatible with LangChain, LangGraph, CrewAI, Claude Agent SDK, Google ADK, Vercel AI SDK, and every major agent framework.

Python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-key"
)
 
response = client.chat.completions.create(
    model="anthropic/claude-opus-4-8",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)

When to use Requesty

You need production-grade routing with the lowest latency. You want caching that pays for itself. You need multi-provider failover that completes in under 50ms. You want governance (budgets, RBAC, PII redaction) without managing infrastructure. You are building agentic applications and need per-agent cost tracking.

Portkey: guardrails and enterprise observability

Portkey is a managed AI gateway with the largest model catalog: 1,600+ models across all major providers. The standout feature set is guardrails and observability.

Guardrails

Portkey ships 50+ built-in guardrails that run on every request: PII detection, toxicity filtering, hallucination checks, regex validation, and custom webhook-based rules. Guardrails execute before the request reaches the model (input guardrails) and after the response returns (output guardrails). This is the most comprehensive built-in safety layer of any gateway.

Observability

Every request logged with token counts, latency, cost, model, and custom metadata. Dashboards break down spend by team, project, model, and API key. Portkey's analytics include prompt performance tracking (which prompts produce the best outputs) and A/B testing dashboards.

Self-hosting

The Portkey Gateway is open-source (MIT). You can self-host the routing layer and run it as a proxy without using Portkey's managed service. The managed platform (observability, guardrails dashboard, team management) is proprietary.

Pricing

Free tier: 10,000 logs per month. Paid tiers use per-log pricing. At high volume, the per-log cost can add up, especially for agentic workloads where a single task generates hundreds of LLM calls.

When to use Portkey

You need the largest model catalog (1,600+). Guardrails (PII, toxicity, hallucination) are a hard requirement. You want the deepest observability dashboards. You need SOC 2 and HIPAA compliance with a managed service.

LiteLLM: the open-source self-hosted option

LiteLLM is the most popular open-source LLM proxy, with 22,000+ GitHub stars (the project's own page says 47,800+, though different counting methods apply). It provides an OpenAI-compatible API over 100+ providers and is completely free to self-host.

Architecture

A Python proxy server backed by PostgreSQL for spend tracking and Redis for caching. You deploy it on your infrastructure, own every byte of data, and pay zero markup on provider costs.

Key features

Virtual API keys: Create project-scoped or user-scoped keys with independent spend limits, model access lists, and rate limits. This is the multi-tenancy layer for self-hosted deployments.

Routing strategies: Load balancing (weighted distribution), fallback chains, latency-based routing, and cost-based routing. All configurable via YAML.

A2A and MCP support: LiteLLM added native A2A protocol support and MCP tool integration for agent-to-agent communication and tool routing alongside model routing.

Tradeoffs

You need a DevOps team. PostgreSQL, Redis, the proxy server, TLS termination, monitoring, and upgrades are all your responsibility. No managed guardrails, no built-in PII detection, no hosted dashboard (though a basic admin UI exists). At $50K per month in API spend, the zero-markup savings ($2,750 per month versus OpenRouter's 5.5%) justify the operational cost. Below $5K per month, the engineering time to maintain LiteLLM likely exceeds the savings.

When to use LiteLLM

You have a DevOps team and want zero vendor markup. You need to own all data on your infrastructure (air-gapped environments, strict data residency). Your API spend is $10K+ per month, making the markup savings significant. You want open-source with full code access.

OpenRouter: the model marketplace

OpenRouter is a hosted API that provides access to 300+ models through a single endpoint. It operates as a marketplace: multiple providers host the same model, and OpenRouter routes to the cheapest or fastest available instance.

Strengths

The broadest model access with the simplest integration. Sign up, get a key, and you can call GPT-5.5, Claude Opus 4.8, Gemini 3.5 Flash, DeepSeek V4, Llama 4 Maverick, and hundreds of open-source models immediately. OpenRouter is often the first platform to host newly released models.

For solo developers and startups that want instant access to every model without infrastructure, OpenRouter is the fastest path. A free tier covers 25+ models with 50 to 1,000 requests per day.

Tradeoffs

Latency: 40 to 55ms of overhead in independent benchmarks. For user-facing applications, this is noticeable. For batch processing, it is acceptable.

No semantic caching: OpenRouter does not cache responses. Every call goes to the provider, every time. At high volume, this means significantly higher costs compared to platforms with caching.

No governance: No RBAC, no budget hierarchies, no PII redaction, no compliance certifications. Per-key budget caps exist, but there is no team-level or project-level spend management.

5.5% markup plus minimums: The 5.5% fee on credit purchases plus a $0.80 minimum charge on small transactions adds up. At $50K per month, you pay $2,750 in markup. At $1K per month, the effective rate is higher due to minimums.

Credit expiration: Credits expire after 365 days. If you buy in bulk and usage drops, you lose the balance.

When to use OpenRouter

You are a solo developer or early-stage startup. You want instant access to 300+ models with no setup. You do not need caching, governance, or compliance. Latency is not a primary concern. You want to evaluate many models quickly.

Kong AI Gateway: for existing Kong users

Kong AI Gateway adds AI-specific plugins to the Kong API platform. If your organization already runs Kong for API management, adding AI routing is a plugin installation, not a new service.

Key features

AI-specific rate limiting, prompt caching, token-aware load balancing, and request transformation plugins. All built on Kong's existing plugin architecture, so they compose with your existing auth, logging, and rate limiting plugins.

Tradeoffs

Kong is an API gateway that added AI features, not an AI-native platform. The routing intelligence (prompt-aware model selection, latency-based routing) is less sophisticated than purpose-built platforms like Requesty or Portkey. Pricing follows Kong's enterprise license model, which is expensive for teams that only need AI routing.

When to use Kong AI Gateway

You already run Kong and want to add AI routing without deploying a separate service. Your API team manages Kong and wants AI traffic to go through the same governance layer.

Cloudflare AI Gateway: edge-first routing

Cloudflare AI Gateway routes AI traffic through Cloudflare's edge network. Requests are processed at the nearest Cloudflare PoP, adding sub-10ms of overhead at edge.

Key features

Edge caching (responses cached at Cloudflare's 300+ locations worldwide), rate limiting, request logging, and analytics. Integration with Cloudflare Workers for custom pre/post-processing logic.

Tradeoffs

Works best within the Cloudflare ecosystem. If you are not already using Cloudflare Workers, adding their AI Gateway means adopting a new platform. Provider support is limited to major providers (OpenAI, Anthropic, Google, Azure OpenAI). No smart routing by request type. No multi-provider failover chains.

When to use Cloudflare AI Gateway

You already run on Cloudflare Workers. You want edge caching for AI responses. Your use case is straightforward (one or two providers, no complex routing logic).

Helicone: observability with routing

Helicone started as an LLM observability platform and has added proxy and routing capabilities. The core strength remains analytics: cost tracking, latency monitoring, prompt performance analysis, and user-level usage dashboards.

Key features

One-line integration (add a header to your existing OpenAI calls). Detailed cost and latency analytics per request, per user, per prompt. Prompt experiments for A/B testing different prompts and models. Session tracking that groups related requests into logical sessions.

Tradeoffs

Routing features are less mature than dedicated gateways. No smart routing by request type, no latency-based model selection, limited failover configuration. Best used alongside a routing platform or as a lightweight proxy for teams where observability is the primary need.

When to use Helicone

Observability and cost analytics are your primary need. You want the simplest possible integration (one header). You do not need advanced routing logic. You want prompt experiment tracking.

Cost comparison at scale

What does each platform cost at $10,000 per month in provider API spend?

PlatformMarkupCaching SavingsNet CostInfrastructure
Requesty$500 (5%)$4,000-$6,000 (40-60%)Net savings of $3,500-$5,500None (managed)
PortkeyPer-log (varies)Provider-side only$200-$1,000+ depending on volumeNone (managed)
LiteLLM$0Self-configured$500-$2,000 (infra + engineering)PostgreSQL, Redis, proxy
OpenRouter$550 (5.5%)None$550None (managed)
KongEnterprise licensePlugin-based$2,000-$10,000+ (license)Kong cluster
CloudflarePay-per-requestEdge cache$100-$500Cloudflare account
HeliconeFree tier + paidNone$0-$500None (managed)

The math favors platforms with caching at scale. Requesty's semantic cache hitting 40 to 60% of requests means you pay for 40 to 60% fewer provider tokens. At $10K monthly spend, the caching savings alone are $4,000 to $6,000, far exceeding the 5% markup.

Feature comparison matrix

FeatureRequestyPortkeyLiteLLMOpenRouterKongCloudflareHelicone
Smart routing (by task type)YesBasicNoBasicNoNoNo
Latency-based routingYes (PeakEWMA)YesYesYesPluginNoNo
Fallback chainsYes (under 50ms)YesYesYesPluginNoNo
Semantic cachingYes (40-60%)NoSelf-configuredNoPluginEdge cacheNo
PII redactionYesYes (guardrails)NoNoPluginNoNo
RBACYes (5-layer)YesVirtual keysNoYesIAMNo
Budget controlsPer-key, per-teamPer-keyPer-keyPer-keyPluginRate limitsAlerts
SOC 2YesYesN/ANoYesYesYes
HIPAAYesYesN/ANoYesNoNo
EU hostingFrankfurtEU optionSelf-hostNoSelf-hostEdgeNo
OpenAI-compatible APIYesYesYesYesYesYesYes
Self-host optionNoGateway onlyFullNoYesNoNo

Decision tree

Start here: Do you have a DevOps team that wants to own the proxy layer?

  • Yes, and spend is over $10K/month: LiteLLM self-hosted. Zero markup justifies the operational cost.
  • Yes, but you need guardrails and dashboards too: Portkey. Open-source gateway with managed observability.
  • No, I want managed: Continue below.

Do you need caching to reduce costs?

  • Yes: Requesty. Semantic caching at 40-60% hit rates. Net cost reduction at scale.
  • No, cost is secondary to access: Continue below.

Do you need governance (RBAC, budgets, compliance)?

  • Yes: Requesty or Portkey. Both offer SOC 2, HIPAA, RBAC, and budget controls.
  • No, just routing: Continue below.

How latency-sensitive is your application?

The bottom line

The LLM routing market in 2026 is mature. You do not need to build your own proxy. The choice comes down to what you value most: lowest latency and cost savings through caching (Requesty), guardrails and observability depth (Portkey), zero markup and full control (LiteLLM), or instant model access with zero setup (OpenRouter). For production applications with multiple agent SDKs, failover requirements, and cost governance, Requesty provides the most complete package at the lowest overhead.

Frequently asked questions

What is an LLM routing platform?
An LLM routing platform sits between your application and AI model providers (OpenAI, Anthropic, Google, and others). It provides a single API endpoint that routes requests to the best model based on cost, latency, or quality. Production platforms also add failover, caching, observability, rate limiting, and governance. You change one base URL and gain access to hundreds of models without managing individual provider integrations.
Which LLM routing platform has the lowest latency?
Requesty adds 8ms P50 overhead in production (16ms P50 on agentic workloads with routing logic). Portkey reports 8ms P95 in benchmarks but adds 10 to 20ms in independent testing. OpenRouter adds 40 to 55ms of overhead. LiteLLM self-hosted adds 10 to 20ms depending on infrastructure. For latency-sensitive agentic workloads, Requesty's Rust-based router is the fastest tested option.
Which LLM gateway is cheapest?
LiteLLM is free and open-source if you self-host and maintain your own infrastructure. For managed services, Requesty charges a flat 5% markup with no hidden fees, and caching savings (40 to 60% on repeated calls) often offset the cost entirely. OpenRouter charges 5.5% on credit purchases plus a $0.80 minimum on small transactions. Portkey offers a free tier with 10K logs per month, then per-log pricing on paid plans.
Should I self-host my LLM gateway or use a managed service?
Self-host (LiteLLM) if you have a DevOps team, need zero vendor markup at high spend ($50K+ per month), and want full control over every byte. Use a managed service (Requesty, Portkey, OpenRouter) if you want sub-day setup, zero infrastructure maintenance, and built-in compliance. Most teams start managed and only self-host after reaching significant scale.
Can I use multiple LLM routing platforms together?
Yes. A common pattern is LiteLLM as the self-hosted proxy layer with OpenRouter as one of its upstream providers for model access, and Requesty or Portkey wrapping the stack for observability and governance. However, each additional layer adds latency. For most teams, a single platform that covers routing, caching, failover, and observability (like Requesty or Portkey) is simpler and faster.
Related reading