Requesty
Back|MAY '26ROUTING / BEST PRACTICES
5 MIN READ|

How to route LLM requests by cost and latency

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Published

Most teams start with one model. You pick Claude or GPT, wire it up, and ship. That works until your bill hits four or five figures a month and your P95 latency starts hurting user experience.

At that point, you realize that not every request needs the most expensive model. A classification task does not need Claude Opus. A simple summarization does not need GPT 4.1. And a latency sensitive autocomplete definitely does not need a reasoning model that thinks for 10 seconds.

This is where routing comes in. Not load balancing in the traditional sense, but intelligent request routing that considers what each request actually needs and matches it to the right model.

The two dimensions that matter

Cost and latency are the two axes you care about. They pull in different directions, and the trick is knowing when to optimize for which.

Cost routing means picking the cheapest model that can handle the task. If you are doing bulk classification, you probably do not need a $15/million token model when a $0.25/million token model gets 95% accuracy on the same task.

Latency routing means picking the fastest model available right now. Provider latency fluctuates throughout the day. A model that responds in 200ms at 3am might take 2 seconds at peak hours. Latency routing checks real time performance data and sends your request to whichever provider is fastest in that moment.

When to use cost routing

Cost routing makes the most sense when:

  1. You have predictable, repeatable tasks. Classification, extraction, tagging, simple Q&A. These tasks have clear quality thresholds and you can verify that a cheaper model meets them.

  2. You are processing in batch. If you are running 100,000 documents through a pipeline overnight, latency barely matters. Cost is everything.

  3. Your margin depends on it. If you are building a product where AI cost is a significant line item (and it usually is), routing 70% of requests to cheaper models changes the economics of your business.

Here is a real example. Say you have a customer support bot. Most incoming messages are straightforward: password resets, order status, return policies. Maybe 20% are complex escalations that need strong reasoning. Cost routing sends the 80% to a fast, cheap model and reserves the frontier model for the hard stuff.

Python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-key"
)
 
# Simple task: use a cost optimized policy
response = client.chat.completions.create(
    model="policy/support-triage",  # routes to cheapest capable model
    messages=[{"role": "user", "content": "What is your return policy?"}]
)

You configure the policy once in the Requesty dashboard. The routing happens automatically after that.

When to use latency routing

Latency routing is the right choice when:

  1. Users are waiting. Chat interfaces, autocomplete, real time agents. Every millisecond of TTFT (time to first token) matters for perceived responsiveness.

  2. Provider performance varies. And it does, constantly. OpenAI might be fast right now and slow in an hour. Having a routing layer that tracks this in real time and shifts traffic accordingly is worth a lot.

  3. You have SLAs to hit. If you promised your customers sub 1 second responses, you need a routing layer that can guarantee it by picking the fastest available option at request time.

Requesty tracks TTFT, P50, P90, and P95 latency for every model continuously. When you enable latency routing, each request goes to whichever provider is performing best right now. Not based on historical averages. Based on what happened in the last few minutes.

The real answer: combine them

In practice, nobody uses pure cost routing or pure latency routing. You use both.

The most common pattern is what we call constrained routing: set a latency ceiling, then optimize for cost within that constraint. For example:

  • "Give me the cheapest model that can respond under 1 second TTFT."
  • "For this task, I need reasoning capability, but pick the fastest option under $5/million tokens."

You can set this up with fallback policies in Requesty. A policy is a priority ordered list of models. The router tries the first model, and if it is too slow, unavailable, or rate limited, it falls to the next one. You control the order, so you put your preferred cost/latency tradeoff first.

Python
# Using a policy that combines cost and latency constraints
response = client.chat.completions.create(
    model="policy/fast-and-cheap",
    messages=[{"role": "user", "content": "Classify this ticket: ..."}]
)

What this looks like in production

A typical production setup might have three policies:

  1. Fast and cheap for high volume, low complexity tasks. Tries GPT 4.1 mini first, falls back to Claude Haiku, then Gemini Flash. All fast, all cheap.

  2. Best available for complex reasoning. Tries Claude Sonnet first, falls back to GPT 4.1, then Gemini Pro. Prioritizes quality but still has failover.

  3. Latency optimized for real time interactions. Uses latency routing to pick whichever frontier model is fastest right now, regardless of cost.

Your application code does not change between these. You just reference different policy names. The routing, failover, retries, and cost tracking all happen on the Requesty side.

Measuring what matters

Once you have routing in place, you need to measure it. The metrics that matter are:

  • Cost per request (not just per token, but per completed request including retries)
  • TTFT (time to first token, the metric users actually feel)
  • Success rate (what percentage of requests complete without error)
  • Model distribution (which models are actually handling your traffic)

Requesty tracks all of this automatically in the analytics dashboard. You can see cost and latency broken down by model, by team, by API key. If a model starts underperforming, you see it immediately and can adjust your policies.

Getting started

If you are running all your LLM traffic through a single model today, here is how to start:

  1. Sign up at app.requesty.ai and get $10 in free credits.
  2. Change your base URL to https://router.requesty.ai/v1. Your existing code works as is.
  3. Create a routing policy in the dashboard. Start simple: pick two models, set one as primary and one as fallback.
  4. Watch the analytics. After a day of traffic, you will see exactly where you can save money or improve latency.

The point is not to over engineer this on day one. Start with a simple policy, watch the data, and iterate. Most teams find their optimal setup within a week.

Routing is not a feature you build once. It is a practice you refine as your traffic patterns, model landscape, and cost constraints evolve. The important thing is to have the infrastructure that makes it easy to iterate. That is what Requesty is built for.

Frequently asked questions

What is LLM routing by cost and latency?
LLM routing by cost and latency means automatically selecting which AI model handles each request based on how much it costs per token and how fast it responds. Instead of hardcoding a single model, a routing layer evaluates available models in real time and picks the best fit for each request.
Can I combine cost routing and latency routing?
Yes. Most production setups use a blended approach. For example, you can set a latency ceiling (say 2 seconds TTFT) and then pick the cheapest model that meets it. Requesty supports this through routing policies that combine cost and latency constraints.
How much can cost routing save?
It depends on your workload. Teams that move simple tasks (classification, extraction, summarization) from frontier models to smaller ones typically save 60 to 80 percent on those requests. Combined with prompt caching, total savings of 50 percent or more are common.