A gateway sees what application code never does: the same prompt fanning out to a dozen different upstreams, the same model behind three different endpoints, the same workload pattern recurring across thousands of teams. Once a quarter we sit down with the production telemetry and ask "what is the data actually showing about how teams ship in 2026?". This post is the April 2026 read.
We are sticking to per-provider operational metrics (latency, success rate, finish_reason mix, streaming, prompt caching, reasoning tokens) and avoiding any cuts that imply absolute volume or share-of-platform. Every chart below answers a question of the form "what does this provider look like to a caller?", not "how much of the platform does it represent".
1. Anthropic-direct is the agentic provider on Requesty
The single sharpest finding in the April data is that Anthropic-direct does not look like any other provider. Roughly 52% of all successful Anthropic-direct completions in April 2026 finished with finish_reason = tool_calls, meaning the model called a tool rather than returning text. That is about 2× the next-highest provider (OpenAI's new responses endpoint at 26%, Azure at 23%) and 17× higher than OpenAI direct (3%).
finish_reason mix per provider, April 2026
Each bar is normalized to 100%. Click a legend item to isolate a series.
A few things worth pulling out of that chart:
- Anthropic-direct ≠ Bedrock Claude. Same model family, completely different workload. Bedrock Claude finishes 88%
stopand 7%tool_calls. The two routes are serving two different cohorts: agentic / IDE-integrated traffic on Anthropic-direct, and more conversational / batched traffic on Bedrock. - OpenAI's
responsesendpoint is meaningfully more agentic than chat-completions. 26% tool_calls vs 3% on the legacy route. If you've migrated toresponsesand not seen this in your own logs yet, this is the canary. - The "blank/error" segment is real signal. NULL
finish_reasoncorrelates very tightly withsuccessful = false. Moonshot at 94% blank and Google direct at 58% blank are both reliability outliers. It is not a labeling artefact; those routes are genuinely failing on a high fraction of calls. We discuss this further in the success-rate panel below. - Vertex Claude lands between Anthropic-direct and Bedrock at 14% tool_calls. Customers seem to use Vertex Claude as a "production hardened Claude with EU residency" route rather than as their agent harness.
If you are picking a route to point an autonomous agent harness at, this chart is the answer: Anthropic-direct, with Vertex Claude (for residency reasons) and openai-responses as the second-tier choices.
2. The latency leaderboard, April 2026
p50 total latency among the top providers spans almost 15× from fastest to slowest.
Latency leaderboard, April 2026 (top 10 providers by volume)
Switch between p50, p95 and time-to-first-token. Hover any row for all three.
Two clusters stand out:
- The "fast tier": xAI (0.6 s p50), Novita (0.8 s), Azure (1.0 s), Mistral (1.4 s). Mostly lighter / non-reasoning workloads, mostly small or distilled models, mostly answers in under a second.
- The "frontier-Claude tier": Vertex (5.1 s), Anthropic (5.8 s), Bedrock (2.8 s), DeepSeek (9.1 s). Heavier models, longer outputs, much more variance: Anthropic's p95 is 53.9 s and DeepSeek's is 77.6 s. If you are running these as part of an interactive UX, you cannot afford to not stream.
The third panel, p50 time-to-first-token, streamed only, tells a different story:
- Azure is the streaming-UX winner at 0.6 s TTFT, despite a 1 s total p50. First-token-fast, total-fast.
- xAI is fast on completions but slow to first token (3.25 s TTFT, 0.6 s total). That is consistent with non-streaming behaviour or a buffered upstream: the model produces the whole answer quickly but doesn't start emitting tokens until late.
- Vertex's TTFT (1.37 s) is genuinely competitive with the fast tier, even though its total p50 is 5.1 s. If you are picking a Claude route for an interactive product, Vertex starts faster than Anthropic-direct (2.13 s TTFT). Useful if the UX is "first token visible" rather than "answer complete".
Note: TTFT (first_token_latency_ns) only started being recorded on the gateway in 2026. We do not have it for 2025 traffic.
3. Open-source aggregator routes got dramatically faster YoY
Comparing Apr 2025 and Apr 2026 for the providers we have a year of data for, the cleanest pattern in the dataset is that the cheap-inference / OSS-aggregator tier is no longer slow:
p50 latency YoY: April 2025 to April 2026
Same provider tag, ≥50k requests in both months. Lower bars = faster requests.
| provider | Apr 2025 p50 | Apr 2026 p50 | YoY |
|---|---|---|---|
| xAI | 9.1 s | 0.6 s | -93% |
| DeepInfra | 15.8 s | 1.4 s | -91% |
| Alibaba | 5.8 s | 0.5 s | -91% |
| Novita | 8.8 s | 0.8 s | -91% |
| Nebius | 22.1 s | 2.3 s | -89% |
| DeepSeek | 24.3 s | 9.1 s | -63% |
| Google direct | 5.2 s | 3.3 s | -37% |
| Vertex | 5.9 s | 5.1 s | -14% |
| OpenAI | 2.6 s | 2.4 s | -9% |
| Anthropic | 6.0 s | 5.8 s | -3% |
The OSS aggregator routes (xAI, DeepInfra, Alibaba, Novita, Nebius) used to be the slowest tier and are now the fastest. Most of them compressed by 89-93% in a single year. The frontier-provider tier (OpenAI, Anthropic, Vertex) was already fast and barely moved, which is consistent with frontier latency being throughput-bound (model size, decode steps) rather than infrastructure-bound. The middle of the pack is where infrastructure investment shows up.
Practical implication: the latency case for routing easy work to a cheap OSS path is much stronger in 2026 than it was in 2025. A year ago you'd pay 5-25 seconds for that hop. Today you pay sub-second.
4. Operational metrics: success rate, streaming, caching, BYOK
Same providers, four different per-provider metrics. Each bar is "% of that provider's own traffic". None of these are share-of-platform.
Operational metrics per provider, April 2026
Switch metrics. Hover any row to see all three at once.
Things to call out:
- Success rate is bimodal. The big six (OpenAI, Anthropic, Vertex, Bedrock, DeepSeek, Novita) sit between 95-99%. Azure is the underperformer of the well-known providers at 78%. Mistral is at 86%. Moonshot is at 6.2%, a genuine, customer-visible reliability problem on that route. If you are routing to Moonshot in production, route around it.
- Streaming adoption is sharply bimodal too. Azure (68%) and Anthropic (57%) are streaming-heavy. Everyone else is below 10%. Streaming on Anthropic tracks with the agentic IDE workloads in section 1; the Azure number tracks with chat-style enterprise apps that haven't migrated to non-streaming
responses-style endpoints yet. - BYOK is asymmetric. 18% of OpenAI-direct traffic is "bring-your-own-key", but only 3% of Anthropic-direct, 0% of Vertex / Bedrock / Azure. Customers BYOK on the most commodity-priced API and pay through the gateway on the strategic ones, which is exactly the pattern you'd predict.
5. Prompt caching is the single biggest cost lever
Cache hit rate (cached_tokens / input_tokens) is the operational lever that separates "we ship to production" from "we have a credit problem". The April distribution by provider:
Cache hit rate per provider, April 2026
cached_tokens divided by input_tokens. Higher is cheaper and faster.
- Anthropic-direct at 77% cache hit is the best on the platform, by a wide margin. Combined with the 52% tool_calls share, the picture is clear: agentic workloads on Claude are highly repetitive (long shared system prompt, similar context per turn) and the prompt cache is doing exactly what it was designed to do.
- Bedrock Claude at 57%, OpenAI at 36%, DeepSeek at 48%, healthy. These are all in the range where prompt caching is meaningfully reducing token spend.
- Mistral at 4% is roughly the floor. Prompt caching is not currently a meaningful lever on that route.
- We are showing the Moonshot bar at the top of section 4 only, not here, because its cache-hit reading (~88%) is a measurement artefact. The upstream records
cached_tokenson partial streams that the gateway then marks as failed (success rate 6%), which inflates the cache-hit reading. Don't quote that number.
6. Reasoning is real, but concentrated in a few providers
In April 2026, the share of each provider's own output that is reasoning varies enormously:
Reasoning-token share of provider output, April 2026
reasoning_tokens / output_tokens within each provider. Pure reasoning routes versus mixed routes.
- Groq, Coding, Google direct, xAI, zai are 50-82% reasoning. Routes that primarily serve a small set of reasoning-heavy models (Gemini 3 thinking variants, Grok thinking, GLM thinking, etc.). Almost everything they emit is in the reasoning / chain-of-thought stream.
- Vertex and OpenAI are ~36% reasoning. A meaningful and growing share, mostly Gemini 2.5 Flash / 3.x previews on Vertex and the GPT-5 family on OpenAI.
- Azure is at 18%. The lower end of the frontier group, consistent with Azure customers leaning on GPT-4.1-class models more than the latest reasoning checkpoints.
- Anthropic, Bedrock, Mistral, Moonshot are at 0%. Anthropic does not report reasoning tokens separately - extended thinking output is delivered inline. Mistral and Moonshot have no reasoning models routed through the gateway in this period.
The headline narrative in the broader industry is "everything is reasoning now". That is not what the data says. Reasoning is concentrated in a specific subset of providers and models, and even on the providers that emit it, the absolute volume is dwarfed by regular completion output. The interesting workload dimension is not "is this a reasoning model"; it is "is this an agent" (section 1).
- APR '26
Agentic routing, benchmarked: Requesty adds 16ms of overhead, OpenRouter adds 55ms
Agentic routing is the decision layer inside a multi-agent LLM system that picks which model or sub-agent handles an incoming request. Here's what it does, what it costs, and how the gateways compare.
- JAN '26
Designing fallback retries: why Requesty uses 500ms → 4s with jitter
A look at the retry schedule behind Requesty's fallback policies, why exponential backoff with jitter beats a tight retry loop, and the failure modes it actually protects against.
- MAR '25
Intelligent LLM Routing in Enterprise AI: Uptime, Cost Efficiency, and Model Selection

