Inference is not just the biggest line item on your AI bill. In early 2026, Anthropic's engineering teams found that inference consumes over 85% of total enterprise AI budgets. The culprit is not cost per token, which has dropped steadily. It is the sheer volume of tokens that agentic workflows generate.
A single chatbot interaction might use 2,000 to 4,000 tokens. A single agent task with tool calls, planning steps, and verification loops? 50,000 to 500,000 tokens. Multiply that by hundreds of tasks per day, and you have a cost problem that scales linearly with adoption.
This guide covers the four techniques that production teams use to cut agent costs by 60 to 80 percent without sacrificing quality. If you are building the loops that drive this token volume in the first place, start with our companion guide on loop engineering.
Technique 1: Model routing (60 to 80% cost reduction)
Not every step in an agent workflow needs a frontier model. The insight is simple: route each step to the cheapest model that can handle it.
The 70/20/10 distribution
Production agent deployments in 2026 typically follow this pattern:
| Volume | Task type | Model tier | Cost per 1M tokens |
|---|---|---|---|
| 70% | Classification, extraction, filtering | Nano/Flash | $0.10 to $0.30 |
| 20% | Drafting, summarization, code generation | Mid-tier | $1 to $3 |
| 10% | Final review, architecture, complex reasoning | Frontier | $10 to $15 |
The math: if your unoptimized agent sends everything to Claude Opus 4.8 at $15/M input tokens, routing 70% to Flash ($0.30/M) and 20% to Sonnet ($3/M) drops your average cost from $15 to roughly $2.10 per million tokens. That is an 86% reduction. For the full breakdown of how this selection works per request, see smart routing demystified and how to route LLM requests by cost and latency.
Setting up routing with Requesty
from openai import OpenAI
client = OpenAI(
base_url="https://router.requesty.ai/v1",
api_key="your-requesty-key"
)
# Reference a named routing policy instead of a single model
response = client.chat.completions.create(
model="policy/agent-cost-optimizer", # the policy decides which model handles this call
messages=[{"role": "user", "content": task_prompt}]
)You define that policy once and reference it by name, so strategy changes never require a redeploy:
# requesty-policy.yaml
name: agent-cost-optimizer
routes:
- name: fast-cheap
match_intent: ["classify", "extract", "filter", "parse"]
model: google/gemini-2.5-flash
- name: balanced
match_intent: ["summarize", "draft", "generate", "refactor"]
model: anthropic/claude-sonnet-4.6
- name: frontier
match_intent: ["review", "architect", "debug-complex", "decide"]
model: anthropic/claude-opus-4.8
fallback:
models:
- anthropic/claude-sonnet-4.6
- openai/gpt-5.4For more policy patterns built specifically for agent workloads, see routing policies for agents.
Technique 2: Prompt caching (40 to 90% input cost reduction)
Agent workflows repeat the same prefixes constantly. System prompts, tool definitions, and RAG context can consume 40 to 60% of each request's token budget, and they are identical across calls.
How caching works
The provider stores the tokenized prefix. On subsequent requests with the same prefix, the cached portion costs a fraction of the full price:
| Provider | Cache discount | Minimum cacheable prefix |
|---|---|---|
| Anthropic | 90% off input | 1,024 tokens |
| OpenAI | 50% off input | Automatic |
| 75% off input | 32,768 tokens |
Real-world impact
A support agent with a 4,000-token system prompt running 10,000 calls/day at Sonnet 4.6 pricing:
- Without caching: 4,000 tokens x 10,000 calls x $3/M = $120/day on system prompt alone
- With caching (90% hit rate): $12/day on system prompt
- Savings: $108/day, $3,240/month from one configuration
Requesty enables prompt caching automatically across all supported providers. No code changes required. We walk through the full caching math in how prompt caching cuts costs by up to 90%.
Technique 3: Context window management (30 to 60% savings)
The second biggest cost driver after repeated prompts is bloated context. Agents accumulate conversation history, tool results, and document chunks that grow unbounded.
Fixed retrieval budgets
Instead of retrieving a variable number of documents, enforce a strict token budget:
# Bad: unbounded retrieval
docs = retriever.get_relevant(query) # might return 20K tokens
# Good: fixed budget
docs = retriever.get_relevant(query, max_tokens=4000) # always 4KCompaction
For long-running agent sessions, compress the conversation history periodically:
# After every 10 tool calls, compact the history
if len(messages) > 20:
summary = summarize(messages[:-4]) # keep last 4 messages verbatim
messages = [system_prompt, summary] + messages[-4:]Claude Code and Codex both support automatic compaction. The loop controller summarizes completed work and resets, preventing context from growing indefinitely.
Hierarchical memory
Use a tiered memory system:
- Hot memory (in context): current task, last 3 steps
- Warm memory (file-based): today's completed work, available via tool call
- Cold memory (database): historical patterns, retrieved only when relevant
Technique 4: Budget caps and spend alerts
Cost optimization without hard limits is incomplete. A single runaway loop can erase a month of savings in an hour.
Per-agent budget caps
# Requesty budget configuration
budgets:
- agent: pr-reviewer
daily_limit: $10
per_task_limit: $2
alert_thresholds: [50%, 80%, 100%]
on_exceed: pause_and_alert
- agent: code-migration
daily_limit: $50
per_task_limit: $5
alert_thresholds: [50%, 80%, 100%]
on_exceed: pause_and_alertTeam chargeback
Track costs by team, project, and agent. Requesty breaks down spend by:
- API key (maps to team or project)
- Model used (shows routing effectiveness)
- Cache hit rate (shows caching effectiveness)
- Per-request cost (identifies expensive outliers)
Spend alerts
Configure webhook notifications when budgets hit thresholds:
{
"alert_type": "spend_threshold",
"threshold_percent": 80,
"webhook_url": "https://hooks.slack.com/your-channel",
"message": "Agent '{agent_name}' has used 80% of daily budget (${spent}/${limit})"
}For the full setup, see budget caps and spend alerts and alerts when your LLM spend spikes.
Putting it all together: a real cost breakdown
Here is what a production coding agent workflow looks like before and after optimization:
| Metric | Before | After | Reduction |
|---|---|---|---|
| Model | Opus 4.8 for everything | Routed (70% Flash, 20% Sonnet, 10% Opus) | N/A |
| Avg cost per 1M tokens | $15.00 | $2.10 | 86% |
| System prompt cost (daily) | $120 | $12 | 90% |
| Context tokens per call | 45,000 | 18,000 | 60% |
| Daily spend (10K calls) | $675 | $95 | 86% |
| Monthly spend | $20,250 | $2,850 | 86% |
The total monthly savings: $17,400. And this is for a single workflow. Teams running multiple agent loops see proportionally larger savings. The same gateway economics apply at the platform level, as we cover in how LLM gateways slash AI spend by up to 80%.
Getting started with Requesty
The fastest path to cost optimization:
- Point your agents at Requesty. Change
base_urltohttps://router.requesty.ai/v1. One line of code. - Define a routing policy. Set up a named policy with your model tiers and fallbacks, then reference it from your code with
model="policy/your-policy". - Caching is automatic. Requesty enables provider-native caching on all supported models.
- Set budget alerts. Configure daily limits and Slack notifications in the dashboard.
- Monitor and tune. Use the cost analytics dashboard to identify expensive patterns and adjust routing.
The LLM gateway market is projected to reach $11 billion by 2030. The reason is simple: as agent adoption scales from 42% to near-universal in enterprise, cost optimization is not optional. It is the difference between a viable AI program and one that gets killed at the next budget review.
Start routing today. Your future self will thank you when the monthly invoice arrives.
Frequently asked questions
- Why are AI agent costs so high compared to chatbots?
- AI agents make 10x to 100x more LLM calls than a chatbot. Each agent loop iteration, tool call, and subagent spawn consumes tokens. A single agent task can involve 20 to 50 LLM calls with large system prompts repeated on every request. Without optimization, inference cost becomes 85% of total enterprise AI budgets.
- How does model routing reduce AI agent costs?
- Model routing sends each agent step to the cheapest capable model. Classification and scanning tasks go to nano models ($0.10 per million tokens), drafting goes to mid-tier models ($1 to $3 per million tokens), and only final decisions use frontier models ($10 to $15 per million tokens). This 70/20/10 distribution reduces average cost per query by 60 to 80 percent.
- What is prompt caching and how much does it save?
- Prompt caching stores repeated token prefixes (system prompts, tool definitions, document context) so the LLM provider does not reprocess them on every request. Anthropic's implementation reduces cached input costs by up to 90%. For agents with long stable system prompts running thousands of calls per day, this saves thousands of dollars monthly.
- How do I prevent runaway agent loops from burning my budget?
- Set per-agent and per-task budget caps. Configure maximum iteration limits on all loops. Use hierarchical budget inheritance where parent tasks allocate fixed budgets to subtasks. Set up spend alerts that trigger at 50%, 80%, and 100% of budget thresholds. A gateway like Requesty provides all of these controls out of the box.
- What is the total cost reduction from combining all optimization techniques?
- Combining model routing (60 to 80% savings), prompt caching (40 to 90% savings on input tokens), context optimization (30 to 60% savings), and budget controls produces a net 60 to 80% total cost reduction in production deployments. A workflow costing $1.60 per interaction unoptimized typically drops below $0.40 with full optimization.
- JUN '26
Loop Engineering: How to Build AI Agent Loops That Run Themselves
Loop engineering is the practice of designing autonomous AI agent loops that prompt themselves, iterate until done, and report findings without human babysitting. Here is how to build production loops with Claude Code, Codex, and an LLM gateway that routes every iteration at minimum cost.
- JUL '25
Smart Routing Demystified: Choosing the Fastest-Cheapest Model per Request
- MAR '26
New: spend alerts for LLM traffic, webhooks when budgets get hit
Requesty Alerts are live: JSON and Slack webhook notifications when a user, group, or organisation crosses a spend threshold. Four alert types, built-in retries, zero application code required.
- JAN '26
Routing policies 101: fallback, load balancing, and latency in production
The three routing-policy primitives every LLM gateway needs (fallback chains, weighted load balancing, and latency-based selection) and when to use each. Written for teams deploying multi-model production setups.

