Why are AI agent costs so high compared to chatbots?

AI agents make 10x to 100x more LLM calls than a chatbot. Each agent loop iteration, tool call, and subagent spawn consumes tokens. A single agent task can involve 20 to 50 LLM calls with large system prompts repeated on every request. Without optimization, inference cost becomes 85% of total enterprise AI budgets.

How does model routing reduce AI agent costs?

Model routing sends each agent step to the cheapest capable model. Classification and scanning tasks go to nano models ($0.10 per million tokens), drafting goes to mid-tier models ($1 to $3 per million tokens), and only final decisions use frontier models ($10 to $15 per million tokens). This 70/20/10 distribution reduces average cost per query by 60 to 80 percent.

What is prompt caching and how much does it save?

Prompt caching stores repeated token prefixes (system prompts, tool definitions, document context) so the LLM provider does not reprocess them on every request. Anthropic's implementation reduces cached input costs by up to 90%. For agents with long stable system prompts running thousands of calls per day, this saves thousands of dollars monthly.

How do I prevent runaway agent loops from burning my budget?

Set per-agent and per-task budget caps. Configure maximum iteration limits on all loops. Use hierarchical budget inheritance where parent tasks allocate fixed budgets to subtasks. Set up spend alerts that trigger at 50%, 80%, and 100% of budget thresholds. A gateway like Requesty provides all of these controls out of the box.

What is the total cost reduction from combining all optimization techniques?

Combining model routing (60 to 80% savings), prompt caching (40 to 90% savings on input tokens), context optimization (30 to 60% savings), and budget controls produces a net 60 to 80% total cost reduction in production deployments. A workflow costing $1.60 per interaction unoptimized typically drops below $0.40 with full optimization.

AI Agent Cost Optimization: How to Cut LLM Spend by 80% with Routing

Inference is not just the biggest line item on your AI bill. In early 2026, Anthropic's engineering teams found that inference consumes over 85% of total enterprise AI budgets. The culprit is not cost per token, which has dropped steadily. It is the sheer volume of tokens that agentic workflows generate.

A single chatbot interaction might use 2,000 to 4,000 tokens. A single agent task with tool calls, planning steps, and verification loops? 50,000 to 500,000 tokens. Multiply that by hundreds of tasks per day, and you have a cost problem that scales linearly with adoption.

This guide covers the four techniques that production teams use to cut agent costs by 60 to 80 percent without sacrificing quality. If you are building the loops that drive this token volume in the first place, start with our companion guide on loop engineering.

Technique 1: Model routing (60 to 80% cost reduction)

Not every step in an agent workflow needs a frontier model. The insight is simple: route each step to the cheapest model that can handle it.

The 70/20/10 distribution

Production agent deployments in 2026 typically follow this pattern:

Volume	Task type	Model tier	Cost per 1M tokens
70%	Classification, extraction, filtering	Nano/Flash	$0.10 to $0.30
20%	Drafting, summarization, code generation	Mid-tier	$1 to $3
10%	Final review, architecture, complex reasoning	Frontier	$10 to $15

The math: if your unoptimized agent sends everything to Claude Opus 4.8 at $15/M input tokens, routing 70% to Flash ($0.30/M) and 20% to Sonnet ($3/M) drops your average cost from $15 to roughly $2.10 per million tokens. That is an 86% reduction. For the full breakdown of how this selection works per request, see smart routing demystified and how to route LLM requests by cost and latency.

Setting up routing with Requesty

Python

from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your-requesty-key"
)
 
# Reference a named routing policy instead of a single model
response = client.chat.completions.create(
    model="policy/agent-cost-optimizer",  # the policy decides which model handles this call
    messages=[{"role": "user", "content": task_prompt}]
)

You define that policy once and reference it by name, so strategy changes never require a redeploy:

YAML

# requesty-policy.yaml
name: agent-cost-optimizer
routes:
  - name: fast-cheap
    match_intent: ["classify", "extract", "filter", "parse"]
    model: google/gemini-2.5-flash
 
  - name: balanced
    match_intent: ["summarize", "draft", "generate", "refactor"]
    model: anthropic/claude-sonnet-4.6
 
  - name: frontier
    match_intent: ["review", "architect", "debug-complex", "decide"]
    model: anthropic/claude-opus-4.8
 
  fallback:
    models:
      - anthropic/claude-sonnet-4.6
      - openai/gpt-5.4

For more policy patterns built specifically for agent workloads, see routing policies for agents.

Technique 2: Prompt caching (40 to 90% input cost reduction)

Agent workflows repeat the same prefixes constantly. System prompts, tool definitions, and RAG context can consume 40 to 60% of each request's token budget, and they are identical across calls.

How caching works

The provider stores the tokenized prefix. On subsequent requests with the same prefix, the cached portion costs a fraction of the full price:

Provider	Cache discount	Minimum cacheable prefix
Anthropic	90% off input	1,024 tokens
OpenAI	50% off input	Automatic
Google	75% off input	32,768 tokens

Real-world impact

A support agent with a 4,000-token system prompt running 10,000 calls/day at Sonnet 4.6 pricing:

Without caching: 4,000 tokens x 10,000 calls x $3/M = $120/day on system prompt alone
With caching (90% hit rate): $12/day on system prompt
Savings: $108/day, $3,240/month from one configuration

Requesty enables prompt caching automatically across all supported providers. No code changes required. We walk through the full caching math in how prompt caching cuts costs by up to 90%.

Technique 3: Context window management (30 to 60% savings)

The second biggest cost driver after repeated prompts is bloated context. Agents accumulate conversation history, tool results, and document chunks that grow unbounded.

Fixed retrieval budgets

Instead of retrieving a variable number of documents, enforce a strict token budget:

Python

# Bad: unbounded retrieval
docs = retriever.get_relevant(query)  # might return 20K tokens
 
# Good: fixed budget
docs = retriever.get_relevant(query, max_tokens=4000)  # always 4K

Compaction

For long-running agent sessions, compress the conversation history periodically:

Python

# After every 10 tool calls, compact the history
if len(messages) > 20:
    summary = summarize(messages[:-4])  # keep last 4 messages verbatim
    messages = [system_prompt, summary] + messages[-4:]

Claude Code and Codex both support automatic compaction. The loop controller summarizes completed work and resets, preventing context from growing indefinitely.

Hierarchical memory

Use a tiered memory system:

Hot memory (in context): current task, last 3 steps
Warm memory (file-based): today's completed work, available via tool call
Cold memory (database): historical patterns, retrieved only when relevant

Technique 4: Budget caps and spend alerts

Cost optimization without hard limits is incomplete. A single runaway loop can erase a month of savings in an hour.

Per-agent budget caps

YAML

# Requesty budget configuration
budgets:
  - agent: pr-reviewer
    daily_limit: $10
    per_task_limit: $2
    alert_thresholds: [50%, 80%, 100%]
    on_exceed: pause_and_alert
 
  - agent: code-migration
    daily_limit: $50
    per_task_limit: $5
    alert_thresholds: [50%, 80%, 100%]
    on_exceed: pause_and_alert

Team chargeback

Track costs by team, project, and agent. Requesty breaks down spend by:

API key (maps to team or project)
Model used (shows routing effectiveness)
Cache hit rate (shows caching effectiveness)
Per-request cost (identifies expensive outliers)

Spend alerts

Configure webhook notifications when budgets hit thresholds:

JSON

{
  "alert_type": "spend_threshold",
  "threshold_percent": 80,
  "webhook_url": "https://hooks.slack.com/your-channel",
  "message": "Agent '{agent_name}' has used 80% of daily budget (${spent}/${limit})"
}

For the full setup, see budget caps and spend alerts and alerts when your LLM spend spikes.

Putting it all together: a real cost breakdown

Here is what a production coding agent workflow looks like before and after optimization:

Metric	Before	After	Reduction
Model	Opus 4.8 for everything	Routed (70% Flash, 20% Sonnet, 10% Opus)	N/A
Avg cost per 1M tokens	$15.00	$2.10	86%
System prompt cost (daily)	$120	$12	90%
Context tokens per call	45,000	18,000	60%
Daily spend (10K calls)	$675	$95	86%
Monthly spend	$20,250	$2,850	86%

The total monthly savings: $17,400. And this is for a single workflow. Teams running multiple agent loops see proportionally larger savings. The same gateway economics apply at the platform level, as we cover in how LLM gateways slash AI spend by up to 80%.

Getting started with Requesty

The fastest path to cost optimization:

Point your agents at Requesty. Change base_url to https://router.requesty.ai/v1. One line of code.
Define a routing policy. Set up a named policy with your model tiers and fallbacks, then reference it from your code with model="policy/your-policy".
Caching is automatic. Requesty enables provider-native caching on all supported models.
Set budget alerts. Configure daily limits and Slack notifications in the dashboard.
Monitor and tune. Use the cost analytics dashboard to identify expensive patterns and adjust routing.

The LLM gateway market is projected to reach $11 billion by 2030. The reason is simple: as agent adoption scales from 42% to near-universal in enterprise, cost optimization is not optional. It is the difference between a viable AI program and one that gets killed at the next budget review.

Start routing today. Your future self will thank you when the monthly invoice arrives.

Frequently asked questions

Why are AI agent costs so high compared to chatbots?: AI agents make 10x to 100x more LLM calls than a chatbot. Each agent loop iteration, tool call, and subagent spawn consumes tokens. A single agent task can involve 20 to 50 LLM calls with large system prompts repeated on every request. Without optimization, inference cost becomes 85% of total enterprise AI budgets.
How does model routing reduce AI agent costs?: Model routing sends each agent step to the cheapest capable model. Classification and scanning tasks go to nano models ($0.10 per million tokens), drafting goes to mid-tier models ($1 to $3 per million tokens), and only final decisions use frontier models ($10 to $15 per million tokens). This 70/20/10 distribution reduces average cost per query by 60 to 80 percent.
What is prompt caching and how much does it save?: Prompt caching stores repeated token prefixes (system prompts, tool definitions, document context) so the LLM provider does not reprocess them on every request. Anthropic's implementation reduces cached input costs by up to 90%. For agents with long stable system prompts running thousands of calls per day, this saves thousands of dollars monthly.
How do I prevent runaway agent loops from burning my budget?: Set per-agent and per-task budget caps. Configure maximum iteration limits on all loops. Use hierarchical budget inheritance where parent tasks allocate fixed budgets to subtasks. Set up spend alerts that trigger at 50%, 80%, and 100% of budget thresholds. A gateway like Requesty provides all of these controls out of the box.
What is the total cost reduction from combining all optimization techniques?: Combining model routing (60 to 80% savings), prompt caching (40 to 90% savings on input tokens), context optimization (30 to 60% savings), and budget controls produces a net 60 to 80% total cost reduction in production deployments. A workflow costing $1.60 per interaction unoptimized typically drops below $0.40 with full optimization.

AI Agent Cost Optimization: How to Cut LLM Spend by 80% with Routing

Technique 1: Model routing (60 to 80% cost reduction)

The 70/20/10 distribution

Setting up routing with Requesty

Technique 2: Prompt caching (40 to 90% input cost reduction)

How caching works

Real-world impact

Technique 3: Context window management (30 to 60% savings)

Fixed retrieval budgets

Compaction

Hierarchical memory

Technique 4: Budget caps and spend alerts

Per-agent budget caps

Team chargeback

Spend alerts

Putting it all together: a real cost breakdown

Getting started with Requesty

Frequently asked questions

Loop Engineering: How to Build AI Agent Loops That Run Themselves

Smart Routing Demystified: Choosing the Fastest-Cheapest Model per Request

New: spend alerts for LLM traffic, webhooks when budgets get hit

Routing policies 101: fallback, load balancing, and latency in production