Requesty
Back|MAY '26AGENTS / BEST PRACTICES
9 MIN READ|

Multi Agent Orchestration Patterns That Actually Work in Production

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Published

Single agent systems hit a ceiling fast. The moment your application needs to research, draft, review, and format in a single workflow, you need multiple specialized agents working together. Gartner reported a 1,445% surge in multi agent system inquiries between Q1 2024 and Q2 2025. Organizations already use an average of 12 agents, and that number is projected to climb 67% within two years.

But 40% of multi agent pilots fail within six months of production deployment. The failure is almost never the individual agents. It is the orchestration: the wiring that coordinates state, message passing, and control flow between them. Anthropic's analysis of 200+ enterprise agent deployments found that 57% of project failures originated in orchestration design.

This guide covers the six orchestration patterns that hold up in production, how each one fails, and how to wire them for cost control, automatic failover, and observability using a unified AI gateway.

The six patterns

Pattern 1: Sequential Pipeline

Agents execute in a fixed order. Agent A's output becomes Agent B's input. This is the simplest pattern and should be your default unless you have a specific reason to use something more complex.

When to use it: Document processing (extract, classify, enrich, route). Content production (research, draft, edit, format). Any workflow where each step depends on the previous step's output.

Python
from openai import OpenAI
 
client = OpenAI(
    api_key="your_requesty_api_key",
    base_url="https://router.requesty.ai/v1",
)
 
async def sequential_pipeline(topic: str):
    # Step 1: Research (use a model with web search)
    research = client.chat.completions.create(
        model="google/gemini-2.5-pro",
        messages=[{
            "role": "user",
            "content": f"Research this topic thoroughly: {topic}"
        }],
    )
 
    # Step 2: Draft (use a strong writing model)
    draft = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-5",
        messages=[{
            "role": "user",
            "content": f"Write a detailed article based on this research:\n{research.choices[0].message.content}"
        }],
    )
 
    # Step 3: Review (use a frontier reasoning model)
    review = client.chat.completions.create(
        model="openai/gpt-5",
        messages=[{
            "role": "user",
            "content": f"Review this article for accuracy and clarity:\n{draft.choices[0].message.content}"
        }],
    )
 
    return review.choices[0].message.content

Notice that each step uses a different model from a different provider. With Requesty, you pick the best model for each task without managing three separate API keys or provider integrations. One endpoint, one API key, 300+ models.

How it fails: The pipeline is only as good as its weakest link. If the research agent returns low quality output, every downstream agent produces garbage. Latency compounds linearly because each step waits for the previous one.

Mitigation: Add quality checks between steps. Use request metadata to tag each step so you can identify bottlenecks in your analytics dashboard:

Python
research = client.chat.completions.create(
    model="google/gemini-2.5-pro",
    messages=[{"role": "user", "content": f"Research: {topic}"}],
    extra_body={
        "requesty": {
            "tags": ["pipeline", "research"],
            "trace_id": f"pipeline_{topic[:20]}",
            "extra": {"step": "1_research", "pipeline": "content_production"},
        }
    },
)

Pattern 2: Orchestrator Worker (Supervisor)

One orchestrator agent receives the task, decomposes it into subtasks, delegates each to a specialist worker, and assembles the results. The orchestrator uses a capable frontier model while workers use cheaper, task specific ones.

When to use it: Cross functional workflows with clear task decomposition. Customer service routing between billing, technical, and product specialists. Any workflow where you need a single accountability point.

How it fails: The orchestrator is a single point of failure. If it misclassifies a task, the wrong worker gets it. Context window overflow is the more subtle problem: the orchestrator accumulates context from every worker. At four or more workers, context frequently exceeds window limits.

Cost optimization with Requesty: Use a routing policy to route the orchestrator to a frontier model and workers to cheaper models. This cuts costs 40 to 60% compared to running everything on a frontier model:

Python
# Orchestrator uses a powerful model via a fallback policy
orchestrator_response = client.chat.completions.create(
    model="policy/frontier-with-fallback",  # GPT-5 → Claude Opus as fallback
    messages=[{"role": "user", "content": "Break this task into subtasks..."}],
)
 
# Workers use fast, cheap models
worker_response = client.chat.completions.create(
    model="policy/fast-and-cheap",  # Gemini Flash → Haiku → GPT-4o mini
    messages=[{"role": "user", "content": "Execute this specific subtask..."}],
)

Set up these policies once in the Routing Policies dashboard and reference them by name. No code changes when you want to swap models.

Pattern 3: Parallel Fan Out

Multiple agents execute simultaneously on different parts of the same task. Results are collected and merged. This is the pattern for tasks where subtasks are independent.

When to use it: Due diligence (legal, financial, technical reviews in parallel). Multi source research (query multiple databases simultaneously). Any workflow where subtasks do not depend on each other.

Python
import asyncio
from openai import AsyncOpenAI
 
client = AsyncOpenAI(
    api_key="your_requesty_api_key",
    base_url="https://router.requesty.ai/v1",
)
 
async def parallel_analysis(document: str):
    legal, financial, technical = await asyncio.gather(
        client.chat.completions.create(
            model="anthropic/claude-sonnet-4-5",
            messages=[{"role": "user", "content": f"Legal review:\n{document}"}],
            extra_body={"requesty": {"tags": ["parallel", "legal"]}},
        ),
        client.chat.completions.create(
            model="openai/gpt-5",
            messages=[{"role": "user", "content": f"Financial analysis:\n{document}"}],
            extra_body={"requesty": {"tags": ["parallel", "financial"]}},
        ),
        client.chat.completions.create(
            model="google/gemini-2.5-pro",
            messages=[{"role": "user", "content": f"Technical assessment:\n{document}"}],
            extra_body={"requesty": {"tags": ["parallel", "technical"]}},
        ),
    )
    return {
        "legal": legal.choices[0].message.content,
        "financial": financial.choices[0].message.content,
        "technical": technical.choices[0].message.content,
    }

How it fails: If one parallel branch takes 10x longer than the others, total latency is dominated by the slowest agent. Different models have different speed profiles, and you will not know which is slowest until production.

Mitigation: Use latency based routing so Requesty automatically picks the fastest available model for each branch:

Python
# Each branch uses a latency-optimized policy
response = client.chat.completions.create(
    model="policy/fastest-frontier",  # Automatically routes to fastest: GPT-5, Claude, or Gemini
    messages=[{"role": "user", "content": "Analyze this..."}],
)

Pattern 4: Router (Dynamic Dispatch)

A lightweight classifier agent examines the incoming request and routes it to the right specialist. Unlike the orchestrator, the router does not decompose tasks or aggregate results. It just dispatches.

When to use it: Customer support triage. Multi domain chatbots where different query types need different expertise. Any system where request types are heterogeneous but each individual request maps to one specialist.

Python
# The router uses a fast, cheap model to classify
classification = client.chat.completions.create(
    model="openai/gpt-4o-mini",  # Fast and cheap for classification
    messages=[{
        "role": "system",
        "content": "Classify the user request into: billing, technical, product. Return only the category.",
    }, {
        "role": "user",
        "content": user_query,
    }],
    extra_body={"requesty": {"tags": ["router", "classify"]}},
)
 
category = classification.choices[0].message.content.strip().lower()
 
# Route to the right specialist with a policy that has built-in fallback
model_map = {
    "billing": "policy/billing-specialist",
    "technical": "policy/technical-specialist",
    "product": "policy/product-specialist",
}
 
specialist_response = client.chat.completions.create(
    model=model_map.get(category, "policy/general-assistant"),
    messages=[{"role": "user", "content": user_query}],
    extra_body={
        "requesty": {
            "tags": ["router", category],
            "extra": {"routed_category": category},
        }
    },
)

Cost impact: The router pattern cuts costs 30 to 60% because simple queries go to cheap models while only complex queries hit frontier models. Track this in the Cost Tracking dashboard by filtering on your custom tags.

Pattern 5: Hierarchical (Multi Level)

Combines the orchestrator and worker patterns at multiple levels. A top level orchestrator delegates to mid level coordinators, which in turn manage their own worker pools. This is the pattern for enterprise scale systems.

When to use it: Large organizations with multiple AI teams. Complex workflows spanning multiple domains with sub workflows in each domain. Systems processing thousands of concurrent agent requests.

How it fails: Debugging is hard. When something goes wrong three levels deep, finding the root cause requires tracing through multiple orchestration layers.

Observability with Requesty: Use session reconstruction and request metadata to trace requests through the hierarchy:

Python
response = client.chat.completions.create(
    model="policy/coordinator",
    messages=[{"role": "user", "content": "Process this order..."}],
    extra_body={
        "requesty": {
            "trace_id": "order_12345",
            "user_id": "customer_abc",
            "tags": ["hierarchical", "order-processing", "level-2"],
            "extra": {
                "parent_task": "fulfillment_pipeline",
                "hierarchy_level": "coordinator",
                "department": "logistics",
            },
        }
    },
)

Every request in the hierarchy shares the same trace_id, so you can reconstruct the entire execution flow in the analytics dashboard. Filter by hierarchy_level to see cost and latency at each tier.

Pattern 6: Evaluator Optimizer Loop

An executor agent produces output. An evaluator agent scores it. If the score is below threshold, the executor tries again with the evaluator's feedback. This loop continues until the output meets quality standards or hits a retry limit.

When to use it: Content generation where quality must meet a bar. Code generation with automated test validation. Any workflow where output quality is measurable and you can afford multiple iterations.

Python
MAX_ITERATIONS = 3
 
draft = initial_prompt
for i in range(MAX_ITERATIONS):
    # Generate with a strong model
    generation = client.chat.completions.create(
        model="anthropic/claude-sonnet-4-5",
        messages=[{"role": "user", "content": draft}],
        extra_body={"requesty": {"tags": ["eval-loop", f"iteration-{i}"]}},
    )
 
    # Evaluate with a different model to avoid self-bias
    evaluation = client.chat.completions.create(
        model="openai/gpt-5",
        messages=[{
            "role": "user",
            "content": f"Score this output 1-10 and provide feedback:\n{generation.choices[0].message.content}"
        }],
        extra_body={"requesty": {"tags": ["eval-loop", "evaluator"]}},
    )
 
    score = parse_score(evaluation.choices[0].message.content)
    if score >= 8:
        break
 
    # Feed back into the next iteration
    draft = f"Improve based on this feedback:\n{evaluation.choices[0].message.content}"

How it fails: Without a cost cap, the loop can run indefinitely. Three iterations of a frontier model on a long document can cost 50x what a single call costs.

Cost controls with Requesty: Set spend limits per API key or project to prevent runaway loops. Use spending alerts to get notified when an eval loop exceeds expected costs.

Choosing your pattern

PatternComplexityLatencyCost ProfileBest For
Sequential PipelineLowHigh (additive)ModerateLinear workflows
Orchestrator WorkerMediumMediumLow (cheap workers)Task decomposition
Parallel Fan OutMediumLow (concurrent)High (multiple calls)Independent subtasks
RouterLowLowLow (smart dispatch)Request classification
HierarchicalHighVariableVariableEnterprise scale
Evaluator OptimizerMediumHigh (iterative)High (retries)Quality critical output

The golden rule: Start with the simplest pattern that solves your problem. Upgrade only when you have metrics showing the current pattern does not meet requirements.

Why observability is non negotiable

Multi agent systems are distributed systems. They have all the same failure modes: race conditions, cascading failures, cost explosions, and silent quality degradation. The difference is that each "service" is a probabilistic model that can fail in ways no unit test will catch.

Requesty's analytics give you the observability layer that multi agent systems need:

Cost by agent — Break down spend by agent role using request metadata tags. Identify which agents are the most expensive.

Latency by step — Track P50, P90, and P99 latency for each step in your pipeline with performance monitoring. Find bottlenecks before users notice.

Error rates — Monitor error rates per model and per agent. Automatic failover means errors do not become outages.

Session replay — Reconstruct full multi agent conversations for debugging with session reconstruction. No manual session ID management required.

Getting started

  1. Pick your pattern. Start with Sequential Pipeline unless you have a clear reason for something more complex.

  2. Set up Requesty. Sign up and create an API key. Point your OpenAI client at https://router.requesty.ai/v1.

  3. Create routing policies. Set up fallback policies for reliability and latency policies for speed. Use different policies for orchestrators vs workers.

  4. Tag everything. Use request metadata to tag each agent call with its role, step, and trace ID. This is what makes debugging possible.

  5. Set cost guardrails. Configure spend limits and alerts to prevent runaway costs from eval loops or recursive orchestrators.

Further reading

  • Fallback Policies — Automatic model failover for production reliability.
  • Load Balancing — Distribute traffic across models by cost, latency, or custom weights.
  • Request Metadata — Tag requests for granular analytics filtering.
  • Guardrails — Detect PII, secrets, and sensitive data in agent inputs and outputs.
  • Agent Harness — Why your LLM gateway is the backbone of production agents.
  • Routing Policies 101 — Fallback, load balancing, and latency in production.

Frequently asked questions

What are the six multi agent orchestration patterns?
The six patterns are Sequential Pipeline (fixed order execution), Orchestrator Worker (supervisor delegates to specialists), Parallel Fan Out (concurrent independent tasks), Router (dynamic dispatch to specialists), Hierarchical (multi-level coordination), and Evaluator Optimizer Loop (iterative quality refinement).
Which orchestration pattern should I start with?
Start with Sequential Pipeline unless you have a specific reason for something more complex. It is the simplest pattern and works for any workflow where each step depends on the previous step's output, like document processing or content production.
How do routing policies help with multi agent cost optimization?
Use routing policies to send orchestrator calls to frontier models and worker calls to cheaper models. This cuts costs 40 to 60% compared to running everything on a frontier model. With Requesty, you create policies once in the dashboard and reference them by name in your code.
Why do 40% of multi agent pilots fail?
The failure is almost never the individual agents. It is the orchestration: the wiring that coordinates state, message passing, and control flow between them. Anthropic's analysis found that 57% of project failures originated in orchestration design, not agent capability.
How do I prevent runaway costs in evaluator optimizer loops?
Set spend limits per API key or project using Requesty's API limits feature. Use spending alerts to get notified when an eval loop exceeds expected costs. Always set a MAX_ITERATIONS cap in your code.
Related reading