Requesty
Back|MAY '26INTEGRATIONS / BEST PRACTICES
15 MIN READ|

Structured Outputs Across LLM Providers: What Works, What Breaks, and How We Tested 244 Models

Thibault Jaigu
Thibault Jaigu
CEO & Co-Founder
Published

You want your LLM to return a JSON object with three fields. Name, email, score. Not prose. Not markdown. Just the JSON.

Every major provider now supports this. They all call it structured output. They all use constrained decoding under the hood. And they all implement it with completely different API parameters, different schema restrictions, and different edge cases that show up when you switch models.

We decided to test this properly. We ran structured output requests against 244 models across 23 providers, using 3 API endpoints and 10 popular framework SDKs. Over 2,400 individual tests. Here is everything we found.

JSON mode vs structured output

Before the results, one important distinction. These are two different features that often get confused.

JSON mode (response_format: {type: "json_object"}): The model returns valid JSON, but there is no schema enforcement. It might return {"answer": 4} or {"result": {"value": 4, "unit": "count"}} or something else entirely. Your code has to handle whatever shape comes back. Almost every model supports this.

Structured output (response_format: {type: "json_schema", json_schema: {...}}): The model is constrained to match your exact schema. If you define {championships: int, summary: string}, that is exactly what you get. Every field, every type, every required property. This uses constrained decoding at the token generation level. Fewer providers support this because it requires deeper infrastructure changes.

Everything in this post is about structured output. The strict json_schema mode. That is what matters for production applications where you need guaranteed schema conformance.

How we tested

Every test sends the same structured output request: a JSON Schema requiring an integer field and a string field, with strict: true and additionalProperties: false. The model must return a response that matches the schema exactly. We parse and validate every response.

We tested across three axes:

  1. Three API endpoints: OpenAI Chat Completions (/v1/chat/completions), OpenAI Responses (/v1/responses), and Anthropic Messages (/v1/messages)
  2. Ten framework SDKs: Instructor, LangChain, LlamaIndex, LiteLLM, PydanticAI, DSPy, Haystack, Agno, Mirascope, and Vercel AI SDK
  3. 244 models across 23 providers: OpenAI, Anthropic, Google, Azure, Bedrock, Vertex, Mistral, xAI, DeepSeek, DeepInfra, Fireworks, Groq, Alibaba (Qwen), MiniMaxi, Moonshot, Nebius, Novita, Parasail, Perplexity, Together, Inceptron, zai, and more

All tests ran through Requesty's gateway at https://router.requesty.ai/v1. Every request used the same API key, the same schema, the same endpoint format. The only variable was the model.

The API fragmentation problem

Before looking at results, here is why testing this matters. Every provider implements structured output with a different parameter name and a different request shape.

OpenAI Chat Completions

Python
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Extract the user info"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"},
                    "score": {"type": "integer"}
                },
                "required": ["name", "email", "score"],
                "additionalProperties": False
            }
        }
    }
)

OpenAI Responses API

Python
response = client.responses.create(
    model="gpt-4o",
    input="Extract the user info",
    text={
        "format": {
            "type": "json_schema",
            "name": "user_info",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"},
                    "score": {"type": "integer"}
                },
                "required": ["name", "email", "score"],
                "additionalProperties": False
            }
        }
    }
)

Same provider, different API. The parameter moved from response_format to text.format. The nesting structure changed completely.

Anthropic Messages

Python
response = client.messages.create(
    model="claude-sonnet-4-5-20250514",
    max_tokens=1024,
    messages=[{"role": "user", "content": "Extract the user info"}],
    output_config={
        "format": {
            "type": "json_schema",
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"},
                    "score": {"type": "integer"}
                },
                "required": ["name", "email", "score"],
                "additionalProperties": False
            }
        }
    }
)

No response_format. The parameter is output_config.format. No name field on the schema wrapper. Completely different shape.

Google Gemini

Python
response = client.models.generate_content(
    model="gemini-2.5-pro",
    contents="Extract the user info",
    config={
        "response_mime_type": "application/json",
        "response_schema": {
            "type": "object",
            "properties": {
                "name": {"type": "string"},
                "email": {"type": "string"},
                "score": {"type": "integer"}
            },
            "required": ["name", "email", "score"]
        }
    }
)

Yet another parameter name: response_schema plus response_mime_type. Four providers, four different shapes for the exact same feature.

The compatibility matrix

FeatureOpenAI (Chat)OpenAI (Responses)AnthropicGoogle
Parameter nameresponse_formattext.formatoutput_config.formatresponse_schema
Schema typejson_schemajson_schemajson_schemaOpenAPI subset / JSON Schema
Strict modestrict: truestrict: trueAlways strictAlways strict
Requires nameYesYesNoNo
additionalPropertiesMust be falseMust be falseMust be falseNot required
Recursive schemasSupportedSupportedNot supportedSupported
anyOfSupportedSupportedSupported (limited)Supported
JSON mode (no schema)json_objectjson_objectNot supportedapplication/json only

Results: native API endpoints

Here is how every provider performed when tested directly with the OpenAI and Anthropic SDKs.

ProviderModelsChat CompletionsResponsesMessages
Alibaba (Qwen)77/77/77/7
Anthropic99/99/99/9
Azure86/68/88/8
Bedrock77/77/77/7
DeepInfra2015/2015/2016/20
DeepSeek40/40/44/4
Fireworks77/77/77/7
Google97/97/97/9
Groq22/22/22/2
Inceptron33/33/33/3
MiniMaxi55/55/55/5
Mistral1313/1313/1313/13
Moonshot77/77/77/7
Nebius77/77/77/7
Novita3511/3511/3528/35
OpenAI2824/2824/2825/28
OpenAI Responses23n/a23/2323/23
Parasail54/55/55/5
Perplexity33/33/33/3
Together43/43/43/4
Vertex2320/2320/2322/23
xAI1010/1010/1010/10
zai54/55/55/5

Summary by endpoint

EndpointPassTestedRate
Anthropic Messages22624493%
OpenAI Responses20124482%
OpenAI Chat Completions17421980%

The Anthropic Messages endpoint has the highest pass rate because Requesty translates json_schema into Anthropic's native constrained decoding format, and that translation is clean. For OpenAI-compatible endpoints, Requesty passes json_schema through to the upstream provider. Providers that support it natively pass. Providers that have not implemented it yet return an error.

199 models pass all native endpoints

Out of 244 models tested, 199 passed structured output on every endpoint where they were tested. That is 82% of all models with full native endpoint coverage. These models return schema-conforming JSON regardless of whether you call them via Chat Completions, Responses, or Messages.

Providers with 100% pass rate

14 out of 23 providers passed structured output on every model and every endpoint:

Alibaba (Qwen), Anthropic, Azure, Bedrock, Fireworks, Groq, Inceptron, MiniMaxi, Mistral, Moonshot, Nebius, OpenAI Responses, Perplexity, xAI.


Results: SDK compatibility

We tested 10 popular framework SDKs against 159 models (recent models released in the last 6 months). Every SDK sends structured output requests through Requesty's OpenAI-compatible API.

SDKPassTestedRate
DSPy14415991%
Agno (Anthropic mode)14415991%
LiteLLM14315990%
LangChain14215989%
Haystack14215989%
LlamaIndex13915987%
PydanticAI13715986%
Instructor12815981%
Mirascope12715980%
Vercel AI SDK12015975%

All 10 SDKs work with structured outputs through Requesty. The pass rate differences come from how each SDK formats the request and parses the response.

DSPy, LiteLLM, LangChain, and Haystack have the highest pass rates (89-91%) because they send clean OpenAI-compatible requests and parse responses without extra transformation.

Instructor and Mirascope are slightly lower because they add retry/validation logic that can trip on certain provider error formats.

Vercel AI SDK has the lowest pass rate at 75% because some interactions with the Responses endpoint require specific response fields that not all providers return.

SDK compatibility by provider

Here is how each provider performed across all 10 framework SDKs. This shows which providers have the broadest SDK compatibility.

ProviderInstructorLangChainLlamaIndexLiteLLMPydanticAIDSPyHaystackMirascopeAgnoVercel
Alibaba100%100%100%100%100%100%100%100%100%-
Anthropic100%100%100%100%100%100%100%100%100%88%
Azure100%100%100%100%100%50%100%100%100%100%
Bedrock100%100%100%100%100%100%100%100%71%85%
DeepInfra85%85%85%100%100%100%85%85%100%57%
Fireworks100%100%100%100%100%100%100%100%85%100%
Google50%75%75%75%75%87%75%62%75%75%
Groq100%100%100%100%50%100%100%50%100%100%
MiniMaxi100%100%100%100%100%100%100%100%100%-
Mistral83%100%100%100%100%100%100%83%100%83%
Moonshot71%100%71%100%71%100%100%71%100%100%
Nebius71%100%100%100%71%100%100%71%100%71%
OpenAI89%85%85%85%89%64%85%89%89%85%
Parasail80%100%100%100%80%100%100%80%100%100%
Perplexity-100%100%100%-100%100%-100%100%
Together50%75%75%75%75%75%75%50%50%50%
Vertex71%85%80%85%85%100%85%66%85%85%
xAI100%100%100%100%100%100%100%100%88%100%
zai60%100%100%100%100%100%100%60%100%-

Key takeaway: Alibaba, Anthropic, MiniMaxi, Fireworks, and xAI have near-perfect SDK compatibility across the board. If you are building with multiple SDKs and want maximum compatibility, these providers are the safest choices.


The 43 models that pass everything

These models returned correct structured output across all 3 endpoints and all 10 framework SDKs. Zero failures out of 14 tests per model. If you want maximum compatibility, start here.

ProviderModels
OpenAIgpt-5.5, gpt-5.4, gpt-5.4-mini, gpt-5.3, gpt-5.2, gpt-5.1, gpt-5, gpt-4.1, gpt-4.1-mini, gpt-4.1-nano, gpt-4o, gpt-4o-mini
Anthropicclaude-opus-4-6, claude-opus-4-5, claude-opus-4-1, claude-opus-4, claude-sonnet-4-6, claude-sonnet-4-5, claude-sonnet-4, claude-haiku-4-5
Vertexclaude-opus-4-6, claude-opus-4-5, gemini-3.5-flash, gemini-2.5-flash-lite
xAIgrok-4, grok-4-1-fast-non-reasoning, grok-3-mini
Azuregpt-4.1, gpt-4.1-nano
Bedrockclaude-sonnet-4-6, claude-sonnet-4-5
Fireworksminimax-m2.5, qwen3.6-plus
Parasailgemma-4-26B, qwen25-vl-72b, qwen3-235b
Googlegemini-2.5-flash
Groqgpt-oss-20b
TogetherLlama-3.3-70B-Instruct-Turbo

42 more models at 13/14

An additional 42 models pass 13 out of 14 SDK tests. The only failure is Agno in OpenAI Chat mode, which is a known SDK-level issue (Agno sends a developer role that some providers reject). These models include:

Mistral: mistral-large-latest, mistral-medium-latest, mistral-small-latest, mistral-small-2603

Moonshot: kimi-k2.5, kimi-k2.6, kimi-k2-thinking, kimi-k2-thinking-turbo, kimi-k2-turbo-preview

Nebius: DeepSeek-V3.2, Llama-3.3-70B, nemotron-3-nano-omni, gpt-oss-120b, glm-5.1

DeepInfra: Qwen3-235B, Qwen3-Coder-480B, DeepSeek-V3, DeepSeek-V3.1

Vertex: claude-haiku-4-5, claude-sonnet-4-5, claude-sonnet-4-6, gemini-2.5-flash, gemini-2.5-pro

xAI: grok-3, grok-4-fast, grok-4.2-beta, grok-4.3

That brings the total to 85 models with near-perfect structured output compatibility across every SDK.


Where structured output does not work (and why)

Not every model supports json_schema. Here is a breakdown of what fails and why.

Provider does not support json_schema

Some providers have not implemented constrained decoding for json_schema at the API level. These providers typically support json_object (JSON mode) but not schema enforcement.

ProviderModels affectedStatus
DeepSeekdeepseek-chat, deepseek-v4-flash, deepseek-v4-pro, deepseek-reasonerJSON mode works. json_schema returns "This response_format type is unavailable now"
Novita (most models)24 of 35 modelsProvider returns Bad Request for json_schema. A few larger models (deepseek-v3.2, gemma-4, llama-3-70b) do work

For DeepSeek specifically: if you need structured output, route through the Anthropic Messages endpoint via Requesty. Requesty translates the schema into a format DeepSeek can handle, and all 4 DeepSeek models pass on the Messages endpoint.

Model returned invalid JSON

Some models accept the json_schema parameter but return JSON that does not match the schema, or return empty/malformed responses.

ProviderModels affectedDetails
DeepInfraQwen2.5-Coder-32B, Qwen3-32B, DeepSeek-R1, DeepSeek-R1-Distill-Llama-70BSmaller/distilled models that struggle with strict schema adherence
Novitadeepseek-r1-distill-* variantsSame issue with distilled models

Image generation models

Image generation models (vertex/gemini-2.5-flash-image, vertex/gemini-3-pro-image-preview, google/gemini-3.1-flash-image-preview) do not support structured text output. This is expected. These models are designed for image generation, not text extraction.

Responses-only models

OpenAI's newest reasoning models (gpt-5.4-pro, gpt-5.5-pro) only work on the /v1/responses endpoint. They are not available on Chat Completions. This is an OpenAI platform constraint, not a structured output issue. Through Requesty, these models work perfectly on the Responses endpoint.


The subtle breaks to watch for

Beyond provider-level support, there are schema-level differences that cause silent failures when switching providers.

Recursive schemas

If you are extracting nested data (a file tree, a comment thread, a recursive category structure), OpenAI and Google handle recursive $ref in schemas. Anthropic does not. Your schema compiles fine on OpenAI. You switch to Claude. You get a schema compilation error at request time, not at schema validation time.

The name field

OpenAI requires a name field on the schema wrapper (both in Chat Completions and Responses, though in different positions). Anthropic does not accept one. Google does not need one. If you have a shared schema definition that includes name, it works on OpenAI and breaks on Anthropic.

Strict mode semantics

On OpenAI, you opt into strict enforcement with strict: true. Without it, the model will try to follow your schema but is not guaranteed to. On Anthropic, all structured output is strict by default. There is no non-strict mode. On Google, it is also always strict.

additionalProperties

Every provider requires additionalProperties: false for strict schemas, but they fail differently when you forget it. OpenAI returns a clear error. Anthropic returns a schema compilation error. Google silently allows extra fields in some SDK versions but blocks them in others.

The Anthropic beta migration

Anthropic shipped structured outputs in November 2025 as a beta with output_format and a required header. The GA release in 2026 moved it to output_config.format with no header required. Both work today. But if you followed a tutorial from six months ago, you are using the beta shape. When the transition period ends, your code breaks.


How Requesty handles all of this

When you route through Requesty, you send a single OpenAI-compatible request. One parameter format. One schema shape. Requesty handles the translation.

Python
from openai import OpenAI
 
client = OpenAI(
    base_url="https://router.requesty.ai/v1",
    api_key="your_requesty_key"
)
 
response = client.chat.completions.create(
    model="anthropic/claude-sonnet-4-5",  # or any of 244 models
    messages=[{"role": "user", "content": "Extract the user info"}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "strict": True,
            "schema": {
                "type": "object",
                "properties": {
                    "name": {"type": "string"},
                    "email": {"type": "string"},
                    "score": {"type": "integer"}
                },
                "required": ["name", "email", "score"],
                "additionalProperties": False
            }
        }
    }
)

What Requesty does under the hood:

  • OpenAI models: response_format passed through as-is
  • Anthropic models: translated to output_config.format, name field stripped, schema validated against Anthropic's restrictions
  • Google/Vertex models: translated to response_schema with response_mime_type set
  • All other providers: translated to whatever format the provider expects

When you switch model from openai/gpt-5.1 to anthropic/claude-sonnet-4-5 to vertex/gemini-2.5-flash, your structured output configuration does not change. When your fallback policy routes a failed request from one provider to another, the schema translation happens automatically.

Every SDK works through one base URL

All 10 SDKs we tested point at https://router.requesty.ai/v1 as the base URL. No SDK-specific configuration. No per-provider adapters.

Python
# Instructor
client = instructor.from_openai(OpenAI(base_url="https://router.requesty.ai/v1"))
result = client.chat.completions.create(model="anthropic/claude-sonnet-4-5", response_model=UserInfo, ...)
 
# LangChain
llm = ChatOpenAI(model="vertex/gemini-2.5-flash", base_url="https://router.requesty.ai/v1")
chain = llm.with_structured_output(UserInfo)
 
# PydanticAI
model = OpenAIModel("openai/gpt-5.1", provider=OpenAIProvider(base_url="https://router.requesty.ai/v1"))
agent = Agent(model, output_type=UserInfo)
 
# LiteLLM
response = await litellm.acompletion(model="xai/grok-4", base_url="https://router.requesty.ai/v1", ...)
 
# DSPy
lm = dspy.LM("openai/mistral/mistral-large-latest", api_base="https://router.requesty.ai/v1")
 
# Haystack
generator = OpenAIChatGenerator(model="fireworks/minimax-m2.5", api_base_url="https://router.requesty.ai/v1")
 
# Vercel AI SDK (TypeScript)
const provider = createOpenAI({ baseURL: "https://router.requesty.ai/v1" })
const result = await generateObject({ model: provider("anthropic/claude-sonnet-4-5"), schema: z.object({...}) })

Same pattern for LlamaIndex, Mirascope, and Agno. One integration, 244 models, 23 providers.


Practical recommendations

Always set additionalProperties: false. Every provider requires it for strict schemas. Make it a default in your schema definitions.

Avoid recursive schemas if you need provider portability. Anthropic does not support them. If your data is naturally recursive, flatten it to a fixed depth in the schema and handle deeper nesting in your application code.

Do not rely on name being present or absent. OpenAI requires it, Anthropic rejects it, Google ignores it. If you need to support multiple providers without a gateway, conditionally add or strip the name field based on the target.

Test schema compilation, not just schema correctness. A schema can be valid JSON Schema but fail compilation on a specific provider due to unsupported keywords, nesting depth, or complexity limits. Anthropic has a 180-second compilation timeout and a limit of 24 optional parameters across all schemas in a request.

Pick models from the 100% compatibility list. If your application depends on structured output working reliably across SDKs and endpoints, start with the 43 models that pass every test. That list covers all the major model families: GPT-5.x, Claude 4.x, Gemini 2.5+, Grok 4, and several open-source models.

Use a gateway. If you are calling more than one provider, or might need to in the future, normalizing the structured output parameter at the gateway level saves you from maintaining translation code for every provider. Requesty gives you $10 free to test this with your existing schemas across all 244 models.

Methodology and updates

All tests were run in May 2026. We will keep running these tests as providers ship updates. The test code is open and runs as part of our CI/CD pipeline, so compatibility data stays current.

Your JSON schema should describe your data, not your provider.

Frequently asked questions

What is structured output in LLM APIs?
Structured output is a feature where the LLM provider uses constrained decoding to guarantee that the model's response matches a JSON Schema you provide. Unlike prompting the model to return JSON (which can fail), structured output enforces the schema at the token generation level. The model literally cannot produce tokens that would violate your schema. Every major provider now supports some form of this, but they all use different parameter names and API shapes.
What is the difference between JSON mode and structured output?
JSON mode (response_format type json_object) guarantees syntactically valid JSON but does not enforce a schema. The model might return any valid JSON object with any fields and types. Structured output (response_format type json_schema) goes further by enforcing a specific JSON Schema, guaranteeing correct field names, types, required properties, and nesting. If you are shipping to production, you almost always want structured output over JSON mode.
Which LLM providers support structured outputs in 2026?
All major providers support structured outputs: OpenAI, Anthropic, Google, Azure, Bedrock, Vertex, Mistral, xAI, Fireworks, Groq, and more. We tested 244 models across 23 providers. Some smaller providers like DeepSeek support JSON mode but not full json_schema enforcement. The full compatibility results are in this post.
Which SDKs work with structured outputs through Requesty?
We tested 10 popular SDKs: Instructor, LangChain, LlamaIndex, LiteLLM, PydanticAI, DSPy, Haystack, Agno, Mirascope, and Vercel AI SDK. All of them work with structured outputs through Requesty. Pass rates range from 75% to 91% across all models, with most failures isolated to a few providers that do not support json_schema at the API level.
How does Requesty handle structured output across providers?
Requesty normalizes structured output parameters at the gateway level. You send a single OpenAI compatible request with response_format and Requesty translates it to whatever parameter the downstream provider expects, whether that is output_config for Anthropic, responseSchema for Google, or the native format for any other provider. When you switch models or failover between providers, your structured output configuration stays the same.
Related reading