Best AI models for tool use and agents
τ²-Bench measures multi-turn agentic tool use: calling functions, following policies, and completing realistic tasks over many turns. If you are building agents or tool-calling workflows, this predicts real-world reliability better than single-shot benchmarks.
- 🥇
GLM-5Z AI·$1.00 / $3.20 per 1M98.2%98.2% - 🥈
GLM-5.1Z AI·$1.40 / $4.40 per 1M97.7%97.7% - 🥉
qwen3.6-plusAlibaba Cloud·$0.50 / $3.00 per 1M97.7%97.7% - 4grok-4.3xAI Corp.·$1.25 / $2.50 per 1M97.7%97.7%
- 5
deepseek-v4-proDeepSeek·$0.43 / $0.87 per 1M96.2%96.2% - 6
kimi-k2.6Moonshot AI·$0.95 / $4.00 per 1M95.9%95.9% - 7
kimi-k2.5Moonshot AI·$0.60 / $3.00 per 1M95.9%95.9% - 8
GLM-4.7Z AI·$0.60 / $2.20 per 1M95.9%95.9% - 9
gemini-3.1-pro-previewGoogle LLC (Gemini API)·$2.00 / $12.00 per 1M95.6%95.6% - 10
qwen/qwen3.5-397b-a17bNovita AI·$0.60 / $3.60 per 1M95.6%95.6% - 11MiniMax-M2.5MiniMax·$0.30 / $1.20 per 1M95.3%95.3%
- 12
gemini-3.5-flashGoogle LLC (Vertex AI)·$1.50 / $9.00 per 1M95.3%95.3% - 13
deepseek-v4-flashDeepSeek·$0.14 / $0.28 per 1M95.0%95.0% - 14
xiaomimimo/mimo-v2-proNovita AI·$2.00 / $6.00 per 1M95.0%95.0% - 15
xiaomimimo/mimo-v2-flashNovita AI·$0.10 / $0.30 per 1M95.0%95.0% - 16
qwen3.7-maxAlibaba Cloud·$2.50 / $7.50 per 1M94.7%94.7% - 17claude-opus-4-8Anthropic PBC·$5.00 / $25.00 per 1M94.4%94.4%
- 18
mistral-medium-3-5Mistral AI SAS·$1.50 / $7.50 per 1M94.2%94.2% - 19
XiaomiMiMo/MiMo-V2.5-ProDeepInfra Inc.·$1.00 / $3.00 per 1M94.2%94.2% - 20
gpt-5.5OpenAI Inc.·$5.00 / $30.00 per 1M93.9%93.9% - 21
Qwen/Qwen3.5-27BDeepInfra Inc.·$0.26 / $2.60 per 1M93.9%93.9% - 22
kimi-k2Google LLC (Vertex AI)·$0.60 / $2.50 per 1M93.0%93.0% - 23
inclusionai/ring-2.6-1tNovita AI·$0.30 / $2.50 per 1M92.4%92.4% - 24claude-opus-4-6Anthropic PBC·$5.00 / $25.00 per 1M92.1%92.1%
- 25
gpt-5.2-codexOpenAI Responses·$1.75 / $14.00 per 1M92.1%92.1% - 26
deepseek-v3.2Google LLC (Vertex AI)·$0.56 / $1.68 per 1M90.6%90.6% - 27grok-3-minixAI Corp.·$0.30 / $0.50 per 1M90.4%90.4%
- 28
inclusionai/ling-2.6-1tNovita AI·$0.30 / $2.50 per 1M89.8%89.8% - 29claude-opus-4-5Anthropic PBC·$5.00 / $25.00 per 1M89.5%89.5%
- 30
Qwen/Qwen3.5-35B-A3BDeepInfra Inc.·$0.14 / $1.00 per 1M89.2%89.2%
Explore other rankings
How we rank
Scores for τ²-Bench come from Artificial Analysis, an independent AI benchmarking service. When a model is available through multiple providers (e.g. Anthropic direct, AWS Bedrock, Google Vertex), we show one canonical entry per model family so the ranking isn't polluted by duplicates. Benchmarks measure specific skills — always validate on your own workload before committing.
One API for every model on this list
Requesty is OpenAI-compatible and routes to 400+ models. Switch between any of the models above by changing one parameter in your code.
Get started free