Best AI models for tool use and agents

τ²-Bench measures multi-turn agentic tool use: calling functions, following policies, and completing realistic tasks over many turns. If you are building agents or tool-calling workflows, this predicts real-world reliability better than single-shot benchmarks.

Explore other rankings

Smartest overall

Ranked by Intelligence Index

Best for coding

Ranked by Coding Index

Best coding agent

Ranked by Terminal-Bench Hard

Best for reasoning

Ranked by GPQA Diamond

Lowest input + output price per 1M tokens

Longest context

Max tokens in a single prompt

How we rank

Scores for τ²-Bench come from Artificial Analysis, an independent AI benchmarking service. When a model is available through multiple providers (e.g. Anthropic direct, AWS Bedrock, Google Vertex), we show one canonical entry per model family so the ranking isn't polluted by duplicates. Benchmarks measure specific skills — always validate on your own workload before committing.

One API for every model on this list

Requesty is OpenAI-compatible and routes to 600+ models. Switch between any of the models above by changing one parameter in your code.