DeepInfra Inc.

Serverless inference for machine learning models.

📍 đŸ‡ș🇾 US‱18 models available‱Visit Website →
18
Available Models
$0.33
Avg Input Price/M
$0.02
Cheapest Model
deepinfra/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
$0.85
Most Expensive
deepinfra/deepseek-ai/DeepSeek-V3

Features Overview

0
Vision Support
0
Advanced Reasoning
0
Caching Support
0
Computer Use

Privacy & Data Policy

Data Retention

No data retention

Location

đŸ‡ș🇾 US

All DeepInfra Inc. Models

View All Providers →
Context Window
131K tokens
Max Output
Unlimited
Input
$0.23/M tokens
Output
$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
164K tokens
Max Output
Unlimited
Input
$0.3/M tokens
Output
$1.00/M tokens
DeepInfra Inc.

zai-org/GLM-4.5

Context Window
131K tokens
Max Output
4K tokens
Input
$0.6/M tokens
Output
$2.20/M tokens

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Context Window
131K tokens
Max Output
4K tokens
Input
$0.35/M tokens
Output
$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
16K tokens
Max Output
Unlimited
Input
$0.07/M tokens
Output
$0.16/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.02/M tokens
Output
$0.05/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

DeepInfra Inc.

microsoft/phi-4

Context Window
16K tokens
Max Output
Unlimited
Input
$0.07/M tokens
Output
$0.14/M tokens

Phi-4-reasoning-plus is an enhanced 14B parameter model from Microsoft, fine-tuned from Phi-4 with additional reinforcement learning to boost accuracy on math, science, and code reasoning tasks. It uses the same dense decoder-only transformer architecture as Phi-4, but generates longer, more comprehensive outputs structured into a step-by-step reasoning trace and final answer. While it offers improved benchmark scores over Phi-4-reasoning across tasks like AIME, OmniMath, and HumanEvalPlus, its responses are typically ~50% longer, resulting in higher latency. Designed for English-only applications, it is well-suited for structured reasoning workflows where output quality takes priority over response speed.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.23/M tokens
Output
$0.4/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

DeepInfra Inc.

Qwen/Qwen3-32B

Context Window
40K tokens
Max Output
Unlimited
Input
$0.1/M tokens
Output
$0.3/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

Context Window
128K tokens
Max Output
8K tokens
Input
$0.85/M tokens
Output
$0.9/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

Context Window
64K tokens
Max Output
8K tokens
Input
$0.23/M tokens
Output
$0.69/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.23/M tokens
Output
$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.12/M tokens
Output
$0.3/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

DeepInfra Inc.

Qwen/Qwen3-235B-A22B

Context Window
40K tokens
Max Output
4K tokens
Input
$0.2/M tokens
Output
$0.6/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

DeepInfra Inc.

zai-org/GLM-4.5-Air

Context Window
131K tokens
Max Output
4K tokens
Input
$0.2/M tokens
Output
$1.10/M tokens

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Context Window
262K tokens
Max Output
Unlimited
Input
$0.4/M tokens
Output
$1.60/M tokens
Context Window
64K tokens
Max Output
8K tokens
Input
$0.85/M tokens
Output
$2.50/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.8/M tokens
Output
$0.8/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Ready to use DeepInfra Inc. models?

Access all DeepInfra Inc. models through Requesty's unified API with intelligent routing, caching, and cost optimization.

DeepInfra Inc. AI Models - Pricing & Features | Requesty