Join our Discord

DeepInfra Inc.

Serverless inference for machine learning models.

📍 🇺🇸 US•19 models available•Visit Website →

19

Available Models

$0.31

Avg Input Price/M

$0.02

Cheapest Model

deepinfra/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo

$0.85

Most Expensive

deepinfra/deepseek-ai/DeepSeek-V3

Features Overview

0

Vision Support

0

Advanced Reasoning

0

Caching Support

0

Computer Use

Privacy & Data Policy

Data Retention

No data retention

Location

🇺🇸 US

Privacy Policy

DeepInfra Privacy Policy →

All DeepInfra Inc. Models

View All Providers →

DeepInfra Inc.

deepseek-ai/DeepSeek-V3

Context Window

128K tokens

Max Output

8K tokens

Input

$0.85/M tokens

Output

$0.9/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

View Details →

DeepInfra Inc.

deepseek-ai/DeepSeek-R1-Distill-Llama-70B

Context Window

64K tokens

Max Output

8K tokens

Input

$0.23/M tokens

Output

$0.69/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

View Details →

DeepInfra Inc.

meta-llama/Llama-3.3-70B-Instruct

Context Window

131K tokens

Max Output

Unlimited

Input

$0.23/M tokens

Output

$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

View Details →

DeepInfra Inc.

Qwen/Qwen3-32B

Context Window

40K tokens

Max Output

Unlimited

Input

$0.1/M tokens

Output

$0.3/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

View Details →

DeepInfra Inc.

zai-org/GLM-4.5-Air

Context Window

131K tokens

Max Output

4K tokens

Input

$0.2/M tokens

Output

$1.10/M tokens

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

View Details →

DeepInfra Inc.

Qwen/Qwen2.5-Coder-32B-Instruct

Context Window

16K tokens

Max Output

Unlimited

Input

$0.07/M tokens

Output

$0.16/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

View Details →

DeepInfra Inc.

microsoft/phi-4

Context Window

16K tokens

Max Output

Unlimited

Input

$0.07/M tokens

Output

$0.14/M tokens

Phi-4-reasoning-plus is an enhanced 14B parameter model from Microsoft, fine-tuned from Phi-4 with additional reinforcement learning to boost accuracy on math, science, and code reasoning tasks. It uses the same dense decoder-only transformer architecture as Phi-4, but generates longer, more comprehensive outputs structured into a step-by-step reasoning trace and final answer. While it offers improved benchmark scores over Phi-4-reasoning across tasks like AIME, OmniMath, and HumanEvalPlus, its responses are typically ~50% longer, resulting in higher latency. Designed for English-only applications, it is well-suited for structured reasoning workflows where output quality takes priority over response speed.

View Details →

DeepInfra Inc.

meta-llama/Meta-Llama-3.1-405B-Instruct

Context Window

131K tokens

Max Output

Unlimited

Input

$0.8/M tokens

Output

$0.8/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

View Details →

DeepInfra Inc.

Qwen/Qwen2.5-72B-Instruct

Context Window

131K tokens

Max Output

Unlimited

Input

$0.23/M tokens

Output

$0.4/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

View Details →

DeepInfra Inc.

meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo

Context Window

131K tokens

Max Output

Unlimited

Input

$0.02/M tokens

Output

$0.05/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

View Details →

DeepInfra Inc.

Qwen/Qwen3-235B-A22B

Context Window

40K tokens

Max Output

4K tokens

Input

$0.2/M tokens

Output

$0.6/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

View Details →

DeepInfra Inc.

zai-org/GLM-4.5

Context Window

131K tokens

Max Output

4K tokens

Input

$0.6/M tokens

Output

$2.20/M tokens

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

View Details →

DeepInfra Inc.

meta-llama/Llama-3.2-90B-Vision-Instruct

Context Window

131K tokens

Max Output

4K tokens

Input

$0.35/M tokens

Output

$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

View Details →

DeepInfra Inc.

meta-llama/Meta-Llama-3.1-70B-Instruct

Context Window

131K tokens

Max Output

Unlimited

Input

$0.23/M tokens

Output

$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

View Details →

DeepInfra Inc.

deepseek-ai/DeepSeek-R1

Context Window

64K tokens

Max Output

8K tokens

Input

$0.85/M tokens

Output

$2.50/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

View Details →

DeepInfra Inc.

meta-llama/Llama-3.3-70B-Instruct-Turbo

Context Window

131K tokens

Max Output

Unlimited

Input

$0.12/M tokens

Output

$0.3/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

View Details →

DeepInfra Inc.

deepseek-ai/DeepSeek-V3.1

Context Window

164K tokens

Max Output

Unlimited

Input

$0.3/M tokens

Output

$1.00/M tokens

View Details →

DeepInfra Inc.

Qwen/QwQ-32B

Context Window

128K tokens

Max Output

Unlimited

Input

$0.12/M tokens

Output

$0.18/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

View Details →

DeepInfra Inc.

Qwen/Qwen3-Coder-480B-A35B-Instruct

Context Window

262K tokens

Max Output

Unlimited

Input

$0.4/M tokens

Output

$1.60/M tokens

View Details →

Ready to use DeepInfra Inc. models?

Access all DeepInfra Inc. models through Requesty's unified API with intelligent routing, caching, and cost optimization.

Get Started Free View Pricing

DeepInfra Inc. AI Models - Pricing & Features | Requesty | Requesty