DeepInfra Inc.

Serverless inference for machine learning models.

πŸ“ πŸ‡ΊπŸ‡Έ USβ€’19 models availableβ€’Visit Website β†’
19
Available Models
$0.31
Avg Input Price/M
$0.02
Cheapest Model
deepinfra/meta-llama/Meta-Llama-3.1-8B-Instruct-Turbo
$0.85
Most Expensive
deepinfra/deepseek-ai/DeepSeek-V3

Features Overview

0
Vision Support
0
Advanced Reasoning
0
Caching Support
0
Computer Use

Privacy & Data Policy

Data Retention

No data retention

Location

πŸ‡ΊπŸ‡Έ US

All DeepInfra Inc. Models

View All Providers β†’
Context Window
128K tokens
Max Output
8K tokens
Input
$0.85/M tokens
Output
$0.9/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

Context Window
64K tokens
Max Output
8K tokens
Input
$0.23/M tokens
Output
$0.69/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.23/M tokens
Output
$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

DeepInfra Inc.

Qwen/Qwen3-32B

Context Window
40K tokens
Max Output
Unlimited
Input
$0.1/M tokens
Output
$0.3/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

DeepInfra Inc.

zai-org/GLM-4.5-Air

Context Window
131K tokens
Max Output
4K tokens
Input
$0.2/M tokens
Output
$1.10/M tokens

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Context Window
16K tokens
Max Output
Unlimited
Input
$0.07/M tokens
Output
$0.16/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

DeepInfra Inc.

microsoft/phi-4

Context Window
16K tokens
Max Output
Unlimited
Input
$0.07/M tokens
Output
$0.14/M tokens

Phi-4-reasoning-plus is an enhanced 14B parameter model from Microsoft, fine-tuned from Phi-4 with additional reinforcement learning to boost accuracy on math, science, and code reasoning tasks. It uses the same dense decoder-only transformer architecture as Phi-4, but generates longer, more comprehensive outputs structured into a step-by-step reasoning trace and final answer. While it offers improved benchmark scores over Phi-4-reasoning across tasks like AIME, OmniMath, and HumanEvalPlus, its responses are typically ~50% longer, resulting in higher latency. Designed for English-only applications, it is well-suited for structured reasoning workflows where output quality takes priority over response speed.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.8/M tokens
Output
$0.8/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.23/M tokens
Output
$0.4/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.02/M tokens
Output
$0.05/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

DeepInfra Inc.

Qwen/Qwen3-235B-A22B

Context Window
40K tokens
Max Output
4K tokens
Input
$0.2/M tokens
Output
$0.6/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

DeepInfra Inc.

zai-org/GLM-4.5

Context Window
131K tokens
Max Output
4K tokens
Input
$0.6/M tokens
Output
$2.20/M tokens

The GLM-4.5 series models are foundation models designed for intelligent agents. GLM-4.5 has 355 billion total parameters with 32 billion active parameters, while GLM-4.5-Air adopts a more compact design with 106 billion total parameters and 12 billion active parameters. GLM-4.5 models unify reasoning, coding, and intelligent agent capabilities to meet the complex demands of intelligent agent applications.

Context Window
131K tokens
Max Output
4K tokens
Input
$0.35/M tokens
Output
$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.23/M tokens
Output
$0.4/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
64K tokens
Max Output
8K tokens
Input
$0.85/M tokens
Output
$2.50/M tokens

DeepSeek-R1-Distill-Qwen-7B is a 7 billion parameter dense language model distilled from DeepSeek-R1, leveraging reinforcement learning-enhanced reasoning data generated by DeepSeek's larger models. The distillation process transfers advanced reasoning, math, and code capabilities into a smaller, more efficient model architecture based on Qwen2.5-Math-7B. This model demonstrates strong performance across mathematical benchmarks (92.8% pass@1 on MATH-500), coding tasks (Codeforces rating 1189), and general reasoning (49.1% pass@1 on GPQA Diamond), achieving competitive accuracy relative to larger models while maintaining smaller inference costs.

Context Window
131K tokens
Max Output
Unlimited
Input
$0.12/M tokens
Output
$0.3/M tokens

A lightweight and ultra-fast variant of Llama 3.3 70B, for use when quick response times are needed most.

Context Window
164K tokens
Max Output
Unlimited
Input
$0.3/M tokens
Output
$1.00/M tokens
DeepInfra Inc.

Qwen/QwQ-32B

Context Window
128K tokens
Max Output
Unlimited
Input
$0.12/M tokens
Output
$0.18/M tokens

Qwen3, the latest generation in the Qwen large language model series, features both dense and mixture-of-experts (MoE) architectures to excel in reasoning, multilingual support, and advanced agent tasks. Its unique ability to switch seamlessly between a thinking mode for complex reasoning and a non-thinking mode for efficient dialogue ensures versatile, high-quality performance. Significantly outperforming prior models like QwQ and Qwen2.5, Qwen3 delivers superior mathematics, coding, commonsense reasoning, creative writing, and interactive dialogue capabilities. The Qwen3-30B-A3B variant includes 30.5 billion parameters (3.3 billion activated), 48 layers, 128 experts (8 activated per task), and supports up to 131K token contexts with YaRN, setting a new standard among open-source models.

Context Window
262K tokens
Max Output
Unlimited
Input
$0.4/M tokens
Output
$1.60/M tokens

Ready to use DeepInfra Inc. models?

Access all DeepInfra Inc. models through Requesty's unified API with intelligent routing, caching, and cost optimization.

DeepInfra Inc. AI Models - Pricing & Features | Requesty | Requesty