NVIDIA API Model Benchmark

Model Speed & Reliability Report

A benchmark of 140 NVIDIA-hosted model IDs using chat completion calls, ranked by response latency and grouped by success, errors, timeouts, and failures.

Total models: 140Working: 61Median success latency: 1.18sFastest: 0.30s
43.6%
success rate
61
successful responses
63
errors
14
timeouts

Status Breakdown

SUCCESS
61
ERROR
63
TIMEOUT
14
FAIL
2

Errors usually mean the model is not compatible with the chat completions endpoint, requires a different endpoint, or is restricted. Timeouts usually mean the model is too slow, overloaded, or unstable for this setup.

Latency Distribution

≤0.5s
7
0.51–1s
18
1.01–2s
15
2.01–5s
15
>5s
6

Most usable models landed under 2 seconds, which is excellent for interactive development. Anything above 5 seconds should be reserved for quality-heavy tasks only.

⚡ Fastest Models

#1
nvidia/gliner-pii0.30s
#2
nvidia/llama-3.1-nemoguard-8b-topic-control0.39s
#3
meta/llama-guard-4-12b0.42s
#4
nvidia/nemotron-3-content-safety0.42s
#5
google/gemma-2-2b-it0.44s
#6
nvidia/riva-translate-4b-instruct-v1.10.44s
#7
nvidia/llama-3.1-nemoguard-8b-content-safety0.50s
#8
nvidia/llama-3.1-nemotron-safety-guard-8b-v30.55s
#9
nvidia/nemotron-mini-4b-instruct0.61s
#10
nvidia/nemotron-content-safety-reasoning-4b0.62s

🏆 Best Practical Picks

#1
google/gemma-2-2b-it0.44s
#2
nvidia/nemotron-mini-4b-instruct0.61s
#3
mistralai/ministral-14b-instruct-25120.65s
#4
nvidia/nemotron-3-nano-omni-30b-a3b-reasoning0.67s
#5
openai/gpt-oss-20b0.72s
#6
nvidia/nemotron-nano-12b-v2-vl0.77s
#7
openai/gpt-oss-20b0.79s
#8
meta/llama-3.2-11b-vision-instruct0.81s
#9
meta/llama-3.2-3b-instruct0.82s
#10
meta/llama-4-maverick-17b-128e-instruct0.83s

🐢 Slowest Successful

#1
deepseek-ai/deepseek-v4-flash9.87s
#2
meta/llama-3.2-1b-instruct9.35s
#3
meta/llama-3.1-8b-instruct8.73s
#4
mistralai/mistral-large-3-675b-instruct-25127.35s
#5
google/gemma-3-27b-it6.45s
#6
mistralai/mistral-medium-3.5-128b5.03s
#7
google/gemma-3-4b-it3.64s
#8
qwen/qwen3-coder-480b-a35b-instruct3.47s

Executive Recommendation

Use these first: mistralai/ministral-14b-instruct-2512, qwen/qwen2.5-coder-32b-instruct, mistralai/mixtral-8x22b-instruct-v0.1, meta/llama-4-maverick-17b-128e-instruct, and nvidia/nemotron-mini-4b-instruct. They were fast enough for real-time use while being more general-purpose than the ultra-fast safety/PII/translation models.

Provider Summary

ProviderTotalSuccessErrorTimeoutFailAvg latencyMedian latency
nvidia462026000.95s0.82s
mistralai1587002.52s1.14s
google1256102.93s2.23s
meta1292103.34s2.24s
deepseek-ai711509.87s9.87s
qwen650101.78s1.29s
microsoft50320
ibm40400
moonshotai420201.84s1.84s
openai440001.80s1.69s
writer40400
z-ai30021
minimaxai210011.85s1.85s
01-ai10100
abacusai110002.83s2.83s
adept10100
ai21labs10100
aisingapore10100
baai10100
bigcode10100
bytedance110002.72s2.72s
databricks10100
nv-mistralai10100
sarvamai110003.40s3.40s
snowflake10100
stepfun-ai110001.02s1.02s
stockmark110001.35s1.35s
upstage110001.78s1.78s
zyphra10100

Full Model Table

#ModelProviderStatusLatency
001-ai/yi-large01-aiERROR
1abacusai/dracarys-llama-3.1-70b-instructabacusaiSUCCESS2.83s
2adept/fuyu-8badeptERROR
3ai21labs/jamba-1.5-large-instructai21labsERROR
4aisingapore/sea-lion-7b-instructaisingaporeERROR
5baai/bge-m3baaiERROR
6bigcode/starcoder2-15bbigcodeERROR
7bytedance/seed-oss-36b-instructbytedanceSUCCESS2.72s
8databricks/dbrx-instructdatabricksERROR
9deepseek-ai/deepseek-coder-6.7b-instructdeepseek-aiERROR
10deepseek-ai/deepseek-v3.1-terminusdeepseek-aiTIMEOUT
11deepseek-ai/deepseek-v3.2deepseek-aiTIMEOUT
12deepseek-ai/deepseek-v4-flashdeepseek-aiTIMEOUT
13deepseek-ai/deepseek-v4-flashdeepseek-aiSUCCESS9.87s
14deepseek-ai/deepseek-v4-prodeepseek-aiTIMEOUT
15deepseek-ai/deepseek-v4-prodeepseek-aiTIMEOUT
16google/codegemma-1.1-7bgoogleERROR
17google/codegemma-7bgoogleERROR
18google/deplotgoogleERROR
19google/gemma-2-2b-itgoogleSUCCESS0.44s
20google/gemma-2bgoogleERROR
21google/gemma-3-12b-itgoogleTIMEOUT
22google/gemma-3-27b-itgoogleSUCCESS6.45s
23google/gemma-3-4b-itgoogleSUCCESS3.64s
24google/gemma-3n-e2b-itgoogleSUCCESS1.88s
25google/gemma-3n-e4b-itgoogleSUCCESS2.23s
26google/gemma-4-31b-itgoogleERROR
27google/recurrentgemma-2bgoogleERROR
28ibm/granite-3.0-3b-a800m-instructibmERROR
29ibm/granite-3.0-8b-instructibmERROR
30ibm/granite-34b-code-instructibmERROR
31ibm/granite-8b-code-instructibmERROR
32meta/codellama-70bmetaERROR
33meta/llama-3.1-405b-instructmetaTIMEOUT
34meta/llama-3.1-70b-instructmetaSUCCESS2.24s
35meta/llama-3.1-8b-instructmetaSUCCESS8.73s
36meta/llama-3.2-11b-vision-instructmetaSUCCESS0.81s
37meta/llama-3.2-1b-instructmetaSUCCESS9.35s
38meta/llama-3.2-3b-instructmetaSUCCESS0.82s
39meta/llama-3.2-90b-vision-instructmetaSUCCESS3.40s
40meta/llama-3.3-70b-instructmetaSUCCESS3.42s
41meta/llama-4-maverick-17b-128e-instructmetaSUCCESS0.83s
42meta/llama-guard-4-12bmetaSUCCESS0.42s
43meta/llama2-70bmetaERROR
44microsoft/kosmos-2microsoftERROR
45microsoft/phi-3-vision-128k-instructmicrosoftERROR
46microsoft/phi-3.5-moe-instructmicrosoftERROR
47microsoft/phi-4-mini-instructmicrosoftTIMEOUT
48microsoft/phi-4-multimodal-instructmicrosoftTIMEOUT
49minimaxai/minimax-m2.5minimaxaiSUCCESS1.85s
50minimaxai/minimax-m2.7minimaxaiFAIL
51mistralai/codestral-22b-instruct-v0.1mistralaiERROR
52mistralai/devstral-2-123b-instruct-2512mistralaiSUCCESS1.18s
53mistralai/magistral-small-2506mistralaiERROR
54mistralai/ministral-14b-instruct-2512mistralaiSUCCESS0.65s
55mistralai/mistral-7b-instruct-v0.3mistralaiERROR
56mistralai/mistral-largemistralaiERROR
57mistralai/mistral-large-2-instructmistralaiERROR
58mistralai/mistral-large-3-675b-instruct-2512mistralaiSUCCESS7.35s
59mistralai/mistral-medium-3-instructmistralaiERROR
60mistralai/mistral-medium-3.5-128bmistralaiSUCCESS5.03s
61mistralai/mistral-nemotronmistralaiSUCCESS1.06s
62mistralai/mistral-small-4-119b-2603mistralaiSUCCESS1.09s
63mistralai/mixtral-8x22b-instruct-v0.1mistralaiSUCCESS0.85s
64mistralai/mixtral-8x22b-v0.1mistralaiERROR
65mistralai/mixtral-8x7b-instruct-v0.1mistralaiSUCCESS2.91s
66moonshotai/kimi-k2-instructmoonshotaiSUCCESS2.47s
67moonshotai/kimi-k2-instruct-0905moonshotaiTIMEOUT
68moonshotai/kimi-k2-thinkingmoonshotaiTIMEOUT
69moonshotai/kimi-k2.6moonshotaiSUCCESS1.21s
70nv-mistralai/mistral-nemo-12b-instructnv-mistralaiERROR
71nvidia/ai-synthetic-video-detectornvidiaERROR
72nvidia/cosmos-reason2-8bnvidiaERROR
73nvidia/embed-qa-4nvidiaERROR
74nvidia/gliner-piinvidiaSUCCESS0.30s
75nvidia/ising-calibration-1-35b-a3bnvidiaSUCCESS0.96s
76nvidia/llama-3.1-nemoguard-8b-content-safetynvidiaSUCCESS0.50s
77nvidia/llama-3.1-nemoguard-8b-topic-controlnvidiaSUCCESS0.39s
78nvidia/llama-3.1-nemotron-51b-instructnvidiaERROR
79nvidia/llama-3.1-nemotron-70b-instructnvidiaERROR
80nvidia/llama-3.1-nemotron-nano-8b-v1nvidiaSUCCESS1.34s
81nvidia/llama-3.1-nemotron-nano-vl-8b-v1nvidiaSUCCESS0.95s
82nvidia/llama-3.1-nemotron-safety-guard-8b-v3nvidiaSUCCESS0.55s
83nvidia/llama-3.1-nemotron-ultra-253b-v1nvidiaERROR
84nvidia/llama-3.2-nemoretriever-1b-vlm-embed-v1nvidiaERROR
85nvidia/llama-3.2-nemoretriever-300m-embed-v1nvidiaERROR
86nvidia/llama-3.2-nv-embedqa-1b-v1nvidiaERROR
87nvidia/llama-3.2-nv-embedqa-1b-v2nvidiaERROR
88nvidia/llama-3.3-nemotron-super-49b-v1nvidiaSUCCESS1.10s
89nvidia/llama-3.3-nemotron-super-49b-v1.5nvidiaSUCCESS2.75s
90nvidia/llama-nemotron-embed-1b-v2nvidiaERROR
91nvidia/llama-nemotron-embed-vl-1b-v2nvidiaERROR
92nvidia/llama3-chatqa-1.5-70bnvidiaERROR
93nvidia/mistral-nemo-minitron-8b-8k-instructnvidiaERROR
94nvidia/nemoretriever-parsenvidiaERROR
95nvidia/nemotron-3-content-safetynvidiaSUCCESS0.42s
96nvidia/nemotron-3-nano-30b-a3bnvidiaSUCCESS0.87s
97nvidia/nemotron-3-nano-omni-30b-a3b-reasoningnvidiaSUCCESS0.91s
98nvidia/nemotron-3-nano-omni-30b-a3b-reasoningnvidiaSUCCESS0.67s
99nvidia/nemotron-3-super-120b-a12bnvidiaSUCCESS1.59s
100nvidia/nemotron-3-super-120b-a12bnvidiaSUCCESS1.17s
101nvidia/nemotron-4-340b-instructnvidiaERROR
102nvidia/nemotron-4-340b-rewardnvidiaERROR
103nvidia/nemotron-content-safety-reasoning-4bnvidiaSUCCESS0.62s
104nvidia/nemotron-mini-4b-instructnvidiaSUCCESS0.61s
105nvidia/nemotron-nano-12b-v2-vlnvidiaSUCCESS0.77s
106nvidia/nemotron-nano-3-30b-a3bnvidiaERROR
107nvidia/nemotron-parsenvidiaERROR
108nvidia/neva-22bnvidiaERROR
109nvidia/nv-embed-v1nvidiaERROR
110nvidia/nv-embedcode-7b-v1nvidiaERROR
111nvidia/nv-embedqa-e5-v5nvidiaERROR
112nvidia/nv-embedqa-mistral-7b-v2nvidiaERROR
113nvidia/nvclipnvidiaERROR
114nvidia/nvidia-nemotron-nano-9b-v2nvidiaSUCCESS2.00s
115nvidia/riva-translate-4b-instructnvidiaERROR
116nvidia/riva-translate-4b-instruct-v1.1nvidiaSUCCESS0.44s
117openai/gpt-oss-120bopenaiSUCCESS2.59s
118openai/gpt-oss-120bopenaiSUCCESS3.12s
119openai/gpt-oss-20bopenaiSUCCESS0.72s
120openai/gpt-oss-20bopenaiSUCCESS0.79s
121qwen/qwen2.5-coder-32b-instructqwenSUCCESS0.85s
122qwen/qwen3-coder-480b-a35b-instructqwenSUCCESS3.47s
123qwen/qwen3-next-80b-a3b-instructqwenSUCCESS0.85s
124qwen/qwen3-next-80b-a3b-thinkingqwenSUCCESS1.29s
125qwen/qwen3.5-122b-a10bqwenSUCCESS2.45s
126qwen/qwen3.5-397b-a17bqwenTIMEOUT
127sarvamai/sarvam-msarvamaiSUCCESS3.40s
128snowflake/arctic-embed-lsnowflakeERROR
129stepfun-ai/step-3.5-flashstepfun-aiSUCCESS1.02s
130stockmark/stockmark-2-100b-instructstockmarkSUCCESS1.35s
131upstage/solar-10.7b-instructupstageSUCCESS1.78s
132writer/palmyra-creative-122bwriterERROR
133writer/palmyra-fin-70b-32kwriterERROR
134writer/palmyra-med-70bwriterERROR
135writer/palmyra-med-70b-32kwriterERROR
136z-ai/glm-5.1z-aiTIMEOUT
137z-ai/glm4.7z-aiFAIL
138z-ai/glm5z-aiTIMEOUT
139zyphra/zamba2-7b-instructzyphraERROR

Key Takeaways

1. The catalog includes many non-chat models, so a high error count is expected when every model is sent to /chat/completions. 2. DeepSeek and very large models were the biggest timeout risks in this run. 3. Smaller and mid-sized instruction models gave the best developer experience. 4. Re-run the test with the same prompt 3–5 times per model before making a final production decision.