Filter models by minimum speed and budget. See which models offer the best tokens-per-second for your price range, with an interactive scatter chart.
Try the Calculator↓Free · No login required · Results in seconds
Filter out models slower than this threshold
Filter out models above this price
Best for Speed
Groq — Llama 3.1 8B
750 tok/s · $0.0800/1M out
Best Value
Groq — Llama 3.1 8B
Efficiency score: 8333.3
| Provider | Model | Speed (tok/s) | Output Cost / 1M | Efficiency Score↓ |
|---|---|---|---|---|
| Groq | Llama 3.1 8B | 750 | $0.0800 | 8333.3 |
| CHEAPESTMistral | Nemo | 150 | $0.0200 | 5000 |
| AWS Bedrock | Nova Micro | 150 | $0.1400 | 1000 |
| Gemini Flash-Lite | 200 | $0.3000 | 645.2 | |
| AWS Bedrock | Nova Lite | 120 | $0.2400 | 480 |
| Gemini Flash | 180 | $0.4000 | 439 | |
| Groq | Llama 3.3 70B | 250 | $0.7900 | 312.5 |
| xAI | Grok 2 Mini | 100 | $0.4000 | 243.9 |
| OpenAI | GPT-4o mini | 120 | $0.6000 | 196.7 |
| Mistral | Small | 120 | $0.6000 | 196.7 |
| DeepSeek | V3 | 50 | $0.2800 | 172.4 |
| Cohere | Command R | 80 | $0.6000 | 131.1 |
| AWS Bedrock | Llama 3.3 70B | 80 | $0.7200 | 109.6 |
| AWS Bedrock | Claude Haiku 3 | 100 | $1.25 | 79.4 |
| Mistral | Medium | 100 | $2.00 | 49.8 |
| AWS Bedrock | Nova Pro | 80 | $3.20 | 24.9 |
| Anthropic | Claude Haiku 4.5 | 100 | $5.00 | 20 |
| Mistral | Large | 70 | $6.00 | 11.6 |
| OpenAI | GPT-5 | 80 | $10.00 | 8 |
| OpenAI | GPT-4o | 60 | $10.00 | 6 |
| xAI | Grok 2 | 60 | $10.00 | 6 |
| Cohere | Command R+ | 60 | $10.00 | 6 |
| Gemini Pro | 70 | $12.00 | 5.8 | |
| AWS Bedrock | Mistral Large | 60 | $12.00 | 5 |
| Anthropic | Claude Sonnet 4.6 | 70 | $15.00 | 4.7 |
| AWS Bedrock | Claude Sonnet 3.5 | 70 | $15.00 | 4.7 |
* Efficiency score = tokens per second divided by output cost per 1M tokens. Higher = better value. Speeds are provider-published benchmarks as of March 2026.
Digiqt audits your AI stack and recommends the right model mix to cut costs without sacrificing speed.
Tokens per second (tok/s) measures how fast a model generates output. At 100 tok/s, a 200-token response takes ~2 seconds. Higher tok/s = faster responses, which matters for real-time chat and interactive applications.
Groq uses custom Language Processing Units (LPUs) specifically designed for fast sequential inference, achieving 500–800+ tok/s on 70B models. Standard GPU providers typically achieve 60–150 tok/s on comparable models.
Speed itself doesn't affect output quality — it's about hardware efficiency. The same model weights give identical output regardless of inference speed. Groq runs open-source models like Llama at much higher speeds than GPU providers.
For a good real-time chat experience, aim for 80+ tok/s. At this speed, a typical 200-token response arrives in ~2.5 seconds, which feels responsive. For streaming word-by-word, even 30–50 tok/s can feel acceptable.
Time each API request from send to final token received. Divide output token count by elapsed seconds. Most providers publish benchmark speeds, but real-world performance varies with load, model size, and prompt complexity.
Ahmedabad
B-714, K P Epitome, near Dav International School, Makarba, Ahmedabad, Gujarat 380051
+91 99747 29554
Mumbai
C-20, G Block, WeWork, Enam Sambhav, Bandra-Kurla Complex, Mumbai, Maharashtra 400051
+91 99747 29554
Stockholm
Bäverbäcksgränd 10 12462 Bandhagen, Stockholm, Sweden.
+46 72789 9039

Malaysia
Level 23-1, Premier Suite One Mont Kiara, No 1, Jalan Kiara, Mont Kiara, 50480 Kuala Lumpur