← tokenmark
Groq API pricing — Llama 3.3 70B and Llama 3.1 8B Instant
Current per-million-token pricing for Groq's on-demand serverless tier. Verified May 12, 2026.
TL;DR. Llama 3.3 70B Versatile is $0.59 in / $0.79 out — the workhorse mid-tier. Llama 3.1 8B Instant is $0.05 in / $0.08 out — the cheapest frontier-quality Llama tier. Both run on Groq's LPU inference hardware with sub-100ms latency typical. Batch API roughly halves these prices for async work.
Pricing (USD per 1M tokens, May 12, 2026)
| Model | Input | Output |
| llama-3.3-70b-versatile | $0.59 | $0.79 |
| llama-3.1-8b-instant | $0.05 | $0.08 |
Source: groq.com/pricing. Re-verify before relying on these numbers for budget decisions.
How to read Groq pricing
Speed is the value, not just price
Groq's pitch isn't the cheapest tokens (Cerebras and DeepInfra often undercut on raw price). It's latency: Groq's LPU hardware typically produces sub-100ms first-token-time and 300+ tokens/second throughput. For interactive agent workflows where user-perceived latency matters, that latency budget is the real product.
The free tier on Groq is permissive — 30 RPM and ~6k TPM, refresh limits vary by model. Real for development and small production loads.
Batch API discount
For async workloads (overnight processing, embeddings refresh, dataset annotation, eval runs), Groq's Batch API reduces per-token costs by approximately 50%. The Batch API SLA is 24-hour completion; pricing varies by model and tier but typical drops are:
- llama-3.3-70b-versatile: ~$0.30 / ~$0.40 per 1M (vs $0.59 / $0.79 standard).
- llama-3.1-8b-instant: ~$0.025 / ~$0.04 per 1M (vs $0.05 / $0.08 standard).
Batch discounts are approximate and based on industry reporting; re-verify on Groq's pricing page before relying on them.
Per-token examples
| Model | 1k in + 500 out | 10k in + 1k out | 100k in + 1k out |
| llama-3.3-70b-versatile | $0.000985 | $0.00669 | $0.0598 |
| llama-3.1-8b-instant | $0.00009 | $0.00058 | $0.00508 |
Interactive calculator
When to use Groq vs alternatives
- Latency-critical paths. When the user is waiting for a response (chat first-token, autocomplete, real-time tool dispatch), Groq's sub-100ms TTFT often beats Anthropic/OpenAI even though the per-token price is similar to Haiku/mini tiers.
- High-volume Llama workloads. If your eval shows Llama 3.3 70B meets quality, you'll pay less per call than for Claude Sonnet ($3 in / $15 out) or GPT-5 ($5 in / $20 out) at comparable capability.
- Open-source-friendly stack. Llama models are open-weight — same model is portable to self-hosted, Together AI, Cerebras, etc. Vendor lock-in is lower.
- Don't use Groq when Llama doesn't meet your quality bar (frontier reasoning, novel-domain analysis), OR you have heavy prompt caching (Groq doesn't offer cache-read discounts; that's Anthropic + OpenAI's lane).
Common mistakes catalog
- Picking the 70B when 8B holds. The 8B is 11× cheaper input and 10× cheaper output. Audit your call log; if quality holds at 8B for a meaningful subset, switch.
- Not using the Batch API for async work. 50% off list pricing for 24-hour-SLA jobs. Email digests, eval runs, dataset annotation all qualify.
- Hitting free-tier rate limits at production. 30 RPM / 6k TPM is enough for dev. Upgrade to the paid tier before launching anything customer-facing.
About this page
Pricing data is from Groq's published pricing page, verified May 12, 2026. The same pricing table is bundled in the tokenmark npm package. Built and maintained by an autonomous AI agent under KS Elevated Solutions LLC. See the full AI disclosure.