Groq API pricing — Llama 3.3 70B and Llama 3.1 8B Instant

Current per-million-token pricing for Groq's on-demand serverless tier. Verified May 12, 2026.

TL;DR. Llama 3.3 70B Versatile is $0.59 in / $0.79 out — the workhorse mid-tier. Llama 3.1 8B Instant is $0.05 in / $0.08 out — the cheapest frontier-quality Llama tier. Both run on Groq's LPU inference hardware with sub-100ms latency typical. Batch API roughly halves these prices for async work.

Pricing (USD per 1M tokens, May 12, 2026)

Model	Input	Output
llama-3.3-70b-versatile	$0.59	$0.79
llama-3.1-8b-instant	$0.05	$0.08

Source: groq.com/pricing. Re-verify before relying on these numbers for budget decisions.

How to read Groq pricing

Speed is the value, not just price

Groq's pitch isn't the cheapest tokens (Cerebras and DeepInfra often undercut on raw price). It's latency: Groq's LPU hardware typically produces sub-100ms first-token-time and 300+ tokens/second throughput. For interactive agent workflows where user-perceived latency matters, that latency budget is the real product.

The free tier on Groq is permissive — 30 RPM and ~6k TPM, refresh limits vary by model. Real for development and small production loads.

Batch API discount

For async workloads (overnight processing, embeddings refresh, dataset annotation, eval runs), Groq's Batch API reduces per-token costs by approximately 50%. The Batch API SLA is 24-hour completion; pricing varies by model and tier but typical drops are:

llama-3.3-70b-versatile: ~$0.30 / ~$0.40 per 1M (vs $0.59 / $0.79 standard).
llama-3.1-8b-instant: ~$0.025 / ~$0.04 per 1M (vs $0.05 / $0.08 standard).

Batch discounts are approximate and based on industry reporting; re-verify on Groq's pricing page before relying on them.

Per-token examples

Model	1k in + 500 out	10k in + 1k out	100k in + 1k out
llama-3.3-70b-versatile	$0.000985	$0.00669	$0.0598
llama-3.1-8b-instant	$0.00009	$0.00058	$0.00508

Interactive calculator

Model

Calls per month

Prompt tokens (per call)

Completion tokens (per call)

When to use Groq vs alternatives

Latency-critical paths. When the user is waiting for a response (chat first-token, autocomplete, real-time tool dispatch), Groq's sub-100ms TTFT often beats Anthropic/OpenAI even though the per-token price is similar to Haiku/mini tiers.
High-volume Llama workloads. If your eval shows Llama 3.3 70B meets quality, you'll pay less per call than for Claude Sonnet ($3 in / $15 out) or GPT-5 ($5 in / $20 out) at comparable capability.
Open-source-friendly stack. Llama models are open-weight — same model is portable to self-hosted, Together AI, Cerebras, etc. Vendor lock-in is lower.
Don't use Groq when Llama doesn't meet your quality bar (frontier reasoning, novel-domain analysis), OR you have heavy prompt caching (Groq doesn't offer cache-read discounts; that's Anthropic + OpenAI's lane).

Common mistakes catalog

Picking the 70B when 8B holds. The 8B is 11× cheaper input and 10× cheaper output. Audit your call log; if quality holds at 8B for a meaningful subset, switch.
Not using the Batch API for async work. 50% off list pricing for 24-hour-SLA jobs. Email digests, eval runs, dataset annotation all qualify.
Hitting free-tier rate limits at production. 30 RPM / 6k TPM is enough for dev. Upgrade to the paid tier before launching anything customer-facing.

Audit your Groq spend

If you have a Groq API call log, paste it into the in-browser analyzer for spend breakdown, top costly calls, and rule-based recommendations. Nothing leaves your browser.

Try in browser → npm i tokenmark Compare to Claude, GPT-5, Gemini →

About this page

Pricing data is from Groq's published pricing page, verified May 12, 2026. The same pricing table is bundled in the tokenmark npm package. Built and maintained by an autonomous AI agent under KS Elevated Solutions LLC. See the full AI disclosure.