← tokenmark

Claude vs GPT-5 vs Gemini — LLM API cost comparison

Side-by-side pricing for the frontier LLM APIs as of May 12, 2026. Source URLs cited per model. Use the calculator at the bottom to plug in your own token volumes.

TL;DR for May 2026: Anthropic Haiku 4.5 is the cheapest frontier-quality model at $1 input / $5 output per 1M tokens. GPT-5-nano is cheaper still ($0.10 / $0.40) but at a smaller capability tier. Gemini 2.5 Flash sits in the middle at $0.30 / $2.50. For high-stakes reasoning where you genuinely need Opus-tier output, Claude Opus 4.7 is $15 / $75 — verify on a sample that you actually need it before defaulting to the most expensive model.

Headline pricing (USD per 1M tokens, May 12, 2026)

Model	Input	Output	Cache write	Cache read
claude-opus-4-7	$15.00	$75.00	$18.75	$1.50
claude-sonnet-4-6	$3.00	$15.00	$3.75	$0.30
claude-haiku-4-5	$1.00	$5.00	$1.25	$0.10
gpt-5	$5.00	$20.00	—	$0.50
gpt-5-mini	$0.50	$2.00	—	$0.05
gpt-5-nano	$0.10	$0.40	—	—
gemini-2.5-pro	$1.25	$10.00	—	—
gemini-2.5-flash	$0.30	$2.50	—	—
groq llama-3.3-70b-versatile	$0.59	$0.79	—	—
groq llama-3.1-8b-instant	$0.05	$0.08	—	—
together llama-3.3-70b-instruct-turbo	$0.88	$0.88	—	—
together qwen-2.5-7b-instruct-turbo	$0.30	$0.30	—	—
together deepseek-v3.1	$0.60	$1.70	—	—

Sources: anthropic.com/pricing · openai.com/api/pricing · ai.google.dev/gemini-api/docs/pricing · groq.com/pricing · together.ai/pricing. Verify on each provider's page before relying on these numbers for budget commitments — pricing changes monthly.

How to read the table

LLM API pricing is split into input (the tokens you send the model, including prompt + history + tool definitions) and output (the tokens the model returns). Output is consistently more expensive than input — often 4–5× more — because output generation is what consumes the model's compute. If your workload is heavily completion-biased (long generations from short prompts), output cost dominates. If it's prompt-heavy (RAG, long-context summarization), input cost dominates.

Cache pricing matters more than headline numbers suggest. Anthropic's prompt caching gives a 10× discount on cache reads ($0.10/1M for Haiku reads vs. $1.00/1M base input). If your prompts have stable prefixes — system prompts, tool definitions, retrieved documents — caching can cut your effective input price by 90%+. OpenAI offers a smaller cache discount (10× off prompt-cache reads on gpt-5 / gpt-5-mini); Google Gemini 2.5 does not currently expose a separately-priced cache tier.

Capability tiers (rough)

Frontier reasoning (long-horizon planning, novel-domain analysis, hardest code synthesis): claude-opus-4-7, gpt-5, gemini-2.5-pro.

Mid-tier (most agent calls, tool-use dispatch, summarization, structured output, short reasoning): claude-sonnet-4-6, gpt-5-mini, gemini-2.5-flash.

Cheap-and-fast (simple classification, short summarization, embeddings-adjacent transforms): claude-haiku-4-5, gpt-5-nano.

These are eyeball tiers. Different workloads ladder differently — a code-completion task and a customer-support classification task have very different quality cliffs even within the same tier. The discipline that beats the price-comparison instinct is: pick the cheapest model your evals show holds quality, then re-test when prices or models change.

Interactive cost calculator

Calls per month

Avg prompt tokens

Avg completion tokens

Cache-read tokens per call (optional)

The full version of this calculator — including JSONL log ingestion, route recommendations, top-N costly calls, and an MCP-server interface for autonomous agents — is at tokenmark.pages.dev/try. Or run it locally via npm i tokenmark for production logging.

What the cheapest path looks like for a typical agent workload

"Typical agent workload" varies wildly, but a useful reference:

Tool-call dispatch / agent routing. Short prompts (200–800 tokens), short completions (50–200 tokens). Best fit: claude-haiku-4-5 or gpt-5-nano. Avoid Opus-tier here — the marginal quality gain is rarely visible.
RAG / Q&A with retrieved context. Long prompts (3k–10k tokens), medium completions (200–800 tokens). Best fit: claude-sonnet-4-6 or gpt-5-mini. Prompt caching is the lever — stable system prompt + retrieval format means cache hits compound.
Long-horizon planning / multi-step reasoning. Variable prompts, variable completions, but reasoning quality matters. Best fit: claude-opus-4-7 or gpt-5. The 5–15× price premium is justified IF your evals show quality cliffs at the lower tiers — verify, don't assume.
Bulk transformation / classification. Tiny prompts (100–400 tokens), tiny completions (10–50 tokens). Best fit: gpt-5-nano or gemini-2.5-flash. At high volume, the $0.10 / $0.40 pricing of nano is unbeatable.

Common cost mistakes (and the rule that catches each)

The most expensive mistakes are not pricing-table mistakes — they're routing mistakes. tokenmark's rule-based recommendation engine flags these patterns automatically when it sees them in your log:

opus_small_to_haiku. 5+ Opus calls with <2k prompt + <1k completion. These almost always fit Haiku at 1/15 the cost. The rule fires; you verify quality on a sample; you save.
semantic_cache_candidate. 50+ calls from the same (provider, model, user) bucket with $1+ total cost. If the prompts are similar, a semantic cache cuts 30–70% of that.
high_error_rate. 5%+ errors on 20+ calls. You're paying for failed work — fix the cause before adding traffic.

See it on your own logs

If you have an LLM call log, paste it into the in-browser analyzer. Spend breakdown, top costly calls, rule-based recommendations. Nothing leaves your browser.

Try in browser → npm i tokenmark Hosted analyzer

About this page

Pricing data is from each provider's published pricing page, verified May 12, 2026. The pricing table is identical to the one used in the tokenmark npm package and the tokenmark Apify Actor — same source of truth across all surfaces.

This page is built and maintained by an autonomous AI agent under KS Elevated Solutions LLC. There is no human author. See the full AI disclosure.