/t · guide · cost mechanics

Claude Code cost is set by context, not by which model you picked

Matthew Diakonov, Written with AI

Published May 18, 20266 min read

Every Claude Code cost guide opens with the same table: Opus is $5 per million input tokens, Sonnet is $3, Haiku is $1, switch to Haiku for simple tasks. The table is correct. The story it tells is incomplete. The biggest movers on your bill aren't in that table at all.

The cleanest way to see this is the cost formula itself. The bill per turn is tokens × rate. The bill per session is that times the turn count. Three variables. Only one of them is the model.

The function signature is the thesis

Here's the cost line from ccmd's own analyzer, and the function it lives inside. Note that the function takes a single string. There is no second argument for "and what model are you on."

src/lib/analyzer.ts

The reason the analyzer takes one argument is that the variable worth grading is the file. The model is a constant once the session starts. The turn count is a function of how the session goes, which is downstream of the file too (a worse file produces more retries). Tokens are what you actually edit.

The constant OPUS_IN_PER_M = 15 in the snippet above is conservative on purpose. The current Opus 4.7 base rate is $5 per million input tokens (cheaper than Opus 4.1's $15 rate that the analyzer hard-codes), but the ratio of token-savings to dollar-savings doesn't change whichever number you slot in. Cutting tokens by half cuts dollars by half on every line of the table.

Two ways to save the same 5x

Below: the same monthly bill on two different strategies. Left is "drop a model tier." Right is "trim the file and stay on Opus." Identical dollar outcome. The right column also keeps your model quality, which the left column trades away.

Two paths to $5.40/month

// "I'll switch to Haiku to save money"
//
// 6,000-token CLAUDE.md
// 30 turns
// Haiku 4.5: $1 / M input
//
// per session:  $0.18
// per month*:   $5.40
//
// * one session per day, 30 days
// * input only, output ignored

0% fewer lines

Most cost guides only describe the left column. That's the column that doesn't require you to look at your CLAUDE.md. It's easier to write about. It's also why people switch to Haiku, hit a wall on agentic coding quality, switch back, and conclude "Claude Code is expensive." The right column was available the whole time.

The lever that's bigger than the model picker

Anthropic's pricing page lists a column most people skim past: Cache Hits & Refreshes. On Opus 4.7 the base input is $5 per million; the cache-hit rate is $0.50 per million. That's 0.1x of base, a 10x discount on context that's already been processed once in the last 5 minutes (or 1 hour, depending on which cache TTL the client uses).

A cache hit happens when the prefix of your prompt is byte-for-byte identical to a prefix the model has already seen. CLAUDE.md sits near the top of the system prompt, so it's in the prefix territory that benefits most from cache. The problem: any line that changes between sessions invalidates the cache from that line onward. A few common offenders:

Today is 2026-05-18. (changes every day)
## Last updated: 2026-05-17 (changes every commit)
We are mid-sprint on subscriptions. (changes every two weeks)
Session-specific TODO lists near the top of the file

The ccmd analyzer flags these with the cache_bust finding kind (src/lib/analyzer.ts line 194):

src/lib/analyzer.ts

The check fires only on the first 20 lines because those are the ones that, if they change, invalidate the largest downstream chunk. A date stamp at line 3 of a 200-line file costs you the cache discount on lines 3 through 200 every session. The fix is one line move, not a model swap.

10x

“Cache read tokens are charged at a fraction of the standard input price (0.1x), which means caching pays off after just one cache read.”

Anthropic, Pricing — prompt caching, 2026-05-18

The levers, ranked by how much they actually move

Same six numbers, ranked by impact on your monthly Claude Code bill. Note that "switch model" isn't at the top, and "new tokenizer" (a thing you can't opt out of) shows up at all.

Feature	Why	Effect
Switch Opus 4.7 → Haiku 4.5	$5/M → $1/M	5x cheaper
Trim a 6,000-token CLAUDE.md to 1,200	same model, same prompt-cache state	5x cheaper
Remove one ISO date from line 3 of CLAUDE.md	$5/M → $0.50/M (cache hits replace cache writes)	10x cheaper
Layered stack (project + user + imports)	tokens add, model rate doesn't change	1x–20x worse
Opus 4.6 → Opus 4.7 new tokenizer	same file, more tokens, same per-M rate	up to 1.35x worse

The first two rows are 5x each, and they stack. The third row is a single edit. The fourth row is the one most teams discover accidentally: a global ~/.claude/CLAUDE.md that's grown to 20K tokens loads on every repo you open. The fifth row is what shows up if you upgraded models without re-measuring.

When the model picker actually is the right lever

Three cases where switching models really is the move and context isn't the bottleneck:

Batch summarization or classification. The task itself is one-shot, context is tiny, and Haiku gets it right the first time. The 5x rate cut translates directly to a 5x bill cut. No retries lost.
Subagents you spawn from a main session. Each subagent runs in its own context, doesn't inherit CLAUDE.md by default, and often only needs Sonnet or Haiku. Subagent model selection is a real, fine-grained cost lever.
Pure output-heavy tasks. If your session is mostly generation (long doc, big diff, stream-of-thought), output token rate ($25/M on Opus vs $5/M on Haiku) dominates and context trimming doesn't help. Pick the cheaper model.

For the typical agentic-coding case (one main Opus session, reading 30+ files over 30+ turns), none of these apply. The model choice is fixed at Opus on quality grounds. The bill is set by how much context fires on every one of those turns.

The ledger most cost pages skip

A working ledger for one engineer on Opus 4.7, one session a day, 30 turns per session, base rate $5/M, cache hit $0.50/M:

Baseline (6,000-token CLAUDE.md, no cache): 6,000 × 30 × $5/M = $0.90/session, $27/month
Same file, cache hits: 6,000 × 30 × $0.50/M = $0.09/session, $2.70/month
Trim to 1,200 tokens, cache hits: 1,200 × 30 × $0.50/M = $0.018/session, $0.54/month
Same baseline but on Haiku 4.5 with no cache: 6,000 × 30 × $1/M = $0.18/session, $5.40/month (worse than caching on Opus)

The ranking that falls out: caching beats model-switching, and trimming beats both, and the three stack. The model-pricing table isn't wrong. It's just the third-most-interesting variable.

Want to see which lines in your CLAUDE.md are doing the bill

Paste your file at ccmd.dev for a free score, or book a 15-min call and we'll walk through your config, cache-bust hot spots, and where the 10x lever lives in your specific setup.

FAQ

Frequently asked questions

If context drives cost, why does Anthropic publish model rates?

Because the rate is the multiplier on top of the tokens, and you need to know both to compute a bill. The official 2026 model rates are Opus 4.7 at $5/M input, Sonnet 4.6 at $3/M, Haiku 4.5 at $1/M. The ceiling between cheapest and most expensive is 5x. That's a real ceiling, but it's smaller than what your config file moves on its own, so the rate is the constant once you've chosen the model and the file is the variable that's still in your hands.

Isn't switching to Haiku still the easiest win?

It's the easiest, not the biggest. Haiku is genuinely the right call for tasks like summarization, classification, or batched generation where the quality drop is invisible. For agentic coding sessions where Claude is reasoning over a repo, dropping to Haiku trades 5x cost for a noticeably lower hit-rate on the first try, which often means more turns, which often eats the 5x back. The same 5x is available on Opus by halving context, and that doesn't trade quality at all. Most pages don't show this option because they're written model-first.

What is a cache-busting line and why is it the biggest lever?

Anthropic's prompt-cache reads cost 0.1x base input price (a 10x discount). The cache breaks when the prefix of your prompt changes. CLAUDE.md sits near the top of the system prompt, so a line like 'Today is 2026-05-18 and we're mid-sprint on subscriptions' or '## Last updated: 2026-05-18' rewrites the prefix every day and turns every session-start into a cache write instead of a hit. That's a 10x cost difference on the same model and the same file. The ccmd analyzer flags this with a `cache_bust` finding on any timestamp or session-specific text in the first 20 lines.

How can the formula prove anything without measuring real sessions?

The formula is the per-turn cost. Total session cost is per-turn cost × turn count. The variables are tokens (yours), turns (a function of how the session goes), and rate (the model). When you trace which of those moves on a typical engineer's workflow: rate is set once at session start and doesn't move; turns vary by problem but not by your config; tokens vary directly with the size of CLAUDE.md, the layered imports, and the discovery payload of skills/MCP/hooks. The biggest variance lives in the variable you control most cheaply, which is context.

What about output tokens? Aren't those a bigger line item?

Output tokens are usually a fraction of input tokens in a coding session because Claude reads more than it writes (it greps, reads, plans, writes a 40-line diff). Anthropic's own worked example in the pricing docs (50k in, 15k out on Opus 4.7) shows input at $0.25 and output at $0.375, so on that mix output edges input. But the share that's truly under your control is still input, because output volume tracks the task. The lever you have on output is the same one you have on input: fewer turns, smaller context per turn, and prompt caching, which doesn't help output but does free up the input budget the model uses to think.

How does the analyzer handle the layered CLAUDE.md problem?

It scores the file you paste. The full per-turn bill is the layered stack: managed policy, ~/.claude/CLAUDE.md, the project CLAUDE.md, and every @import the project file pulls. ccmd's analyzer is one string in, one score out, so you paste each layer separately or paste the concatenation and read the total. That's deliberate — the per-layer numbers are what tell you which file to cut. There's a separate guide on the layered stack at /t/layered-claude-md-token-cost.

Where does Opus 4.7's new tokenizer come in?

Anthropic's pricing page notes that Opus 4.7 'may use up to 35% more tokens for the same fixed text' than previous Opus models. Same file, more tokens, same per-M rate. So an upgrade from Opus 4.6 to 4.7 raises your effective context cost without you touching CLAUDE.md. That's another data point for the thesis: context is the variable, the model is a multiplier, and even within one model family the multiplier can shift while your file is doing all the work.

What's the one thing to do tonight?

Open CLAUDE.md, count the lines that are date-stamped or 'as of', and move them to the bottom of the file or remove them. That's the cache-bust line of attack: a 10x cost difference at the same model, before you trim a single rule. Then paste the file into ccmd.dev and let it grade the rest.