You are not training Claude Code with CLAUDE.md. You are renting prompt space from yourself, by the turn.

The phrase “training cost” assumes a one-time charge that amortizes over future inference. CLAUDE.md is the opposite: it never enters the model's weights, it gets re-sent in full on every turn, and the bill arrives on every turn for as long as that line stays in the file. Once you see the math, every rule has a price tag.

M
Matthew Diakonov
6 min
Direct answer (verified 2026-05-19)

There is no training. CLAUDE.md is plain Markdown that the Claude Code harness pastes into the system prompt on every API call. That means the per-turn token cost equals the file's full token count, and the lifetime cost is tokens x turns x input_rate / 1,000,000. At Opus 4.7's $5 per million input rate, a 6,400-token file is $0.032 per turn, about $0.96 per 30-turn session, and roughly $350 per year if you run a session a day, with no caching. The cost never amortizes because nothing is learned.

Rates from platform.claude.com/docs/en/about-claude/pricing. Firing model from src/lib/analyzer.ts line 264.

The misconception buried inside the word “training”

When people search this phrase, they usually have a mental model borrowed from fine-tuning: you pay once to update model behavior, then every future call benefits at no extra cost. Under that model, a 6,400-token CLAUDE.md feels like a sunk investment that gets cheaper the longer you use the agent. The opposite is true.

Two mental models, very different bills

One-time write. Pay tokens once at 'training time'. Model learns the rules. Future turns are cheaper because the model already knows. Adding rules is a capex line.

  • Pay once
  • Behavior baked in
  • Cost decays with use
  • Dead lines stop billing after the gradient update

The line in our analyzer that proves it

The Messages API is stateless. On every turn, the Claude Code harness rebuilds the entire prompt from scratch and prepends the system prompt. Your CLAUDE.md sits inside that prefix. There is no flag for “only on turn 1” and no decay schedule. Our analyzer encodes this directly:

src/lib/analyzer.ts

If you assumed CLAUDE.md was training, you would expect this line to read something like totalTokens / TURNS (amortized over the session) or totalTokens * 0.3 (model only re-reads a fraction). It does not. The full file is the per-turn cost, and that is a property of how the API actually behaves, not a quirk of our implementation.

What a single CLAUDE.md rule costs over its lifetime

The unit of waste is not the file, it is the line. Pick any aspirational rule sitting in your CLAUDE.md right now, run the math:

lifetime-cost.ts

Eighty cents per year, per dead rule, per developer. A team of five carrying twenty such rules is paying roughly $0 a year just to send Claude the same advice it ignores. The number alone is small; the asymmetry is what stings. Real training would have learned to skip those tokens during the gradient update. Prompt context cannot.

Training vs. CLAUDE.md, side by side

If you came here looking for “training cost” numbers, this is the comparison you actually needed. The shapes of the two things are different almost everywhere.

FeatureActual model training (fine-tune)CLAUDE.md (prompt context)
Cost modelOne-time. Amortized over future inference.Recurring. Paid every turn.
Where it livesModel weights, baked in at training time.Prompt context, sent on every API call.
What it changesThe model's parameters. It actually learns.Nothing about the model. It is just text the model reads.
Dead linesPruned at training time; never billed at inference.Bill forever even when ignored.
Cost of a 6,400-token fileNot applicable. Fine-tuning is per-million-training-tokens, one charge.$0.032/turn, $0.96/session, $350/year (Opus 4.7, no cache).
How to lower itRe-train. Different infrastructure entirely.Delete lines. Move them into on-demand skills.

What Anthropic actually calls it

The official term in the Claude Code docs is memory, not training. The How Claude remembers your project page describes CLAUDE.md as a Markdown file that is “loaded and injected into the prompt context” at session start and re-read after every /compact. No part of that pipeline touches model weights, no part of it is billed as a training charge, and Anthropic's own wording avoids the word “training” entirely. The mental shortcut that turned “memory” into “training” is where the cost surprise comes from.

Translation table for the rest of this page: memory file = the bytes on disk; prompt context = where those bytes land on every turn; input tokens = what you are billed for. The word training belongs to a different pipeline you are not using.

Same file, three model tiers, two cache states

People ask “what does training cost” because they want a number. The honest answer is a 2D table: the rate depends on which model you run and whether the prefix hit the prompt cache. A 6,400-token CLAUDE.md (about 25,600 characters), one turn at a time, billed at the rates Anthropic published for 2026:

ModelPer turn, uncachedPer turn, cache readCache write (turn 1)30-turn session, uncached
Opus 4.7 ($5 / $25 per M)$0.0320$0.0032$0.0400$0.96
Sonnet 4.6 ($3 / $15 per M)$0.0192$0.0019$0.0240$0.58
Haiku 4.5 ($1 / $5 per M)$0.0064$0.0006$0.0080$0.19

Cache reads bill at 10% of base, cache writes at 1.25x base on the 5-minute TTL. Two consequences. (1) The cheapest cell is Haiku, cache read at $0.0006 per turn; the most expensive is Opus, cache write at $0.04 per turn. (2) A real session is a mix: turn 1 pays the cache-write premium, turns 2+ pay the cache-read rate, and any prefix mutation (an ISO date in the first 20 lines, a rotating sprint string) drops you back to the uncached column.

Notice what is notin this table: a column called “over time”. There is no decay. The per-turn cost on day 365 is the per-turn cost on day 1. Compare that to fine-tuning, where the training charge is paid once and the per-token inference rate stays flat regardless of how long the model lives.

See your own file

Paste your CLAUDE.md into the textarea at ccmd.dev. The analyzer runs in your browser with no upload. It returns the per-turn token cost (always equal to the total) plus a per-line findings list with token-savings estimates. Output looks like this:

ccmd analyzer output

The two numbers that matter are tokensFireEveryTurn and the count of aspirational findings. The first is your recurring tax. The second is the share of that tax you are paying for nothing.

What to do once you accept it is not training

Three moves, in order of impact:

  1. Delete aspirational lines.Anything that says “always X”, “be careful with Y”, or “write clean code” without a concrete failure mode is paying rent for nothing. If deleting the line would not change Claude's behavior on the next ten tasks, the line was already dead.
  2. Move stack runbooks into skills. A skill loads on demand (when the user types a matching command or names the tool), so it costs zero tokens on every turn that does not need it. The analyzer points at installable skills from skillhu.bz for each block you cut, so the behavior survives the diet.
  3. Replace prose with hooks. Where a rule is really a guardrail (never run destructive migrations without confirmation), a hook in settings.json enforces it deterministically at zero token cost, and the model cannot ignore it. Prose rules can be ignored; hooks cannot.

After those three passes, run the analyzer once more. A typical team-shared CLAUDE.md cuts 40 to 60 percent of its tokens without losing a single behavior, because most of what was in there was aspirational and the rest was better off as a skill or a hook.

Want a walk-through on your own CLAUDE.md?

Bring your file. We will go through the per-turn breakdown live and find the lines you can delete without losing behavior.

Frequently asked questions

Frequently asked questions

Is CLAUDE.md actually training Claude Code?

No. CLAUDE.md never enters model weights and is not part of any training pipeline. It is a Markdown file the Claude Code harness reads at session start and pastes into the system prompt of every API call to the Messages API. The model has no persistent memory of your file between turns; it only sees the copy that arrives inside the current prompt. That copy is re-sent on every turn, which is why our analyzer at src/lib/analyzer.ts line 264 sets estimatedTokensFireEveryTurn = totalTokens. The word 'training' is a useful shorthand for 'I am teaching Claude how to behave on my repo', but billing-wise it is the exact opposite of training: training is one charge then free, CLAUDE.md is no charge up front then billed forever.

What does my CLAUDE.md cost over a year, then?

Multiply your file's token count by the number of turns you will run, then by the input rate, divided by one million. Take a 6,400-token file (about 25,600 characters), a developer running roughly 30 turns per day, and Opus 4.7 input at $5 per million tokens. That is 6,400 x 30 x $5 / 1,000,000 = $0.96 per day in just CLAUDE.md prefix tokens, before any of your actual work. Run that 365 days and you have spent about $350 per year purely re-sending the same file. Prompt caching reduces the cached portion to 10% of the base rate, dropping the same scenario to roughly $35 per year, but caching only works if the prefix stays byte-identical between requests within the cache window.

Then why does everyone call it 'training' anyway?

Because the user experience feels like training. You write rules, the agent's behavior changes, you keep adding rules. The pattern looks like fine-tuning. But the mechanism is prompt engineering, not parameter updates. The practical consequence: rules in CLAUDE.md do not get cheaper over time the way fine-tuned weights do. If a rule is dead weight (Claude ignores it on most turns), training would have learned to ignore it during the gradient update; prompt context bills you for it on every single turn whether the model uses it or not. That is why analyzing the file matters more than analyzing your token usage after the fact: ccusage tells you what you spent, but only an analyzer tells you which specific lines spent it for no behavioral return.

How is this different from actual Claude fine-tuning?

Fine-tuning Claude (when offered) is a one-time training charge on the training dataset, then per-token inference at a different rate against the resulting custom model. The training cost is amortized over every future call. CLAUDE.md has no training charge; the cost is loaded entirely onto inference, charged on every turn forever. The two also fail differently. A bad fine-tune behaves badly on every output until you re-train. A bad CLAUDE.md silently bills you for tokens that produce no behavior change, because Claude is allowed to ignore prompt instructions when they conflict with newer ones or with the current task. You cannot 'retrain' your way out of bloat. You can only delete or move the lines.

Does prompt caching make this comparable to training cost-wise?

Closer, but no. Anthropic's prompt cache bills cache reads at 10% of the base input rate, so a 6,400-token CLAUDE.md drops from $0.032 per turn to $0.0032 per turn when the cache hits. That makes the prefix feel almost free, which is why a lot of teams stop worrying about CLAUDE.md size. The catch: the cached prefix has to be byte-identical between requests within the cache window (5 minutes). One ISO date in the first 20 lines (today's date, last-modified, current-sprint string) mutates the prefix every session and forces the full uncached rate. The analyzer flags that pattern as a cache_bust finding. Even with perfect caching, you still pay the cache-write turn at full rate, and you still pay every turn for lines the model ignores.

What lines are actually paying their way, and which are dead weight?

Heuristic from the analyzer: lines that name a specific file path, a specific command, or a specific failure mode tend to fire. Lines that say 'always X', 'be careful with Y', or 'write clean code' tend to be ignored 80%+ of the time and still bill on every turn. The Karpathy-12 rubric (rule 5: avoid silent assumptions, rule 6: no orthogonal damage, rule 7: tests as truth) is a strong filter; rules outside that rubric usually fail the 'would Claude behave differently if I deleted this' test. Paste your file into the textarea at ccmd.dev and the analyzer returns a per-line findings array with severity and token savings, so you can see exactly which lines are paying rent and which are not.

If there's no training, why does Claude seem to 'learn' my project?

Because every turn it reads CLAUDE.md again, plus whatever files it has open, plus the conversation so far. The 'learning' is a reading exercise, not a parameter update. Two consequences worth knowing. First, a fresh session starts from zero understanding of your project until it reads CLAUDE.md (so the file's value is highest right after /clear). Second, the model is not getting smarter about your repo over time, it is just re-reading the same instructions; if you want compounding behavior, move durable rules into installable skills that load on demand rather than into CLAUDE.md that fires on every turn.

How do I cut my CLAUDE.md cost without losing behavior?

Three patterns work. (1) Delete aspirational lines ('write clean code', 'follow best practices') that fire every turn but produce no measurable behavior change. (2) Move stack-specific runbooks ('how we run migrations', 'how we deploy') into skills that load when the user types the matching command, so they cost zero tokens on every turn that does not need them. (3) Replace prose with hooks and MCP allowlists in settings.json where you can; a deterministic hook costs zero tokens and cannot be ignored. The ccmd analyzer recommends installable skills from skillhu.bz for each prose block you cut, so the behavior survives the diet.

Related reading on this site: CLAUDE.md per-turn token cost lookup table, why Claude Code cost is set by context, not the model, which CLAUDE.md rules fire every turn.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.