/t · guide · accuracy bought, not tokens cut

The CLAUDE.md accuracy gain audit: rubricPassed → mistake rate.

M
Matthew Diakonov
7 min read

The audit's thesis is one sentence: every other piece on CLAUDE.md hygiene measures tokens cut. None of them measure accuracy bought. That is what this audit is. It converts the integer your analyzer already returns into a mistake-rate estimate you can show a CTO, and it tells you which one-line rule to add next.

The reason it matters is asymmetric. A bloat audit might save you 800 tokens. An accuracy gain audit might cut your mistake rate by 12 points for 250 tokens of added rules. Both edits are cheap. Only one of them changes whether the agent ships the right code.

The input: rubricPassed

The audit's primary input is a single integer the analyzer already exposes. Twelve regex tests run against the raw file string; one boolean per rule; the sum is the score.

src/lib/analyzer.ts

The regex is the spec. A rule passes if any of its alternatives match anywhere in the file. The audit treats those alternatives as the shortest legal way to cover the rule. If you fail rule 7 (Tests as Truth) you can fix it with eleven words: "tests must pass before you mark anything done". That line is around 20 tokens.

The math: from score to mistake rate

The published Karpathy-12 extension reported a 41% to 11% drop across a controlled task set. The audit treats that 30-point range as the band your file lives inside, with your rubricPassed deciding where.

accuracy-gain-audit.ts

The headline number is the second-to-last line of the snippet: a file at 4/12 sits at an expected 31% mistake rate, with 20 percentage points of accuracy still on the table. The audit's job for that file is to identify the next rule that buys the most of that 20 points per added token.

add-when-wrong loop

Anytime we see Claude do something incorrectly we add it to the CLAUDE.md.

Boris Cherny, Claude Code lead engineer

How the audit differs from the bloat findings

The same analyzer ships two scores. They point in opposite directions and answer different questions. Run the bloat findings first, then the accuracy gain audit. Treat them as independent budgets:

FeatureBloat findings (Finding[])Accuracy gain audit (rubricPassed)
What it measurestokens you can cut (Finding[])accuracy you bought (rubricPassed / 12)
Direction of the scorelower is better (fewer findings)higher is better (more rules passed)
What a 0 meansclean file, no deadweightagent at 41% mistake rate, no accuracy bought
What 12 meansn/a (no upper bound)agent at 11% mistake rate, full rubric coverage
Anchor variable in sourcepotentialSavingsTokens (line 271)rubricPassed (line 261)
Unit of next actiondelete a lineadd a one-line rule that matches a specific regex
Caveatsavings are an estimate at 35% of bloat lines2.5pp per rule is the average; rule 5 and rule 7 carry more weight

The reason both exist is that a clean file can still be a low-coverage file. A 1,500-token CLAUDE.md with zero findings and rubricPassed of 2 is a small, cheap, low-accuracy config. The bloat audit gives it a gold star. The accuracy gain audit tells you it's costing you roughly 25 percentage points of avoidable mistake rate.

The 5-step audit loop

The audit is small enough to run between commits. The whole loop is paste, read one number, edit one line, paste again. Five steps:

1

1. Read your rubricPassed

Paste your file into ccmd.dev. The analyzer returns an integer 0 to 12. That is the input to the audit.

The analyzer at src/lib/analyzer.ts line 261 computes rubric.filter((r) => r.passed).length. No model call, no upload, single-digit milliseconds.

2

2. Map it onto the 30-point range

Expected mistake rate = 41% − 30% × (rubricPassed / 12). At 4/12 you sit at 31%. At 8/12 you sit at 21%.

This is the headline number for the audit. It is the planning estimate, not a guarantee. Real-world drops vary by codebase but the direction and the order of operations are stable.

3

3. List your failed rules

rubric[] is an array of 12 RubricCheck objects with passed: boolean. The audit is interested in the ones where passed is false.

A failed check almost always means the regex did not match. Open the test function on the rule (analyzer.ts lines 49 to 122), read the regex, and you have a one-line spec for the rule you need to add.

4

4. Rank missing rules by gain-per-token

Rule 5 (Avoid Silent Assumptions) and rule 7 (Tests as Truth) consistently carry more than 2.5pp. Add those first if missing.

A passing line for rule 5 is around 11 words, ≈20 tokens. The expected gain is above the 2.5pp average. That is the highest-leverage edit you can make to your file today.

5

5. Re-run after each edit

Paste the new file. Watch rubricPassed move. If a rule you intended to cover did not pass, the regex did not match; tweak the wording.

The audit is a tight loop: edit, paste, read the new pass count, decide if the added tokens were worth it. Token cost moves in one direction (up) per edit; the audit tells you whether accuracy moved enough to justify it.

Why rule 5 and rule 7 carry extra weight

The 2.5pp average is a planning number. The distribution behind it isn't flat. Two rules consistently produce larger-than-average drops in the reproductions and are worth adding first if you fail them.

Rule 5 (Avoid Silent Assumptions) is the one Karpathy himself called the most expensive failure mode. An agent that doesn't know it should ask will guess an API shape, a column name, or a config flag, and ship a confident-looking lie. Every coverage layer downstream (tests, code review, observability) is rate-limited by how often this fires. Adding the rule cuts the upstream cause; everything below it gets cheaper.

Rule 7 (Tests as Truth) is the closing rule. Without it the agent claims done before running the suite, or worse modifies the suite to pass. The audit treats it as the floor: if you fail rule 7 the score is unreliable because the agent can be wrong without the rubric ever noticing.

If your audit returns rubricPassed of 4 or less, the recommended order is: rule 7 first, rule 5 second, then the remaining ten in any order. That single sequence usually moves a file from the 4/12 range (≈31% mistake rate) into the 6/12 range (≈26%) for under 50 added tokens.

What the audit doesn't measure

The audit is a coverage proxy. It doesn't prove your file is good. Three honest caveats:

  • A rule can pass the regex and still be badly worded. A line that says "run tests" passes rule 7. So does "tests must pass before you mark anything done". The second one is a real rule; the first is a vibe.
  • The 41% to 11% baseline is per-task and aggregate. Your codebase sits somewhere on that band but the exact landing point is yours, not the published number.
  • Some accuracy gains live outside the rubric: a hook that runs pre-commit, a Skill that owns a workflow, an MCP allowlist that prevents the wrong tool. The audit doesn't see those. If the rubric is high and you're still shipping the same bug, the next move is one of those three surfaces, not another CLAUDE.md rule.

Want us to run the accuracy gain audit on your CLAUDE.md?

Bring the file. We score rubricPassed, list the rules you fail, and write the missing one-line rules with you on the call.

FAQ

Frequently asked questions

What is a CLAUDE.md accuracy gain audit?

It is the conversion of ccmd's rubricPassed score (0 to 12) into an expected mistake-rate reduction. The Karpathy-12 community report measured a 41% mistake rate on files at 0/12 and 11% on files at 12/12 across the same task set. The audit takes your file's pass count, maps it onto that 30-point range, and tells you how much accuracy is still on the table. The free analyzer at ccmd.dev returns rubricPassed in the AnalysisResult; this page is the conversion table that turns that integer into a number you can show a CTO.

How much accuracy does each rule actually buy?

On average, 2.5 percentage points of expected mistake-rate reduction per rule passed. That is (41 - 11) / 12. The distribution is not flat though. Rule 5 (Avoid Silent Assumptions) and rule 7 (Tests as Truth) consistently produce the largest single-rule drops in the published reproductions because they fix the two failure modes that survive every other coverage layer. Treat 2.5pp as the planning number, treat 5 and 7 as the rules to add first if you are missing them.

Why is this not just the Karpathy 12 scorecard?

The scorecard reports a pass/fail per rule. The audit converts the score into an estimated outcome and ranks your missing rules by marginal gain per added token. The scorecard answers 'which of the 12 do I cover?' The audit answers 'how much accuracy is still on the table, and which rule will buy the most accuracy per token I add to my file?' Most people stop at the scorecard. The audit is the next step.

What does the audit cost in tokens?

Roughly 20 to 60 tokens per rule, depending on how tersely you write it. A passing line for rule 5 can be as short as 'ask if any input is unclear; do not guess'. That is 11 words, about 20 tokens. Twelve such lines is around 250 tokens. On a 6,000-token CLAUDE.md, that is a 4% size increase for an expected 30-point mistake-rate drop. The audit makes that trade explicit instead of asking you to guess.

How does the audit decide whether a rule passed?

Each Karpathy-12 entry in src/lib/analyzer.ts has a regex test function. Rule 5 fires if the file matches /ask (if|when|the)|do not guess|confirm before|surface assumption/i. Rule 7 fires on /run tests|tests must pass|before commit|before done|verify with/i. The full list is at lines 49 to 122. The pass count is the audit's primary signal; if you fail a rule you intended to cover, the fix is usually a small wording change so the regex matches.

Is the audit polyglot?

Yes. The KARPATHY_12 test functions run on the raw file string, so they fire identically against CLAUDE.md, AGENTS.md, .cursorrules, and .grokrules. detectType() at analyzer.ts line 41 labels the file for display but does not gate the rubric. A team running Codex on AGENTS.md and Claude Code on CLAUDE.md can audit both files with the same scorecard and compare like for like.

How does the audit interact with the 7 bloat checks?

They are orthogonal. The findings array (bloat, vague, aspirational, missing_why, duplicate, conflict, cache_bust) is the tokens-cut side. The rubric is the accuracy-bought side. Some teams fail both at once: a 6,000-token file with 18 findings and rubricPassed of 3. The accuracy gain audit is what you run after the bloat audit. Cut the deadweight first, then see how many of the 9 missing rules you can add inside the budget you just freed.

Where does the 41% to 11% number come from?

It is the reported drop from the community Karpathy-12 extension run in April 2026, building on the original four principles Andrej Karpathy posted on 2026-01-26. The number is per-task and aggregate; individual codebases will move differently. Treat it as the planning baseline for which way the audit is pointing, not as a guarantee. The audit's job is to make the score legible, not to promise an exact outcome.

Related: The Karpathy 12-rule scorecard · CLAUDE.md line firing audit · CLAUDE.md token cost audit

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.