/t · guide · audit

A Claude Code config audit, the 12-rule scorecard your file actually gets graded on.

Matthew Diakonov, Written with AI

Published May 18, 20266 min read

Every other guide on this topic ends with the same paragraph. Keep your file short. Be specific. Remove instructions that no longer apply. That is all true and none of it answers the actual question: is my file done? An audit answers that. A rubric makes it answerable.

This page walks through both layers of the audit ccmd.dev runs. The 12-rule scorecard (the rubric), and the seven finding kinds (the per-line layer). The whole audit completes in roughly 220 ms in the browser. Both layers are pure functions in src/lib/analyzer.ts under 300 lines; you can read every threshold and regex yourself.

1. The file under audit

A realistic CLAUDE.md from a Next.js payments repo. 196 tokens, 20 lines. Looks fine on first read. Scores 4 of 12 on the rubric and ships seven flagged lines.

CLAUDE.md

2. The audit output

Same file, run through analyzeConfig(). The hosted UI on ccmd.dev renders the same data as a richer panel; the underlying shape is this.

ccmd · audit report

Two layers stacked. The rubric block (top) is the verdict: how complete is the file as an agent contract. The flagged lines block (bottom) is what you fix today. Rule 9 (stack awareness) and rule 10 (tool preference) pass because the file names Next.js, Postgres, Stripe, and includes a Never-use directive. The other ten rules need new lines. The seven flagged lines are where existing lines hurt more than they help.

3. The scorecard, twelve rules

Adapted from Karpathy's published rules, each one a regex test against the full file body. Pass means at least one phrase in the file matches the rule pattern. The rules live in the KARPATHY_12 array in src/lib/analyzer.ts.

R1, R2, R3 — plan, simplify, stay surgical

Three rules about restraint before edits. R1 looks for an explicit plan step (PLAN, OUTLINE, DRAFT). R2 looks for a bias toward the smallest change (simplicity, YAGNI, KISS). R3 looks for surgical-change language (only touch what is needed, do not refactor unrelated). Most files we audit skip all three; the agent reaches for big edits by default.

R4, R5 — name the goal, no silent guesses

R4 asks for a completion condition: goal, definition of done, when this is done. R5 asks for an ask-instead-of-guess instruction: confirm before, surface assumption, do not guess. The two together are how the agent knows when to stop and when to interrupt.

R6, R7 — no orthogonal damage, tests as truth

R6 is a stricter version of R3 phrased as a prohibition: stay in scope, no drive-by, no orthogonal edits. R7 is the verification gate: run tests, tests must pass, before commit, before done. Failing R7 is the rule most teams notice first; it correlates with the agent declaring success without running anything.

R8 — concise output

Looks for a bias to short answers: concise, short answer, no recap, brief, terse, minimal output. Without R8 the model defaults to long-form prose summaries after every action, which costs output tokens and breaks flow. One line in the file fixes it.

R9, R10 — stack awareness, tool preference

R9 looks for a named language and framework (typescript, python, rust, go, next.js, react, django, fastapi, bun, npm, pnpm, etc.). R10 looks for explicit tool preference (prefer X over Y, avoid using, never use). These are the rules most files pass; almost every file lists a stack.

R11 — failure-mode coverage (the rule almost no file passes)

Regex: past failure | previous mistake | do not again | we got burned | never again | incident. The rule asks: does the file remember the last time the agent was wrong? Most files read like a setup guide. The strongest CLAUDE.md files read like a postmortem record with each rule annotated by the incident behind it. Pass R11 by adding one Why: line under each prohibition.

R12 — self-improvement loop

Looks for instructions to log new learnings back into the file: add to this file, update CLAUDE.md, log it here, self-improve. R12 is what makes the file compound. The single rule that turns a static doc into something the agent maintains for you.

4. The per-line layer, seven finding kinds

The rubric tells you what is missing. The per-line layer tells you which existing lines are hurting. Both layers run in the same pass.

finding kinds (analyzer.ts)

cache_bust — high severity. ISO date or session phrase in the first 20 lines. Voids the prompt cache on every new session and re-bills the full file at full input cost. Largest dollar impact per fix. (analyzer.ts:194)
bloat — medium severity. One non-blank line over 28 words. The second half gets ignored by the model. Split into 2-3 directives; ~35% token saving. (analyzer.ts:150)
missing_why — medium severity. A DO NOT, NEVER, or don't line whose next 4 lines do not contain because, why, reason, past, got burned, incident, happened, or caused. The agent follows the rule until an edge case, then guesses. (analyzer.ts:227)
duplicate — medium severity. Same lowercased non-trivial line appears twice in the file. Catches the multi-author concatenation pattern. Saving is the full line. (analyzer.ts:207)
vague — low severity. One of 14 banned soft words: appropriate, good, best, proper, carefully, thoughtfully, well, nicely, cleanly, as needed, where applicable, if relevant, when possible. The agent cannot tell when it succeeded. (analyzer.ts:124)
aspirational — low severity. Absolute word (always, never, must, should always, in all cases, every time) without an escape clause (unless, except, but if, when X then). Real codebases have exceptions. (analyzer.ts:130)
conflict — high severity. Contradictory absolutes inside the same file (e.g. never use comments + add comments). The agent has no resolution path. Rare but loud. (analyzer.ts:244)

5. Which files the audit accepts

Detection is by content, not filename. The audit reads the first 300 characters and routes the input to the right header. The rubric and finding kinds are the same for all four formats; the only difference is the label in the report.

0supported formats

0rubric rules

0finding kinds

0 msaudit time (6K tokens)

CLAUDE.md is the default. The detector flips to agents.md if the input contains # agents, to .cursorrules if the input opens with You are an expert, and to .grokrules on a Grok header. The audit applies the same checks across all of them because the prompt cost shape is the same regardless of which CLI ships the file.

6. Where audit advice ends and an audit starts

We read what currently shows up under the search this page targets. The advice converges and the deliverables diverge.

Feature	generic 'audit your CLAUDE.md' guides	ccmd audit
Has a fixed rubric	no — prose advice	12-rule scorecard, regex-tested
Flags specific lines	no — qualitative	7 finding kinds, per-line
Catches cache-busting timestamps	not mentioned anywhere we found	cache_bust, high severity, first 20 lines
Runs on the file you already wrote	advice, not a run	paste, browser, ~220 ms
Polyglot	single-format	CLAUDE.md, AGENTS.md, .cursorrules, .grokrules
Names the rule each fix unlocks	no rule layer	fix maps back to R1-R12
Privacy	varies, usually upload	pure client, no network

7. The order to fix in

Findings have different blast radius. Sort by severity, not by line number. For the sample file in section 1 the order is:

cache_bust on L20. Strip Today is 2026-05-18 or move it to the bottom of the file. One delete recovers the prompt-cache discount for the rest of the session.
missing_why on L11 and L12. Append a Why: line under each prohibition. "Never use any. Why: silent runtime error in checkout 2026-03, four hours to trace." Two lines, also unlocks rule 11 (failure-mode coverage).
bloat on L15. Split the 29-word stack paragraph into three 8-word lines: framework, payments boundary, migration rule. The second half of the original was being ignored anyway.
vague + aspirational on L6-8. Replace "handle edge cases appropriately" with "cover the four cases in cases.test.ts". Replace "always think carefully" with a tests-must-pass line (also unlocks R7). Concrete rules retire the absolute-word findings on their own.
Add lines to unlock the missing rules. R1 (one-line plan step). R4 (definition of done). R8 (concise output preference). R12 (instruction to update this file when the agent gets something wrong). Four lines, four rule passes.

After the passes the file scores 11 of 12 on the rubric (R5 still wants an ask-do-not-guess line you can add separately), drops to roughly 140 tokens, and the prompt cache stays warm across sessions. The token saving is real but secondary. The bigger win is that every line in the file is doing work the agent will follow.

Want us to run the audit on your config live?

15 minutes, paste your CLAUDE.md (or AGENTS.md, .cursorrules, .grokrules), walk through the rubric and findings, leave with a fix list. Free.

Frequently asked questions

What is a Claude Code config audit?

A deterministic line-by-line scoring of your agent config file against a fixed rubric. ccmd's analyzer runs two layers: a 12-rule scorecard adapted from Karpathy's published rules (each rule is a regex test against the full file) and seven per-line finding kinds (bloat, vague, aspirational, conflict, duplicate, missing_why, cache_bust). The output is a pass/fail per rule plus a list of flagged lines with a suggested fix. Same rubric runs on CLAUDE.md, AGENTS.md, .cursorrules, and .grokrules; detection is by content, not filename.

Why a rubric and not just advice?

Generic 'audit your CLAUDE.md' posts give qualitative guidance: be specific, remove dead weight, write rules not preferences. None of it tells you whether your file is done. A rubric does: 12 rules, score one through twelve, you know which lines you still owe. Karpathy reported the 12-rule rubric dropped his agent mistake rate from 41% to 11%. The number is approximate but the shape of the result holds up across teams that have tried it.

How does ccmd's audit differ from ccusage or claude-meter?

ccusage and claude-meter are retrospective. They tell you what last week's session cost and which sessions were expensive. They cannot tell you which line in your file caused it. ccmd's audit is forward-looking and causal: it grades the file you wrote, names the lines, and gives the dollar delta of fixing each one. Pair them; ccusage shows the bill, ccmd shows the cause.

Why is cache_bust the highest-severity finding?

Anthropic and xAI prompt caches only hit when the cached prefix is byte-identical to a previous request. Put 'Today is 2026-05-18' or a session-specific string near the top of CLAUDE.md and every new session writes a different string at that line. The cached prefix never matches and every turn in the new session re-bills the full file at full input cost. On a 6,000-token file at Opus 4.7 input rates that is the difference between a $0.09 session and a $0.90 session. One line. ccmd's analyzer is the only audit we know of that flags this specifically.

What surfaces does the audit cover beyond CLAUDE.md?

The same analyzer scores AGENTS.md (Codex), .cursorrules (Cursor), and .grokrules (xAI Grok Build) using content-based detection. Skills, hooks, and MCP allowlists are separate audit surfaces; ccmd's per-skill bloat scoring is covered in the skills-bloat guide. Settings.json hooks and MCP cross-checking are on the roadmap for the same paste-and-score flow.

What threshold does the bloat check use, and why that number?

28 words on a single non-blank trimmed line, defined in src/lib/analyzer.ts at line 150. The threshold is empirical. Lines over roughly 25 words consistently get treated as a single signal by the model and the second half tends to be ignored. Split a 34-word stack paragraph into three 8-word rules and the agent follows all three. The estimated token saving on a bloat fix is 35% of the line's tokens; you keep most of the words but produce rules the agent actually executes.

How long does the audit take and does anything get uploaded?

The analyzer runs in roughly 220 ms on a 6,000-token file, entirely in your browser. It is pure client-side TypeScript in src/lib/analyzer.ts. Open DevTools, watch the network tab, paste a file. No POST. No signup. No backend. Same is true for any of the four supported formats.

Which rule do real files fail most often?

Rule 11, failure-mode coverage. The test regex looks for 'past failure | previous mistake | do not again | we got burned | never again | incident'. Most CLAUDE.md files we score read like a fresh setup guide rather than a postmortem record. The fix is to append one line under each prohibition that names the incident or constraint behind it. Rule 12, self-improvement loop, is the next most-failed: very few files contain instructions to update the file itself when the agent gets something wrong.