/t · guide · scoring

The Karpathy rubric, scored: twelve regexes deciding pass or fail.

M
Matthew Diakonov
7 min read

Every published version of Karpathy's CLAUDE.md rubric does the same thing: lists the rules, gives a failure mode for each, names a source, and stops. The rubric is aspirational on the page where you read about it, and you have no way to find out whether your file passes without re-reading both end to end and guessing.

ccmd turns each of the twelve rules into one regex. Match means pass. Miss means fail. The score is a count. That is the entire scoring mechanism. The rest of this page is the twelve patterns, the count function, what a real run looks like, and the parts the score cannot tell you.

1. The rubric as twelve regex tests

The shape is an array of twelve objects. Each object has an id, a rule name, a description, and a test function that takes the file as a string and returns a boolean. Four of the twelve, byte for byte:

src/lib/analyzer.ts (lines 49-122, excerpt)

A regex is a strong choice for this kind of pre-flight test for one reason and one reason only: it runs in the browser, in microseconds, with no network call and no LLM judge. The whole analyzer is a 322-line pure file. The cost of a rubric pass is roughly nothing, and you can run it after every edit while you are still typing.

2. The twelve regexes, each rule's pass criterion

Left column: the rule name. Middle column: the literal pattern from src/lib/analyzer.ts. Right column: the human-readable version of what makes the rule pass. Read top to bottom; the first time you see a phrase in your file is the moment that rule goes green.

FeatureThe actual regexWhat counts as a match
1. Think Before Coding/plan|outline|draft|design first|think first/iOne of "plan", "outline", "draft", "design first", "think first" anywhere in the file.
2. Simplicity First/simpl(e|er|icity)|smallest change|no premature|yagni|kiss/i"simple", "simpler", "simplicity", "smallest change", "no premature", "yagni", or "kiss".
3. Surgical Changes/surgical|minimal change|only touch|don.?t refactor unrelated/i"surgical", "minimal change", "only touch", or "don't refactor unrelated".
4. Goal-Driven Execution/goal|completion condition|when (this is )?done|definition of done/i"goal", "completion condition", "when done", "when this is done", or "definition of done".
5. Avoid Silent Assumptions/ask (if|when|the)|do not guess|confirm before|surface assumption/i"ask if", "ask when", "ask the", "do not guess", "confirm before", or "surface assumption".
6. No Orthogonal Damage/don.?t refactor unrelated|no drive.?by|stay in scope|orthogonal/i"don't refactor unrelated", "no drive-by", "stay in scope", or "orthogonal".
7. Tests as Truth/run tests|tests must pass|before commit|before done|verify with/i"run tests", "tests must pass", "before commit", "before done", or "verify with".
8. Concise Output/concise|short answer|no recap|brief|minimal output|terse/i"concise", "short answer", "no recap", "brief", "minimal output", or "terse".
9. Stack Awareness/\b(typescript|python|rust|go|...|uv|poetry)\b/iNames one of 20 languages, frameworks, runtimes, or package managers as a whole word.
10. Tool Preference/prefer|use [a-z]+ over|avoid using|never use [a-z]+/i"prefer", "use X over", "avoid using", or "never use X".
11. Failure Mode Coverage/past failure|previous mistake|do not.*again|we got burned|never again|incident/i"past failure", "previous mistake", "do not ... again", "we got burned", "never again", or "incident".
12. Self-Improvement Loop/add to this file|update claude\.md|log it here|self.?improve/i"add to this file", "update claude.md", "log it here", or "self improve".

Row 9 (Stack Awareness) is the only one with a fixed vocabulary list. The others are short alternation patterns covering two to six common phrasings for the rule. If your CLAUDE.md says "think first, then code" you pass rule 1. If it says "reason through it before editing", you do not, even though the intent is identical. The rubric is a coverage smoke test, not a semantic one.

3. The scorer is eight lines

Map the array to a list of pass results, filter for the truthy ones, count the length. That is the entire function:

src/lib/analyzer.ts (lines 254-261)

There is no weighting (rule 11 counts the same as rule 1), no proximity check (a match in the middle of a 9,000-line file counts the same as a match in the first paragraph), and no recurrence requirement (one match counts the same as forty). The score lives in a single integer field on the result object, rubricPassed, and the homepage renders it as {rubricPassed}/12 karpathy in the header chip row.

4. What a 4-of-12 run looks like

One paste of a real CLAUDE.md, the rubric panel expanded. The four passing rules and the eight failing ones, plus the cheapest sequence of edits to bring the score up:

ccmd analyzer · rubric panel

Four added lines, roughly thirty tokens. The rubric moves from 4 of 12 to 8 of 12 without changing anything else in the file. Whether the agent actually plans before coding is a separate question; the rubric is satisfied either way. That is the next section.

5. What the score cannot tell you

A passing rule means the trigger phrase is in your file. It does not mean the surrounding context makes the rule load-bearing for the agent. A CLAUDE.md that opens with "always plan before editing" and then runs for 180 lines of contradictory aspirational rules will still pass rule 1, and the agent will still skip planning. The score is a necessary condition, not a sufficient one.

The rubric also says nothing about token cost. A 12-of-12 file can be 9,000 tokens of duplicate rules and dated lines at the top; the rubric reports 12, the per-line scan reports cache_bust, duplicate, and bloat findings that make the file three times more expensive per turn than it needs to be. Run both: the rubric for coverage, the per-line scan for waste. They look at the same file through different lenses.

And the rubric is a fixed twelve. The community extension stopped at twelve because Karpathy started with four and the community extended it once. There is no theoretical reason the right rubric for your codebase is exactly these twelve rules. The right one is probably some of these twelve plus a handful of failure-mode rules drawn from your specific incidents. We have a longer take on this at /t/claude-md-accuracy-gain-audit.

6. Using the score honestly

  1. Paste the file. Read which rules failed. For each failure, ask whether the rule actually matters for your project. Some genuinely do not (you may have no past failures to call out yet; rule 11 can stay red without harm).
  2. For each failure that does matter, write the rule the way you would write it for a new hire, not the way the regex wants it. The phrase the regex looks for will probably show up naturally; if it does not, that is fine, the rule still works for the agent even if it stays red.
  3. Re-paste. The rubric should move. The per-line scan should not have grown. If the per-line scan grew (a new bloat line, a duplicate of an earlier rule) you have just paid tokens to pass a rubric check. Strip and try again.
  4. Stop at the rubric score that matches the work you actually do. 7 or 8 of 12 is normal for a tight file; 12 of 12 is suspicious and usually means keyword-stuffing.

The full twelve rules with failure modes and source citations live at /karpathy-rules. The polyglot per-line scan is at /t/agent-config-token-bloat-audit. The per-turn cost math is at /t/claude-md-per-turn-token-cost.

Want us to score your CLAUDE.md with you on a call?

20 minutes, your file on screen, every passing and failing rule called out by line. We will also flag the rules that pass the regex but do not actually carry weight in your file.

Frequently asked questions

How does the Karpathy rubric actually score my CLAUDE.md?

ccmd's analyzer runs one case-insensitive regex per rule against the entire file. If the regex matches anywhere, the rule passes. If it does not match, the rule fails. Your score is the count of passing rules out of twelve. The full set of regexes lives at src/lib/analyzer.ts lines 49 to 122. There is no weighting, no severity, no proximity check, no count of how many times the pattern appears. One match counts the same as a hundred.

Where did the twelve rules come from?

Andrej Karpathy posted four CLAUDE.md principles on 2026-01-26 (think before coding, simplicity, surgical changes, goal-driven execution). Forrest Chang turned them into a 65-line template that became widely circulated. A community extension added eight more rules over the next quarter: avoid silent assumptions, no orthogonal damage, tests as truth, concise output, stack awareness, tool preference, failure mode coverage, self-improvement loop. The community version is what trained 41 percent down to 11 percent reported mistake rate on a controlled task set. ccmd uses the twelve-rule extension as the default rubric.

Why a regex instead of an LLM judge?

Three reasons. First, the analyzer runs in the browser with no network call: the whole audit is 250 milliseconds and the file never leaves your machine. Second, the rubric is meant to be the cheapest possible smoke test before you change anything, not the final word on whether your CLAUDE.md is good. Third, regex matches are deterministic and auditable: you can read src/lib/analyzer.ts and see exactly what counts as a pass, which means you can game it if you want to, and you can also fix it honestly.

What is the difference between a passing rule and a good rule?

A passing rule means your file contains the phrase the regex looks for. That is a necessary condition for the agent to follow the rule, not a sufficient one. A CLAUDE.md that says 'always plan before coding' passes rule 1 with a regex match, but if the rest of the file pulls in the opposite direction (long aspirational rules, no consequences, no examples) the agent will still skip planning. The rubric tells you what is missing structurally; it does not tell you what is working in practice. We have a longer take on this at /t/claude-md-accuracy-gain-audit.

Can I make a CLAUDE.md score 12 of 12 without writing useful rules?

Yes, trivially, by pasting the regex keywords into a comment block. 'plan outline draft simpler surgical only touch goal definition of done do not guess stay in scope run tests concise next.js prefer past failure update claude.md self-improve.' Scores 12 of 12 in about thirty tokens. This is by design: the score is a coverage test, not a quality test. The token-bloat findings (cache_bust, duplicate, bloat, dead-rule) are what catch the keyword stuffing, because a list of trigger words also fails the per-line scan as a vague aspirational rule with no why.

What does a typical CLAUDE.md score?

From the files we have seen pasted into the analyzer in the open: the median is around 4 of 12. The most common passing rules are 8 (concise output) because people write 'be concise' near the top, 9 (stack awareness) because the file usually names the framework, 10 (tool preference) because 'prefer X over Y' is a common shape, and 4 (goal-driven execution) because of the recent /goal command primitives. The most commonly failed are 11 (failure mode coverage) and 12 (self-improvement loop), because those require turning past incidents into rules and almost nobody does this until they have been burned twice.

Does the rubric work on AGENTS.md or .cursorrules too?

Yes. The twelve regexes are content patterns, not file-format checks. The detector at src/lib/analyzer.ts line 41 tags the input as one of claude.md, agents.md, .cursorrules, or .grokrules, and the rubric scoring at line 254 runs the same twelve regexes against the file regardless of which tag came back. A 9 of 12 AGENTS.md and a 9 of 12 CLAUDE.md mean the same thing: nine of Karpathy's twelve patterns are present in the file. The polyglot walkthrough is at /t/agent-config-token-bloat-audit.

How is this different from the /karpathy-rules page on this site?

The /karpathy-rules page lists the twelve rules, their failure modes, and the source citation for each. It is the human-readable rubric. This page is about the scoring mechanism: the regex behind each rule, the count function that produces the score, and the limits of measuring it that way. Read /karpathy-rules first if you want to understand what each rule is for; read this page if you want to understand exactly what makes ccmd say your file passed or failed.

Does ccmd plan to add more rules?

The twelve-rule rubric is the default because it matches the published community version with the measured mistake-rate drop. We have not added rules ourselves. We have considered making the rubric pluggable so a team can add its own pass-tests on top of the twelve, which would slot into the same KARPATHY_12 array shape. If you want this, send us your draft rules at i@m13v.com.

How did this page land for you?

React to reveal totals

Comments ()

Leave a comment to see what others are saying.

Public and anonymous. No signup.