02 / karpathy 12-rule scorecard

Twelve rules. The failure mode each one is built to catch.

Andrej Karpathy described his Claude Code failure modes on X in January 2026. Forrest Chang distilled them into a CLAUDE.md (multica-ai/andrej-karpathy-skills, 200k+ combined stars across the repo and its org mirror). The original four covered the highest-frequency failures; the community extended them to twelve, layering in agent-orchestration safeguards. ccmd scores every paste against all 12 rules and the per-line token cost.

Below: each rule, what it tells the agent to do, and the failure mode it is meant to catch. These map one-to-one to the rubric the analyzer runs, so a rule you see here is a check ccmd actually fires on your file.

Karpathy's original four
  1. 01

    Think Before Coding

    Surface tradeoffs before changing code. Do not assume. Do not hide confusion behind a confident-looking edit.

    failure modeAgent dives into edits without forming a plan, then thrashes when it hits the first surprise.

  2. 02

    Simplicity First

    Minimum code that solves the problem. Nothing speculative. No premature abstraction.

    failure modeAgent over-engineers because it pattern-matches what 'professional code' looks like, not what the task needs.

  3. 03

    Surgical Changes

    Touch only what you must. Clean only your own mess. Do not refactor unrelated code.

    failure modeAgent rewrites a working file while fixing one line. Drive-by refactor breaks tests in unrelated modules.

  4. 04

    Goal-Driven Execution

    Define success criteria up front. Loop until verified against those criteria.

    failure modeAgent stops at the first compile; success is 'feature merges with passing tests', not 'no error message right now'.

The eight-rule extension ccmd also scores
  1. 05

    Avoid Silent Assumptions

    Ask when the spec is ambiguous. Do not guess and ship the guess as fact.

    failure modeAgent picks one reading of an unclear request and reports success. 'Migration completed' while 14% of rows were silently skipped.

  2. 06

    No Orthogonal Damage

    Stay inside the scope the request named. No edits to neighboring code just because you are in the file.

    failure modeAgent reformats or 'improves' a module next to the one it was asked to change. The diff touches files nobody asked about.

  3. 07

    Tests as Truth

    The suite must be green before you claim done. Verify against tests, not against the last error message.

    failure modeAgent reports 'tests pass' after quietly excluding the three that failed. Done means green, not silent.

  4. 08

    Concise Output

    Bias to short answers. No recap of work you just did unless asked.

    failure modeAgent pads every reply by restating what it just changed. Context fills with echoes instead of new signal.

  5. 09

    Stack Awareness

    Name the language, framework, package manager, and runtime so the agent stops inferring them.

    failure modeAgent assumes npm in a pnpm repo, or React Router in a Next.js app, then writes code against the wrong stack.

  6. 10

    Tool Preference

    Spell out prefer-X-over-Y. Name the libraries and tools the repo standardized on.

    failure modeAgent reaches for the first dependency it knows and reintroduces the one you spent a sprint removing.

  7. 11

    Failure Mode Coverage

    Record the specific mistakes that burned you before, so the file warns against them by name.

    failure modeNothing in the file mentions last month's incident, so the agent walks into it again on a fresh session.

  8. 12

    Self-Improvement Loop

    When the agent gets corrected, instruct it to write the lesson back into this file.

    failure modeA correction lands, the session ends, and the knowledge evaporates. The next run starts from the same blank slate.

Paste-ready: all twelve rules in one short file

Here are the twelve as lines you can drop straight into a CLAUDE.md. Treat it as a starting point, not gospel: keep the rules that match how your repo actually breaks, cut the ones that do not. It is 24 lines on purpose. Anthropic's own guidance puts compliance dropoff around 150 to 200 instructions, so a rules block that runs long stops being read. Paste this into the analyzer and it scores 12 of 12; then paste your own file and see where it falls short.

CLAUDE.md · 24 lines · scores 12 / 12
# CLAUDE.md

## How to work in this repo

- Plan before editing: outline the approach and surface tradeoffs first. Do not hide confusion behind a confident edit.
- Simplicity first: smallest change that works, no premature abstraction (YAGNI).
- Surgical changes only: touch only what the task names; don't refactor unrelated code.
- State the goal and the definition of done up front; loop until verified against it.
- Ask when the spec is ambiguous; do not guess and ship the guess as fact.
- Stay in scope: no drive-by edits to neighboring code just because you're in the file.
- Tests are truth: run tests and they must pass before you claim done.
- Be concise: short answers, no recap of work you just did unless asked.

## Stack

- TypeScript + Next.js, pnpm, Node runtime. Don't infer the package manager.
- Prefer the repo's existing libraries over reaching for a new dependency.

## What burned us before

- Past incident: a migration reported "completed" while 14% of rows were silently skipped. Verify row counts.

## Keep this file honest

- When you get corrected, add the lesson to this file (update CLAUDE.md) so the next session starts ahead.

Want the scoring mechanics, not just the list? See how the 12-rule rubric is scored, or paste your own file and watch it run.

Frequently asked questions

What are Andrej Karpathy's rules for CLAUDE.md?

Karpathy's original four are: Think Before Coding (plan before you edit), Simplicity First (smallest change that works, no premature abstraction), Surgical Changes (touch only what the task names), and Goal-Driven Execution (define done up front, loop until verified). The community extended these to twelve by adding Avoid Silent Assumptions, No Orthogonal Damage, Tests as Truth, Concise Output, Stack Awareness, Tool Preference, Failure Mode Coverage, and Self-Improvement Loop. ccmd scores your file against all twelve.

Where did the twelve rules come from?

Karpathy posted his Claude Code failure modes on X in January 2026 (the original four). Forrest Chang distilled them into a reusable CLAUDE.md in multica-ai/andrej-karpathy-skills, 200k+ combined stars across the repo and its org mirror. Rules 5 through 12 are the community extension that layered in agent-orchestration safeguards (scope discipline, test gating, self-correction). ccmd's analyzer encodes all twelve as the default rubric.

Do the Karpathy rules actually reduce mistakes?

The published version reports a drop in agent mistake rate from 41% to 11% once the file covers the rules. That number is self-reported by the community distillation, not an independent benchmark, so treat it as directional. What is concrete: each rule maps to a specific, recurring failure (a drive-by refactor, a silent migration that skipped 14% of rows, a 'tests pass' claim after quietly excluding the three that failed). Covering the rule does not guarantee the agent obeys it; it removes the excuse that the file never said so.

What is the difference between the original four and the eight-rule extension?

The original four are Karpathy's highest-frequency failure modes: they fire on almost every coding session regardless of stack. The eight-rule extension is about agent orchestration over longer runs: staying in scope across many files, treating the test suite as the source of truth, keeping output short so context does not fill with echoes, and writing corrections back into the file so the next session does not repeat them. A fresh CLAUDE.md usually covers two or three of the original four and almost none of the extension.

How do I apply these rules to my own CLAUDE.md?

Paste your file into the analyzer on the homepage. It runs all twelve regexes in the browser (no signup, no upload), reports your score out of twelve, and flags which rules are missing. It also shows the per-line token cost, so you can add the missing rules without bloating the file past the point where the agent starts ignoring it. The most commonly failed rules are 11 (Failure Mode Coverage) and 12 (Self-Improvement Loop), because both require turning past incidents into rules.

Do the rules work for AGENTS.md, .cursorrules, and .grokrules too?

Yes. The twelve rules are content patterns, not file-format checks, so the same rubric runs against AGENTS.md (Codex), .cursorrules (Cursor), and .grokrules (xAI Grok Build). A 9 of 12 means the same thing in any of them: nine of Karpathy's twelve patterns are present in the file. The detector tags which format you pasted, then scores it the same way.