CLAUDE.md vs AGENTS.md vs .cursorrules drift: how to actually measure it

Matthew Diakonov, Written with AI

Published May 18, 20267 min read

If your repo has CLAUDE.md, AGENTS.md, and .cursorrules sitting next to each other, they have already drifted. Three files were added by three people at three different points and nobody owns keeping them in sync. The agents behave differently on the same task and nobody can say which file is wrong.

The reason this is hard to talk about is that there is no shared unit to measure drift with. Token counts are not it (a longer file is not a better file). Word-for-word diffing is not it either (the formats use different syntax for the same idea). What you actually want is a single rubric, applied identically to each file, that returns a comparable score. That is what ccmd does, and the score delta between the three columns is the answer to your question.

Direct answer (verified 2026-05-18)

Paste each file into the analyzer at ccmd.dev. The same Karpathy 12-rule rubric runs against all four formats (CLAUDE.md, AGENTS.md, .cursorrules, .grokrules), so the pass counts are directly comparable. If your CLAUDE.md scores 7/12 and your .cursorrules scores 3/12, the four rules in the gap are exactly where Claude and Cursor will silently behave differently on the same task. Source: the analyzer applies one KARPATHY_12 array uniformly, verified at src/lib/analyzer.ts lines 49-122 and 255-260.

Why one rubric works across four formats

The thing that makes drift measurable is that the rubric is not testing syntax, it is testing ideas. Rule R7 (Tests as Truth) passes if the file contains any phrase that means “tests must pass before claiming done.” The phrase can live under a ## Tests heading in CLAUDE.md, under # Verification in AGENTS.md, or in a bullet at the top of .cursorrules. The same regex catches all three. That is why the score is comparable.

The analyzer detects which format you pasted only to print the right label in the output. The rubric tests themselves do not branch on format. You can read it yourself:

src/lib/analyzer.ts

What three runs on the same repo look like

This is the trio of pastes from a real Next.js repo that had all three files. The interesting number in each output is the rubric line. Read them top to bottom and mentally line up the rule IDs.

run 1 — CLAUDE.md

run 2 — AGENTS.md

run 3 — .cursorrules

CLAUDE.md is the most-loved file and scores 7/12. AGENTS.md was added next by a different person and inherited five rules but missed seven. The .cursorrules file is a single-paragraph template someone grabbed off GitHub and scores 3/12. The gap between the best and worst column is four rubric rules. Those four rules are the drift.

The drift, rule by rule

The same view, transposed. Each row is one of the Karpathy rules. The “ours” column is what happens in this repo today.

Feature	the rule	this repo
Think Before Coding (R1)	miss	pass in CLAUDE.md and AGENTS.md, miss in .cursorrules. Cursor will dive straight into edits while Claude and Codex draft a plan first.
Surgical Changes (R3)	miss across all three	no one is enforcing scope. Drive-by refactors will land in every agent.
Goal-Driven Execution (R4)	split	CLAUDE.md sets a completion condition. AGENTS.md and .cursorrules do not. Same task ends in three different states.
Tests as Truth (R7)	split	CLAUDE.md and AGENTS.md require tests pass before done. .cursorrules does not. Cursor sessions ship without running the suite.
Tool Preference (R10)	split	.cursorrules names the package manager and ORM. CLAUDE.md and AGENTS.md do not. Same package gets installed two different ways.
Self-Improvement Loop (R12)	miss across all three	nothing tells the agent to write learnings back into the file. Mistakes repeat.

Two of these rules (Surgical Changes and Self-Improvement Loop) miss in all three files. That is not drift, that is a shared blind spot. The other four are real drift: one or two of the three files enforce the rule and the others do not. Run the same task through each agent and you will get behaviorally different results on those four axes.

A four-step drift check

Takes about three minutes the first time, ninety seconds once you know the shape.

Open ccmd.dev

The analyzer is the textarea on the homepage. No signup. Everything runs in the browser. Three tabs is the most ergonomic way to do this.

Paste CLAUDE.md, then AGENTS.md, then .cursorrules

Each run prints inputType, totalTokens, the 12-rule scorecard, and the per-line findings. Write down the rubric pass count for each file. That trio of numbers is your drift signature.

Diff the rubric columns

A rule that passes in one file and misses in another is a drift line. The Karpathy rule set is small enough to eyeball: twelve rows, three columns. The agents disagree on every cell where the columns disagree.

Move the gap-fillers to whichever file is canonical

Pick one of the three as source of truth (usually CLAUDE.md if Claude Code is the primary harness) and port the missing rules across. For team repos, the paid tier emails a weekly drift diff so the file does not silently slide back.

What competitor tools miss

The two existing camps in this space both whiff on drift. Token-cost meters (ccusage, 14.3K stars on GitHub) tell you what you spent across a Claude session, but they only see Claude. They have no signal on AGENTS.md or .cursorrules at all. Template generators (Genesis Claw, codewithclaude.net) emit a starter CLAUDE.md from a stack picker, but they never grade an existing file, never touch AGENTS.md, and never compare anything.

The closest thesis match is Alex Greensh's token-optimizer, but it ships as a Claude Code CLI plugin with no hosted analyzer and no cross-format support. Anthropic's official claude-code-setup is structurally read-only and only handles Claude. ccmd is the only tool that runs one rubric against all four agent config formats in one tab, which is the precondition for the drift number to exist at all.

What to do about it

Pick one of the three files as canonical. For most teams running Claude Code as the primary harness, that is CLAUDE.md. Read down the rubric column for the other two files and port the missing rules across. Phrase each one in the target file's native dialect (a Karpathy rule does not need the same wording in CLAUDE.md and .cursorrules, just the same meaning).

Run the analyzer a second time and confirm the rubric scores converged. The drift you remove this way pays for itself the next time two engineers on the same team get different answers from Claude and Cursor on the same prompt. The paid tier of ccmd watches the score over time and emails a weekly diff so the files do not quietly slide back to where they started. Nothing on the free side is gated; the rubric and the four-format detection are the same code path.

Want a 15-minute review of your three config files?

Bring CLAUDE.md, AGENTS.md, and .cursorrules. We will run them through the analyzer together, talk through which rules to port where, and you walk away with a single-pager of edits.

Frequently asked questions

What does drift between CLAUDE.md, AGENTS.md, and .cursorrules actually look like in a real repo?

Different rules in different files. The classic shape: someone added 'always run pnpm test before saying done' to CLAUDE.md last sprint because Claude shipped untested code, but the .cursorrules file still says 'be concise' and nothing else. Same project, two agents, two definitions of done. The bytes are not the problem; the silent behavioral fork is. ccmd's analyzer runs the same Karpathy 12-rule scorecard against all four formats so the fork is a literal column-by-column diff instead of a vibes check.

Why do these files drift in the first place?

Because each one was added by a different person at a different time for a different agent. The senior who introduced Claude Code wrote a long CLAUDE.md. A second engineer started using Cursor and grabbed a generic .cursorrules template off GitHub. Someone tried Codex with a fresh AGENTS.md. None of them touched the others. Six months later the rules contradict each other and the repo has three different definitions of 'clean code'. The fix is not picking one tool; it is making the three files derive from one source of truth.

Can ccmd actually compare files side by side, or just score one at a time?

The free analyzer scores one file per paste. You do the comparison by reading the rubric column for each file (twelve rows, takes about ninety seconds for three files). The paid tier ($9-19 solo, $49 team) connects to the GitHub repo, runs all four formats on every PR, posts a drift diff as a PR comment, and emails a weekly summary with the rubric delta over time. The CLI version handles the comparison locally if you do not want a hosted tool to see your config.

Which file should win when they disagree?

Whichever agent you use the most. For most teams that is CLAUDE.md because Claude Code is the primary harness. Adopt one canonical file as source of truth, treat the others as projections of it. Some teams generate AGENTS.md and .cursorrules from CLAUDE.md on commit (a script that strips Claude-specific terms and rewrites the heading). Other teams hand-maintain all three but enforce that the Karpathy rubric score is identical, which forces the rules in the gap to migrate.

Are AGENTS.md and .cursorrules also sent on every turn the way CLAUDE.md is?

Yes. The host harness reads its config file at session start and concatenates it into the system prompt, same shape as CLAUDE.md. Codex injects AGENTS.md. Cursor injects .cursorrules. Grok Build injects .grokrules. None of them gate the bytes per-rule, so every line is on the wire on every API call for the session. Drift is not just a quality problem, it is a cost problem: a 4,000-token CLAUDE.md, a 2,800-token AGENTS.md, and a 1,000-token .cursorrules across three different sessions in the same day is 7,800 tokens of overlap that could have been one shared rubric.

How does ccmd score formats with completely different syntax with one rubric?

The rubric tests for ideas, not syntax. R7 (Tests as Truth) matches any phrase that means 'tests must pass before claiming done' regardless of which markdown header it lives under. R9 (Stack Awareness) matches the names of common languages and frameworks. You can read the test functions in src/lib/analyzer.ts lines 49-122. The detectType function at line 41 only decides the inputType label; the twelve rubric tests run against the raw input string and do not branch on format.

Does the drift problem go away if I just consolidate everything into CLAUDE.md?

It removes the cross-file fork but it does not remove the surface fork. Each host still reads its own file. If you delete .cursorrules and only keep CLAUDE.md, Cursor will run with no rules at all and you have made it worse. The two viable paths: (1) keep one canonical file and generate the others from it on commit, (2) keep all three by hand but pin them to the same rubric score. The thing to avoid is the third path everyone defaults to, which is editing one file when you notice a problem and forgetting the other two.