Judge accuracy by reading diffs, not commit messages; batch by size by anivar · Pull Request #4 · anivar/contributor-codebase-analyzer

anivar · 2026-06-13T00:55:46Z

Why

The skill's whole pitch is "read every diff, baselines not scores." But the accuracy number was computed by grep-ing fix/crash/revert over commit messages — which measures an engineer's commit-message vocabulary, not their code. It rewarded terse messages (wip, update) with a falsely high score and penalized honest ones (fix: null guard in payment capture). That's the exact vanity metric this tool exists to replace.

This PR moves all the judgment into the AI reading diffs and leaves scripts the cheap mechanical work — scripts gather, the AI judges. Faster, more accurate, and more agentic, with the core principle intact.

Accuracy — read from diffs, one verdict per commit

The rate is no longer derived from message keywords. While reading each diff, the agent assigns exactly one verdict: clean / exploratory / self-rework / production-breaking.
This kills the old double-counting. The previous formula summed overlapping buckets (self-reverts + same-day-fixes + crash-fixes + console-cleanup), so a single same-day crash-fix that also stripped a debug line was counted three times — and the ratio could exceed 100%.
Thresholds are reframed as investigation bands that must be backed by the diffs read, not standalone grades. A <80% band on someone who owns the payment core is a different story than one from careless debug commits, and only reading the diffs tells them apart.

Speed + dynamic batching

New checkpoint.sh manifest: one git pass that emits per-commit size (SHA date +add -del files subject). It replaces the repeated per-quarter/per-month count loops and the seven separate keyword passes.
Reading is now batched by diff churn, not commit count. The fixed "91+ commits WILL FAIL" table mis-sized both ways — it over-split 90 one-line commits and under-split 30 mega-commits (the cause of the documented 20.7% coverage gap). A greedy size packer prevents that.
Phases 3/4/5 fold into a single read: each commit gets its type, complexity, quality notes, and accuracy verdict in one pass instead of three walks over history.

Fixes found along the way

checkpoint.sh used shasum -a 256, which isn't present on most Linux distros. Under set -euo pipefail the integrity check didn't degrade — it aborted. Now uses sha256sum with a shasum fallback (verified on Linux).
manifest validates its --author, --after/--before, and --range inputs against option-injection and shell metacharacters.

Notes

Version bumped 3.0.0 → 3.1.0 (SKILL.md, AGENTS.md, README badge).
Docs across SKILL.md, AGENTS.md, README, and the references were updated to speak one language (verdict vocabulary, size-based batching). No behavioral scripts beyond the mechanical manifest helper — the reading and judgment stay with the agent.

Verification

bash -n clean; manifest exercised with author/date/range filters and the batch packer
Injection guards reject --author=-x, --after "2026; rm -rf /", and malformed ranges
sha256 helper digest matches the known SHA-256 of a test string on Linux

Generated by Claude Code

Move all judgment into the AI reading diffs and leave scripts the cheap mechanical work — "scripts gather, the AI judges." This makes the skill faster, more accurate, and more agentic while keeping its core principle (read every diff, baselines not scores) intact. Accuracy: - Stop deriving the accuracy rate from grep over commit *messages* (fix/crash/revert). That measured vocabulary, not engineering — it rewarded terse messages and penalized honest ones, the exact vanity metric the skill exists to replace. - Accuracy is now one verdict per commit, assigned while reading the diff: clean / exploratory / self-rework / production-breaking. Deduped, so the old overlap double-counting (a same-day crash-fix that also stripped a debug line counted 3x) is gone. - Reframe thresholds as investigation bands that must be backed by the diffs read, not standalone grades. Speed / agentic workflow: - Add `checkpoint.sh manifest`: one git pass emitting per-commit size, replacing the repeated per-quarter/per-month count loops and the seven separate keyword passes. - Batch by diff churn, not commit count. The fixed "91+ WILL FAIL" table mis-sized both ways; a greedy size packer prevents coverage gaps. - Fold the former bug/quality passes into the single Phase 3 read — each commit gets type, complexity, quality notes, and its accuracy verdict in one pass instead of three history walks. Fixes: - checkpoint.sh now uses sha256sum with a shasum fallback; the previous shasum-only path aborted the integrity feature under set -e on Linux. - Validate author/date/range inputs to manifest against injection. Docs (SKILL.md, AGENTS.md, README, references) updated to match.

Version bump for the diff-judged accuracy and size-based batching change. Update the last two references that still said "fix-related" to the verdict vocabulary (self-rework vs clean), so the whole skill speaks one language.

claude added 2 commits June 12, 2026 17:25

Bump to 3.1.0 and align remaining rework wording

29fb2b3

Version bump for the diff-judged accuracy and size-based batching change. Update the last two references that still said "fix-related" to the verdict vocabulary (self-rework vs clean), so the whole skill speaks one language.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Judge accuracy by reading diffs, not commit messages; batch by size#4

Judge accuracy by reading diffs, not commit messages; batch by size#4
anivar wants to merge 2 commits into
mainfrom
claude/intelligent-johnson-p336pc

anivar commented Jun 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

anivar commented Jun 13, 2026

Why

Accuracy — read from diffs, one verdict per commit

Speed + dynamic batching

Fixes found along the way

Notes

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants