Skip to content

Judge accuracy by reading diffs, not commit messages; batch by size#4

Open
anivar wants to merge 2 commits into
mainfrom
claude/intelligent-johnson-p336pc
Open

Judge accuracy by reading diffs, not commit messages; batch by size#4
anivar wants to merge 2 commits into
mainfrom
claude/intelligent-johnson-p336pc

Conversation

@anivar

@anivar anivar commented Jun 13, 2026

Copy link
Copy Markdown
Owner

Why

The skill's whole pitch is "read every diff, baselines not scores." But the accuracy number was computed by grep-ing fix/crash/revert over commit messages — which measures an engineer's commit-message vocabulary, not their code. It rewarded terse messages (wip, update) with a falsely high score and penalized honest ones (fix: null guard in payment capture). That's the exact vanity metric this tool exists to replace.

This PR moves all the judgment into the AI reading diffs and leaves scripts the cheap mechanical work — scripts gather, the AI judges. Faster, more accurate, and more agentic, with the core principle intact.

Accuracy — read from diffs, one verdict per commit

  • The rate is no longer derived from message keywords. While reading each diff, the agent assigns exactly one verdict: clean / exploratory / self-rework / production-breaking.
  • This kills the old double-counting. The previous formula summed overlapping buckets (self-reverts + same-day-fixes + crash-fixes + console-cleanup), so a single same-day crash-fix that also stripped a debug line was counted three times — and the ratio could exceed 100%.
  • Thresholds are reframed as investigation bands that must be backed by the diffs read, not standalone grades. A <80% band on someone who owns the payment core is a different story than one from careless debug commits, and only reading the diffs tells them apart.

Speed + dynamic batching

  • New checkpoint.sh manifest: one git pass that emits per-commit size (SHA date +add -del files subject). It replaces the repeated per-quarter/per-month count loops and the seven separate keyword passes.
  • Reading is now batched by diff churn, not commit count. The fixed "91+ commits WILL FAIL" table mis-sized both ways — it over-split 90 one-line commits and under-split 30 mega-commits (the cause of the documented 20.7% coverage gap). A greedy size packer prevents that.
  • Phases 3/4/5 fold into a single read: each commit gets its type, complexity, quality notes, and accuracy verdict in one pass instead of three walks over history.

Fixes found along the way

  • checkpoint.sh used shasum -a 256, which isn't present on most Linux distros. Under set -euo pipefail the integrity check didn't degrade — it aborted. Now uses sha256sum with a shasum fallback (verified on Linux).
  • manifest validates its --author, --after/--before, and --range inputs against option-injection and shell metacharacters.

Notes

  • Version bumped 3.0.0 → 3.1.0 (SKILL.md, AGENTS.md, README badge).
  • Docs across SKILL.md, AGENTS.md, README, and the references were updated to speak one language (verdict vocabulary, size-based batching). No behavioral scripts beyond the mechanical manifest helper — the reading and judgment stay with the agent.

Verification

  • bash -n clean; manifest exercised with author/date/range filters and the batch packer
  • Injection guards reject --author=-x, --after "2026; rm -rf /", and malformed ranges
  • sha256 helper digest matches the known SHA-256 of a test string on Linux

Generated by Claude Code

claude added 2 commits June 12, 2026 17:25
Move all judgment into the AI reading diffs and leave scripts the cheap
mechanical work — "scripts gather, the AI judges." This makes the skill
faster, more accurate, and more agentic while keeping its core principle
(read every diff, baselines not scores) intact.

Accuracy:
- Stop deriving the accuracy rate from grep over commit *messages*
  (fix/crash/revert). That measured vocabulary, not engineering — it
  rewarded terse messages and penalized honest ones, the exact vanity
  metric the skill exists to replace.
- Accuracy is now one verdict per commit, assigned while reading the diff:
  clean / exploratory / self-rework / production-breaking. Deduped, so the
  old overlap double-counting (a same-day crash-fix that also stripped a
  debug line counted 3x) is gone.
- Reframe thresholds as investigation bands that must be backed by the
  diffs read, not standalone grades.

Speed / agentic workflow:
- Add `checkpoint.sh manifest`: one git pass emitting per-commit size,
  replacing the repeated per-quarter/per-month count loops and the seven
  separate keyword passes.
- Batch by diff churn, not commit count. The fixed "91+ WILL FAIL" table
  mis-sized both ways; a greedy size packer prevents coverage gaps.
- Fold the former bug/quality passes into the single Phase 3 read — each
  commit gets type, complexity, quality notes, and its accuracy verdict in
  one pass instead of three history walks.

Fixes:
- checkpoint.sh now uses sha256sum with a shasum fallback; the previous
  shasum-only path aborted the integrity feature under set -e on Linux.
- Validate author/date/range inputs to manifest against injection.

Docs (SKILL.md, AGENTS.md, README, references) updated to match.
Version bump for the diff-judged accuracy and size-based batching change.
Update the last two references that still said "fix-related" to the
verdict vocabulary (self-rework vs clean), so the whole skill speaks one
language.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants