Judge accuracy by reading diffs, not commit messages; batch by size#4
Open
anivar wants to merge 2 commits into
Open
Judge accuracy by reading diffs, not commit messages; batch by size#4anivar wants to merge 2 commits into
anivar wants to merge 2 commits into
Conversation
Move all judgment into the AI reading diffs and leave scripts the cheap mechanical work — "scripts gather, the AI judges." This makes the skill faster, more accurate, and more agentic while keeping its core principle (read every diff, baselines not scores) intact. Accuracy: - Stop deriving the accuracy rate from grep over commit *messages* (fix/crash/revert). That measured vocabulary, not engineering — it rewarded terse messages and penalized honest ones, the exact vanity metric the skill exists to replace. - Accuracy is now one verdict per commit, assigned while reading the diff: clean / exploratory / self-rework / production-breaking. Deduped, so the old overlap double-counting (a same-day crash-fix that also stripped a debug line counted 3x) is gone. - Reframe thresholds as investigation bands that must be backed by the diffs read, not standalone grades. Speed / agentic workflow: - Add `checkpoint.sh manifest`: one git pass emitting per-commit size, replacing the repeated per-quarter/per-month count loops and the seven separate keyword passes. - Batch by diff churn, not commit count. The fixed "91+ WILL FAIL" table mis-sized both ways; a greedy size packer prevents coverage gaps. - Fold the former bug/quality passes into the single Phase 3 read — each commit gets type, complexity, quality notes, and its accuracy verdict in one pass instead of three history walks. Fixes: - checkpoint.sh now uses sha256sum with a shasum fallback; the previous shasum-only path aborted the integrity feature under set -e on Linux. - Validate author/date/range inputs to manifest against injection. Docs (SKILL.md, AGENTS.md, README, references) updated to match.
Version bump for the diff-judged accuracy and size-based batching change. Update the last two references that still said "fix-related" to the verdict vocabulary (self-rework vs clean), so the whole skill speaks one language.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The skill's whole pitch is "read every diff, baselines not scores." But the accuracy number was computed by
grep-ingfix/crash/revertover commit messages — which measures an engineer's commit-message vocabulary, not their code. It rewarded terse messages (wip,update) with a falsely high score and penalized honest ones (fix: null guard in payment capture). That's the exact vanity metric this tool exists to replace.This PR moves all the judgment into the AI reading diffs and leaves scripts the cheap mechanical work — scripts gather, the AI judges. Faster, more accurate, and more agentic, with the core principle intact.
Accuracy — read from diffs, one verdict per commit
clean/exploratory/self-rework/production-breaking.self-reverts + same-day-fixes + crash-fixes + console-cleanup), so a single same-day crash-fix that also stripped a debug line was counted three times — and the ratio could exceed 100%.Speed + dynamic batching
checkpoint.sh manifest: one git pass that emits per-commit size (SHA date +add -del files subject). It replaces the repeated per-quarter/per-month count loops and the seven separate keyword passes.Fixes found along the way
checkpoint.shusedshasum -a 256, which isn't present on most Linux distros. Underset -euo pipefailthe integrity check didn't degrade — it aborted. Now usessha256sumwith ashasumfallback (verified on Linux).manifestvalidates its--author,--after/--before, and--rangeinputs against option-injection and shell metacharacters.Notes
manifesthelper — the reading and judgment stay with the agent.Verification
bash -nclean;manifestexercised with author/date/range filters and the batch packer--author=-x,--after "2026; rm -rf /", and malformed rangessha256helper digest matches the known SHA-256 of a test string on LinuxGenerated by Claude Code