feat(bench): EOPS steerer sweep + population + product-value claim#203
Conversation
…unction
Generalizes the depth arm into a population sweep: every steerer (a {systemPrompt,
userTemplate} read from STEERERS_FILE) runs as a depth arm against ONE shared
breadth baseline per task, scored depth-BEST (checkpoint), ranked by paired-
bootstrap lift. This is the fitness function /evolve + meta-harness call.
- evaluateSteerers() exported (returns ranked lift + per-steerer LOSSES = the tasks
where depth-best lost to breadth, with the trajectory — the reflection fuel for a
prompt optimizer). loadTasks/EopsTask exported; import-guarded main() so the gym
client + eval are importable without running.
- bench/steerers/eops-itsm-population.json: 7 diverse trace-analyst steerers from a
designer panel (the GEPA generation-0 population). FIREWALLED (read trace, never
verifiers); {task}/{trace} placeholders substituted per shot.
Screening (n=12, deepseek-v4-flash) shows a real fitness SLOPE — steerers spread
~15pp and rank sensibly:
checkpoint-restore-and-lock +2.8pp score / +16.7pp resolved (WINNER)
generic (control) -9.0pp
stop-by-default -12.5pp SIGNIF -
The winner reconstructs each field's value-history and RESTORES the overwritten-
correct value — directly attacking the measured degradation failure (+6-8pp). So
the population is mechanistically targeted, not generic; and the landscape has the
gradient a prompt optimizer (GEPA / /evolve) needs. n=24 confirmation in flight.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 83 | 83 | 83 |
| Confidence | 70 | 70 | 70 |
| Correctness | 83 | 83 | 83 |
| Security | 83 | 83 | 83 |
| Testing | 83 | 83 | 83 |
| Architecture | 83 | 83 | 83 |
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟡 LOW Dead import: writeFileSync never used — bench/src/eops-gate.mts
Line 23 adds
writeFileSyncto thenode:fsimport but it is never called anywhere in the file (confirmed via ripgrep across bench/src/). Violates no-unused-imports. Likely a leftover from a planned output-file feature that wasn't completed in this PR. Fix: removewriteFileSyncfrom the import.
🟡 LOW No dedicated tests for exported interfaces and functions — bench/src/eops-gate.mts
The diff exports
EopsTask,loadTasks,Steerer,SteererRank,SteererLoss,EvalResult, andevaluateSteerers. These are new public API surface with no corresponding test file. The existingbench/src/benchmarks/enterpriseops-gym.test.mtstests the adapter, not the gate's steerer sweep. A unit test with mockedRouterConfig/poolverifying ranked output and loss identification would catch regressions. Low severity because the gate is inherently an integration benchmark, but the exported types invite programmatic consumption (GEPA).
🟡 LOW No runtime validation on STEERERS_FILE JSON parse — bench/src/eops-gate.mts
Line 379:
JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[]casts without validation. If the file contains a non-array, or objects missingid, the error surfaces deep insideevaluateSteerers(likely atr.perSteerer[st.id]producing undefined). A Zod/yup guard or even anArray.isArray+.every(o => typeof o.id === 'string')check at parse time would give an actionable error. Low severity since this is a bench tool run by developers, not production code.
🟡 LOW Per-task tool-call counts dropped from stderr observability — bench/src/eops-gate.mts
Old code (removed line ~222) printed
toolcalls=Nper task in the stderr progress line. New code (line 336) drops this field. Tool-call counts per depth arm are still computed insiderunDepthArm(line 254) but are discarded byevaluateSteerers(line 332 doesn't capture them). This is an observability regression for cost/efficiency analysis during long runs. Fix: capture and include tool-call counts in the per-task
🟡 LOW Unused import: writeFileSync — bench/src/eops-gate.mts
Line 23 imports
writeFileSyncfrom 'node:fs' but it is never called anywhere in the file. The pre-change version only usedreadFileSync; the diff addswriteFileSyncto the import but never introduces a call site. Dead import. Remove it.
🟡 LOW losses array computed but not surfaced in CLI output — bench/src/eops-gate.mts
Line 387:
main()destructures{ ok, excluded, ranked }fromevaluateSteerers(), discarding thelossesarray. The computed losses (line 356-362: tasks where depth-best < breadth, tagged per-steerer with trajectory) are described as 'GEPA's reflection fuel' in the JSDoc (line 300-301) but are never printed to console or written to a file. The data IS available to programmatic consumers of the exported f
🟡 LOW steerInstruction failure scope: one steerer's LLM error kills entire task row — bench/src/eops-gate.mts
Line 331-332:
runDepthArmcallssteerInstruction(line 253) which may invokerouterChatWithUsage(LLM call) for steerers with auserTemplate. If this LLM call fails (e.g., rate limit despite retries), the uncaught throw propagates throughrunDepthArm, through the per-steerer loop, to the per-task catch at line 338, which marks the ENTIRE task as SKIP (null). This discards the already-computed breadth bas
🟡 LOW No trailing newline in JSON file — bench/steerers/eops-itsm-population.json
File ends with
}]and no newline (diff shows\ No newline at end of file). POSIX convention. Zero functional impact for a JSON config consumed by a parser, but could triggergit diffnoise on future edits. Fix: add trailing newline.
tangletools · 2026-06-09T00:43:42Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 8 non-blocking findings — 7d99440b
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-09T00:43:42Z · immutable trace
Premise check withheld merge —
|
…stion)
Under concurrency the gym's SQLite exhausts file handles ('unable to open database
file', HTTP 500) and a third of tasks were dropped from the n=24 confirm run. It's
transient — clears as sibling DBs are deleted. Bounded retry (5x, linear backoff)
so a momentary container limit doesn't bleed data out of the gate (or any optimizer
run built on it). Pair with CONCURRENCY<=3 on the gym.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 20290a9d
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T00:49:45Z
A gpt-4.1 run bled 23/24 tasks to 'fetch failed' — a THROWN fetch (network reset / router throttle / wedged gym under concurrency), which the seed-status retry didn't catch. Wrap the one gym network primitive (postJson) in a bounded retry-with-backoff so transient blips on ANY gym call (seed/tools/verify) don't drop a task.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — 05c8eb29
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T00:52:24Z
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 66 | 83 | 66 |
| Confidence | 70 | 70 | 70 |
| Correctness | 66 | 83 | 66 |
| Security | 66 | 83 | 66 |
| Testing | 66 | 83 | 66 |
| Architecture | 66 | 83 | 66 |
Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Unused writeFileSync import — bench/src/eops-gate.mts
Line 23:
import { readFileSync, writeFileSync } from 'node:fs'—writeFileSyncis never called anywhere in the file. Dead import. Will fail strict TypeScriptnoUnusedLocalslint. RemovewriteFileSyncfrom the destructured import.
🟠 MEDIUM Unvalidated STEERERS_FILE JSON cast bypasses runtime shape checks — bench/src/eops-gate.mts
Line 399:
JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[]— no runtime validation of the parsed data. A steerer object missingid(e.g.{systemPrompt:"..."}) createsperSteerer["undefined"]keys, corrupting the ranked table, loss records, and stderr tags silently. The companion filebench/steerers/eops-itsm-population.jsonis hand-edited single-line JSON; shape drift is plausible. Validate with a proper schema check or at minimum a runtime assertion that every entry has a non-empty stringid.
🟡 LOW Dead import: writeFileSync is imported but never used — bench/src/eops-gate.mts
Line 23:
import { readFileSync, writeFileSync } from 'node:fs'— writeFileSync is never called anywhere in the file. The diff adds it to the import but no usage was added. Should be reverted toimport { readFileSync } from 'node:fs'. Trivial fix.
🟡 LOW Fragile import.meta.url CLI guard vs established fileURLToPath pattern — bench/src/eops-gate.mts
Line 428:
if (import.meta.url === \file://${process.argv[1]}`)uses raw string interpolation. The established pattern incorpus-replay.mts:298isif (argv[1] && fileURLToPath(import.meta.url) === argv[1])`, which correctly handles URL encoding. The interpolation approach breaks on paths with spaces or special characters. Low-severity since bench scripts typically run from clean paths, but inconsistent with the codebase standard.
🟡 LOW No unit tests for exported evaluateSteerers / Steerer / EvalResult — bench/src/eops-gate.mts
Three commits refactor the gate into exported, reusable functions (
evaluateSteerers,Steerer,SteererRank,SteererLoss,EvalResult,loadTasks,EopsTask). No test file exists (noeops-gate.test.mts). Other bench adapters (commit0, aec-bench, programbench) all have offline fixture tests that validate loadTasks, structure, and output contracts. The steerer sweep logic (loss collection, ranking by lift.point, degradation calculation) is testable offline with mocked depth arms. Not blocking — the gate requires live services for end-to-end — but the structural/ranking logic deserves coverage.
🟡 LOW Per-task tool-call count no longer surfaced — bench/src/eops-gate.mts
Lines 341-342:
await runShot(cfg, task, server, dbId, tools, m)discards the return value (toolCalls, toolTrace) for breadth shots. Lines 352-353:runDepthArmreturnstoolCallsbut it is not stored inperSteerer. The old code tracked and logged totalactsper task — useful for diagnosing 'many turns but low score' degradation patterns. This is an observability regression. Storearm.toolCallsinperSteerer(add atoolCallsfield) and log it alongside breadth scores in stderr output.
🟡 LOW Unvalidated STEERERS_FILE JSON with bare type assertion — bench/src/eops-gate.mts
Line 399:
JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[]— theas Steerer[]is an unchecked assertion. If the file contains malformed data (e.g., missingidfield), runtime will proceed with undefined fields until the first.idaccess at line 352 (st.id), which silently producesundefinedkeys inperSteerer. Low severity: developer-controlled bench config, but a defensive validation or at minimum a try/catch with a descriptive error would be safer.
🟡 LOW seedDb retries create orphaned databases on the gym server — bench/src/eops-gate.mts
Lines 115-121: Each retry generates a fresh
dbIdviaMath.random(). If the gym processes the seed-database POST (HTTP 200 withsuccess: false, or the response is lost after server-side processing), the created database is orphaned — thefinallyblock in callers only deletes the returneddbId. Bounded to max 4 orphans perseedDbcall, but accumulates under concurrency. The gym's SQLite file-handle exhaustion this retry was added to mitigate may itself be exacerbated by accumulated orphans. Consider seeding with the same deterministicdbIdon retry so duplicates self-correct, or issuing a best-effort delete on failure paths.
🟡 LOW File missing trailing newline — bench/steerers/eops-itsm-population.json
The file ends without a newline character (29007 bytes, no \n terminator). POSIX/editor convention but no functional impact — JSON.parse handles it correctly. Optional: add newline for consistency with project conventions.
tangletools · 2026-06-09T01:00:23Z · trace
Premise check withheld merge —
|
What
Turns the EOPS gate into the fitness function for steerer optimization — the front half of the RSI flywheel, ready for
/evolveormeta-harnessto drive (instead of a hand-rolled GEPA loop).evaluateSteerers()(exported): every steerer ({systemPrompt, userTemplate}fromSTEERERS_FILE) runs as a depth arm against ONE shared breadth baseline per task, scored depth-BEST (checkpoint, the autopsy-corrected metric), ranked by paired-bootstrap lift. Returns ranked fitness + per-steerer LOSSES (tasks where depth-best lost to breadth, with the trajectory) — the reflection fuel a prompt optimizer needs.loadTasks/EopsTaskexported;main()import-guarded.bench/steerers/eops-itsm-population.json: 7 diverse trace-analyst steerers (a designer-panel population = GEPA generation-0). Firewalled (read trace, never verifiers)..evolve/eops-steerer-product-claim.md: the one-sentence product-value claim + falsifiers that gate optimizer spend (the skill's fix: persist final runtime stream failures #1 failure mode is Goodharting a proxy).Signal — the fitness landscape has a slope (n=12, deepseek-v4-flash)
Steerers spread ~15pp and rank sensibly. The winner reconstructs each field's value-history and restores the overwritten-correct value — directly attacking the measured degradation failure (+6–8pp). So the population is mechanistically targeted, and the landscape has the gradient an optimizer needs. n=24 confirmation in flight.
Why this (not a hand-rolled GEPA)
The steerer prompt is a parameter →
/evolveterritory; the harness architecture (depth/mix/checkpoint/analyst-runtime) ismeta-harnessterritory. Both require a stable baseline + a product-value claim first (this PR provides the fitness fn + the claim). Don't burn optimizer compute on a noisy proxy.Test
typecheck clean; screening n=12 + smoke n=1 ran 0-excluded against the live gym; all 8 arms fire.