Skip to content

feat(bench): EOPS steerer sweep + population + product-value claim#203

Merged
drewstone merged 3 commits into
mainfrom
feat/eops-steerer-sweep
Jun 9, 2026
Merged

feat(bench): EOPS steerer sweep + population + product-value claim#203
drewstone merged 3 commits into
mainfrom
feat/eops-steerer-sweep

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Turns the EOPS gate into the fitness function for steerer optimization — the front half of the RSI flywheel, ready for /evolve or meta-harness to drive (instead of a hand-rolled GEPA loop).

  • evaluateSteerers() (exported): every steerer ({systemPrompt, userTemplate} from STEERERS_FILE) runs as a depth arm against ONE shared breadth baseline per task, scored depth-BEST (checkpoint, the autopsy-corrected metric), ranked by paired-bootstrap lift. Returns ranked fitness + per-steerer LOSSES (tasks where depth-best lost to breadth, with the trajectory) — the reflection fuel a prompt optimizer needs. loadTasks/EopsTask exported; main() import-guarded.
  • bench/steerers/eops-itsm-population.json: 7 diverse trace-analyst steerers (a designer-panel population = GEPA generation-0). Firewalled (read trace, never verifiers).
  • .evolve/eops-steerer-product-claim.md: the one-sentence product-value claim + falsifiers that gate optimizer spend (the skill's fix: persist final runtime stream failures #1 failure mode is Goodharting a proxy).

Signal — the fitness landscape has a slope (n=12, deepseek-v4-flash)

steerer depth-best − breadth resolved Δ
checkpoint-restore-and-lock +2.8pp +16.7pp
multi-lens-stop-or-fix −2.1pp 0
generic (control) −9.0pp −8.3pp
stop-by-default −12.5pp (SIGNIF −) −8.3pp

Steerers spread ~15pp and rank sensibly. The winner reconstructs each field's value-history and restores the overwritten-correct value — directly attacking the measured degradation failure (+6–8pp). So the population is mechanistically targeted, and the landscape has the gradient an optimizer needs. n=24 confirmation in flight.

Why this (not a hand-rolled GEPA)

The steerer prompt is a parameter/evolve territory; the harness architecture (depth/mix/checkpoint/analyst-runtime) is meta-harness territory. Both require a stable baseline + a product-value claim first (this PR provides the fitness fn + the claim). Don't burn optimizer compute on a noisy proxy.

Test

typecheck clean; screening n=12 + smoke n=1 ran 0-excluded against the live gym; all 8 arms fire.

…unction

Generalizes the depth arm into a population sweep: every steerer (a {systemPrompt,
userTemplate} read from STEERERS_FILE) runs as a depth arm against ONE shared
breadth baseline per task, scored depth-BEST (checkpoint), ranked by paired-
bootstrap lift. This is the fitness function /evolve + meta-harness call.

- evaluateSteerers() exported (returns ranked lift + per-steerer LOSSES = the tasks
  where depth-best lost to breadth, with the trajectory — the reflection fuel for a
  prompt optimizer). loadTasks/EopsTask exported; import-guarded main() so the gym
  client + eval are importable without running.
- bench/steerers/eops-itsm-population.json: 7 diverse trace-analyst steerers from a
  designer panel (the GEPA generation-0 population). FIREWALLED (read trace, never
  verifiers); {task}/{trace} placeholders substituted per shot.

Screening (n=12, deepseek-v4-flash) shows a real fitness SLOPE — steerers spread
~15pp and rank sensibly:
  checkpoint-restore-and-lock  +2.8pp score / +16.7pp resolved  (WINNER)
  generic (control)            -9.0pp
  stop-by-default              -12.5pp  SIGNIF -
The winner reconstructs each field's value-history and RESTORES the overwritten-
correct value — directly attacking the measured degradation failure (+6-8pp). So
the population is mechanistically targeted, not generic; and the landscape has the
gradient a prompt optimizer (GEPA / /evolve) needs. n=24 confirmation in flight.
@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 7d99440b

Readiness 83/100 · Confidence 70/100 · 8 findings (8 low)

deepseek glm aggregate
Readiness 83 83 83
Confidence 70 70 70
Correctness 83 83 83
Security 83 83 83
Testing 83 83 83
Architecture 83 83 83

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟡 LOW Dead import: writeFileSync never used — bench/src/eops-gate.mts

Line 23 adds writeFileSync to the node:fs import but it is never called anywhere in the file (confirmed via ripgrep across bench/src/). Violates no-unused-imports. Likely a leftover from a planned output-file feature that wasn't completed in this PR. Fix: remove writeFileSync from the import.

🟡 LOW No dedicated tests for exported interfaces and functions — bench/src/eops-gate.mts

The diff exports EopsTask, loadTasks, Steerer, SteererRank, SteererLoss, EvalResult, and evaluateSteerers. These are new public API surface with no corresponding test file. The existing bench/src/benchmarks/enterpriseops-gym.test.mts tests the adapter, not the gate's steerer sweep. A unit test with mocked RouterConfig/pool verifying ranked output and loss identification would catch regressions. Low severity because the gate is inherently an integration benchmark, but the exported types invite programmatic consumption (GEPA).

🟡 LOW No runtime validation on STEERERS_FILE JSON parse — bench/src/eops-gate.mts

Line 379: JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[] casts without validation. If the file contains a non-array, or objects missing id, the error surfaces deep inside evaluateSteerers (likely at r.perSteerer[st.id] producing undefined). A Zod/yup guard or even an Array.isArray + .every(o => typeof o.id === 'string') check at parse time would give an actionable error. Low severity since this is a bench tool run by developers, not production code.

🟡 LOW Per-task tool-call counts dropped from stderr observability — bench/src/eops-gate.mts

Old code (removed line ~222) printed toolcalls=N per task in the stderr progress line. New code (line 336) drops this field. Tool-call counts per depth arm are still computed inside runDepthArm (line 254) but are discarded by evaluateSteerers (line 332 doesn't capture them). This is an observability regression for cost/efficiency analysis during long runs. Fix: capture and include tool-call counts in the per-task

🟡 LOW Unused import: writeFileSync — bench/src/eops-gate.mts

Line 23 imports writeFileSync from 'node:fs' but it is never called anywhere in the file. The pre-change version only used readFileSync; the diff adds writeFileSync to the import but never introduces a call site. Dead import. Remove it.

🟡 LOW losses array computed but not surfaced in CLI output — bench/src/eops-gate.mts

Line 387: main() destructures { ok, excluded, ranked } from evaluateSteerers(), discarding the losses array. The computed losses (line 356-362: tasks where depth-best < breadth, tagged per-steerer with trajectory) are described as 'GEPA's reflection fuel' in the JSDoc (line 300-301) but are never printed to console or written to a file. The data IS available to programmatic consumers of the exported f

🟡 LOW steerInstruction failure scope: one steerer's LLM error kills entire task row — bench/src/eops-gate.mts

Line 331-332: runDepthArm calls steerInstruction (line 253) which may invoke routerChatWithUsage (LLM call) for steerers with a userTemplate. If this LLM call fails (e.g., rate limit despite retries), the uncaught throw propagates through runDepthArm, through the per-steerer loop, to the per-task catch at line 338, which marks the ENTIRE task as SKIP (null). This discards the already-computed breadth bas

🟡 LOW No trailing newline in JSON file — bench/steerers/eops-itsm-population.json

File ends with }] and no newline (diff shows \ No newline at end of file). POSIX convention. Zero functional impact for a JSON config consumed by a parser, but could trigger git diff noise on future edits. Fix: add trailing newline.


tangletools · 2026-06-09T00:43:42Z · trace

tangletools
tangletools previously approved these changes Jun 9, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Approved — 8 non-blocking findings — 7d99440b

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-09T00:43:42Z · immutable trace

@tangletools

Copy link
Copy Markdown
Contributor

Premise check withheld merge — 7d99440b

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +2.8pp
  • PR body excerpt: feat(bench): EOPS steerer sweep + population + product-value claim

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 9 numeric claim(s) (+2.8pp, +16.7pp, 2.1pp...) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #203

…stion)

Under concurrency the gym's SQLite exhausts file handles ('unable to open database
file', HTTP 500) and a third of tasks were dropped from the n=24 confirm run. It's
transient — clears as sibling DBs are deleted. Bounded retry (5x, linear backoff)
so a momentary container limit doesn't bleed data out of the gate (or any optimizer
run built on it). Pair with CONCURRENCY<=3 on the gym.
tangletools
tangletools previously approved these changes Jun 9, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 20290a9d

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T00:49:45Z

A gpt-4.1 run bled 23/24 tasks to 'fetch failed' — a THROWN fetch (network reset /
router throttle / wedged gym under concurrency), which the seed-status retry didn't
catch. Wrap the one gym network primitive (postJson) in a bounded retry-with-backoff
so transient blips on ANY gym call (seed/tools/verify) don't drop a task.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Refreshed approval after new commits — 05c8eb29

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T00:52:24Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 05c8eb29

Readiness 66/100 · Confidence 70/100 · 9 findings (2 medium, 7 low)

deepseek glm aggregate
Readiness 66 83 66
Confidence 70 70 70
Correctness 66 83 66
Security 66 83 66
Testing 66 83 66
Architecture 66 83 66

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Unused writeFileSync import — bench/src/eops-gate.mts

Line 23: import { readFileSync, writeFileSync } from 'node:fs'writeFileSync is never called anywhere in the file. Dead import. Will fail strict TypeScript noUnusedLocals lint. Remove writeFileSync from the destructured import.

🟠 MEDIUM Unvalidated STEERERS_FILE JSON cast bypasses runtime shape checks — bench/src/eops-gate.mts

Line 399: JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[] — no runtime validation of the parsed data. A steerer object missing id (e.g. {systemPrompt:"..."}) creates perSteerer["undefined"] keys, corrupting the ranked table, loss records, and stderr tags silently. The companion file bench/steerers/eops-itsm-population.json is hand-edited single-line JSON; shape drift is plausible. Validate with a proper schema check or at minimum a runtime assertion that every entry has a non-empty string id.

🟡 LOW Dead import: writeFileSync is imported but never used — bench/src/eops-gate.mts

Line 23: import { readFileSync, writeFileSync } from 'node:fs' — writeFileSync is never called anywhere in the file. The diff adds it to the import but no usage was added. Should be reverted to import { readFileSync } from 'node:fs'. Trivial fix.

🟡 LOW Fragile import.meta.url CLI guard vs established fileURLToPath pattern — bench/src/eops-gate.mts

Line 428: if (import.meta.url === \file://${process.argv[1]}`)uses raw string interpolation. The established pattern incorpus-replay.mts:298isif (argv[1] && fileURLToPath(import.meta.url) === argv[1])`, which correctly handles URL encoding. The interpolation approach breaks on paths with spaces or special characters. Low-severity since bench scripts typically run from clean paths, but inconsistent with the codebase standard.

🟡 LOW No unit tests for exported evaluateSteerers / Steerer / EvalResult — bench/src/eops-gate.mts

Three commits refactor the gate into exported, reusable functions (evaluateSteerers, Steerer, SteererRank, SteererLoss, EvalResult, loadTasks, EopsTask). No test file exists (no eops-gate.test.mts). Other bench adapters (commit0, aec-bench, programbench) all have offline fixture tests that validate loadTasks, structure, and output contracts. The steerer sweep logic (loss collection, ranking by lift.point, degradation calculation) is testable offline with mocked depth arms. Not blocking — the gate requires live services for end-to-end — but the structural/ranking logic deserves coverage.

🟡 LOW Per-task tool-call count no longer surfaced — bench/src/eops-gate.mts

Lines 341-342: await runShot(cfg, task, server, dbId, tools, m) discards the return value (toolCalls, toolTrace) for breadth shots. Lines 352-353: runDepthArm returns toolCalls but it is not stored in perSteerer. The old code tracked and logged total acts per task — useful for diagnosing 'many turns but low score' degradation patterns. This is an observability regression. Store arm.toolCalls in perSteerer (add a toolCalls field) and log it alongside breadth scores in stderr output.

🟡 LOW Unvalidated STEERERS_FILE JSON with bare type assertion — bench/src/eops-gate.mts

Line 399: JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[] — the as Steerer[] is an unchecked assertion. If the file contains malformed data (e.g., missing id field), runtime will proceed with undefined fields until the first .id access at line 352 (st.id), which silently produces undefined keys in perSteerer. Low severity: developer-controlled bench config, but a defensive validation or at minimum a try/catch with a descriptive error would be safer.

🟡 LOW seedDb retries create orphaned databases on the gym server — bench/src/eops-gate.mts

Lines 115-121: Each retry generates a fresh dbId via Math.random(). If the gym processes the seed-database POST (HTTP 200 with success: false, or the response is lost after server-side processing), the created database is orphaned — the finally block in callers only deletes the returned dbId. Bounded to max 4 orphans per seedDb call, but accumulates under concurrency. The gym's SQLite file-handle exhaustion this retry was added to mitigate may itself be exacerbated by accumulated orphans. Consider seeding with the same deterministic dbId on retry so duplicates self-correct, or issuing a best-effort delete on failure paths.

🟡 LOW File missing trailing newline — bench/steerers/eops-itsm-population.json

The file ends without a newline character (29007 bytes, no \n terminator). POSIX/editor convention but no functional impact — JSON.parse handles it correctly. Optional: add newline for consistency with project conventions.


tangletools · 2026-06-09T01:00:23Z · trace

@tangletools

Copy link
Copy Markdown
Contributor

Premise check withheld merge — 05c8eb29

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

  • Cited claim: +2.8pp
  • PR body excerpt: feat(bench): EOPS steerer sweep + population + product-value claim

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 9 numeric claim(s) (+2.8pp, +16.7pp, 2.1pp...) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.


tangletools premise check · #203

@drewstone drewstone merged commit c831084 into main Jun 9, 2026
1 check passed
@drewstone drewstone deleted the feat/eops-steerer-sweep branch June 9, 2026 10:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants