feat(bench): EOPS steerer sweep + population + product-value claim by drewstone · Pull Request #203 · tangle-network/agent-runtime

drewstone · 2026-06-09T00:33:12Z

What

Turns the EOPS gate into the fitness function for steerer optimization — the front half of the RSI flywheel, ready for /evolve or meta-harness to drive (instead of a hand-rolled GEPA loop).

evaluateSteerers() (exported): every steerer ({systemPrompt, userTemplate} from STEERERS_FILE) runs as a depth arm against ONE shared breadth baseline per task, scored depth-BEST (checkpoint, the autopsy-corrected metric), ranked by paired-bootstrap lift. Returns ranked fitness + per-steerer LOSSES (tasks where depth-best lost to breadth, with the trajectory) — the reflection fuel a prompt optimizer needs. loadTasks/EopsTask exported; main() import-guarded.
bench/steerers/eops-itsm-population.json: 7 diverse trace-analyst steerers (a designer-panel population = GEPA generation-0). Firewalled (read trace, never verifiers).
.evolve/eops-steerer-product-claim.md: the one-sentence product-value claim + falsifiers that gate optimizer spend (the skill's fix: persist final runtime stream failures #1 failure mode is Goodharting a proxy).

Signal — the fitness landscape has a slope (n=12, deepseek-v4-flash)

steerer	depth-best − breadth	resolved Δ
checkpoint-restore-and-lock	+2.8pp	+16.7pp
multi-lens-stop-or-fix	−2.1pp	0
generic (control)	−9.0pp	−8.3pp
stop-by-default	−12.5pp (SIGNIF −)	−8.3pp

Steerers spread ~15pp and rank sensibly. The winner reconstructs each field's value-history and restores the overwritten-correct value — directly attacking the measured degradation failure (+6–8pp). So the population is mechanistically targeted, and the landscape has the gradient an optimizer needs. n=24 confirmation in flight.

Why this (not a hand-rolled GEPA)

The steerer prompt is a parameter → /evolve territory; the harness architecture (depth/mix/checkpoint/analyst-runtime) is meta-harness territory. Both require a stable baseline + a product-value claim first (this PR provides the fitness fn + the claim). Don't burn optimizer compute on a noisy proxy.

Test

typecheck clean; screening n=12 + smoke n=1 ran 0-excluded against the live gym; all 8 arms fire.

…unction Generalizes the depth arm into a population sweep: every steerer (a {systemPrompt, userTemplate} read from STEERERS_FILE) runs as a depth arm against ONE shared breadth baseline per task, scored depth-BEST (checkpoint), ranked by paired- bootstrap lift. This is the fitness function /evolve + meta-harness call. - evaluateSteerers() exported (returns ranked lift + per-steerer LOSSES = the tasks where depth-best lost to breadth, with the trajectory — the reflection fuel for a prompt optimizer). loadTasks/EopsTask exported; import-guarded main() so the gym client + eval are importable without running. - bench/steerers/eops-itsm-population.json: 7 diverse trace-analyst steerers from a designer panel (the GEPA generation-0 population). FIREWALLED (read trace, never verifiers); {task}/{trace} placeholders substituted per shot. Screening (n=12, deepseek-v4-flash) shows a real fitness SLOPE — steerers spread ~15pp and rank sensibly: checkpoint-restore-and-lock +2.8pp score / +16.7pp resolved (WINNER) generic (control) -9.0pp stop-by-default -12.5pp SIGNIF - The winner reconstructs each field's value-history and RESTORES the overwritten- correct value — directly attacking the measured degradation failure (+6-8pp). So the population is mechanistically targeted, not generic; and the landscape has the gradient a prompt optimizer (GEPA / /evolve) needs. n=24 confirmation in flight.

tangletools · 2026-06-09T00:43:44Z

✅ No Blockers — `7d99440b`

Readiness 83/100 · Confidence 70/100 · 8 findings (8 low)

	deepseek	glm	aggregate
Readiness	83	83	83
Confidence	70	70	70
Correctness	83	83	83
Security	83	83	83
Testing	83	83	83
Architecture	83	83	83

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟡 LOW Dead import: writeFileSync never used — bench/src/eops-gate.mts

Line 23 adds writeFileSync to the node:fs import but it is never called anywhere in the file (confirmed via ripgrep across bench/src/). Violates no-unused-imports. Likely a leftover from a planned output-file feature that wasn't completed in this PR. Fix: remove writeFileSync from the import.

🟡 LOW No dedicated tests for exported interfaces and functions — bench/src/eops-gate.mts

The diff exports EopsTask, loadTasks, Steerer, SteererRank, SteererLoss, EvalResult, and evaluateSteerers. These are new public API surface with no corresponding test file. The existing bench/src/benchmarks/enterpriseops-gym.test.mts tests the adapter, not the gate's steerer sweep. A unit test with mocked RouterConfig/pool verifying ranked output and loss identification would catch regressions. Low severity because the gate is inherently an integration benchmark, but the exported types invite programmatic consumption (GEPA).

🟡 LOW No runtime validation on STEERERS_FILE JSON parse — bench/src/eops-gate.mts

Line 379: JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[] casts without validation. If the file contains a non-array, or objects missing id, the error surfaces deep inside evaluateSteerers (likely at r.perSteerer[st.id] producing undefined). A Zod/yup guard or even an Array.isArray + .every(o => typeof o.id === 'string') check at parse time would give an actionable error. Low severity since this is a bench tool run by developers, not production code.

🟡 LOW Per-task tool-call counts dropped from stderr observability — bench/src/eops-gate.mts

Old code (removed line ~222) printed toolcalls=N per task in the stderr progress line. New code (line 336) drops this field. Tool-call counts per depth arm are still computed inside runDepthArm (line 254) but are discarded by evaluateSteerers (line 332 doesn't capture them). This is an observability regression for cost/efficiency analysis during long runs. Fix: capture and include tool-call counts in the per-task

🟡 LOW Unused import: writeFileSync — bench/src/eops-gate.mts

Line 23 imports writeFileSync from 'node:fs' but it is never called anywhere in the file. The pre-change version only used readFileSync; the diff adds writeFileSync to the import but never introduces a call site. Dead import. Remove it.

🟡 LOW losses array computed but not surfaced in CLI output — bench/src/eops-gate.mts

Line 387: main() destructures { ok, excluded, ranked } from evaluateSteerers(), discarding the losses array. The computed losses (line 356-362: tasks where depth-best < breadth, tagged per-steerer with trajectory) are described as 'GEPA's reflection fuel' in the JSDoc (line 300-301) but are never printed to console or written to a file. The data IS available to programmatic consumers of the exported f

🟡 LOW steerInstruction failure scope: one steerer's LLM error kills entire task row — bench/src/eops-gate.mts

Line 331-332: runDepthArm calls steerInstruction (line 253) which may invoke routerChatWithUsage (LLM call) for steerers with a userTemplate. If this LLM call fails (e.g., rate limit despite retries), the uncaught throw propagates through runDepthArm, through the per-steerer loop, to the per-task catch at line 338, which marks the ENTIRE task as SKIP (null). This discards the already-computed breadth bas

🟡 LOW No trailing newline in JSON file — bench/steerers/eops-itsm-population.json

File ends with }] and no newline (diff shows \ No newline at end of file). POSIX convention. Zero functional impact for a JSON config consumed by a parser, but could trigger git diff noise on future edits. Fix: add trailing newline.

_{tangletools · 2026-06-09T00:43:42Z · trace}

tangletools

✅ Approved — 8 non-blocking findings — `7d99440b`

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary

_{tangletools · 2026-06-09T00:43:42Z · immutable trace}

tangletools · 2026-06-09T00:43:49Z

Premise check withheld merge — `7d99440b`

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

Cited claim: +2.8pp
PR body excerpt: feat(bench): EOPS steerer sweep + population + product-value claim

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 9 numeric claim(s) (+2.8pp, +16.7pp, 2.1pp...) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.

_{tangletools premise check · #203}

…stion) Under concurrency the gym's SQLite exhausts file handles ('unable to open database file', HTTP 500) and a third of tasks were dropped from the n=24 confirm run. It's transient — clears as sibling DBs are deleted. Bounded retry (5x, linear backoff) so a momentary container limit doesn't bleed data out of the gate (or any optimizer run built on it). Pair with CONCURRENCY<=3 on the gym.

tangletools

✅ Refreshed approval after new commits — `20290a9d`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T00:49:45Z}

A gpt-4.1 run bled 23/24 tasks to 'fetch failed' — a THROWN fetch (network reset / router throttle / wedged gym under concurrency), which the seed-status retry didn't catch. Wrap the one gym network primitive (postJson) in a bounded retry-with-backoff so transient blips on ANY gym call (seed/tools/verify) don't drop a task.

tangletools

✅ Refreshed approval after new commits — `05c8eb29`

A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

_{tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T00:52:24Z}

tangletools · 2026-06-09T01:00:25Z

✅ No Blockers — `05c8eb29`

Readiness 66/100 · Confidence 70/100 · 9 findings (2 medium, 7 low)

	deepseek	glm	aggregate
Readiness	66	83	66
Confidence	70	70	70
Correctness	66	83	66
Security	66	83	66
Testing	66	83	66
Architecture	66	83	66

Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 2 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Unused writeFileSync import — bench/src/eops-gate.mts

Line 23: import { readFileSync, writeFileSync } from 'node:fs' — writeFileSync is never called anywhere in the file. Dead import. Will fail strict TypeScript noUnusedLocals lint. Remove writeFileSync from the destructured import.

🟠 MEDIUM Unvalidated STEERERS_FILE JSON cast bypasses runtime shape checks — bench/src/eops-gate.mts

Line 399: JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[] — no runtime validation of the parsed data. A steerer object missing id (e.g. {systemPrompt:"..."}) creates perSteerer["undefined"] keys, corrupting the ranked table, loss records, and stderr tags silently. The companion file bench/steerers/eops-itsm-population.json is hand-edited single-line JSON; shape drift is plausible. Validate with a proper schema check or at minimum a runtime assertion that every entry has a non-empty string id.

🟡 LOW Dead import: writeFileSync is imported but never used — bench/src/eops-gate.mts

Line 23: import { readFileSync, writeFileSync } from 'node:fs' — writeFileSync is never called anywhere in the file. The diff adds it to the import but no usage was added. Should be reverted to import { readFileSync } from 'node:fs'. Trivial fix.

🟡 LOW Fragile import.meta.url CLI guard vs established fileURLToPath pattern — bench/src/eops-gate.mts

Line 428: if (import.meta.url === \file://${process.argv[1]}`)uses raw string interpolation. The established pattern incorpus-replay.mts:298isif (argv[1] && fileURLToPath(import.meta.url) === argv[1])`, which correctly handles URL encoding. The interpolation approach breaks on paths with spaces or special characters. Low-severity since bench scripts typically run from clean paths, but inconsistent with the codebase standard.

🟡 LOW No unit tests for exported evaluateSteerers / Steerer / EvalResult — bench/src/eops-gate.mts

Three commits refactor the gate into exported, reusable functions (evaluateSteerers, Steerer, SteererRank, SteererLoss, EvalResult, loadTasks, EopsTask). No test file exists (no eops-gate.test.mts). Other bench adapters (commit0, aec-bench, programbench) all have offline fixture tests that validate loadTasks, structure, and output contracts. The steerer sweep logic (loss collection, ranking by lift.point, degradation calculation) is testable offline with mocked depth arms. Not blocking — the gate requires live services for end-to-end — but the structural/ranking logic deserves coverage.

🟡 LOW Per-task tool-call count no longer surfaced — bench/src/eops-gate.mts

Lines 341-342: await runShot(cfg, task, server, dbId, tools, m) discards the return value (toolCalls, toolTrace) for breadth shots. Lines 352-353: runDepthArm returns toolCalls but it is not stored in perSteerer. The old code tracked and logged total acts per task — useful for diagnosing 'many turns but low score' degradation patterns. This is an observability regression. Store arm.toolCalls in perSteerer (add a toolCalls field) and log it alongside breadth scores in stderr output.

🟡 LOW Unvalidated STEERERS_FILE JSON with bare type assertion — bench/src/eops-gate.mts

Line 399: JSON.parse(readFileSync(process.env.STEERERS_FILE, 'utf8')) as Steerer[] — the as Steerer[] is an unchecked assertion. If the file contains malformed data (e.g., missing id field), runtime will proceed with undefined fields until the first .id access at line 352 (st.id), which silently produces undefined keys in perSteerer. Low severity: developer-controlled bench config, but a defensive validation or at minimum a try/catch with a descriptive error would be safer.

🟡 LOW seedDb retries create orphaned databases on the gym server — bench/src/eops-gate.mts

Lines 115-121: Each retry generates a fresh dbId via Math.random(). If the gym processes the seed-database POST (HTTP 200 with success: false, or the response is lost after server-side processing), the created database is orphaned — the finally block in callers only deletes the returned dbId. Bounded to max 4 orphans per seedDb call, but accumulates under concurrency. The gym's SQLite file-handle exhaustion this retry was added to mitigate may itself be exacerbated by accumulated orphans. Consider seeding with the same deterministic dbId on retry so duplicates self-correct, or issuing a best-effort delete on failure paths.

🟡 LOW File missing trailing newline — bench/steerers/eops-itsm-population.json

The file ends without a newline character (29007 bytes, no \n terminator). POSIX/editor convention but no functional impact — JSON.parse handles it correctly. Optional: add newline for consistency with project conventions.

_{tangletools · 2026-06-09T01:00:23Z · trace}

tangletools · 2026-06-09T01:00:29Z

Premise check withheld merge — `05c8eb29`

Classifier flagged this PR as a premise claim (numeric pp/% delta + eval terminology). Confidence: high.

Recommend re-running the underlying eval with pairedEvalueSequence before merging.

Cited claim: +2.8pp
PR body excerpt: feat(bench): EOPS steerer sweep + population + product-value claim

Run:

pnpm eval:evolve --reps 5 --skip-mutation

Classifier rationale: Body cites 9 numeric claim(s) (+2.8pp, +16.7pp, 2.1pp...) and eval-related terms appear in pr_body, review_findings. PR is asserting a measurable result that repair-pr cannot polish away — re-run the underlying evaluation before merging.

_{tangletools premise check · #203}

tangletools previously approved these changes Jun 9, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 20290a9 June 9, 2026 00:49

tangletools previously approved these changes Jun 9, 2026

View reviewed changes

drewstone dismissed tangletools’s stale review via 05c8eb2 June 9, 2026 00:52

tangletools approved these changes Jun 9, 2026

View reviewed changes

drewstone merged commit c831084 into main Jun 9, 2026
1 check passed

drewstone deleted the feat/eops-steerer-sweep branch June 9, 2026 10:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): EOPS steerer sweep + population + product-value claim#203

feat(bench): EOPS steerer sweep + population + product-value claim#203
drewstone merged 3 commits into
mainfrom
feat/eops-steerer-sweep

drewstone commented Jun 9, 2026

Uh oh!

tangletools commented Jun 9, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 9, 2026

Uh oh!

tangletools left a comment

Uh oh!

tangletools left a comment

Uh oh!

tangletools commented Jun 9, 2026

Uh oh!

tangletools commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

drewstone commented Jun 9, 2026

What

Signal — the fitness landscape has a slope (n=12, deepseek-v4-flash)

Why this (not a hand-rolled GEPA)

Test

Uh oh!

tangletools commented Jun 9, 2026

✅ No Blockers — 7d99440b

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Approved — 8 non-blocking findings — 7d99440b

Uh oh!

tangletools commented Jun 9, 2026

Premise check withheld merge — 7d99440b

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 20290a9d

Uh oh!

tangletools left a comment

Choose a reason for hiding this comment

✅ Refreshed approval after new commits — 05c8eb29

Uh oh!

tangletools commented Jun 9, 2026

✅ No Blockers — 05c8eb29

Uh oh!

tangletools commented Jun 9, 2026

Premise check withheld merge — 05c8eb29

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

✅ No Blockers — `7d99440b`

✅ Approved — 8 non-blocking findings — `7d99440b`

Premise check withheld merge — `7d99440b`

✅ Refreshed approval after new commits — `20290a9d`

✅ Refreshed approval after new commits — `05c8eb29`

✅ No Blockers — `05c8eb29`

Premise check withheld merge — `05c8eb29`