feat(bench): GEPA over the analyst/steerer prompt (canonical stack, real agent-eval primitives)#205
Conversation
The analyst IS the steerer (observe()'s findings → recommended_action → the depth steer), so optimizing the analyst prompt optimizes the loop. This evolves it with agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse + paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution in agent-eval 0.83, only the primitives, so the population loop is thin orchestration over them. - observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob); defaultAnalystInstruction exported. Firewall stays structural (input has no score). - agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer. - eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe gate; breadth computed ONCE per task (shared baseline, correct + halves cost); failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default (the +16.4pp instruction) FIRST, then the designer-panel population. Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect → mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.
… tasks) The first real run died when the (long-lived) gym container wedged: breadth baselines returned 0% then runAgentic threw 'every rollout went down', killing the whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a fresh gym container + WIDTH<=2.
…type (−433 LOC) It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the canonical Supervisor + a second copy of the gym client (6 functions duplicating gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam (agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the GEPA harness run on the canonical path; this prototype only de-risked the plumbing (gym standup, router-tools worker, depth-best scoring) and is now dead weight.
❌ Needs Work —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 61 | 80 | 61 |
| Confidence | 70 | 70 | 70 |
| Correctness | 61 | 80 | 61 |
| Security | 61 | 80 | 61 |
| Testing | 61 | 80 | 61 |
| Architecture | 61 | 80 | 61 |
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH Missing population file makes eops-gepa.mts always crash on startup — bench/src/eops-gepa.mts
Line 116: readFileSync('steerers/eops-itsm-population.json', 'utf8') — the steerers/ directory does not exist in the repo, so this always throws ENOENT. The script never reaches the evaluation loop. The defaultAnalystInstruction seed on line 119 is the only required seed; the file-based seeds are supplementary. Either (a) check in the population file, or (b) wrap in try/catch and default to an empty array so the script can run with just the observe-default seed.
Other
🟡 LOW No test coverage for analystInstruction propagation — bench/src/agentic.ts
Line 193: The conditional spread
...(opts.analystInstruction ? { analystInstruction: opts.analystInstruction } : {})is the critical seam that lets GEPA control the steerer, but has zero test coverage. Theobserve()function'sanalystInstructionoverride is also untested. Since this is bench tooling, not a library hot path, the risk is low, but a single integration test confirming the knob reaches observe() would raise confidence.
🟡 LOW Whitespace-only analystInstruction bypasses truthiness guard in agentic.ts — bench/src/agentic.ts
Line 193: the spread guard
...(opts.analystInstruction ? { analystInstruction: opts.analystInstruction } : {}))passes the??in coercible values (e.g. ' '), which would be sent as an effectively empty system prompt to observe(). The eops-gepa.mts caller defends against this withinstruction.trim().length < 40at line 165, but agentic.ts itself has no guard. Consideropts.analystInstruction?.trim() ? ...to match the intent.
🟡 LOW Hardcoded relative path for seed population file — bench/src/eops-gepa.mts
Line 116:
readFileSync('steerers/eops-itsm-population.json', 'utf8')uses a relative path, so the script must be run frombench/or the read fails. The file exists (bench/steerers/eops-itsm-population.json), and the usage docstring (line 17) shows running astsx src/eops-gepa.mtsfrombench/, so this works in practice. Low severity — could useimport.meta.dirnamefor robustness but not blocking.
🟡 LOW Reflection top/bottom trials overlap when perTask has ≤2 entries — bench/src/eops-gepa.mts
Lines 144-146:
sorted.slice(0, 2)(top) andsorted.slice(-2)(bottom) produce identical arrays when perTask has exactly 2 items, feeding the reflection prompt no gradient signal. With N=4 tasks and infrastructure skips, this can happen. Impact: degraded optimization quality, not correctness. Fix: guard withif (sorted.length <= 2) { bottom = sorted.slice(0, 1); top = sorted.slice(-1); }or skip reflection for that parent.
🟡 LOW Custom analystInstruction can remove behavioral guardrails — src/runtime/observe.ts
The analystInstruction option replaces the full system prompt, including the constraints 'Only claim what the trace shows' and 'No findings if the run was clean' (lines 57-62). While the score firewall is structural (ObserveInput has no score field; derived_from_judge is hardcoded false at line 174), a custom instruction can instruct the analyst to hallucinate findings. This is the intended use case (GEPA optimization surface) and is honestly documented in the JSDoc ([lines 48-53](https://github.com/tangle-network/agent-runtime/blob/dfca5406
🟡 LOW No test coverage for observe() function — src/runtime/observe.ts
No test file exists for src/runtime/observe.ts (glob for observe.test.* / observe.spec.* returned no results). The observe() function — including its new analystInstruction path — has zero automated coverage. This is a pre-existing condition, not introduced by this PR, but the new code path (analystInstruction override, defaultAnalystInstruction export) inherits the gap. The bench/src/eops-gepa.mts integration provides smoke coverage but no unit-level assertion on the option plumbing.
🟡 LOW No unit test for observe() or the analystInstruction override — src/runtime/observe.ts
There is no test file for
observe()anywhere in the repo (checkedtests/**/*observe*and grep for imports). The newanalystInstructionfallback path (opts.analystInstruction ?? defaultAnalystInstruction) is untested. The function is non-trivial (LLM call, JSON parse, corpus append). This is pre-existing tech debt, not introduced by this PR, but the new parameter adds a coverage gap. A minimal test mockingChatClientwould confirm the override flows through and the fallback fires when omitted.
🟡 LOW Verbose doc comment on analystInstruction could be trimmed — src/runtime/observe.ts
Lines 48-53: the JSDoc for
analystInstructionis 6 lines explaining GEPA context and the firewall invariant. The same rationale is repeated in the module-level doc (lines 1-17) and in thedefaultAnalystInstructionexport comment. Not a bug, but the redundancy adds maintenance surface. Consider a one-liner:/** Override the analyst system instruction. Omitted ⇒ default. */— the firewall explanation belongs in module docs, not per-field.
tangletools · 2026-06-09T11:43:33Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — dfca5406
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-09T11:43:33Z · immutable trace
Premise check withheld merge —
|
…naming + onboarding fixes
The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:
runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
→ runs each strategy, scores by the environment's own deployable check, returns the
per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
gives the verdict. Resilient to transient per-task infra (skip, don't crash).
Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.
Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.
…ur own)
The question: when we collapse to "refine", can a dev create their OWN strategy?
Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability
existed (a strategy is an Agent) but the door wasn't cut.
Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget)
=> Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by
returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine`
and `sample` ship as instances AND the reference driver implementations (depthDriver/
breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for
back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own.
What's under the words:
sample = K independent attempts, keep the best-verifying (best-of-N / resample)
refine = attempt → observe() reads the trace → steer the next → repeat (iterate)
A multi-agent "team" is just a Strategy whose driver spawns several different agents —
same recursive Agent atom, coordinated over the Scope.
… lines (skillifiable)
The original goal: loops compact enough to skillify, so agents author them. A 70-line
Supervisor driver isn't that. This adds the composable LEGO:
defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... })
A strategy body gets two steps — shot() (one worker attempt over an artifact) and
critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/
Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is
the unit an agent or a skill can emit.
Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered
shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure
motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure
strategy logic, no plumbing.
Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are
UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact
spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified;
adaptiveRefine live-smoke pending the gym (GEPA has it).
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 73 | 67 | 67 |
| Confidence | 75 | 75 | 75 |
| Correctness | 73 | 67 | 67 |
| Security | 73 | 67 | 67 |
| Testing | 73 | 67 | 67 |
| Architecture | 73 | 67 | 67 |
Full multi-shot audit completed 3/3 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 6 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM adaptiveRefine double-opens surface on first-shot score=0 when best=-1 initial condition causes premature abandon — bench/src/agentic.ts
Line 510:
let best = -1. If the very first shot scores 0 (passes=0), then line 519out.score <= bestis0 <= -1→ false, so best becomes 0. But if the SECOND shot on a fresh restart also scores 0,0 <= 0is true → abandons and restarts again. This means the strategy can thrash: open→score 0→continue→open→score 0→abandon→restart→repeat. The initialbest = -1sentinel means the first 0-score shot always passes the check, but subsequent 0-score shots on the same handle trigger restart. This is probably the intended 'branch-when-stuck' behavi
🟠 MEDIUM runBenchmark crashes when all tasks are excluded via transient infra — bench/src/run-benchmark.mts
Line 84:
pairedLift(ok.map(...), ok.map(...))throwspairedLift: no pairs(stats.mts:69) whenokis empty (all tasks excluded). The function's doc claims 'Resilient: a task whose rollouts fail is excluded, not fatal' but if ALL fail, it crashes with an unhandled exception. Reproduction: gym container down, network outage, or all tasks hitting router auth failures. Fix: guard the pairedLift call withif (ok.length > 0)or catch the error and setrefineVsSample = undefined.
🟡 LOW AgenticRunResult.mode widened from union to string — no exhaustiveness check — bench/src/agentic.ts
Line 312:
mode: string(was'depth' | 'breadth'). This is intentional to support custom strategy names, but any consumer doingswitch(result.mode)with only 'depth'/'breadth' cases won't get a TS exhaustiveness error for the new 'adaptiveRefine' or custom names. This is the expected tradeoff for the Strategy extension point, but downstream code that pattern-matches on mode should be audited (outside this shot's scope).
🟡 LOW No tests for new Strategy/defineStrategy/adaptiveRefine/runBenchmark abstractions — bench/src/agentic.ts
None of the 11 test files in bench/src/ import or exercise the new public API surface: Strategy, defineStrategy, adaptiveRefine, sample, refine as Strategy objects, runBenchmark, printBenchmarkReport. The existing refine-loop.test.mts tests a different abstraction (runRefineLoop). The Strategy abstraction is now the primary extension point and should have at minimum a test verifying defineStrategy produces a working driver, and that adaptiveRefine correctly handles the restart-then-resume flow.
🟡 LOW adaptiveRefine restart never resets best-score threshold, causing potential restart spiral — bench/src/agentic.ts
Lines 519-525: When a steered shot scores <=
bestand the line is abandoned/restarted,bestretains the previous line's best score. The next line's first shot (on a fresh artifact with minimal work done) is compared against this inheritedbest, and if it falls short — which is likely since the artifact starts from scratch — the line is immediately abandoned again. This conflates two separate concerns into one variable: global-best-across-lines and per-line-improvement-threshold. The restart loop continues without any line getting more than one shot, wasting budget. Fix: trackglobalBestseparately from a per-line improvement threshold, or reset `best = out
🟡 LOW eops-gepa.mts assumes cwd=bench/ for seed file path — bench/src/eops-gepa.mts
Line 116:
readFileSync('steerers/eops-itsm-population.json')is a relative path resolved againstprocess.cwd(). The documented invocation (tsx src/eops-gepa.mts) implies running from the bench/ directory, but if run from the repo root or any other directory, the file won't be found and the process exits with a raw ENOENT error. Fix: useimport.meta.url-based resolution or a CLI flag for the population file path.
🟡 LOW eops-gepa.mts fitness silently returns -1 lift when ALL tasks fail depth — bench/src/eops-gepa.mts
Line 111:
lift: scored ? liftSum / scored : -1— if every task's depth run throws,scored=0and the candidate getslift=-1, cost=1e9. This is a safe sentinel that won't win the pareto frontier, but the error message at line 91 only guards breadth baseline (< 2tasks). If all depth runs fail for every candidate, GEPA runs to completion with all candidates at lift=-1 and the 'winner' is arbitrary. No data loss, but the user gets a misleading 'WINNER' log. A warning whenscored === 0would help.
🟡 LOW eops-gepa.mts reads seed population from hardcoded relative path — bench/src/eops-gepa.mts
Line 116:
readFileSync('steerers/eops-itsm-population.json', ...)— this is a relative path resolved from CWD, not fromimport.meta.dirname. The file exists atbench/steerers/, so it only works if CWD isbench/. The rest of the file uses absolute env vars (EOPS_GYM_DBS_DIR). If someone runstsx bench/src/eops-gepa.mtsfrom the repo root, this throws ENOENT. Should usepath.resolve(import.meta.dirname, '../steerers/eops-itsm-population.json')or make the path an env var.
🟡 LOW runBenchmark sequentially runs strategies per task — no isolation guarantee for custom strategies — bench/src/run-benchmark.mts
Lines 65-69:
for (const s of strategies) { const r = await runAgentic({...}) }— strategies run sequentially per task. Built-in strategies (sample/refine) open fresh artifacts, so there's no shared-state contamination. But a custom Strategy could mutate global or surface-level state (e.g., writing files) that bleeds into the next strategy's run on the same task. The benchmark doesn't document this sequencing assumption. Worth a note in BenchmarkConfig.strategies JSDoc that strategies must not share mutable state.
🟡 LOW Empty-string analystInstruction bypasses default silently — src/runtime/observe.ts
Line 148:
opts.analystInstruction ?? defaultAnalystInstructionuses nullish coalescing, which passes through""as-is, resulting in an empty system prompt. This would produce degenerate LLM output. The only current caller (bench/src/agentic.ts:193) guards withopts.analystInstruction ? { analystInstruction: ... } : {}, so it's safe in practice, but a future caller could pass""accidentally. Fix: either validate non-empty inobserve()or switch to||(which also catches""). This is a nit, not a blocker.
🟡 LOW No test coverage for observe() or the new analystInstruction option — src/runtime/observe.ts
No test files exist for
observe.ts(globbed*observe*.test.*and*observe*.spec.*— zero results). The newanalystInstructionoption and thedefaultAnalystInstructionexport are untested. This is pre-existing (no tests were removed), but the public API surface grows without coverage. A unit test that stubsChatClientand verifies the system message content with/without the override would be low-effort and high-value.
🟡 LOW No test coverage for exported defaultAnalystInstruction — src/runtime/observe.ts
The new
defaultAnalystInstructionconstant is exported and consumed bybench/src/eops-gepa.mts(line 27) as a seed for GEPA prompt optimization, but no test asserts its content or verifies the??fallback behavior inobserve(). Pre-existing gap (noobserve.test.tsexisted before this PR), but the export raises the blast radius if the string is ever accidentally modified.
tangletools · 2026-06-09T12:13:24Z · trace
Superseded by re-review — no blocking findings on latest commit.
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 12 non-blocking findings — 1dfbfd67
Full multi-shot audit completed 3/3 planned shots over 6 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 3/3 planned shots over 6 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-09T12:13:24Z · immutable trace
Premise check withheld merge —
|
…rs (gym-free, runnable)
The missing onboarding piece: a runnable demo of the whole suite on a toy "counter"
Environment (needs only a router key — no dataset, no sandbox). Shows all three layers:
1. runBenchmark(env, …) — default strategies compared, free.
2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior.
3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(),
zero Supervisor ceremony. The skillifiable unit.
Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and
score via the Environment's own check. README documents the model + the customization hooks.
tangletools
left a comment
There was a problem hiding this comment.
✅ Refreshed approval after new commits — ab137984
A previous trusted approval on this PR was invalidated by new commits.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: stale_approval_refresh · 2026-06-09T12:16:09Z
|
| State | Detail |
|---|---|
| Interrupted | webhook restarted |
No review verdict was produced for this run. Trigger a fresh review on the current PR head if the PR is still open.
tangletools · #205 · model: kimi-for-coding · updated 2026-06-09T12:30:05Z
…the baseline Adds a HOLDOUT=N option: after optimizing on the search tasks, score the winning analyst instruction AND the seeded baseline (observe default) on a DISJOINT slice (offset = search-set size). Holdout breadth computed once; winner+baseline depth scored against it. Reports whether GEPA GENERALIZED (winner > baseline on held-out tasks) — the frozen confirmation the discipline requires (guards against overfitting the search set). loadItsmTasks gains an offset param.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — a7e18a15
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-09T12:40:24Z
# Conflicts: # bench/src/eops-gepa.mts # docs/concepts.md
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — db0e0821
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-09T12:42:00Z
What
The flywheel, on the canonical loop system. The analyst is the steerer —
observe()'s findings →recommended_action→ the depth steer — so this evolves the analyst's system instruction against the live EOPS gate.observe()is now tunable:analystInstruction?override + exporteddefaultAnalystInstruction. The analyst prompt is the GEPA knob. The firewall stays structural (the observe input carries no score), so a custom instruction can't break it.agentic.ts:AgenticOptions.analystInstructionthreads into the depth steerer.eops-gepa.mts: assembles the GEPA loop from agent-eval's real primitives —buildReflectionPrompt+parseReflectionResponse(reflective mutation) +paretoFrontier(selection over [maximize lift, minimize cost]). No hand-rolled optimizer. (There is no turnkeyrunPromptEvolutionin agent-eval 0.83 — only the primitives — so the population loop is thin orchestration over them.)Fitness = the depth-vs-breadth lift on the canonical Supervisor+
observe()gate. Breadth is computed once per task (shared baseline — breadth has no analyst — correct design + halves cost). The failing per-task lifts are the reflection gradient. Seeds =observe()'s proven default (the +16.4pp instruction) FIRST, then the designer-panel population — so GEPA improves from known-good, not from below baseline.Validation
Smoke (N=2, 1 gen) ran the full loop end-to-end: score →
paretoFrontierselect →buildReflectionPrompt→LLM→parseReflectionResponse→ child → re-score → pick. Bounded real run (N=6, 2 gens, maxShots=3, deepseek-v4-pro) in flight — will report whether GEPA finds an analyst prompt beating the seeded baseline.Test
typecheck clean (runtime + bench, 0 errors);
observe()change is additive (default preserved); smoke validated the loop.