feat(bench): EOPS on the canonical loop system (Supervisor + observe analyst) — depth beats breadth +16.4pp SIGNIF#204
Merged
Conversation
…() analyst Stops hand-rolling. Brings the Supervisor depth/breadth drivers forward (the general agentic primitive, d5aa85a) and makes them truly canonical: the depth STEERER is now agent-eval's observe() (makeFinding + ChatClient + the derived_from_judge firewall), not a hand-rolled chat call. The worker runs through the keystone (createSupervisor + Scope + Agent.act + scope.spawn shots), metered by the conserved budget pool (equal-k by construction), journaled. This is the loop system we built, end to end — depth (continue over one artifact, observe()-steered) vs breadth (parallel best-of), over the live EOPS gym. The +13.4pp branch hand-rolled its analyst; this is the first run with the REAL analyst. Fixed the executor-rename drift (LeafExecutor→Executor etc, #190). Smoke (n=2, deepseek-v4-pro) runs clean through the Supervisor; full n in flight. Replaces the throwaway flat-loop eops-gate prototype for the science.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Stops hand-rolling. Runs the EOPS depth-vs-breadth steering experiment on our actual loop system: the keystone
createSupervisor+Scope+Agent.actdepth/breadth drivers (bench/src/agentic.ts, brought forward from d5aa85a), with the depth STEERER swapped to agent-eval'sobserve()— the real analyst (makeFinding+ChatClient+ thederived_from_judgefirewall), not a hand-rolled chat call. Worker runs through the Supervisor's conserved budget pool (equal-k by construction), journaled.This is the first run of the fully canonical construction — even the prior +13.4pp branch hand-rolled its analyst. Fixed the executor-rename drift (
LeafExecutor→Executor, #190).Result (n=16, deepseek-v4-pro, equal compute, paired bootstrap)
DEPTH 72.9% vs BREADTH 56.6% = +16.4pp, 95% CI [+5.3, +29.8] (excludes 0); depth wins 6 / loses 0 / ties 10.
observe()visibly climbs — progress curves: task820→100, task1450→100, task130→50, task40→67. The analyst unsticks the agent across resumed shots; depth reaches states breadth's best-of-K can't.The meta-finding: architecture matters
The hand-rolled flat loop (
eops-gate.mts) — a for-loop bypassing the Supervisor + a hand-rolled steerer instead ofobserve()— gave noisy ties / a −9.9pp scoring artifact / +6pp-after-fix, never clean. The canonical loop (Supervisor + realobserve()) gives +16.4pp SIGNIF on the same gym/model/n. Using what we built > reinventing a worse version.Caveats
n=16 (modest); single domain (itsm); single worker model; breadth baseline 56.6% has headroom. But CI excludes 0 + 6-0 discordant = a real effect. Next: power up to n=40 and GEPA/
/evolvetheobserve()analyst prompt on this canonical stack (fitness = the depth-vs-breadth lift).Test
typecheck clean (0 errors); smoke (n=2) + full (n=16) ran end-to-end through the Supervisor against the live gym.