Skip to content

feat(bench): EOPS on the canonical loop system (Supervisor + observe analyst) — depth beats breadth +16.4pp SIGNIF#204

Merged
drewstone merged 1 commit into
mainfrom
feat/eops-canonical-supervisor
Jun 9, 2026
Merged

feat(bench): EOPS on the canonical loop system (Supervisor + observe analyst) — depth beats breadth +16.4pp SIGNIF#204
drewstone merged 1 commit into
mainfrom
feat/eops-canonical-supervisor

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

What

Stops hand-rolling. Runs the EOPS depth-vs-breadth steering experiment on our actual loop system: the keystone createSupervisor + Scope + Agent.act depth/breadth drivers (bench/src/agentic.ts, brought forward from d5aa85a), with the depth STEERER swapped to agent-eval's observe() — the real analyst (makeFinding + ChatClient + the derived_from_judge firewall), not a hand-rolled chat call. Worker runs through the Supervisor's conserved budget pool (equal-k by construction), journaled.

This is the first run of the fully canonical construction — even the prior +13.4pp branch hand-rolled its analyst. Fixed the executor-rename drift (LeafExecutorExecutor, #190).

Result (n=16, deepseek-v4-pro, equal compute, paired bootstrap)

DEPTH 72.9% vs BREADTH 56.6% = +16.4pp, 95% CI [+5.3, +29.8] (excludes 0); depth wins 6 / loses 0 / ties 10.

  • observe() visibly climbs — progress curves: task8 20→100, task14 50→100, task13 0→50, task4 0→67. The analyst unsticks the agent across resumed shots; depth reaches states breadth's best-of-K can't.
  • Depth is handicapped (scored on FINAL state vs breadth's best-of-K — the autopsy's biased-against-depth comparison) and still wins significantly.
  • Cheap model (deepseek-v4-pro), not gpt-4.1.

The meta-finding: architecture matters

The hand-rolled flat loop (eops-gate.mts) — a for-loop bypassing the Supervisor + a hand-rolled steerer instead of observe() — gave noisy ties / a −9.9pp scoring artifact / +6pp-after-fix, never clean. The canonical loop (Supervisor + real observe()) gives +16.4pp SIGNIF on the same gym/model/n. Using what we built > reinventing a worse version.

Caveats

n=16 (modest); single domain (itsm); single worker model; breadth baseline 56.6% has headroom. But CI excludes 0 + 6-0 discordant = a real effect. Next: power up to n=40 and GEPA//evolve the observe() analyst prompt on this canonical stack (fitness = the depth-vs-breadth lift).

Test

typecheck clean (0 errors); smoke (n=2) + full (n=16) ran end-to-end through the Supervisor against the live gym.

…() analyst

Stops hand-rolling. Brings the Supervisor depth/breadth drivers forward (the
general agentic primitive, d5aa85a) and makes them truly canonical: the depth
STEERER is now agent-eval's observe() (makeFinding + ChatClient + the
derived_from_judge firewall), not a hand-rolled chat call. The worker runs through
the keystone (createSupervisor + Scope + Agent.act + scope.spawn shots), metered by
the conserved budget pool (equal-k by construction), journaled.

This is the loop system we built, end to end — depth (continue over one artifact,
observe()-steered) vs breadth (parallel best-of), over the live EOPS gym. The
+13.4pp branch hand-rolled its analyst; this is the first run with the REAL
analyst. Fixed the executor-rename drift (LeafExecutor→Executor etc, #190).

Smoke (n=2, deepseek-v4-pro) runs clean through the Supervisor; full n in flight.
Replaces the throwaway flat-loop eops-gate prototype for the science.
@drewstone drewstone merged commit a6e6534 into main Jun 9, 2026
1 check passed
@drewstone drewstone deleted the feat/eops-canonical-supervisor branch June 9, 2026 02:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant