feat(bench): EOPS on the canonical loop system (Supervisor + observe analyst) — depth beats breadth +16.4pp SIGNIF by drewstone · Pull Request #204 · tangle-network/agent-runtime

drewstone · 2026-06-09T02:24:38Z

What

Stops hand-rolling. Runs the EOPS depth-vs-breadth steering experiment on our actual loop system: the keystone createSupervisor + Scope + Agent.act depth/breadth drivers (bench/src/agentic.ts, brought forward from d5aa85a), with the depth STEERER swapped to agent-eval's observe() — the real analyst (makeFinding + ChatClient + the derived_from_judge firewall), not a hand-rolled chat call. Worker runs through the Supervisor's conserved budget pool (equal-k by construction), journaled.

This is the first run of the fully canonical construction — even the prior +13.4pp branch hand-rolled its analyst. Fixed the executor-rename drift (LeafExecutor→Executor, #190).

Result (n=16, deepseek-v4-pro, equal compute, paired bootstrap)

DEPTH 72.9% vs BREADTH 56.6% = +16.4pp, 95% CI [+5.3, +29.8] (excludes 0); depth wins 6 / loses 0 / ties 10.

observe() visibly climbs — progress curves: task8 20→100, task14 50→100, task13 0→50, task4 0→67. The analyst unsticks the agent across resumed shots; depth reaches states breadth's best-of-K can't.
Depth is handicapped (scored on FINAL state vs breadth's best-of-K — the autopsy's biased-against-depth comparison) and still wins significantly.
Cheap model (deepseek-v4-pro), not gpt-4.1.

The meta-finding: architecture matters

The hand-rolled flat loop (eops-gate.mts) — a for-loop bypassing the Supervisor + a hand-rolled steerer instead of observe() — gave noisy ties / a −9.9pp scoring artifact / +6pp-after-fix, never clean. The canonical loop (Supervisor + real observe()) gives +16.4pp SIGNIF on the same gym/model/n. Using what we built > reinventing a worse version.

Caveats

n=16 (modest); single domain (itsm); single worker model; breadth baseline 56.6% has headroom. But CI excludes 0 + 6-0 discordant = a real effect. Next: power up to n=40 and GEPA//evolve the observe() analyst prompt on this canonical stack (fitness = the depth-vs-breadth lift).

Test

typecheck clean (0 errors); smoke (n=2) + full (n=16) ran end-to-end through the Supervisor against the live gym.

…() analyst Stops hand-rolling. Brings the Supervisor depth/breadth drivers forward (the general agentic primitive, d5aa85a) and makes them truly canonical: the depth STEERER is now agent-eval's observe() (makeFinding + ChatClient + the derived_from_judge firewall), not a hand-rolled chat call. The worker runs through the keystone (createSupervisor + Scope + Agent.act + scope.spawn shots), metered by the conserved budget pool (equal-k by construction), journaled. This is the loop system we built, end to end — depth (continue over one artifact, observe()-steered) vs breadth (parallel best-of), over the live EOPS gym. The +13.4pp branch hand-rolled its analyst; this is the first run with the REAL analyst. Fixed the executor-rename drift (LeafExecutor→Executor etc, #190). Smoke (n=2, deepseek-v4-pro) runs clean through the Supervisor; full n in flight. Replaces the throwaway flat-loop eops-gate prototype for the science.

drewstone merged commit a6e6534 into main Jun 9, 2026
1 check passed

drewstone deleted the feat/eops-canonical-supervisor branch June 9, 2026 02:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(bench): EOPS on the canonical loop system (Supervisor + observe analyst) — depth beats breadth +16.4pp SIGNIF#204

feat(bench): EOPS on the canonical loop system (Supervisor + observe analyst) — depth beats breadth +16.4pp SIGNIF#204
drewstone merged 1 commit into
mainfrom
feat/eops-canonical-supervisor

drewstone commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 9, 2026

What

Result (n=16, deepseek-v4-pro, equal compute, paired bootstrap)

The meta-finding: architecture matters

Caveats

Test

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant