Skip to content

chore(examples): clearer names — drop confusing with- prefix#206

Merged
drewstone merged 8 commits into
mainfrom
chore/examples-naming
Jun 9, 2026
Merged

chore(examples): clearer names — drop confusing with- prefix#206
drewstone merged 8 commits into
mainfrom
chore/examples-naming

Conversation

@drewstone

Copy link
Copy Markdown
Contributor

Disciplined subset of the examples-naming audit. Renames the genuinely-confusing names; rejects the audit's 01-08 numbering + .deprecated/ quarantine (churn for throwaway examples; the README already orders them).

old new why
with-knowledge-readiness knowledge-gating with- reads as an optional toggle, not a primary pattern
with-intelligence-export intelligence-export same
agent-into-reviewer pipe-into-reviewer signals the 2-runtime piping it teaches

Kept (audit wanted to rename, I pushed back): runtime-run (it teaches startRuntimeRun — the name matches the product API; renaming would disconnect it), agents-of-all-shapes (memorable + has a test).

git mv preserves history; examples/README.md + docs/concepts.md + all internal self-references updated; zero stragglers.

drewstone added 8 commits June 9, 2026 05:15
The analyst IS the steerer (observe()'s findings → recommended_action → the depth
steer), so optimizing the analyst prompt optimizes the loop. This evolves it with
agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse
+ paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution
in agent-eval 0.83, only the primitives, so the population loop is thin orchestration
over them.

- observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob);
  defaultAnalystInstruction exported. Firewall stays structural (input has no score).
- agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer.
- eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe
  gate; breadth computed ONCE per task (shared baseline, correct + halves cost);
  failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default
  (the +16.4pp instruction) FIRST, then the designer-panel population.

Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect
→ mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.
… tasks)

The first real run died when the (long-lived) gym container wedged: breadth
baselines returned 0% then runAgentic threw 'every rollout went down', killing the
whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task
whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the
depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a
fresh gym container + WIDTH<=2.
…type (−433 LOC)

It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the
canonical Supervisor + a second copy of the gym client (6 functions duplicating
gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind
depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam
(agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the
GEPA harness run on the canonical path; this prototype only de-risked the plumbing
(gym standup, router-tools worker, depth-best scoring) and is now dead weight.
…naming + onboarding fixes

The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:

  runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
    → runs each strategy, scores by the environment's own deployable check, returns the
      per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
      gives the verdict. Resilient to transient per-task infra (skip, don't crash).

Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.

Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.
…ur own)

The question: when we collapse to "refine", can a dev create their OWN strategy?
Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability
existed (a strategy is an Agent) but the door wasn't cut.

Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget)
=> Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by
returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine`
and `sample` ship as instances AND the reference driver implementations (depthDriver/
breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for
back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own.

What's under the words:
  sample = K independent attempts, keep the best-verifying (best-of-N / resample)
  refine = attempt → observe() reads the trace → steer the next → repeat (iterate)
A multi-agent "team" is just a Strategy whose driver spawns several different agents —
same recursive Agent atom, coordinated over the Scope.
… lines (skillifiable)

The original goal: loops compact enough to skillify, so agents author them. A 70-line
Supervisor driver isn't that. This adds the composable LEGO:

  defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... })

A strategy body gets two steps — shot() (one worker attempt over an artifact) and
critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/
Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is
the unit an agent or a skill can emit.

Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered
shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure
motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure
strategy logic, no plumbing.

Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are
UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact
spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified;
adaptiveRefine live-smoke pending the gym (GEPA has it).
…rs (gym-free, runnable)

The missing onboarding piece: a runnable demo of the whole suite on a toy "counter"
Environment (needs only a router key — no dataset, no sandbox). Shows all three layers:
  1. runBenchmark(env, …) — default strategies compared, free.
  2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior.
  3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(),
     zero Supervisor ceremony. The skillifiable unit.
Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and
score via the Environment's own check. README documents the model + the customization hooks.
…larify intent

Disciplined subset of the examples-naming audit (NOT the proposed 01-08 numbering /
.deprecated quarantine — that's churn for throwaway examples and the README already
orders them):
  with-knowledge-readiness → knowledge-gating   (`with-` read as an optional toggle)
  with-intelligence-export → intelligence-export (same)
  agent-into-reviewer      → pipe-into-reviewer  (signals the 2-runtime piping)
KEPT runtime-run (it teaches startRuntimeRun — the name matches the product API) and
agents-of-all-shapes (memorable + has a test). git mv preserves history; README +
docs/concepts + all internal self-references updated; zero stragglers.
@drewstone drewstone merged commit 29aba34 into main Jun 9, 2026
1 check passed
@drewstone drewstone deleted the chore/examples-naming branch June 9, 2026 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant