chore(examples): clearer names — drop confusing with- prefix#206
Merged
Conversation
The analyst IS the steerer (observe()'s findings → recommended_action → the depth steer), so optimizing the analyst prompt optimizes the loop. This evolves it with agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse + paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution in agent-eval 0.83, only the primitives, so the population loop is thin orchestration over them. - observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob); defaultAnalystInstruction exported. Firewall stays structural (input has no score). - agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer. - eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe gate; breadth computed ONCE per task (shared baseline, correct + halves cost); failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default (the +16.4pp instruction) FIRST, then the designer-panel population. Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect → mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.
… tasks) The first real run died when the (long-lived) gym container wedged: breadth baselines returned 0% then runAgentic threw 'every rollout went down', killing the whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a fresh gym container + WIDTH<=2.
…type (−433 LOC) It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the canonical Supervisor + a second copy of the gym client (6 functions duplicating gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam (agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the GEPA harness run on the canonical path; this prototype only de-risked the plumbing (gym standup, router-tools worker, depth-best scoring) and is now dead weight.
…naming + onboarding fixes
The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't
wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged
front door:
runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget })
→ runs each strategy, scores by the environment's own deployable check, returns the
per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport
gives the verdict. Resilient to transient per-task infra (skip, don't crash).
Naming, made legible (public API; maps to internal depth/breadth — zero churn to the
running internals): a task domain is an `Environment` (the AgenticSurface seam under the
RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine`
(attempt → critic reads trace → steer → repeat), named by what they DO, not the search
tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction
= the critic, Environment.score = the check) or drop to runAgentic for new strategies.
Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194);
fixed the dead examples/model-resolution link in docs/concepts.md.
…ur own)
The question: when we collapse to "refine", can a dev create their OWN strategy?
Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability
existed (a strategy is an Agent) but the door wasn't cut.
Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget)
=> Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by
returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine`
and `sample` ship as instances AND the reference driver implementations (depthDriver/
breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for
back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own.
What's under the words:
sample = K independent attempts, keep the best-verifying (best-of-N / resample)
refine = attempt → observe() reads the trace → steer the next → repeat (iterate)
A multi-agent "team" is just a Strategy whose driver spawns several different agents —
same recursive Agent atom, coordinated over the Scope.
… lines (skillifiable)
The original goal: loops compact enough to skillify, so agents author them. A 70-line
Supervisor driver isn't that. This adds the composable LEGO:
defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... })
A strategy body gets two steps — shot() (one worker attempt over an artifact) and
critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/
Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is
the unit an agent or a skill can emit.
Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered
shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure
motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure
strategy logic, no plumbing.
Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are
UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact
spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified;
adaptiveRefine live-smoke pending the gym (GEPA has it).
…rs (gym-free, runnable)
The missing onboarding piece: a runnable demo of the whole suite on a toy "counter"
Environment (needs only a router key — no dataset, no sandbox). Shows all three layers:
1. runBenchmark(env, …) — default strategies compared, free.
2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior.
3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(),
zero Supervisor ceremony. The skillifiable unit.
Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and
score via the Environment's own check. README documents the model + the customization hooks.
…larify intent Disciplined subset of the examples-naming audit (NOT the proposed 01-08 numbering / .deprecated quarantine — that's churn for throwaway examples and the README already orders them): with-knowledge-readiness → knowledge-gating (`with-` read as an optional toggle) with-intelligence-export → intelligence-export (same) agent-into-reviewer → pipe-into-reviewer (signals the 2-runtime piping) KEPT runtime-run (it teaches startRuntimeRun — the name matches the product API) and agents-of-all-shapes (memorable + has a test). git mv preserves history; README + docs/concepts + all internal self-references updated; zero stragglers.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Disciplined subset of the examples-naming audit. Renames the genuinely-confusing names; rejects the audit's
01-08numbering +.deprecated/quarantine (churn for throwaway examples; the README already orders them).with-knowledge-readinessknowledge-gatingwith-reads as an optional toggle, not a primary patternwith-intelligence-exportintelligence-exportagent-into-reviewerpipe-into-reviewerKept (audit wanted to rename, I pushed back):
runtime-run(it teachesstartRuntimeRun— the name matches the product API; renaming would disconnect it),agents-of-all-shapes(memorable + has a test).git mvpreserves history;examples/README.md+docs/concepts.md+ all internal self-references updated; zero stragglers.