chore(examples): clearer names — drop confusing `with-` prefix by drewstone · Pull Request #206 · tangle-network/agent-runtime

drewstone · 2026-06-09T12:26:04Z

Disciplined subset of the examples-naming audit. Renames the genuinely-confusing names; rejects the audit's 01-08 numbering + .deprecated/ quarantine (churn for throwaway examples; the README already orders them).

old	new	why
`with-knowledge-readiness`	`knowledge-gating`	`with-` reads as an optional toggle, not a primary pattern
`with-intelligence-export`	`intelligence-export`	same
`agent-into-reviewer`	`pipe-into-reviewer`	signals the 2-runtime piping it teaches

Kept (audit wanted to rename, I pushed back): runtime-run (it teaches startRuntimeRun — the name matches the product API; renaming would disconnect it), agents-of-all-shapes (memorable + has a test).

git mv preserves history; examples/README.md + docs/concepts.md + all internal self-references updated; zero stragglers.

The analyst IS the steerer (observe()'s findings → recommended_action → the depth steer), so optimizing the analyst prompt optimizes the loop. This evolves it with agent-eval's REAL GEPA primitives (buildReflectionPrompt + parseReflectionResponse + paretoFrontier) — no hand-rolled optimizer; there is no turnkey runPromptEvolution in agent-eval 0.83, only the primitives, so the population loop is thin orchestration over them. - observe(): + analystInstruction? override (the analyst prompt is now the GEPA knob); defaultAnalystInstruction exported. Firewall stays structural (input has no score). - agentic.ts: AgenticOptions.analystInstruction threads into the depth steerer. - eops-gepa.mts: FITNESS = depth-vs-breadth lift on the canonical Supervisor+observe gate; breadth computed ONCE per task (shared baseline, correct + halves cost); failing per-task lifts = the reflection gradient. Seeds = observe()'s PROVEN default (the +16.4pp instruction) FIRST, then the designer-panel population. Smoke (N=2, 1 gen) validated the full loop: score → paretoFrontier select → reflect → mutate → re-score → pick. Bounded real run (N=6, 2 gens) in flight.

… tasks) The first real run died when the (long-lived) gym container wedged: breadth baselines returned 0% then runAgentic threw 'every rollout went down', killing the whole GEPA run. runAgentic is fail-loud; the GEPA loop now catches per-task: a task whose rollouts fail is SKIPPED (not fatal), both in the breadth precompute and the depth fitness. Fails loud only if <2 tasks survive (genuine infra-down). Pair with a fresh gym container + WIDTH<=2.

…type (−433 LOC) It was a dead-end (nothing imports it): a hand-rolled flat loop that BYPASSED the canonical Supervisor + a second copy of the gym client (6 functions duplicating gym-agent.ts's 5). Fully superseded by the canonical stack — agentic.ts (domain-blind depth/breadth/Supervisor/observe, 428 LOC, written ONCE) + the AgenticSurface seam (agentic-eops.ts, 73 LOC = the entire per-domain slot-in). The +16.4pp result and the GEPA harness run on the canonical path; this prototype only de-risked the plumbing (gym standup, router-tools worker, depth-best scoring) and is now dead weight.

…naming + onboarding fixes The pieces existed (Supervisor + observe + the depth/breadth strategies) but weren't wrapped as a usable suite, and the vocabulary was opaque. runBenchmark is the packaged front door: runBenchmark({ environment, tasks, worker, strategies: ['sample','refine'], budget }) → runs each strategy, scores by the environment's own deployable check, returns the per-strategy means + the paired-bootstrap lift of refine over sample. printBenchmarkReport gives the verdict. Resilient to transient per-task infra (skip, don't crash). Naming, made legible (public API; maps to internal depth/breadth — zero churn to the running internals): a task domain is an `Environment` (the AgenticSurface seam under the RL/gym-standard name); the strategies are `sample` (best-of-N / resample) and `refine` (attempt → critic reads trace → steer → repeat), named by what they DO, not the search tree's shape. Juniors call runBenchmark; seniors customize the hooks (worker.analystInstruction = the critic, Environment.score = the check) or drop to runAgentic for new strategies. Onboarding: deleted the orphaned empty examples/define-loop/ (defineLoop removed #194); fixed the dead examples/model-resolution link in docs/concepts.md.

…ur own) The question: when we collapse to "refine", can a dev create their OWN strategy? Before: no — runAgentic took mode:'depth'|'breadth', a CLOSED enum. The capability existed (a strategy is an Agent) but the door wasn't cut. Now: `Strategy` is an exported interface — `{ name, driver(surface, task, opts, budget) => Agent }`. A strategy builds the driver Agent the Supervisor runs; author your own by returning an Agent whose act() spawns shots/analysts via scope.spawn/next/send. `refine` and `sample` ship as instances AND the reference driver implementations (depthDriver/ breadthDriver) are exported to copy. runAgentic accepts a `strategy` (mode kept for back-compat); runBenchmark takes `Strategy[]` — pass the built-ins or your own. What's under the words: sample = K independent attempts, keep the best-verifying (best-of-N / resample) refine = attempt → observe() reads the trace → steer the next → repeat (iterate) A multi-agent "team" is just a Strategy whose driver spawns several different agents — same recursive Agent atom, coordinated over the Scope.

… lines (skillifiable) The original goal: loops compact enough to skillify, so agents author them. A 70-line Supervisor driver isn't that. This adds the composable LEGO: defineStrategy(name, async ({ shot, critique, surface, budget }) => { ...compose... }) A strategy body gets two steps — shot() (one worker attempt over an artifact) and critique() (the firewalled analyst reads the trace → a steer) — with ZERO Supervisor/ Scope/spawn/leaf/drainOne ceremony (all of it lives inside defineStrategy now). That is the unit an agent or a skill can emit. Proof: adaptiveRefine — a NEW strategy (refine, but ABANDON-and-restart when a steered shot fails to improve = branch-when-stuck, the widen/MCTS idea the depth-stuck failure motivated), authored entirely from the steps, scored keep-best. ~22 lines of pure strategy logic, no plumbing. Behavior-preserving: the proven refine/sample drivers (depthDriver/breadthDriver) are UNTOUCHED — the +16.4pp result + GEPA stay valid. The steps replicate their exact spawn/drain pattern, so a step-authored strategy behaves identically. Typecheck-verified; adaptiveRefine live-smoke pending the gym (GEPA has it).

…rs (gym-free, runnable) The missing onboarding piece: a runnable demo of the whole suite on a toy "counter" Environment (needs only a router key — no dataset, no sandbox). Shows all three layers: 1. runBenchmark(env, …) — default strategies compared, free. 2. strategies: [sample, refine, adaptiveRefine] — pick, named by behavior. 3. defineStrategy('doubleCheck', body) — author your own in ~10 lines from shot()+critique(), zero Supervisor ceremony. The skillifiable unit. Verified: runs end-to-end through the canonical Supervisor; all 4 strategies execute and score via the Environment's own check. README documents the model + the customization hooks.

…larify intent Disciplined subset of the examples-naming audit (NOT the proposed 01-08 numbering / .deprecated quarantine — that's churn for throwaway examples and the README already orders them): with-knowledge-readiness → knowledge-gating (`with-` read as an optional toggle) with-intelligence-export → intelligence-export (same) agent-into-reviewer → pipe-into-reviewer (signals the 2-runtime piping) KEPT runtime-run (it teaches startRuntimeRun — the name matches the product API) and agents-of-all-shapes (memorable + has a test). git mv preserves history; README + docs/concepts + all internal self-references updated; zero stragglers.

drewstone added 8 commits June 9, 2026 05:15

drewstone merged commit 29aba34 into main Jun 9, 2026
1 check passed

drewstone deleted the chore/examples-naming branch June 9, 2026 12:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(examples): clearer names — drop confusing `with-` prefix#206

chore(examples): clearer names — drop confusing `with-` prefix#206
drewstone merged 8 commits into
mainfrom
chore/examples-naming

drewstone commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

drewstone commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant