diff --git a/docs/research/README.md b/docs/research/README.md index 0832aae..adf13b7 100644 --- a/docs/research/README.md +++ b/docs/research/README.md @@ -27,6 +27,23 @@ spine happen explicitly, with `file:line` anchors, once a design ships. | [codex-techniques-audit.md](./codex-techniques-audit.md) | Adoption report mining OpenAI Codex for succinct-code principles + orchestration techniques. **Advisory** — verify `file:line` before acting. | | [loop-facade-postmortem.md](./loop-facade-postmortem.md) | Failure record for the deleted `defineLoop` facade: why retyping `Scope`/MCP/journals/validators produced code without substrate proof, and the prevention rule for future loop APIs. | +### The optimization-space suite (2026-06-09) + +The strategy map + per-layer stress tests, written after the steering/GEPA gate series. +Start at the index; each layer doc carries its own evidence table, strongest objections, +and concrete next experiments. + +| Doc | What it holds | +|-----|---------------| +| [optimization-space.md](./optimization-space.md) | **The index.** The 6-axis taxonomy (timescale · target · objective · validity scope · serving architecture · authorship), the evidence map (which cells are measured/null/empty), the canon-compatibility audit, and the ranked experiment portfolio. | +| [layer-within-run.md](./layer-within-run.md) | Within-run optimization — the settled boundary law (steering negative on stateless, positive on stateful+keep-best), the two engineering laws (checkpointing; architecture-is-a-variable), and the one open lever (topology tournament). | +| [layer-across-run.md](./layer-across-run.md) | **The unmeasured thesis (n=0).** The corpus flywheel: primed-vs-cold A/B design, the four falsifiers (context pollution, stale facts, judge leakage, worker disregard), and why this layer dominates the portfolio. | +| [layer-economics.md](./layer-economics.md) | Multi-objective + cost: the largest practice-vs-canon inconsistency (all gates single-objective; canon mandates the vector), the lift-per-dollar frontier, and the tool-augmentation effect (+70pp) that dominates everything else measured. | +| [layer-domain-generality.md](./layer-domain-generality.md) | The n=1-domain exposure of the headline result; the nearly-free cross-domain replication (csm/hr gym splits); why itsm may be idiosyncratic; the product-transfer falsifier. | +| [layer-intelligence-serving.md](./layer-intelligence-serving.md) | Self-hosted vs platform-served intelligence: Tangle Intelligence is export-only today; the timescale split (in-loop critic local, across-run memory served); the four-gap list incl. the **server-side judge firewall** as the non-negotiable. | +| [layer-agent-authored.md](./layer-agent-authored.md) | Skillification: agent-authored strategies via `defineStrategy`, the two structural safety properties (conserved budget, firewall), and the R0→R3 success ladder for the strategy-author skill. | +| [product-integration-playbook.md](./product-integration-playbook.md) | **The operator playbook.** The 8-step product integration sequence (gtm first), the consolidated human-role table (what only operators do), the three packaging gaps (publish the suite, corpus inflow, product Environments), and fleet sequencing. | + ## Source artifacts (multi-agent passes) | Run | Pass | Result lands in | diff --git a/docs/research/layer-across-run.md b/docs/research/layer-across-run.md new file mode 100644 index 0000000..c56465b --- /dev/null +++ b/docs/research/layer-across-run.md @@ -0,0 +1,68 @@ +> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** THE unmeasured thesis — n=0, highest priority + +# Layer: across-run learning (the flywheel) + +**The claim under test:** run N+1 is measurably better than run N because the system +*learned* from run N — the corpus of trace-derived findings primes future runs. This is +the canon's success criterion verbatim (architecture §0.5.4: "the across-run curve is +RSI, and it is THE success criterion (Gate B)"; learning-flywheel §1). + +## Status: the embarrassing asymmetry + +Within-run mechanics have ~6 adequately-powered measurements (mostly null/negative). +Across-run learning has **zero**. The machinery is wired (`observe()` → `Corpus` → +`renderCorpusToInstructions` → next-run priming; demonstrated live in `fleet.mts`, +"carrying 2 prior learnings"), but the *benefit* has never been measured. The ledger has +called the primed-vs-cold A/B "the cheap test that makes it pay rent" since 2026-06-08. + +## The experiment (designed, runnable now) + +**Primed-vs-cold at equal budget.** Two arms over the same task stream (EOPS split, or +ideally a *sequence* so learning can accumulate): +- **cold**: every run starts fresh (the canonical loop as measured). +- **primed**: before each run, `corpus.query(task tags)` → top-k high-confidence facts + injected into the worker/analyst context; after each run, `observe()` appends. + +Score both with the same deployable verifier; the metric is the **slope** (does primed's +advantage *grow* over the stream — the flywheel signature) and the endpoint lift. Frozen +holdout: a final disjoint slice where primed keeps its corpus but cold stays cold. + +Falsifiers to design against (the stress test): +1. **Context pollution** — injected facts displace task-relevant context and *hurt* + (the FinSearch lesson: workers got advice and ignored it; fleet.mts observed the + same). Mitigate: cap k, relevance-rank, measure a k=0/2/5 dose curve. +2. **Stale facts** — the gym DB resets per task; "learnings" about *instances* are + noise, only *procedural* learnings transfer ("verify before mutate", "SLA must be + relinked after priority change"). The corpus schema already separates `area`/`claim`; + the A/B should tag procedural-vs-instance and report both. +3. **Judge leakage** — corpus facts must remain trace-derived (`derived_from_judge: + false` is enforced structurally in `observe()`); a primed win that came from leaked + verdicts would be Goodhart, not learning. +4. **Worker disregard** — measured before (advice ignored). Track *uptake*: did the + worker's tool sequence change in the direction of the injected fact? + +## Why this layer dominates the portfolio + +- It is the **stated product** ("the moat is the cross-benchmark learning flywheel", + architecture §8) and the only layer whose success directly justifies the corpus, the + judge discipline, and the RSI framing. +- The within-run results make it *more* urgent, not less: if adaptive compute inside a + run is mostly worthless, the entire bet collapses onto memory across runs. +- It is the natural junction with **Tangle Intelligence** (see + `layer-intelligence-serving.md`): a positive primed-vs-cold result is simultaneously + the proof that a hosted corpus/findings service has product value — the same + experiment, two strategic answers. + +## Expansion beyond the first A/B + +- **Retrieval-steered analyst**: the analyst's context includes findings from *past + similar failures* (corpus query keyed on the current trace), not just the current + trace — the cross-run version of `observe()`. +- **Cross-benchmark transfer** (the full Gate B): learn on EOPS-itsm, measure lift on + csm/hr — does *procedural* knowledge transfer across domains? This is the actual moat + claim and it has a concrete falsifier (instance-knowledge won't transfer; procedural + might). +- **Corpus curation as the optimization target**: once priming shows any lift, *what to + keep* (confidence thresholds, decay, dedup) becomes the GEPA-optimizable surface — + optimizing memory instead of prompts. Note this is exactly where the prompt-GEPA + machinery transfers after its within-run null. diff --git a/docs/research/layer-agent-authored.md b/docs/research/layer-agent-authored.md new file mode 100644 index 0000000..07c0487 --- /dev/null +++ b/docs/research/layer-agent-authored.md @@ -0,0 +1,78 @@ +> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** newly feasible — the skillification goal, unmeasured + +# Layer: agent-authored optimization (skillification) + +**The claim under test:** agents can author the optimization machinery themselves — +read a run's failures, write a *new strategy* (code, not prompt), and have it gated like +any human-built candidate. This is the stated product goal ("skillify the process so +agents develop these complex things") and the literal RSI claim, one level up from +prompt mutation. + +## Why this just became feasible + +Before `defineStrategy`, a strategy was a ~70-line Supervisor driver (spawn/scope/ +journal ceremony) — not a unit any agent emits reliably. Now a strategy is a **~20-line +body composing two steps** (`shot()`, `critique()`) with the ceremony hidden, proven by +`adaptiveRefine` (branch-when-stuck, authored from the steps, runs through the canonical +gate). The skillifiable unit exists; what's missing is the skill and the measurement. + +## The two safety properties that make agent authorship sound + +These are structural, not policy — which is what makes this layer credible at all: + +1. **Equal-compute by construction.** Any authored strategy spends through the + Supervisor's conserved budget pool — it *cannot* win by spending more (the + anti-confound invariant the keystone was built for). +2. **The firewall is structural.** A strategy body composes `shot`/`critique`; it never + receives the verifiers or expected values. An authored strategy can be wrong but + cannot Goodhart the check — the judge stays write-only regardless of who wrote the + code. + +Residual risks that are NOT structurally covered: infinite-loop bodies (cap: the budget +pool exhausts → spawn refused → strategy ends), environment abuse via tool calls (same +exposure as any worker — the Environment's own tool surface is the boundary), and +plain bad code (gate + holdout catches uselessness; typecheck catches breakage). + +## The experiment (the strategy-author skill) + +A skill/agent given: the `defineStrategy` contract + the two steps' docs + a run's +**losses** (per-task: breadth score, depth score, trajectory — already emitted by the +GEPA fitness fn) — asked to author one new strategy attacking the observed failure +mode. The authored strategy enters the same tournament as human-built ones +(`runBenchmark`, n≥24, frozen holdout). + +Success ladder (each rung independently informative): +- **R0** — the agent emits a strategy that typechecks and completes the gate. (Pure + feasibility; expect pass.) +- **R1** — an authored strategy beats `sample` on the holdout. (Parity with human + baseline quality.) +- **R2** — an authored strategy beats the best *human* strategy on the holdout. (The + actual RSI-one-level-up claim.) +- **R3** — iterated: feed the authored strategy's own losses back; does generation 2 + beat generation 1? (GEPA-over-code; this is meta-harness's territory and should run + through that skill's discipline — stable baseline + product-value claim — not a + hand-rolled loop.) + +## Stress test + +- *"Isn't this just GEPA with a bigger search space?"* Materially different: prompt + space was measured flat (holdout tie); *program* space contains things prompts cannot + express (branch-when-stuck, restart policies, multi-artifact coordination, team + topologies). The prior is genuinely open. +- *"LLMs write plausible-broken control flow."* R0 exists precisely to measure the + emission reliability before claiming anything; the gate absorbs broken candidates as + scored losses, not crashes (the resilient harness skips, never dies). +- *"Multi-agent teams?"* Same unit: a "team" is a strategy whose body spawns several + *different* agents and arbitrates — the recursive atom already expresses it; the skill + just needs one team-shaped example in its docs. +- *"Why a skill rather than a workflow?"* The skill is the productization: it travels to + any repo with the substrate, and it is the artifact that makes "agents develop these + complex things themselves" true for users, not just for this bench. + +## Order of operations + +1. Write the strategy-author skill (input: losses + contract; output: a + `defineStrategy` file + rationale). Small. +2. R0/R1 on the existing EOPS gate (cheap, reuses everything). +3. R2 tournament: authored vs `refine` vs `adaptiveRefine` vs `sample`, n≥24 + holdout. +4. R3 only through `meta-harness` discipline, gated on R2 signal. diff --git a/docs/research/layer-domain-generality.md b/docs/research/layer-domain-generality.md new file mode 100644 index 0000000..28e849c --- /dev/null +++ b/docs/research/layer-domain-generality.md @@ -0,0 +1,63 @@ +> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** n=1 domain — the headline result's biggest validity risk + +# Layer: domain generality and product transfer + +**The claim under test:** the boundary law ("steering wins on stateful agentic work") +and the +16.4pp depth result generalize beyond EOPS-itsm — across gym domains, across +task families, and ultimately to live products. + +## The exposure + +Every positive steering result in this program sits on **one domain**: EOPS *itsm* +(ServiceNow ticket ops, SQL-state verifiers). The negatives sit on two stateless domains +(FinSearchComp, HumanEval). So the "boundary law" is interpolated from 3 points, and the +product thesis ("depth wins on ops-like agentic work") rests on n=1 domain, n=1 gym, +n=2 models. The canon's own discipline (eval-substrate: paired stats, honest scoping) +demands this be named: **the law is a hypothesis with one supporting stateful domain.** + +## The cheap replication (nearly free) + +`gym_dbs.zip` ships **eight** domain splits: itsm, csm, hr, email, drive, calendar, +teams, hybrid — same container, same MCP/verifier machinery, same `Environment` +implementation (`agentic-eops.ts` is domain-blind; only the HF split name changes). A +cross-domain run is a config change: + +- **Experiment:** canonical depth-vs-breadth (Supervisor + observe, keep-best) on csm + + hr at n≥16 each, same model. +- **Outcomes:** (a) replicates → the law has 3 stateful domains and the product claim + firms up; (b) fails on one → the boundary is finer than "stateful" (e.g. itsm's + read-verify-write loops are unusually steerable) and we learn *which* property carries + the win — either result is decision-grade. + +## Stress test (why itsm might be idiosyncratic) + +- itsm tasks have **many independent sub-goals** (2–18 SQL verifiers/task) — partial + credit is dense, so a steer always has a "next unfinished item." Domains with one + atomic verifier may behave like stateless tasks. +- itsm tools are **read/write symmetric** (every mutation is cheaply checkable by a + read) — the verify-before-mutate steer is unusually actionable. Email/calendar may + lack cheap verification reads. +- The gym DB **resets per task** — no long-horizon persistence *across* tasks, so this + is still short-horizon steering. The long-horizon claim (hours-scale accumulation) + needs commit0/SWE-class coding domains — currently platform-gated (#984 sandbox + egress), the honest outer boundary of what's testable today. + +## Product transfer (the falsifier the product-value claim wrote down) + +The gym is a proxy. The five live products (gtm/tax/legal/creative/agent-builder) are +the target, and `.evolve/eops-steerer-product-claim.md` already names the falsifier: +*"the win doesn't transfer off the gym to a real connector-backed ops agent."* Transfer +is not a bigger gym run — it is the integration question (see +`product-integration-playbook.md`): implement an `Environment` over one product's real +tool surface + a deployable check from its domain (e.g. gtm: a campaign-state check; +tax: a return-validation check), and run the same gate. That is the experiment that +converts this research program into product value, and nothing in the current evidence +shortcuts it. + +## Order of operations + +1. csm + hr replication (config-change cheap, decision-grade either way). +2. The (correct,$,ms) vector on those runs (free, per layer-economics). +3. One product `Environment` (gtm first — richest tool surface, live traces flowing) — + the bridge experiment, scoped in the playbook. +4. commit0/SWE long-horizon — parked on #984; revisit when the platform unblocks. diff --git a/docs/research/layer-economics.md b/docs/research/layer-economics.md new file mode 100644 index 0000000..2a6c3dd --- /dev/null +++ b/docs/research/layer-economics.md @@ -0,0 +1,67 @@ +> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** canon-mandated, practice-absent — the largest internal inconsistency + +# Layer: economics, multi-objective, and the portfolio question + +**The claim under test:** "best" is a vector — correct · fast · secure · cheap — and the +optimization target is the Pareto frontier, not a pre-collapsed score. + +## The inconsistency this layer names + +The canon mandates this (architecture §0.5.2 "Success is multi-objective; we do not +collapse it to one number until forced"; §0.5.3 each objective carries its own deployable +checker). **Every gate this program has run is single-objective** (verifier score), with +cost merely *reported*. The Pareto machinery exists (`paretoFrontier`, +`paretoFrontierWithCrowding` in agent-eval; the GEPA harness already selects on +[lift, cost]). This is practice lagging canon, not a design dispute — and it changes +conclusions: a strategy that ties on score but halves cost **wins** under the canon's +definition and is invisible under ours. + +## What's free to wire (harvest, not research) + +- **correct** — already the verifier. **cheap** — already measured (`Spend.usd`, + tokens; the conserved pool meters it). **fast** — already measured (`Spend.ms`). + Three of four objectives are *already in every RunRecord*; the work is reporting the + vector + Pareto verdicts instead of the scalar. ~Days, not weeks. +- **secure** — the one objective needing a real checker (domain-dependent: policy + violations in EOPS, dangerous tool calls, secret leakage). Defer until a domain + supplies one; don't fake it with an LLM judge (eval-substrate: deterministic or + execution-grounded only). + +## The two big unmeasured effects in this layer + +1. **The cost-quality frontier across models.** The router serves 500+ models; the + gates have used 2–3. The product question is *lift-per-dollar*, and the data so far + hints the frontier is strange: deepseek-v4-flash resolves 6% of EOPS (too weak to + steer), v4-pro carries the +16.4pp at a fraction of gpt-4.1's price. A model-sweep on + the existing gate (same harness, 4–5 models, report (score, $/task)) maps it for the + cost of one rerun. +2. **Tool/harness augmentation dominates.** The largest single effect this program has + ever measured is not steering, not selection, not prompts — it is **giving cheap + models a search tool**: you.com lifted *all five* models to ~90% on SimpleQA (+70pp + for cheap models, p≈.03), erasing the model-quality gap. The honest implication: for + many task classes, **harness augmentation ≥ model choice ≥ strategy ≫ prompt** in + effect size. The portfolio should weight accordingly — an "augmentation sweep" (which + tool grants close which domain's gap) is plausibly worth more than every remaining + steering experiment combined. + +## Stress test + +- *"Multi-objective is premature until score itself is solid."* Backwards under the + canon: collapsing to score is what made the deepseek-flash runs look uninformative + (6% resolve) when the right reading was "off the frontier, wrong model for the + domain." The vector is *cheaper* to be right with, not more expensive. +- *"Pareto verdicts confuse operators."* The scalarization exists (`scalarScore`, + weighted) for when a single winner is forced; the discipline is collapse-last. +- *"Routing is a product, not an experiment."* It's both — but the *measurement* (the + frontier map) is precisely the eval-substrate's sellable exhaust (eval-substrate: "which + (harness × model × provider × strategy) is actually best for task-class X"). + +## Concrete next steps + +1. Wire the (correct, usd, ms) vector + `paretoFrontier` verdict into `runBenchmark`'s + report (additive; the data is already in the records). +2. Model-frontier sweep on the canonical EOPS gate: {v4-flash, v4-pro, glm-5, gpt-4.1} + × {sample, refine} → the first published lift-per-dollar table. +3. Augmentation sweep design: per domain, the tool grant that closes the cheap-model + gap (search for retrieval domains; what is the EOPS analog — schema docs? read-tool + hints?). diff --git a/docs/research/layer-intelligence-serving.md b/docs/research/layer-intelligence-serving.md new file mode 100644 index 0000000..462a09e --- /dev/null +++ b/docs/research/layer-intelligence-serving.md @@ -0,0 +1,85 @@ +> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** architecture decision — export-only today; the across-run layer's natural home + +# Layer: intelligence serving — self-hosted vs platform-served + +**The question (operator-posed):** today the loop *self-hosts* its intelligence +gathering (`observe()` runs in-process, the `Corpus` is a local JSONL). Should **Tangle +Intelligence** instead *serve* intelligence to agents and agent teams — and is what we +built pointing toward that or away from it? + +## Ground truth: what Tangle Intelligence is today + +Verified against the code (otel-export.ts, examples/intelligence-export, +agents-of-all-shapes, the sandbox SDK): + +| surface | direction | shape | +|---|---|---| +| `createOtelExporter` → `/v1/traces` | **export only** | OTel GenAI spans (loop topology, usage, cost) | +| `exportEvalRuns` → `/v1/ingest/eval-runs` | **export only** | eval provenance (baselines, generations, gates, InsightReport) | +| sandbox `createIntelligenceReport` / `createAgenticIntelligenceReport` | async pull | fleet/box-level report, `queued→completed`, dashboard-shaped | +| `/v1/insights/outputs?kind=report` | human dashboard | no programmatic agent contract | + +**Verdict: export-only.** Nothing in `src/` reads Intelligence back into a loop. The +in-loop intelligence is entirely `observe()` (per-run, synchronous, ~1 LLM call, +firewalled) + `Corpus` (local durable facts, `corpus.query()` → next-run priming). + +## The two systems are layered, not duplicates + +| | `observe()` + `Corpus` (in-process) | Tangle Intelligence (hosted) | +|---|---|---| +| granularity | one run's trace → findings *now* | fleet-scale, multi-run clustering, lift CIs, Pareto | +| latency | in-loop (<1s need) | async (seconds–minutes) | +| store | local JSONL per product | server-side, tenant-wide | +| consumer | the very next shot/run | humans (dashboards) | +| firewall | **structural** (`derived_from_judge:false`; input carries no score) | **none** — InsightReport embeds judge-derived stats | + +So the answer to "are we self-hosting what Intelligence should serve?" is: **partially, +and the split should be by timescale.** The *within-run* critic must stay in-process +(latency, firewall, per-run context). The *across-run* memory — the corpus, the fleet +patterns, the "what do we know about failures like this" query — is exactly what a +hosted service does better: amortized analysis across every run of every product in the +tenant, cached, one place to curate. **Tangle Intelligence is the natural home of the +across-run layer** (`layer-across-run.md`), and today's local JSONL corpus is the +self-hosted stopgap for a read-back API that doesn't exist yet. + +## What's missing to make Intelligence "serve the agents" (the gap list) + +1. **A read-back API** — `GET` findings by subject/window/tags, agent-consumable shape + (`AnalystFinding[]`-like: area, claim, recommended_action, confidence), not + dashboard-shaped reports. Sub-second from cache. +2. **Pre-computed/cached findings** — computed on ingest or scheduled, not + generate-on-request; an agent priming a run cannot wait minutes. +3. **The firewall, server-side** — this is the hard constraint, and it is + non-negotiable: InsightReport today mixes judge-derived statistics. If agents steer + on served intelligence that embeds judge verdicts, the keystone discipline + (selector ≠ judge, judge write-only — learning-flywheel: "the keystone of the entire + stack") breaks *at the platform level*, silently, for every consumer. The served + slice must be trace-derived-only, enforced where the report is built, with + `derived_from_judge` provenance on every served claim. +4. **Uptake telemetry** — served findings should carry IDs so the loop can report back + "injected, followed, outcome" — closing Intelligence's own improvement loop. + +## Stress test + +- *"Why not keep it all local — it works?"* Local corpora silo learning per product and + per machine; the moat claim is *cross*-run, cross-product transfer, which only a + shared service realizes. Also: ops (curation, decay, dedup) done five times badly vs + once well. +- *"Why not move observe() to the platform too?"* Latency + context: the in-loop critic + needs the live trace within the shot cadence, and shipping full traces mid-loop is + cost + privacy surface. Per-run critic local, cross-run memory hosted — clean split. +- *"Does a hosted dependency break offline/dev?"* The `Corpus` port stays; the hosted + service is one implementation behind it (`IntelligenceCorpus` beside `FileCorpus`). + Degrade to local, never fail closed on a network read. +- *"Is there a business here or just plumbing?"* The primed-vs-cold A/B answers both at + once: if priming lifts outcomes, "intelligence served to agents" has measurable value + per query — eval-substrate's sellable-exhaust thesis, applied to the corpus itself. + +## Decision + sequence + +1. Run the corpus A/B locally first (no platform work) — it gates everything: no lift, + no service. +2. On a positive: define the served-findings contract (the `Corpus` port already exists + — implement it over Intelligence read-back), with the firewall enforced server-side. +3. The product playbook's Phase 3 (see `product-integration-playbook.md`) then swaps + each product's local corpus for the served one — one port, no loop changes. diff --git a/docs/research/layer-within-run.md b/docs/research/layer-within-run.md new file mode 100644 index 0000000..6664046 --- /dev/null +++ b/docs/research/layer-within-run.md @@ -0,0 +1,58 @@ +> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** mostly settled — boundary law established, one lever open + +# Layer: within-run optimization + +**The claim under test:** spending a run's compute *adaptively* (steer, refine, branch) +beats spending it *blindly* (best-of-N resampling) at equal budget. + +## Evidence (all paired, equal-compute, deployable checkers) + +| domain | setup | steering vs compute | verdict | +|---|---|---|---| +| FinSearchComp (stateless retrieval) | n=40, BH | refineHand −10pp, refineGepa −15pp; compute +22.5pp (p=.008) | **negative** | +| HumanEval (stateless codegen) | n=82, LLM-audit steer | −1.2pp CI[−8.5,+6.1] | null | +| HumanEval (stateless codegen) | n=82, exec-grounded self-repair (`run_tests` tool) | **−17.1pp** CI[−26.8,−7.3] | **significantly negative** | +| EOPS-itsm (stateful agentic), flat hand-rolled loop | n=24 | −9.9pp → autopsy: scoring asymmetry | artifact (see below) | +| EOPS-itsm, **canonical loop** (Supervisor + observe()) | n=16 | **+16.4pp** CI[+5.3,+29.8], 6W/0L | **significantly positive** | +| EOPS-itsm, disjoint holdout slice | n=6 | +8.3pp (both analyst prompts) | replicates | +| analyst-prompt GEPA | search n=12, frozen holdout n=6 | holdout: winner +8.3 = baseline +8.3 | **null** (prompt not binding) | + +## The boundary law (the durable output of this layer) + +Steering pays **iff** the task is *stateful* (the artifact accumulates, so an observed +correction is worth more than a fresh sample), has a *correctable middle band* (partial +credit a steer can move), and resampling is *expensive or impossible* (you can't restart +a 6-step ticket migration). On stateless generation, fresh samples explore for free and +any anchored continuation loses — exactly the canon's prediction (architecture §10). + +Two engineering laws fell out, both load-bearing: +1. **Keep-best checkpointing is mandatory.** Steering *reaches* better states then + *undoes* them (measured degradation +6–8pp). Score/keep the best-verifying + checkpoint, never the final state. The flat-loop "depth loses −9.9pp" result was + entirely this scoring asymmetry (autopsy `.evolve/autopsies/2026-06-08-…`). +2. **Architecture is a variable, not plumbing.** The same model/domain/n flipped from + "depth loses" (flat loop, hand-rolled steerer) to "+16.4pp significant" (Supervisor + + real `observe()` analyst). Measure on the canonical stack only. + +## Stress test (strongest objections) + +- *"+16.4pp is one domain, one model, n=16."* True. The holdout replication (+8.3pp, + disjoint tasks) helps but cross-domain (layer-domain-generality) is the real answer. +- *"The analyst adds nothing — GEPA tied."* The correct reading is narrower: the + analyst-prompt *text* is not binding at this budget. The analyst *mechanism* is in + every positive cell, and removing it (generic nudge, flat loop) degraded results. The + untested attribution experiment: canonical depth WITHOUT any analyst (pure + continuation) vs with — isolates the analyst's marginal value. +- *"Maybe more shots, not steering, explains depth's win."* No — equal completions by + construction (conserved budget pool), and breadth had ≥ compute in the wins. + +## What's left in this layer (and what to stop) + +**Open lever — topology/strategy:** `adaptiveRefine` (branch-when-stuck), refine/sample +mixes, widen gates. Now cheap to test (`defineStrategy` + `runBenchmark` + holdout). +The one within-run experiment still worth funding: **strategy tournament at n≥24 + +frozen holdout.** + +**Stop:** analyst-prompt GEPA at small n (flat landscape, holdout-tied); steering +experiments on stateless domains (three independent negatives); rich-analyst plumbing +(HALO OTLP emitter) until a topology win re-motivates it. diff --git a/docs/research/optimization-space.md b/docs/research/optimization-space.md new file mode 100644 index 0000000..d471157 --- /dev/null +++ b/docs/research/optimization-space.md @@ -0,0 +1,107 @@ +> **Track:** Architecture (research) · **Role:** strategy map · **Status:** open — taxonomy + stress-tests, 2026-06-09 + +# The optimization space — axes, not a ladder + +A stress-test of the question "does GEPA / steerers / HALO contextualize everything we +could be working on?" Answer: **no.** Those are all points in ONE region of a larger +space, and the region we have been grinding (within-run mechanics) is the one where the +evidence keeps coming back null-or-marginal, while the region the canon names as the +actual success criterion (the across-run flywheel, Gate B) has **n=0 measurements**. + +This doc holds the taxonomy and the canon-compatibility audit. One stress-test doc per +layer lives beside it (`layer-*.md`). + +## Why axes, not layers + +The original framing ("6 layers") conflated independent dimensions. The clean model: an +optimization effort is a **point in a 6-axis space**, and any "ladder" (the canon's +L0→L1→L2 rungs) is one *path* through it — not the space itself. + +| Axis | Values | Where this repo is today | +|---|---|---| +| **Timescale** | within-run · across-run · meta (optimizer-of-optimizer) | almost all effort within-run; across-run n=0 | +| **Target** | prompt (content) · topology/strategy (structure) · knowledge/corpus (memory) · policy (routing, ask-vs-act, budget) · tasks (curriculum) | prompt = measured (tie); topology = open; the rest untouched | +| **Objective** | single score · multi-objective vector (correct·fast·secure·cheap) | every gate so far single-objective — **in tension with the canon** (see audit) | +| **Validity scope** | one domain · cross-domain · live product | n=1 domain (EOPS-itsm) for the headline result | +| **Serving architecture** | in-process (observe()/Corpus) · platform-served (Tangle Intelligence) | all in-process; Intelligence is export-only today | +| **Authorship** | human-built · agent-authored | human; `defineStrategy` makes agent-authored feasible | + +Reconciliation with the canon's ladder: the rungs (L0 worker → L1 controller → L2 +meta-optimizer) are the **timescale × target** diagonal. The axes add what the ladder +hides: objective shape, validity scope, serving topology, authorship. Both frames are +compatible; the ladder answers "is level n real?" (lift on level n−1), the axes answer +"where is the unexplored headroom?". + +## The map with evidence status (2026-06-09) + +| Region | Evidence | Verdict | +|---|---|---| +| within-run steering, stateless retrieval (FinSearchComp) | n=40, BH-corrected | **NEGATIVE** (steering −10/−15pp; compute +22.5pp) | +| within-run steering, stateless codegen (HumanEval) | n=82 ×2, paired-bootstrap | **NULL** (audit −1.2 n.s.) / **NEGATIVE** (exec-grounded repair −17.1 SIGNIF) | +| within-run depth+keep-best, stateful agentic (EOPS) | n=16 + holdout replication | **POSITIVE** (+16.4pp CI[+5.3,+29.8]; +8.3pp on disjoint slice) | +| analyst-prompt GEPA | search n=12 + frozen holdout n=6 | **NULL** (holdout exact tie vs default) | +| within-run topology (adaptiveRefine, mix/widen) | unmeasured | open — the one within-run lever left | +| across-run corpus flywheel (primed-vs-cold) | **n=0** | the canon's stated success criterion, never measured | +| multi-objective vector | **n=0** (machinery exists: `paretoFrontier`) | canon-required, unwired | +| cross-domain (csm/hr/email/… gym splits) | **n=0** | nearly free to run | +| live-product transfer | **n=0** | the product-value claim's own falsifier | +| tool/harness augmentation | SimpleQA: you.com lifts cheap models +70pp to parity | the **largest single effect measured anywhere in this program** | +| agent-authored strategies | feasible since `defineStrategy`; unmeasured | the skillification goal | + +Reading of the map: the program has **over-sampled one cell** (within-run × prompt/strategy × +single-objective × itsm × in-process × human) and the cells the canon itself designates as +the product (across-run, multi-objective, product-scope) are empty. + +## Canon-compatibility audit + +Checked against `architecture.md`, `learning-flywheel.md`, `eval-substrate.md`, +`roadmap-rsi.md`, `architecture-interpretations.md`, `.evolve/current.json`. + +**Compatible / direct alignment:** +- Across-run = success (architecture §0.5.4: "That across-run curve is RSI, and it is THE + success criterion (Gate B)"; roadmap: Gate B "not yet instrumented"). The axes frame + *restates* the canon's own acknowledged gap. +- "Topology over prompt as the next within-run lever" — consistent with roadmap Phase 3 + (grow the ISA) being gated on findings reaching the planner. +- Platform-served intelligence is a **deployment-topology choice**, not an architecture + violation — the kernel owns Scope/MCP/profiles; analysis attaches via hooks + (architecture §1b). See `layer-intelligence-serving.md` for the one hard constraint + (the judge firewall). + +**Corrections the canon forces on the new framing:** +- "Within-run steering is mostly null" is **too gentle**: the adequately-powered rung-0 + result is *negative on every slice* (learning-flywheel §Honest status). The accurate + law: **negative on stateless retrieval, null-to-negative on stateless codegen, positive + on stateful agentic with keep-best checkpointing.** The boundary variable is state + + a correctable middle band + the inability to cheaply resample. +- The canon already predicted the self-refine failure (architecture §10: "intrinsic + self-refine degrades… the driver must re-investigate, not self-critique"). Our + HumanEval repair −17.1pp is a *confirmation*, not news. + +**Tensions / staleness to resolve (documentation debt, not design conflict):** +- `learning-flywheel.md` rung-0 verdict ("steering loses") is FinSearchComp-scoped and + now needs the domain boundary added (EOPS depth win, canonical loop, +16.4pp). +- Every gate run to date is single-objective, while architecture §0.5.2–0.5.3 mandates a + multi-objective vector with per-objective deployable checkers. This is the **largest + internal inconsistency between practice and canon** — see `layer-economics.md`. +- `.evolve/current.json` predates the canonical-loop result and the GEPA verdict; needs a + state refresh (tracked separately from this doc set). + +## The portfolio (what to multi-pursue) + +Ranked by (decision-relevance × cheapness × independence): + +1. **Across-run corpus A/B** (`layer-across-run.md`) — primed-vs-cold at equal budget. + The thesis test; doubles as the Tangle-Intelligence-value proof. +2. **Cross-domain replication** (`layer-domain-generality.md`) — depth-vs-breadth on a + second gym split (csm or hr). Validates or bounds the headline result. +3. **Multi-objective wiring** (`layer-economics.md`) — report the (correct, cost, wall) + vector per strategy; lift-per-dollar. Mostly harvest, machinery exists. +4. **Topology evolution** (`layer-within-run.md`) — adaptiveRefine/mix vs refine vs + sample, n≥24 + holdout, the fitness fn already built. +5. **Strategy-author skill** (`layer-agent-authored.md`) — an agent reads the losses and + emits a `defineStrategy`; gate scores it. Small build; IS the skillification goal. + +Explicitly **not** in the portfolio: more analyst-prompt GEPA (holdout-tied, flat +landscape), HALO plumbing (rich-analyst bet weakened by the prompt null), in-box sandbox +arms (platform-gated, #984). diff --git a/docs/research/product-integration-playbook.md b/docs/research/product-integration-playbook.md new file mode 100644 index 0000000..ba92082 --- /dev/null +++ b/docs/research/product-integration-playbook.md @@ -0,0 +1,91 @@ +> **Track:** Operations (research) · **Role:** integration + operator playbook · **Status:** actionable — primitives mostly shipped, three packaging gaps named + +# Product integration playbook — putting the optimization system into the products + +The step-by-step path for wiring the optimization system (canonical Supervisor loop · +`observe()` analyst · Environment/Strategy/`runBenchmark` · corpus) into the live +agent-app products (gtm / tax / creative / legal / agent-builder), and **what the +operator (Drew + team) does at each step** vs what runs autonomously. + +Honest framing up front: most of the production loop **already ships** in agent-eval / +agent-runtime (the `agent-stack-adoption` 9-phase pipeline). What this playbook adds is +(a) where the *new* optimization suite slots into that pipeline, (b) the operator role +table, (c) the three packaging gaps that block "just import it" today. + +## The three packaging gaps (do these first) + +| gap | today | needed | +|---|---|---| +| **G1 — the suite isn't published.** `Environment`, `Strategy`, `defineStrategy`, `runBenchmark`, the canonical depth/breadth drivers live in `bench/src/` (R&D workspace), not in the published `@tangle-network/agent-runtime` exports. | products can't import them | lift `agentic.ts` + `run-benchmark.mts` into `src/` behind `/loops` (a `substrate-release` motion; the code is already domain-blind) | +| **G2 — corpus has no production inflow.** `observe()`/`Corpus` runs in bench loops; production traces flow to the trace sink + (optionally) OTLP, but nothing turns production traces into corpus facts automatically. | analyst-loop proposes; PR-gated | a production `observe()` pass over the trace sink (batch, nightly) writing corpus facts; later the Intelligence-served corpus (layer-intelligence-serving) | +| **G3 — no product `Environment` exists.** The gate has only gym Environments. | gym-only evidence | one product Environment (gtm first): tools = the product's real MCP surface; `score()` = a deployable domain check | + +## The integration sequence (one product: gtm-agent) + +Assumes the product is already at adoption Phase 3+ (composer + trace sink + nightly +eval live — gtm is). Each step names the existing primitive; nothing here is invented. + +1. **Parity profile** — eval runs the *production* agent: `composeProductionAgentProfile` + → `createSandboxAct`. (Shipped; most products wired.) *Operator: none.* +2. **Production traces flowing** — `createProductionTraceSink` on every chat turn; OTLP + export to Intelligence optional but recommended (`createOtelExporter`). *Operator: + set the OTLP endpoint secret once; glance at trace health weekly.* +3. **The product Environment (G3)** — implement the 5 hooks over gtm's real surface: + `open` = a scoped workspace/session; `tools` = the product MCP tools; `call` = + invoke them; `score` = a deployable check (campaign-state assertions, not an LLM + judge); `close` = teardown. ~1–2 days; this is the gym→product bridge experiment + from `layer-domain-generality.md`. *Operator decision: which checks define "done" + for a gtm task — this is product judgment, not engineering.* +4. **Run the gate on the product** — `runBenchmark({environment: gtmEnv, strategies: + [sample, refine], …})` over a frozen scenario set. First output: does depth/steering + pay on *your* domain, with the (correct, $, ms) vector per layer-economics. + *Operator: review the report; pick the strategy+model cell for production.* +5. **Backend integrity + scorecard + ship-gate** — `assertRealBackend` before any + verdict; `recordRunsToScorecard`/`diffScorecard` per commit; `runProductionLoop`'s + held-out promotion gate for any prompt/addendum change. (All shipped.) *Operator: + approve/reject gate-passing PRs — this is the standing human checkpoint.* +6. **Corpus priming (G2 + the across-run layer)** — nightly `observe()` over the day's + production traces → corpus; prime tomorrow's runs via `corpus.query`. Run + primed-vs-cold on the product scenario set — the product-grade flywheel test. + *Operator: review high-confidence facts weekly (a 10-minute curation pass); approve + the auto-apply threshold.* +7. **Intelligence hookup** — keep exporting (step 2 covers it). When the served-findings + read-back exists (layer-intelligence-serving), swap `FileCorpus` for the + Intelligence-backed `Corpus` — one port, no loop changes. *Operator: tenant config.* +8. **CI crons** — nightly eval + weekly production-loop (templates shipped in the + adoption skills). *Operator: provision the runner once; rotate secrets; review the + weekly auto-PR.* + +## The operator role, consolidated + +What **only humans** do — everything else runs autonomously: + +| cadence | action | authority | +|---|---|---| +| once per product | define the deployable checks (step 3) + holdout scenarios | product judgment — the single highest-leverage human input | +| once | set gate thresholds (paired-delta, overfit gap), budgets, model allowlist | risk posture | +| weekly | review scorecard diff + the production-loop auto-PR; approve/reject | the ship decision | +| weekly | 10-min corpus curation (high-confidence facts in/out) | knowledge quality | +| on failure | backend-integrity or infra alerts (stub verdict, runner down) | unblock | + +The deliberate design: the human owns **what "good" means** (checks, thresholds, +scenarios) and **the ship decision**; the system owns everything between — running, +scoring, mutating, gating, reporting. That is the operator contract to staff for: not +babysitting runs, but curating definitions and reviewing one diff per product per week. + +## Sequencing across the fleet + +gtm first (richest tools, live traces, friendliest checks) → then tax (high-value +deterministic checks: return validation) → creative/legal (checks are harder to make +deterministic — may stay at steps 1–2+5 until eval-agent rubrics mature) → +agent-builder (special case: its *product* is generating agents, so the strategy-author +skill from `layer-agent-authored.md` is its feature, not its tooling). + +## What NOT to do + +- Don't fork `runProductionLoop` per product to get custom topologies — that's G1's + job (publish `Strategy`), then strategies are injected, not forked. +- Don't auto-apply corpus facts above the measured-precision threshold; PR-gate until + the primed-vs-cold A/B shows lift. +- Don't ship any steering default to a product before its own Environment gate (step 4) + shows it pays *on that domain* — the boundary law says it may not.