tangle-network · drewstone · Jun 9, 2026 · Jun 9, 2026
diff --git a/docs/research/README.md b/docs/research/README.md
@@ -27,6 +27,23 @@ spine happen explicitly, with `file:line` anchors, once a design ships.
 | [codex-techniques-audit.md](./codex-techniques-audit.md) | Adoption report mining OpenAI Codex for succinct-code principles + orchestration techniques. **Advisory** — verify `file:line` before acting. |
 | [loop-facade-postmortem.md](./loop-facade-postmortem.md) | Failure record for the deleted `defineLoop` facade: why retyping `Scope`/MCP/journals/validators produced code without substrate proof, and the prevention rule for future loop APIs. |
 
+### The optimization-space suite (2026-06-09)
+
+The strategy map + per-layer stress tests, written after the steering/GEPA gate series.
+Start at the index; each layer doc carries its own evidence table, strongest objections,
+and concrete next experiments.
+
+| Doc | What it holds |
+|-----|---------------|
+| [optimization-space.md](./optimization-space.md) | **The index.** The 6-axis taxonomy (timescale · target · objective · validity scope · serving architecture · authorship), the evidence map (which cells are measured/null/empty), the canon-compatibility audit, and the ranked experiment portfolio. |
+| [layer-within-run.md](./layer-within-run.md) | Within-run optimization — the settled boundary law (steering negative on stateless, positive on stateful+keep-best), the two engineering laws (checkpointing; architecture-is-a-variable), and the one open lever (topology tournament). |
+| [layer-across-run.md](./layer-across-run.md) | **The unmeasured thesis (n=0).** The corpus flywheel: primed-vs-cold A/B design, the four falsifiers (context pollution, stale facts, judge leakage, worker disregard), and why this layer dominates the portfolio. |
+| [layer-economics.md](./layer-economics.md) | Multi-objective + cost: the largest practice-vs-canon inconsistency (all gates single-objective; canon mandates the vector), the lift-per-dollar frontier, and the tool-augmentation effect (+70pp) that dominates everything else measured. |
+| [layer-domain-generality.md](./layer-domain-generality.md) | The n=1-domain exposure of the headline result; the nearly-free cross-domain replication (csm/hr gym splits); why itsm may be idiosyncratic; the product-transfer falsifier. |
+| [layer-intelligence-serving.md](./layer-intelligence-serving.md) | Self-hosted vs platform-served intelligence: Tangle Intelligence is export-only today; the timescale split (in-loop critic local, across-run memory served); the four-gap list incl. the **server-side judge firewall** as the non-negotiable. |
+| [layer-agent-authored.md](./layer-agent-authored.md) | Skillification: agent-authored strategies via `defineStrategy`, the two structural safety properties (conserved budget, firewall), and the R0→R3 success ladder for the strategy-author skill. |
+| [product-integration-playbook.md](./product-integration-playbook.md) | **The operator playbook.** The 8-step product integration sequence (gtm first), the consolidated human-role table (what only operators do), the three packaging gaps (publish the suite, corpus inflow, product Environments), and fleet sequencing. |
+
 ## Source artifacts (multi-agent passes)
 
 | Run | Pass | Result lands in |

diff --git a/docs/research/layer-across-run.md b/docs/research/layer-across-run.md
@@ -0,0 +1,68 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** THE unmeasured thesis — n=0, highest priority
+
+# Layer: across-run learning (the flywheel)
+
+**The claim under test:** run N+1 is measurably better than run N because the system
+*learned* from run N — the corpus of trace-derived findings primes future runs. This is
+the canon's success criterion verbatim (architecture §0.5.4: "the across-run curve is
+RSI, and it is THE success criterion (Gate B)"; learning-flywheel §1).
+
+## Status: the embarrassing asymmetry
+
+Within-run mechanics have ~6 adequately-powered measurements (mostly null/negative).
+Across-run learning has **zero**. The machinery is wired (`observe()` → `Corpus` →
+`renderCorpusToInstructions` → next-run priming; demonstrated live in `fleet.mts`,
+"carrying 2 prior learnings"), but the *benefit* has never been measured. The ledger has
+called the primed-vs-cold A/B "the cheap test that makes it pay rent" since 2026-06-08.
+
+## The experiment (designed, runnable now)
+
+**Primed-vs-cold at equal budget.** Two arms over the same task stream (EOPS split, or
+ideally a *sequence* so learning can accumulate):
+- **cold**: every run starts fresh (the canonical loop as measured).
+- **primed**: before each run, `corpus.query(task tags)` → top-k high-confidence facts
+  injected into the worker/analyst context; after each run, `observe()` appends.
+
+Score both with the same deployable verifier; the metric is the **slope** (does primed's
+advantage *grow* over the stream — the flywheel signature) and the endpoint lift. Frozen
+holdout: a final disjoint slice where primed keeps its corpus but cold stays cold.
+
+Falsifiers to design against (the stress test):
+1. **Context pollution** — injected facts displace task-relevant context and *hurt*
+   (the FinSearch lesson: workers got advice and ignored it; fleet.mts observed the
+   same). Mitigate: cap k, relevance-rank, measure a k=0/2/5 dose curve.
+2. **Stale facts** — the gym DB resets per task; "learnings" about *instances* are
+   noise, only *procedural* learnings transfer ("verify before mutate", "SLA must be
+   relinked after priority change"). The corpus schema already separates `area`/`claim`;
+   the A/B should tag procedural-vs-instance and report both.
+3. **Judge leakage** — corpus facts must remain trace-derived (`derived_from_judge:
+   false` is enforced structurally in `observe()`); a primed win that came from leaked
+   verdicts would be Goodhart, not learning.
+4. **Worker disregard** — measured before (advice ignored). Track *uptake*: did the
+   worker's tool sequence change in the direction of the injected fact?
+
+## Why this layer dominates the portfolio
+
+- It is the **stated product** ("the moat is the cross-benchmark learning flywheel",
+  architecture §8) and the only layer whose success directly justifies the corpus, the
+  judge discipline, and the RSI framing.
+- The within-run results make it *more* urgent, not less: if adaptive compute inside a
+  run is mostly worthless, the entire bet collapses onto memory across runs.
+- It is the natural junction with **Tangle Intelligence** (see
+  `layer-intelligence-serving.md`): a positive primed-vs-cold result is simultaneously
+  the proof that a hosted corpus/findings service has product value — the same
+  experiment, two strategic answers.
+
+## Expansion beyond the first A/B
+
+- **Retrieval-steered analyst**: the analyst's context includes findings from *past
+  similar failures* (corpus query keyed on the current trace), not just the current
+  trace — the cross-run version of `observe()`.
+- **Cross-benchmark transfer** (the full Gate B): learn on EOPS-itsm, measure lift on
+  csm/hr — does *procedural* knowledge transfer across domains? This is the actual moat
+  claim and it has a concrete falsifier (instance-knowledge won't transfer; procedural
+  might).
+- **Corpus curation as the optimization target**: once priming shows any lift, *what to
+  keep* (confidence thresholds, decay, dedup) becomes the GEPA-optimizable surface —
+  optimizing memory instead of prompts. Note this is exactly where the prompt-GEPA
+  machinery transfers after its within-run null.
diff --git a/docs/research/layer-agent-authored.md b/docs/research/layer-agent-authored.md
@@ -0,0 +1,78 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** newly feasible — the skillification goal, unmeasured
+
+# Layer: agent-authored optimization (skillification)
+
+**The claim under test:** agents can author the optimization machinery themselves —
+read a run's failures, write a *new strategy* (code, not prompt), and have it gated like
+any human-built candidate. This is the stated product goal ("skillify the process so
+agents develop these complex things") and the literal RSI claim, one level up from
+prompt mutation.
+
+## Why this just became feasible
+
+Before `defineStrategy`, a strategy was a ~70-line Supervisor driver (spawn/scope/
+journal ceremony) — not a unit any agent emits reliably. Now a strategy is a **~20-line
+body composing two steps** (`shot()`, `critique()`) with the ceremony hidden, proven by
+`adaptiveRefine` (branch-when-stuck, authored from the steps, runs through the canonical
+gate). The skillifiable unit exists; what's missing is the skill and the measurement.
+
+## The two safety properties that make agent authorship sound
+
+These are structural, not policy — which is what makes this layer credible at all:
+
+1. **Equal-compute by construction.** Any authored strategy spends through the
+   Supervisor's conserved budget pool — it *cannot* win by spending more (the
+   anti-confound invariant the keystone was built for).
+2. **The firewall is structural.** A strategy body composes `shot`/`critique`; it never
+   receives the verifiers or expected values. An authored strategy can be wrong but
+   cannot Goodhart the check — the judge stays write-only regardless of who wrote the
+   code.
+
+Residual risks that are NOT structurally covered: infinite-loop bodies (cap: the budget
+pool exhausts → spawn refused → strategy ends), environment abuse via tool calls (same
+exposure as any worker — the Environment's own tool surface is the boundary), and
+plain bad code (gate + holdout catches uselessness; typecheck catches breakage).
+
+## The experiment (the strategy-author skill)
+
+A skill/agent given: the `defineStrategy` contract + the two steps' docs + a run's
+**losses** (per-task: breadth score, depth score, trajectory — already emitted by the
+GEPA fitness fn) — asked to author one new strategy attacking the observed failure
+mode. The authored strategy enters the same tournament as human-built ones
+(`runBenchmark`, n≥24, frozen holdout).
+
+Success ladder (each rung independently informative):
+- **R0** — the agent emits a strategy that typechecks and completes the gate. (Pure
+  feasibility; expect pass.)
+- **R1** — an authored strategy beats `sample` on the holdout. (Parity with human
+  baseline quality.)
+- **R2** — an authored strategy beats the best *human* strategy on the holdout. (The
+  actual RSI-one-level-up claim.)
+- **R3** — iterated: feed the authored strategy's own losses back; does generation 2
+  beat generation 1? (GEPA-over-code; this is meta-harness's territory and should run
+  through that skill's discipline — stable baseline + product-value claim — not a
+  hand-rolled loop.)
+
+## Stress test
+
+- *"Isn't this just GEPA with a bigger search space?"* Materially different: prompt
+  space was measured flat (holdout tie); *program* space contains things prompts cannot
+  express (branch-when-stuck, restart policies, multi-artifact coordination, team
+  topologies). The prior is genuinely open.
+- *"LLMs write plausible-broken control flow."* R0 exists precisely to measure the
+  emission reliability before claiming anything; the gate absorbs broken candidates as
+  scored losses, not crashes (the resilient harness skips, never dies).
+- *"Multi-agent teams?"* Same unit: a "team" is a strategy whose body spawns several
+  *different* agents and arbitrates — the recursive atom already expresses it; the skill
+  just needs one team-shaped example in its docs.
+- *"Why a skill rather than a workflow?"* The skill is the productization: it travels to
+  any repo with the substrate, and it is the artifact that makes "agents develop these
+  complex things themselves" true for users, not just for this bench.
+
+## Order of operations
+
+1. Write the strategy-author skill (input: losses + contract; output: a
+   `defineStrategy` file + rationale). Small.
+2. R0/R1 on the existing EOPS gate (cheap, reuses everything).
+3. R2 tournament: authored vs `refine` vs `adaptiveRefine` vs `sample`, n≥24 + holdout.
+4. R3 only through `meta-harness` discipline, gated on R2 signal.
diff --git a/docs/research/layer-domain-generality.md b/docs/research/layer-domain-generality.md
@@ -0,0 +1,63 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** n=1 domain — the headline result's biggest validity risk
+
+# Layer: domain generality and product transfer
+
+**The claim under test:** the boundary law ("steering wins on stateful agentic work")
+and the +16.4pp depth result generalize beyond EOPS-itsm — across gym domains, across
+task families, and ultimately to live products.
+
+## The exposure
+
+Every positive steering result in this program sits on **one domain**: EOPS *itsm*
+(ServiceNow ticket ops, SQL-state verifiers). The negatives sit on two stateless domains
+(FinSearchComp, HumanEval). So the "boundary law" is interpolated from 3 points, and the
+product thesis ("depth wins on ops-like agentic work") rests on n=1 domain, n=1 gym,
+n=2 models. The canon's own discipline (eval-substrate: paired stats, honest scoping)
+demands this be named: **the law is a hypothesis with one supporting stateful domain.**
+
+## The cheap replication (nearly free)
+
+`gym_dbs.zip` ships **eight** domain splits: itsm, csm, hr, email, drive, calendar,
+teams, hybrid — same container, same MCP/verifier machinery, same `Environment`
+implementation (`agentic-eops.ts` is domain-blind; only the HF split name changes). A
+cross-domain run is a config change:
+
+- **Experiment:** canonical depth-vs-breadth (Supervisor + observe, keep-best) on csm +
+  hr at n≥16 each, same model. 
+- **Outcomes:** (a) replicates → the law has 3 stateful domains and the product claim
+  firms up; (b) fails on one → the boundary is finer than "stateful" (e.g. itsm's
+  read-verify-write loops are unusually steerable) and we learn *which* property carries
+  the win — either result is decision-grade.
+
+## Stress test (why itsm might be idiosyncratic)
+
+- itsm tasks have **many independent sub-goals** (2–18 SQL verifiers/task) — partial
+  credit is dense, so a steer always has a "next unfinished item." Domains with one
+  atomic verifier may behave like stateless tasks.
+- itsm tools are **read/write symmetric** (every mutation is cheaply checkable by a
+  read) — the verify-before-mutate steer is unusually actionable. Email/calendar may
+  lack cheap verification reads.
+- The gym DB **resets per task** — no long-horizon persistence *across* tasks, so this
+  is still short-horizon steering. The long-horizon claim (hours-scale accumulation)
+  needs commit0/SWE-class coding domains — currently platform-gated (#984 sandbox
+  egress), the honest outer boundary of what's testable today.
+
+## Product transfer (the falsifier the product-value claim wrote down)
+
+The gym is a proxy. The five live products (gtm/tax/legal/creative/agent-builder) are
+the target, and `.evolve/eops-steerer-product-claim.md` already names the falsifier:
+*"the win doesn't transfer off the gym to a real connector-backed ops agent."* Transfer
+is not a bigger gym run — it is the integration question (see
+`product-integration-playbook.md`): implement an `Environment` over one product's real
+tool surface + a deployable check from its domain (e.g. gtm: a campaign-state check;
+tax: a return-validation check), and run the same gate. That is the experiment that
+converts this research program into product value, and nothing in the current evidence
+shortcuts it.
+
+## Order of operations
+
+1. csm + hr replication (config-change cheap, decision-grade either way).
+2. The (correct,$,ms) vector on those runs (free, per layer-economics).
+3. One product `Environment` (gtm first — richest tool surface, live traces flowing) —
+   the bridge experiment, scoped in the playbook.
+4. commit0/SWE long-horizon — parked on #984; revisit when the platform unblocks.
diff --git a/docs/research/layer-economics.md b/docs/research/layer-economics.md
@@ -0,0 +1,67 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** canon-mandated, practice-absent — the largest internal inconsistency
+
+# Layer: economics, multi-objective, and the portfolio question
+
+**The claim under test:** "best" is a vector — correct · fast · secure · cheap — and the
+optimization target is the Pareto frontier, not a pre-collapsed score.
+
+## The inconsistency this layer names
+
+The canon mandates this (architecture §0.5.2 "Success is multi-objective; we do not
+collapse it to one number until forced"; §0.5.3 each objective carries its own deployable
+checker). **Every gate this program has run is single-objective** (verifier score), with
+cost merely *reported*. The Pareto machinery exists (`paretoFrontier`,
+`paretoFrontierWithCrowding` in agent-eval; the GEPA harness already selects on
+[lift, cost]). This is practice lagging canon, not a design dispute — and it changes
+conclusions: a strategy that ties on score but halves cost **wins** under the canon's
+definition and is invisible under ours.
+
+## What's free to wire (harvest, not research)
+
+- **correct** — already the verifier. **cheap** — already measured (`Spend.usd`,
+  tokens; the conserved pool meters it). **fast** — already measured (`Spend.ms`).
+  Three of four objectives are *already in every RunRecord*; the work is reporting the
+  vector + Pareto verdicts instead of the scalar. ~Days, not weeks.
+- **secure** — the one objective needing a real checker (domain-dependent: policy
+  violations in EOPS, dangerous tool calls, secret leakage). Defer until a domain
+  supplies one; don't fake it with an LLM judge (eval-substrate: deterministic or
+  execution-grounded only).
+
+## The two big unmeasured effects in this layer
+
+1. **The cost-quality frontier across models.** The router serves 500+ models; the
+   gates have used 2–3. The product question is *lift-per-dollar*, and the data so far
+   hints the frontier is strange: deepseek-v4-flash resolves 6% of EOPS (too weak to
+   steer), v4-pro carries the +16.4pp at a fraction of gpt-4.1's price. A model-sweep on
+   the existing gate (same harness, 4–5 models, report (score, $/task)) maps it for the
+   cost of one rerun.
+2. **Tool/harness augmentation dominates.** The largest single effect this program has
+   ever measured is not steering, not selection, not prompts — it is **giving cheap
+   models a search tool**: you.com lifted *all five* models to ~90% on SimpleQA (+70pp
+   for cheap models, p≈.03), erasing the model-quality gap. The honest implication: for
+   many task classes, **harness augmentation ≥ model choice ≥ strategy ≫ prompt** in
+   effect size. The portfolio should weight accordingly — an "augmentation sweep" (which
+   tool grants close which domain's gap) is plausibly worth more than every remaining
+   steering experiment combined.
+
+## Stress test
+
+- *"Multi-objective is premature until score itself is solid."* Backwards under the
+  canon: collapsing to score is what made the deepseek-flash runs look uninformative
+  (6% resolve) when the right reading was "off the frontier, wrong model for the
+  domain." The vector is *cheaper* to be right with, not more expensive.
+- *"Pareto verdicts confuse operators."* The scalarization exists (`scalarScore`,
+  weighted) for when a single winner is forced; the discipline is collapse-last.
+- *"Routing is a product, not an experiment."* It's both — but the *measurement* (the
+  frontier map) is precisely the eval-substrate's sellable exhaust (eval-substrate: "which
+  (harness × model × provider × strategy) is actually best for task-class X").
+
+## Concrete next steps
+
+1. Wire the (correct, usd, ms) vector + `paretoFrontier` verdict into `runBenchmark`'s
+   report (additive; the data is already in the records).
+2. Model-frontier sweep on the canonical EOPS gate: {v4-flash, v4-pro, glm-5, gpt-4.1}
+   × {sample, refine} → the first published lift-per-dollar table.
+3. Augmentation sweep design: per domain, the tool grant that closes the cheap-model
+   gap (search for retrieval domains; what is the EOPS analog — schema docs? read-tool
+   hints?).