diff --git a/docs/research/README.md b/docs/research/README.md
index 0832aae..adf13b7 100644
--- a/docs/research/README.md
+++ b/docs/research/README.md
@@ -27,6 +27,23 @@ spine happen explicitly, with `file:line` anchors, once a design ships.
 | [codex-techniques-audit.md](./codex-techniques-audit.md) | Adoption report mining OpenAI Codex for succinct-code principles + orchestration techniques. **Advisory** — verify `file:line` before acting. |
 | [loop-facade-postmortem.md](./loop-facade-postmortem.md) | Failure record for the deleted `defineLoop` facade: why retyping `Scope`/MCP/journals/validators produced code without substrate proof, and the prevention rule for future loop APIs. |
 
+### The optimization-space suite (2026-06-09)
+
+The strategy map + per-layer stress tests, written after the steering/GEPA gate series.
+Start at the index; each layer doc carries its own evidence table, strongest objections,
+and concrete next experiments.
+
+| Doc | What it holds |
+|-----|---------------|
+| [optimization-space.md](./optimization-space.md) | **The index.** The 6-axis taxonomy (timescale · target · objective · validity scope · serving architecture · authorship), the evidence map (which cells are measured/null/empty), the canon-compatibility audit, and the ranked experiment portfolio. |
+| [layer-within-run.md](./layer-within-run.md) | Within-run optimization — the settled boundary law (steering negative on stateless, positive on stateful+keep-best), the two engineering laws (checkpointing; architecture-is-a-variable), and the one open lever (topology tournament). |
+| [layer-across-run.md](./layer-across-run.md) | **The unmeasured thesis (n=0).** The corpus flywheel: primed-vs-cold A/B design, the four falsifiers (context pollution, stale facts, judge leakage, worker disregard), and why this layer dominates the portfolio. |
+| [layer-economics.md](./layer-economics.md) | Multi-objective + cost: the largest practice-vs-canon inconsistency (all gates single-objective; canon mandates the vector), the lift-per-dollar frontier, and the tool-augmentation effect (+70pp) that dominates everything else measured. |
+| [layer-domain-generality.md](./layer-domain-generality.md) | The n=1-domain exposure of the headline result; the nearly-free cross-domain replication (csm/hr gym splits); why itsm may be idiosyncratic; the product-transfer falsifier. |
+| [layer-intelligence-serving.md](./layer-intelligence-serving.md) | Self-hosted vs platform-served intelligence: Tangle Intelligence is export-only today; the timescale split (in-loop critic local, across-run memory served); the four-gap list incl. the **server-side judge firewall** as the non-negotiable. |
+| [layer-agent-authored.md](./layer-agent-authored.md) | Skillification: agent-authored strategies via `defineStrategy`, the two structural safety properties (conserved budget, firewall), and the R0→R3 success ladder for the strategy-author skill. |
+| [product-integration-playbook.md](./product-integration-playbook.md) | **The operator playbook.** The 8-step product integration sequence (gtm first), the consolidated human-role table (what only operators do), the three packaging gaps (publish the suite, corpus inflow, product Environments), and fleet sequencing. |
+
 ## Source artifacts (multi-agent passes)
 
 | Run | Pass | Result lands in |
diff --git a/docs/research/layer-across-run.md b/docs/research/layer-across-run.md
new file mode 100644
index 0000000..c56465b
--- /dev/null
+++ b/docs/research/layer-across-run.md
@@ -0,0 +1,68 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** THE unmeasured thesis — n=0, highest priority
+
+# Layer: across-run learning (the flywheel)
+
+**The claim under test:** run N+1 is measurably better than run N because the system
+*learned* from run N — the corpus of trace-derived findings primes future runs. This is
+the canon's success criterion verbatim (architecture §0.5.4: "the across-run curve is
+RSI, and it is THE success criterion (Gate B)"; learning-flywheel §1).
+
+## Status: the embarrassing asymmetry
+
+Within-run mechanics have ~6 adequately-powered measurements (mostly null/negative).
+Across-run learning has **zero**. The machinery is wired (`observe()` → `Corpus` →
+`renderCorpusToInstructions` → next-run priming; demonstrated live in `fleet.mts`,
+"carrying 2 prior learnings"), but the *benefit* has never been measured. The ledger has
+called the primed-vs-cold A/B "the cheap test that makes it pay rent" since 2026-06-08.
+
+## The experiment (designed, runnable now)
+
+**Primed-vs-cold at equal budget.** Two arms over the same task stream (EOPS split, or
+ideally a *sequence* so learning can accumulate):
+- **cold**: every run starts fresh (the canonical loop as measured).
+- **primed**: before each run, `corpus.query(task tags)` → top-k high-confidence facts
+  injected into the worker/analyst context; after each run, `observe()` appends.
+
+Score both with the same deployable verifier; the metric is the **slope** (does primed's
+advantage *grow* over the stream — the flywheel signature) and the endpoint lift. Frozen
+holdout: a final disjoint slice where primed keeps its corpus but cold stays cold.
+
+Falsifiers to design against (the stress test):
+1. **Context pollution** — injected facts displace task-relevant context and *hurt*
+   (the FinSearch lesson: workers got advice and ignored it; fleet.mts observed the
+   same). Mitigate: cap k, relevance-rank, measure a k=0/2/5 dose curve.
+2. **Stale facts** — the gym DB resets per task; "learnings" about *instances* are
+   noise, only *procedural* learnings transfer ("verify before mutate", "SLA must be
+   relinked after priority change"). The corpus schema already separates `area`/`claim`;
+   the A/B should tag procedural-vs-instance and report both.
+3. **Judge leakage** — corpus facts must remain trace-derived (`derived_from_judge:
+   false` is enforced structurally in `observe()`); a primed win that came from leaked
+   verdicts would be Goodhart, not learning.
+4. **Worker disregard** — measured before (advice ignored). Track *uptake*: did the
+   worker's tool sequence change in the direction of the injected fact?
+
+## Why this layer dominates the portfolio
+
+- It is the **stated product** ("the moat is the cross-benchmark learning flywheel",
+  architecture §8) and the only layer whose success directly justifies the corpus, the
+  judge discipline, and the RSI framing.
+- The within-run results make it *more* urgent, not less: if adaptive compute inside a
+  run is mostly worthless, the entire bet collapses onto memory across runs.
+- It is the natural junction with **Tangle Intelligence** (see
+  `layer-intelligence-serving.md`): a positive primed-vs-cold result is simultaneously
+  the proof that a hosted corpus/findings service has product value — the same
+  experiment, two strategic answers.
+
+## Expansion beyond the first A/B
+
+- **Retrieval-steered analyst**: the analyst's context includes findings from *past
+  similar failures* (corpus query keyed on the current trace), not just the current
+  trace — the cross-run version of `observe()`.
+- **Cross-benchmark transfer** (the full Gate B): learn on EOPS-itsm, measure lift on
+  csm/hr — does *procedural* knowledge transfer across domains? This is the actual moat
+  claim and it has a concrete falsifier (instance-knowledge won't transfer; procedural
+  might).
+- **Corpus curation as the optimization target**: once priming shows any lift, *what to
+  keep* (confidence thresholds, decay, dedup) becomes the GEPA-optimizable surface —
+  optimizing memory instead of prompts. Note this is exactly where the prompt-GEPA
+  machinery transfers after its within-run null.
diff --git a/docs/research/layer-agent-authored.md b/docs/research/layer-agent-authored.md
new file mode 100644
index 0000000..07c0487
--- /dev/null
+++ b/docs/research/layer-agent-authored.md
@@ -0,0 +1,78 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** newly feasible — the skillification goal, unmeasured
+
+# Layer: agent-authored optimization (skillification)
+
+**The claim under test:** agents can author the optimization machinery themselves —
+read a run's failures, write a *new strategy* (code, not prompt), and have it gated like
+any human-built candidate. This is the stated product goal ("skillify the process so
+agents develop these complex things") and the literal RSI claim, one level up from
+prompt mutation.
+
+## Why this just became feasible
+
+Before `defineStrategy`, a strategy was a ~70-line Supervisor driver (spawn/scope/
+journal ceremony) — not a unit any agent emits reliably. Now a strategy is a **~20-line
+body composing two steps** (`shot()`, `critique()`) with the ceremony hidden, proven by
+`adaptiveRefine` (branch-when-stuck, authored from the steps, runs through the canonical
+gate). The skillifiable unit exists; what's missing is the skill and the measurement.
+
+## The two safety properties that make agent authorship sound
+
+These are structural, not policy — which is what makes this layer credible at all:
+
+1. **Equal-compute by construction.** Any authored strategy spends through the
+   Supervisor's conserved budget pool — it *cannot* win by spending more (the
+   anti-confound invariant the keystone was built for).
+2. **The firewall is structural.** A strategy body composes `shot`/`critique`; it never
+   receives the verifiers or expected values. An authored strategy can be wrong but
+   cannot Goodhart the check — the judge stays write-only regardless of who wrote the
+   code.
+
+Residual risks that are NOT structurally covered: infinite-loop bodies (cap: the budget
+pool exhausts → spawn refused → strategy ends), environment abuse via tool calls (same
+exposure as any worker — the Environment's own tool surface is the boundary), and
+plain bad code (gate + holdout catches uselessness; typecheck catches breakage).
+
+## The experiment (the strategy-author skill)
+
+A skill/agent given: the `defineStrategy` contract + the two steps' docs + a run's
+**losses** (per-task: breadth score, depth score, trajectory — already emitted by the
+GEPA fitness fn) — asked to author one new strategy attacking the observed failure
+mode. The authored strategy enters the same tournament as human-built ones
+(`runBenchmark`, n≥24, frozen holdout).
+
+Success ladder (each rung independently informative):
+- **R0** — the agent emits a strategy that typechecks and completes the gate. (Pure
+  feasibility; expect pass.)
+- **R1** — an authored strategy beats `sample` on the holdout. (Parity with human
+  baseline quality.)
+- **R2** — an authored strategy beats the best *human* strategy on the holdout. (The
+  actual RSI-one-level-up claim.)
+- **R3** — iterated: feed the authored strategy's own losses back; does generation 2
+  beat generation 1? (GEPA-over-code; this is meta-harness's territory and should run
+  through that skill's discipline — stable baseline + product-value claim — not a
+  hand-rolled loop.)
+
+## Stress test
+
+- *"Isn't this just GEPA with a bigger search space?"* Materially different: prompt
+  space was measured flat (holdout tie); *program* space contains things prompts cannot
+  express (branch-when-stuck, restart policies, multi-artifact coordination, team
+  topologies). The prior is genuinely open.
+- *"LLMs write plausible-broken control flow."* R0 exists precisely to measure the
+  emission reliability before claiming anything; the gate absorbs broken candidates as
+  scored losses, not crashes (the resilient harness skips, never dies).
+- *"Multi-agent teams?"* Same unit: a "team" is a strategy whose body spawns several
+  *different* agents and arbitrates — the recursive atom already expresses it; the skill
+  just needs one team-shaped example in its docs.
+- *"Why a skill rather than a workflow?"* The skill is the productization: it travels to
+  any repo with the substrate, and it is the artifact that makes "agents develop these
+  complex things themselves" true for users, not just for this bench.
+
+## Order of operations
+
+1. Write the strategy-author skill (input: losses + contract; output: a
+   `defineStrategy` file + rationale). Small.
+2. R0/R1 on the existing EOPS gate (cheap, reuses everything).
+3. R2 tournament: authored vs `refine` vs `adaptiveRefine` vs `sample`, n≥24 + holdout.
+4. R3 only through `meta-harness` discipline, gated on R2 signal.
diff --git a/docs/research/layer-domain-generality.md b/docs/research/layer-domain-generality.md
new file mode 100644
index 0000000..28e849c
--- /dev/null
+++ b/docs/research/layer-domain-generality.md
@@ -0,0 +1,63 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** n=1 domain — the headline result's biggest validity risk
+
+# Layer: domain generality and product transfer
+
+**The claim under test:** the boundary law ("steering wins on stateful agentic work")
+and the +16.4pp depth result generalize beyond EOPS-itsm — across gym domains, across
+task families, and ultimately to live products.
+
+## The exposure
+
+Every positive steering result in this program sits on **one domain**: EOPS *itsm*
+(ServiceNow ticket ops, SQL-state verifiers). The negatives sit on two stateless domains
+(FinSearchComp, HumanEval). So the "boundary law" is interpolated from 3 points, and the
+product thesis ("depth wins on ops-like agentic work") rests on n=1 domain, n=1 gym,
+n=2 models. The canon's own discipline (eval-substrate: paired stats, honest scoping)
+demands this be named: **the law is a hypothesis with one supporting stateful domain.**
+
+## The cheap replication (nearly free)
+
+`gym_dbs.zip` ships **eight** domain splits: itsm, csm, hr, email, drive, calendar,
+teams, hybrid — same container, same MCP/verifier machinery, same `Environment`
+implementation (`agentic-eops.ts` is domain-blind; only the HF split name changes). A
+cross-domain run is a config change:
+
+- **Experiment:** canonical depth-vs-breadth (Supervisor + observe, keep-best) on csm +
+  hr at n≥16 each, same model. 
+- **Outcomes:** (a) replicates → the law has 3 stateful domains and the product claim
+  firms up; (b) fails on one → the boundary is finer than "stateful" (e.g. itsm's
+  read-verify-write loops are unusually steerable) and we learn *which* property carries
+  the win — either result is decision-grade.
+
+## Stress test (why itsm might be idiosyncratic)
+
+- itsm tasks have **many independent sub-goals** (2–18 SQL verifiers/task) — partial
+  credit is dense, so a steer always has a "next unfinished item." Domains with one
+  atomic verifier may behave like stateless tasks.
+- itsm tools are **read/write symmetric** (every mutation is cheaply checkable by a
+  read) — the verify-before-mutate steer is unusually actionable. Email/calendar may
+  lack cheap verification reads.
+- The gym DB **resets per task** — no long-horizon persistence *across* tasks, so this
+  is still short-horizon steering. The long-horizon claim (hours-scale accumulation)
+  needs commit0/SWE-class coding domains — currently platform-gated (#984 sandbox
+  egress), the honest outer boundary of what's testable today.
+
+## Product transfer (the falsifier the product-value claim wrote down)
+
+The gym is a proxy. The five live products (gtm/tax/legal/creative/agent-builder) are
+the target, and `.evolve/eops-steerer-product-claim.md` already names the falsifier:
+*"the win doesn't transfer off the gym to a real connector-backed ops agent."* Transfer
+is not a bigger gym run — it is the integration question (see
+`product-integration-playbook.md`): implement an `Environment` over one product's real
+tool surface + a deployable check from its domain (e.g. gtm: a campaign-state check;
+tax: a return-validation check), and run the same gate. That is the experiment that
+converts this research program into product value, and nothing in the current evidence
+shortcuts it.
+
+## Order of operations
+
+1. csm + hr replication (config-change cheap, decision-grade either way).
+2. The (correct,$,ms) vector on those runs (free, per layer-economics).
+3. One product `Environment` (gtm first — richest tool surface, live traces flowing) —
+   the bridge experiment, scoped in the playbook.
+4. commit0/SWE long-horizon — parked on #984; revisit when the platform unblocks.
diff --git a/docs/research/layer-economics.md b/docs/research/layer-economics.md
new file mode 100644
index 0000000..2a6c3dd
--- /dev/null
+++ b/docs/research/layer-economics.md
@@ -0,0 +1,67 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** canon-mandated, practice-absent — the largest internal inconsistency
+
+# Layer: economics, multi-objective, and the portfolio question
+
+**The claim under test:** "best" is a vector — correct · fast · secure · cheap — and the
+optimization target is the Pareto frontier, not a pre-collapsed score.
+
+## The inconsistency this layer names
+
+The canon mandates this (architecture §0.5.2 "Success is multi-objective; we do not
+collapse it to one number until forced"; §0.5.3 each objective carries its own deployable
+checker). **Every gate this program has run is single-objective** (verifier score), with
+cost merely *reported*. The Pareto machinery exists (`paretoFrontier`,
+`paretoFrontierWithCrowding` in agent-eval; the GEPA harness already selects on
+[lift, cost]). This is practice lagging canon, not a design dispute — and it changes
+conclusions: a strategy that ties on score but halves cost **wins** under the canon's
+definition and is invisible under ours.
+
+## What's free to wire (harvest, not research)
+
+- **correct** — already the verifier. **cheap** — already measured (`Spend.usd`,
+  tokens; the conserved pool meters it). **fast** — already measured (`Spend.ms`).
+  Three of four objectives are *already in every RunRecord*; the work is reporting the
+  vector + Pareto verdicts instead of the scalar. ~Days, not weeks.
+- **secure** — the one objective needing a real checker (domain-dependent: policy
+  violations in EOPS, dangerous tool calls, secret leakage). Defer until a domain
+  supplies one; don't fake it with an LLM judge (eval-substrate: deterministic or
+  execution-grounded only).
+
+## The two big unmeasured effects in this layer
+
+1. **The cost-quality frontier across models.** The router serves 500+ models; the
+   gates have used 2–3. The product question is *lift-per-dollar*, and the data so far
+   hints the frontier is strange: deepseek-v4-flash resolves 6% of EOPS (too weak to
+   steer), v4-pro carries the +16.4pp at a fraction of gpt-4.1's price. A model-sweep on
+   the existing gate (same harness, 4–5 models, report (score, $/task)) maps it for the
+   cost of one rerun.
+2. **Tool/harness augmentation dominates.** The largest single effect this program has
+   ever measured is not steering, not selection, not prompts — it is **giving cheap
+   models a search tool**: you.com lifted *all five* models to ~90% on SimpleQA (+70pp
+   for cheap models, p≈.03), erasing the model-quality gap. The honest implication: for
+   many task classes, **harness augmentation ≥ model choice ≥ strategy ≫ prompt** in
+   effect size. The portfolio should weight accordingly — an "augmentation sweep" (which
+   tool grants close which domain's gap) is plausibly worth more than every remaining
+   steering experiment combined.
+
+## Stress test
+
+- *"Multi-objective is premature until score itself is solid."* Backwards under the
+  canon: collapsing to score is what made the deepseek-flash runs look uninformative
+  (6% resolve) when the right reading was "off the frontier, wrong model for the
+  domain." The vector is *cheaper* to be right with, not more expensive.
+- *"Pareto verdicts confuse operators."* The scalarization exists (`scalarScore`,
+  weighted) for when a single winner is forced; the discipline is collapse-last.
+- *"Routing is a product, not an experiment."* It's both — but the *measurement* (the
+  frontier map) is precisely the eval-substrate's sellable exhaust (eval-substrate: "which
+  (harness × model × provider × strategy) is actually best for task-class X").
+
+## Concrete next steps
+
+1. Wire the (correct, usd, ms) vector + `paretoFrontier` verdict into `runBenchmark`'s
+   report (additive; the data is already in the records).
+2. Model-frontier sweep on the canonical EOPS gate: {v4-flash, v4-pro, glm-5, gpt-4.1}
+   × {sample, refine} → the first published lift-per-dollar table.
+3. Augmentation sweep design: per domain, the tool grant that closes the cheap-model
+   gap (search for retrieval domains; what is the EOPS analog — schema docs? read-tool
+   hints?).
diff --git a/docs/research/layer-intelligence-serving.md b/docs/research/layer-intelligence-serving.md
new file mode 100644
index 0000000..462a09e
--- /dev/null
+++ b/docs/research/layer-intelligence-serving.md
@@ -0,0 +1,85 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** architecture decision — export-only today; the across-run layer's natural home
+
+# Layer: intelligence serving — self-hosted vs platform-served
+
+**The question (operator-posed):** today the loop *self-hosts* its intelligence
+gathering (`observe()` runs in-process, the `Corpus` is a local JSONL). Should **Tangle
+Intelligence** instead *serve* intelligence to agents and agent teams — and is what we
+built pointing toward that or away from it?
+
+## Ground truth: what Tangle Intelligence is today
+
+Verified against the code (otel-export.ts, examples/intelligence-export,
+agents-of-all-shapes, the sandbox SDK):
+
+| surface | direction | shape |
+|---|---|---|
+| `createOtelExporter` → `/v1/traces` | **export only** | OTel GenAI spans (loop topology, usage, cost) |
+| `exportEvalRuns` → `/v1/ingest/eval-runs` | **export only** | eval provenance (baselines, generations, gates, InsightReport) |
+| sandbox `createIntelligenceReport` / `createAgenticIntelligenceReport` | async pull | fleet/box-level report, `queued→completed`, dashboard-shaped |
+| `/v1/insights/outputs?kind=report` | human dashboard | no programmatic agent contract |
+
+**Verdict: export-only.** Nothing in `src/` reads Intelligence back into a loop. The
+in-loop intelligence is entirely `observe()` (per-run, synchronous, ~1 LLM call,
+firewalled) + `Corpus` (local durable facts, `corpus.query()` → next-run priming).
+
+## The two systems are layered, not duplicates
+
+| | `observe()` + `Corpus` (in-process) | Tangle Intelligence (hosted) |
+|---|---|---|
+| granularity | one run's trace → findings *now* | fleet-scale, multi-run clustering, lift CIs, Pareto |
+| latency | in-loop (<1s need) | async (seconds–minutes) |
+| store | local JSONL per product | server-side, tenant-wide |
+| consumer | the very next shot/run | humans (dashboards) |
+| firewall | **structural** (`derived_from_judge:false`; input carries no score) | **none** — InsightReport embeds judge-derived stats |
+
+So the answer to "are we self-hosting what Intelligence should serve?" is: **partially,
+and the split should be by timescale.** The *within-run* critic must stay in-process
+(latency, firewall, per-run context). The *across-run* memory — the corpus, the fleet
+patterns, the "what do we know about failures like this" query — is exactly what a
+hosted service does better: amortized analysis across every run of every product in the
+tenant, cached, one place to curate. **Tangle Intelligence is the natural home of the
+across-run layer** (`layer-across-run.md`), and today's local JSONL corpus is the
+self-hosted stopgap for a read-back API that doesn't exist yet.
+
+## What's missing to make Intelligence "serve the agents" (the gap list)
+
+1. **A read-back API** — `GET` findings by subject/window/tags, agent-consumable shape
+   (`AnalystFinding[]`-like: area, claim, recommended_action, confidence), not
+   dashboard-shaped reports. Sub-second from cache.
+2. **Pre-computed/cached findings** — computed on ingest or scheduled, not
+   generate-on-request; an agent priming a run cannot wait minutes.
+3. **The firewall, server-side** — this is the hard constraint, and it is
+   non-negotiable: InsightReport today mixes judge-derived statistics. If agents steer
+   on served intelligence that embeds judge verdicts, the keystone discipline
+   (selector ≠ judge, judge write-only — learning-flywheel: "the keystone of the entire
+   stack") breaks *at the platform level*, silently, for every consumer. The served
+   slice must be trace-derived-only, enforced where the report is built, with
+   `derived_from_judge` provenance on every served claim.
+4. **Uptake telemetry** — served findings should carry IDs so the loop can report back
+   "injected, followed, outcome" — closing Intelligence's own improvement loop.
+
+## Stress test
+
+- *"Why not keep it all local — it works?"* Local corpora silo learning per product and
+  per machine; the moat claim is *cross*-run, cross-product transfer, which only a
+  shared service realizes. Also: ops (curation, decay, dedup) done five times badly vs
+  once well.
+- *"Why not move observe() to the platform too?"* Latency + context: the in-loop critic
+  needs the live trace within the shot cadence, and shipping full traces mid-loop is
+  cost + privacy surface. Per-run critic local, cross-run memory hosted — clean split.
+- *"Does a hosted dependency break offline/dev?"* The `Corpus` port stays; the hosted
+  service is one implementation behind it (`IntelligenceCorpus` beside `FileCorpus`).
+  Degrade to local, never fail closed on a network read.
+- *"Is there a business here or just plumbing?"* The primed-vs-cold A/B answers both at
+  once: if priming lifts outcomes, "intelligence served to agents" has measurable value
+  per query — eval-substrate's sellable-exhaust thesis, applied to the corpus itself.
+
+## Decision + sequence
+
+1. Run the corpus A/B locally first (no platform work) — it gates everything: no lift,
+   no service.
+2. On a positive: define the served-findings contract (the `Corpus` port already exists
+   — implement it over Intelligence read-back), with the firewall enforced server-side.
+3. The product playbook's Phase 3 (see `product-integration-playbook.md`) then swaps
+   each product's local corpus for the served one — one port, no loop changes.
diff --git a/docs/research/layer-within-run.md b/docs/research/layer-within-run.md
new file mode 100644
index 0000000..6664046
--- /dev/null
+++ b/docs/research/layer-within-run.md
@@ -0,0 +1,58 @@
+> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** mostly settled — boundary law established, one lever open
+
+# Layer: within-run optimization
+
+**The claim under test:** spending a run's compute *adaptively* (steer, refine, branch)
+beats spending it *blindly* (best-of-N resampling) at equal budget.
+
+## Evidence (all paired, equal-compute, deployable checkers)
+
+| domain | setup | steering vs compute | verdict |
+|---|---|---|---|
+| FinSearchComp (stateless retrieval) | n=40, BH | refineHand −10pp, refineGepa −15pp; compute +22.5pp (p=.008) | **negative** |
+| HumanEval (stateless codegen) | n=82, LLM-audit steer | −1.2pp CI[−8.5,+6.1] | null |
+| HumanEval (stateless codegen) | n=82, exec-grounded self-repair (`run_tests` tool) | **−17.1pp** CI[−26.8,−7.3] | **significantly negative** |
+| EOPS-itsm (stateful agentic), flat hand-rolled loop | n=24 | −9.9pp → autopsy: scoring asymmetry | artifact (see below) |
+| EOPS-itsm, **canonical loop** (Supervisor + observe()) | n=16 | **+16.4pp** CI[+5.3,+29.8], 6W/0L | **significantly positive** |
+| EOPS-itsm, disjoint holdout slice | n=6 | +8.3pp (both analyst prompts) | replicates |
+| analyst-prompt GEPA | search n=12, frozen holdout n=6 | holdout: winner +8.3 = baseline +8.3 | **null** (prompt not binding) |
+
+## The boundary law (the durable output of this layer)
+
+Steering pays **iff** the task is *stateful* (the artifact accumulates, so an observed
+correction is worth more than a fresh sample), has a *correctable middle band* (partial
+credit a steer can move), and resampling is *expensive or impossible* (you can't restart
+a 6-step ticket migration). On stateless generation, fresh samples explore for free and
+any anchored continuation loses — exactly the canon's prediction (architecture §10).
+
+Two engineering laws fell out, both load-bearing:
+1. **Keep-best checkpointing is mandatory.** Steering *reaches* better states then
+   *undoes* them (measured degradation +6–8pp). Score/keep the best-verifying
+   checkpoint, never the final state. The flat-loop "depth loses −9.9pp" result was
+   entirely this scoring asymmetry (autopsy `.evolve/autopsies/2026-06-08-…`).
+2. **Architecture is a variable, not plumbing.** The same model/domain/n flipped from
+   "depth loses" (flat loop, hand-rolled steerer) to "+16.4pp significant" (Supervisor +
+   real `observe()` analyst). Measure on the canonical stack only.
+
+## Stress test (strongest objections)
+
+- *"+16.4pp is one domain, one model, n=16."* True. The holdout replication (+8.3pp,
+  disjoint tasks) helps but cross-domain (layer-domain-generality) is the real answer.
+- *"The analyst adds nothing — GEPA tied."* The correct reading is narrower: the
+  analyst-prompt *text* is not binding at this budget. The analyst *mechanism* is in
+  every positive cell, and removing it (generic nudge, flat loop) degraded results. The
+  untested attribution experiment: canonical depth WITHOUT any analyst (pure
+  continuation) vs with — isolates the analyst's marginal value.
+- *"Maybe more shots, not steering, explains depth's win."* No — equal completions by
+  construction (conserved budget pool), and breadth had ≥ compute in the wins.
+
+## What's left in this layer (and what to stop)
+
+**Open lever — topology/strategy:** `adaptiveRefine` (branch-when-stuck), refine/sample
+mixes, widen gates. Now cheap to test (`defineStrategy` + `runBenchmark` + holdout).
+The one within-run experiment still worth funding: **strategy tournament at n≥24 +
+frozen holdout.**
+
+**Stop:** analyst-prompt GEPA at small n (flat landscape, holdout-tied); steering
+experiments on stateless domains (three independent negatives); rich-analyst plumbing
+(HALO OTLP emitter) until a topology win re-motivates it.
diff --git a/docs/research/optimization-space.md b/docs/research/optimization-space.md
new file mode 100644
index 0000000..d471157
--- /dev/null
+++ b/docs/research/optimization-space.md
@@ -0,0 +1,107 @@
+> **Track:** Architecture (research) · **Role:** strategy map · **Status:** open — taxonomy + stress-tests, 2026-06-09
+
+# The optimization space — axes, not a ladder
+
+A stress-test of the question "does GEPA / steerers / HALO contextualize everything we
+could be working on?" Answer: **no.** Those are all points in ONE region of a larger
+space, and the region we have been grinding (within-run mechanics) is the one where the
+evidence keeps coming back null-or-marginal, while the region the canon names as the
+actual success criterion (the across-run flywheel, Gate B) has **n=0 measurements**.
+
+This doc holds the taxonomy and the canon-compatibility audit. One stress-test doc per
+layer lives beside it (`layer-*.md`).
+
+## Why axes, not layers
+
+The original framing ("6 layers") conflated independent dimensions. The clean model: an
+optimization effort is a **point in a 6-axis space**, and any "ladder" (the canon's
+L0→L1→L2 rungs) is one *path* through it — not the space itself.
+
+| Axis | Values | Where this repo is today |
+|---|---|---|
+| **Timescale** | within-run · across-run · meta (optimizer-of-optimizer) | almost all effort within-run; across-run n=0 |
+| **Target** | prompt (content) · topology/strategy (structure) · knowledge/corpus (memory) · policy (routing, ask-vs-act, budget) · tasks (curriculum) | prompt = measured (tie); topology = open; the rest untouched |
+| **Objective** | single score · multi-objective vector (correct·fast·secure·cheap) | every gate so far single-objective — **in tension with the canon** (see audit) |
+| **Validity scope** | one domain · cross-domain · live product | n=1 domain (EOPS-itsm) for the headline result |
+| **Serving architecture** | in-process (observe()/Corpus) · platform-served (Tangle Intelligence) | all in-process; Intelligence is export-only today |
+| **Authorship** | human-built · agent-authored | human; `defineStrategy` makes agent-authored feasible |
+
+Reconciliation with the canon's ladder: the rungs (L0 worker → L1 controller → L2
+meta-optimizer) are the **timescale × target** diagonal. The axes add what the ladder
+hides: objective shape, validity scope, serving topology, authorship. Both frames are
+compatible; the ladder answers "is level n real?" (lift on level n−1), the axes answer
+"where is the unexplored headroom?".
+
+## The map with evidence status (2026-06-09)
+
+| Region | Evidence | Verdict |
+|---|---|---|
+| within-run steering, stateless retrieval (FinSearchComp) | n=40, BH-corrected | **NEGATIVE** (steering −10/−15pp; compute +22.5pp) |
+| within-run steering, stateless codegen (HumanEval) | n=82 ×2, paired-bootstrap | **NULL** (audit −1.2 n.s.) / **NEGATIVE** (exec-grounded repair −17.1 SIGNIF) |
+| within-run depth+keep-best, stateful agentic (EOPS) | n=16 + holdout replication | **POSITIVE** (+16.4pp CI[+5.3,+29.8]; +8.3pp on disjoint slice) |
+| analyst-prompt GEPA | search n=12 + frozen holdout n=6 | **NULL** (holdout exact tie vs default) |
+| within-run topology (adaptiveRefine, mix/widen) | unmeasured | open — the one within-run lever left |
+| across-run corpus flywheel (primed-vs-cold) | **n=0** | the canon's stated success criterion, never measured |
+| multi-objective vector | **n=0** (machinery exists: `paretoFrontier`) | canon-required, unwired |
+| cross-domain (csm/hr/email/… gym splits) | **n=0** | nearly free to run |
+| live-product transfer | **n=0** | the product-value claim's own falsifier |
+| tool/harness augmentation | SimpleQA: you.com lifts cheap models +70pp to parity | the **largest single effect measured anywhere in this program** |
+| agent-authored strategies | feasible since `defineStrategy`; unmeasured | the skillification goal |
+
+Reading of the map: the program has **over-sampled one cell** (within-run × prompt/strategy ×
+single-objective × itsm × in-process × human) and the cells the canon itself designates as
+the product (across-run, multi-objective, product-scope) are empty.
+
+## Canon-compatibility audit
+
+Checked against `architecture.md`, `learning-flywheel.md`, `eval-substrate.md`,
+`roadmap-rsi.md`, `architecture-interpretations.md`, `.evolve/current.json`.
+
+**Compatible / direct alignment:**
+- Across-run = success (architecture §0.5.4: "That across-run curve is RSI, and it is THE
+  success criterion (Gate B)"; roadmap: Gate B "not yet instrumented"). The axes frame
+  *restates* the canon's own acknowledged gap.
+- "Topology over prompt as the next within-run lever" — consistent with roadmap Phase 3
+  (grow the ISA) being gated on findings reaching the planner.
+- Platform-served intelligence is a **deployment-topology choice**, not an architecture
+  violation — the kernel owns Scope/MCP/profiles; analysis attaches via hooks
+  (architecture §1b). See `layer-intelligence-serving.md` for the one hard constraint
+  (the judge firewall).
+
+**Corrections the canon forces on the new framing:**
+- "Within-run steering is mostly null" is **too gentle**: the adequately-powered rung-0
+  result is *negative on every slice* (learning-flywheel §Honest status). The accurate
+  law: **negative on stateless retrieval, null-to-negative on stateless codegen, positive
+  on stateful agentic with keep-best checkpointing.** The boundary variable is state +
+  a correctable middle band + the inability to cheaply resample.
+- The canon already predicted the self-refine failure (architecture §10: "intrinsic
+  self-refine degrades… the driver must re-investigate, not self-critique"). Our
+  HumanEval repair −17.1pp is a *confirmation*, not news.
+
+**Tensions / staleness to resolve (documentation debt, not design conflict):**
+- `learning-flywheel.md` rung-0 verdict ("steering loses") is FinSearchComp-scoped and
+  now needs the domain boundary added (EOPS depth win, canonical loop, +16.4pp).
+- Every gate run to date is single-objective, while architecture §0.5.2–0.5.3 mandates a
+  multi-objective vector with per-objective deployable checkers. This is the **largest
+  internal inconsistency between practice and canon** — see `layer-economics.md`.
+- `.evolve/current.json` predates the canonical-loop result and the GEPA verdict; needs a
+  state refresh (tracked separately from this doc set).
+
+## The portfolio (what to multi-pursue)
+
+Ranked by (decision-relevance × cheapness × independence):
+
+1. **Across-run corpus A/B** (`layer-across-run.md`) — primed-vs-cold at equal budget.
+   The thesis test; doubles as the Tangle-Intelligence-value proof.
+2. **Cross-domain replication** (`layer-domain-generality.md`) — depth-vs-breadth on a
+   second gym split (csm or hr). Validates or bounds the headline result.
+3. **Multi-objective wiring** (`layer-economics.md`) — report the (correct, cost, wall)
+   vector per strategy; lift-per-dollar. Mostly harvest, machinery exists.
+4. **Topology evolution** (`layer-within-run.md`) — adaptiveRefine/mix vs refine vs
+   sample, n≥24 + holdout, the fitness fn already built.
+5. **Strategy-author skill** (`layer-agent-authored.md`) — an agent reads the losses and
+   emits a `defineStrategy`; gate scores it. Small build; IS the skillification goal.
+
+Explicitly **not** in the portfolio: more analyst-prompt GEPA (holdout-tied, flat
+landscape), HALO plumbing (rich-analyst bet weakened by the prompt null), in-box sandbox
+arms (platform-gated, #984).
diff --git a/docs/research/product-integration-playbook.md b/docs/research/product-integration-playbook.md
new file mode 100644
index 0000000..ba92082
--- /dev/null
+++ b/docs/research/product-integration-playbook.md
@@ -0,0 +1,91 @@
+> **Track:** Operations (research) · **Role:** integration + operator playbook · **Status:** actionable — primitives mostly shipped, three packaging gaps named
+
+# Product integration playbook — putting the optimization system into the products
+
+The step-by-step path for wiring the optimization system (canonical Supervisor loop ·
+`observe()` analyst · Environment/Strategy/`runBenchmark` · corpus) into the live
+agent-app products (gtm / tax / creative / legal / agent-builder), and **what the
+operator (Drew + team) does at each step** vs what runs autonomously.
+
+Honest framing up front: most of the production loop **already ships** in agent-eval /
+agent-runtime (the `agent-stack-adoption` 9-phase pipeline). What this playbook adds is
+(a) where the *new* optimization suite slots into that pipeline, (b) the operator role
+table, (c) the three packaging gaps that block "just import it" today.
+
+## The three packaging gaps (do these first)
+
+| gap | today | needed |
+|---|---|---|
+| **G1 — the suite isn't published.** `Environment`, `Strategy`, `defineStrategy`, `runBenchmark`, the canonical depth/breadth drivers live in `bench/src/` (R&D workspace), not in the published `@tangle-network/agent-runtime` exports. | products can't import them | lift `agentic.ts` + `run-benchmark.mts` into `src/` behind `/loops` (a `substrate-release` motion; the code is already domain-blind) |
+| **G2 — corpus has no production inflow.** `observe()`/`Corpus` runs in bench loops; production traces flow to the trace sink + (optionally) OTLP, but nothing turns production traces into corpus facts automatically. | analyst-loop proposes; PR-gated | a production `observe()` pass over the trace sink (batch, nightly) writing corpus facts; later the Intelligence-served corpus (layer-intelligence-serving) |
+| **G3 — no product `Environment` exists.** The gate has only gym Environments. | gym-only evidence | one product Environment (gtm first): tools = the product's real MCP surface; `score()` = a deployable domain check |
+
+## The integration sequence (one product: gtm-agent)
+
+Assumes the product is already at adoption Phase 3+ (composer + trace sink + nightly
+eval live — gtm is). Each step names the existing primitive; nothing here is invented.
+
+1. **Parity profile** — eval runs the *production* agent: `composeProductionAgentProfile`
+   → `createSandboxAct`. (Shipped; most products wired.) *Operator: none.*
+2. **Production traces flowing** — `createProductionTraceSink` on every chat turn; OTLP
+   export to Intelligence optional but recommended (`createOtelExporter`). *Operator:
+   set the OTLP endpoint secret once; glance at trace health weekly.*
+3. **The product Environment (G3)** — implement the 5 hooks over gtm's real surface:
+   `open` = a scoped workspace/session; `tools` = the product MCP tools; `call` =
+   invoke them; `score` = a deployable check (campaign-state assertions, not an LLM
+   judge); `close` = teardown. ~1–2 days; this is the gym→product bridge experiment
+   from `layer-domain-generality.md`. *Operator decision: which checks define "done"
+   for a gtm task — this is product judgment, not engineering.*
+4. **Run the gate on the product** — `runBenchmark({environment: gtmEnv, strategies:
+   [sample, refine], …})` over a frozen scenario set. First output: does depth/steering
+   pay on *your* domain, with the (correct, $, ms) vector per layer-economics.
+   *Operator: review the report; pick the strategy+model cell for production.*
+5. **Backend integrity + scorecard + ship-gate** — `assertRealBackend` before any
+   verdict; `recordRunsToScorecard`/`diffScorecard` per commit; `runProductionLoop`'s
+   held-out promotion gate for any prompt/addendum change. (All shipped.) *Operator:
+   approve/reject gate-passing PRs — this is the standing human checkpoint.*
+6. **Corpus priming (G2 + the across-run layer)** — nightly `observe()` over the day's
+   production traces → corpus; prime tomorrow's runs via `corpus.query`. Run
+   primed-vs-cold on the product scenario set — the product-grade flywheel test.
+   *Operator: review high-confidence facts weekly (a 10-minute curation pass); approve
+   the auto-apply threshold.*
+7. **Intelligence hookup** — keep exporting (step 2 covers it). When the served-findings
+   read-back exists (layer-intelligence-serving), swap `FileCorpus` for the
+   Intelligence-backed `Corpus` — one port, no loop changes. *Operator: tenant config.*
+8. **CI crons** — nightly eval + weekly production-loop (templates shipped in the
+   adoption skills). *Operator: provision the runner once; rotate secrets; review the
+   weekly auto-PR.*
+
+## The operator role, consolidated
+
+What **only humans** do — everything else runs autonomously:
+
+| cadence | action | authority |
+|---|---|---|
+| once per product | define the deployable checks (step 3) + holdout scenarios | product judgment — the single highest-leverage human input |
+| once | set gate thresholds (paired-delta, overfit gap), budgets, model allowlist | risk posture |
+| weekly | review scorecard diff + the production-loop auto-PR; approve/reject | the ship decision |
+| weekly | 10-min corpus curation (high-confidence facts in/out) | knowledge quality |
+| on failure | backend-integrity or infra alerts (stub verdict, runner down) | unblock |
+
+The deliberate design: the human owns **what "good" means** (checks, thresholds,
+scenarios) and **the ship decision**; the system owns everything between — running,
+scoring, mutating, gating, reporting. That is the operator contract to staff for: not
+babysitting runs, but curating definitions and reviewing one diff per product per week.
+
+## Sequencing across the fleet
+
+gtm first (richest tools, live traces, friendliest checks) → then tax (high-value
+deterministic checks: return validation) → creative/legal (checks are harder to make
+deterministic — may stay at steps 1–2+5 until eval-agent rubrics mature) →
+agent-builder (special case: its *product* is generating agents, so the strategy-author
+skill from `layer-agent-authored.md` is its feature, not its tooling).
+
+## What NOT to do
+
+- Don't fork `runProductionLoop` per product to get custom topologies — that's G1's
+  job (publish `Strategy`), then strategies are injected, not forked.
+- Don't auto-apply corpus facts above the measured-precision threshold; PR-gate until
+  the primed-vs-cold A/B shows lift.
+- Don't ship any steering default to a product before its own Environment gate (step 4)
+  shows it pays *on that domain* — the boundary law says it may not.