Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
17 changes: 17 additions & 0 deletions docs/research/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,23 @@ spine happen explicitly, with `file:line` anchors, once a design ships.
| [codex-techniques-audit.md](./codex-techniques-audit.md) | Adoption report mining OpenAI Codex for succinct-code principles + orchestration techniques. **Advisory** — verify `file:line` before acting. |
| [loop-facade-postmortem.md](./loop-facade-postmortem.md) | Failure record for the deleted `defineLoop` facade: why retyping `Scope`/MCP/journals/validators produced code without substrate proof, and the prevention rule for future loop APIs. |

### The optimization-space suite (2026-06-09)

The strategy map + per-layer stress tests, written after the steering/GEPA gate series.
Start at the index; each layer doc carries its own evidence table, strongest objections,
and concrete next experiments.

| Doc | What it holds |
|-----|---------------|
| [optimization-space.md](./optimization-space.md) | **The index.** The 6-axis taxonomy (timescale · target · objective · validity scope · serving architecture · authorship), the evidence map (which cells are measured/null/empty), the canon-compatibility audit, and the ranked experiment portfolio. |
| [layer-within-run.md](./layer-within-run.md) | Within-run optimization — the settled boundary law (steering negative on stateless, positive on stateful+keep-best), the two engineering laws (checkpointing; architecture-is-a-variable), and the one open lever (topology tournament). |
| [layer-across-run.md](./layer-across-run.md) | **The unmeasured thesis (n=0).** The corpus flywheel: primed-vs-cold A/B design, the four falsifiers (context pollution, stale facts, judge leakage, worker disregard), and why this layer dominates the portfolio. |
| [layer-economics.md](./layer-economics.md) | Multi-objective + cost: the largest practice-vs-canon inconsistency (all gates single-objective; canon mandates the vector), the lift-per-dollar frontier, and the tool-augmentation effect (+70pp) that dominates everything else measured. |
| [layer-domain-generality.md](./layer-domain-generality.md) | The n=1-domain exposure of the headline result; the nearly-free cross-domain replication (csm/hr gym splits); why itsm may be idiosyncratic; the product-transfer falsifier. |
| [layer-intelligence-serving.md](./layer-intelligence-serving.md) | Self-hosted vs platform-served intelligence: Tangle Intelligence is export-only today; the timescale split (in-loop critic local, across-run memory served); the four-gap list incl. the **server-side judge firewall** as the non-negotiable. |
| [layer-agent-authored.md](./layer-agent-authored.md) | Skillification: agent-authored strategies via `defineStrategy`, the two structural safety properties (conserved budget, firewall), and the R0→R3 success ladder for the strategy-author skill. |
| [product-integration-playbook.md](./product-integration-playbook.md) | **The operator playbook.** The 8-step product integration sequence (gtm first), the consolidated human-role table (what only operators do), the three packaging gaps (publish the suite, corpus inflow, product Environments), and fleet sequencing. |

## Source artifacts (multi-agent passes)

| Run | Pass | Result lands in |
Expand Down
68 changes: 68 additions & 0 deletions docs/research/layer-across-run.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,68 @@
> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** THE unmeasured thesis — n=0, highest priority

# Layer: across-run learning (the flywheel)

**The claim under test:** run N+1 is measurably better than run N because the system
*learned* from run N — the corpus of trace-derived findings primes future runs. This is
the canon's success criterion verbatim (architecture §0.5.4: "the across-run curve is
RSI, and it is THE success criterion (Gate B)"; learning-flywheel §1).

## Status: the embarrassing asymmetry

Within-run mechanics have ~6 adequately-powered measurements (mostly null/negative).
Across-run learning has **zero**. The machinery is wired (`observe()` → `Corpus` →
`renderCorpusToInstructions` → next-run priming; demonstrated live in `fleet.mts`,
"carrying 2 prior learnings"), but the *benefit* has never been measured. The ledger has
called the primed-vs-cold A/B "the cheap test that makes it pay rent" since 2026-06-08.

## The experiment (designed, runnable now)

**Primed-vs-cold at equal budget.** Two arms over the same task stream (EOPS split, or
ideally a *sequence* so learning can accumulate):
- **cold**: every run starts fresh (the canonical loop as measured).
- **primed**: before each run, `corpus.query(task tags)` → top-k high-confidence facts
injected into the worker/analyst context; after each run, `observe()` appends.

Score both with the same deployable verifier; the metric is the **slope** (does primed's
advantage *grow* over the stream — the flywheel signature) and the endpoint lift. Frozen
holdout: a final disjoint slice where primed keeps its corpus but cold stays cold.

Falsifiers to design against (the stress test):
1. **Context pollution** — injected facts displace task-relevant context and *hurt*
(the FinSearch lesson: workers got advice and ignored it; fleet.mts observed the
same). Mitigate: cap k, relevance-rank, measure a k=0/2/5 dose curve.
2. **Stale facts** — the gym DB resets per task; "learnings" about *instances* are
noise, only *procedural* learnings transfer ("verify before mutate", "SLA must be
relinked after priority change"). The corpus schema already separates `area`/`claim`;
the A/B should tag procedural-vs-instance and report both.
3. **Judge leakage** — corpus facts must remain trace-derived (`derived_from_judge:
false` is enforced structurally in `observe()`); a primed win that came from leaked
verdicts would be Goodhart, not learning.
4. **Worker disregard** — measured before (advice ignored). Track *uptake*: did the
worker's tool sequence change in the direction of the injected fact?

## Why this layer dominates the portfolio

- It is the **stated product** ("the moat is the cross-benchmark learning flywheel",
architecture §8) and the only layer whose success directly justifies the corpus, the
judge discipline, and the RSI framing.
- The within-run results make it *more* urgent, not less: if adaptive compute inside a
run is mostly worthless, the entire bet collapses onto memory across runs.
- It is the natural junction with **Tangle Intelligence** (see
`layer-intelligence-serving.md`): a positive primed-vs-cold result is simultaneously
the proof that a hosted corpus/findings service has product value — the same
experiment, two strategic answers.

## Expansion beyond the first A/B

- **Retrieval-steered analyst**: the analyst's context includes findings from *past
similar failures* (corpus query keyed on the current trace), not just the current
trace — the cross-run version of `observe()`.
- **Cross-benchmark transfer** (the full Gate B): learn on EOPS-itsm, measure lift on
csm/hr — does *procedural* knowledge transfer across domains? This is the actual moat
claim and it has a concrete falsifier (instance-knowledge won't transfer; procedural
might).
- **Corpus curation as the optimization target**: once priming shows any lift, *what to
keep* (confidence thresholds, decay, dedup) becomes the GEPA-optimizable surface —
optimizing memory instead of prompts. Note this is exactly where the prompt-GEPA
machinery transfers after its within-run null.
78 changes: 78 additions & 0 deletions docs/research/layer-agent-authored.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** newly feasible — the skillification goal, unmeasured

# Layer: agent-authored optimization (skillification)

**The claim under test:** agents can author the optimization machinery themselves —
read a run's failures, write a *new strategy* (code, not prompt), and have it gated like
any human-built candidate. This is the stated product goal ("skillify the process so
agents develop these complex things") and the literal RSI claim, one level up from
prompt mutation.

## Why this just became feasible

Before `defineStrategy`, a strategy was a ~70-line Supervisor driver (spawn/scope/
journal ceremony) — not a unit any agent emits reliably. Now a strategy is a **~20-line
body composing two steps** (`shot()`, `critique()`) with the ceremony hidden, proven by
`adaptiveRefine` (branch-when-stuck, authored from the steps, runs through the canonical
gate). The skillifiable unit exists; what's missing is the skill and the measurement.

## The two safety properties that make agent authorship sound

These are structural, not policy — which is what makes this layer credible at all:

1. **Equal-compute by construction.** Any authored strategy spends through the
Supervisor's conserved budget pool — it *cannot* win by spending more (the
anti-confound invariant the keystone was built for).
2. **The firewall is structural.** A strategy body composes `shot`/`critique`; it never
receives the verifiers or expected values. An authored strategy can be wrong but
cannot Goodhart the check — the judge stays write-only regardless of who wrote the
code.

Residual risks that are NOT structurally covered: infinite-loop bodies (cap: the budget
pool exhausts → spawn refused → strategy ends), environment abuse via tool calls (same
exposure as any worker — the Environment's own tool surface is the boundary), and
plain bad code (gate + holdout catches uselessness; typecheck catches breakage).

## The experiment (the strategy-author skill)

A skill/agent given: the `defineStrategy` contract + the two steps' docs + a run's
**losses** (per-task: breadth score, depth score, trajectory — already emitted by the
GEPA fitness fn) — asked to author one new strategy attacking the observed failure
mode. The authored strategy enters the same tournament as human-built ones
(`runBenchmark`, n≥24, frozen holdout).

Success ladder (each rung independently informative):
- **R0** — the agent emits a strategy that typechecks and completes the gate. (Pure
feasibility; expect pass.)
- **R1** — an authored strategy beats `sample` on the holdout. (Parity with human
baseline quality.)
- **R2** — an authored strategy beats the best *human* strategy on the holdout. (The
actual RSI-one-level-up claim.)
- **R3** — iterated: feed the authored strategy's own losses back; does generation 2
beat generation 1? (GEPA-over-code; this is meta-harness's territory and should run
through that skill's discipline — stable baseline + product-value claim — not a
hand-rolled loop.)

## Stress test

- *"Isn't this just GEPA with a bigger search space?"* Materially different: prompt
space was measured flat (holdout tie); *program* space contains things prompts cannot
express (branch-when-stuck, restart policies, multi-artifact coordination, team
topologies). The prior is genuinely open.
- *"LLMs write plausible-broken control flow."* R0 exists precisely to measure the
emission reliability before claiming anything; the gate absorbs broken candidates as
scored losses, not crashes (the resilient harness skips, never dies).
- *"Multi-agent teams?"* Same unit: a "team" is a strategy whose body spawns several
*different* agents and arbitrates — the recursive atom already expresses it; the skill
just needs one team-shaped example in its docs.
- *"Why a skill rather than a workflow?"* The skill is the productization: it travels to
any repo with the substrate, and it is the artifact that makes "agents develop these
complex things themselves" true for users, not just for this bench.

## Order of operations

1. Write the strategy-author skill (input: losses + contract; output: a
`defineStrategy` file + rationale). Small.
2. R0/R1 on the existing EOPS gate (cheap, reuses everything).
3. R2 tournament: authored vs `refine` vs `adaptiveRefine` vs `sample`, n≥24 + holdout.
4. R3 only through `meta-harness` discipline, gated on R2 signal.
63 changes: 63 additions & 0 deletions docs/research/layer-domain-generality.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** n=1 domain — the headline result's biggest validity risk

# Layer: domain generality and product transfer

**The claim under test:** the boundary law ("steering wins on stateful agentic work")
and the +16.4pp depth result generalize beyond EOPS-itsm — across gym domains, across
task families, and ultimately to live products.

## The exposure

Every positive steering result in this program sits on **one domain**: EOPS *itsm*
(ServiceNow ticket ops, SQL-state verifiers). The negatives sit on two stateless domains
(FinSearchComp, HumanEval). So the "boundary law" is interpolated from 3 points, and the
product thesis ("depth wins on ops-like agentic work") rests on n=1 domain, n=1 gym,
n=2 models. The canon's own discipline (eval-substrate: paired stats, honest scoping)
demands this be named: **the law is a hypothesis with one supporting stateful domain.**

## The cheap replication (nearly free)

`gym_dbs.zip` ships **eight** domain splits: itsm, csm, hr, email, drive, calendar,
teams, hybrid — same container, same MCP/verifier machinery, same `Environment`
implementation (`agentic-eops.ts` is domain-blind; only the HF split name changes). A
cross-domain run is a config change:

- **Experiment:** canonical depth-vs-breadth (Supervisor + observe, keep-best) on csm +
hr at n≥16 each, same model.
- **Outcomes:** (a) replicates → the law has 3 stateful domains and the product claim
firms up; (b) fails on one → the boundary is finer than "stateful" (e.g. itsm's
read-verify-write loops are unusually steerable) and we learn *which* property carries
the win — either result is decision-grade.

## Stress test (why itsm might be idiosyncratic)

- itsm tasks have **many independent sub-goals** (2–18 SQL verifiers/task) — partial
credit is dense, so a steer always has a "next unfinished item." Domains with one
atomic verifier may behave like stateless tasks.
- itsm tools are **read/write symmetric** (every mutation is cheaply checkable by a
read) — the verify-before-mutate steer is unusually actionable. Email/calendar may
lack cheap verification reads.
- The gym DB **resets per task** — no long-horizon persistence *across* tasks, so this
is still short-horizon steering. The long-horizon claim (hours-scale accumulation)
needs commit0/SWE-class coding domains — currently platform-gated (#984 sandbox
egress), the honest outer boundary of what's testable today.

## Product transfer (the falsifier the product-value claim wrote down)

The gym is a proxy. The five live products (gtm/tax/legal/creative/agent-builder) are
the target, and `.evolve/eops-steerer-product-claim.md` already names the falsifier:
*"the win doesn't transfer off the gym to a real connector-backed ops agent."* Transfer
is not a bigger gym run — it is the integration question (see
`product-integration-playbook.md`): implement an `Environment` over one product's real
tool surface + a deployable check from its domain (e.g. gtm: a campaign-state check;
tax: a return-validation check), and run the same gate. That is the experiment that
converts this research program into product value, and nothing in the current evidence
shortcuts it.

## Order of operations

1. csm + hr replication (config-change cheap, decision-grade either way).
2. The (correct,$,ms) vector on those runs (free, per layer-economics).
3. One product `Environment` (gtm first — richest tool surface, live traces flowing) —
the bridge experiment, scoped in the playbook.
4. commit0/SWE long-horizon — parked on #984; revisit when the platform unblocks.
67 changes: 67 additions & 0 deletions docs/research/layer-economics.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,67 @@
> **Track:** Architecture (research) · **Role:** layer stress-test · **Status:** canon-mandated, practice-absent — the largest internal inconsistency

# Layer: economics, multi-objective, and the portfolio question

**The claim under test:** "best" is a vector — correct · fast · secure · cheap — and the
optimization target is the Pareto frontier, not a pre-collapsed score.

## The inconsistency this layer names

The canon mandates this (architecture §0.5.2 "Success is multi-objective; we do not
collapse it to one number until forced"; §0.5.3 each objective carries its own deployable
checker). **Every gate this program has run is single-objective** (verifier score), with
cost merely *reported*. The Pareto machinery exists (`paretoFrontier`,
`paretoFrontierWithCrowding` in agent-eval; the GEPA harness already selects on
[lift, cost]). This is practice lagging canon, not a design dispute — and it changes
conclusions: a strategy that ties on score but halves cost **wins** under the canon's
definition and is invisible under ours.

## What's free to wire (harvest, not research)

- **correct** — already the verifier. **cheap** — already measured (`Spend.usd`,
tokens; the conserved pool meters it). **fast** — already measured (`Spend.ms`).
Three of four objectives are *already in every RunRecord*; the work is reporting the
vector + Pareto verdicts instead of the scalar. ~Days, not weeks.
- **secure** — the one objective needing a real checker (domain-dependent: policy
violations in EOPS, dangerous tool calls, secret leakage). Defer until a domain
supplies one; don't fake it with an LLM judge (eval-substrate: deterministic or
execution-grounded only).

## The two big unmeasured effects in this layer

1. **The cost-quality frontier across models.** The router serves 500+ models; the
gates have used 2–3. The product question is *lift-per-dollar*, and the data so far
hints the frontier is strange: deepseek-v4-flash resolves 6% of EOPS (too weak to
steer), v4-pro carries the +16.4pp at a fraction of gpt-4.1's price. A model-sweep on
the existing gate (same harness, 4–5 models, report (score, $/task)) maps it for the
cost of one rerun.
2. **Tool/harness augmentation dominates.** The largest single effect this program has
ever measured is not steering, not selection, not prompts — it is **giving cheap
models a search tool**: you.com lifted *all five* models to ~90% on SimpleQA (+70pp
for cheap models, p≈.03), erasing the model-quality gap. The honest implication: for
many task classes, **harness augmentation ≥ model choice ≥ strategy ≫ prompt** in
effect size. The portfolio should weight accordingly — an "augmentation sweep" (which
tool grants close which domain's gap) is plausibly worth more than every remaining
steering experiment combined.

## Stress test

- *"Multi-objective is premature until score itself is solid."* Backwards under the
canon: collapsing to score is what made the deepseek-flash runs look uninformative
(6% resolve) when the right reading was "off the frontier, wrong model for the
domain." The vector is *cheaper* to be right with, not more expensive.
- *"Pareto verdicts confuse operators."* The scalarization exists (`scalarScore`,
weighted) for when a single winner is forced; the discipline is collapse-last.
- *"Routing is a product, not an experiment."* It's both — but the *measurement* (the
frontier map) is precisely the eval-substrate's sellable exhaust (eval-substrate: "which
(harness × model × provider × strategy) is actually best for task-class X").

## Concrete next steps

1. Wire the (correct, usd, ms) vector + `paretoFrontier` verdict into `runBenchmark`'s
report (additive; the data is already in the records).
2. Model-frontier sweep on the canonical EOPS gate: {v4-flash, v4-pro, glm-5, gpt-4.1}
× {sample, refine} → the first published lift-per-dollar table.
3. Augmentation sweep design: per domain, the tool grant that closes the cheap-model
gap (search for retrieval domains; what is the EOPS analog — schema docs? read-tool
hints?).
Loading
Loading