Jcb/kernel sharding by johnchandlerburnham · Pull Request #436 · argumentcomputer/ix

johnchandlerburnham · 2026-06-08T09:56:51Z

Sharding: measure → partition → manifest for proving `.ixe` environments

What & why

A compiled Lean environment is infeasible to prove in zero-knowledge as one
monolith (mathlib is ~631k blocks, 12.4M delta-unfold edges). This PR adds the
measure → partition → manifest pipeline that splits an environment into N
balanced, low-overhead shards, each provable as an independent conditional proof.

Two new CLI verbs. The kernel runs once to profile; sharding is then pure,
cheap graph work, re-tunable for any N without re-running the kernel:

  ix profile <env.ixe>   →  env.ixesp   (per-block heartbeats + delta graph)
  ix shard   <env.ixesp> →  env.ixes    (N-shard manifest + what-if metrics)

Scope is measure + partition + manifest — it stops before proof generation.
No external graph-library dependency; the partitioner is self-contained.

How it works

Phase A — out-of-circuit profiler (src/ix/profile.rs, kernel instrumentation).
Runs the Ix kernel over the .ixe, recording per checked constant its heartbeats
(recursive fuel) and the set of constants whose bodies it delta-unfolds. The
delta-unfold graph — not the static reference graph — is the real cross-shard
cost: a shard must ingress the body of any foreign block its members unfold.
Capture is gated behind an optional KEnv::profile_sink (zero overhead when off);
per-constant cache isolation makes recording sound (no unfold skipped by a
cross-constant memo) and faithful to in-circuit, un-memoized cost. Output is the
.ixesp block-profile (heartbeats + serialized size + CSR delta graph), with
explicit little-endian (de)serialization (no serde dep).

Phase B — partitioner (src/ix/shard.rs, pure).
Weighted hypergraph partitioning under the connectivity-1 (km1) metric: vertices
= blocks weighted by heartbeats (balance); nets = blocks weighted by serialized
size (ingress cost); objective = total cross-shard ingress bytes. Solved by
recursive balanced bisection with cut-net splitting and rayon parallelism across
subtrees. Each bisection is a multilevel V-cycle: coarsen the sub-hypergraph
by heavy-edge matching down to ~256 supervertices, decide the global cut there
(greedy graph-growing + FM, several seeds), then uncoarsen, boundary-refining
with incremental-gain FM at each level. Leaf-count allocation guarantees every
shard is non-empty (any N, not just powers of two); a balance-weight cap keeps
splits even while accepting the unavoidable imbalance of atomic mutual blocks.

Phase C — manifest (src/ix/shard.rs).
The .ixes per-shard manifest: member blocks, heartbeat/own-size sums, foreign
(delta-dependency) block sets, delta-based assumption-tree roots
(merkle_root_canonical), and a what-if summary (balance, total cross-shard
ingress, atomic-block floor).

Benchmarks

Profiling (one-time kernel run, isolated caches):

  init      87,844 blocks,  1.0M edges    18s
  lean     152,596 blocks,  1.2M edges    22s
  mathlib  631,006 blocks, 12.4M edges   318s

Partitioning (ix shard, balance ±5%; 0 empty shards, deterministic):

  env       n     time     imbalance   cross-shard ingress
  init      64    2.7s      5.8x         6.52M
  init      128   2.7s     11.5x         9.83M
  init      256   3.0s     23.0x        15.03M
  lean      64    3.4s      5.0x         7.42M
  lean      128   3.5s     10.0x        10.48M
  lean      256   3.6s     19.9x        15.49M
  mathlib   64   30.8s      1.5x        93.60M
  mathlib   128  31.4s      1.6x       131.84M
  mathlib   256  33.8s      2.0x       184.32M

The bigger the library, the better it shards: mathlib reaches 1.5–2.0x
heartbeat imbalance, while init/lean are floored by a single 592k-heartbeat
atomic block they're too small to dilute (the .ixes report flags this).

Multilevel coarsening (the second commit) makes mathlib partition ~3.2x
faster (99s → 31s at n=64) with ~half the cross-shard ingress
(185.6M → 93.6M) vs the initial flat greedy+FM bisection, and improves init/lean
by ~2x on both axes — global cut decisions are made once on a tiny coarse graph,
and only local boundary refinement ever touches the big graphs.

Testing

Unit tests in shard.rs (synthetic fixtures): cluster separation through the
full V-cycle, one-level contraction, cluster-cap matching, determinism, and the
non-empty guarantee under a dominant block. Recorder smoke tests in the kernel
(profile_sink_records_delta_edge_and_fuel). Full suite green (1045 lib tests).
End-to-end on init / lean / mathlib × {64, 128, 256}: 0 empty shards,
byte-for-byte reproducible manifests.

Add the measure -> partition -> manifest pipeline for splitting a compiled `.ixe` environment into balanced, low-overhead shards, so large environments (mathlib is ~631k blocks) can be proven in zero-knowledge as independent conditional proofs rather than one infeasible monolith. Two new CLI verbs; the kernel runs once to profile, then sharding is cheap to re-tune for any shard count: ix profile <env.ixe> -> env.ixesp (per-block heartbeats + delta graph) ix shard <env.ixesp> -> env.ixes (N-shard manifest + what-if metrics) Phase A -- out-of-circuit kernel profiler (gated; zero overhead when off): - Record, per checked constant, its heartbeats (recursive fuel) and the set of constants whose definition bodies it delta-unfolds. The *delta-unfold* graph, not the static reference graph, is the real cross-shard cost: a shard must ingress the body of any foreign block its members unfold. - KEnv gains an optional ProfileSink (per-worker accumulator). Capture sites are the delta-unfold commit points in whnf.rs and per-constant fuel accounting in tc.rs; begin_const markers in check.rs/inductive.rs attribute work to the right constant. Per-constant cache isolation makes recording sound (no unfolds skipped by cross-constant memo) and faithful to in-circuit, un-memoized cost. - src/ix/profile.rs: the `.ixesp` block-profile model (heartbeats + serialized size + CSR delta graph), the ProfileSink accumulator, and explicit little-endian (de)serialization (no serde dependency). Phase B -- partitioner (src/ix/shard.rs, pure): - Weighted hypergraph partitioning under the connectivity-1 (km1) metric: vertices = blocks weighted by heartbeats (balance), nets = blocks weighted by serialized size (ingress cost); objective = total cross-shard ingress bytes. - Recursive bisection via greedy graph-growing + Fiduccia-Mattheyses refinement, with cut-net splitting on recursion and a hub-net cap. Leaf-count allocation distributes the N-shard budget by heartbeat mass clamped to [1, side size], so any N works (not only powers of two) and every shard is non-empty when #blocks >= N. A balance-weight cap (heartbeats capped at total/N) keeps splits even while accepting the unavoidable imbalance from atomic mutual blocks. - Parallelized with rayon across independent subtrees (deterministic: identical result to serial), with live progress output so a slow run is never mistaken for a stuck one. Phase C -- manifest (src/ix/shard.rs): - `.ixes` per-shard manifest: member blocks, heartbeat/own-size sums, foreign (delta-dependency) block sets, delta-based assumption-tree roots (merkle_root_canonical), and a what-if summary (balance, total cross-shard ingress, atomic-block floor). FFI + CLI: - src/ffi/kernel.rs: profile_anon_ixe (runs the anon kernel over a `.ixe` with recording, maps constants to home blocks, writes `.ixesp`) and shard_esp (partition + manifest); rs_kernel_profile_anon / rs_shard_esp FFI. - Ix/Cli/{ProfileCmd,ShardCmd}.lean, Ix/KernelCheck.lean externs, Main wiring. No external graph-library dependency; the partitioner is self-contained. Kernel changes are gated behind KEnv::profile_sink, so production checking is unaffected. Validated end-to-end on initstd/lean/mathlib (64/128/256 shards, 0 empty shards). Follow-up: multilevel coarsening to make mathlib-scale partitions run in seconds and recover cut quality.

Replace the flat greedy+FM body of `bisect()` with a multilevel V-cycle (coarsen → partition the tiny coarsest graph → uncoarsen + refine), so large environments partition in a fraction of the time *and* at markedly lower cross-shard ingress. Self-contained to `src/ix/shard.rs`; recursion, leaf-count allocation, cut-net splitting, rayon parallelism, and the balance-weight cap are unchanged. Each bisection now: - coarsens the sub-hypergraph by heavy-edge matching (merge blocks that co-occur in small, heavy delta-nets) under a cluster-weight cap, down to ~256 supervertices; - decides the global cut once on that tiny graph (greedy graph-growing + FM to convergence, from several diverse seeds, keeping the lowest balanced cut); - uncoarsens, projecting the cut down and boundary-refining with FM at each level. Deciding global structure on the small graph and only ever refining locally on the big ones is what improves both axes at once. Benchmarks (profiled `.ixe` → partition; balance ±5%; flat baseline → multilevel; 0 empty shards, deterministic in every case): env n partition time cross-shard ingress init 64 5.0s → 2.7s 8.77M → 6.52M (-26%) init 128 6.0s → 2.7s 12.44M → 9.83M (-21%) init 256 6.0s → 3.0s 17.71M → 15.03M (-15%) lean 64 8.0s → 3.4s 9.59M → 7.42M (-23%) lean 128 8.0s → 3.5s 12.59M → 10.48M (-17%) lean 256 7.0s → 3.6s 18.73M → 15.49M (-17%) mathlib 64 99s → 30.8s 185.63M → 93.60M (-50%) mathlib 128 101s → 31.4s 233.58M →131.84M (-44%) mathlib 256 104s → 33.8s 294.85M →184.32M (-37%) mathlib (631k blocks, 12.4M delta edges) partitions ~3.2x faster with roughly half the cross-shard ingress; its heartbeat imbalance also drops (1.66x → 1.53x at n=64). The non-empty-shard guarantee and determinism hold throughout. Implementation notes — what real-env validation required beyond the textbook scheme: - Fallback pairing in matching: a vertex with no heavy-edge partner (delta-sparse or only in hub nets) is merged with the next unmatched vertex. They share no tracked net, so the merge is cut-neutral, but it keeps coarsening shrinking ~2x/pass instead of stalling far above the target (which had left an expensive initial partition on a ~51k-vertex graph). - Incremental FM gains: a move updates only the changed net's contribution to each co-pin (O(Σ pins of the moved vertex's nets)) instead of recomputing neighbour gains from scratch (O(Σ neighbour degrees)). Dominant speedup — refining the finest level dropped from ~11s to ~0.9s on init. - Degenerate-split rejection: graph-growing seeded at a light vertex can sweep an entire sub onto one side (cut 0) when one atomic block holds more than half the balance weight; the initial-partition selector now rejects empty-sided candidates and prefers balanced-then-min-cut, so no shard is left empty. - One boundary-refinement pass per uncoarsen level (measured 1.6–1.8x faster than two for only ~3–6% more ingress — a clear win given the margin). Removes the FM-skip heuristic (`FM_SKIP_VERTICES` / `FM_FULL_VERTICES`): full refinement is now affordable at every size because it only ever runs on already-good partitions. Adds unit tests for one-level contraction, cluster-cap matching, large-cluster separation through the full V-cycle, determinism, and the non-empty guarantee under a dominant block.

johnchandlerburnham added 2 commits June 8, 2026 04:11

johnchandlerburnham requested a review from arthurpaulino June 8, 2026 09:58

clippy & fmt

4def34c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Jcb/kernel sharding#436

Jcb/kernel sharding#436
johnchandlerburnham wants to merge 3 commits into
mainfrom
jcb/kernel-sharding

johnchandlerburnham commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

johnchandlerburnham commented Jun 8, 2026

Sharding: measure → partition → manifest for proving .ixe environments

What & why

How it works

Benchmarks

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sharding: measure → partition → manifest for proving `.ixe` environments