Skip to content

Jcb/kernel sharding#436

Open
johnchandlerburnham wants to merge 3 commits into
mainfrom
jcb/kernel-sharding
Open

Jcb/kernel sharding#436
johnchandlerburnham wants to merge 3 commits into
mainfrom
jcb/kernel-sharding

Conversation

@johnchandlerburnham

Copy link
Copy Markdown
Member

Sharding: measure → partition → manifest for proving .ixe environments

What & why

A compiled Lean environment is infeasible to prove in zero-knowledge as one
monolith (mathlib is ~631k blocks, 12.4M delta-unfold edges). This PR adds the
measure → partition → manifest pipeline that splits an environment into N
balanced, low-overhead shards, each provable as an independent conditional proof.

Two new CLI verbs. The kernel runs once to profile; sharding is then pure,
cheap graph work, re-tunable for any N without re-running the kernel:

  ix profile <env.ixe>   →  env.ixesp   (per-block heartbeats + delta graph)
  ix shard   <env.ixesp> →  env.ixes    (N-shard manifest + what-if metrics)

Scope is measure + partition + manifest — it stops before proof generation.
No external graph-library dependency; the partitioner is self-contained.

How it works

Phase A — out-of-circuit profiler (src/ix/profile.rs, kernel instrumentation).
Runs the Ix kernel over the .ixe, recording per checked constant its heartbeats
(recursive fuel) and the set of constants whose bodies it delta-unfolds. The
delta-unfold graph — not the static reference graph — is the real cross-shard
cost: a shard must ingress the body of any foreign block its members unfold.
Capture is gated behind an optional KEnv::profile_sink (zero overhead when off);
per-constant cache isolation makes recording sound (no unfold skipped by a
cross-constant memo) and faithful to in-circuit, un-memoized cost. Output is the
.ixesp block-profile (heartbeats + serialized size + CSR delta graph), with
explicit little-endian (de)serialization (no serde dep).

Phase B — partitioner (src/ix/shard.rs, pure).
Weighted hypergraph partitioning under the connectivity-1 (km1) metric: vertices
= blocks weighted by heartbeats (balance); nets = blocks weighted by serialized
size (ingress cost); objective = total cross-shard ingress bytes. Solved by
recursive balanced bisection with cut-net splitting and rayon parallelism across
subtrees. Each bisection is a multilevel V-cycle: coarsen the sub-hypergraph
by heavy-edge matching down to ~256 supervertices, decide the global cut there
(greedy graph-growing + FM, several seeds), then uncoarsen, boundary-refining
with incremental-gain FM at each level. Leaf-count allocation guarantees every
shard is non-empty (any N, not just powers of two); a balance-weight cap keeps
splits even while accepting the unavoidable imbalance of atomic mutual blocks.

Phase C — manifest (src/ix/shard.rs).
The .ixes per-shard manifest: member blocks, heartbeat/own-size sums, foreign
(delta-dependency) block sets, delta-based assumption-tree roots
(merkle_root_canonical), and a what-if summary (balance, total cross-shard
ingress, atomic-block floor).

Benchmarks

Profiling (one-time kernel run, isolated caches):

  init      87,844 blocks,  1.0M edges    18s
  lean     152,596 blocks,  1.2M edges    22s
  mathlib  631,006 blocks, 12.4M edges   318s

Partitioning (ix shard, balance ±5%; 0 empty shards, deterministic):

  env       n     time     imbalance   cross-shard ingress
  init      64    2.7s      5.8x         6.52M
  init      128   2.7s     11.5x         9.83M
  init      256   3.0s     23.0x        15.03M
  lean      64    3.4s      5.0x         7.42M
  lean      128   3.5s     10.0x        10.48M
  lean      256   3.6s     19.9x        15.49M
  mathlib   64   30.8s      1.5x        93.60M
  mathlib   128  31.4s      1.6x       131.84M
  mathlib   256  33.8s      2.0x       184.32M

The bigger the library, the better it shards: mathlib reaches 1.5–2.0x
heartbeat imbalance, while init/lean are floored by a single 592k-heartbeat
atomic block they're too small to dilute (the .ixes report flags this).

Multilevel coarsening (the second commit) makes mathlib partition ~3.2x
faster
(99s → 31s at n=64) with ~half the cross-shard ingress
(185.6M → 93.6M) vs the initial flat greedy+FM bisection, and improves init/lean
by ~2x on both axes — global cut decisions are made once on a tiny coarse graph,
and only local boundary refinement ever touches the big graphs.

Testing

  • Unit tests in shard.rs (synthetic fixtures): cluster separation through the
    full V-cycle, one-level contraction, cluster-cap matching, determinism, and the
    non-empty guarantee under a dominant block. Recorder smoke tests in the kernel
    (profile_sink_records_delta_edge_and_fuel). Full suite green (1045 lib tests).
  • End-to-end on init / lean / mathlib × {64, 128, 256}: 0 empty shards,
    byte-for-byte reproducible manifests.

Add the measure -> partition -> manifest pipeline for splitting a compiled
`.ixe` environment into balanced, low-overhead shards, so large environments
(mathlib is ~631k blocks) can be proven in zero-knowledge as independent
conditional proofs rather than one infeasible monolith.

Two new CLI verbs; the kernel runs once to profile, then sharding is cheap to
re-tune for any shard count:

  ix profile <env.ixe>   ->  env.ixesp   (per-block heartbeats + delta graph)
  ix shard   <env.ixesp> ->  env.ixes    (N-shard manifest + what-if metrics)

Phase A -- out-of-circuit kernel profiler (gated; zero overhead when off):
- Record, per checked constant, its heartbeats (recursive fuel) and the set of
  constants whose definition bodies it delta-unfolds. The *delta-unfold* graph,
  not the static reference graph, is the real cross-shard cost: a shard must
  ingress the body of any foreign block its members unfold.
- KEnv gains an optional ProfileSink (per-worker accumulator). Capture sites are
  the delta-unfold commit points in whnf.rs and per-constant fuel accounting in
  tc.rs; begin_const markers in check.rs/inductive.rs attribute work to the
  right constant. Per-constant cache isolation makes recording sound (no unfolds
  skipped by cross-constant memo) and faithful to in-circuit, un-memoized cost.
- src/ix/profile.rs: the `.ixesp` block-profile model (heartbeats + serialized
  size + CSR delta graph), the ProfileSink accumulator, and explicit
  little-endian (de)serialization (no serde dependency).

Phase B -- partitioner (src/ix/shard.rs, pure):
- Weighted hypergraph partitioning under the connectivity-1 (km1) metric:
  vertices = blocks weighted by heartbeats (balance), nets = blocks weighted by
  serialized size (ingress cost); objective = total cross-shard ingress bytes.
- Recursive bisection via greedy graph-growing + Fiduccia-Mattheyses refinement,
  with cut-net splitting on recursion and a hub-net cap. Leaf-count allocation
  distributes the N-shard budget by heartbeat mass clamped to [1, side size], so
  any N works (not only powers of two) and every shard is non-empty when
  #blocks >= N. A balance-weight cap (heartbeats capped at total/N) keeps splits
  even while accepting the unavoidable imbalance from atomic mutual blocks.
- Parallelized with rayon across independent subtrees (deterministic: identical
  result to serial), with live progress output so a slow run is never mistaken
  for a stuck one.

Phase C -- manifest (src/ix/shard.rs):
- `.ixes` per-shard manifest: member blocks, heartbeat/own-size sums, foreign
  (delta-dependency) block sets, delta-based assumption-tree roots
  (merkle_root_canonical), and a what-if summary (balance, total cross-shard
  ingress, atomic-block floor).

FFI + CLI:
- src/ffi/kernel.rs: profile_anon_ixe (runs the anon kernel over a `.ixe` with
  recording, maps constants to home blocks, writes `.ixesp`) and shard_esp
  (partition + manifest); rs_kernel_profile_anon / rs_shard_esp FFI.
- Ix/Cli/{ProfileCmd,ShardCmd}.lean, Ix/KernelCheck.lean externs, Main wiring.

No external graph-library dependency; the partitioner is self-contained. Kernel
changes are gated behind KEnv::profile_sink, so production checking is
unaffected. Validated end-to-end on initstd/lean/mathlib (64/128/256 shards,
0 empty shards). Follow-up: multilevel coarsening to make mathlib-scale
partitions run in seconds and recover cut quality.
Replace the flat greedy+FM body of `bisect()` with a multilevel V-cycle
(coarsen → partition the tiny coarsest graph → uncoarsen + refine), so large
environments partition in a fraction of the time *and* at markedly lower
cross-shard ingress. Self-contained to `src/ix/shard.rs`; recursion, leaf-count
allocation, cut-net splitting, rayon parallelism, and the balance-weight cap are
unchanged.

Each bisection now:
- coarsens the sub-hypergraph by heavy-edge matching (merge blocks that co-occur
  in small, heavy delta-nets) under a cluster-weight cap, down to ~256
  supervertices;
- decides the global cut once on that tiny graph (greedy graph-growing + FM to
  convergence, from several diverse seeds, keeping the lowest balanced cut);
- uncoarsens, projecting the cut down and boundary-refining with FM at each
  level.

Deciding global structure on the small graph and only ever refining locally on
the big ones is what improves both axes at once.

Benchmarks (profiled `.ixe` → partition; balance ±5%; flat baseline → multilevel;
0 empty shards, deterministic in every case):

  env       n     partition time     cross-shard ingress
  init      64     5.0s →  2.7s        8.77M →  6.52M   (-26%)
  init      128    6.0s →  2.7s       12.44M →  9.83M   (-21%)
  init      256    6.0s →  3.0s       17.71M → 15.03M   (-15%)
  lean      64     8.0s →  3.4s        9.59M →  7.42M   (-23%)
  lean      128    8.0s →  3.5s       12.59M → 10.48M   (-17%)
  lean      256    7.0s →  3.6s       18.73M → 15.49M   (-17%)
  mathlib   64      99s → 30.8s      185.63M → 93.60M   (-50%)
  mathlib   128     101s → 31.4s     233.58M →131.84M   (-44%)
  mathlib   256     104s → 33.8s     294.85M →184.32M   (-37%)

mathlib (631k blocks, 12.4M delta edges) partitions ~3.2x faster with roughly
half the cross-shard ingress; its heartbeat imbalance also drops (1.66x → 1.53x
at n=64). The non-empty-shard guarantee and determinism hold throughout.

Implementation notes — what real-env validation required beyond the textbook
scheme:

- Fallback pairing in matching: a vertex with no heavy-edge partner (delta-sparse
  or only in hub nets) is merged with the next unmatched vertex. They share no
  tracked net, so the merge is cut-neutral, but it keeps coarsening shrinking
  ~2x/pass instead of stalling far above the target (which had left an expensive
  initial partition on a ~51k-vertex graph).
- Incremental FM gains: a move updates only the changed net's contribution to
  each co-pin (O(Σ pins of the moved vertex's nets)) instead of recomputing
  neighbour gains from scratch (O(Σ neighbour degrees)). Dominant speedup —
  refining the finest level dropped from ~11s to ~0.9s on init.
- Degenerate-split rejection: graph-growing seeded at a light vertex can sweep an
  entire sub onto one side (cut 0) when one atomic block holds more than half the
  balance weight; the initial-partition selector now rejects empty-sided
  candidates and prefers balanced-then-min-cut, so no shard is left empty.
- One boundary-refinement pass per uncoarsen level (measured 1.6–1.8x faster than
  two for only ~3–6% more ingress — a clear win given the margin).

Removes the FM-skip heuristic (`FM_SKIP_VERTICES` / `FM_FULL_VERTICES`): full
refinement is now affordable at every size because it only ever runs on
already-good partitions. Adds unit tests for one-level contraction, cluster-cap
matching, large-cluster separation through the full V-cycle, determinism, and
the non-empty guarantee under a dominant block.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant