Jcb/kernel sharding#436
Open
johnchandlerburnham wants to merge 3 commits into
Open
Conversation
Add the measure -> partition -> manifest pipeline for splitting a compiled
`.ixe` environment into balanced, low-overhead shards, so large environments
(mathlib is ~631k blocks) can be proven in zero-knowledge as independent
conditional proofs rather than one infeasible monolith.
Two new CLI verbs; the kernel runs once to profile, then sharding is cheap to
re-tune for any shard count:
ix profile <env.ixe> -> env.ixesp (per-block heartbeats + delta graph)
ix shard <env.ixesp> -> env.ixes (N-shard manifest + what-if metrics)
Phase A -- out-of-circuit kernel profiler (gated; zero overhead when off):
- Record, per checked constant, its heartbeats (recursive fuel) and the set of
constants whose definition bodies it delta-unfolds. The *delta-unfold* graph,
not the static reference graph, is the real cross-shard cost: a shard must
ingress the body of any foreign block its members unfold.
- KEnv gains an optional ProfileSink (per-worker accumulator). Capture sites are
the delta-unfold commit points in whnf.rs and per-constant fuel accounting in
tc.rs; begin_const markers in check.rs/inductive.rs attribute work to the
right constant. Per-constant cache isolation makes recording sound (no unfolds
skipped by cross-constant memo) and faithful to in-circuit, un-memoized cost.
- src/ix/profile.rs: the `.ixesp` block-profile model (heartbeats + serialized
size + CSR delta graph), the ProfileSink accumulator, and explicit
little-endian (de)serialization (no serde dependency).
Phase B -- partitioner (src/ix/shard.rs, pure):
- Weighted hypergraph partitioning under the connectivity-1 (km1) metric:
vertices = blocks weighted by heartbeats (balance), nets = blocks weighted by
serialized size (ingress cost); objective = total cross-shard ingress bytes.
- Recursive bisection via greedy graph-growing + Fiduccia-Mattheyses refinement,
with cut-net splitting on recursion and a hub-net cap. Leaf-count allocation
distributes the N-shard budget by heartbeat mass clamped to [1, side size], so
any N works (not only powers of two) and every shard is non-empty when
#blocks >= N. A balance-weight cap (heartbeats capped at total/N) keeps splits
even while accepting the unavoidable imbalance from atomic mutual blocks.
- Parallelized with rayon across independent subtrees (deterministic: identical
result to serial), with live progress output so a slow run is never mistaken
for a stuck one.
Phase C -- manifest (src/ix/shard.rs):
- `.ixes` per-shard manifest: member blocks, heartbeat/own-size sums, foreign
(delta-dependency) block sets, delta-based assumption-tree roots
(merkle_root_canonical), and a what-if summary (balance, total cross-shard
ingress, atomic-block floor).
FFI + CLI:
- src/ffi/kernel.rs: profile_anon_ixe (runs the anon kernel over a `.ixe` with
recording, maps constants to home blocks, writes `.ixesp`) and shard_esp
(partition + manifest); rs_kernel_profile_anon / rs_shard_esp FFI.
- Ix/Cli/{ProfileCmd,ShardCmd}.lean, Ix/KernelCheck.lean externs, Main wiring.
No external graph-library dependency; the partitioner is self-contained. Kernel
changes are gated behind KEnv::profile_sink, so production checking is
unaffected. Validated end-to-end on initstd/lean/mathlib (64/128/256 shards,
0 empty shards). Follow-up: multilevel coarsening to make mathlib-scale
partitions run in seconds and recover cut quality.
Replace the flat greedy+FM body of `bisect()` with a multilevel V-cycle (coarsen → partition the tiny coarsest graph → uncoarsen + refine), so large environments partition in a fraction of the time *and* at markedly lower cross-shard ingress. Self-contained to `src/ix/shard.rs`; recursion, leaf-count allocation, cut-net splitting, rayon parallelism, and the balance-weight cap are unchanged. Each bisection now: - coarsens the sub-hypergraph by heavy-edge matching (merge blocks that co-occur in small, heavy delta-nets) under a cluster-weight cap, down to ~256 supervertices; - decides the global cut once on that tiny graph (greedy graph-growing + FM to convergence, from several diverse seeds, keeping the lowest balanced cut); - uncoarsens, projecting the cut down and boundary-refining with FM at each level. Deciding global structure on the small graph and only ever refining locally on the big ones is what improves both axes at once. Benchmarks (profiled `.ixe` → partition; balance ±5%; flat baseline → multilevel; 0 empty shards, deterministic in every case): env n partition time cross-shard ingress init 64 5.0s → 2.7s 8.77M → 6.52M (-26%) init 128 6.0s → 2.7s 12.44M → 9.83M (-21%) init 256 6.0s → 3.0s 17.71M → 15.03M (-15%) lean 64 8.0s → 3.4s 9.59M → 7.42M (-23%) lean 128 8.0s → 3.5s 12.59M → 10.48M (-17%) lean 256 7.0s → 3.6s 18.73M → 15.49M (-17%) mathlib 64 99s → 30.8s 185.63M → 93.60M (-50%) mathlib 128 101s → 31.4s 233.58M →131.84M (-44%) mathlib 256 104s → 33.8s 294.85M →184.32M (-37%) mathlib (631k blocks, 12.4M delta edges) partitions ~3.2x faster with roughly half the cross-shard ingress; its heartbeat imbalance also drops (1.66x → 1.53x at n=64). The non-empty-shard guarantee and determinism hold throughout. Implementation notes — what real-env validation required beyond the textbook scheme: - Fallback pairing in matching: a vertex with no heavy-edge partner (delta-sparse or only in hub nets) is merged with the next unmatched vertex. They share no tracked net, so the merge is cut-neutral, but it keeps coarsening shrinking ~2x/pass instead of stalling far above the target (which had left an expensive initial partition on a ~51k-vertex graph). - Incremental FM gains: a move updates only the changed net's contribution to each co-pin (O(Σ pins of the moved vertex's nets)) instead of recomputing neighbour gains from scratch (O(Σ neighbour degrees)). Dominant speedup — refining the finest level dropped from ~11s to ~0.9s on init. - Degenerate-split rejection: graph-growing seeded at a light vertex can sweep an entire sub onto one side (cut 0) when one atomic block holds more than half the balance weight; the initial-partition selector now rejects empty-sided candidates and prefers balanced-then-min-cut, so no shard is left empty. - One boundary-refinement pass per uncoarsen level (measured 1.6–1.8x faster than two for only ~3–6% more ingress — a clear win given the margin). Removes the FM-skip heuristic (`FM_SKIP_VERTICES` / `FM_FULL_VERTICES`): full refinement is now affordable at every size because it only ever runs on already-good partitions. Adds unit tests for one-level contraction, cluster-cap matching, large-cluster separation through the full V-cycle, determinism, and the non-empty guarantee under a dominant block.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Sharding: measure → partition → manifest for proving
.ixeenvironmentsWhat & why
A compiled Lean environment is infeasible to prove in zero-knowledge as one
monolith (mathlib is ~631k blocks, 12.4M delta-unfold edges). This PR adds the
measure → partition → manifest pipeline that splits an environment into
Nbalanced, low-overhead shards, each provable as an independent conditional proof.
Two new CLI verbs. The kernel runs once to profile; sharding is then pure,
cheap graph work, re-tunable for any
Nwithout re-running the kernel:Scope is measure + partition + manifest — it stops before proof generation.
No external graph-library dependency; the partitioner is self-contained.
How it works
Phase A — out-of-circuit profiler (
src/ix/profile.rs, kernel instrumentation).Runs the Ix kernel over the
.ixe, recording per checked constant its heartbeats(recursive fuel) and the set of constants whose bodies it delta-unfolds. The
delta-unfold graph — not the static reference graph — is the real cross-shard
cost: a shard must ingress the body of any foreign block its members unfold.
Capture is gated behind an optional
KEnv::profile_sink(zero overhead when off);per-constant cache isolation makes recording sound (no unfold skipped by a
cross-constant memo) and faithful to in-circuit, un-memoized cost. Output is the
.ixespblock-profile (heartbeats + serialized size + CSR delta graph), withexplicit little-endian (de)serialization (no serde dep).
Phase B — partitioner (
src/ix/shard.rs, pure).Weighted hypergraph partitioning under the connectivity-1 (km1) metric: vertices
= blocks weighted by heartbeats (balance); nets = blocks weighted by serialized
size (ingress cost); objective = total cross-shard ingress bytes. Solved by
recursive balanced bisection with cut-net splitting and rayon parallelism across
subtrees. Each bisection is a multilevel V-cycle: coarsen the sub-hypergraph
by heavy-edge matching down to ~256 supervertices, decide the global cut there
(greedy graph-growing + FM, several seeds), then uncoarsen, boundary-refining
with incremental-gain FM at each level. Leaf-count allocation guarantees every
shard is non-empty (any
N, not just powers of two); a balance-weight cap keepssplits even while accepting the unavoidable imbalance of atomic mutual blocks.
Phase C — manifest (
src/ix/shard.rs).The
.ixesper-shard manifest: member blocks, heartbeat/own-size sums, foreign(delta-dependency) block sets, delta-based assumption-tree roots
(
merkle_root_canonical), and a what-if summary (balance, total cross-shardingress, atomic-block floor).
Benchmarks
Profiling (one-time kernel run, isolated caches):
Partitioning (
ix shard, balance ±5%; 0 empty shards, deterministic):The bigger the library, the better it shards: mathlib reaches 1.5–2.0x
heartbeat imbalance, while init/lean are floored by a single 592k-heartbeat
atomic block they're too small to dilute (the
.ixesreport flags this).Multilevel coarsening (the second commit) makes mathlib partition ~3.2x
faster (99s → 31s at n=64) with ~half the cross-shard ingress
(185.6M → 93.6M) vs the initial flat greedy+FM bisection, and improves init/lean
by ~2x on both axes — global cut decisions are made once on a tiny coarse graph,
and only local boundary refinement ever touches the big graphs.
Testing
shard.rs(synthetic fixtures): cluster separation through thefull V-cycle, one-level contraction, cluster-cap matching, determinism, and the
non-empty guarantee under a dominant block. Recorder smoke tests in the kernel
(
profile_sink_records_delta_edge_and_fuel). Full suite green (1045 lib tests).byte-for-byte reproducible manifests.