Skip to content

Reclaim hotblocks disk via DeleteFilesInRange + range deletes#79

Draft
mo4islona wants to merge 1 commit into
masterfrom
sync-dataset-cleanup-flag
Draft

Reclaim hotblocks disk via DeleteFilesInRange + range deletes#79
mo4islona wants to merge 1 commit into
masterfrom
sync-dataset-cleanup-flag

Conversation

@mo4islona

@mo4islona mo4islona commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Supersedes the sync_dataset_cleanup flag this PR originally carried (that flag only changed when the tombstone purge ran; it freed no space).

Problem

During the mainnet incident the hotblocks volume hit 100% and compaction couldn't reclaim space. Deletion in hotblocks is write-based: retention marks tables in DELETED_TABLES and the cleanup loop point-deletes every key in a WriteBatch — millions of tombstones that grow the DB and free space only once compaction rewrites the SSTs. At a full disk those tombstone writes fail, exactly when reclaim is needed.

Two-phase cleanup

Phase 1 — logical, snapshot-safe (Database::cleanup): one range tombstone per dead table (delete_range_cf) instead of millions of point deletes, then mark it reclaim-pending. Range tombstones respect snapshots → in-flight queries unaffected, no grace needed.

Phase 2 — physical (Database::reclaim_disk_space(grace)): unlink whole SST files below the live watermark (min live TableId over all chunks + dirty tables) with DeleteFilesInRange. No writes / no scratch space → works at 100% disk. TableId is a time-ordered UUIDv7 that is never reused, so dead tables form a contiguous low key range.

DeleteFilesInRange ignores snapshots, so a per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period that must exceed the max query/snapshot lifetime (--reclaim-grace-secs, default 15m).

Also

  • TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files.
  • hotblocks cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace (no readers yet).

Tests

crates/storage/tests/cleanup_reclaim.rs + a codec/migration unit test:

  • logical delete invisible to a pre-existing snapshot (MVCC);
  • physical reclaim only after grace, unlinks bottom-level files;
  • a live old table pins the watermark (documents the heterogeneous-retention limitation; live data is never unlinked);
  • idempotency / crash-safety;
  • end-to-end delete_dataset + startup-style reclaim;
  • value-codec + legacy empty-value migration.

Known limitations / open questions

  • Watermark pinning: the watermark is the global min live TableId, so any long-retention (None/Api) dataset with an old live table pins it and blocks file-reclaim of newer dead tables elsewhere (those fall back to range-tombstone + compaction, which don't work at a full disk). Worth quantifying in production.
  • Grace tuning: 15m default is conservative; should be ~2× the real max query lifetime. Ideally pair with read-side snapshot-age enforcement so the bound is guaranteed, not assumed.
  • Migration: pre-existing empty DELETED_TABLES values decode as "deleted long ago, logical-delete owed" and drain safely on the next cycle.

🤖 Generated with Claude Code

mo4islona added a commit that referenced this pull request Jun 25, 2026
…9, NET-798)

Replace the per-key tombstone purge with a two-phase table cleanup that can
actually reclaim disk space, including at a full disk.

Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a
single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of
millions of point deletes, then marks it reclaim-pending. Range tombstones
respect snapshots, so in-flight queries are unaffected.

Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the
live watermark (min live TableId across all chunks + dirty tables) via
DeleteFilesInRange. It performs no writes and needs no scratch space, so it
makes progress even at 100% disk. A per-table deletion timestamp in
CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink
ignores snapshots, so grace must exceed the max query/snapshot lifetime.

Also:
- TABLES CF: compact-on-deletion collector + 24h periodic compaction so
  compaction finds tombstone-heavy / boundary files (NET-819).
- hotblocks: the cleanup loop runs both phases each tick and backs off on
  error instead of busy-looping failing writes; startup reclaims unconfigured
  datasets' files before serving with a zero grace, since no readers exist yet
  (NET-798). New --reclaim-grace-secs flag (default 15m).

Tests: logical-delete snapshot safety, physical reclaim after grace, watermark
pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and
a value-codec/migration unit test.

Supersedes the sync_dataset_cleanup flag (draft PR #79).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch from 0e7ab30 to be14d62 Compare June 25, 2026 15:08
@mo4islona mo4islona changed the title Add sync_dataset_cleanup flag to defer table purge Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798) Jun 25, 2026
mo4islona added a commit that referenced this pull request Jun 25, 2026
…9, NET-798)

Replace the per-key tombstone purge with a two-phase table cleanup that can
actually reclaim disk space, including at a full disk.

Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a
single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of
millions of point deletes, then marks it reclaim-pending. Range tombstones
respect snapshots, so in-flight queries are unaffected.

Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the
live watermark (min live TableId across all chunks + dirty tables) via
DeleteFilesInRange. It performs no writes and needs no scratch space, so it
makes progress even at 100% disk. A per-table deletion timestamp in
CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink
ignores snapshots, so grace must exceed the max query/snapshot lifetime.

Also:
- TABLES CF: compact-on-deletion collector + 24h periodic compaction so
  compaction finds tombstone-heavy / boundary files (NET-819).
- hotblocks: the cleanup loop runs both phases each tick and backs off on
  error instead of busy-looping failing writes; startup reclaims unconfigured
  datasets' files before serving with a zero grace, since no readers exist yet
  (NET-798). New --reclaim-grace-secs flag (default 15m).

Tests: logical-delete snapshot safety, physical reclaim after grace, watermark
pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and
a value-codec/migration unit test.

Supersedes the sync_dataset_cleanup flag (draft PR #79).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch 2 times, most recently from b19920a to 7659b7c Compare June 25, 2026 15:22
@mo4islona mo4islona changed the title Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798) Reclaim hotblocks disk via DeleteFilesInRange + range deletes Jun 25, 2026
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch 3 times, most recently from fc98611 to 4736ea5 Compare June 25, 2026 17:22
Replace the per-key tombstone purge with a two-phase table cleanup that can
actually reclaim disk space, including at a full disk.

Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a
single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of
millions of point deletes, then marks it reclaim-pending. Range tombstones
respect snapshots, so in-flight queries are unaffected.

Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the
live watermark (min live TableId across all chunks + dirty tables) via
DeleteFilesInRange. It performs no writes and needs no scratch space, so it
makes progress even at 100% disk. A per-table deletion timestamp in
CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink
ignores snapshots, so grace must exceed the max query/snapshot lifetime.
reclaim_disk_space only clears bookkeeping for tables Phase 1 has already
tombstoned, so a still-pending table is never forgotten with its data left
un-tombstoned.

Crash recovery: a DIRTY_TABLES marker with no committed chunk is an orphan left
by a build that died before commit; its id would otherwise pin the watermark
forever. purge_orphan_dirty_tables() drops such markers at startup, before any
ingest, range-tombstoning the orphaned data so it can be reclaimed.

Also:
- TABLES CF: compact-on-deletion collector + 24h periodic compaction so
  compaction finds tombstone-heavy / boundary files.
- hotblocks: the cleanup loop runs both phases each tick -- Phase 2 runs even
  when Phase 1's writes fail, so a full disk still gets freed -- and backs off
  on error instead of busy-looping; startup purges orphan dirty markers and
  reclaims unconfigured datasets' files before any controller spawns, the one
  point a zero grace is safe (no ingest/compaction/query snapshot exists yet).
  New --reclaim-grace-secs flag (default 15m).
- Robustness: cleanup scans skip a malformed CF key instead of panicking (a
  panic would re-fire every tick and wedge cleanup).

Tests: a MockDB harness drives the lifecycle (commit / delete / cleanup /
reclaim / snapshot). Covers logical-delete snapshot safety; the Phase-2
grace/snapshot invariant (a pre-deletion reader still reads within grace and
the files survive, unlinked only past grace) plus the negative case (a
past-grace unlink under a live snapshot breaks the read); orphan-marker
watermark pinning; un-tombstoned bookkeeping survival; physical reclaim;
watermark pinning by a live table; idempotency; end-to-end delete_dataset +
reclaim; and a value-codec unit test.

Supersedes the sync_dataset_cleanup flag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@mo4islona mo4islona force-pushed the sync-dataset-cleanup-flag branch from 4736ea5 to 98eac3f Compare June 25, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant