Reclaim hotblocks disk via DeleteFilesInRange + range deletes#79
Draft
mo4islona wants to merge 1 commit into
Draft
Reclaim hotblocks disk via DeleteFilesInRange + range deletes#79mo4islona wants to merge 1 commit into
mo4islona wants to merge 1 commit into
Conversation
mo4islona
added a commit
that referenced
this pull request
Jun 25, 2026
…9, NET-798) Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files (NET-819). - hotblocks: the cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace, since no readers exist yet (NET-798). New --reclaim-grace-secs flag (default 15m). Tests: logical-delete snapshot safety, physical reclaim after grace, watermark pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and a value-codec/migration unit test. Supersedes the sync_dataset_cleanup flag (draft PR #79). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
0e7ab30 to
be14d62
Compare
mo4islona
added a commit
that referenced
this pull request
Jun 25, 2026
…9, NET-798) Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files (NET-819). - hotblocks: the cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace, since no readers exist yet (NET-798). New --reclaim-grace-secs flag (default 15m). Tests: logical-delete snapshot safety, physical reclaim after grace, watermark pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and a value-codec/migration unit test. Supersedes the sync_dataset_cleanup flag (draft PR #79). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
b19920a to
7659b7c
Compare
fc98611 to
4736ea5
Compare
Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. reclaim_disk_space only clears bookkeeping for tables Phase 1 has already tombstoned, so a still-pending table is never forgotten with its data left un-tombstoned. Crash recovery: a DIRTY_TABLES marker with no committed chunk is an orphan left by a build that died before commit; its id would otherwise pin the watermark forever. purge_orphan_dirty_tables() drops such markers at startup, before any ingest, range-tombstoning the orphaned data so it can be reclaimed. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files. - hotblocks: the cleanup loop runs both phases each tick -- Phase 2 runs even when Phase 1's writes fail, so a full disk still gets freed -- and backs off on error instead of busy-looping; startup purges orphan dirty markers and reclaims unconfigured datasets' files before any controller spawns, the one point a zero grace is safe (no ingest/compaction/query snapshot exists yet). New --reclaim-grace-secs flag (default 15m). - Robustness: cleanup scans skip a malformed CF key instead of panicking (a panic would re-fire every tick and wedge cleanup). Tests: a MockDB harness drives the lifecycle (commit / delete / cleanup / reclaim / snapshot). Covers logical-delete snapshot safety; the Phase-2 grace/snapshot invariant (a pre-deletion reader still reads within grace and the files survive, unlinked only past grace) plus the negative case (a past-grace unlink under a live snapshot breaks the read); orphan-marker watermark pinning; un-tombstoned bookkeeping survival; physical reclaim; watermark pinning by a live table; idempotency; end-to-end delete_dataset + reclaim; and a value-codec unit test. Supersedes the sync_dataset_cleanup flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4736ea5 to
98eac3f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Supersedes the
sync_dataset_cleanupflag this PR originally carried (that flag only changed when the tombstone purge ran; it freed no space).Problem
During the mainnet incident the hotblocks volume hit 100% and compaction couldn't reclaim space. Deletion in hotblocks is write-based: retention marks tables in
DELETED_TABLESand the cleanup loop point-deletes every key in a WriteBatch — millions of tombstones that grow the DB and free space only once compaction rewrites the SSTs. At a full disk those tombstone writes fail, exactly when reclaim is needed.Two-phase cleanup
Phase 1 — logical, snapshot-safe (
Database::cleanup): one range tombstone per dead table (delete_range_cf) instead of millions of point deletes, then mark it reclaim-pending. Range tombstones respect snapshots → in-flight queries unaffected, no grace needed.Phase 2 — physical (
Database::reclaim_disk_space(grace)): unlink whole SST files below the live watermark (min liveTableIdover all chunks + dirty tables) withDeleteFilesInRange. No writes / no scratch space → works at 100% disk.TableIdis a time-ordered UUIDv7 that is never reused, so dead tables form a contiguous low key range.DeleteFilesInRangeignores snapshots, so a per-table deletion timestamp inCF_DELETED_TABLESgates the unlink behind a grace period that must exceed the max query/snapshot lifetime (--reclaim-grace-secs, default 15m).Also
Tests
crates/storage/tests/cleanup_reclaim.rs+ a codec/migration unit test:delete_dataset+ startup-style reclaim;Known limitations / open questions
TableId, so any long-retention (None/Api) dataset with an old live table pins it and blocks file-reclaim of newer dead tables elsewhere (those fall back to range-tombstone + compaction, which don't work at a full disk). Worth quantifying in production.DELETED_TABLESvalues decode as "deleted long ago, logical-delete owed" and drain safely on the next cycle.🤖 Generated with Claude Code