Reclaim hotblocks disk via DeleteFilesInRange + range deletes by mo4islona · Pull Request #79 · subsquid/data

mo4islona · 2026-06-25T11:48:03Z

Supersedes the sync_dataset_cleanup flag this PR originally carried (that flag only changed when the tombstone purge ran; it freed no space).

Problem

During the mainnet incident the hotblocks volume hit 100% and compaction couldn't reclaim space. Deletion in hotblocks is write-based: retention marks tables in DELETED_TABLES and the cleanup loop point-deletes every key in a WriteBatch — millions of tombstones that grow the DB and free space only once compaction rewrites the SSTs. At a full disk those tombstone writes fail, exactly when reclaim is needed.

Two-phase cleanup

Phase 1 — logical, snapshot-safe (Database::cleanup): one range tombstone per dead table (delete_range_cf) instead of millions of point deletes, then mark it reclaim-pending. Range tombstones respect snapshots → in-flight queries unaffected, no grace needed.

Phase 2 — physical (Database::reclaim_disk_space(grace)): unlink whole SST files below the live watermark (min live TableId over all chunks + dirty tables) with DeleteFilesInRange. No writes / no scratch space → works at 100% disk. TableId is a time-ordered UUIDv7 that is never reused, so dead tables form a contiguous low key range.

DeleteFilesInRange ignores snapshots, so a per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period that must exceed the max query/snapshot lifetime (--reclaim-grace-secs, default 15m).

Also

TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files.
hotblocks cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace (no readers yet).

Tests

crates/storage/tests/cleanup_reclaim.rs + a codec/migration unit test:

logical delete invisible to a pre-existing snapshot (MVCC);
physical reclaim only after grace, unlinks bottom-level files;
a live old table pins the watermark (documents the heterogeneous-retention limitation; live data is never unlinked);
idempotency / crash-safety;
end-to-end delete_dataset + startup-style reclaim;
value-codec + legacy empty-value migration.

Known limitations / open questions

Watermark pinning: the watermark is the global min live TableId, so any long-retention (None/Api) dataset with an old live table pins it and blocks file-reclaim of newer dead tables elsewhere (those fall back to range-tombstone + compaction, which don't work at a full disk). Worth quantifying in production.
Grace tuning: 15m default is conservative; should be ~2× the real max query lifetime. Ideally pair with read-side snapshot-age enforcement so the bound is guaranteed, not assumed.
Migration: pre-existing empty DELETED_TABLES values decode as "deleted long ago, logical-delete owed" and drain safely on the next cycle.

🤖 Generated with Claude Code

…9, NET-798) Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files (NET-819). - hotblocks: the cleanup loop runs both phases each tick and backs off on error instead of busy-looping failing writes; startup reclaims unconfigured datasets' files before serving with a zero grace, since no readers exist yet (NET-798). New --reclaim-grace-secs flag (default 15m). Tests: logical-delete snapshot safety, physical reclaim after grace, watermark pinning by a live table, idempotency, end-to-end delete_dataset + reclaim, and a value-codec/migration unit test. Supersedes the sync_dataset_cleanup flag (draft PR #79). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Replace the per-key tombstone purge with a two-phase table cleanup that can actually reclaim disk space, including at a full disk. Phase 1 (logical, snapshot-safe): cleanup() drops each deleted table with a single range tombstone (OptimisticTransactionDB::delete_range_cf) instead of millions of point deletes, then marks it reclaim-pending. Range tombstones respect snapshots, so in-flight queries are unaffected. Phase 2 (physical): reclaim_disk_space(grace) unlinks whole SST files below the live watermark (min live TableId across all chunks + dirty tables) via DeleteFilesInRange. It performs no writes and needs no scratch space, so it makes progress even at 100% disk. A per-table deletion timestamp in CF_DELETED_TABLES gates the unlink behind a grace period -- the file unlink ignores snapshots, so grace must exceed the max query/snapshot lifetime. reclaim_disk_space only clears bookkeeping for tables Phase 1 has already tombstoned, so a still-pending table is never forgotten with its data left un-tombstoned. Crash recovery: a DIRTY_TABLES marker with no committed chunk is an orphan left by a build that died before commit; its id would otherwise pin the watermark forever. purge_orphan_dirty_tables() drops such markers at startup, before any ingest, range-tombstoning the orphaned data so it can be reclaimed. Also: - TABLES CF: compact-on-deletion collector + 24h periodic compaction so compaction finds tombstone-heavy / boundary files. - hotblocks: the cleanup loop runs both phases each tick -- Phase 2 runs even when Phase 1's writes fail, so a full disk still gets freed -- and backs off on error instead of busy-looping; startup purges orphan dirty markers and reclaims unconfigured datasets' files before any controller spawns, the one point a zero grace is safe (no ingest/compaction/query snapshot exists yet). New --reclaim-grace-secs flag (default 15m). - Robustness: cleanup scans skip a malformed CF key instead of panicking (a panic would re-fire every tick and wedge cleanup). Tests: a MockDB harness drives the lifecycle (commit / delete / cleanup / reclaim / snapshot). Covers logical-delete snapshot safety; the Phase-2 grace/snapshot invariant (a pre-deletion reader still reads within grace and the files survive, unlinked only past grace) plus the negative case (a past-grace unlink under a live snapshot breaks the read); orphan-marker watermark pinning; un-tombstoned bookkeeping survival; physical reclaim; watermark pinning by a live table; idempotency; end-to-end delete_dataset + reclaim; and a value-codec unit test. Supersedes the sync_dataset_cleanup flag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

mo4islona force-pushed the sync-dataset-cleanup-flag branch from 0e7ab30 to be14d62 Compare June 25, 2026 15:08

mo4islona changed the title ~~Add sync_dataset_cleanup flag to defer table purge~~ Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798) Jun 25, 2026

mo4islona force-pushed the sync-dataset-cleanup-flag branch 2 times, most recently from b19920a to 7659b7c Compare June 25, 2026 15:22

mo4islona changed the title ~~Reclaim hotblocks disk via DeleteFilesInRange + range deletes (NET-819, NET-798)~~ Reclaim hotblocks disk via DeleteFilesInRange + range deletes Jun 25, 2026

mo4islona force-pushed the sync-dataset-cleanup-flag branch 3 times, most recently from fc98611 to 4736ea5 Compare June 25, 2026 17:22

mo4islona force-pushed the sync-dataset-cleanup-flag branch from 4736ea5 to 98eac3f Compare June 25, 2026 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reclaim hotblocks disk via DeleteFilesInRange + range deletes#79

Reclaim hotblocks disk via DeleteFilesInRange + range deletes#79
mo4islona wants to merge 1 commit into
masterfrom
sync-dataset-cleanup-flag

mo4islona commented Jun 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mo4islona commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Two-phase cleanup

Also

Tests

Known limitations / open questions

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mo4islona commented Jun 25, 2026 •

edited

Loading