[CASSANDRA-21472][trunk] Fix memtable on-heap accounting drift in BTree.update and BTreeRow.merge#4900
Open
netudima wants to merge 1 commit into
Open
[CASSANDRA-21472][trunk] Fix memtable on-heap accounting drift in BTree.update and BTreeRow.merge#4900netudima wants to merge 1 commit into
netudima wants to merge 1 commit into
Conversation
2c4126e to
f576597
Compare
maedhroz
reviewed
Jun 24, 2026
Contributor
|
Pasting the bits surfaced from a quick CC skill-assisted review here, and then I'll do a more careful manual review: |
Contributor
|
For finding 1 above, this test does appear to fail, but I could still just be testing the wrong thing... |
Contributor
Author
|
I've fixd the reported issues except Finding 6 - it looks like the direct way to fix it breaks AtomicBTreePartitionMemtableAccountingTest test, so I need to think a bit more about it |
maedhroz
reviewed
Jun 25, 2026
maedhroz
approved these changes
Jun 25, 2026
maedhroz
left a comment
Contributor
There was a problem hiding this comment.
LGTM (with one more nit in BTree)
Several paths report on-heap allocation through UpdateFunction.onAllocatedOnHeap in a way that diverges from the heap actually retained (BTree.sizeOnHeapOf), so the memtable's owned-heap counter drifts. Five under/over-counting causes are fixed so that reported allocation matches sizeOnHeapOf: 1) BTree.update over-counts on node split/overflow and never counts branch sizeMaps. The Updater's running 'allocated' was not a true net delta: leaf drain() cleared the source before subtracting it, the redistribute/overflow paths added new nodes without releasing the source they replace, and branch sizeMaps were never counted. Fix: account each node net - add every newly retained node's shallow heap (array plus sizeMap) and subtract it for every replaced source, releasing before the source is cleared; and record the root as the top builder's source so the old root is released too. Test: BTreeUpdateHeapAccountingTest (randomized small / contiguous-block / overlapping / height-4, coverage verified by JaCoCo). 2) BTreeRow.merge does not release a row's column tree when a row tombstone shadows its cells: the retain branch rebuilds it smaller via BTree.transformAndFilter (node accounting disabled) but never releases the freed structure. Fix: report sizeOnHeapOf(retained) - sizeOnHeapOf(existing) when the filter shrinks the tree, as ColumnData.Reconciler.merge does. 3) BTreeRow.merge does not account the row's LivenessInfo/Deletion change (e.g. a tombstone replacing a live row). Fix: account (reconciled liveness+deletion) - (existing liveness+deletion). 4) Allocation and release disagree on the branch sizeMap: allocation used sizeOfStructureOnHeap (excludes it), release used sizeOnHeapOf (includes it). Fix: remove sizeOfStructureOnHeap and use sizeOnHeapOf everywhere. 5) ColumnData.removeShadowed does not release a shadowed complex (collection) column's own structure: it releases the inner cells via recordDeletion.delete but not the column's cell tree (which can span multiple nodes) nor, when the column is dropped, its wrapper - both counted as owned when written. Fix: report (EMPTY_SIZE + sizeOnHeapOf(tree)) after - before (after is 0 when dropped); a no-op on the update side (recordDeletion == noOp), as required by CASSANDRA-21469. Tests for 2-5: PartitionRowAccountingTest.rowTombstoneOverExistingRowDoesNotInflateOwnership and .rowTombstoneOverExistingCollectionDoesNotInflateOwnership require two logically identical partitions reached via different merge paths to own exactly the same on-heap (only with all fixes does it match); SetCellAccountingTest guards that a grow/reset op mix on a set<text> column never drives the owned heap negative. patch by Dmitry Konstantinov; reviewed by Caleb Rackliffe for CASSANDRA-21472
83e78bf to
7dcfc6f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Several paths report on-heap allocation through UpdateFunction.onAllocatedOnHeap in a
way that diverges from the heap actually retained (BTree.sizeOnHeapOf), so the memtable's
owned-heap counter drifts. Five under/over-counting causes are fixed so that reported
allocation matches sizeOnHeapOf:
BTree.update over-counts on node split/overflow and never counts branch sizeMaps.
The Updater's running 'allocated' was not a true net delta: leaf drain() cleared the
source before subtracting it, the redistribute/overflow paths added new nodes without
releasing the source they replace, and branch sizeMaps were never counted.
Fix: account each node net - add every newly retained node's shallow heap (array plus
sizeMap) and subtract it for every replaced source, releasing before the source is
cleared; and record the root as the top builder's source so the old root is released too.
Test: BTreeUpdateHeapAccountingTest (randomized small / contiguous-block / overlapping /
height-4, coverage verified by JaCoCo).
BTreeRow.merge does not release a row's column tree when a row tombstone shadows its
cells: the retain branch rebuilds it smaller via BTree.transformAndFilter (node accounting
disabled) but never releases the freed structure. Fix: report sizeOnHeapOf(retained) -
sizeOnHeapOf(existing) when the filter shrinks the tree, as ColumnData.Reconciler.merge does.
BTreeRow.merge does not account the row's LivenessInfo/Deletion change (e.g. a tombstone
replacing a live row). Fix: account (reconciled liveness+deletion) - (existing liveness+deletion).
Allocation and release disagree on the branch sizeMap: allocation used sizeOfStructureOnHeap
(excludes it), release used sizeOnHeapOf (includes it). Fix: remove sizeOfStructureOnHeap and
use sizeOnHeapOf everywhere.
ColumnData.removeShadowed does not release a shadowed complex (collection) column's own
structure: it releases the inner cells via recordDeletion.delete but not the column's cell
tree (which can span multiple nodes) nor, when the column is dropped, its wrapper - both
counted as owned when written. Fix: report (EMPTY_SIZE + sizeOnHeapOf(tree)) after - before
(after is 0 when dropped); a no-op on the update side (recordDeletion == noOp), as required
by CASSANDRA-21469.
Tests for 2-5: PartitionRowAccountingTest.rowTombstoneOverExistingRowDoesNotInflateOwnership and
.rowTombstoneOverExistingCollectionDoesNotInflateOwnership require two logically identical
partitions reached via different merge paths to own exactly the same on-heap (only with all
fixes does it match); SetCellAccountingTest guards that a grow/reset op mix on a set
column never drives the owned heap negative.
patch by Dmitry Konstantinov; reviewed by TBD for CASSANDRA-21472