Skip to content

[CASSANDRA-21472][trunk] Fix memtable on-heap accounting drift in BTree.update and BTreeRow.merge#4900

Open
netudima wants to merge 1 commit into
apache:trunkfrom
netudima:CASSANDRA-21472-trunk
Open

[CASSANDRA-21472][trunk] Fix memtable on-heap accounting drift in BTree.update and BTreeRow.merge#4900
netudima wants to merge 1 commit into
apache:trunkfrom
netudima:CASSANDRA-21472-trunk

Conversation

@netudima

@netudima netudima commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Several paths report on-heap allocation through UpdateFunction.onAllocatedOnHeap in a
way that diverges from the heap actually retained (BTree.sizeOnHeapOf), so the memtable's
owned-heap counter drifts. Five under/over-counting causes are fixed so that reported
allocation matches sizeOnHeapOf:

  1. BTree.update over-counts on node split/overflow and never counts branch sizeMaps.
    The Updater's running 'allocated' was not a true net delta: leaf drain() cleared the
    source before subtracting it, the redistribute/overflow paths added new nodes without
    releasing the source they replace, and branch sizeMaps were never counted.
    Fix: account each node net - add every newly retained node's shallow heap (array plus
    sizeMap) and subtract it for every replaced source, releasing before the source is
    cleared; and record the root as the top builder's source so the old root is released too.
    Test: BTreeUpdateHeapAccountingTest (randomized small / contiguous-block / overlapping /
    height-4, coverage verified by JaCoCo).

  2. BTreeRow.merge does not release a row's column tree when a row tombstone shadows its
    cells: the retain branch rebuilds it smaller via BTree.transformAndFilter (node accounting
    disabled) but never releases the freed structure. Fix: report sizeOnHeapOf(retained) -
    sizeOnHeapOf(existing) when the filter shrinks the tree, as ColumnData.Reconciler.merge does.

  3. BTreeRow.merge does not account the row's LivenessInfo/Deletion change (e.g. a tombstone
    replacing a live row). Fix: account (reconciled liveness+deletion) - (existing liveness+deletion).

  4. Allocation and release disagree on the branch sizeMap: allocation used sizeOfStructureOnHeap
    (excludes it), release used sizeOnHeapOf (includes it). Fix: remove sizeOfStructureOnHeap and
    use sizeOnHeapOf everywhere.

  5. ColumnData.removeShadowed does not release a shadowed complex (collection) column's own
    structure: it releases the inner cells via recordDeletion.delete but not the column's cell
    tree (which can span multiple nodes) nor, when the column is dropped, its wrapper - both
    counted as owned when written. Fix: report (EMPTY_SIZE + sizeOnHeapOf(tree)) after - before
    (after is 0 when dropped); a no-op on the update side (recordDeletion == noOp), as required
    by CASSANDRA-21469.

Tests for 2-5: PartitionRowAccountingTest.rowTombstoneOverExistingRowDoesNotInflateOwnership and
.rowTombstoneOverExistingCollectionDoesNotInflateOwnership require two logically identical
partitions reached via different merge paths to own exactly the same on-heap (only with all
fixes does it match); SetCellAccountingTest guards that a grow/reset op mix on a set
column never drives the owned heap negative.

patch by Dmitry Konstantinov; reviewed by TBD for CASSANDRA-21472

@netudima netudima force-pushed the CASSANDRA-21472-trunk branch 2 times, most recently from 2c4126e to f576597 Compare June 23, 2026 22:20
Comment thread src/java/org/apache/cassandra/db/rows/ComplexColumnData.java Outdated
@maedhroz

Copy link
Copy Markdown
Contributor

Pasting the bits surfaced from a quick CC skill-assisted review here, and then I'll do a more careful manual review:

 Finding 1: if (allocated > 0) silently swallows net releases — contradicts patch's stated net-delta semantics

  - Location: src/java/org/apache/cassandra/utils/btree/BTree.java:3713-3714
  - Confidence: High (latent active bug; certain consistency hazard)
  - Flagged by: Logic, Boundary, Resources, Completeness, Symmetry (5 specialists)
  - What's wrong: After the patch, Updater.allocated is documented and used as a true net delta (addAllocated adds for newly retained nodes, subtracts for replaced sources, root release included). But the report site still gates on allocated > 0, dropping any zero or negative result. This collides on two axes: (a) the sentinel -1
  overloads "tracking disabled" with "net release" — once allocated goes below zero it stays negative and future addAllocated calls become no-ops because the helper guards on allocated >= 0; (b) the parallel report sites at BTree.java:382, 397 are unconditional/signed, so this is the only asymmetric guard.
  - Why current code doesn't trigger it: BTree.update's reconciler is 1-to-1 (no key removal), so per-step addAllocated arguments are non-negative across all paths the symmetry/logic agents traced (drain, redistributeAndDrain, redistributeOverflowAndDrain, propagateOverflow). The patch's BTreeUpdateHeapAccountingTest does not
  exercise a net-zero-or-negative outcome — the closest case (root reuse on identical-shape merge) is not in the scenario set.
  - Suggested fix: replace the report condition with if (allocated != 0) and decouple "tracking disabled" from the value of allocated (e.g. a boolean accountingEnabled set once at the start of update). Add a scenario to BTreeUpdateHeapAccountingTest where every key in insert already exists with identical content (root reuse →
  allocated == 0).

  Finding 2: UpdateFunction.onAllocatedOnHeap javadoc still describes the old semantics

  - Location: src/java/org/apache/cassandra/utils/btree/UpdateFunction.java:46-49
  - Confidence: High
  - Flagged by: Completeness
  - What's wrong: Javadoc reads @param heapSize extra heap space allocated (over previous tree). After this patch the parameter is a signed net delta — the new BTreeRow.merge and ColumnData.removeShadowed call sites can pass negative values today. Future implementers reading the contract may add a non-negative assert and silently
  break accounting.
  - Suggested fix: state explicitly that the value is a signed delta and may be negative when more heap is released than allocated.

  Finding 3: AbstractFastBuilder.reset() does not clear the leaf-level sourceNode; the new root setSourceNode(update) widens the leak window

  - Location: src/java/org/apache/cassandra/utils/btree/BTree.java:3511-3532 (reset) and :3706 (the new root assignment)
  - Confidence: Medium
  - Flagged by: Concurrency, Symmetry
  - What's wrong: reset() walks branch = leaf().parent clearing each parent's sourceNode, but never clears the leaf builder's own sourceNode. Pre-patch this was an idle latent issue because LeafBuilder.drain() always cleared it on the success path. Post-patch, Updater.update() now eagerly sets builder.setSourceNode(update) before
  updateRecursive runs — so any exception thrown between line 3706 and the next drain() returns the Updater to the FastThreadLocal pool with a dangling reference to the previous update tree's root array. The dangling ref is eventually overwritten on the next update() reaching the leaf level, but until then the pooled Updater
  pins memory.
  - Suggested fix: add clearSourceNode() (the leaf-level call on this) at the top or bottom of AbstractFastBuilder.reset().

  Finding 4: LeafBuilder.propagateOverflow not migrated to addAllocated(shallowHeapOf(savedBuffer))

  - Location: src/java/org/apache/cassandra/utils/btree/BTree.java:2896-2905
  - Confidence: Medium (consistency, not correctness today)
  - Flagged by: Boundary, Concurrency, Resources, Symmetry (4 specialists)
  - What's wrong: Every other allocation/release point in LeafBuilder/BranchBuilder was rewritten through addAllocated(...) plus shallowHeapOf(...)/sizeOnHeapOfLeaf(...); this one site still uses the inline-guarded allocated += ObjectSizes.sizeOfReferenceArray(MAX_KEYS). Numerically equivalent today (savedBuffer is new
  Object[MAX_KEYS]), but the inconsistency is exactly the kind of asymmetric edit that produced four of the five bugs the patch is fixing.
  - Suggested fix: replace with leaf.addAllocated(shallowHeapOf(savedBuffer));.

  Finding 5: recordDeletion.onAllocatedOnHeap(...) in removeShadowed is computed and discarded on the noOp path

  - Location: src/java/org/apache/cassandra/db/rows/ColumnData.java:230-232 (with noOp defined at :67-89)
  - Confidence: Low (efficiency, not correctness)
  - Flagged by: Completeness
  - What's wrong: When recordDeletion == ColumnData.noOp (the update side / Reconciler.merge path), the new line still computes EMPTY_SIZE + sizeOnHeapOf(existingTree) and EMPTY_SIZE + sizeOnHeapOf(cells) (each potentially walking a multi-node tree) only to pass the result to a no-op. Trivial fix: gate on recordDeletion != noOp.

  Finding 6: removeShadowed does not subtract complexDeletion.dataSize() when a complex column is dropped

  - Location: src/java/org/apache/cassandra/db/rows/ColumnData.java:200-233
  - Confidence: Low (out of patch scope; small dataSize drift, not heap)
  - Flagged by: Resources
  - What's wrong: BTreePartitionUpdater.insert(ColumnData) originally added insert.dataSize() (which for ComplexColumnData includes complexDeletion.dataSize() == 12). The patch's release path subtracts each cell's dataSize via recordDeletion.delete and the wrapper+tree heap via onAllocatedOnHeap, but never subtracts the
  complexDeletion.dataSize(). Heap accounting (the patch's stated target) is fine; dataSize accumulates 12 stranded bytes per dropped complex column with a non-LIVE complexDeletion. Adjacent to the patch, would be a one-line fix.

  Finding 7: Latent paths that don't release sourceNode if reached

  - Location: LeafBuilder.drain() count==0 early return at BTree.java:2951-2956; BranchBuilder.drain() count==0 / right-child collapse at :3261-3270
  - Confidence: Low (currently unreachable from the patched call paths)
  - Flagged by: Resources, Symmetry
  - What's wrong: Both early-return branches clearSourceNode() without first releasing the source's shallow heap through addAllocated. BTree.update's reconciler doesn't drop keys, so neither branch is reachable from the Updater today. Worth either an assertion or a defensive addAllocated(-shallowHeapOf(sourceNode)) so the new
  "every replaced source releases its shallow heap" invariant holds globally.

@maedhroz

Copy link
Copy Markdown
Contributor

For finding 1 above, this test does appear to fail, but I could still just be testing the wrong thing...

/**
 * Verifies that re-inserting every key that already exists reports a net heap delta of exactly zero,
 * and that {@code onAllocatedOnHeap} is actually invoked (not silently skipped by an {@code allocated > 0}
 * guard). The call must happen so that callers accumulating signed deltas see the confirmation, and so
 * that a negative allocated value cannot permanently disable accounting by crossing the -1 sentinel.
 */
@Test
public void netZeroDeltaOnAllocatedOnHeapIsInvoked()
{
    int numKeys = BTree.MAX_KEYS * BTree.MAX_KEYS;
    Integer[] keys = new Integer[numKeys];
    for (int i = 0; i < numKeys; i++)
        keys[i] = i;

    Object[] tree = BTree.build(BulkIterator.of(keys), numKeys, UpdateFunction.noOp());
    Object[] insert = BTree.build(BulkIterator.of(keys), numKeys, UpdateFunction.noOp());

    // Use a counting variant that records invocation count separately from the accumulated total
    final long[] invocations = { 0 };
    final long[] total = { 0 };

    UpdateFunction<Integer, Integer> fn = new UpdateFunction<Integer, Integer>()
    {
        @Override public Integer insert(Integer i) { return i; }
        @Override public Integer merge(Integer existing, Integer update) { return update; }
        @Override public void onAllocatedOnHeap(long delta)
        {
            invocations[0]++;
            total[0] += delta;
        }
    };

    Object[] result = BTree.update(tree, insert, CMP, fn);

    assertSame("full re-insert of identical keys must reuse the existing tree root", tree, result);

    assertEquals("reported net delta must be zero for a full re-insert of identical keys",
                 0L, total[0]);

    // This is the assertion that catches the bug:
    // with `if (allocated > 0)`, invocations[0] == 0 (call skipped entirely)
    // with `if (allocated != 0)` or unconditional, invocations[0] == 1
    assertTrue("onAllocatedOnHeap must be invoked even for a net-zero update; " +
               "skipping it conflates 'tracking disabled' (-1 sentinel) with 'net zero result' (0)",
               invocations[0] >= 1);
}

@netudima

netudima commented Jun 24, 2026

Copy link
Copy Markdown
Contributor Author

I've fixd the reported issues except Finding 6 - it looks like the direct way to fix it breaks AtomicBTreePartitionMemtableAccountingTest test, so I need to think a bit more about it

Comment thread src/java/org/apache/cassandra/utils/btree/BTree.java Outdated

@maedhroz maedhroz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (with one more nit in BTree)

Several paths report on-heap allocation through UpdateFunction.onAllocatedOnHeap in a
way that diverges from the heap actually retained (BTree.sizeOnHeapOf), so the memtable's
owned-heap counter drifts. Five under/over-counting causes are fixed so that reported
allocation matches sizeOnHeapOf:

1) BTree.update over-counts on node split/overflow and never counts branch sizeMaps.
   The Updater's running 'allocated' was not a true net delta: leaf drain() cleared the
   source before subtracting it, the redistribute/overflow paths added new nodes without
   releasing the source they replace, and branch sizeMaps were never counted.
   Fix: account each node net - add every newly retained node's shallow heap (array plus
   sizeMap) and subtract it for every replaced source, releasing before the source is
   cleared; and record the root as the top builder's source so the old root is released too.
   Test: BTreeUpdateHeapAccountingTest (randomized small / contiguous-block / overlapping /
   height-4, coverage verified by JaCoCo).

2) BTreeRow.merge does not release a row's column tree when a row tombstone shadows its
   cells: the retain branch rebuilds it smaller via BTree.transformAndFilter (node accounting
   disabled) but never releases the freed structure. Fix: report sizeOnHeapOf(retained) -
   sizeOnHeapOf(existing) when the filter shrinks the tree, as ColumnData.Reconciler.merge does.

3) BTreeRow.merge does not account the row's LivenessInfo/Deletion change (e.g. a tombstone
   replacing a live row). Fix: account (reconciled liveness+deletion) - (existing liveness+deletion).

4) Allocation and release disagree on the branch sizeMap: allocation used sizeOfStructureOnHeap
   (excludes it), release used sizeOnHeapOf (includes it). Fix: remove sizeOfStructureOnHeap and
   use sizeOnHeapOf everywhere.

5) ColumnData.removeShadowed does not release a shadowed complex (collection) column's own
   structure: it releases the inner cells via recordDeletion.delete but not the column's cell
   tree (which can span multiple nodes) nor, when the column is dropped, its wrapper - both
   counted as owned when written. Fix: report (EMPTY_SIZE + sizeOnHeapOf(tree)) after - before
   (after is 0 when dropped); a no-op on the update side (recordDeletion == noOp), as required
   by CASSANDRA-21469.

Tests for 2-5: PartitionRowAccountingTest.rowTombstoneOverExistingRowDoesNotInflateOwnership and
.rowTombstoneOverExistingCollectionDoesNotInflateOwnership require two logically identical
partitions reached via different merge paths to own exactly the same on-heap (only with all
fixes does it match); SetCellAccountingTest guards that a grow/reset op mix on a set<text>
column never drives the owned heap negative.

patch by Dmitry Konstantinov; reviewed by Caleb Rackliffe for CASSANDRA-21472
@netudima netudima force-pushed the CASSANDRA-21472-trunk branch from 83e78bf to 7dcfc6f Compare June 25, 2026 21:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants