Skip to content

fix(tx-submitter): prevent restart panic caused by sealed batch indices holes#991

Open
SegueII wants to merge 3 commits into
mainfrom
submitter
Open

fix(tx-submitter): prevent restart panic caused by sealed batch indices holes#991
SegueII wants to merge 3 commits into
mainfrom
submitter

Conversation

@SegueII

@SegueII SegueII commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Closes #975

Summary

tx-submitter crash-looped on restart with a nil pointer dereference (SIGSEGV addr=0x8) at common/batch/batch_storage.go:146. Root cause: finalize cleanup deleted only a single batch index (Delete(batchIndex-1)), so when the finalize target jumped past batches finalized by other submitters, the persisted sealed_batch_indices snapshot was left with middle holes. LoadAllSealedBatchesAndHeader assumed contiguous indices (batches[idx-1]) and dereferenced a nil parent batch. The hole stays invisible at runtime (queries are memory-first) and only explodes on the next restart.

Changes

  • Load path (stop the bleeding): loadBatchIndices now sorts ascending; LoadAllSealedBatchesAndHeader verifies index contiguity and nil-checks the parent batch, returning an error so the existing DeleteBatchStorageAndInitFromRollup self-heal path rebuilds from the rollup contract instead of panicking.
  • Finalize cleanup (root cause): new range-based BatchCache.DeleteUntil / BatchStorage.DeleteSealedBatchesUpTo replaces the single-index delete in handleConfirmedTx, so surviving indices always form a contiguous window. Also reclaims header keys that the old path leaked.
  • Storage hardening: SealedBatchKV gains an atomic WriteBatch; batch data + header + indices snapshot are now persisted in one atomic write (StoreSealedBatchAndHeader), and indices update errors are no longer swallowed.

Compatibility

Nodes that already have a holed indices snapshot will hit the contiguity check on first restart after upgrade and automatically self-heal by rebuilding sealed batches from the rollup contract; no manual intervention needed.

Test plan

  • New regression tests in common/batch/batch_storage_test.go:
    • holed indices return an error instead of panicking (reproduces the original crash scenario)
    • contiguous indices load unchanged (control)
    • DeleteSealedBatchesUpTo removes data + header keys and keeps the surviving window loadable
    • legacy unsorted snapshots are normalized on load
    • indices read failures propagate from store/delete operations
  • cd common && go test ./batch/ (unit tests) passing
  • cd tx-submitter && go build ./... && go test ./db/... ./utils/... ./services/... passing

Made with Cursor

Summary by CodeRabbit

  • New Features
    • Added range-based batch cache cleanup via BatchCache.DeleteUntil(maxIndex).
    • Added BatchStorage.DeleteSealedBatchesUpTo(maxIndex) to delete sealed batches while maintaining a contiguous indices snapshot.
  • Bug Fixes
    • Batch persistence is now fail-fast with stricter validation for contiguous sealed batch indices and parent relationships.
    • Sealed batch data and indices are persisted/deleted together using atomic writes to prevent gaps or stale references.
  • Tests
    • Expanded regression coverage for contiguous loading, hole detection, range deletion boundaries, and error propagation.
  • Refactor
    • Finalization cleanup now uses range-based deletion to avoid stale indices.

…es holes

Finalize cleanup deleted only a single batch index, so when the finalize
target jumped past batches finalized by other submitters, the persisted
sealed batch indices snapshot was left with holes. On restart,
LoadAllSealedBatchesAndHeader assumed contiguous indices and dereferenced
a nil parent batch (SIGSEGV at batch_storage.go:146), crash-looping the
service.

- load path: sort indices, verify contiguity and nil-check the parent
  batch, returning an error so the existing self-heal path rebuilds from
  the rollup contract instead of panicking
- finalize cleanup: replace single-index Delete with range-based
  DeleteUntil so surviving indices always form a contiguous window (also
  reclaims previously leaked header keys)
- storage: add atomic WriteBatch to SealedBatchKV and persist batch
  data + header + indices in one write; indices update errors are no
  longer swallowed

Co-authored-by: Cursor <cursoragent@cursor.com>
@SegueII SegueII requested a review from a team as a code owner June 11, 2026 15:18
@SegueII SegueII requested review from twcctop and removed request for a team June 11, 2026 15:18
@SegueII SegueII mentioned this pull request Jun 11, 2026
@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e052ae93-3cbd-4a0c-9b72-b1bab34a03dc

📥 Commits

Reviewing files that changed from the base of the PR and between 8d95e4b and 49fe461.

📒 Files selected for processing (3)
  • common/batch/batch_cache.go
  • common/batch/batch_storage.go
  • common/batch/batch_storage_test.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • common/batch/batch_storage.go

📝 Walkthrough

Walkthrough

This PR fixes a startup crash-loop in tx-submitter caused by persisted gaps in sealed batch indices. It hardens batch storage with atomic writes and snapshot consistency, validates contiguity and parent batch presence during load to error gracefully instead of panicking, adds range-based cleanup to maintain index window continuity, protects concurrent initialization paths with mutex serialization, and makes SealBatch atomic with rollback semantics.

Changes

Atomic Batch Storage Consistency and Startup Resilience

Layer / File(s) Summary
Batch Write Interface and Storage Contract
common/batch/interfaces.go, common/batch/helpers_test.go, tx-submitter/db/db.go
New KVPair struct and WriteBatch(puts []KVPair, deletes [][]byte) contract enable atomic multi-key operations; implemented in test LevelDB wrapper and production Db with proper mutex serialization.
Atomic Batch Persistence with Indices Snapshot Consistency
common/batch/batch_storage.go
StoreSealedBatch and StoreSealedBatchAndHeader now compute updated batch indices using snapshot helpers, write batch data + header + indices in one atomic WriteBatch, and propagate indices errors; loadBatchIndices sorts indices for deterministic ordering and consistent validation.
Load-Time Contiguity Validation and Error Reporting
common/batch/batch_storage.go
LoadAllSealedBatchesAndHeader validates that indices are contiguous after sorting, ensures each batch's parent is present and consistent, and returns explicit errors on violation so existing self-heal logic can rebuild from the rollup contract instead of panicking.
Range-Based and Atomic Batch Cleanup
common/batch/batch_storage.go
DeleteSealedBatch becomes atomic; new DeleteSealedBatchesUpTo(maxIndex) deletes all batches with indices <= maxIndex while rewriting remaining indices as a contiguous window; DeleteAllSealedBatches updated to atomic write.
Initialization Crash-Loop Protection with Mutex and Recovery
common/batch/batch_cache.go
Adds initMu mutex and isInitialized()/setInitialized() helpers to serialize concurrent initialization; wraps all init paths with lock protection and routes failure/recovery cases through locked cleanup helpers.
SealBatch Atomic Persistence with Rollback
common/batch/batch_cache.go
SealBatch uses single atomic StoreSealedBatchAndHeader call; on error rolls back in-memory sealed batch/header entries before returning.
Cache-Level Range Cleanup and Initialization Check
common/batch/batch_cache.go
New DeleteUntil(maxIndex) deletes persisted batches and in-memory entries with indices <= maxIndex under proper locking; AssembleCurrentBatchHeader checks initialization via helper.
Finalize Cleanup Using Range Deletion
tx-submitter/services/rollup.go
Finalize-batch handler switches from single-batch delete (Delete(batchIndex-1)) to range-based DeleteUntil(batchIndex-1) to handle finalized indices that skip when concurrent submitters advance L1 finalized index.
Comprehensive Batch Storage and Cache Tests
common/batch/batch_storage_test.go
Test helpers for batch chains, contiguity assertions on load, hole detection regression, deletion semantics covering window shifts and idempotency, legacy indices normalization, and fault-injection confirming error propagation and rollback behavior.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Suggested reviewers

  • SecurityLife
  • twcctop
  • FletcherMan

Poem

🐰 A batch of fixes hopping through the code,
Atomic writes ensure a safer road,
No more crashes at the startup gate,
Contiguous ranges seal the fate,
Range-based cleanup hops with care. 🎯

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 42.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately summarizes the primary fix: preventing startup panics caused by non-contiguous sealed batch indices.
Linked Issues check ✅ Passed All primary objectives from #975 are addressed: load path defensive checks with sorting/contiguity validation, range-based finalize cleanup, and atomic storage with error propagation.
Out of Scope Changes check ✅ Passed All changes are directly related to fixing the sealed batch indices hole issue; no unrelated refactoring or feature additions detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch submitter

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
common/batch/batch_storage_test.go (4)

128-131: ⚡ Quick win

Use isKVNotFound for consistent error validation.

Lines 128-131 verify that deleted batches return errors, but don't check the error type. Line 127 correctly uses isKVNotFound to verify the indices key was deleted. Applying the same pattern here ensures the correct "not found" error is returned rather than a different error type.

✨ Consistent error type checking
 	_, err = s.LoadSealedBatch(100)
-	require.Error(t, err)
+	require.True(t, isKVNotFound(err), "batch should be deleted")
 	_, err = s.LoadSealedBatchHeader(100)
-	require.Error(t, err)
+	require.True(t, isKVNotFound(err), "header should be deleted")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@common/batch/batch_storage_test.go` around lines 128 - 131, Replace the
generic require.Error checks with assertions that the error is the expected
KV-not-found using isKVNotFound; specifically, after calling
s.LoadSealedBatch(100) and s.LoadSealedBatchHeader(100) assert isKVNotFound(err)
(e.g., require.Truef(t, isKVNotFound(err), "expected KV not found, got: %v",
err)) so the test verifies the correct not-found error type rather than any
error.

70-104: ⚡ Quick win

Consider adding edge case tests for boundary and over-deletion scenarios.

The current test verifies core DeleteSealedBatchesUpTo behavior but doesn't cover important boundary cases:

  • Deleting exactly at the highest index (e.g., DeleteSealedBatchesUpTo(105) when indices are [104, 105]) to verify all batches are removed
  • Deleting well beyond the highest index (e.g., DeleteSealedBatchesUpTo(200) when max is 105) to ensure it handles over-deletion gracefully

These cases would strengthen confidence that the deletion logic handles edge conditions correctly.

📋 Example edge case tests
// After line 103, add:

// Delete exactly at the boundary.
s2 := NewBatchStorage(openTestKV(t))
storeTestChain(t, s2, []uint64{100, 101})
require.NoError(t, s2.DeleteSealedBatchesUpTo(101))
indices, err := s2.loadBatchIndices()
require.True(t, isKVNotFound(err), "all indices should be deleted")

// Delete beyond the window.
s3 := NewBatchStorage(openTestKV(t))
storeTestChain(t, s3, []uint64{100, 101})
require.NoError(t, s3.DeleteSealedBatchesUpTo(200))
indices, err = s3.loadBatchIndices()
require.True(t, isKVNotFound(err), "all indices should be deleted")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@common/batch/batch_storage_test.go` around lines 70 - 104, Add two edge-case
checks to TestDeleteSealedBatchesUpTo: after the existing assertions,
instantiate new BatchStorage instances (e.g., s2 and s3), store a small chain
with indices like [100,101], then call DeleteSealedBatchesUpTo(101) and
DeleteSealedBatchesUpTo(200) respectively and assert the indices are gone by
calling loadBatchIndices and verifying it returns the KV-not-found condition
(use isKVNotFound(err) or equivalent). Ensure you reference the same methods
DeleteSealedBatchesUpTo and loadBatchIndices so the tests target the exact
deletion behavior at the boundary and when over-deleting.

152-168: ⚡ Quick win

Consider testing error propagation for range deletion methods.

The test validates that StoreSealedBatch, StoreSealedBatchAndHeader, and DeleteSealedBatch propagate indices read errors. Since the PR objectives mention "index update errors are propagated" and DeleteSealedBatchesUpTo and DeleteAllSealedBatches also update indices, consider testing those methods as well to ensure complete coverage of the error propagation guarantee.

🧪 Additional error propagation checks
// After line 167, add:

err = s.DeleteSealedBatchesUpTo(100)
require.ErrorIs(t, err, errIndicesUnavailable)

err = s.DeleteAllSealedBatches()
require.ErrorIs(t, err, errIndicesUnavailable)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@common/batch/batch_storage_test.go` around lines 152 - 168, The test
TestStoreSealedBatchPropagatesIndicesError currently checks StoreSealedBatch,
StoreSealedBatchAndHeader, and DeleteSealedBatch but misses range-deletion index
errors; add additional assertions using the same storage fixture (s :=
NewBatchStorage(&failingIndicesKV{SealedBatchKV: openTestKV(t)})) to call
DeleteSealedBatchesUpTo(100) and DeleteAllSealedBatches() and assert
require.ErrorIs(err, errIndicesUnavailable) for each, placing them after the
existing DeleteSealedBatch assertion so the test verifies index-update errors
are propagated for those range-deletion methods as well.

80-85: ⚡ Quick win

Use isKVNotFound for consistent error validation.

Lines 81-84 verify that deleted batches return errors, but don't check that the error type is "not found". Line 127 and the similar pattern in TestDeleteAllSealedBatches (line 127) use isKVNotFound to verify the specific error type. This ensures the right kind of error is returned and could catch bugs where a different error (e.g., a corruption error) is returned instead.

✨ Consistent error type checking
 	for idx := uint64(100); idx <= 103; idx++ {
 		_, err := s.LoadSealedBatch(idx)
-		require.Error(t, err, "batch %d should be deleted", idx)
+		require.True(t, isKVNotFound(err), "batch %d should be deleted", idx)
 		_, err = s.LoadSealedBatchHeader(idx)
-		require.Error(t, err, "header %d should be deleted", idx)
+		require.True(t, isKVNotFound(err), "header %d should be deleted", idx)
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@common/batch/batch_storage_test.go` around lines 80 - 85, Replace generic
require.Error checks with a specific "not found" assertion using isKVNotFound:
when calling s.LoadSealedBatch and s.LoadSealedBatchHeader in the loop
(functions LoadSealedBatch and LoadSealedBatchHeader), assert that
isKVNotFound(err) is true (e.g., via require.True or require.Truef) so the test
verifies the error type rather than any error; keep the same loop and messages
but use isKVNotFound(err) for validation.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@common/batch/batch_cache.go`:
- Around line 755-760: SealBatch currently swallows errors from
bc.batchStorage.StoreSealedBatchAndHeader and logs them instead of returning
failure; change SealBatch so that after calling
batchStorage.StoreSealedBatchAndHeader(batchIndex, sealedBatch,
&batchHeaderCopy) it returns the encountered error (or a wrapped error with
context) instead of only logging, ensuring the SealBatch function signature and
all callers propagate/handle the error; reference
bc.batchStorage.StoreSealedBatchAndHeader and the SealBatch function when making
the change so persistence failures are propagated to callers.

---

Nitpick comments:
In `@common/batch/batch_storage_test.go`:
- Around line 128-131: Replace the generic require.Error checks with assertions
that the error is the expected KV-not-found using isKVNotFound; specifically,
after calling s.LoadSealedBatch(100) and s.LoadSealedBatchHeader(100) assert
isKVNotFound(err) (e.g., require.Truef(t, isKVNotFound(err), "expected KV not
found, got: %v", err)) so the test verifies the correct not-found error type
rather than any error.
- Around line 70-104: Add two edge-case checks to TestDeleteSealedBatchesUpTo:
after the existing assertions, instantiate new BatchStorage instances (e.g., s2
and s3), store a small chain with indices like [100,101], then call
DeleteSealedBatchesUpTo(101) and DeleteSealedBatchesUpTo(200) respectively and
assert the indices are gone by calling loadBatchIndices and verifying it returns
the KV-not-found condition (use isKVNotFound(err) or equivalent). Ensure you
reference the same methods DeleteSealedBatchesUpTo and loadBatchIndices so the
tests target the exact deletion behavior at the boundary and when over-deleting.
- Around line 152-168: The test TestStoreSealedBatchPropagatesIndicesError
currently checks StoreSealedBatch, StoreSealedBatchAndHeader, and
DeleteSealedBatch but misses range-deletion index errors; add additional
assertions using the same storage fixture (s :=
NewBatchStorage(&failingIndicesKV{SealedBatchKV: openTestKV(t)})) to call
DeleteSealedBatchesUpTo(100) and DeleteAllSealedBatches() and assert
require.ErrorIs(err, errIndicesUnavailable) for each, placing them after the
existing DeleteSealedBatch assertion so the test verifies index-update errors
are propagated for those range-deletion methods as well.
- Around line 80-85: Replace generic require.Error checks with a specific "not
found" assertion using isKVNotFound: when calling s.LoadSealedBatch and
s.LoadSealedBatchHeader in the loop (functions LoadSealedBatch and
LoadSealedBatchHeader), assert that isKVNotFound(err) is true (e.g., via
require.True or require.Truef) so the test verifies the error type rather than
any error; keep the same loop and messages but use isKVNotFound(err) for
validation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 717da052-ebfd-4005-845e-6788be45eabe

📥 Commits

Reviewing files that changed from the base of the PR and between 99bc207 and 8d95e4b.

📒 Files selected for processing (7)
  • common/batch/batch_cache.go
  • common/batch/batch_storage.go
  • common/batch/batch_storage_test.go
  • common/batch/helpers_test.go
  • common/batch/interfaces.go
  • tx-submitter/db/db.go
  • tx-submitter/services/rollup.go

Comment thread common/batch/batch_cache.go
SegueII and others added 2 commits June 22, 2026 11:02
Propagate sealed batch storage failures and make cleanup DB-first so restart recovery cannot leave cache and storage diverged.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

tx-submitter bug fix

1 participant