autobahn mempool by pompon0 · Pull Request #3522 · sei-protocol/sei-chain

pompon0 · 2026-05-29T17:42:03Z

Mempool constructed specifically for autobahn tx lifecycle:

mempool rejects pending (non-ready) rpcs
mempool insert rpcs are blocking if mempool is full (except for BroadcastTxAsync)
mempool capacity is counted in local lane blocks - it can contain up to types.BlocksPerLane blocks since the last locally executed lane block
in particular mempool tracks EVM nonces until the local block is locally executed, at which point it can confirm whether the account nonces have been bumped as expected.
blocks are pushed from mempool to lane in avail.State as soon as there is capacity in avail.State (which is different than mempool capacity, because mempool counts blocks wrt to local execution, while avail.State counts blocks wrt to global execution)
full local blocks are constructed as soon as there is enough txs in the mempool
non-full local blocks are constructed after a BlockInterval timeout. Empty blocks are supported for now, because integration tests require that.

…i-chain into gprusak-mempool2

cursor

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 51d950e. Configure here.}

wen-coding · 2026-06-03T16:31:26Z

-		q := inner.blocks[lane]
 		if err := ctrl.WaitUntil(ctx, func() bool {
-			return q.next < inner.persistedBlockStart[lane]+BlocksPerLane
+			return toProduce < inner.persistedBlockStart[lane]+BlocksPerLane


I know this is pre-existing, we probably need to change inner.persistedBlockStart[lane] to appropriate value when validators can join in the middle

I'm just panicking we are always setting inner.persistedBlockStart[lane] to 0 when there is no WAL on the disk. Not related to this PR, I think we will fix that in the state sync work?

I mean as soon as anchor is fetched from the network, persistedBlockStart is also moved forward. avail state is supposed to be independent from statesync.

wen-coding · 2026-06-04T17:25:55Z

+				nonce = s.cfg.App.EvmNonce(addr)
+			}
+			if nonce != resp.EVMNonce {
+				return nil, fmt.Errorf("%w: got %v, want %v", errBadNonce, resp.EVMNonce, nonce)


So right now we only accept the next Nonce? Say the user sent tx with nonce 100 and 101, the packet containing 100 got dropped, 101 will rejected?

And also we don't allow higher-priority tx to replace old ones any more?

I guess it might be okay for the loadtest, but for public testnet this doesn't work. Maybe add a TODO?

For loadtest we also need to remember to retry in sequence if the loadtest client died.

the packet won't get dropped, because we support HTTP and WS which are both built on top of TCP. We don't allow replacement, because the lower priority already gets sequenced on the lane when the higher priority tx arrives, it does not make sense. Replacement doesn't make also sense, because there is no gossip so there is no need of reconciliation of txs conflicting on nonces. Mempool simply sequences a tx within a lane and then forgets about it - user is free to send a different tx with the same nonce and arbitrary priority as soon as the previous tx fails to get executed. Loadtest client is responsible for retrying txs. It has enough information to detect problem and resend what is needed.

TCP guarantees byte order, if there are two HTTP connections sending tx N and tx N+1, and they happen to be in wrong order, I don't think TCP can fix that? If there is some tool which fires concurrent txs without awaiting sequence, they could be rejected even if user didn't do anything wrong?

Depending on how our loadtest tool is written, if it always awaits tx N before sending tx N+1, then this won't be a problem. But I'd be a bit worried for public testnet.

Depending on how our loadtest tool is written, if it always awaits tx N before sending tx N+1, then this won't be a problem

I think it is a reasonable expectation.

wen-coding · 2026-06-04T17:34:15Z

+		lane := availState.PublicKey()
+		firstBlock := s.mempoolFirst()
+		scope.Spawn(func() error {
+			// Task pruning executed lane blocks from the mempool


Not saying right or wrong, just an observation.

In non-Autobahn we used to have success/fail results from TxResults, so there is a feedback about individual tx. Now it's per-account. I think originally we allowed second chance for failed tx, do we care about allowing that here? (I imagine if you are landing a transfer tx and sending enough gas at the same time, but your "send enough gas" tx got delayed, your transfer tx can maybe get a second chance?)

Maybe document the behavior?

it is compatible with the TxMempool of tendermint: failed transactions are removed from the mempool and are allowed to be reinserted. The "second chance" limitation is on the cache level. There is no cache of bad transaction for the autobahn mempool currently so in particular there is no restriction on 2 attempts per transaction.

I think your description is accurate for failed-tx cache. But for under-funded txs at admission, I think those are held in pending queue and re-assessed when execution results come in. It's okay if we force user to retry in this case, dunno how much it matters.

claude · 2026-06-05T10:32:02Z

 	if err != nil {
 		return none
 	}
-	if env.ConsensusState == nil {
-		return none
-	}
 	privValAddress := k.Address()

 	// Skip the in-memory consensus-state lookup under Autobahn: the CometBFT
 	// consensus State is never advanced, so GetValidators would nil-deref
 	// on an unpopulated validator set. The state-store lookup below is kept
 	// in sync under both engines.
-	if !env.gigaRouter().IsPresent() {
-		lastBlockHeight, vals := env.ConsensusState.GetValidators()
+	if consensusState, ok := env.ConsensusState.Get(); ok && !env.gigaRouter().IsPresent() {
+		lastBlockHeight, vals := consensusState.GetValidators()
 		if lastBlockHeight == h {
 			for _, val := range vals {
 				if bytes.Equal(val.Address, privValAddress) {


🟣 Under Autobahn, validatorAtHeight() in status.go always returns None for chains with InitialHeight > 1 (e.g. the test genesis uses 100), so /status reports VotingPower=0 for healthy validators. The root cause is pre-existing — this PR's only change here is the utils.Option refactor — but the new comment "state-store lookup below is kept in sync under both engines" is misleading and the natural fix sits in the function this PR already touched: under Autobahn the committee is fixed at genesis, so the call should fall through to GenDoc.Validators (same path the new Validators handler in consensus.go uses).

Extended reasoning...

What the bug is

Status calls env.validatorAtHeight(env.latestUncommittedHeight()) and uses the returned VotingPower to populate ValidatorInfo in the /status response. Under Autobahn:

env.BlockStore.Height() stays at 0 — the CometBFT block store is never written to (autobahn nodes only construct a minimal blocksync.Reactor for lag RPC; SaveBlock is only called from the blocksync syncer, which is None under autobahn).

env.ConsensusReactor is None under autobahn, so latestUncommittedHeight() skips the WaitSync branch and returns BlockStore.Height() + 1 = 1.

validatorAtHeight(1) then calls env.StateStore.LoadValidators(1).

But sm.Store.Save() at genesis persists the validator set at state.InitialHeight, not at height 1. From sei-tendermint/internal/state/store.go (saveValidatorsInfo at line ~171 and the nextHeight remap at ~167–173): for LastBlockHeight=0, nextHeight=1 is reassigned to InitialHeight, and the validators are saved at validatorsKey(InitialHeight) and validatorsKey(InitialHeight+1). For any chain with InitialHeight > 1 (the test genesis here uses InitialHeight: 100, and production chains commonly run with non-1 initial heights), LoadValidators(1) returns ErrNoValSetForHeight. validatorAtHeight returns None, and the Status response carries the zero default for VotingPower.

Step-by-step proof

Boot an autobahn node with the test genesis (setup_test.go: InitialHeight: 100).

Status runs validatorAtHeight(latestUncommittedHeight()).

BlockStore.Height() = 0, ConsensusReactor.IsPresent() = false → latestUncommittedHeight() = 1.

StateStore.LoadValidators(1) → loadValidatorsInfo(db, 1) returns "value retrieved from db is empty" because the only entries are at keys 100 and 101.

err != nil → return none.

Status reads validatorInfo.VotingPower from the unset utils.Option, which stays at zero.

External tooling scraping /status sees VotingPower=0 for a perfectly healthy validator.

For chains with InitialHeight == 1 the lookup happens to succeed and returns the genesis voting power; since Autobahn's committee is fixed at genesis (no validator-set updates), that genesis value also happens to be the current value — the bug is silent in that case.

Why this PR is the natural place to flag it

The diff at lines 164–177 was deliberately gated if !env.gigaRouter().IsPresent() for the in-memory consensus-state lookup, but the StateStore.LoadValidators(h) call on the line above the gate is not gated. The author was clearly aware of autobahn quirks in this function and added the gate one line too low. The accompanying comment ("state-store lookup below is kept in sync under both engines") is also inaccurate for autobahn — the CometBFT state store is never updated under autobahn.

The PR's only behavioral change to validatorAtHeight is the utils.Option refactor (removing the env.ConsensusState == nil check and combining with the gigaRouter check). Pre-PR, gigaRouter().IsPresent() already skipped the in-memory lookup and the function fell through to the same broken LoadValidators path, so the underlying bug pre-dates this PR.

Impact

Cosmetic only — observability/monitoring. No consensus or correctness impact. External tooling that scrapes /status to verify "this node is participating in consensus" by checking VotingPower > 0 will incorrectly conclude the node is down. There is already a TODO at the top of Status (lines 19–32) tracking autobahn /status limitations; this is one more case in that bucket.

Suggested fix

Under autobahn, fetch validators from env.GenDoc.Validators — the same path the new Validators handler in consensus.go uses for the autobahn branch:

if env.gigaRouter().IsPresent() { for _, gv := range env.GenDoc.Validators { if bytes.Equal(gv.Address, privValAddress) { return utils.Some(&types.Validator{ Address: gv.Address, PubKey: gv.PubKey, VotingPower: gv.Power, }) } } return none }

Either fix here or extend the existing TODO at the top of Status to explicitly call out ValidatorInfo.VotingPower so it gets picked up alongside the other deferred autobahn /status work.

claude · 2026-06-05T11:05:50Z

+		for m.first < min(n, m.next) {
+			b := m.blocks[m.first]
+			delete(m.blocks, m.first)
+			m.first += 1
+			for _, hash := range b.evmHashes {
+				delete(m.evmTxs, hash)
+			}
+			for addr, wantNonce := range b.evmNonces {
+				if wantNonce == m.evmNonces[addr] {
+					// Happy path: all account's txs got executed.
+					delete(m.evmNonces, addr)
+				} else if gotNonce := s.cfg.App.EvmNonce(addr); gotNonce < wantNonce {
+					// Some txs have not been executed - reset account tracking.
+					// NOTE: app execution is not synchronized with mempool, so nonce could have already
+					// proceeded past wantNonce and that is expected.
+					delete(m.evmNonces, addr)
+					delete(m.nextBlock.evmNonces, addr)
+					for _, x := range m.blocks {
+						delete(x.evmNonces, addr)
+					}
+				}


🟡 Resubmitting an identical EVM tx body after the pruneMempool nonce-reset path can leave an orphaned m.evmTxs[hash] mapping: m.evmTxs[hash] gets unconditionally deleted at line 101 when the original block is pruned, even though that same hash now points to the resubmitted tx still pending in a later block. The result is EvmTxByHash returning (nil, false) for a tx that is genuinely queued in the lane — a transient false-negative during the canonical retry pattern. Fix is to scope tx mappings per-block or guard the delete by identity (e.g. only delete when the stored tx matches the one being pruned).

Extended reasoning...

What the bug is

pruneMempool (lines 96-118) unconditionally calls delete(m.evmTxs, hash) for every hash in the pruned blocks evmHashes. insertTx (line 215) unconditionally writes m.evmTxs[resp.EVMHash] = tx. Combined with the reset path (lines 107-116), an identical resubmit can land the new tx-instance in m.evmTxs[H] while the old sealed block still carries H in its evmHashes — and the unconditional delete on that older block then wipes the mapping that now belongs to the still-pending resubmit.

Why the existing nonce check does not prevent it

Normally line 187 (nonce != resp.EVMNonce) blocks re-insertion of the same tx. But the reset path explicitly clears m.evmNonces[addr] (and the per-block evmNonces[addr] across all sealed blocks) when an accounts pending tx fails to bump the app nonce. After reset, insertTx reads the nonce from the app and accepts a fresh identical-bytes tx — the canonical "tx failed, retry it" pattern the author endorsed in inline-comment 3358604793.

Step-by-step proof

Setup: account A, app nonce N. Two txs already sequenced:

B1 contains Tx1 (nonce N, hash H1); B1.evmNonces[A]=N+1

B2 contains Tx2 (nonce N+1, hash H2); B2.evmNonces[A]=N+2

m.evmTxs = {H1: Tx1, H2: Tx2}; m.evmNonces[A] = N+2

B1 executes; Tx1 fails to bump the nonce (e.g. antehandler failure post-CheckTx). App nonce stays at N.

pruneMempool prunes B1: deletes m.evmTxs[H1] (fine). For A: wantNonce=N+1, m.evmNonces[A]=N+2 — mismatch. gotNonce=N < N+1 → reset path fires: deletes m.evmNonces[A], m.nextBlock.evmNonces[A], and B2.evmNonces[A]. m.evmTxs[H2] still points at the original Tx2; B2.evmHashes still contains H2.

User resubmits Tx1 (identical bytes, hash H1). m.evmNonces[A] is absent → falls back to app nonce N → matches → accepted. m.evmNonces[A]=N+1, m.evmTxs[H1]=Tx1 (new instance).

User resubmits Tx2 (identical bytes, hash H2). nonce N+1 matches m.evmNonces[A]=N+1 → accepted. m.evmNonces[A]=N+2. Line 215: m.evmTxs[H2] = Tx2 (overwrites the mapping that was still pointing at original Tx2 — same bytes, but now logically owned by the new block being built).

The new block B3 seals containing the resubmits; B3.evmHashes includes H2.

B2 (original) executes — Tx2 fails (app nonce still N, tx nonce N+1). pruneMempool prunes B2: iterates B2.evmHashes including H2, and runs delete(m.evmTxs, H2) unconditionally on line 101 — this removes the mapping that now points to the resubmitted Tx2 still pending in B3. B2.evmNonces was cleared in step 2 so the nonce-reset loop is a no-op.

EvmTxByHash(H2) now returns (nil, false) until B3 executes and is pruned, even though the tx is in the lane.

Impact and severity

Observability-only — eth_getTransactionByHash returns null for a brief window for a tx that IS being processed. No consensus or correctness impact. The trigger requires (a) a tx failing to bump the nonce during execution, (b) ≥2 pending txs from the same account across multiple blocks, (c) the user resubmitting the exact same signed bytes (the canonical retry pattern), and (d) the original later block being pruned before the resubmits block executes.

Addressing the refutation

The refuter is right that this is narrow and observability-only, and that the minimal-mempool design explicitly defers prioritization/replacement work to post-private-testnet. But the failing scenario IS the exact retry pattern the author endorsed ("user is free to send a different tx with the same nonce... user is responsible for retrying") — and the identical-bytes retry is the simplest form of that pattern. Filing as nit rather than normal: the impact is bounded, the trigger is rare, and the fix (per-block tx maps, or guarding the delete by identity) adds bookkeeping that may legitimately wait until after private testnet. Worth tracking, not worth blocking.

Suggested fix

Either:

Store tx by (blockNumber, hash) rather than just hash, and have prune only clear entries owned by the pruned block. Or

Compare the stored bytes before deleting: if bytes.Equal(m.evmTxs[hash], <ownedTx>) { delete(m.evmTxs, hash) }. Since identical resubmits have identical bytes this still leaves the mapping correct.

pompon0 added 30 commits May 15, 2026 13:46

new mempool draft

5c35437

WIP

e4a1280

Merge remote-tracking branch 'origin/main' into gprusak-mempool

2623868

WIP

5a91fa6

WIP

a6feb77

removed priority queue

df95cde

clist

4635fca

before implementing PopTxs

13254e5

reap marker

750c82a

prod code compiles

6bf90a7

fixing tests WIP

cd616ce

WIP

c860fe9

mempool tests compile

e295382

wip

c2d3f06

codex WIP

bdcd0a2

codex WIP

b7f02dc

codex WIP

9f0f768

moved cache

4bff32e

reaped requires a better handling

f51ec39

no reaped again

17702a9

codex WIP

2a7bc71

codex WIP

6ea4f7a

fmt

97717e7

Merge remote-tracking branch 'origin/main' into gprusak-mempool

41dcf16

some updates and documentation

0b9bf27

some fixes

d25b16d

test fix

f55487d

Merge remote-tracking branch 'origin/main' into gprusak-mempool

58778ce

termination fix

81c762c

fmt

f83f36f

pompon0 added 2 commits June 1, 2026 13:11

addressed codex comments

113739f

fix with debug logs

e6655b5

cursor Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/producer/state.go Outdated

pompon0 added 8 commits June 1, 2026 14:05

fmt

2a8aff5

tests wip

7b97818

testing framework snapshot

af52e84

wip

29dcdef

happy path test implemented

778b709

mempool tests

e05c29d

removed logs

ad8e184

Merge branch 'main' into gprusak-mempool2

e443701

pompon0 requested review from sei-will and wen-coding June 2, 2026 12:09

sei-will approved these changes Jun 2, 2026

View reviewed changes

pompon0 added 2 commits June 3, 2026 15:33

full support for gas wanted/estimated limits

646456b

Merge branch 'gprusak-mempool2' of https://github.com/sei-protocol/se…

51d950e

…i-chain into gprusak-mempool2

cursor Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread sei-tendermint/internal/autobahn/producer/mempool.go

Comment thread sei-tendermint/node/setup.go

wen-coding reviewed Jun 4, 2026

View reviewed changes

wen-coding approved these changes Jun 4, 2026

View reviewed changes

pompon0 added 3 commits June 5, 2026 11:42

applied comments

c7d601d

Merge remote-tracking branch 'origin/main' into gprusak-mempool2

7fcbcfc

missing rpc

b414399

claude Bot reviewed Jun 5, 2026

View reviewed changes

string association calls for association tests

7f9f698

pompon0 enabled auto-merge June 5, 2026 10:42

pompon0 added this pull request to the merge queue Jun 5, 2026

claude Bot reviewed Jun 5, 2026

View reviewed changes

Merged via the queue into main with commit b8473e7 Jun 5, 2026
55 of 56 checks passed

pompon0 deleted the gprusak-mempool2 branch June 5, 2026 11:23

Conversation

pompon0 commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pompon0 Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

What the bug is

Step-by-step proof

Why this PR is the natural place to flag it

Impact

Suggested fix

Uh oh!

claude Bot Jun 5, 2026

Choose a reason for hiding this comment

What the bug is

Why the existing nonce check does not prevent it

Step-by-step proof

Impact and severity

Addressing the refutation

Suggested fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pompon0 commented May 29, 2026 •

edited

Loading

pompon0 Jun 5, 2026 •

edited

Loading