autobahn mempool#3522
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 51d950e. Configure here.
| q := inner.blocks[lane] | ||
| if err := ctrl.WaitUntil(ctx, func() bool { | ||
| return q.next < inner.persistedBlockStart[lane]+BlocksPerLane | ||
| return toProduce < inner.persistedBlockStart[lane]+BlocksPerLane |
There was a problem hiding this comment.
I know this is pre-existing, we probably need to change inner.persistedBlockStart[lane] to appropriate value when validators can join in the middle
There was a problem hiding this comment.
I'm just panicking we are always setting inner.persistedBlockStart[lane] to 0 when there is no WAL on the disk. Not related to this PR, I think we will fix that in the state sync work?
There was a problem hiding this comment.
I mean as soon as anchor is fetched from the network, persistedBlockStart is also moved forward. avail state is supposed to be independent from statesync.
| nonce = s.cfg.App.EvmNonce(addr) | ||
| } | ||
| if nonce != resp.EVMNonce { | ||
| return nil, fmt.Errorf("%w: got %v, want %v", errBadNonce, resp.EVMNonce, nonce) |
There was a problem hiding this comment.
So right now we only accept the next Nonce? Say the user sent tx with nonce 100 and 101, the packet containing 100 got dropped, 101 will rejected?
And also we don't allow higher-priority tx to replace old ones any more?
I guess it might be okay for the loadtest, but for public testnet this doesn't work. Maybe add a TODO?
For loadtest we also need to remember to retry in sequence if the loadtest client died.
There was a problem hiding this comment.
the packet won't get dropped, because we support HTTP and WS which are both built on top of TCP. We don't allow replacement, because the lower priority already gets sequenced on the lane when the higher priority tx arrives, it does not make sense. Replacement doesn't make also sense, because there is no gossip so there is no need of reconciliation of txs conflicting on nonces. Mempool simply sequences a tx within a lane and then forgets about it - user is free to send a different tx with the same nonce and arbitrary priority as soon as the previous tx fails to get executed. Loadtest client is responsible for retrying txs. It has enough information to detect problem and resend what is needed.
There was a problem hiding this comment.
TCP guarantees byte order, if there are two HTTP connections sending tx N and tx N+1, and they happen to be in wrong order, I don't think TCP can fix that? If there is some tool which fires concurrent txs without awaiting sequence, they could be rejected even if user didn't do anything wrong?
Depending on how our loadtest tool is written, if it always awaits tx N before sending tx N+1, then this won't be a problem. But I'd be a bit worried for public testnet.
There was a problem hiding this comment.
Depending on how our loadtest tool is written, if it always awaits tx N before sending tx N+1, then this won't be a problem
I think it is a reasonable expectation.
| lane := availState.PublicKey() | ||
| firstBlock := s.mempoolFirst() | ||
| scope.Spawn(func() error { | ||
| // Task pruning executed lane blocks from the mempool |
There was a problem hiding this comment.
Not saying right or wrong, just an observation.
In non-Autobahn we used to have success/fail results from TxResults, so there is a feedback about individual tx. Now it's per-account. I think originally we allowed second chance for failed tx, do we care about allowing that here? (I imagine if you are landing a transfer tx and sending enough gas at the same time, but your "send enough gas" tx got delayed, your transfer tx can maybe get a second chance?)
Maybe document the behavior?
There was a problem hiding this comment.
it is compatible with the TxMempool of tendermint: failed transactions are removed from the mempool and are allowed to be reinserted. The "second chance" limitation is on the cache level. There is no cache of bad transaction for the autobahn mempool currently so in particular there is no restriction on 2 attempts per transaction.
There was a problem hiding this comment.
I think your description is accurate for failed-tx cache. But for under-funded txs at admission, I think those are held in pending queue and re-assessed when execution results come in. It's okay if we force user to retry in this case, dunno how much it matters.
| if err != nil { | ||
| return none | ||
| } | ||
| if env.ConsensusState == nil { | ||
| return none | ||
| } | ||
| privValAddress := k.Address() | ||
|
|
||
| // Skip the in-memory consensus-state lookup under Autobahn: the CometBFT | ||
| // consensus State is never advanced, so GetValidators would nil-deref | ||
| // on an unpopulated validator set. The state-store lookup below is kept | ||
| // in sync under both engines. | ||
| if !env.gigaRouter().IsPresent() { | ||
| lastBlockHeight, vals := env.ConsensusState.GetValidators() | ||
| if consensusState, ok := env.ConsensusState.Get(); ok && !env.gigaRouter().IsPresent() { | ||
| lastBlockHeight, vals := consensusState.GetValidators() | ||
| if lastBlockHeight == h { | ||
| for _, val := range vals { | ||
| if bytes.Equal(val.Address, privValAddress) { |
There was a problem hiding this comment.
🟣 Under Autobahn, validatorAtHeight() in status.go always returns None for chains with InitialHeight > 1 (e.g. the test genesis uses 100), so /status reports VotingPower=0 for healthy validators. The root cause is pre-existing — this PR's only change here is the utils.Option refactor — but the new comment "state-store lookup below is kept in sync under both engines" is misleading and the natural fix sits in the function this PR already touched: under Autobahn the committee is fixed at genesis, so the call should fall through to GenDoc.Validators (same path the new Validators handler in consensus.go uses).
Extended reasoning...
What the bug is
Status calls env.validatorAtHeight(env.latestUncommittedHeight()) and uses the returned VotingPower to populate ValidatorInfo in the /status response. Under Autobahn:
env.BlockStore.Height()stays at 0 — the CometBFT block store is never written to (autobahn nodes only construct a minimalblocksync.Reactorfor lag RPC;SaveBlockis only called from the blocksync syncer, which isNoneunder autobahn).env.ConsensusReactorisNoneunder autobahn, solatestUncommittedHeight()skips theWaitSyncbranch and returnsBlockStore.Height() + 1 = 1.validatorAtHeight(1)then callsenv.StateStore.LoadValidators(1).
But sm.Store.Save() at genesis persists the validator set at state.InitialHeight, not at height 1. From sei-tendermint/internal/state/store.go (saveValidatorsInfo at line ~171 and the nextHeight remap at ~167–173): for LastBlockHeight=0, nextHeight=1 is reassigned to InitialHeight, and the validators are saved at validatorsKey(InitialHeight) and validatorsKey(InitialHeight+1). For any chain with InitialHeight > 1 (the test genesis here uses InitialHeight: 100, and production chains commonly run with non-1 initial heights), LoadValidators(1) returns ErrNoValSetForHeight. validatorAtHeight returns None, and the Status response carries the zero default for VotingPower.
Step-by-step proof
- Boot an autobahn node with the test genesis (
setup_test.go:InitialHeight: 100). StatusrunsvalidatorAtHeight(latestUncommittedHeight()).BlockStore.Height() = 0,ConsensusReactor.IsPresent() = false→latestUncommittedHeight() = 1.StateStore.LoadValidators(1)→loadValidatorsInfo(db, 1)returns "value retrieved from db is empty" because the only entries are at keys 100 and 101.err != nil→ returnnone.StatusreadsvalidatorInfo.VotingPowerfrom the unsetutils.Option, which stays at zero.- External tooling scraping
/statusseesVotingPower=0for a perfectly healthy validator.
For chains with InitialHeight == 1 the lookup happens to succeed and returns the genesis voting power; since Autobahn's committee is fixed at genesis (no validator-set updates), that genesis value also happens to be the current value — the bug is silent in that case.
Why this PR is the natural place to flag it
The diff at lines 164–177 was deliberately gated if !env.gigaRouter().IsPresent() for the in-memory consensus-state lookup, but the StateStore.LoadValidators(h) call on the line above the gate is not gated. The author was clearly aware of autobahn quirks in this function and added the gate one line too low. The accompanying comment ("state-store lookup below is kept in sync under both engines") is also inaccurate for autobahn — the CometBFT state store is never updated under autobahn.
The PR's only behavioral change to validatorAtHeight is the utils.Option refactor (removing the env.ConsensusState == nil check and combining with the gigaRouter check). Pre-PR, gigaRouter().IsPresent() already skipped the in-memory lookup and the function fell through to the same broken LoadValidators path, so the underlying bug pre-dates this PR.
Impact
Cosmetic only — observability/monitoring. No consensus or correctness impact. External tooling that scrapes /status to verify "this node is participating in consensus" by checking VotingPower > 0 will incorrectly conclude the node is down. There is already a TODO at the top of Status (lines 19–32) tracking autobahn /status limitations; this is one more case in that bucket.
Suggested fix
Under autobahn, fetch validators from env.GenDoc.Validators — the same path the new Validators handler in consensus.go uses for the autobahn branch:
if env.gigaRouter().IsPresent() {
for _, gv := range env.GenDoc.Validators {
if bytes.Equal(gv.Address, privValAddress) {
return utils.Some(&types.Validator{
Address: gv.Address,
PubKey: gv.PubKey,
VotingPower: gv.Power,
})
}
}
return none
}Either fix here or extend the existing TODO at the top of Status to explicitly call out ValidatorInfo.VotingPower so it gets picked up alongside the other deferred autobahn /status work.
| for m.first < min(n, m.next) { | ||
| b := m.blocks[m.first] | ||
| delete(m.blocks, m.first) | ||
| m.first += 1 | ||
| for _, hash := range b.evmHashes { | ||
| delete(m.evmTxs, hash) | ||
| } | ||
| for addr, wantNonce := range b.evmNonces { | ||
| if wantNonce == m.evmNonces[addr] { | ||
| // Happy path: all account's txs got executed. | ||
| delete(m.evmNonces, addr) | ||
| } else if gotNonce := s.cfg.App.EvmNonce(addr); gotNonce < wantNonce { | ||
| // Some txs have not been executed - reset account tracking. | ||
| // NOTE: app execution is not synchronized with mempool, so nonce could have already | ||
| // proceeded past wantNonce and that is expected. | ||
| delete(m.evmNonces, addr) | ||
| delete(m.nextBlock.evmNonces, addr) | ||
| for _, x := range m.blocks { | ||
| delete(x.evmNonces, addr) | ||
| } | ||
| } |
There was a problem hiding this comment.
🟡 Resubmitting an identical EVM tx body after the pruneMempool nonce-reset path can leave an orphaned m.evmTxs[hash] mapping: m.evmTxs[hash] gets unconditionally deleted at line 101 when the original block is pruned, even though that same hash now points to the resubmitted tx still pending in a later block. The result is EvmTxByHash returning (nil, false) for a tx that is genuinely queued in the lane — a transient false-negative during the canonical retry pattern. Fix is to scope tx mappings per-block or guard the delete by identity (e.g. only delete when the stored tx matches the one being pruned).
Extended reasoning...
What the bug is
pruneMempool (lines 96-118) unconditionally calls delete(m.evmTxs, hash) for every hash in the pruned blocks evmHashes. insertTx (line 215) unconditionally writes m.evmTxs[resp.EVMHash] = tx. Combined with the reset path (lines 107-116), an identical resubmit can land the new tx-instance in m.evmTxs[H] while the old sealed block still carries H in its evmHashes — and the unconditional delete on that older block then wipes the mapping that now belongs to the still-pending resubmit.
Why the existing nonce check does not prevent it
Normally line 187 (nonce != resp.EVMNonce) blocks re-insertion of the same tx. But the reset path explicitly clears m.evmNonces[addr] (and the per-block evmNonces[addr] across all sealed blocks) when an accounts pending tx fails to bump the app nonce. After reset, insertTx reads the nonce from the app and accepts a fresh identical-bytes tx — the canonical "tx failed, retry it" pattern the author endorsed in inline-comment 3358604793.
Step-by-step proof
Setup: account A, app nonce N. Two txs already sequenced:
- B1 contains Tx1 (nonce N, hash H1); B1.evmNonces[A]=N+1
- B2 contains Tx2 (nonce N+1, hash H2); B2.evmNonces[A]=N+2
- m.evmTxs = {H1: Tx1, H2: Tx2}; m.evmNonces[A] = N+2
- B1 executes; Tx1 fails to bump the nonce (e.g. antehandler failure post-CheckTx). App nonce stays at N.
pruneMempoolprunes B1: deletes m.evmTxs[H1] (fine). For A: wantNonce=N+1, m.evmNonces[A]=N+2 — mismatch. gotNonce=N < N+1 → reset path fires: deletes m.evmNonces[A], m.nextBlock.evmNonces[A], and B2.evmNonces[A]. m.evmTxs[H2] still points at the original Tx2; B2.evmHashes still contains H2.- User resubmits Tx1 (identical bytes, hash H1). m.evmNonces[A] is absent → falls back to app nonce N → matches → accepted. m.evmNonces[A]=N+1, m.evmTxs[H1]=Tx1 (new instance).
- User resubmits Tx2 (identical bytes, hash H2). nonce N+1 matches m.evmNonces[A]=N+1 → accepted. m.evmNonces[A]=N+2. Line 215: m.evmTxs[H2] = Tx2 (overwrites the mapping that was still pointing at original Tx2 — same bytes, but now logically owned by the new block being built).
- The new block B3 seals containing the resubmits; B3.evmHashes includes H2.
- B2 (original) executes — Tx2 fails (app nonce still N, tx nonce N+1).
pruneMempoolprunes B2: iterates B2.evmHashes including H2, and runsdelete(m.evmTxs, H2)unconditionally on line 101 — this removes the mapping that now points to the resubmitted Tx2 still pending in B3. B2.evmNonces was cleared in step 2 so the nonce-reset loop is a no-op. EvmTxByHash(H2)now returns(nil, false)until B3 executes and is pruned, even though the tx is in the lane.
Impact and severity
Observability-only — eth_getTransactionByHash returns null for a brief window for a tx that IS being processed. No consensus or correctness impact. The trigger requires (a) a tx failing to bump the nonce during execution, (b) ≥2 pending txs from the same account across multiple blocks, (c) the user resubmitting the exact same signed bytes (the canonical retry pattern), and (d) the original later block being pruned before the resubmits block executes.
Addressing the refutation
The refuter is right that this is narrow and observability-only, and that the minimal-mempool design explicitly defers prioritization/replacement work to post-private-testnet. But the failing scenario IS the exact retry pattern the author endorsed ("user is free to send a different tx with the same nonce... user is responsible for retrying") — and the identical-bytes retry is the simplest form of that pattern. Filing as nit rather than normal: the impact is bounded, the trigger is rare, and the fix (per-block tx maps, or guarding the delete by identity) adds bookkeeping that may legitimately wait until after private testnet. Worth tracking, not worth blocking.
Suggested fix
Either:
- Store tx by
(blockNumber, hash)rather than justhash, and have prune only clear entries owned by the pruned block. Or - Compare the stored bytes before deleting:
if bytes.Equal(m.evmTxs[hash], <ownedTx>) { delete(m.evmTxs, hash) }. Since identical resubmits have identical bytes this still leaves the mapping correct.

Mempool constructed specifically for autobahn tx lifecycle: