Wait for soroban ledger ingestion at chain head instead of failing fatally on -32600#167
Wait for soroban ledger ingestion at chain head instead of failing fatally on -32600#167ahmdssi wants to merge 1 commit into
Conversation
|
Warning Review limit reached
More reviews will be available in 58 minutes and 49 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more credits in the billing tab to continue. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (5)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
64f7af7 to
6599649
Compare
Problem
When an indexer is fully caught up and follows the chain head,
@subql/node-stellarcan enter a fatal crash-loop:We observed this in production on Stellar mainnet behind a hosted RPC provider (QuickNode): the pod crashed and restarted every ~8 minutes for hours. Across 7 consecutive crashes, the fatal block was always exactly
range max + 1.Root cause
The target height comes from Horizon (
getFinalizedBlockHeight→ledgers().order('desc')), while soroban events are fetched from a separate soroban endpoint (sorobanClient.getEvents({ startLedger })). stellar-rpc rejectsgetEventswith JSON-RPC-32600whenstartLedgeris greater than the last ledger ingested by the serving backend (get_events.go) — not the network head. With hosted, load-balanced RPCs the backend servinggetEventscan lag the Horizon target by a few ledgers for tens of seconds.StellarApidoes not recognize this error (only the legacy'start is after newest ledger'/'start is before oldest ledger'messages), so it propagates as an ordinary fetch error: node-core retries 5 times (~20s total) and then terminates the process. The ledger exists and becomes available seconds later — it is a transient condition, not an invalid request.Note: the stellar-rpc maintainers acknowledge this case as distinct in the getEvents v2 proposal (
ledger_future, stellar-rpc#593), but v1 conflates it into-32600.Fix
fetchAndWrapLedgernow callsgetEventsWhenIngested(sequence), which waits for the soroban endpoint to ingest the ledger instead of failing the fetch:-32600range message (and the legacy'start is after newest ledger'message), with a structural fallback for unrecognized-32600wordings: compare the requested sequence against the endpoint's owngetLatestLedger().sorobanIngestWaitSecondsendpoint config (default 600s, documented to stay below--timeout) — then rethrows, preserving the previous fail-fast behaviour as a backstop for genuine outages.'start is before oldest ledger'(out of retention) keeps its immediate explanatory error.Mapping the error to
BlockUnavailableErrorwas deliberately avoided: the dispatcher would permanently skip the block (Near/Solana semantics) and silently drop that ledger's events.Testing
api.stellar.spec.ts(no network required,delaymocked): recovery after transient failures, immediate rethrow below the retention window, both legacy messages, thegetLatestLedgerfallback (both directions), deadline exhaustion, and end-to-end wiring throughfetchAndWrapLedger.@subql/node-stellar@6.2.0dist, stopped the crash-loop scenario described above.