fix(replication): handle inbound overload backpressure#156
Closed
mickvandijke wants to merge 4 commits into
Closed
fix(replication): handle inbound overload backpressure#156mickvandijke wants to merge 4 commits into
mickvandijke wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds explicit backpressure handling for inbound replication traffic so the P2P receive loop doesn’t execute expensive replication handlers inline under load, while also making downstream replication flows treat overload as a neutral deferral (with bounded budgets where overload claims are unauthenticated).
Changes:
- Added global + per-peer inbound request semaphores and bounded overload-response task spawning, plus a new
Overloadedreplication message for request types without typed rejection responses. - Updated neighbor sync, verification, fetch, audit, and prune-audit paths to classify overload responses as neutral deferrals, with per-peer overload-claim budgets for audit/fetch.
- Added bounded retry tracking for inconclusive verification rounds and enhanced bootstrap drain accounting to include overload-deferred initial sync retries.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/poc_d1_bounded_queues.rs | Updates test helpers to include the new inconclusive_rounds field. |
| src/replication/types.rs | Extends replication state types: VerificationEntry.inconclusive_rounds and BootstrapState.overload_deferred_sync_peers. |
| src/replication/scheduling.rs | Adds inconclusive tracking helpers, fetch-source introspection, and returns evicted keys from stale eviction. |
| src/replication/quorum.rs | Treats Overloaded and unexpected messages as unresolved evidence in verification rounds. |
| src/replication/pruning.rs | Treats prune-audit overload rejections as neutral (no failure) when the overload reason matches admission-control reasons. |
| src/replication/protocol.rs | Introduces ReplicationMessageBody::Overloaded(OverloadedNotice) with a roundtrip test. |
| src/replication/neighbor_sync.rs | Adds NeighborSyncAttempt (Completed/Failed/Overloaded) and prioritizes overload-deferred peers. |
| src/replication/mod.rs | Implements inbound admission control + overload replies, overload-claim tracking, fetch deferral retrying, and bootstrap drain integration points. |
| src/replication/config.rs | Adds MAX_INCONCLUSIVE_VERIFY_ROUNDS to bound repeated inconclusive verification retries. |
| src/replication/bootstrap.rs | Blocks bootstrap drain on overload-deferred sync retries and adds APIs to track/clear them. |
| src/replication/audit.rs | Adds overload-claim budgeting to the audit path via an overload tracker (engine reuses one across ticks). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| .await; | ||
| } | ||
|
|
||
| let pending_keys = { |
Comment on lines
350
to
352
| config: &ReplicationConfig, | ||
| is_bootstrapping: bool, | ||
| ) -> Option<NeighborSyncResponse> { |
Comment on lines
+132
to
+136
| /// logging parity with the typed overload rejections. | ||
| #[derive(Debug, Clone, Serialize, Deserialize)] | ||
| pub struct OverloadedNotice { | ||
| /// Human-readable overload reason (for logging only). | ||
| pub reason: String, |
Comment on lines
+119
to
+121
| /// Sent in place of a typed response when the receiver's inbound handler | ||
| /// capacity is exhausted and a request type has no natural rejection | ||
| /// response to carry the overload reason (currently `NeighborSyncRequest`). |
Comment on lines
+353
to
+358
| match sync_with_peer_with_outcome(peer, p2p_node, storage, paid_list, config, is_bootstrapping) | ||
| .await | ||
| .map(|outcome| outcome.response) | ||
| { | ||
| NeighborSyncAttempt::Completed(outcome) => Some(outcome.response), | ||
| NeighborSyncAttempt::Failed | NeighborSyncAttempt::Overloaded => None, | ||
| } |
Comment on lines
+118
to
+121
| fn is_inbound_replication_overloaded_reason(reason: &str) -> bool { | ||
| reason == INBOUND_REPLICATION_OVERLOADED_REASON | ||
| || reason == INBOUND_REPLICATION_PEER_OVERLOADED_REASON | ||
| } |
Collaborator
Author
|
superseded by #158 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Overloadedreplication message for request types without typed rejection responses, and map overload into neutral deferrals for neighbor sync, verification, prune audit, fetch, and audit flows.Details
NeighborSyncAttemptdistinguishes completed, failed, and overloaded sync attempts. Overloaded bootstrap syncs block drain until a priority retry completes, fails, or the peer leaves.Overloadedresponses as unresolved evidence, caps consecutive inconclusive rounds, and returns stale evictions to bootstrap bookkeeping.Validation
git diff --checkcargo fmt --all -- --checkcargo clippy --all-targets --all-features -- -D warningscargo test --lib --all-features replication::cargo test --test e2e --all-features replication::test_late_joiner_replicates_responsible_chunks -- --exactNote
A broader
cargo test --all-targets --all-features replication::run completed the library suite and 29/30 replication e2e tests but had one late-joiner convergence timeout after 90s. The exact failed e2e test passed on rerun in 71.36s.