fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951
fix(sdk): ban rate-limited node for Envoy-advertised reset window#3951Claudius-Maginificent wants to merge 8 commits into
Conversation
…limits ResourceExhausted is a congestion/backpressure signal, not endpoint ill-health. Banning a rate-limited (but healthy) node relocates its load onto survivors and cascades to NoAvailableAddressesToRetry. Now RE is classified as rate-limited: the node is not banned, the retry rotates to a different node via explicit exclusion, and the existing bounded retry count bounds an over-limit client. Genuine-failure ban path (60s x e^n) is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
📝 WalkthroughWalkthroughAdds a new ChangesRate-limit-aware retry rotation
Sequence Diagram(s)sequenceDiagram
participant App
participant DapiClient
participant AddressList
participant TransportClient
participant Node
App->>DapiClient: execute(request)
DapiClient->>AddressList: get_live_address_excluding(tried)
AddressList-->>DapiClient: address_A
DapiClient->>TransportClient: create(address_A)
TransportClient->>Node: send request
Node-->>TransportClient: ResourceExhausted
TransportClient-->>DapiClient: TransportError::Grpc(ResourceExhausted)
DapiClient->>DapiClient: is_rate_limited() == true → skip ban, push address_A to tried
DapiClient->>DapiClient: compute delay with jitter, sleep
DapiClient->>AddressList: get_live_address_excluding([address_A])
AddressList-->>DapiClient: address_B
DapiClient->>TransportClient: create(address_B)
TransportClient->>Node: send request
Node-->>TransportClient: Ok(response)
TransportClient-->>DapiClient: Ok(response)
DapiClient-->>App: ExecutionResponse { inner, retries: 1 }
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
✅ Review complete (commit 2c76a14) |
There was a problem hiding this comment.
Pull request overview
This PR adjusts rs-dapi-client retry/failover behavior so gRPC ResourceExhausted (rate-limiting / backpressure) no longer triggers the exponential “health ban” logic, preventing ban-cascades under sustained rate-limits while keeping genuine node-failure banning unchanged.
Changes:
- Introduces a new
CanRetry::is_rate_limited()classification (defaultfalse), implemented for gRPCStatus::ResourceExhaustedand delegated throughTransportError,DapiClientError, andExecutionError. - Updates
update_address_ban_statusto not ban rate-limited addresses (log + rotate instead), preserving the existing ban ladder for genuine failures. - Adds address-rotation across retries via
AddressList::get_live_address_excluding(&[Address]), plus new unit/integration tests to assert both invariants and rotation behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| packages/rs-dapi-client/tests/rate_limit_rotation.rs | New integration tests driving the real executor retry loop with a fake transport to validate “rotate, don’t ban” for ResourceExhausted. |
| packages/rs-dapi-client/src/transport/grpc.rs | Implements CanRetry::is_rate_limited() for gRPC Status (ResourceExhausted only). |
| packages/rs-dapi-client/src/transport.rs | Delegates is_rate_limited() for TransportError and adds unit tests pinning the “ResourceExhausted only” classification. |
| packages/rs-dapi-client/src/lib.rs | Extends the public CanRetry trait with is_rate_limited() (default false) and documents intended semantics. |
| packages/rs-dapi-client/src/executor.rs | Delegates is_rate_limited() through ExecutionError<E>. |
| packages/rs-dapi-client/src/dapi_client.rs | Skips banning for rate-limited errors and rotates retries away from already-tried addresses. Adds invariants-focused tests for ban ladder vs rate-limit behavior. |
| packages/rs-dapi-client/src/address_list.rs | Adds get_live_address_excluding and tests ensuring exclusion/rotation semantics and graceful fallback. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| let info = address_list.ban_info(); | ||
| let entry = info | ||
| .iter() | ||
| .find(|i| i.uri.contains("3000")) | ||
| .expect("address present in ban info"); |
There was a problem hiding this comment.
Same brittle test-lookup as the parallel thread — acknowledged, left as-is in this pass (test-only, no production impact). Leaving open.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Narrow, well-tested fix that adds an is_rate_limited axis to CanRetry, suppresses banning for gRPC ResourceExhausted, and rotates the next retry off the throttled node via get_live_address_excluding plus a per-execution tried set. The default trait impl preserves prior behavior and the genuine-failure ban ladder is pinned by tests. Only minor docs/test-robustness nits — nothing blocking.
🟡 1 suggestion(s) | 💬 4 nitpick(s)
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/lib.rs`:
- [SUGGESTION] packages/rs-dapi-client/src/lib.rs:107-109: Pin `is_rate_limited ⇒ can_retry` with a test or debug_assert
The fix is load-bearing on the implicit contract that any rate-limited error is also retryable: `update_address_ban_status` only consults `is_rate_limited()` from inside the `if error.can_retry()` branch. If a future change ever marks `ResourceExhausted` non-retryable (e.g. a tweak to `Status::can_retry`), rate-limit handling silently stops firing and the error falls through to the no-op tail — neither ban nor rotation happens. Lock the invariant with either a unit test asserting `is_rate_limited() ⇒ can_retry()` across the gRPC code matrix, or a `debug_assert!(!self.is_rate_limited() || self.can_retry())` on the hot path, so the two predicates can't drift unnoticed.
| /// Instead the retry should rotate to a *different* node and leave the | ||
| /// throttled one in the pool. See `update_address_ban_status`. |
There was a problem hiding this comment.
💬 Nitpick: Doc overstates the rotation guarantee
The doc states the retry "should rotate to a different node and leave the throttled one in the pool." The executor's documented fallback in dapi_client.rs:761 explicitly re-selects the throttled node when get_live_address_excluding(&tried) returns None (single-node pool, or every live node already in tried). Rotation is best-effort, not guaranteed. A one-line caveat keeps the trait contract honest, e.g. "... and prefer rotating to a different node when one is available."
source: ['claude']
There was a problem hiding this comment.
Partially addressed: M2 makes the retry genuinely rotate to a different live node when one exists (396865f). The doc wording at this site still does not spell out the single-node-pool fallback (where the sole live address is reused with backoff) — that doc caveat was intentionally not added in this pass. Flagging as a known doc nit; leaving open.
There was a problem hiding this comment.
Resolved in this update — Doc overstates the rotation guarantee no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| let entry = info | ||
| .iter() | ||
| .find(|i| i.uri.contains("3000")) | ||
| .expect("address present in ban info"); |
There was a problem hiding this comment.
💬 Nitpick: Match the ban-info test entry by exact URI
The invariant test locates the address under test with i.uri.contains("3000"). The current fixture only inserts one address, so this is correct today, but it ties the assertion to a magic substring inside mock_address(). A future fixture adding another URI containing 3000, or a rename of mock_address, would silently match the wrong entry. Match the exact URI from addr so the test is tied to the value it actually inserted.
| let entry = info | |
| .iter() | |
| .find(|i| i.uri.contains("3000")) | |
| .expect("address present in ban info"); | |
| let info = address_list.ban_info(); | |
| let addr_uri = addr.to_string(); | |
| let entry = info | |
| .iter() | |
| .find(|i| i.uri == addr_uri) | |
| .expect("address present in ban info"); |
source: ['codex']
There was a problem hiding this comment.
Acknowledged — the i.uri.contains("3000") lookup is brittle. It is test-only and unaffected by the production rate-limit changes, so it was left as-is in this pass. Leaving open.
There was a problem hiding this comment.
Resolved in 396865f — Match the ban-info test entry by exact URI no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| let Some(address) = self | ||
| .address_list | ||
| .get_live_address_excluding(&tried) | ||
| .or_else(|| self.address_list.get_live_address()) | ||
| else { |
There was a problem hiding this comment.
💬 Nitpick: Fallback re-picks the throttled node when exclusion empties the live pool
When get_live_address_excluding(&tried) returns None, the or_else(|| self.address_list.get_live_address()) fallback can hand back an address already present in tried — i.e. the just-rate-limited node — and tried.push(address.clone()) then appends a duplicate. With a fixed 10ms retry_delay, the remaining retry budget collapses onto the throttled node within tens of ms, which adds no value against a per-minute Envoy rate-limit. The behavior is intentionally documented and bounded by max_retries, so it is non-blocking. Two cheap improvements worth considering: (a) when the exclusion empties the pool, surface the raw ResourceExhausted immediately rather than spinning on the same node; or (b) switch tried to a HashSet<Address> so the contract "already attempted" stays exact (also makes the intent self-documenting).
source: ['claude']
There was a problem hiding this comment.
Addressed in 396865f/b8a8c329: the rate-limit retry path now uses a 3-tier selection that rotates off the just-throttled node whenever any alternative live address exists, plus capped exponential backoff (10 ms → 500 ms) with full jitter. tried is intentionally kept as a Vec — it is bounded by max_retries and the adversarial review agreed a HashSet is not warranted here. Leaving this thread open for maintainer confirmation.
There was a problem hiding this comment.
Resolved in 396865f — Fallback re-picks the throttled node when exclusion empties the live pool no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| let before = chrono::Utc::now(); | ||
| let result: ExecutionResult<i32, DapiClientError> = Err(ExecutionError { | ||
| inner: DapiClientError::Transport(TransportError::Grpc(status)), | ||
| retries: 0, | ||
| address: Some(addr.clone()), | ||
| }); | ||
| update_address_ban_status(&address_list, &result, &make_applied_settings(true)); | ||
| let after = chrono::Utc::now(); | ||
|
|
||
| let info = address_list.ban_info(); | ||
| let entry = info | ||
| .iter() | ||
| .find(|i| i.uri.contains("3000")) | ||
| .expect("address present in ban info"); | ||
|
|
||
| assert!(entry.banned, "genuine failure must ban the node"); | ||
| assert_eq!( | ||
| entry.ban_count, expected_count, | ||
| "ban_count must increment on each genuine failure" | ||
| ); | ||
|
|
||
| // Window must equal base × e^(ban_count - 1). Since | ||
| // banned_until = t_call + period with before <= t_call <= after: | ||
| // (banned_until - before) >= period and | ||
| // (banned_until - after) <= period. | ||
| let period = base_secs * ((expected_count - 1) as f64).exp(); | ||
| let banned_until = entry.banned_until.expect("banned_until is set"); | ||
| let lower = (banned_until - before).num_milliseconds() as f64 / 1000.0; | ||
| let upper = (banned_until - after).num_milliseconds() as f64 / 1000.0; | ||
| assert!( | ||
| lower >= period - 0.05, | ||
| "ban #{expected_count} window lower bound {lower}s < expected {period}s", | ||
| ); | ||
| assert!( | ||
| upper <= period + 0.05, | ||
| "ban #{expected_count} window upper bound {upper}s > expected {period}s", | ||
| ); | ||
| } |
There was a problem hiding this comment.
💬 Nitpick: Wall-clock tolerance on ban-window assertion is thin for CI
The invariant test snapshots chrono::Utc::now() before/after update_address_ban_status and asserts the ban window is within ±50 ms of base × e^(n-1). On a loaded shared CI runner a 50 ms scheduling stall between the before snapshot and the ban write is plausible and would flip the lower-bound check. The assertion itself is exact, but the wall-clock dependence is what flakes. Consider widening the tolerance (e.g. to a few hundred ms) or recomputing the period from a directly observable banned_at field if one is exposed. Not blocking — current headroom is empirically fine — but worth a follow-up if these ever flake.
source: ['claude']
There was a problem hiding this comment.
Left unchanged in this pass — the ±50 ms wall-clock tolerance is a known timing tradeoff in this test. Not modified by the rate-limit fix; leaving open for the maintainer's call.
There was a problem hiding this comment.
Resolved in 396865f — Wall-clock tolerance on ban-window assertion is thin for CI no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
…limit retry, lock can_retry invariant Three review fixes for the rotate-instead-of-ban rate-limit handling. M1 (invariant lock): the rotate-don't-ban behavior depends on `is_rate_limited() => can_retry()`. Add a debug_assert! at the rotation decision in `execute` and an exhaustive property test (`test_rate_limit_implies_retryable_invariant`) covering every gRPC code across tonic::Status, TransportError, DapiClientError and ExecutionError, so a future widening of `is_rate_limited` (or narrowing of `can_retry`) fails loudly instead of silently killing rotation. M2 (rotate off the throttled node + backoff): on a rate-limited retry the selection no longer re-picks the just-throttled node in small pools — the fallback now excludes the just-tried address and only reuses it when it is the single live node. The rate-limit retry path also replaces the flat 10ms delay with capped exponential backoff (base 10ms, x2 per attempt, capped at 500ms) plus full jitter, so a congested fleet de-correlates instead of retrying in lockstep. Genuine-failure delays are unchanged. Server-side RetryInfo is left as a documented TODO (needs metadata plumbing). M3 (rs-sdk propagation): override `is_rate_limited` on the SDK `Error` to delegate to the wrapped DapiClientError, so the rotate-don't-ban semantics survive at the SDK layer independently of how `can_retry` is defined. Pure future-proofing (today the can_retry guard short-circuits first); locked with a regression test. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…iant assert Two self-introduced nits from the verify pass. 1. `rate_limit_backoff_window`'s `checked_shl` guards only the shift AMOUNT, not value bit-loss: a large `attempt` (e.g. 63) shifts the set bits out of `10 << attempt`, collapsing the window to 0ms — no backoff at all. Pathological (needs retries >= 64; default is 5) but it's a footgun the backoff math created. Clamp the shift amount at 16 (already saturates the 500ms cap, so nothing changes in the valid range) and extend the unit test with the attempt=63/64 boundary asserting the window stays at the cap. 2. Add the symmetric `debug_assert!(!is_rate_limited() || can_retry())` to the transport-client-creation error block for parity with the transport-error block. Defense-in-depth; the property test already covers the invariant. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Review fixes (adversarial review pass)Addressed reviewer findings (CodeRabbit / thepastaclaw / internal review). All MEDIUM; two LOW left open by design. M1 —
M2 — rate-limit retry hardening
M3 — rs-sdk rate-limit classification propagated
Left open (LOW, by design): ban-evasion blind spot for a node that always returns Commits: 🤖 Co-authored by Claudius the Magnificent AI Agent |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Cumulative review at b8a8c32. The latest delta hardens the rate-limit handling: capped exponential backoff with full jitter, three-tier address selection (excluding tried → just-tried → fallback), debug_assert! invariants on both retry arms, and a property test pinning is_rate_limited() ⇒ can_retry() across all CanRetry implementors. Prior findings #1 and #4 are resolved; #2, #3, #5 are still valid nitpicks. One new latest-delta finding: the new AddressStatus ban-ladder test inherits the same thin 50ms wall-clock tolerance. No blocking issues.
💬 3 nitpick(s)
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
| /// Rate-limit errors are retryable (see [`CanRetry::can_retry`]) but, | ||
| /// unlike genuine node ill-health, the node must NOT be banned: it is | ||
| /// healthy, just throttled. Banning it would shift its load onto the | ||
| /// remaining nodes and cascade into `NoAvailableAddressesToRetry`. | ||
| /// Instead the retry should rotate to a *different* node and leave the | ||
| /// throttled one in the pool. See `update_address_ban_status`. |
There was a problem hiding this comment.
💬 Nitpick: CanRetry::is_rate_limited doc asserts unconditional rotation
The doc says Instead the retry should rotate to a *different* node and leave the throttled one in the pool. The executor at dapi_client.rs:980-990 has a three-tier fallback whose final tier (get_live_address()) intentionally returns the just-tried address when no alternative is live (single-node pool, or every live node already in tried). Rotation is best-effort, not guaranteed. Soften the contract so future maintainers don't read it as a hard invariant.
| /// Rate-limit errors are retryable (see [`CanRetry::can_retry`]) but, | |
| /// unlike genuine node ill-health, the node must NOT be banned: it is | |
| /// healthy, just throttled. Banning it would shift its load onto the | |
| /// remaining nodes and cascade into `NoAvailableAddressesToRetry`. | |
| /// Instead the retry should rotate to a *different* node and leave the | |
| /// throttled one in the pool. See `update_address_ban_status`. | |
| /// Rate-limit errors are retryable (see [`CanRetry::can_retry`]) but, | |
| /// unlike genuine node ill-health, the node must NOT be banned: it is | |
| /// healthy, just throttled. Banning it would shift its load onto the | |
| /// remaining nodes and cascade into `NoAvailableAddressesToRetry`. | |
| /// Instead the retry path should prefer a different live node while leaving | |
| /// the throttled one in the pool; if no alternative exists, bounded retries | |
| /// may reuse it after backoff. See `update_address_ban_status`. |
source: ['claude', 'codex']
There was a problem hiding this comment.
Resolved in 8a41c92 — CanRetry::is_rate_limited doc asserts unconditional rotation no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| let entry = info | ||
| .iter() | ||
| .find(|i| i.uri.contains("3000")) | ||
| .expect("address present in ban info"); |
There was a problem hiding this comment.
💬 Nitpick: Ban-info invariant test matches by URI substring instead of the inserted address
info.iter().find(|i| i.uri.contains("3000")).expect("address present in ban info") ties the assertion to a magic substring inside mock_address(). The test owns the addr it inserted; matching by exact URI is both safer and self-documenting. A future fixture adding another URI that happens to contain 3000, or a rename of mock_address, would silently match the wrong entry.
| let entry = info | |
| .iter() | |
| .find(|i| i.uri.contains("3000")) | |
| .expect("address present in ban info"); | |
| let entry = info | |
| .iter() | |
| .find(|i| i.uri == addr.to_string()) | |
| .expect("address present in ban info"); |
source: ['claude', 'codex']
There was a problem hiding this comment.
Resolved in 8a41c92 — Ban-info invariant test matches by URI substring instead of the inserted address no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| match execution_result { | ||
| Ok(response) => return Ok(response), | ||
| Err(error) => { | ||
| // Invariant lock (load-bearing): a rate-limited error MUST | ||
| // be retryable, otherwise the rotation below never fires | ||
| // and the rotate-don't-ban behavior silently dies. See the | ||
| // `CanRetry` docs and the `test_rate_limit_implies_retryable_invariant` | ||
| // property test. | ||
| debug_assert!( | ||
| !error.is_rate_limited() || error.can_retry(), | ||
| "is_rate_limited() must imply can_retry(); rate-limit rotation depends on it" | ||
| ); |
There was a problem hiding this comment.
💬 Nitpick: debug_assert! placement is inconsistent between the two retry arms
The transport-create-error arm (line 1049) places the debug_assert!(!cloned_error.is_rate_limited() || cloned_error.can_retry(), ...) before update_address_ban_status, while the transport-call-error arm (line 1133) places the equivalent assert after update_address_ban_status. Functionally equivalent (the assert reads nothing the ban-status update mutates), but the two arms otherwise mirror each other. Aligning placement removes a small future-maintainer puzzle about whether the ordering was intentional.
source: ['claude']
There was a problem hiding this comment.
Resolved in 8a41c92 — debug_assert! placement is inconsistent between the two retry arms no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
|
Opened #3956 against this branch to address the Copilot note by matching the ban-info entry on the exact URI instead of a substring search. |
…an-for-duration Replace the "rotate-don't-ban" approach with "ban for server-dictated window": - Add `rate_limit.rs`: decode `google.rpc.RetryInfo` from gRPC status details via minimal prost structs; `DAPI_RATE_LIMIT_BAN_MS` env-var fallback (default 60 s) when no RetryInfo is present; WASM-compatible guard on env-var read. - Add `AddressStatus::ban_for_duration()` + `AddressList::ban_for_duration()`: set `banned_until` without touching `ban_count` so the health-ban exponential ladder is never disturbed by rate-limit events. - Add `CanRetry::rate_limit_ban_duration()` (default `None`) and wire it through `tonic::Status`, `TransportError`, `ExecutionError`, `DapiClientError` and `dash_sdk::Error`. - Update `update_address_ban_status`: `ResourceExhausted` now calls `ban_for_duration(server_hint.unwrap_or(fallback))` instead of skipping. - Remove rotation machinery: `tried: Vec<Address>`, 3-tier `get_live_address_excluding` selection, jitter backoff (`retry_delay`, `rate_limit_backoff_window`, `RATE_LIMIT_MAX_DELAY_MS`), and rand import. The retry loop now always calls `get_live_address()` — rate-limited nodes are naturally absent from the live pool until their ban window expires. - Fix `AddressBanInfo.banned`: now consistent with `get_live_address` filtering (checks only `banned_until`, not `ban_count > 0`), so rate-limit bans are visible in diagnostics. - Delete `tests/rate_limit_rotation.rs`; update/add invariant tests. All 419 unit/integration/doc tests pass. Zero clippy warnings. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…r; drop hand-rolled rotation Replace prost/RetryInfo-based approach with direct `ratelimit-reset` HTTP header parsing. No hand-rolled rotation machinery — banning a node already achieves rotation because `get_live_address()` skips banned nodes. Changes: - Delete `rate_limit.rs` (prost/RetryInfo decoder) and remove prost dep - Replace `is_rate_limited() -> bool` on `CanRetry` trait with `rate_limit_ban_duration() -> Option<Duration>` (default `None`) - `tonic::Status` impl: if `ResourceExhausted` + `ratelimit-reset` header is a parseable u64 > 0, return `Some(clamped [1s, 600s])`; else `None` - Add `AddressStatus::ban_for(period, reason)` + `AddressList::ban_for` that sets `banned_until = now + period`, `ban_count = max(ban_count, 1)`; does NOT inflate the exponential health-ban ladder - `update_address_ban_status`: dispatch to `ban_for` on `Some(period)`, fall back to existing `ban_with_reason` on `None` - Revert execute loop to base: flat 10ms retry delay, plain `get_live_address()` - Revert `rs-sdk/src/error.rs` to base (banning stays inside rs-dapi-client) - Add `LIMIT_RESPONSE_HEADERS_ENABLED=true` to dashmate rate-limiter compose - Replace deleted `tests/rate_limit_rotation.rs` with `tests/rate_limit_ban.rs` (5 focused tests for the header-parse path) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… fixes QA-001: Correct `AddressStatus::ban_for` / `AddressList::ban_for` docstrings — the "no ladder inflation" claim was false for fresh nodes (ban_count 0 → 1 raises the next genuine health ban from ~60 s to ~163 s). Docs now describe the floor/side-effect accurately. New test pins the 0→1 ladder-floor behaviour. QA-002: Fix `MIN_RATE_LIMIT_BAN_SECS` comment — the `> 0` filter before the clamp already rejects 0, so MIN=1 is a documentation/intent marker, not an active lower clamp. QA-003: De-tautologize "bad/zero/missing header → ladder" tests. Previous assertions (ban_count == 1) passed for BOTH ban_for and ban_with_reason. Now also assert the banned_until window is ~60 s (first ladder rung), which would fail if a bad header were wrongly routed to ban_for. QA-004: Add clamp-edge coverage: 1 → 1 s, 600 → 600 s, 601 → 600 s (was only 45 and 9999). QA-005: Add test that a ban_for'd address re-enters rotation via get_live_address after its window expires (zero-duration re-ban simulates an expired window). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@thepastaclaw review most recent commit, |
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
New latest-delta verification found one in-scope blocking state-management bug: the new exact-window rate-limit ban can overwrite a longer active ban. Two non-blocking latest-delta issues remain around WASM/browser header visibility and misleading logging, and the carried-forward wall-clock tolerance test flake is still present. No CodeRabbit inline findings or comment IDs were supplied.
🔴 1 blocking | 🟡 2 suggestion(s)
1 additional finding(s) omitted (not in diff).
1 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/rs-dapi-client/src/address_list.rs`:
- [BLOCKING] packages/rs-dapi-client/src/address_list.rs:116-119: Rate-limit ban can shorten an active longer ban
`ban_for` always replaces `banned_until` with `now + period`. Because request completions can arrive out of order for the same shared `AddressList`, a later `ResourceExhausted` response with a short `ratelimit-reset` value can shorten a longer health-ban or longer prior rate-limit ban that was already applied. That reintroduces the address before the existing ban window expires, so genuine failure bans are no longer preserved under overlapping requests.
In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose rate-limit reset headers to browser clients
The PR enables the rate limiter to emit `RateLimit-Reset`, and the Rust client parses it from tonic metadata, but browser/WASM clients go through fetch/CORS. Since Envoy's CORS `expose_headers` list does not include `ratelimit-reset` or the other `ratelimit-*` headers, browser clients cannot read the advertised reset window and will fall back to the exponential health-ban path even when Envoy sent the exact duration.
In `packages/rs-dapi-client/src/dapi_client.rs`:
- [SUGGESTION] packages/rs-dapi-client/src/dapi_client.rs:213-221: Rate-limit ban log is emitted before the ban succeeds
The `Some(period)` branch logs that it is banning the address before calling `address_list.ban_for`. If the address was removed by an SML update, `ban_for` returns `false` and the code then logs that it was unable to ban the same address. Move the rate-limit-specific debug log behind the `banned` check so logs describe the state transition that actually happened.
| pub fn ban_for(&mut self, period: Duration, reason: Option<String>) { | ||
| self.banned_until = Some(chrono::Utc::now() + period); | ||
| self.ban_count = self.ban_count.max(1); | ||
| self.ban_reason = reason; |
There was a problem hiding this comment.
🔴 Blocking: Rate-limit ban can shorten an active longer ban
ban_for always replaces banned_until with now + period. Because request completions can arrive out of order for the same shared AddressList, a later ResourceExhausted response with a short ratelimit-reset value can shorten a longer health-ban or longer prior rate-limit ban that was already applied. That reintroduces the address before the existing ban window expires, so genuine failure bans are no longer preserved under overlapping requests.
| pub fn ban_for(&mut self, period: Duration, reason: Option<String>) { | |
| self.banned_until = Some(chrono::Utc::now() + period); | |
| self.ban_count = self.ban_count.max(1); | |
| self.ban_reason = reason; | |
| pub fn ban_for(&mut self, period: Duration, reason: Option<String>) { | |
| let advertised_until = chrono::Utc::now() + period; | |
| if self | |
| .banned_until | |
| .map(|current_until| current_until < advertised_until) | |
| .unwrap_or(true) | |
| { | |
| self.banned_until = Some(advertised_until); | |
| self.ban_reason = reason; | |
| } | |
| self.ban_count = self.ban_count.max(1); | |
| } |
source: ['codex']
There was a problem hiding this comment.
Resolved in this update — Rate-limit ban can shorten an active longer ban no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| Some(period) => { | ||
| tracing::debug!( | ||
| ?address, | ||
| ban_secs = period.as_secs(), | ||
| "rate-limited (ResourceExhausted): banning {address} \ | ||
| for {}s (from RateLimit-Reset header)", | ||
| period.as_secs() | ||
| ); | ||
| address_list.ban_for(address, period, reason) |
There was a problem hiding this comment.
🟡 Suggestion: Rate-limit ban log is emitted before the ban succeeds
The Some(period) branch logs that it is banning the address before calling address_list.ban_for. If the address was removed by an SML update, ban_for returns false and the code then logs that it was unable to ban the same address. Move the rate-limit-specific debug log behind the banned check so logs describe the state transition that actually happened.
source: ['claude']
There was a problem hiding this comment.
Resolved in this update — Rate-limit ban log is emitted before the ban succeeds no longer present.
Auto-resolved by the review system based on the latest commit diff. If you believe this was closed in error, reopen the thread.
| # Emit RateLimit-Limit / RateLimit-Remaining / RateLimit-Reset response | ||
| # headers so rs-dapi-client can read the exact reset window and ban the | ||
| # node for that duration instead of the exponential health-ban ladder. | ||
| - LIMIT_RESPONSE_HEADERS_ENABLED=true |
There was a problem hiding this comment.
I feel like I suggested the wrong direction. I mean, the ban system is correct, but exposing headers that are disabled by default and enabling them via env is risky. Could you please investigate what consequences and why it's disabled by default?
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review
Carried-forward prior findings reconcile cleanly against HEAD 2c76a14: ban_for still unconditionally overwrites banned_until (blocking regression introduced by this PR), Envoy's CORS expose_headers still omits the ratelimit-* family (suggestion — browser/WASM clients silently fall back to the ladder), and the rate-limit path still emits a success-shaped debug log before calling ban_for (nitpick). No new latest-delta findings beyond what the agents and prior review already flagged. Quality-agent suggestions about ban_count semantics, constant placement, and timing tolerances are out-of-scope architectural/test-stability commentary and were dropped.
🟡 1 suggestion(s)
1 additional finding(s) omitted (not in diff).
2 carried-forward finding(s) already raised on this PR; not re-posting as new inline comments.
🤖 Prompt for all review comments with AI agents
These findings are from an automated code review. Verify each finding against the current code and only fix it if needed.
In `packages/dashmate/templates/platform/gateway/envoy.yaml.dot`:
- [SUGGESTION] packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201: Expose `ratelimit-*` headers via CORS so browser/WASM clients can read them
This PR teaches `rs-dapi-client` to read Envoy's `ratelimit-reset` header and pin an exact ban window from it. The WASM transport (`tonic_web_wasm_client`, used by `wasm-sdk`) reaches the gateway through browser `fetch`, which only surfaces non-safelisted response headers that the server enumerates in `Access-Control-Expose-Headers`. The current `expose_headers` list (line 201) does not include `ratelimit-limit`, `ratelimit-remaining`, or `ratelimit-reset`, so browser clients receive `ResourceExhausted` but cannot see the reset window and fall back to the exponential health-ban ladder — defeating the PR's stated improvement for the WASM SDK path. Native Rust clients are unaffected; the fix is one line on the gateway template.
thepastaclaw
left a comment
There was a problem hiding this comment.
Code Review correction
The previous exact-SHA automated review at 2c76a14f was posted as COMMENT, but the verifier output still contains a blocking carried-forward finding. Requesting changes so the PR state matches the verified result.
🔴 1 blocking | 🟡 1 suggestion | ⚪ 1 nitpick
Blocking: packages/rs-dapi-client/src/address_list.rs:116-120 — ban_for still unconditionally writes banned_until = Some(now + period), so a later short ratelimit-reset response can shorten an existing longer health/rate-limit ban and return the node to rotation early. Preserve the longer existing/advertised deadline.
Suggestion: packages/dashmate/templates/platform/gateway/envoy.yaml.dot:201 — expose ratelimit-limit, ratelimit-remaining, and ratelimit-reset in CORS so browser/WASM clients can read the reset window instead of falling back to the exponential ladder.
Nitpick: packages/rs-dapi-client/src/dapi_client.rs:213-225 — the rate-limit path logs a success-shaped ban message before ban_for returns, producing contradictory logs when the address has been removed.
Run artifacts: /Users/claw/.openclaw/workspace/worktrees/review-platform-3951/.review/run-1782229049-2c76a14f.
Why this PR exists
rs-dapi-clienttreats a gRPCResourceExhausted(per-IP rate-limit / backpressure) the same as a node-down failure — it applies the60s × e^ban_counthealth ban to the address. Banning a healthy-but-throttled node doesn't shed load, it relocates it onto the remaining nodes.ResourceExhausted→ banned 60s → its traffic shifts to B/C → they cross the limit → banned → … →NoAvailableAddressesToRetry, with zero server faults.v3.1-dev.What was done?
A single rate-limit ban mechanism, driven by the duration Envoy already advertises:
CanRetry::rate_limit_ban_duration(&self) -> Option<Duration>(defaultNone), implemented fortonic::Status(packages/rs-dapi-client/src/transport/grpc.rs): returnsNoneunless the code isResourceExhausted; otherwise parses theratelimit-resetresponse-metadata header (whole seconds), filters out0/non-numeric, and clamps to[1s, 600s]. Delegated unchanged throughTransportError → DapiClientError / ExecutionError.update_address_ban_statusdispatch (dapi_client.rs):Some(period)→AddressList::ban_for(address, period);None→ the existingban_with_reasonexponential health-ban ladder (the fallback is the normal ladder, not a hardcoded default).AddressList::ban_for/AddressStatus::ban_for(address_list.rs): setsbanned_until = now + periodand raisesban_countto a floor of1(sois_banned()stays consistent withbanned_until). The rate-limit path is flat — it never inflates the exponential ladder. Reinstatement is the existingbanned_until-expiry path.packages/dashmate/docker-compose.rate_limiter.yml):LIMIT_RESPONSE_HEADERS_ENABLED=trueon the Lyft RLS container, so Envoy emitsRateLimit-Reset(surfaced to the client asratelimit-resetgRPC metadata). This is the operator-tunable knob requested in review.Net effect: a throttled node is banned for exactly the window the server says it needs — no more, no less — instead of a fixed exponential health ban; genuine node ill-health still bans exactly as before.
How Has This Been Tested?
cargo test -p rs-dapi-client: 119 pass (106 unit + 5rate_limit_banintegration + 3 failover + 5 doc-tests), 0 failures.cargo clippy -p rs-dapi-client -p dash-sdk -- -D warningsclean;cargo fmtclean.ban_for; clamp edges1→1s/600→600s/601→600s;0/ garbage / empty / missing header → ladder fallback (assertions boundbanned_untilto the≈60sfirst ladder rung, so they fail if a bad header is wrongly routed toban_for); full delegation chain; theban_forladder-floor side-effect (fresh nodeban_count 0→1); and a banned address re-entering rotation after its window expires.RateLimit-Reset; the dashmate change above enables it. Where the header is absent, behaviour falls back to the unchanged exponential ban ladder.Breaking Changes
None. Adds a
CanRetrymethod with a default implementation; the genuine-failure ban path is unchanged.Checklist:
For repository code-owners and collaborators only
Attribution
🤖 Co-authored by Claudius the Magnificent AI Agent