Skip to content

fix(statesync): resolve internal-RPC witnesses for label peers#384

Merged
bdchatham merged 6 commits into
mainfrom
fix/statesync-internal-rpc-witnesses
Jun 3, 2026
Merged

fix(statesync): resolve internal-RPC witnesses for label peers#384
bdchatham merged 6 commits into
mainfrom
fix/statesync-internal-rpc-witnesses

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Problem

State-sync light-client witnesses were derived by the sidecar from persistent_peers. For networking.tcp peers, peerAddress returns Spec.ExternalAddress — the external P2P NLB hostname, which exposes 26656 only and has no RPC listener. The sidecar's extractRPCHosts stripped the port and appended :26657, producing a witness on a host that never answers /status. seid exited on no witnesses connected and crashlooped. This is the regression that took arctic-1 node-19 down during the BYO-key migration (manually patched in prod; this codifies the fix).

Fix (controller half of the A+B fix)

Resolve a parallel witness list from the same label-selected peers, using each peer's in-cluster headless RPC DNS and never the external address:

  • New peerRPCAddress(peer)<peer>-0.<peer>.<ns>.svc.cluster.local:26657 (seiconfig.PortRPC). Unlike peerAddress it never consults Spec.ExternalAddress — RPC is internal-only.
  • reconcilePeers/resolveLabelPeers collect witnesses alongside ResolvedPeers and store them in a new Status.ResolvedRPCWitnesses.
  • configureStateSyncTask passes them to ConfigureStateSyncTask.RpcServers (requires seictl v0.0.55, bumped here).

Witnesses are deterministic from peer identity (no node_id, no sidecar call), so every matched peer yields a witness regardless of sidecar reachability — strictly more robust than peer resolution. Empty (EC2/static-peer nodes) leaves the sidecar to derive witnesses from persistent_peers as before.

This pairs with seictl #197 (merged, v0.0.55): the sidecar uses explicit RpcServers verbatim when provided and /status-probes each, dropping unreachable ones.

Changes

  • internal/controller/node/peers.gopeerRPCAddress + witness collection
  • api/v1alpha1/seinode_types.goStatus.ResolvedRPCWitnesses (+ regenerated deepcopy + CRD)
  • internal/planner/planner.go — pass witnesses into the configure-state-sync task
  • go.mod — seictl v0.0.50 → v0.0.55

Tests

  • TestReconcilePeers_PrefersExternalAddress — extended: external-P2P peer still yields internal RPC DNS witness (the regression guard)
  • TestReconcilePeers_WitnessesExcludeSelfAndUseRPCPort
  • TestConfigureStateSyncTask_PassesResolvedWitnesses / _NoWitnessesLeavesEmpty

🤖 Generated with Claude Code

State-sync light-client witnesses were derived by the sidecar from
persistent_peers. For networking.tcp peers those carry the external P2P
NLB hostname (Spec.ExternalAddress), which serves P2P only — no RPC
listener — so seid exited on "no witnesses connected" and crashlooped.

Resolve a parallel witness list from the same label-selected peers using
each peer's in-cluster headless RPC DNS (<peer>-0.<peer>.<ns>.svc.
cluster.local:26657, never ExternalAddress), store it in
Status.ResolvedRPCWitnesses, and pass it to ConfigureStateSyncTask.
RpcServers (seictl v0.0.55). Witnesses are deterministic from peer
identity, so every matched peer yields one regardless of sidecar
node_id reachability. Empty (EC2/static peers) leaves the sidecar to
derive witnesses from peers as before.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 3, 2026

PR Summary

Medium Risk
Touches node bootstrap/state-sync wiring and a CRD status field; misconfigured witnesses could still block sync, but the change is scoped and covered by regression tests.

Overview
Fixes state-sync crashloops when label peers use external P2P addresses: witnesses were effectively derived from persistent_peers, so the sidecar could point the light client at an NLB host that only serves 26656, not RPC.

The controller now maintains status.resolvedRPCWitnesses alongside resolvedPeers—in-cluster headless DNS on the RPC port via new peerRPCAddress, never spec.externalAddress. Label peer reconciliation still needs sidecar/node_id for P2P entries, but witnesses are emitted for every matched peer even when P2P resolution is skipped (e.g. nil sidecar factory).

configureStateSyncTask passes those endpoints to the sidecar as RpcServers (with seictl v0.0.55); when the list is empty, behavior falls back to deriving witnesses from persistent_peers. CRD/deepcopy and tests cover the external-P2P vs internal-RPC split.

Reviewed by Cursor Bugbot for commit bb65d7a. Bugbot is set up for automated code reviews on this repo. Configure here.

bdchatham and others added 3 commits June 3, 2026 18:12
A peer skipped from persistent_peers (node_id unresolvable) still yields
an RPC witness — the witness needs no node_id. Asserts the intentional
divergence so it isn't "symmetrized" later.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the planner call-site comment (the field name + ResolvedRPCWitnesses
doc already convey it) and tighten the status field doc (grammar).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread internal/planner/planner.go Outdated
return discoverPeersTask(node)
case TaskConfigureStateSync:
return configureStateSyncTask(snap)
return configureStateSyncTask(node, snap)
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pass in the node you could just source the snapshot source from it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call — done in a6612e6. configureStateSyncTask(node) now derives snap := node.Spec.SnapshotSource() internally instead of taking the redundant arg. Verified equivalent for every node type: full/validator/replay callers already passed exactly SnapshotSource(), and archive uses Spec.Archive (not in the switch → nil), matching the explicit nil archive.go passed.

bdchatham and others added 2 commits June 3, 2026 18:51
Per review: configureStateSyncTask already takes node, so derive the
snapshot source via node.Spec.SnapshotSource() rather than threading a
redundant snap arg. Equivalent for all node types (archive uses
Spec.Archive, not in the SnapshotSource switch → nil, matching the
explicit nil archive.go passed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
verify-generated drift: the ResolvedRPCWitnesses doc comment was trimmed
without re-running make manifests. Pure description text sync.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 0ecf85e into main Jun 3, 2026
5 checks passed
@bdchatham bdchatham deleted the fix/statesync-internal-rpc-witnesses branch June 3, 2026 17:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant