fix(statesync): resolve internal-RPC witnesses for label peers#384
Conversation
State-sync light-client witnesses were derived by the sidecar from persistent_peers. For networking.tcp peers those carry the external P2P NLB hostname (Spec.ExternalAddress), which serves P2P only — no RPC listener — so seid exited on "no witnesses connected" and crashlooped. Resolve a parallel witness list from the same label-selected peers using each peer's in-cluster headless RPC DNS (<peer>-0.<peer>.<ns>.svc. cluster.local:26657, never ExternalAddress), store it in Status.ResolvedRPCWitnesses, and pass it to ConfigureStateSyncTask. RpcServers (seictl v0.0.55). Witnesses are deterministic from peer identity, so every matched peer yields one regardless of sidecar node_id reachability. Empty (EC2/static peers) leaves the sidecar to derive witnesses from peers as before. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryMedium Risk Overview The controller now maintains
Reviewed by Cursor Bugbot for commit bb65d7a. Bugbot is set up for automated code reviews on this repo. Configure here. |
A peer skipped from persistent_peers (node_id unresolvable) still yields an RPC witness — the witness needs no node_id. Asserts the intentional divergence so it isn't "symmetrized" later. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Drop the planner call-site comment (the field name + ResolvedRPCWitnesses doc already convey it) and tighten the status field doc (grammar). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| return discoverPeersTask(node) | ||
| case TaskConfigureStateSync: | ||
| return configureStateSyncTask(snap) | ||
| return configureStateSyncTask(node, snap) |
There was a problem hiding this comment.
if you pass in the node you could just source the snapshot source from it.
There was a problem hiding this comment.
Good call — done in a6612e6. configureStateSyncTask(node) now derives snap := node.Spec.SnapshotSource() internally instead of taking the redundant arg. Verified equivalent for every node type: full/validator/replay callers already passed exactly SnapshotSource(), and archive uses Spec.Archive (not in the switch → nil), matching the explicit nil archive.go passed.
Per review: configureStateSyncTask already takes node, so derive the snapshot source via node.Spec.SnapshotSource() rather than threading a redundant snap arg. Equivalent for all node types (archive uses Spec.Archive, not in the SnapshotSource switch → nil, matching the explicit nil archive.go passed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
verify-generated drift: the ResolvedRPCWitnesses doc comment was trimmed without re-running make manifests. Pure description text sync. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Problem
State-sync light-client witnesses were derived by the sidecar from
persistent_peers. Fornetworking.tcppeers,peerAddressreturnsSpec.ExternalAddress— the external P2P NLB hostname, which exposes 26656 only and has no RPC listener. The sidecar'sextractRPCHostsstripped the port and appended:26657, producing a witness on a host that never answers/status. seid exited onno witnesses connectedand crashlooped. This is the regression that took arctic-1 node-19 down during the BYO-key migration (manually patched in prod; this codifies the fix).Fix (controller half of the A+B fix)
Resolve a parallel witness list from the same label-selected peers, using each peer's in-cluster headless RPC DNS and never the external address:
peerRPCAddress(peer)→<peer>-0.<peer>.<ns>.svc.cluster.local:26657(seiconfig.PortRPC). UnlikepeerAddressit never consultsSpec.ExternalAddress— RPC is internal-only.reconcilePeers/resolveLabelPeerscollect witnesses alongsideResolvedPeersand store them in a newStatus.ResolvedRPCWitnesses.configureStateSyncTaskpasses them toConfigureStateSyncTask.RpcServers(requires seictl v0.0.55, bumped here).Witnesses are deterministic from peer identity (no
node_id, no sidecar call), so every matched peer yields a witness regardless of sidecar reachability — strictly more robust than peer resolution. Empty (EC2/static-peer nodes) leaves the sidecar to derive witnesses frompersistent_peersas before.This pairs with seictl #197 (merged, v0.0.55): the sidecar uses explicit
RpcServersverbatim when provided and/status-probes each, dropping unreachable ones.Changes
internal/controller/node/peers.go—peerRPCAddress+ witness collectionapi/v1alpha1/seinode_types.go—Status.ResolvedRPCWitnesses(+ regenerated deepcopy + CRD)internal/planner/planner.go— pass witnesses into the configure-state-sync taskgo.mod— seictl v0.0.50 → v0.0.55Tests
TestReconcilePeers_PrefersExternalAddress— extended: external-P2P peer still yields internal RPC DNS witness (the regression guard)TestReconcilePeers_WitnessesExcludeSelfAndUseRPCPortTestConfigureStateSyncTask_PassesResolvedWitnesses/_NoWitnessesLeavesEmpty🤖 Generated with Claude Code