docs(runbooks): validator BYO-secrets migration runbook#382
Conversation
Cutover procedure for moving a live validator (arctic-1 node-19) off a legacy EC2 host onto the platform carrying its consensus identity via SOPS-encrypted Secrets. Centers the stop-before-start double-sign discipline and the layered equivocation defenses (procedure, replicas:1 CEL guard, double-sign alerts), the controller validation surface, the cutover/rollback sequence, and the four findings from the harbor dry-run (write-mode↔image coupling, networking.tcp DNS race, deploy-clean-not-recreate, deletionPolicy→PVC cascade, operatorKeyring guard #380). Written from the harbor dry-run and cross-reviewed by the platform and kubernetes specialists; verified against the controller source, the platform .sops.yaml layout, and the sei-infra#1034 node-19 removal mechanism. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
PR SummaryLow Risk Overview The new runbook documents what consensus material actually moves, per-cluster SOPS/KMS layout, Reviewed by Cursor Bugbot for commit f1fa8f0. Bugbot is set up for automated code reviews on this repo. Configure here. |
- §5 step 3: config-apply runs before discover-peers (config-apply writes base config, discover-peers writes persistent-peers, config-validate checks last) — matches buildSidecarProgression (discover-peers inserted before config-validate, not before config-apply). kubernetes-specialist blocker. - §2: drop the clusters/harbor row — the platform repo has no clusters/harbor/.sops.yaml (harbor in-repo secrets use explicit --kms). platform-engineer correction.
Adds
.agent/runbooks/migrating-validator-to-byo-secrets.md(+ index row) — the cutover procedure for moving a live validator off a legacy EC2 host onto the platform, carrying its consensus identity via SOPS-encrypted Secrets. Written from the arctic-1 node-19 harbor dry-run.Contents
priv_validator_key.json(the identity) +node_key.json; chain state does not..sops.yamllayout (KMS by cluster dir;clusters/prod→alias/prod).replicas:1CEL guard, thesystemctl disableauto-restart seam, and theValidatorDoubleSignEvidenceObserved/ValidatorNewlyJailedalerts.Review
Cross-reviewed by the platform-engineer and kubernetes-specialist. kubernetes-specialist signed off (wording fixes applied: no
block-synctask — corrected to the real bootstrap progression + seid catch-up; mark-ready lives in the SeiNode plan, not the SND's; BYO nodeKey NodeID is Secret-pinned; PVC ownership topology). platform-engineer's two blockers handled:.sops.yaml; secrets live under the cluster dir as*.secret.yaml, encrypted via the cluster's.sops.yaml. Fixed.instanceCount 10→9with thenodeStartId 0/10/20pins preventing the later region from reindexing. The original "decrement kills node-29 / needs state surgery" concern came from reading the raw count module without the inventory/terraform.py region layer; documented the actual (clean) mechanism.Plus:
ValidatorNewlyJailedisfor:5mw/ 30m lookback; KMS-by-cluster (dev is us-east-2); downtime-jail vs tombstone caveat; image-gate is a human tag check.