feat(task/observe-image): advance STS status.currentRevision after rollout by bdchatham · Pull Request #379 · sei-protocol/sei-k8s-controller

bdchatham · 2026-06-02T15:25:04Z

Problem

SeiNode StatefulSets are generated with updateStrategy: OnDelete (see internal/noderesource/noderesource.go:253) because pod replacement is driven by the replace-pod task, not the StatefulSet controller's rolling-update loop. The SND InPlace strategy is implemented at the controller layer via ReplacePod, not via the StatefulSet controller's own rolling-update step.

The upstream StatefulSet controller never advances status.currentRevision under OnDelete — that field is only written as part of its rolling-update step, which we deliberately bypass. After a successful in-place rollout, the StatefulSet sits in a permanent split state:

status.replicas == status.updatedReplicas == status.readyReplicas == 1
status.currentRevision != status.updateRevision

The kube-prometheus-mixin alert KubeStatefulSetUpdateNotRolledOut latches on exactly this mismatch and never clears, producing chronic false-positive pages on every SND we cycle.

Concrete symptoms (observed today):

atlantic-2/archive-0 — firing since 2026-05-29 despite the rollout completing cleanly.
Every other SND rolled today entered the same state.

Fix

Augment the existing ObserveImage task (Option A in the design discussion). Once UpdatedReplicas confirms rollout is complete and we stamp Status.CurrentImage / Status.CurrentSidecarImage, also patch the underlying StatefulSet's status.currentRevision = status.updateRevision (and currentReplicas = updatedReplicas).

Patch via the status subresource with client.MergeFrom.
No-op when revisions already match or when updateRevision is empty (controller hasn't observed the spec yet).
Lives exactly where we already know "new pod is up at the new revision" — minimal surface area, no new task added to the progression.
Adds RBAC for apps/statefulsets/status (get;patch;update).

Why ObserveImage and not a new task or a periodic sweeper

Augment vs. new task: the progression is already 7 tasks (ApplyStatefulSet -> ApplyService -> ConfigPatch -> ConfigValidate -> ReplacePod -> ObserveImage -> MarkReady). The information we need to advance currentRevision is the same information ObserveImage already computes — adding an 8th task would be one indirection short of pointless.
Augment vs. sweeper: a periodic sweeper means doing this work on every SeiNode reconcile forever; the targeted version only runs as part of an active update plan. Same correctness, less ongoing reconcile cost.

Test plan

Unit tests in internal/task/observe_image_test.go:
- TestObserveImage_RolloutComplete_AdvancesStatefulSetCurrentRevision — happy path: differing revisions, patch applies.
- TestObserveImage_RolloutComplete_RevisionsAlreadyMatch_NoOp — already-equal case: no patch attempted, task still completes.
- TestObserveImage_RolloutComplete_EmptyUpdateRevision_NoOp — race window where StatefulSet controller hasn't observed the spec: do not blank an existing CurrentRevision.
Existing observe-image tests continue to pass.
go test ./... (excluding e2e) passes cleanly across all packages.
make manifests regenerated manifests/role.yaml for the new RBAC marker; diff is exactly the expected statefulsets/status block.
Manual verification on next rollout in atlantic-2: after deploying a controller bump, trigger an in-place update on an SND, watch for KubeStatefulSetUpdateNotRolledOut to clear on the next Prometheus scrape after ObserveImage reports complete.

Scope discipline (per CLAUDE.md)

Single PR, no refactor beyond what's required.
No version bumps, no backwards-compat shims, no feature flags.
No CRD schema changes (none required).
Status patches use client.MergeFrom for the external StatefulSet — the MergeFromWithOptimisticLock{} requirement applies to SeiNode/SND/SeiNodeTask own-status writes, not to writes to upstream resources, so this is consistent with the repo convention.

…nning-mode update completes SeiNode StatefulSets are generated with updateStrategy: OnDelete (see noderesource.GenerateNodeStatefulSet) because pod replacement is driven by the replace-pod task, not the StatefulSet controller's rolling-update loop. As a side effect, the upstream StatefulSet controller never advances status.currentRevision after a successful rollout — it only does so as part of its own rolling-update step, which never runs under OnDelete. The kube-prometheus-mixin alert KubeStatefulSetUpdateNotRolledOut latches on this mismatch (currentRevision != updateRevision with updatedReplicas == replicas == readyReplicas) and never clears, producing chronic false-positive pages on every SND we cycle. Concrete instance: atlantic-2/archive-0 has been firing since 2026-05-29, and every other SND rolled today entered the same state. Fix: extend ObserveImage. Once UpdatedReplicas confirms rollout is complete and we stamp Status.CurrentImage, also patch the underlying StatefulSet's status.currentRevision = status.updateRevision (and currentReplicas = updatedReplicas). Patch via the status subresource with client.MergeFrom; no-op when revisions already match or when updateRevision is empty (controller hasn't observed the spec yet). This sits exactly where we already know "new pod is up at the new revision" — minimal surface area, no new task in the progression. Adds RBAC for apps/statefulsets/status. Test plan: - New unit tests in observe_image_test.go covering revision advance, no-op when revisions already match, no-op when updateRevision is empty. - Existing observe-image tests continue to pass (no behavioral change for rollout-not-yet-complete paths).

cursor · 2026-06-02T15:25:13Z

PR Summary

Medium Risk
Writes to upstream StatefulSet status during the update plan; scope is narrow and guarded, but incorrect patching could affect monitoring or status semantics cluster-wide for managed STS.

Overview
After ObserveImage confirms a StatefulSet rollout (UpdatedReplicas), it now patches the StatefulSet status so currentRevision matches updateRevision and currentReplicas matches updatedReplicas. That closes the gap left by OnDelete rollouts (pod swaps via replace-pod), where the built-in controller never advances currentRevision and KubeStatefulSetUpdateNotRolledOut can stay firing despite a healthy node.

The patch is a no-op when revisions already match or updateRevision is still empty. RBAC adds apps/statefulsets/status (get, patch, update) on the node controller and generated roles. Unit tests cover the happy path, no-op cases, and the empty-updateRevision race.

^{Reviewed by Cursor Bugbot for commit 9b6655e. Bugbot is set up for automated code reviews on this repo. Configure here.}

bdchatham · 2026-06-02T15:37:39Z

Closing — adversarial review with online verification surfaced that this approach is the wrong remediation.

Root cause we hit: SND-managed StatefulSets use updateStrategy: OnDelete (because ReplacePod drives pod lifecycle, not the STS controller's rolling-update loop). The upstream STS controller does not advance status.currentRevision for OnDelete strategy, so KubeStatefulSetUpdateNotRolledOut latches indefinitely once a spec change creates an updateRevision.

Why this PR is wrong:

Multi-writer status race. The upstream STS controller writes status via full UpdateStatus (not patch) on every reconcile (pkg/controller/statefulset/stateful_set_status_updater.go). When inconsistentStatus() fires (any of 8 status fields drifted), it UpdateStatus-es the whole status from its in-memory cache — overwriting our patch. We were racing the upstream controller without a field-manager strategy.
Upstream fix already exists: StatefulSet: Fix OnDelete strategy not updating CurrentRevision kubernetes/kubernetes#136833 — "StatefulSet: Fix OnDelete strategy not updating CurrentRevision" — merged 2026-05-05 into K8s master, targeted v1.37. Fixes the 6-year-old upstream bugs #73492 and #106055. The gate set.Spec.UpdateStrategy.Type == RollingUpdate was removed from completeUpdate so the existing promotion path also fires for OnDelete when UpdatedReplicas == Replicas == ReadyReplicas == *Spec.Replicas.
Cross-controller status writes are an anti-pattern. Status is conventionally owned by the controller that owns the resource. Our sei-k8s-controller doesn't own the StatefulSet's lifecycle — kube-controller-manager does. SIG-apps convention is to either drive the rollout via the strategy that produces the correct status (RollingUpdate with partition, à la Argo Rollouts / OpenKruise) or filter the alert.

Correct remediation (separate PR to platform repo):

A PrometheusRule override that excludes sei-managed StatefulSets from KubeStatefulSetUpdateNotRolledOut. Documented + reversible. Removable when our EKS clusters cross v1.37 (currently v1.34).

Will file a follow-up issue tracking the v1.37 upgrade as the removal trigger.

bdchatham closed this Jun 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(task/observe-image): advance STS status.currentRevision after rollout#379

feat(task/observe-image): advance STS status.currentRevision after rollout#379
bdchatham wants to merge 1 commit into
mainfrom
feat/controller/sts-status-advance

bdchatham commented Jun 2, 2026

Uh oh!

cursor Bot commented Jun 2, 2026 •

edited

Loading

Uh oh!

bdchatham commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bdchatham commented Jun 2, 2026

Problem

Fix

Why ObserveImage and not a new task or a periodic sweeper

Test plan

Scope discipline (per CLAUDE.md)

Uh oh!

cursor Bot commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Summary

Uh oh!

bdchatham commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cursor Bot commented Jun 2, 2026 •

edited

Loading