Skip to content

feat(task/observe-image): advance STS status.currentRevision after rollout#379

Closed
bdchatham wants to merge 1 commit into
mainfrom
feat/controller/sts-status-advance
Closed

feat(task/observe-image): advance STS status.currentRevision after rollout#379
bdchatham wants to merge 1 commit into
mainfrom
feat/controller/sts-status-advance

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Problem

SeiNode StatefulSets are generated with updateStrategy: OnDelete (see internal/noderesource/noderesource.go:253) because pod replacement is driven by the replace-pod task, not the StatefulSet controller's rolling-update loop. The SND InPlace strategy is implemented at the controller layer via ReplacePod, not via the StatefulSet controller's own rolling-update step.

The upstream StatefulSet controller never advances status.currentRevision under OnDelete — that field is only written as part of its rolling-update step, which we deliberately bypass. After a successful in-place rollout, the StatefulSet sits in a permanent split state:

status.replicas == status.updatedReplicas == status.readyReplicas == 1
status.currentRevision != status.updateRevision

The kube-prometheus-mixin alert KubeStatefulSetUpdateNotRolledOut latches on exactly this mismatch and never clears, producing chronic false-positive pages on every SND we cycle.

Concrete symptoms (observed today):

  • atlantic-2/archive-0 — firing since 2026-05-29 despite the rollout completing cleanly.
  • Every other SND rolled today entered the same state.

Fix

Augment the existing ObserveImage task (Option A in the design discussion). Once UpdatedReplicas confirms rollout is complete and we stamp Status.CurrentImage / Status.CurrentSidecarImage, also patch the underlying StatefulSet's status.currentRevision = status.updateRevision (and currentReplicas = updatedReplicas).

  • Patch via the status subresource with client.MergeFrom.
  • No-op when revisions already match or when updateRevision is empty (controller hasn't observed the spec yet).
  • Lives exactly where we already know "new pod is up at the new revision" — minimal surface area, no new task added to the progression.
  • Adds RBAC for apps/statefulsets/status (get;patch;update).

Why ObserveImage and not a new task or a periodic sweeper

  • Augment vs. new task: the progression is already 7 tasks (ApplyStatefulSet -> ApplyService -> ConfigPatch -> ConfigValidate -> ReplacePod -> ObserveImage -> MarkReady). The information we need to advance currentRevision is the same information ObserveImage already computes — adding an 8th task would be one indirection short of pointless.
  • Augment vs. sweeper: a periodic sweeper means doing this work on every SeiNode reconcile forever; the targeted version only runs as part of an active update plan. Same correctness, less ongoing reconcile cost.

Test plan

  • Unit tests in internal/task/observe_image_test.go:
    • TestObserveImage_RolloutComplete_AdvancesStatefulSetCurrentRevision — happy path: differing revisions, patch applies.
    • TestObserveImage_RolloutComplete_RevisionsAlreadyMatch_NoOp — already-equal case: no patch attempted, task still completes.
    • TestObserveImage_RolloutComplete_EmptyUpdateRevision_NoOp — race window where StatefulSet controller hasn't observed the spec: do not blank an existing CurrentRevision.
  • Existing observe-image tests continue to pass.
  • go test ./... (excluding e2e) passes cleanly across all packages.
  • make manifests regenerated manifests/role.yaml for the new RBAC marker; diff is exactly the expected statefulsets/status block.
  • Manual verification on next rollout in atlantic-2: after deploying a controller bump, trigger an in-place update on an SND, watch for KubeStatefulSetUpdateNotRolledOut to clear on the next Prometheus scrape after ObserveImage reports complete.

Scope discipline (per CLAUDE.md)

  • Single PR, no refactor beyond what's required.
  • No version bumps, no backwards-compat shims, no feature flags.
  • No CRD schema changes (none required).
  • Status patches use client.MergeFrom for the external StatefulSet — the MergeFromWithOptimisticLock{} requirement applies to SeiNode/SND/SeiNodeTask own-status writes, not to writes to upstream resources, so this is consistent with the repo convention.

…nning-mode update completes

SeiNode StatefulSets are generated with updateStrategy: OnDelete (see
noderesource.GenerateNodeStatefulSet) because pod replacement is driven
by the replace-pod task, not the StatefulSet controller's rolling-update
loop. As a side effect, the upstream StatefulSet controller never
advances status.currentRevision after a successful rollout — it only
does so as part of its own rolling-update step, which never runs under
OnDelete.

The kube-prometheus-mixin alert KubeStatefulSetUpdateNotRolledOut latches
on this mismatch (currentRevision != updateRevision with
updatedReplicas == replicas == readyReplicas) and never clears,
producing chronic false-positive pages on every SND we cycle. Concrete
instance: atlantic-2/archive-0 has been firing since 2026-05-29, and
every other SND rolled today entered the same state.

Fix: extend ObserveImage. Once UpdatedReplicas confirms rollout is
complete and we stamp Status.CurrentImage, also patch the underlying
StatefulSet's status.currentRevision = status.updateRevision (and
currentReplicas = updatedReplicas). Patch via the status subresource
with client.MergeFrom; no-op when revisions already match or when
updateRevision is empty (controller hasn't observed the spec yet).

This sits exactly where we already know "new pod is up at the new
revision" — minimal surface area, no new task in the progression. Adds
RBAC for apps/statefulsets/status.

Test plan:
- New unit tests in observe_image_test.go covering revision advance,
  no-op when revisions already match, no-op when updateRevision is
  empty.
- Existing observe-image tests continue to pass (no behavioral change
  for rollout-not-yet-complete paths).
@cursor
Copy link
Copy Markdown

cursor Bot commented Jun 2, 2026

PR Summary

Medium Risk
Writes to upstream StatefulSet status during the update plan; scope is narrow and guarded, but incorrect patching could affect monitoring or status semantics cluster-wide for managed STS.

Overview
After ObserveImage confirms a StatefulSet rollout (UpdatedReplicas), it now patches the StatefulSet status so currentRevision matches updateRevision and currentReplicas matches updatedReplicas. That closes the gap left by OnDelete rollouts (pod swaps via replace-pod), where the built-in controller never advances currentRevision and KubeStatefulSetUpdateNotRolledOut can stay firing despite a healthy node.

The patch is a no-op when revisions already match or updateRevision is still empty. RBAC adds apps/statefulsets/status (get, patch, update) on the node controller and generated roles. Unit tests cover the happy path, no-op cases, and the empty-updateRevision race.

Reviewed by Cursor Bugbot for commit 9b6655e. Bugbot is set up for automated code reviews on this repo. Configure here.

@bdchatham
Copy link
Copy Markdown
Collaborator Author

Closing — adversarial review with online verification surfaced that this approach is the wrong remediation.

Root cause we hit: SND-managed StatefulSets use updateStrategy: OnDelete (because ReplacePod drives pod lifecycle, not the STS controller's rolling-update loop). The upstream STS controller does not advance status.currentRevision for OnDelete strategy, so KubeStatefulSetUpdateNotRolledOut latches indefinitely once a spec change creates an updateRevision.

Why this PR is wrong:

  1. Multi-writer status race. The upstream STS controller writes status via full UpdateStatus (not patch) on every reconcile (pkg/controller/statefulset/stateful_set_status_updater.go). When inconsistentStatus() fires (any of 8 status fields drifted), it UpdateStatus-es the whole status from its in-memory cache — overwriting our patch. We were racing the upstream controller without a field-manager strategy.

  2. Upstream fix already exists: StatefulSet: Fix OnDelete strategy not updating CurrentRevision kubernetes/kubernetes#136833 — "StatefulSet: Fix OnDelete strategy not updating CurrentRevision" — merged 2026-05-05 into K8s master, targeted v1.37. Fixes the 6-year-old upstream bugs #73492 and #106055. The gate set.Spec.UpdateStrategy.Type == RollingUpdate was removed from completeUpdate so the existing promotion path also fires for OnDelete when UpdatedReplicas == Replicas == ReadyReplicas == *Spec.Replicas.

  3. Cross-controller status writes are an anti-pattern. Status is conventionally owned by the controller that owns the resource. Our sei-k8s-controller doesn't own the StatefulSet's lifecycle — kube-controller-manager does. SIG-apps convention is to either drive the rollout via the strategy that produces the correct status (RollingUpdate with partition, à la Argo Rollouts / OpenKruise) or filter the alert.

Correct remediation (separate PR to platform repo):

A PrometheusRule override that excludes sei-managed StatefulSets from KubeStatefulSetUpdateNotRolledOut. Documented + reversible. Removable when our EKS clusters cross v1.37 (currently v1.34).

Will file a follow-up issue tracking the v1.37 upgrade as the removal trigger.

@bdchatham bdchatham closed this Jun 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant