diff --git a/docs/en/solutions/Node_configuration_daemon_fails_to_apply_cgroup_v2_kernel_arguments_runtime_security_policy_kills_the_pod.md b/docs/en/solutions/Node_configuration_daemon_fails_to_apply_cgroup_v2_kernel_arguments_runtime_security_policy_kills_the_pod.md new file mode 100644 index 000000000..fdac6d858 --- /dev/null +++ b/docs/en/solutions/Node_configuration_daemon_fails_to_apply_cgroup_v2_kernel_arguments_runtime_security_policy_kills_the_pod.md @@ -0,0 +1,124 @@ +--- +kind: + - Troubleshooting +products: + - Alauda Container Platform +ProductsVersion: + - 4.1.0,4.2.x +--- + +# Node configuration daemon fails to apply cgroup v2 kernel arguments — runtime security policy kills the pod +## Issue + +When migrating cluster nodes from cgroup v1 to cgroup v2, the platform's node configuration controller (the immutable-infrastructure / machine-configuration operator) flips the relevant kernel arguments on each node: + +- Removes `systemd.unified_cgroup_hierarchy=0` and `systemd.legacy_systemd_cgroup_controller=1`. +- Appends `systemd.unified_cgroup_hierarchy=1`, `cgroup_no_v1="all"`, and `psi=0`. + +Some node pools end up `Degraded` mid-rollout with the controller reporting: + +```text +Node is reporting: "unexpected on-disk state validating against: + missing expected kernel arguments: + [systemd.unified_cgroup_hierarchy=0 systemd.legacy_systemd_cgroup_controller=1]" +``` + +The controller is reading the node's running kargs, comparing against the previous (v1) `MachineConfig`, and finding a mismatch — but the node has clearly already moved past v1, just without rebooting into the new state. + +## Root Cause + +The node configuration daemon ("MCD") on the node performs the karg flip via `rpm-ostree kargs` and then expects to drain → reboot the node so the new boot loads with v2. The daemon's pod manifest declares which low-level operations the rendered config touches; for a karg-only delta it is `kargs:true, passwd:false, files:false` — a password change is **not** part of the rendered transition. + +In the affected scenarios, the daemon is incorrectly issuing a `usermod` against the `core` user mid-transition (an upstream bug — the password is being touched even when the rendered config does not require it). Clusters that run a runtime-security operator (StackRox / a Container Security policy that flags user-add-execution) treat that `usermod` as a policy violation. The policy's enforcement is `KILL_POD`, and the security pod kills the daemon's container right after it issued the `rpm-ostree kargs` command — but **before** the daemon could drain and reboot the node. + +The end state is broken in a recoverable way: the karg change is staged in `rpm-ostree`, but `/proc/cmdline` still reflects the previous boot. The controller compares the live cmdline against its own expectation and reports `Degraded`. + +## Resolution + +Two complementary actions: change the runtime-security policy so it does not kill the configuration daemon mid-transition, and reboot any node where the karg change was staged but never applied. + +### 1. Set the runtime-security policy to "inform" before the cgroup migration + +The relevant policy in the platform's container-security stack monitors changes to the `core` user's password. While the upstream bug is open, switch this policy from `enforce` to `inform` for the duration of the migration window so that the daemon's spurious `usermod` does not get its pod killed: + +- Locate the policy named along the lines of "User Add Execution" in the runtime-security console (or its CR equivalent). +- Change **Policy behavior** from `enforce` to `inform`. +- Save and let the change propagate to the sensors before kicking off the cgroup migration. + +The policy still records the violation (so you have an audit trail), it just no longer takes the kill action. + +### 2. For nodes where rpm-ostree already staged the change, reboot manually + +If a node has already gone through the daemon's karg flip — `rpm-ostree status` shows a staged deployment with the new kargs — the only thing left to do is to reboot it: + +```bash +NODE= +kubectl drain "$NODE" --ignore-daemonsets --delete-emptydir-data --disable-eviction +kubectl debug node/"$NODE" -it --profile=sysadmin --image= \ + -- chroot /host systemctl reboot +``` + +After the reboot, the new cmdline takes effect. Verify: + +```bash +kubectl debug node/"$NODE" -it --profile=sysadmin --image= \ + -- chroot /host bash -c ' + echo "----- /proc/cmdline -----"; cat /proc/cmdline + echo "----- cgroup version -----"; stat -c %T -f /sys/fs/cgroup + echo "----- staged kargs -----"; rpm-ostree kargs + ' +``` + +Expected on a successfully migrated node: + +- `/proc/cmdline` carries `systemd.unified_cgroup_hierarchy=1`, `cgroup_no_v1="all"`, `psi=0`. +- `stat -c %T -f /sys/fs/cgroup` returns `cgroup2fs`. +- `rpm-ostree kargs` matches `/proc/cmdline`. + +Uncordon the node: + +```bash +kubectl uncordon "$NODE" +``` + +### 3. Other failure shapes + +If the cluster does not run a runtime-security tool, or if the daemon pod was not killed by such a tool, the failure has a different cause — typically a missing rendered config, an MCO controller stuck on a different node, or a `pivot` (rpm-ostree) error. Inspect the daemon's pod log on the affected node to identify the actual cause before applying the workaround above. + +## Diagnostic Steps + +1. Confirm the daemon ran `usermod` against `core` and was killed before it could drain / reboot: + + ```bash + NODE= + kubectl debug node/"$NODE" -it --profile=sysadmin --image= \ + -- chroot /host journalctl --no-pager --boot \ + | grep -E 'machine-config-daemon\[|change user|Killing container.*machine-config-daemon' + ``` + + Look for the sequence: `Update prepared` → `drain complete` → `usermod ... change user 'core' password` → `Killing container with a grace period`. + +2. Confirm the kill came from the runtime-security stack (StackRox / the platform's Container Security agent): + + ```bash + PODNAME= + kubectl logs -n stackrox $(kubectl get pod -n stackrox -l app=sensor -o name) \ + | grep -B2 -A9 "$PODNAME" + ``` + + The fingerprint is `enforcement: KILL_POD_ENFORCEMENT` next to `policy_name: ".*User Add Execution"` referencing the daemon pod. + +3. Inspect `rpm-ostree` state to know whether the karg change was *staged* (just needs a reboot) or *not yet applied* (needs the daemon to run again): + + ```bash + kubectl debug node/"$NODE" -it --profile=sysadmin --image= \ + -- chroot /host bash -c ' + echo "----- boot options -----"; cat /proc/cmdline + echo "----- cgroup version -----"; stat -c %T -f /sys/fs/cgroup + echo "----- rpm-ostree status -----"; rpm-ostree status -v + echo "----- rpm-ostree kargs -----"; rpm-ostree kargs + ' + ``` + + - `Staged: yes` paired with the new kargs in `rpm-ostree kargs` and the **old** kargs in `/proc/cmdline` is the "needs reboot" state — apply step 2 of the resolution. + - No staged deployment is the "daemon never finished" state — re-enable the daemon (after putting the security policy in `inform`) and let the controller retry.