KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090jmsperu wants to merge 4 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #13090 +/- ##
============================================
+ Coverage 18.10% 18.11% +0.01%
- Complexity 16746 16759 +13
============================================
Files 6037 6037
Lines 542933 542785 -148
Branches 66487 66454 -33
============================================
+ Hits 98283 98337 +54
+ Misses 433602 433401 -201
+ Partials 11048 11047 -1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
DaanHoogland
left a comment
There was a problem hiding this comment.
clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).
|
@jmsperu , would you consider this for older versions as well? |
There was a problem hiding this comment.
Pull request overview
Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).
Changes:
- Introduces new agent property
kvm.heartbeat.fence.action(defaultreboot) to select fencing behavior. - Updates
kvmheartbeat.shandkvmspheartbeat.shto read the property from/etc/cloudstack/agent/agent.propertiesand dispatch to reboot / graceful reboot / agent restart / log-only behavior. - Documents the property in
agent.propertiesand adds aAgentPropertiesentry for discoverability.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/vm/hypervisor/kvm/kvmheartbeat.sh | Reads kvm.heartbeat.fence.action and selects fence behavior for heartbeat write failures. |
| scripts/vm/hypervisor/kvm/kvmspheartbeat.sh | Same fencing configurability for StorPool heartbeat script. |
| agent/src/main/java/com/cloud/agent/properties/AgentProperties.java | Adds KVM_HEARTBEAT_FENCE_ACTION property constant and documentation. |
| agent/conf/agent.properties | Documents kvm.heartbeat.fence.action for operators. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # write fails persistently. Supersedes the legacy binary | ||
| # 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value. | ||
| # |
There was a problem hiding this comment.
we could remove reboot which is a bit misleading
graceful-reboot and hard-reboot are more clear
|
@jmsperu The current hard option is there because a stale NFS (v3?) mount will keep the OS from rebooting gracefully. |
…me to hard-reboot Per review on PR apache#13090: - Refactor common fence-action case into kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti). - Add 'custom' fence action that invokes an operator-supplied script (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/ heartbeat-fence-custom.sh) with the heartbeat script name as arg; falls back to hard-reboot if the script is missing/non-executable (per @NuxRo). - Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep 'reboot' accepted as alias so existing deployments don't break (per @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot, required where a stale NFSv3 mount blocks systemctl reboot.
|
Pushed f5f3db5ea3 addressing all review comments: @sureshanaparti — extracted the shared fence-action case into a new . "$(dirname "$0")/kvmha-fence.sh"
fence_action "kvmheartbeat.sh" # script name passed for log taggingNet -40 lines of duplication. @NuxRo — added @DaanHoogland — renamed canonical action to Backport question — happy to open separate cherry-pick PRs against any active branch you'd like (4.21? 4.20?). Pure agent-side change, no DB schema, low risk. |
|
@blueorangutan package |
|
@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17799 |
|
Two-week ping — this PR has been APPROVED and all GHA checks (codecov, CodeQL, packaging all five distros via Jenkins SL-JID 17799) have been green since 2026-05-09. Could @sureshanaparti / @DaanHoogland trigger |
The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and
kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via
'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage
times out. This bypasses all OS-level shutdown protections, drops every
running VM on the host instantly, and triggers HA cascades onto
surviving hosts.
For NFS shared storage the binary "heartbeat-write-failed = host-is-dead"
heuristic is reasonable. For LINSTOR/DRBD or other replicated local
storage, the same disk serves application I/O, replication I/O and
heartbeat I/O simultaneously - so a transient I/O contention spike can
time out the heartbeat write without the host actually being unhealthy.
The result is false-positive sysrq fencing.
Adds a new agent.properties option:
kvm.heartbeat.fence.action = reboot | graceful-reboot
| restart-agent | log-only
Default value is "reboot" so existing deployments keep their current
behavior. Operators on replicated storage backends can choose a less
destructive action:
- graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs
a chance to shut down cleanly
- restart-agent: restart cloudstack-agent only, preserving running VMs
- log-only: log + alert, no automatic action
The existing 'reboot.host.and.alert.management.on.heartbeat.timeout'
boolean continues to function as a complete Java-side bypass.
Refs: apache#13089
…me to hard-reboot Per review on PR apache#13090: - Refactor common fence-action case into kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti). - Add 'custom' fence action that invokes an operator-supplied script (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/ heartbeat-fence-custom.sh) with the heartbeat script name as arg; falls back to hard-reboot if the script is missing/non-executable (per @NuxRo). - Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep 'reboot' accepted as alias so existing deployments don't break (per @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot, required where a stale NFSv3 mount blocks systemctl reboot.
f5f3db5 to
10303db
Compare
|
@blueorangutan package |
|
@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress. |
| # write fails persistently. Supersedes the legacy binary | ||
| # 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value. | ||
| # |
|
|
||
| if [ -r "$AGENT_PROPS" ]; then | ||
| local val | ||
| val=$(grep -E '^[[:space:]]*kvm\.heartbeat\.fence\.action[[:space:]]*=' "$AGENT_PROPS" | tail -n 1 | cut -d= -f2- | tr -d '[:space:]') |
There was a problem hiding this comment.
this command is a bit complicated, maybe just
grep "^kvm.heartbeat.fence.action=" "$AGENT_PROPS" | xxxxxxx
| # write fails persistently. Supersedes the legacy binary | ||
| # 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value. | ||
| # |
There was a problem hiding this comment.
we could remove reboot which is a bit misleading
graceful-reboot and hard-reboot are more clear
|
Can you update kvmsmpheartbeat.sh as well? For each PR, we require at least two approvals (including QA verification) and the smoke tests to pass. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18034 |
|
Thanks @DaanHoogland 🙏 On the rename — already in (commit 10303db): the property value is now On older versions — yes, happy to open backport PRs to 4.20.x and 4.21.x once this lands in main. Will tag you on those for review. Just re-ran the one failing CI shard ( |
Per @weizhouapache's review on apache#13090, kvmsmpheartbeat.sh was still hard-coding the immediate-reboot fence action (echo b > /proc/sysrq-trigger) while kvmheartbeat.sh and kvmspheartbeat.sh delegate to kvmha-fence.sh. Source kvmha-fence.sh and call fence_action so this script picks up the same hard-reboot / graceful-reboot / restart-agent / log-only / custom behaviour from agent.properties. Default behaviour (no property set) remains hard-reboot via sysrq-trigger, so existing deployments are unaffected.
|
Thanks @weizhouapache 🙏 You're right — kvmsmpheartbeat.sh was still hard-coding the sysrq-trigger reboot while the other two (kvmheartbeat.sh + kvmspheartbeat.sh) delegate to the shared kvmha-fence.sh helper. Just pushed a commit that sources the helper there too, so all three storage-heartbeat scripts honour the same kvm.heartbeat.fence.action property. Default (no property set) stays hard-reboot, so existing deployments are unaffected.
Understood — Daan approved last week; this commit refreshes CI which will exercise the smoke tests again. Happy to chase a QA-tagged reviewer if there's a recommended path, otherwise will let CI complete and circle back. |
Description
Fixes #13089
The KVM agent's storage heartbeat scripts (
kvmheartbeat.shandkvmspheartbeat.sh) hard-code an immediate kernel-level reboot viaecho b > /proc/sysrq-triggerwhen a heartbeat write to primary storage times out.This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.
We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.
Change
Adds a new agent property
kvm.heartbeat.fence.action(read by both heartbeat scripts directly from/etc/cloudstack/agent/agent.properties):reboot(default)echo b > /proc/sysrq-triggergraceful-rebootsystemctl reboot— allows running VMs to stop cleanlyrestart-agentcloudstack-agentonly; running VMs preservedlog-onlyDefault is
rebootso existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.The existing
reboot.host.and.alert.management.on.heartbeat.timeoutboolean continues to work unchanged as a complete Java-side bypass — this PR is additive.Files changed
scripts/vm/hypervisor/kvm/kvmheartbeat.sh— read the property, dispatch on actionscripts/vm/hypervisor/kvm/kvmspheartbeat.sh— sameagent/conf/agent.properties— document the new propertyagent/src/main/java/com/cloud/agent/properties/AgentProperties.java— add Java-side property entry for tooling/discoverabilityBackward compatibility
reboot, identical to current behaviortail -n 1so duplicate entries take the last valuerebootreboot.host.and.alert.management.on.heartbeat.timeout) continues to work as beforeTesting
reboot(default) — verified produces same output as before viabash -xtrace; sysrq path unchangedlog-only— verified the script exits 0 with logger entry, no reboot/agent-restart attemptedrestart-agent— verifiedsystemctl restart cloudstack-agentis invokedgraceful-reboot— verifiedsystemctl rebootis invoked instead of sysrqIn production we have been running with the fence path neutered (equivalent to
log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.Related