KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only) by jmsperu · Pull Request #13090 · apache/cloudstack

jmsperu · 2026-05-01T00:09:48Z

Description

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out.

This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.

We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.

Change

Adds a new agent property kvm.heartbeat.fence.action (read by both heartbeat scripts directly from /etc/cloudstack/agent/agent.properties):

Value	Behavior
`reboot` (default)	Original behavior: `echo b > /proc/sysrq-trigger`
`graceful-reboot`	`systemctl reboot` — allows running VMs to stop cleanly
`restart-agent`	Restart `cloudstack-agent` only; running VMs preserved
`log-only`	Log + alert, no automatic action (admin investigates)

Default is reboot so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.

The existing reboot.host.and.alert.management.on.heartbeat.timeout boolean continues to work unchanged as a complete Java-side bypass — this PR is additive.

Files changed

scripts/vm/hypervisor/kvm/kvmheartbeat.sh — read the property, dispatch on action
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh — same
agent/conf/agent.properties — document the new property
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java — add Java-side property entry for tooling/discoverability

Backward compatibility

Default action is reboot, identical to current behavior
Property is read with tail -n 1 so duplicate entries take the last value
If the property file is unreadable or the value is unrecognized, falls back to reboot
No Java-side runtime change — the existing boolean (reboot.host.and.alert.management.on.heartbeat.timeout) continues to work as before

Testing

reboot (default) — verified produces same output as before via bash -x trace; sysrq path unchanged
log-only — verified the script exits 0 with logger entry, no reboot/agent-restart attempted
restart-agent — verified systemctl restart cloudstack-agent is invoked
graceful-reboot — verified systemctl reboot is invoked instead of sysrq

In production we have been running with the fence path neutered (equivalent to log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.

Issue: KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage #13089
Affected versions: 4.22.0.0 (likely earlier; the script section is unchanged for many releases)
Triggered by: LINSTOR/DRBD primary storage with active resyncs, but applies to any replicated local storage

codecov · 2026-05-01T02:47:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.11%. Comparing base (a0aafe2) to head (1143903).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13090      +/-   ##
============================================
+ Coverage     18.10%   18.11%   +0.01%     
- Complexity    16746    16759      +13     
============================================
  Files          6037     6037              
  Lines        542933   542785     -148     
  Branches      66487    66454      -33     
============================================
+ Hits          98283    98337      +54     
+ Misses       433602   433401     -201     
+ Partials      11048    11047       -1

Flag	Coverage Δ
uitests	`3.51% <ø> (ø)`
unittests	`19.28% <100.00%> (+0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

DaanHoogland

clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).

DaanHoogland · 2026-05-01T09:46:36Z

@jmsperu , would you consider this for older versions as well?

Copilot

Pull request overview

Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).

Changes:

Introduces new agent property kvm.heartbeat.fence.action (default reboot) to select fencing behavior.
Updates kvmheartbeat.sh and kvmspheartbeat.sh to read the property from /etc/cloudstack/agent/agent.properties and dispatch to reboot / graceful reboot / agent restart / log-only behavior.
Documents the property in agent.properties and adds a AgentProperties entry for discoverability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
scripts/vm/hypervisor/kvm/kvmheartbeat.sh	Reads `kvm.heartbeat.fence.action` and selects fence behavior for heartbeat write failures.
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh	Same fencing configurability for StorPool heartbeat script.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java	Adds `KVM_HEARTBEAT_FENCE_ACTION` property constant and documentation.
agent/conf/agent.properties	Documents `kvm.heartbeat.fence.action` for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

weizhouapache · 2026-05-26T08:15:20Z

+# write fails persistently. Supersedes the legacy binary
+# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
+#


@jmsperu
it makes sense. what do you think ?

we could remove reboot which is a bit misleading

graceful-reboot and hard-reboot are more clear

NuxRo · 2026-05-07T11:52:45Z

@jmsperu The current hard option is there because a stale NFS (v3?) mount will keep the OS from rebooting gracefully.
That should stay as a default, imho, otherwise I will agree the feature needs improvements.
While at it it could be handy to add another option: "custom" in which the operator can run whatever they want.

@sureshanaparti

…me to hard-reboot Per review on PR apache#13090: - Refactor common fence-action case into kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti). - Add 'custom' fence action that invokes an operator-supplied script (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/ heartbeat-fence-custom.sh) with the heartbeat script name as arg; falls back to hard-reboot if the script is missing/non-executable (per @NuxRo). - Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep 'reboot' accepted as alias so existing deployments don't break (per @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot, required where a stale NFSv3 mount blocks systemctl reboot.

jmsperu · 2026-05-09T03:51:01Z

Pushed f5f3db5ea3 addressing all review comments:

@sureshanaparti — extracted the shared fence-action case into a new scripts/vm/hypervisor/kvm/kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh. Each caller now does:

. "$(dirname "$0")/kvmha-fence.sh"
fence_action "kvmheartbeat.sh"   # script name passed for log tagging

Net -40 lines of duplication.

@NuxRo — added custom action. Operator script path is configurable via kvm.heartbeat.fence.custom.script (default /etc/cloudstack/agent/heartbeat-fence-custom.sh); the script is invoked with one positional arg (the originating heartbeat script name) so an operator can branch on which storage tripped. If the path is missing or not executable, the helper logs and falls back to hard-reboot rather than silently doing nothing. The default action stays as the immediate sysrq-trigger reboot per your point about stale NFSv3.

@DaanHoogland — renamed canonical action to hard-reboot (with reboot kept as a quietly-accepted alias so no existing agent.properties files break). Default value updated everywhere (AgentProperties.java + agent.properties + helper).

Backport question — happy to open separate cherry-pick PRs against any active branch you'd like (4.21? 4.20?). Pure agent-side change, no DB schema, low risk.

sureshanaparti · 2026-05-09T06:28:09Z

@blueorangutan package

blueorangutan · 2026-05-09T06:30:04Z

@sureshanaparti a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

blueorangutan · 2026-05-09T07:27:06Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 17799

jmsperu · 2026-05-22T21:30:13Z

Two-week ping — this PR has been APPROVED and all GHA checks (codecov, CodeQL, packaging all five distros via Jenkins SL-JID 17799) have been green since 2026-05-09. Could @sureshanaparti / @DaanHoogland trigger @blueorangutan test so we can run Trillian + merge? Happy to push a no-op if needed.

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via 'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage times out. This bypasses all OS-level shutdown protections, drops every running VM on the host instantly, and triggers HA cascades onto surviving hosts. For NFS shared storage the binary "heartbeat-write-failed = host-is-dead" heuristic is reasonable. For LINSTOR/DRBD or other replicated local storage, the same disk serves application I/O, replication I/O and heartbeat I/O simultaneously - so a transient I/O contention spike can time out the heartbeat write without the host actually being unhealthy. The result is false-positive sysrq fencing. Adds a new agent.properties option: kvm.heartbeat.fence.action = reboot | graceful-reboot | restart-agent | log-only Default value is "reboot" so existing deployments keep their current behavior. Operators on replicated storage backends can choose a less destructive action: - graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs a chance to shut down cleanly - restart-agent: restart cloudstack-agent only, preserving running VMs - log-only: log + alert, no automatic action The existing 'reboot.host.and.alert.management.on.heartbeat.timeout' boolean continues to function as a complete Java-side bypass. Refs: apache#13089

@sureshanaparti

…me to hard-reboot Per review on PR apache#13090: - Refactor common fence-action case into kvmha-fence.sh sourced by both kvmheartbeat.sh and kvmspheartbeat.sh (per @sureshanaparti). - Add 'custom' fence action that invokes an operator-supplied script (kvm.heartbeat.fence.custom.script, default /etc/cloudstack/agent/ heartbeat-fence-custom.sh) with the heartbeat script name as arg; falls back to hard-reboot if the script is missing/non-executable (per @NuxRo). - Rename canonical action 'reboot' -> 'hard-reboot' for clarity; keep 'reboot' accepted as alias so existing deployments don't break (per @DaanHoogland). Default behavior unchanged: sysrq-trigger reboot, required where a stale NFSv3 mount blocks systemctl reboot.

DaanHoogland · 2026-05-26T08:06:42Z

@blueorangutan package

blueorangutan · 2026-05-26T08:08:04Z

@DaanHoogland a [SL] Jenkins job has been kicked to build packages. It will be bundled with no SystemVM templates. I'll keep you posted as I make progress.

weizhouapache · 2026-05-26T08:15:20Z

+# write fails persistently. Supersedes the legacy binary
+# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
+#


@jmsperu
it makes sense. what do you think ?

weizhouapache · 2026-05-26T08:19:19Z

+
+  if [ -r "$AGENT_PROPS" ]; then
+    local val
+    val=$(grep -E '^[[:space:]]*kvm\.heartbeat\.fence\.action[[:space:]]*=' "$AGENT_PROPS" | tail -n 1 | cut -d= -f2- | tr -d '[:space:]')


this command is a bit complicated, maybe just

grep "^kvm.heartbeat.fence.action=" "$AGENT_PROPS" | xxxxxxx

weizhouapache · 2026-05-26T08:23:59Z

+# write fails persistently. Supersedes the legacy binary
+# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
+#


we could remove reboot which is a bit misleading

graceful-reboot and hard-reboot are more clear

weizhouapache · 2026-05-26T08:25:52Z

@jmsperu

Can you update kvmsmpheartbeat.sh as well?

For each PR, we require at least two approvals (including QA verification) and the smoke tests to pass.

blueorangutan · 2026-05-26T09:05:49Z

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18034

jmsperu · 2026-05-26T17:42:08Z

Thanks @DaanHoogland 🙏

On the rename — already in (commit 10303db): the property value is now hard-reboot (with reboot no longer accepted), and I also added a custom action that calls an operator-supplied script for sites with bespoke fencing needs.

On older versions — yes, happy to open backport PRs to 4.20.x and 4.21.x once this lands in main. Will tag you on those for review.

Just re-ran the one failing CI shard (concurrent_snapshots / cpu_domain_limits / …) — patch only touches kvmheartbeat.sh / kvmspheartbeat.sh / AgentProperties.java, none of which affect snapshots, so it's almost certainly a flake. Will ping again once it's green.

@weizhouapache

Per @weizhouapache's review on apache#13090, kvmsmpheartbeat.sh was still hard-coding the immediate-reboot fence action (echo b > /proc/sysrq-trigger) while kvmheartbeat.sh and kvmspheartbeat.sh delegate to kvmha-fence.sh. Source kvmha-fence.sh and call fence_action so this script picks up the same hard-reboot / graceful-reboot / restart-agent / log-only / custom behaviour from agent.properties. Default behaviour (no property set) remains hard-reboot via sysrq-trigger, so existing deployments are unaffected.

jmsperu · 2026-05-26T18:14:48Z

Thanks @weizhouapache 🙏

You're right — kvmsmpheartbeat.sh was still hard-coding the sysrq-trigger reboot while the other two (kvmheartbeat.sh + kvmspheartbeat.sh) delegate to the shared kvmha-fence.sh helper. Just pushed a commit that sources the helper there too, so all three storage-heartbeat scripts honour the same kvm.heartbeat.fence.action property. Default (no property set) stays hard-reboot, so existing deployments are unaffected.

For each PR, we require at least two approvals (including QA verification) and the smoke tests to pass.

Understood — Daan approved last week; this commit refreshes CI which will exercise the smoke tests again. Happy to chase a QA-tagged reviewer if there's a recommended path, otherwise will let CI complete and circle back.

boring-cyborg Bot added component:agent component:kvm labels May 1, 2026

DaanHoogland approved these changes May 1, 2026

View reviewed changes

DaanHoogland added the status:needs-testing label May 1, 2026

sureshanaparti requested a review from Copilot May 1, 2026 09:50

sureshanaparti added this to the 4.23.0 milestone May 1, 2026

Copilot started reviewing on behalf of sureshanaparti May 1, 2026 09:51 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

sureshanaparti reviewed May 7, 2026

View reviewed changes

Comment thread scripts/vm/hypervisor/kvm/kvmspheartbeat.sh Outdated

jmsperu and others added 2 commits May 26, 2026 10:06

DaanHoogland force-pushed the feature/configurable-heartbeat-fence branch from f5f3db5 to 10303db Compare May 26, 2026 08:06

weizhouapache reviewed May 26, 2026

View reviewed changes

jmsperu added 2 commits May 26, 2026 20:42

ci: retrigger workflow (flaky concurrent-snapshots shard)

8223739

Conversation

jmsperu commented May 1, 2026

Description

Change

Files changed

Backward compatibility

Testing

Related

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

weizhouapache May 26, 2026

Choose a reason for hiding this comment

Uh oh!

weizhouapache May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

NuxRo commented May 7, 2026

Uh oh!

Uh oh!

jmsperu commented May 9, 2026

Uh oh!

sureshanaparti commented May 9, 2026

Uh oh!

blueorangutan commented May 9, 2026

Uh oh!

blueorangutan commented May 9, 2026

Uh oh!

jmsperu commented May 22, 2026

Uh oh!

DaanHoogland commented May 26, 2026

Uh oh!

blueorangutan commented May 26, 2026

Uh oh!

weizhouapache May 26, 2026

Choose a reason for hiding this comment

Uh oh!

weizhouapache May 26, 2026

Choose a reason for hiding this comment

Uh oh!

weizhouapache May 26, 2026

Choose a reason for hiding this comment

Uh oh!

weizhouapache commented May 26, 2026

Uh oh!

blueorangutan commented May 26, 2026

Uh oh!

jmsperu commented May 26, 2026

Uh oh!

jmsperu commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

codecov Bot commented May 1, 2026 •

edited

Loading