Skip to content

Add UFFD snapshot pager#262

Open
sjmiller609 wants to merge 10 commits into
mainfrom
hypeship/uffd-pager-v2
Open

Add UFFD snapshot pager#262
sjmiller609 wants to merge 10 commits into
mainfrom
hypeship/uffd-pager-v2

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • adds a config-gated Firecracker UFFD memory backend for snapshot restore
  • starts a per-restore UFFD pager session backed by the snapshot memory file and an optional shared page cache
  • applies resume-network mailbox updates through UFFD overlay pages so restore does not mutate the backing memory file
  • shards the pager cache and tracks pager timing counters for page-fault, lookup, backing-read, and copy latency

Tests

  • go test ./lib/hypervisor/firecracker ./lib/uffdpager ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1

Note

High Risk
Changes Firecracker snapshot restore and guest memory fault handling (UFFD), introduces a new host-side pager dependency, and affects instance lifecycle correctness if sessions or cache keys are mishandled.

Overview
Adds an opt-in Firecracker snapshot memory backend (hypervisor.firecracker_snapshot_memory_backend=uffd, default remains file) so future snapshot restores can load guest RAM lazily via userfaultfd instead of mapping the full memory file up front.

Introduces a dedicated hypeman-uffd-pager Linux binary and lib/uffdpager: versioned pager process (subprocess or hypeman-uffd@<version>.service), HTTP control API, per-restore UFFD sessions, and a bounded sharded LRU page cache keyed by snapshot cache key. CI enforces bumping lib/uffdpager/VERSION when pager runtime code changes.

Firecracker restore now uses mem_backend (File vs Uffd socket path) via extended RestoreOptions; the instance manager starts the pager supervisor on Linux when UFFD is enabled, tracks session/cache metadata, closes sessions on stop/delete/standby, and treats an unhealthy pager as Unknown state.

Packaging and ops: build/install/release include the pager binary, systemd template, and config validation for backend and cache size; RestoreVM signatures are updated across hypervisors (non-FC paths ignore options).

Reviewed by Cursor Bugbot for commit 6c8c898. Bugbot is set up for automated code reviews on this repo. Configure here.

@sjmiller609 sjmiller609 force-pushed the hypeship/uffd-pager-v2 branch from 9c706a1 to fb5341c Compare June 1, 2026 19:53
@sjmiller609 sjmiller609 changed the base branch from hypeship/fc-resume-on-load-v2 to main June 1, 2026 19:53
Comment thread cmd/api/config/config.go
Comment thread cmd/api/config/config.go
Comment thread lib/instances/firecracker_uffd.go
Comment thread lib/instances/guest_resume_network.go Outdated
Comment thread lib/uffdpager/cache.go
Comment thread lib/uffdpager/server_linux.go
Comment thread lib/uffdpager/server_linux.go Outdated
Comment thread lib/uffdpager/supervisor_linux.go
Comment thread cmd/api/config/config.go
Comment thread scripts/install.sh
@sjmiller609 sjmiller609 force-pushed the hypeship/uffd-pager-v2 branch from 38b3cf8 to 6c8c898 Compare June 3, 2026 15:06
@sjmiller609 sjmiller609 marked this pull request as ready for review June 3, 2026 15:32
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 6c8c898. Configure here.

Comment thread lib/instances/restore.go
log.ErrorContext(ctx, "failed to resume VM", "instance_id", id, "error", err)
// Cleanup on failure
hv.Shutdown(ctx)
m.closeFirecrackerUFFDSession(ctx, stored)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reconfigure failure leaks UFFD session

Medium Severity

After a successful UFFD snapshot restore, a failed post-resume reconfigureGuestNetwork shuts down the VM and rolls back admission but never calls closeFirecrackerUFFDSession. The pager session created during restore stays open, leaking resources until manual pager drain or restart.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 6c8c898. Configure here.

@firetiger-agent
Copy link
Copy Markdown

Created a monitoring plan for this PR.

What this PR does: Adds opt-in lazy-memory paging for Firecracker snapshot restores using Linux UFFD. The default backend remains file — no behavior changes until an operator explicitly sets FIRECRACKER_SNAPSHOT_MEMORY_BACKEND=uffd on a hypeman node.

Intended effect:

  • Snapshot restore success rate: baseline 0 WARN-level restore failures/hr; confirmed if "failed to restore from snapshot" and "configure snapshot memory backend" logs remain at 0 after deploy
  • Instance spawn rate: baseline 10K–31K/hr; confirmed if rate stays within range (no regression from the RestoreVM interface or mem_backend payload change)
  • API 5xx error rate: baseline 0.006–0.026%; confirmed if no sustained increase post-deploy

Risks:

  • mem_backend Firecracker API incompatibilitysnapshotLoadParams now sends mem_backend struct instead of mem_file_path; alert if any "load firecracker snapshot" ERROR appears post-deploy (baseline: 0/hr)
  • Hypeman node crash at startupNewManagerWithConfig now panics if the UFFD pager fails to start; alert if any hypeman process restart occurs within 1h of deploy (only applies when uffd backend is set, but worth confirming)
  • Stale UFFD sessions on failure — on restore failure, UFFD session cleanup must succeed within 2s; alert if "failed to close firecracker uffd session" WARN appears (baseline: 0/hr; relevant when UFFD mode is first activated)
  • StateUnknown proliferation — UFFD session health check during instance query can return StateUnknown if pager dies; alert if any "firecracker uffd session is unhealthy" WARN log appears (baseline: 0/hr; relevant when UFFD mode is active)

Status updates will be posted automatically on this PR as monitoring progresses.

View monitor

@sjmiller609 sjmiller609 requested a review from rgarcia June 3, 2026 15:43
Copy link
Copy Markdown
Contributor

@hiroTamada hiroTamada left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reviewed the uffd pager slice — architecture looks solid (separate pager process, opt-in file backend, versioned systemd + drain). left a few nits on docs, restore cleanup, cache key wording, and pager vs session health. nice work.

Comment thread lib/uffdpager/README.md
@@ -0,0 +1,26 @@
# UFFD Snapshot Pager
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: would be helpful to add a "control http api" section here — GET /health, GET /stats, POST /sessions, POST /sessions/{id}/close, POST /drain on {dataDir}/uffd/{version}/control.sock. firecracker uses a separate per-session unix socket (not http). routes live in server_linux.go but there's no single obvious place to discover them today.

Comment thread lib/uffdpager/README.md
cache because page faults are latency-sensitive and the kernel-facing UFFD
socket is local. The process keeps one shared in-memory page cache, bounded by
`hypervisor.firecracker_uffd_cache_max_bytes`. Cache entries are keyed by a
snapshot cache key plus page offset, so multiple restore sessions from the same
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this reads like multiple vms restoring the same snapshot share cache, but the default cache_key hashes stored.Id (see firecrackerSnapshotCacheKey) so different instances won't share pages. might be worth tweaking the wording or the key formula depending on what you want.

Comment thread lib/instances/restore.go
log.ErrorContext(ctx, "failed to resume VM", "instance_id", id, "error", err)
// Cleanup on failure
hv.Shutdown(ctx)
m.closeFirecrackerUFFDSession(ctx, stored)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

restore/resume failures close the uffd session here, but the reconfigureGuestNetwork failure path a few lines below (~313) doesn't — worth calling closeFirecrackerUFFDSession there too before return (+1 on bugbot).

}
healthCtx, cancel := context.WithTimeout(ctx, 100*time.Millisecond)
defer cancel()
if _, err := m.firecrackerUFFDPager.HealthVersion(healthCtx, version); err != nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only checks pager /health for the version, not whether this instance's session still exists. after an unplanned pager restart, health can pass while the vm's uffd session is gone. fine for v1 but worth a readme/runbook note or a follow-up session-level check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants