Add UFFD snapshot pager#262
Conversation
9c706a1 to
fb5341c
Compare
38b3cf8 to
6c8c898
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 6c8c898. Configure here.
| log.ErrorContext(ctx, "failed to resume VM", "instance_id", id, "error", err) | ||
| // Cleanup on failure | ||
| hv.Shutdown(ctx) | ||
| m.closeFirecrackerUFFDSession(ctx, stored) |
There was a problem hiding this comment.
Reconfigure failure leaks UFFD session
Medium Severity
After a successful UFFD snapshot restore, a failed post-resume reconfigureGuestNetwork shuts down the VM and rolls back admission but never calls closeFirecrackerUFFDSession. The pager session created during restore stays open, leaking resources until manual pager drain or restart.
Reviewed by Cursor Bugbot for commit 6c8c898. Configure here.
|
Created a monitoring plan for this PR. What this PR does: Adds opt-in lazy-memory paging for Firecracker snapshot restores using Linux UFFD. The default backend remains Intended effect:
Risks:
Status updates will be posted automatically on this PR as monitoring progresses. |
hiroTamada
left a comment
There was a problem hiding this comment.
reviewed the uffd pager slice — architecture looks solid (separate pager process, opt-in file backend, versioned systemd + drain). left a few nits on docs, restore cleanup, cache key wording, and pager vs session health. nice work.
| @@ -0,0 +1,26 @@ | |||
| # UFFD Snapshot Pager | |||
There was a problem hiding this comment.
nit: would be helpful to add a "control http api" section here — GET /health, GET /stats, POST /sessions, POST /sessions/{id}/close, POST /drain on {dataDir}/uffd/{version}/control.sock. firecracker uses a separate per-session unix socket (not http). routes live in server_linux.go but there's no single obvious place to discover them today.
| cache because page faults are latency-sensitive and the kernel-facing UFFD | ||
| socket is local. The process keeps one shared in-memory page cache, bounded by | ||
| `hypervisor.firecracker_uffd_cache_max_bytes`. Cache entries are keyed by a | ||
| snapshot cache key plus page offset, so multiple restore sessions from the same |
There was a problem hiding this comment.
this reads like multiple vms restoring the same snapshot share cache, but the default cache_key hashes stored.Id (see firecrackerSnapshotCacheKey) so different instances won't share pages. might be worth tweaking the wording or the key formula depending on what you want.
| log.ErrorContext(ctx, "failed to resume VM", "instance_id", id, "error", err) | ||
| // Cleanup on failure | ||
| hv.Shutdown(ctx) | ||
| m.closeFirecrackerUFFDSession(ctx, stored) |
There was a problem hiding this comment.
restore/resume failures close the uffd session here, but the reconfigureGuestNetwork failure path a few lines below (~313) doesn't — worth calling closeFirecrackerUFFDSession there too before return (+1 on bugbot).
| } | ||
| healthCtx, cancel := context.WithTimeout(ctx, 100*time.Millisecond) | ||
| defer cancel() | ||
| if _, err := m.firecrackerUFFDPager.HealthVersion(healthCtx, version); err != nil { |
There was a problem hiding this comment.
this only checks pager /health for the version, not whether this instance's session still exists. after an unplanned pager restart, health can pass while the vm's uffd session is gone. fine for v1 but worth a readme/runbook note or a follow-up session-level check.


Summary
Tests
Note
High Risk
Changes Firecracker snapshot restore and guest memory fault handling (UFFD), introduces a new host-side pager dependency, and affects instance lifecycle correctness if sessions or cache keys are mishandled.
Overview
Adds an opt-in Firecracker snapshot memory backend (
hypervisor.firecracker_snapshot_memory_backend=uffd, default remainsfile) so future snapshot restores can load guest RAM lazily via userfaultfd instead of mapping the full memory file up front.Introduces a dedicated
hypeman-uffd-pagerLinux binary andlib/uffdpager: versioned pager process (subprocess orhypeman-uffd@<version>.service), HTTP control API, per-restore UFFD sessions, and a bounded sharded LRU page cache keyed by snapshot cache key. CI enforces bumpinglib/uffdpager/VERSIONwhen pager runtime code changes.Firecracker restore now uses
mem_backend(FilevsUffdsocket path) via extendedRestoreOptions; the instance manager starts the pager supervisor on Linux when UFFD is enabled, tracks session/cache metadata, closes sessions on stop/delete/standby, and treats an unhealthy pager as Unknown state.Packaging and ops: build/install/release include the pager binary, systemd template, and config validation for backend and cache size;
RestoreVMsignatures are updated across hypervisors (non-FC paths ignore options).Reviewed by Cursor Bugbot for commit 6c8c898. Bugbot is set up for automated code reviews on this repo. Configure here.