Skip to content

Add fork mailbox payload API#267

Open
sjmiller609 wants to merge 14 commits into
mainfrom
hypeship/fork-mailbox-payloads
Open

Add fork mailbox payload API#267
sjmiller609 wants to merge 14 commits into
mainfrom
hypeship/fork-mailbox-payloads

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented Jun 1, 2026

Summary

  • add optional named JSON mailbox payloads to instance and snapshot fork requests
  • patch matching mailbox markers in standby snapshot memory before the fork resumes
  • optionally wait for guest UDP acknowledgements after resume when a mailbox requests it
  • share mailbox marker/layout helpers through lib/mailbox so host and guest-side integrations use the same contract
  • preserve existing fork network readiness behavior: running forks wait for network readiness, with resume-network fallback paths unchanged

Tests

  • make oapi-generate
  • git diff --check origin/main --
  • go test ./lib/mailbox ./lib/guest ./lib/system/guest_agent ./lib/oapi -count=1
  • go test ./cmd/api/api ./lib/instances -run 'Test(ForkInstanceSuccess|ForkSnapshotSuccess|PatchForkMailbox|ForkMailboxPayloadWithAckPort|ForkReturnReadinessDoesNotUseMailboxEligibility)' -count=1 (with local placeholder embedded binaries)\n\nFull cmd/api/api and lib/instances package test runs still require embedded hypervisor, guest-agent, and caddy binaries in this checkout.

Note

High Risk
Changes fork/restore orchestration and direct snapshot-memory patching with new failure modes (missing markers, ack timeouts); limited to Firecracker standby forks targeting Running but still security- and correctness-sensitive.

Overview
Adds optional named JSON mailbox payloads to instance and snapshot fork APIs so callers can inject per-template data into guest memory before a forked standby VM resumes.

Fork requests accept a mailboxes array (name, token, JSON payload, optional wait_for_ack / ack_timeout_ms). The API layer maps these into domain types; fork/snapshot restore validates them (standby snapshot only, target_state Running, Firecracker only, up to 16 unique mailboxes). Before resume, the host locates guest memory markers in the snapshot, writes payloads (preflight then apply), and can inject ack_port and block on UDP stage=applied for that mailbox name.

lib/mailbox gains a shared fork layout plus FindMarker, EnsurePayloadFits, and WritePayloadAt, reused for resume-network patching and stricter UDP ack parsing. Restore now runs multiple post-resume handoffs (resume network + fork mailboxes) instead of inlining guest network reconfigure only.

Reviewed by Cursor Bugbot for commit 1778dc2. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

✱ Stainless preview builds for hypeman

This PR will update the hypeman SDKs with the following commit message.

feat: Add fork mailbox payload API

Edit this comment to update it. It will appear in the SDK's changelogs.

hypeman-openapi studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅

hypeman-go studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ✅test ✅

go get github.com/stainless-sdks/hypeman-go@7238d9f233931f9b2301a9a6b86ebb9347998895
hypeman-typescript studio · code · diff

Your SDK build had at least one "note" diagnostic, but this did not represent a regression.
generate ✅build ✅lint ❗test ✅

npm install https://pkg.stainless.com/s/hypeman-typescript/4d4a77c2b38e6a4fd29dce8a7e7b907cf4908d1a/dist.tar.gz

This comment is auto-generated by GitHub Actions and is automatically kept up to date as you push.
If you push custom code to the preview branch, re-run this workflow to update the comment.
Last updated: 2026-06-02 15:51:23 UTC

@sjmiller609 sjmiller609 force-pushed the hypeship/network-handoff-v2 branch 2 times, most recently from dffc792 to 05bc363 Compare June 1, 2026 13:52
Base automatically changed from hypeship/network-handoff-v2 to main June 1, 2026 19:18
…-payloads

# Conflicts:
#	cmd/api/api/snapshots.go
#	cmd/api/api/snapshots_test.go
#	lib/forkvm/README.md
#	lib/guest/client.go
#	lib/instances/fork.go
#	lib/instances/guest_resume_network.go
#	lib/instances/manager.go
#	lib/instances/restore.go
#	lib/instances/restore_egress_test.go
#	lib/system/guest_agent/resume_network.go
@sjmiller609 sjmiller609 marked this pull request as ready for review June 1, 2026 19:40
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

PRs in the kernel, infra, hypeman, and hypeship repos. kernel is a ~mono repo with many logical services underneath, ensure to focus on the implicated service for the PR

Reason: PR repository cannot be determined from provided information; please confirm this is in the kernel, infra, hypeman, or hypeship repo to enable automatic deploy monitoring.

To monitor this PR anyway, reply with @firetiger monitor this.

Comment thread lib/instances/fork_mailbox.go
@sjmiller609 sjmiller609 requested a review from rgarcia June 1, 2026 20:33
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blocking on a few maintainability/design issues:

  • lib/instances/fork.go:50 and lib/instances/snapshot.go:390 only gate mailbox usage on standby + running target, but lib/instances/fork_mailbox.go:225 assumes a Firecracker-style raw memory snapshot file and fixed mailbox offsets. Non-Firecracker standby forks can accept mailbox requests and fail later inside restore. Please make this an explicit capability/hypervisor check before fork work begins, so the API rejects unsupported mailbox forks at the boundary.

  • lib/instances/fork_mailbox.go:209 duplicates the open/stat/mmap/find/write/cache logic already in lib/instances/guest_resume_network.go:186. This is the same mailbox-frame operation with a different layout. Please push the common primitive into lib/mailbox with layout config, instead of maintaining two snapshot-memory patchers that can drift. Ideally the fork path should also preflight all requested markers before writing any payloads, so multi-mailbox patching is structurally all-or-nothing rather than relying on outer fork cleanup.

  • lib/instances/guest_resume_network.go:143 adds mailbox ACK matching as raw substring checks. mailbox=foo can match mailbox=foobar, and any free-form UDP text containing stage=applied is accepted. Since this ACK now gates successful fork restore, please parse the key/value payload and require exact fields for stage and mailbox.

@sjmiller609 sjmiller609 requested a review from rgarcia June 2, 2026 12:44
Copy link
Copy Markdown
Contributor

@rgarcia rgarcia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Suggestion, maybe too early: Unify the two handoffs into one interface + slice (main ask).
    After this PR, restoreInstance prepares two handoffs back-to-back (restore.go:263 and :271) that now implement the identical trio — prepare… / Close() / AfterResume(ctx) error — and the two post-resume teardown blocks (restore.go:332-341 and :343-349) are line-for-line the same recovery (hv.Shutdown → rollbackAdmissionAllocationActive → releaseNetwork → wrapped return). That's the setup for:
type resumeHandoff interface {
    AfterResume(ctx context.Context) error
    Close()
}
handoffs := []resumeHandoff{resumeNetworkHandoff, forkMailboxHandoff}
for _, h := range handoffs { defer h.Close() }
// after resume, in order:
for _, h := range handoffs {
    if err := h.AfterResume(ctx); err != nil { /* one shared teardown */ }
}

Order is preserved (network up before the guest can UDP-ack), so the slice is safe. Collapses the duplicated teardown and makes a third handoff one line. No guest changes needed.

  1. Light suggestion: Collapse the duplicated UDP-wait machinery.
    WaitApplied and WaitMailboxApplied (guest_resume_network.go:119 and :140) are methods on the same waiter, reading the same channel, with identical select loops — they differ only in the match predicate. Suggest one wait(ctx, match func(fields map[string]string) bool) and route both through parseUDPAckFields. That also removes the case-handling mismatch between them (one lowercases the whole string, the other only the keys).

  2. WriteForkMailboxPayloadAt is a thin wrapper used only by its own test.
    mailbox.go:110 just calls WritePayloadAt(w, ForkLayout, …); production calls WritePayloadAt directly (fork_mailbox.go:272), and the only caller of the wrapper is TestWriteForkMailboxPayloadAt. Suggest deleting it and pointing the test at WritePayloadAt.

  3. AckTimeout validation is asymmetric + message is off.
    fork_mailbox.go:99-104: the < 0 check is gated on WaitForAck but the > 30s check isn't, and "must be positive" is misleading since 0 is valid (maps to the 2s default). Suggest aligning the gating and changing the message to "must not be negative."

  4. Question on wait_for_ack failure semantics.
    On a missing ack, a fork that resumed successfully gets torn down (restore.go:343). Two asymmetries vs the resume-network path worth confirming are intentional: that path falls back to host-initiated reconfigure (resume_network_handoff.go:81) rather than aborting, and its default timeout is 5s vs 2s here. Since the guest-side consumer of a fork mailbox is caller-implemented, wait_for_ack=true against a guest that doesn't ack will reliably destroy the fork after 2s — fine if that's the intended contract, but worth a note on the API field and a deliberate choice on the 2s vs 5s default.

Comment thread openapi.yaml
type: string
description: Per-template mailbox token used to identify the guest memory marker.
minLength: 1
maxLength: 128
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discoverability gap — there's no way for a fork caller to enumerate which mailbox names/tokens exist in a given snapshot. A wrong/typo'd token just surfaces as a generic "marker not found"-style failure at fork time. Feels strange for the token to be an out-of-band contract that assumes knowledge of how the guest sets this token?

Comment thread openapi.yaml
Optional JSON mailbox payloads to patch into a standby snapshot before resuming the fork.
Each mailbox must correspond to a guest-side mailbox marker that was present when the
source snapshot was captured. Mailboxes are only supported for forks that restore from a
standby snapshot into Running state.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a reader could reasonably expect a snapshot-side API given the phrasing here. Consider adding a sentence like "The marker is written by guest software before the standby snapshot is captured; Hypeman does not create or return these tokens" if we're doing out-of-band contract

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 1778dc2. Configure here.

Comment thread lib/instances/restore.go
forkMailboxHandoff, err := m.prepareForkMailboxHandoff(ctx, stored, snapshotDir, opts.Mailboxes)
if err != nil {
releaseNetwork()
return nil, fmt.Errorf("prepare fork mailbox handoff: %w", err)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resume network handoff leaks UDP socket on error

Medium Severity

If prepareForkMailboxHandoff fails, the already-created resumeNetworkHandoff (which may hold an open UDP socket via its ackWaiter) is never closed. The defer closeRestoreHandoffs(handoffs) is registered only after both handoffs succeed, so the early return on line 281 bypasses cleanup of the first handoff.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 1778dc2. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants