Skip to content

feat(gateway): add resource defaults and grpc rate limiting#1566

Open
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-76-fsr-025-add-default-cgroup-resource-limits-and-rate-limiting
Open

feat(gateway): add resource defaults and grpc rate limiting#1566
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-76-fsr-025-add-default-cgroup-resource-limits-and-rate-limiting

Conversation

@alangou
Copy link
Copy Markdown
Contributor

@alangou alangou commented May 26, 2026

Summary

Adds gateway-level defaults for sandbox CPU and memory limits, plus an optional gateway-wide gRPC request rate limit. This gives operators safer default resource controls while keeping PID limits and per-principal rate limiting out of scope for this draft.

Related Issue

Closes OS-76

Changes

  • Added gateway config fields and CLI/TOML parsing for default sandbox CPU and memory limits.
  • Applies configured CPU and memory defaults when sandbox requests omit explicit limits, while preserving caller-provided limits.
  • Added optional gateway-wide gRPC rate limiting for gRPC traffic only; health, metrics, and local sandbox-service HTTP routes remain outside this control.
  • Updated architecture and gateway config docs for the new settings.

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Additional targeted checks run during development:

  • mise exec -- cargo test -p openshell-core grpc_rate_limit --lib
  • mise exec -- cargo test -p openshell-server grpc_rate_limit --lib
  • mise exec -- cargo clippy -p openshell-server --all-targets -- -D warnings

Open Questions

PID Limit Semantics

  • Should default_sandbox_pids_limit ship in a follow-up, or stay out until the driver contract has a typed PID field?
  • The current patch does not expose a gateway config knob or inject default template.resources.limits.pids.
  • User-supplied template.resources.limits.pids is currently preserved in the public sandbox spec but ignored when building driver requests. Should unsupported drivers reject this explicitly instead of silently ignoring it?
  • Kubernetes does not expose per-container PID caps through resources.limits.pids; confirm whether any Kubernetes-side PID control is possible through the sandbox controller, kubelet podPidsLimit, or a different mechanism.
  • Docker and Podman support per-container PID caps with --pids-limit, but the current compute driver proto only types CPU and memory. Decide whether to add a typed field, use driver-specific config, or leave Docker/Podman PID defaults out of scope.

Driver Consistency

  • Unsupported resource fields behave differently across drivers today: some reject raw platform_config, some ignore passthrough resources, and some apply backend-specific fallbacks.
  • Decide whether unsupported resource defaults should fail fast, be silently ignored, or be suppressed by the gateway before dispatch.
  • Consider standardizing resources_raw as a best-effort contract accepted by every driver, where supported fields are applied and unsupported fields become no-ops with warnings.
  • If best-effort resources_raw is adopted, decide where warnings should surface so users do not get a false security signal from accepted-but-unenforced limits.
  • Check whether Podman's existing resource fallback defaults are intentional, and whether gateway-injected CPU/memory should replace or coexist with those defaults.

gRPC Rate Limit Scope

  • This PR implements a gateway-wide gRPC limit only.
  • Per-user and per-sandbox buckets are intentionally out of scope for now.
  • Follow-up work can evaluate identity-aware limits if authenticated callers starving the global bucket becomes a priority.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

alangou added 2 commits May 26, 2026 17:16
Signed-off-by: Adrien Langou <alangou@nvidia.com>
Signed-off-by: Adrien Langou <alangou@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@alangou alangou marked this pull request as ready for review May 26, 2026 17:01
@johntmyers johntmyers added topic:security Security issues test:e2e Requires end-to-end coverage labels May 27, 2026
@github-actions
Copy link
Copy Markdown

Label test:e2e applied for 8f89ad2. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

test:e2e Requires end-to-end coverage topic:security Security issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants