feat(gateway): add resource defaults and grpc rate limiting by alangou · Pull Request #1566 · NVIDIA/OpenShell

alangou · 2026-05-26T16:09:24Z

Summary

Adds gateway-level defaults for sandbox CPU and memory limits, plus an optional gateway-wide gRPC request rate limit. This gives operators safer default resource controls while keeping PID limits and per-principal rate limiting out of scope for this draft.

Related Issue

Closes OS-76

Changes

Added gateway config fields and CLI/TOML parsing for default sandbox CPU and memory limits.
Applies configured CPU and memory defaults when sandbox requests omit explicit limits, while preserving caller-provided limits.
Added optional gateway-wide gRPC rate limiting for gRPC traffic only; health, metrics, and local sandbox-service HTTP routes remain outside this control.
Updated architecture and gateway config docs for the new settings.

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Additional targeted checks run during development:

mise exec -- cargo test -p openshell-core grpc_rate_limit --lib
mise exec -- cargo test -p openshell-server grpc_rate_limit --lib
mise exec -- cargo clippy -p openshell-server --all-targets -- -D warnings

Open Questions

PID Limit Semantics

Should default_sandbox_pids_limit ship in a follow-up, or stay out until the driver contract has a typed PID field?
The current patch does not expose a gateway config knob or inject default template.resources.limits.pids.
User-supplied template.resources.limits.pids is currently preserved in the public sandbox spec but ignored when building driver requests. Should unsupported drivers reject this explicitly instead of silently ignoring it?
Kubernetes does not expose per-container PID caps through resources.limits.pids; confirm whether any Kubernetes-side PID control is possible through the sandbox controller, kubelet podPidsLimit, or a different mechanism.
Docker and Podman support per-container PID caps with --pids-limit, but the current compute driver proto only types CPU and memory. Decide whether to add a typed field, use driver-specific config, or leave Docker/Podman PID defaults out of scope.

Driver Consistency

Unsupported resource fields behave differently across drivers today: some reject raw platform_config, some ignore passthrough resources, and some apply backend-specific fallbacks.
Decide whether unsupported resource defaults should fail fast, be silently ignored, or be suppressed by the gateway before dispatch.
Consider standardizing resources_raw as a best-effort contract accepted by every driver, where supported fields are applied and unsupported fields become no-ops with warnings.
If best-effort resources_raw is adopted, decide where warnings should surface so users do not get a false security signal from accepted-but-unenforced limits.
Check whether Podman's existing resource fallback defaults are intentional, and whether gateway-injected CPU/memory should replace or coexist with those defaults.

gRPC Rate Limit Scope

This PR implements a gateway-wide gRPC limit only.
Per-user and per-sandbox buckets are intentionally out of scope for now.
Follow-up work can evaluate identity-aware limits if authenticated callers starving the global bucket becomes a priority.

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

Signed-off-by: Adrien Langou <alangou@nvidia.com>

copy-pr-bot · 2026-05-26T16:09:28Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

github-actions · 2026-05-27T15:10:43Z

Label test:e2e applied for 8f89ad2. Open the existing run and click Re-run all jobs to execute with the label set. The run will execute the standard E2E suite after building the required gateway and supervisor images once. The matching required CI gate status on this PR will flip green automatically once the run finishes.

alangou added 2 commits May 26, 2026 17:16

feat(gateway): apply default cgroup limits to sandboxes

58059f7

Signed-off-by: Adrien Langou <alangou@nvidia.com>

feat(server): add grpc rate limiting gateway-wide

8f89ad2

Signed-off-by: Adrien Langou <alangou@nvidia.com>

alangou marked this pull request as ready for review May 26, 2026 17:01

alangou requested review from a team, derekwaynecarr, maxamillion and mrunalp as code owners May 26, 2026 17:01

johntmyers added topic:security Security issues test:e2e Requires end-to-end coverage labels May 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gateway): add resource defaults and grpc rate limiting#1566

feat(gateway): add resource defaults and grpc rate limiting#1566
alangou wants to merge 2 commits into
NVIDIA:mainfrom
alangou:alangou/os-76-fsr-025-add-default-cgroup-resource-limits-and-rate-limiting

alangou commented May 26, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alangou commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Related Issue

Changes

Testing

Open Questions

PID Limit Semantics

Driver Consistency

gRPC Rate Limit Scope

Checklist

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

github-actions Bot commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alangou commented May 26, 2026 •

edited

Loading