feat(gpu-operator): add container variant for hosts with a pre-installed NVIDIA driver

## Summary

`cozystack.gpu-operator` ships two variants: `default` (passthrough — vfio-pci for KubeVirt VMs) and `vgpu` (NVIDIA vGPU, requires a license server). There is no variant that works with a host that already has the NVIDIA driver installed and wants the GPUs available to **containerized** workloads via the standard NVIDIA device plugin.

This is the most common shape of a Linux GPU node in the wild (apt-installed driver + nvidia-container-toolkit), and it currently has no out-of-the-box path through Cozystack.

## Proposal

Add a `container` variant under `packages/core/platform/sources/gpu-operator.yaml`:

```yaml
- name: container
  dependsOn:
  - cozystack.networking
  components:
  - name: gpu-operator
    path: system/gpu-operator
    valuesFiles:
    - values.yaml
    - values-container.yaml
    install:
      privileged: true
      namespace: cozy-gpu-operator
      releaseName: gpu-operator
```

`values-container.yaml` in `packages/system/gpu-operator/`:

```yaml
gpu-operator:
  driver:
    enabled: false        # use the host driver
  devicePlugin:
    enabled: true         # publish nvidia.com/gpu to the kubelet
  vfioManager:
    enabled: false        # do not unbind the host driver
  sandboxWorkloads:
    enabled: false
  # toolkit.enabled defaults to true upstream and that is correct here:
  # nvidia-container-toolkit configures containerd to inject /dev/nvidia*
  # into containers requesting nvidia.com/gpu.
```

With this variant a user can `bundles.enabledPackages: [cozystack.gpu-operator]` with `variant: container` and immediately get containerized GPU workloads (CUDA pods, ML training) against a pre-existing host driver, without touching VFIO or the host driver stack.

## Why this is a separate variant and not a values override

The existing `default` and `vgpu` variants both make architectural assumptions that fight the host driver:

- `default` runs `vfioManager` which unbinds the host driver to bind `vfio-pci`. With a host driver present, `k8s-driver-manager` early-exits and the bind never happens (see referenced issue).
- `vgpu` requires the NVIDIA vGPU host driver and a license server, which is a separate operational commitment.

`container` is the third architectural mode upstream NVIDIA gpu-operator already supports (`sandboxWorkloads.enabled: false` + `devicePlugin.enabled: true`), it just is not surfaced as a Cozystack variant today.

## Cross-references

- Silent skip on pre-installed host driver (referenced issue) — would no longer be a footgun once `container` is available, because the user has a documented alternative.
- HAMi (fractional sharing) is orthogonal and stacks on top of `container` for tenants who want per-pod fractional GPU.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gpu-operator): add container variant for hosts with a pre-installed NVIDIA driver #2764

Summary

Proposal

Why this is a separate variant and not a values override

Cross-references

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(gpu-operator): add container variant for hosts with a pre-installed NVIDIA driver #2764

Description

Summary

Proposal

Why this is a separate variant and not a values override

Cross-references

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions