Summary
cozystack.gpu-operator ships two variants: default (passthrough — vfio-pci for KubeVirt VMs) and vgpu (NVIDIA vGPU, requires a license server). There is no variant that works with a host that already has the NVIDIA driver installed and wants the GPUs available to containerized workloads via the standard NVIDIA device plugin.
This is the most common shape of a Linux GPU node in the wild (apt-installed driver + nvidia-container-toolkit), and it currently has no out-of-the-box path through Cozystack.
Proposal
Add a container variant under packages/core/platform/sources/gpu-operator.yaml:
- name: container
dependsOn:
- cozystack.networking
components:
- name: gpu-operator
path: system/gpu-operator
valuesFiles:
- values.yaml
- values-container.yaml
install:
privileged: true
namespace: cozy-gpu-operator
releaseName: gpu-operator
values-container.yaml in packages/system/gpu-operator/:
gpu-operator:
driver:
enabled: false # use the host driver
devicePlugin:
enabled: true # publish nvidia.com/gpu to the kubelet
vfioManager:
enabled: false # do not unbind the host driver
sandboxWorkloads:
enabled: false
# toolkit.enabled defaults to true upstream and that is correct here:
# nvidia-container-toolkit configures containerd to inject /dev/nvidia*
# into containers requesting nvidia.com/gpu.
With this variant a user can bundles.enabledPackages: [cozystack.gpu-operator] with variant: container and immediately get containerized GPU workloads (CUDA pods, ML training) against a pre-existing host driver, without touching VFIO or the host driver stack.
Why this is a separate variant and not a values override
The existing default and vgpu variants both make architectural assumptions that fight the host driver:
default runs vfioManager which unbinds the host driver to bind vfio-pci. With a host driver present, k8s-driver-manager early-exits and the bind never happens (see referenced issue).
vgpu requires the NVIDIA vGPU host driver and a license server, which is a separate operational commitment.
container is the third architectural mode upstream NVIDIA gpu-operator already supports (sandboxWorkloads.enabled: false + devicePlugin.enabled: true), it just is not surfaced as a Cozystack variant today.
Cross-references
- Silent skip on pre-installed host driver (referenced issue) — would no longer be a footgun once
container is available, because the user has a documented alternative.
- HAMi (fractional sharing) is orthogonal and stacks on top of
container for tenants who want per-pod fractional GPU.
Summary
cozystack.gpu-operatorships two variants:default(passthrough — vfio-pci for KubeVirt VMs) andvgpu(NVIDIA vGPU, requires a license server). There is no variant that works with a host that already has the NVIDIA driver installed and wants the GPUs available to containerized workloads via the standard NVIDIA device plugin.This is the most common shape of a Linux GPU node in the wild (apt-installed driver + nvidia-container-toolkit), and it currently has no out-of-the-box path through Cozystack.
Proposal
Add a
containervariant underpackages/core/platform/sources/gpu-operator.yaml:values-container.yamlinpackages/system/gpu-operator/:With this variant a user can
bundles.enabledPackages: [cozystack.gpu-operator]withvariant: containerand immediately get containerized GPU workloads (CUDA pods, ML training) against a pre-existing host driver, without touching VFIO or the host driver stack.Why this is a separate variant and not a values override
The existing
defaultandvgpuvariants both make architectural assumptions that fight the host driver:defaultrunsvfioManagerwhich unbinds the host driver to bindvfio-pci. With a host driver present,k8s-driver-managerearly-exits and the bind never happens (see referenced issue).vgpurequires the NVIDIA vGPU host driver and a license server, which is a separate operational commitment.containeris the third architectural mode upstream NVIDIA gpu-operator already supports (sandboxWorkloads.enabled: false+devicePlugin.enabled: true), it just is not surfaced as a Cozystack variant today.Cross-references
containeris available, because the user has a documented alternative.containerfor tenants who want per-pod fractional GPU.