Skip to content

feat(gpu-operator): add container variant for hosts with a pre-installed NVIDIA driver #2764

@lexfrei

Description

Summary

cozystack.gpu-operator ships two variants: default (passthrough — vfio-pci for KubeVirt VMs) and vgpu (NVIDIA vGPU, requires a license server). There is no variant that works with a host that already has the NVIDIA driver installed and wants the GPUs available to containerized workloads via the standard NVIDIA device plugin.

This is the most common shape of a Linux GPU node in the wild (apt-installed driver + nvidia-container-toolkit), and it currently has no out-of-the-box path through Cozystack.

Proposal

Add a container variant under packages/core/platform/sources/gpu-operator.yaml:

- name: container
  dependsOn:
  - cozystack.networking
  components:
  - name: gpu-operator
    path: system/gpu-operator
    valuesFiles:
    - values.yaml
    - values-container.yaml
    install:
      privileged: true
      namespace: cozy-gpu-operator
      releaseName: gpu-operator

values-container.yaml in packages/system/gpu-operator/:

gpu-operator:
  driver:
    enabled: false        # use the host driver
  devicePlugin:
    enabled: true         # publish nvidia.com/gpu to the kubelet
  vfioManager:
    enabled: false        # do not unbind the host driver
  sandboxWorkloads:
    enabled: false
  # toolkit.enabled defaults to true upstream and that is correct here:
  # nvidia-container-toolkit configures containerd to inject /dev/nvidia*
  # into containers requesting nvidia.com/gpu.

With this variant a user can bundles.enabledPackages: [cozystack.gpu-operator] with variant: container and immediately get containerized GPU workloads (CUDA pods, ML training) against a pre-existing host driver, without touching VFIO or the host driver stack.

Why this is a separate variant and not a values override

The existing default and vgpu variants both make architectural assumptions that fight the host driver:

  • default runs vfioManager which unbinds the host driver to bind vfio-pci. With a host driver present, k8s-driver-manager early-exits and the bind never happens (see referenced issue).
  • vgpu requires the NVIDIA vGPU host driver and a license server, which is a separate operational commitment.

container is the third architectural mode upstream NVIDIA gpu-operator already supports (sandboxWorkloads.enabled: false + devicePlugin.enabled: true), it just is not surfaced as a Cozystack variant today.

Cross-references

  • Silent skip on pre-installed host driver (referenced issue) — would no longer be a footgun once container is available, because the user has a documented alternative.
  • HAMi (fractional sharing) is orthogonal and stacks on top of container for tenants who want per-pod fractional GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions