From b52c69701ae0511c380c8f812d4d6d85deca53e7 Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Mon, 4 May 2026 20:10:56 +0200 Subject: [PATCH 1/3] design-proposal: split kubernetes package and add Talos backends Propose extracting node pools from the kubernetes application into a sibling kubernetes-nodes application, modelled on the vm-instance/vm-disk split. Add a backend abstraction that supports the existing KubeVirt+kubeadm flow alongside new Talos backends: KubeVirt+Talos via clastix/talos-csr-signer, and cloud-talos for Hetzner and Azure without Cluster API. Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- .../kubernetes-nodes-split/README.md | 269 ++++++++++++++++++ 1 file changed, 269 insertions(+) create mode 100644 design-proposals/kubernetes-nodes-split/README.md diff --git a/design-proposals/kubernetes-nodes-split/README.md b/design-proposals/kubernetes-nodes-split/README.md new file mode 100644 index 0000000..3d37e98 --- /dev/null +++ b/design-proposals/kubernetes-nodes-split/README.md @@ -0,0 +1,269 @@ +# Split the kubernetes package: extract node pools and add Talos backends + +- **Title:** `Split the kubernetes package: extract node pools and add Talos backends` +- **Author(s):** `@kvaps` +- **Date:** `2026-05-04` +- **Status:** Draft + +## Overview + +The `kubernetes` application currently bundles a Kamaji-hosted control-plane and all worker node pools into a single Helm release. This couples two lifecycles that are becoming increasingly independent: the control-plane is a single object owned by the platform, while node pools are growing in number and variety — different locations, different infrastructure providers, different operating systems. + +This proposal extracts node pools into a separate sibling application, `kubernetes-nodes`, modelled on the precedent of the `vm-instance` / `vm-disk` split. A user-facing tenant cluster is described as one `kubernetes` HelmRelease (control-plane only) plus N `kubernetes-nodes` HelmReleases (one per pool). At the same time, `kubernetes-nodes` introduces a backend abstraction so a single tenant cluster can mix KubeVirt VMs (existing kubeadm-bootstrapped flow) with Talos-on-cloud VMs (new, no Cluster API) and KubeVirt VMs running Talos (new) — joined by a Kamaji control-plane and stitched together at the network layer through the existing Kilo mesh. + +## Scope and related proposals + +- Companion to [`cross-cluster-tenant-mesh`](../cross-cluster-tenant-mesh/) — that proposal exposes host-cluster services (Ceph) to tenant clusters, which becomes more relevant once tenants span multiple locations. +- Migration of existing tenants is described here as a phased plan; an explicit migration utility (modelled on `migrations/29` from the `virtual-machine` → `vm-instance + vm-disk` split) will accompany the implementation but its detailed script is out of scope of this proposal. +- This proposal does **not** propose removing Cluster API immediately. It contains the architectural seam needed to remove CAPI later for non-KubeVirt backends if that becomes desirable. + +## Context + +### The problem + +> *I want my Kubernetes control plane lifecycle to be managed independently from my worker node pools. The pools may live in different locations and be backed by different infrastructure providers (in-cluster KubeVirt VMs, cloud VMs in Hetzner or Azure), and I want first-class Talos Linux support.* + +The current `kubernetes` package (under `packages/apps/kubernetes/`) carries a top-level `nodeGroups` map in `values.yaml`; each entry is rendered into a `MachineDeployment` + `KubeadmConfigTemplate` + `KubevirtMachineTemplate` + `MachineHealthCheck`. The control-plane (`KamajiControlPlane`) and the cluster-wide infrastructure objects (`Cluster`, `KubevirtCluster`, `kubevirt-ccm`, `cluster-autoscaler`) all live in the same Helm release. The chart has no notion of "location", "region" or "zone"; bootstrap is exclusively kubeadm. + +This bundling has three concrete consequences that are starting to bite: + +- **Coupled lifecycles.** A user wanting to add a small node pool in a second location must edit the same HelmRelease that owns the control-plane. A bad `values.yaml` change risks the entire cluster, not just one pool. +- **Single backend.** The values shape assumes everything is a KubeVirt VM joined via kubeadm. There is no clean place to plug in Talos-on-cloud-VM workers, which Cozystack already supports for the management cluster (see `/operations/multi-location/autoscaling/`). +- **No multi-location semantics.** Users currently model "different location" by hand using node labels and Kilo annotations. There is no first-class `location` knob in the kubernetes package, no per-pool placement, no provider-aware autoscaler. + +### Existing primitives + +- **Kamaji** for hosted control-planes (`KamajiControlPlane` CRD, `clastix.io/v1alpha1`). +- **Cluster API** with the **KubeVirt provider (CAPK)** — `KubevirtCluster`, `KubevirtMachineTemplate`. Currently the only worker backend. +- **`local-ccm`** (`github.com/cozystack/local-ccm`) deployed on the management cluster: a DaemonSet that detects ExternalIP via netlink, plus a `node-lifecycle-controller` (NLC) that deletes zombie `Node` objects after `cluster-autoscaler` scales their VMs down. Today NLC runs only against the management cluster's API; the same logic would solve zombie-node issues for tenant clusters where workers are autoscaled cloud VMs. +- **`cluster-autoscaler`** as a single Deployment per tenant kubernetes HelmRelease, in the management cluster, discovering MachineDeployments via the clusterapi provider. +- **clastix/talos-csr-signer** (a small gRPC sidecar that re-implements Talos `trustd` so that workers can fetch their Talos machine certificate from a non-Talos control-plane like Kamaji). Co-developed with CLASTIX; deployed as a sidecar inside the `KamajiControlPlane` Pod via `additionalContainers`. +- The **Sidero Metal / CAPS** path is **not** an option: the upstream project is officially deprecated and bare-metal-only, and assumes a Talos control-plane (incompatible with Kamaji). + +## Goals + +- A tenant cluster's worker pools live in their own HelmReleases, separate from the control-plane's HelmRelease, and can be added, removed, scaled and re-templated independently. +- A single tenant cluster can mix multiple `kubernetes-nodes` HelmReleases of different backends — for example: one KubeVirt pool on the host cluster running kubeadm (legacy compatibility), one Talos pool on Hetzner, one Talos pool on Azure — all joined by the same Kamaji control-plane. +- First-class Talos Linux support for new worker pools: machineconfig-driven bootstrap, with a system-managed template and a user-facing overlay. +- No regression for existing tenants: the current `kubevirt-kubeadm` flow remains the default, fully supported, and migrations are scripted. +- Cluster-autoscaler runs once per node pool, in the management cluster (not inside tenant nodes), driven by per-pool configuration. +- Tenant `Node` objects are cleaned up automatically when their VM is removed, regardless of backend. + +### Non-goals + +- Removing Cluster API entirely. CAPI remains the path for KubeVirt-backed pools; this proposal only avoids introducing it for new Talos-on-cloud backends. +- Migrating existing KubeVirt-kubeadm pools to KubeVirt-Talos. The Talos-on-KubeVirt backend is offered for new pools; existing kubeadm pools stay as-is. +- Cross-cluster service discovery, DNS, or service mirroring. Out of scope; partially addressed by the companion `cross-cluster-tenant-mesh` proposal. +- Replacing `kubevirt-ccm` inside tenant clusters. It remains in the `kubernetes` (control-plane) package and is critical for KubeVirt-backed tenants. +- Changing the on-disk shape of existing CAPI objects. Migration adopts them in place. + +## Design + +### Package split + +Two packages replace today's monolithic `kubernetes`: + +- **`kubernetes`** — control-plane only. Renders `Cluster`, `KamajiControlPlane`, `KubevirtCluster` (when needed for KubeVirt-backed pools), `kubevirt-ccm` for the tenant, addons (cert-manager, FluxCD, ingress-nginx, etc.), and the Talos signer sidecar configuration when Talos backends are in use. Does **not** render any node-pool objects. +- **`kubernetes-nodes`** — exactly one node pool per HelmRelease. Renders all backend-specific resources for that pool: for `kubevirt-*` backends, the CAPI `MachineDeployment` + bootstrap config template + infrastructure machine template + `MachineHealthCheck`; for `cloud-talos-*` backends, a `cluster-autoscaler` Deployment configured with the cloud's native provider plus the Talos machineconfig Secret consumed via cloud-init. + +A tenant cluster is therefore described as `1 × kubernetes` HelmRelease + `N × kubernetes-nodes` HelmReleases. The control-plane does not hold any reference to its node pools — workers self-register against the apiserver via their bootstrap mechanism, just as they do today. + +```mermaid +flowchart LR + subgraph Mgmt[Management cluster] + direction TB + KCP[kubernetes/
HelmRelease
tenant-foo] + NG1[kubernetes-nodes/
HelmRelease
tenant-foo-md0
backend: kubevirt-kubeadm] + NG2[kubernetes-nodes/
HelmRelease
tenant-foo-hel-fsn1
backend: cloud-talos-hetzner] + NG3[kubernetes-nodes/
HelmRelease
tenant-foo-az-weu
backend: cloud-talos-azure] + + KCP --> KamajiCP[KamajiControlPlane] + KamajiCP -.->|exposes :6443
:50001 talosd| LB[Tenant API LB] + + NG1 --> CAPI1[MachineDeployment
KubevirtMachineTemplate
KubeadmConfigTemplate] + NG2 --> AS2[cluster-autoscaler
--cloud-provider=hetzner] + NG3 --> AS3[cluster-autoscaler
--cloud-provider=azure] + end + + CAPI1 -.creates.-> VMs[KubeVirt VMs] + AS2 -.scales.-> Hel[Hetzner cloud servers] + AS3 -.scales.-> Az[Azure VMSS instances] + + VMs --> LB + Hel --> LB + Az --> LB +``` + +### Linkage by name + +Following the `vm-instance` / `vm-disk` precedent, `kubernetes-nodes` references its parent `kubernetes` HelmRelease by **name**, not by an explicit CRD reference. The chart's `clusterName` value (e.g., `tenant-foo`) is used at template render time to: + +- `lookup` the tenant's `KamajiControlPlane` and read its API endpoint, CA, and bootstrap-token data. +- Set Helm release naming convention `kubernetes-nodes-` so that orphan detection during migration is deterministic. +- Apply labels (`apps.cozystack.io/cluster: `) to all rendered objects, enabling label-selector-based reconciliation in `cluster-autoscaler` and any future controllers. + +If the parent control-plane is missing, the chart `fail`s the render with a clear error pointing at the expected HelmRelease name. This is the same mode of fragility present in `vm-instance/vm-disk` — accepted as a tradeoff for simplicity, with the understanding that a future iteration could move to an explicit CRD reference. + +### Backend abstraction + +`kubernetes-nodes` exposes a `backend.type` field that selects which set of templates is rendered. Three backends are introduced; only `kubevirt-kubeadm` is fully equivalent to today's behaviour. + +#### `kubevirt-kubeadm` (existing, default) + +Identical to today's flow. Renders `MachineDeployment` + `KubevirtMachineTemplate` + `KubeadmConfigTemplate` + `MachineHealthCheck`. Bootstrap via kubeadm join. Workers run a Cozystack-blessed Ubuntu/Talos-untouched image. CAPI + CAPK do all the lifecycle work; cluster-autoscaler drives scale. + +This is the migration target for every existing pool. + +#### `kubevirt-talos` + +Renders the same CAPI/CAPK objects as above, but with a `TalosConfigTemplate` (from `cluster-api-bootstrap-provider-talos`) replacing `KubeadmConfigTemplate`. Worker VMs boot from a Talos image. Bootstrap fetches the Talos machineconfig from CAPI and joins the cluster via standard Talos PKI. + +The tenant's `KamajiControlPlane` carries an `additionalContainers` entry running `clastix/talos-csr-signer` listening on UDP/50001, exposed alongside `:6443` on the tenant API LoadBalancer. This is what allows `talosctl` to operate against worker nodes whose control-plane is Kamaji rather than Talos. + +This backend keeps CAPI in the loop because for KubeVirt VMs the `cluster-api-provider-kubevirt` machinery is the path of least resistance — it already handles VM lifecycle, networking, and storage attachment. + +#### `cloud-talos-hetzner` and `cloud-talos-azure` + +No Cluster API involvement. The pattern mirrors what Cozystack already uses for the management cluster (see `/docs/v1.3/operations/multi-location/autoscaling/`): + +- A `cluster-autoscaler` Deployment is rendered into the management cluster's namespace for this tenant, configured with `--cloud-provider=hetzner` (or `azure`), `--cloud-config` referencing a Secret with cloud credentials provided in the HelmRelease values, and `autoscalingGroups` describing min/max replicas, instance type, and region. +- A `Secret` holds the Talos machineconfig (see "Talos machineconfig" below). The autoscaler injects it via the cloud's `cloud-init` / `customData` mechanism when launching new instances. +- Newly booted instances complete their Talos bootstrap against the tenant's public API endpoint (Kamaji) using the Talos token in the machineconfig, register via kubelet, and obtain their kubelet client certificate through standard CSR approval. + +These backends do not use CAPI at all. There is no `Machine`, no `MachineDeployment`. The autoscaler is the source of truth for desired pool size; the cloud's API is the source of truth for actual instance state. This is the same model proven on the management cluster. + +### Talos machineconfig: template + user overlay + +A single Talos machineconfig per pool is generated by the `kubernetes-nodes` chart and stored as a Secret. It is constructed in two layers: + +**System layer (chart-managed, not exposed to user):** + +- Cluster CA, machine CA, apiserver endpoint — read at template time via `lookup` from the tenant's `KamajiControlPlane`. +- Talos token — generated once per pool, stored alongside the machineconfig. +- Kilo annotations (`kilo.squat.ai/location`, `kilo.squat.ai/persistent-keepalive`, `topology.kubernetes.io/zone`) when the pool participates in the Cozystack mesh. +- Standard Cozystack defaults: registry mirrors, kubelet flags, time servers, install disk hints. + +**User layer (`backend.userMachineConfig` in values.yaml):** + +- Extra kubelet args. +- Extra node labels and taints (free-form). +- Per-pool registry mirrors override. +- Extra `extensions` (talos image factory schematic). +- Anything else the user explicitly wants to pass through. + +The two layers are merged at render time and the result is the machineconfig that gets injected into cloud-init or the KubevirtMachineTemplate. The user never writes raw Talos YAML for cluster-critical fields; the chart guarantees the result will join the right control-plane. + +### Cluster-autoscaler — one per pool, in the management cluster + +The current model (one autoscaler per tenant in the management cluster, discovering all pools via clusterapi labels) does not generalise: a `cloud-talos-hetzner` pool needs `--cloud-provider=hetzner`, a `cloud-talos-azure` pool needs `--cloud-provider=azure`, and the upstream cluster-autoscaler accepts only one cloud-provider flag per Deployment. + +The new model: each `kubernetes-nodes` HelmRelease renders its own `cluster-autoscaler` Deployment, in the management cluster, scoped to its pool. The autoscaler: + +- For `kubevirt-*` backends, uses `--cloud-provider=clusterapi` and watches just this pool's MachineDeployment. +- For `cloud-talos-*` backends, uses the corresponding native provider and the values-supplied `autoscalingGroups`. + +Coordination across pools is left to the standard scheduler — pending pods select among pools via standard mechanisms (taints, node selectors, topology constraints). + +### Tenant-side node lifecycle (NLC) + +When `cluster-autoscaler` scales a `cloud-talos-*` pool down, it deletes the cloud VM. The tenant's apiserver still has a `Node` object that will linger until something deletes it. CAPI was previously the agent doing this; without CAPI, we need an equivalent. + +The `node-lifecycle-controller` from `cozystack/local-ccm` is a good fit for this role. The `kubernetes-nodes` chart for `cloud-talos-*` backends renders an NLC Deployment that runs in the management cluster but uses a kubeconfig pointing to the **tenant** apiserver. It watches Node objects with the `ToBeDeletedByClusterAutoscaler:NoSchedule` taint and removes them after a configurable grace period and unreachability check. + +For `kubevirt-*` backends NLC is not needed: CAPI's machine controller already removes the Node object when it deletes the Machine. + +### `kubevirt-ccm` stays in the control-plane package + +The Kubernetes Cloud Controller Manager for KubeVirt-backed nodes (`kubevirt-ccm`, currently in `templates/kccm/manager.yaml`) remains in the `kubernetes` package, not in `kubernetes-nodes`. It is critical for KubeVirt-backed tenants — without it, KubeVirt VMs do not get their LoadBalancer Services properly wired — and it is logically a property of the cluster, not of a particular node pool. + +Tenants that have **no** KubeVirt pools at all (only `cloud-talos-*` pools) will have `kubevirt-ccm` running idle. This is acceptable cost; a future `enabled` flag in the `kubernetes` chart can disable it on demand. + +## User-facing changes + +- New CRD-style application `kubernetes-nodes` with `values.yaml` containing: `clusterName`, `backend.type`, `backend..*` settings, common `replicas`/`minReplicas`/`maxReplicas`, `roles`, `resources`, `userMachineConfig` (Talos backends only). +- Existing `kubernetes` `values.yaml` no longer accepts `nodeGroups` (after migration completes; during the migration window both shapes are accepted). +- New tenant-cluster pages in the dashboard list node pools as separate entities, with their backend type and current capacity. +- `cozystack` CLI gains commands to list, create, scale, and delete node pools per cluster. + +## Upgrade and rollback compatibility + +The migration follows the precedent of the `virtual-machine` → `vm-instance` + `vm-disk` split: a long parallel period during which both shapes work, followed by a scripted migration and eventual removal of the legacy code path. + +**Phase 1 — both shapes accepted.** +- Ship `kubernetes-nodes` as a new package. Document its use for new node pools. +- The `kubernetes` chart continues to support `nodeGroups` in `values.yaml` exactly as today. +- Users can adopt `kubernetes-nodes` on a per-pool basis for new pools without touching existing ones. + +**Phase 2 — migration tool.** +- A migration script (modelled on `migrations/29` for the `virtual-machine` split) walks every `kubernetes` HelmRelease with non-empty `nodeGroups`. For each `nodeGroup` it: + 1. Patches the existing `MachineDeployment`, `KubevirtMachineTemplate`, `KubeadmConfigTemplate`, and `MachineHealthCheck` with `meta.helm.sh/release-name` and `meta.helm.sh/release-namespace` annotations pointing at the new `kubernetes-nodes-` HelmRelease. + 2. Creates the `kubernetes-nodes` HelmRelease with values copied from the source `nodeGroup`. + 3. Strips the `nodeGroup` entry from the `kubernetes` HelmRelease values. +- Critical pre-step: each affected resource is annotated `helm.sh/resource-policy: keep` first, so that the `kubernetes` chart's reconciliation does not delete it during the brief window before the new HelmRelease's first reconcile. +- The script is idempotent and safe to re-run. + +**Phase 3 — legacy removal.** +- The `kubernetes` chart drops `nodeGroups` from its schema entirely. Charts that still receive it produce a clear validation error pointing at the migration tool. +- Documentation deprecates the old shape. + +**Rollback.** +- During Phase 1 and Phase 2, rollback is a matter of reverting the migration: delete the `kubernetes-nodes` HelmRelease, restore the `nodeGroup` entry in `kubernetes` values, run reconcile. The migration script supports this direction explicitly. +- After Phase 3, rollback requires reinstating the legacy code path in the `kubernetes` chart. This is a hard cut and should not be done lightly; Phase 3 only ships once Phase 1 and 2 have been in production long enough to gather operational confidence. + +## Security + +- `kubernetes-nodes` HelmReleases run with the same RBAC as `kubernetes` HelmReleases today — they're both managed by Cozystack platform components, not by tenants. The split does not introduce new tenant-controlled inputs. +- Talos backends introduce a new credential: the per-pool `TALOS_TOKEN`, used by `clastix/talos-csr-signer` to validate worker bootstrap. Stored in a Secret in the tenant's namespace, rotated on pool re-creation. +- `cloud-talos-*` backends introduce cloud-provider credentials (Hetzner API token, Azure service principal). These are user-supplied at HelmRelease creation, stored as Secrets in the tenant's namespace, and never read by the tenant's apiserver — only by the management-cluster `cluster-autoscaler`. +- The Talos machineconfig contains the cluster CA and the bootstrap token. It is stored as a Secret accessible only to the autoscaler and the chart's render path. It is **not** exposed to the tenant's apiserver or workloads. +- `clastix/talos-csr-signer` uses a single shared `TALOS_TOKEN` per pool with no per-node identity proof. This matches upstream Talos's `trustd` model. Co-developed with CLASTIX; experimental upstream status acknowledged but accepted, given Cozystack's involvement in its development. + +## Failure and edge cases + +- **`kubernetes-nodes` HelmRelease created before its parent `kubernetes` HelmRelease** → chart `fail`s the render with a clear error message identifying the missing parent. No partial CAPI/autoscaler resources created. +- **Parent `kubernetes` HelmRelease deleted while children exist** → all `kubernetes-nodes` HelmReleases for that cluster fail subsequent reconciles. An admission webhook on `kubernetes` HelmRelease delete blocks the operation if any `kubernetes-nodes` references it. +- **Migration runs while autoscaler is mid-scale-out** → resource-policy `keep` annotation prevents deletion. The new HelmRelease's first reconcile picks up the in-flight `Machine` objects via the standard CAPI reconcile. +- **`cluster-autoscaler` for a `cloud-talos-*` backend fails to delete a cloud VM** (rate limit, transient API error) → instances stay up; NLC will not see the Node as `ToBeDeletedByClusterAutoscaler:NoSchedule` and will not delete the Node. Operator alerted via metrics; manual cleanup required. Documented runbook. +- **Talos-CSR-signer pod restart during worker bootstrap** → worker retries `trustd` calls with exponential backoff (Talos default). No data lost. +- **Mixed-backend tenant where one pool fails reconcile** → other pools and the control-plane are unaffected (independent HelmReleases). The cluster degrades gracefully. + +## Testing + +- Unit tests for chart rendering: synthetic inputs covering each backend, expected Kubernetes objects, expected absence of forbidden combinations (e.g., `userMachineConfig` for `kubevirt-kubeadm`). +- Schema validation tests for the new `kubernetes-nodes` `values.yaml` shape. +- Migration script tests: synthetic existing `kubernetes` releases with various `nodeGroups` configurations; verify idempotence, rollback, and identity preservation (Machine names, BootstrapData ownership). +- Integration tests with `kind` and a stub KubeVirt: full lifecycle of `kubevirt-kubeadm` and `kubevirt-talos` pools. +- E2E in CI for `cloud-talos-*` backends using a small Hetzner project and an Azure subscription: scale-up, scale-down, NLC behaviour on rapid scale-down. +- Failure-injection tests: kill the talos-csr-signer pod during worker join; kill the cluster-autoscaler pod mid-scale; delete a pool's cloud-credentials Secret and verify graceful degradation. + +## Rollout + +- **Phase 1.** Implement `kubernetes-nodes` package with `kubevirt-kubeadm` backend only. Ship as opt-in alongside the existing `kubernetes` chart with no migration required for existing pools. +- **Phase 2.** Add `kubevirt-talos` backend, including talos-csr-signer integration in the `kubernetes` (control-plane) chart. +- **Phase 3.** Add `cloud-talos-hetzner` and `cloud-talos-azure` backends, including per-pool cluster-autoscaler and tenant-side NLC. +- **Phase 4.** Ship migration script. Document the migration; encourage but don't force adoption. +- **Phase 5.** Once telemetry shows broad migration of existing tenants, remove `nodeGroups` from `kubernetes` chart's schema; ship final migration. + +Each phase is independently shippable and rollback-safe. + +## Open questions + +1. **CAPI removal long-term.** Should we set a roadmap target for removing CAPI from `kubevirt-kubeadm` and `kubevirt-talos` backends entirely (replacing CAPK with a thin Cozystack-internal controller that creates `VirtualMachine` objects directly)? This would unify all backends under "no CAPI" and reduce a substantial dependency, but requires re-implementing what CAPK gives us today (machine lifecycle, healthchecks, status). Out of scope for this proposal but worth scoping next. +2. **Backend extension shape.** The proposed `backend.type` enum has a fixed set of values. Adding AWS, GCP, or on-prem KVM later is straightforward (new `cloud-talos-aws` etc.), but should we accept arbitrary backend identifiers and dispatch through a plugin mechanism? Probably not — the explicit enum keeps the chart auditable. +3. **Per-pool talos-csr-signer vs cluster-wide.** Currently proposed as a single sidecar in the tenant's `KamajiControlPlane` Pod (cluster-wide). Should each pool have its own token for blast-radius isolation? Operationally heavier; security gain limited because tokens already give only the right to obtain a Talos machine cert, not Kubernetes API access. Open for discussion. +4. **NLC reuse vs fork.** Should we deploy the existing `local-ccm` NLC in tenant-targeting mode, or fork it into a `tenant-nlc` package? Reuse keeps the codebase smaller; fork makes the host vs tenant deployment paths explicitly different. Likely reuse is correct. +5. **Should `kubernetes-nodes` be allowed to advertise capacity to multiple `kubernetes` clusters?** Almost certainly no, but stating it explicitly. Each pool belongs to one cluster. + +## Alternatives considered + +**Keep the monolithic `kubernetes` package and add `nodeGroupsBackend` discriminators.** Rejected because it would force every tenant cluster's HelmRelease to know about every backend, and growing the values shape further entangles control-plane and node-pool lifecycles. The whole reason for the split is to *separate* lifecycles, not to make the same release manage more variety. + +**Sidero Metal / CAPS for Talos-on-bare-metal.** Rejected. Sidero Labs has officially deprecated Sidero Metal. Successor (Omni) is a closed-core SaaS, not a drop-in OSS CAPI provider. Sidero is also bare-metal-only and assumes a Talos control-plane, incompatible with Kamaji. + +**Explicit `clusterRef` on `kubernetes-nodes` instead of name-based linkage.** Considered. Trade-off favours simplicity: name-based linkage matches the `vm-instance/vm-disk` precedent that Cozystack maintainers and users are already familiar with, and the security gain of a CRD reference is marginal because both packages are platform-controlled (not tenant-controlled). The fragility of the name-based approach is real but understood and accepted. + +**Single global cluster-autoscaler per tenant with multi-cloud-provider support.** Not feasible. Upstream cluster-autoscaler accepts one `--cloud-provider` flag; supporting multiple simultaneously would require either a fork or a per-pool autoscaler. Per-pool autoscaler in the management cluster is the natural fit. + +**Inline-disk-style "embedded" node pools** (each `kubernetes` HelmRelease has its node pools as a sub-section, but rendered as separate releases under the hood). Rejected because it does not actually decouple the lifecycle — a `helm upgrade` on the parent still touches all children. The split has to be at the user-visible HelmRelease level for the goals to be achieved. + +**Exposing Talos machineconfig directly to users without a system layer.** Rejected because it forces every user to understand Talos machineconfig deeply, and gives them enough rope to break the join with Kamaji (wrong CA, wrong endpoint, wrong token). The template + user-overlay approach matches the ergonomics Cozystack offers everywhere else (system handles the boilerplate, user describes intent). From d3e56407a75ea5266aefb9e3300fa366772e2f22 Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Mon, 11 May 2026 20:55:04 +0200 Subject: [PATCH 2/3] design-proposal: rescope kubernetes-nodes-split to Phase 1 + Phase 2 Drop the multi-backend design (cloud-talos-hetzner, cloud-talos-azure, LocationProfile, NLC, etc.) and rewrite around two phases of internal restructuring: - Phase 1: replace Ubuntu+kubeadm worker bootstrap with Talos via CABPT, inside the existing monolithic chart, with no user-facing API change. Patch needed in cluster-api-control-plane-provider-kamaji to render the talos-csr-signer sidecar in TenantControlPlane. - Phase 2: once workers are uniformly Talos, split the chart into kubernetes (control-plane) + kubernetes-nodes (per-pool). Single backend, no backend.type field. Hybrid / external-cloud clusters are deferred to Phase 3, tracked separately as a follow-up draft proposal. Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- .../kubernetes-nodes-split/README.md | 295 +++++++----------- 1 file changed, 114 insertions(+), 181 deletions(-) diff --git a/design-proposals/kubernetes-nodes-split/README.md b/design-proposals/kubernetes-nodes-split/README.md index 3d37e98..b2618af 100644 --- a/design-proposals/kubernetes-nodes-split/README.md +++ b/design-proposals/kubernetes-nodes-split/README.md @@ -1,269 +1,202 @@ -# Split the kubernetes package: extract node pools and add Talos backends +# Migrate kubernetes workers to Talos and split control-plane from node pools -- **Title:** `Split the kubernetes package: extract node pools and add Talos backends` +- **Title:** `Migrate kubernetes workers to Talos and split control-plane from node pools` - **Author(s):** `@kvaps` - **Date:** `2026-05-04` - **Status:** Draft ## Overview -The `kubernetes` application currently bundles a Kamaji-hosted control-plane and all worker node pools into a single Helm release. This couples two lifecycles that are becoming increasingly independent: the control-plane is a single object owned by the platform, while node pools are growing in number and variety — different locations, different infrastructure providers, different operating systems. +A two-phase reshape of the `kubernetes` application: -This proposal extracts node pools into a separate sibling application, `kubernetes-nodes`, modelled on the precedent of the `vm-instance` / `vm-disk` split. A user-facing tenant cluster is described as one `kubernetes` HelmRelease (control-plane only) plus N `kubernetes-nodes` HelmReleases (one per pool). At the same time, `kubernetes-nodes` introduces a backend abstraction so a single tenant cluster can mix KubeVirt VMs (existing kubeadm-bootstrapped flow) with Talos-on-cloud VMs (new, no Cluster API) and KubeVirt VMs running Talos (new) — joined by a Kamaji control-plane and stitched together at the network layer through the existing Kilo mesh. +1. **Phase 1 — Talos migration.** Replace the worker OS bootstrap path of the existing `kubernetes` chart from Ubuntu + `kubeadm` to Talos + `cluster-api-bootstrap-provider-talos`. The chart's user-facing API (`values.yaml`) does not change; the migration is seamless via a standard CAPI MachineDeployment rolling update. +2. **Phase 2 — Package split.** Once workers are uniformly on Talos, split the chart into `kubernetes` (control-plane only) and `kubernetes-nodes` (one HelmRelease per pool), modelled on the `vm-instance` / `vm-disk` precedent. + +Hybrid clusters — workers that live outside the Cozystack management cluster (cloud autoscaler against Hetzner / Azure / AWS, BYO clusters with admin-/user-managed location ownership) — are deliberately deferred as **Phase 3** and tracked in a separate draft proposal. This document does not commit to any specific shape for that work. ## Scope and related proposals -- Companion to [`cross-cluster-tenant-mesh`](../cross-cluster-tenant-mesh/) — that proposal exposes host-cluster services (Ceph) to tenant clusters, which becomes more relevant once tenants span multiple locations. -- Migration of existing tenants is described here as a phased plan; an explicit migration utility (modelled on `migrations/29` from the `virtual-machine` → `vm-instance + vm-disk` split) will accompany the implementation but its detailed script is out of scope of this proposal. -- This proposal does **not** propose removing Cluster API immediately. It contains the architectural seam needed to remove CAPI later for non-KubeVirt backends if that becomes desirable. +- **Phase 3 (hybrid clusters)** lives in a separate draft proposal — link to be added once that PR is open. None of the design here forecloses Phase 3; the package split is exactly what makes Phase 3 expressible cleanly. +- **Companion: [`cross-cluster-tenant-mesh`](../cross-cluster-tenant-mesh/)** (PR #7). Independent of this proposal; relevant once tenants need to reach services across cluster boundaries. ## Context -### The problem - -> *I want my Kubernetes control plane lifecycle to be managed independently from my worker node pools. The pools may live in different locations and be backed by different infrastructure providers (in-cluster KubeVirt VMs, cloud VMs in Hetzner or Azure), and I want first-class Talos Linux support.* - -The current `kubernetes` package (under `packages/apps/kubernetes/`) carries a top-level `nodeGroups` map in `values.yaml`; each entry is rendered into a `MachineDeployment` + `KubeadmConfigTemplate` + `KubevirtMachineTemplate` + `MachineHealthCheck`. The control-plane (`KamajiControlPlane`) and the cluster-wide infrastructure objects (`Cluster`, `KubevirtCluster`, `kubevirt-ccm`, `cluster-autoscaler`) all live in the same Helm release. The chart has no notion of "location", "region" or "zone"; bootstrap is exclusively kubeadm. - -This bundling has three concrete consequences that are starting to bite: +### The current shape -- **Coupled lifecycles.** A user wanting to add a small node pool in a second location must edit the same HelmRelease that owns the control-plane. A bad `values.yaml` change risks the entire cluster, not just one pool. -- **Single backend.** The values shape assumes everything is a KubeVirt VM joined via kubeadm. There is no clean place to plug in Talos-on-cloud-VM workers, which Cozystack already supports for the management cluster (see `/operations/multi-location/autoscaling/`). -- **No multi-location semantics.** Users currently model "different location" by hand using node labels and Kilo annotations. There is no first-class `location` knob in the kubernetes package, no per-pool placement, no provider-aware autoscaler. +The `kubernetes` package (`packages/apps/kubernetes/`) carries a top-level `nodeGroups` map in `values.yaml`. Each entry is rendered into a `MachineDeployment` + `KubeadmConfigTemplate` + `KubevirtMachineTemplate` + `MachineHealthCheck`. The control-plane (`KamajiControlPlane`) and cluster-wide infrastructure (`Cluster`, `KubevirtCluster`, `kubevirt-ccm`, `cluster-autoscaler`) live in the same Helm release. Workers are bootstrapped via kubeadm join, running an Ubuntu image. ### Existing primitives -- **Kamaji** for hosted control-planes (`KamajiControlPlane` CRD, `clastix.io/v1alpha1`). -- **Cluster API** with the **KubeVirt provider (CAPK)** — `KubevirtCluster`, `KubevirtMachineTemplate`. Currently the only worker backend. -- **`local-ccm`** (`github.com/cozystack/local-ccm`) deployed on the management cluster: a DaemonSet that detects ExternalIP via netlink, plus a `node-lifecycle-controller` (NLC) that deletes zombie `Node` objects after `cluster-autoscaler` scales their VMs down. Today NLC runs only against the management cluster's API; the same logic would solve zombie-node issues for tenant clusters where workers are autoscaled cloud VMs. -- **`cluster-autoscaler`** as a single Deployment per tenant kubernetes HelmRelease, in the management cluster, discovering MachineDeployments via the clusterapi provider. -- **clastix/talos-csr-signer** (a small gRPC sidecar that re-implements Talos `trustd` so that workers can fetch their Talos machine certificate from a non-Talos control-plane like Kamaji). Co-developed with CLASTIX; deployed as a sidecar inside the `KamajiControlPlane` Pod via `additionalContainers`. -- The **Sidero Metal / CAPS** path is **not** an option: the upstream project is officially deprecated and bare-metal-only, and assumes a Talos control-plane (incompatible with Kamaji). +- **Kamaji** for hosted control-planes (`KamajiControlPlane` CRD). +- **Cluster API** with the **KubeVirt provider (CAPK)** — `KubevirtCluster`, `KubevirtMachineTemplate`. Workers run as KubeVirt VMs on host nodes. +- **`cluster-api-bootstrap-provider-talos`** (CABPT) — from siderolabs. Drop-in replacement for CABPK that produces a Talos machineconfig as the bootstrap-data Secret. Works with any CAPI infra provider (CAPK included) because the infra provider does not inspect bootstrap-data content — it just injects it as cloud-init. +- **clastix/talos-csr-signer** — small gRPC sidecar reimplementing Talos's `trustd` protocol. Lets Talos workers fetch their Talos machine certificate from a non-Talos control-plane (Kamaji). Co-developed with CLASTIX. +- **`cluster-api-control-plane-provider-kamaji`** — the CAPI control-plane provider for Kamaji. Today it does **not** wire a CSR-signer sidecar into the `TenantControlPlane`. A patch is required (see Phase 1 below). ## Goals -- A tenant cluster's worker pools live in their own HelmReleases, separate from the control-plane's HelmRelease, and can be added, removed, scaled and re-templated independently. -- A single tenant cluster can mix multiple `kubernetes-nodes` HelmReleases of different backends — for example: one KubeVirt pool on the host cluster running kubeadm (legacy compatibility), one Talos pool on Hetzner, one Talos pool on Azure — all joined by the same Kamaji control-plane. -- First-class Talos Linux support for new worker pools: machineconfig-driven bootstrap, with a system-managed template and a user-facing overlay. -- No regression for existing tenants: the current `kubevirt-kubeadm` flow remains the default, fully supported, and migrations are scripted. -- Cluster-autoscaler runs once per node pool, in the management cluster (not inside tenant nodes), driven by per-pool configuration. -- Tenant `Node` objects are cleaned up automatically when their VM is removed, regardless of backend. +- Workers are uniformly Talos-bootstrapped across all tenants. No Ubuntu + kubeadm path remains in the chart after Phase 1 lands. +- The migration is seamless: existing tenants pick up Talos via a CAPI MachineDeployment rolling update, with no operator intervention beyond a chart upgrade. +- A tenant cluster's control-plane lifecycle is separable from its node-pool lifecycles. Adding, scaling or replacing a pool does not touch the HelmRelease that owns the control-plane. +- The package shape is set up to accept future hybrid backends (Phase 3) without further restructuring. ### Non-goals -- Removing Cluster API entirely. CAPI remains the path for KubeVirt-backed pools; this proposal only avoids introducing it for new Talos-on-cloud backends. -- Migrating existing KubeVirt-kubeadm pools to KubeVirt-Talos. The Talos-on-KubeVirt backend is offered for new pools; existing kubeadm pools stay as-is. -- Cross-cluster service discovery, DNS, or service mirroring. Out of scope; partially addressed by the companion `cross-cluster-tenant-mesh` proposal. -- Replacing `kubevirt-ccm` inside tenant clusters. It remains in the `kubernetes` (control-plane) package and is critical for KubeVirt-backed tenants. -- Changing the on-disk shape of existing CAPI objects. Migration adopts them in place. +- **Hybrid / external-cloud workers.** Deferred to Phase 3 (separate proposal). +- **Multi-location semantics inside this proposal.** No `location` knob in `values.yaml` here; Phase 3 will introduce one if needed. +- **Removing Cluster API.** CAPI + CAPK remains the path for KubeVirt-VM workers. +- **Replacing `kubevirt-ccm` in tenant clusters.** It stays in the `kubernetes` (control-plane) package and is critical for KubeVirt-backed tenants. +- **talosctl-driven workflows beyond what the signer enables.** Phase 1 confirms `talosctl` works against migrated workers; advanced flows (system extensions, upgrades through talosctl) are follow-up. -## Design +## Phase 1 — Migrate worker bootstrap from kubeadm to Talos -### Package split +### What changes -Two packages replace today's monolithic `kubernetes`: +Inside the existing monolithic `kubernetes` chart: -- **`kubernetes`** — control-plane only. Renders `Cluster`, `KamajiControlPlane`, `KubevirtCluster` (when needed for KubeVirt-backed pools), `kubevirt-ccm` for the tenant, addons (cert-manager, FluxCD, ingress-nginx, etc.), and the Talos signer sidecar configuration when Talos backends are in use. Does **not** render any node-pool objects. -- **`kubernetes-nodes`** — exactly one node pool per HelmRelease. Renders all backend-specific resources for that pool: for `kubevirt-*` backends, the CAPI `MachineDeployment` + bootstrap config template + infrastructure machine template + `MachineHealthCheck`; for `cloud-talos-*` backends, a `cluster-autoscaler` Deployment configured with the cloud's native provider plus the Talos machineconfig Secret consumed via cloud-init. +- `KubeadmConfigTemplate` → `TalosConfigTemplate` (CABPT). The bootstrap-data Secret content changes from kubeadm cloud-init to Talos machineconfig; CAPK still injects it as cloud-init userdata into the VM. +- Base disk image referenced by `KubevirtMachineTemplate` switches from a Cozystack-built Ubuntu image to a Cozystack-built Talos image. Image build pipeline updated accordingly. +- The tenant's `KamajiControlPlane` gains a sidecar entry running `clastix/talos-csr-signer`, exposed alongside the API server on the tenant API endpoint (UDP/50001 for Talos `trustd`). +- `KamajiControlPlane` exposed-ports configuration extended to include `:50001` alongside `:6443` so the tenant API LoadBalancer Service surfaces both. -A tenant cluster is therefore described as `1 × kubernetes` HelmRelease + `N × kubernetes-nodes` HelmReleases. The control-plane does not hold any reference to its node pools — workers self-register against the apiserver via their bootstrap mechanism, just as they do today. +### Patch required: `cluster-api-control-plane-provider-kamaji` -```mermaid -flowchart LR - subgraph Mgmt[Management cluster] - direction TB - KCP[kubernetes/
HelmRelease
tenant-foo] - NG1[kubernetes-nodes/
HelmRelease
tenant-foo-md0
backend: kubevirt-kubeadm] - NG2[kubernetes-nodes/
HelmRelease
tenant-foo-hel-fsn1
backend: cloud-talos-hetzner] - NG3[kubernetes-nodes/
HelmRelease
tenant-foo-az-weu
backend: cloud-talos-azure] - - KCP --> KamajiCP[KamajiControlPlane] - KamajiCP -.->|exposes :6443
:50001 talosd| LB[Tenant API LB] - - NG1 --> CAPI1[MachineDeployment
KubevirtMachineTemplate
KubeadmConfigTemplate] - NG2 --> AS2[cluster-autoscaler
--cloud-provider=hetzner] - NG3 --> AS3[cluster-autoscaler
--cloud-provider=azure] - end - - CAPI1 -.creates.-> VMs[KubeVirt VMs] - AS2 -.scales.-> Hel[Hetzner cloud servers] - AS3 -.scales.-> Az[Azure VMSS instances] - - VMs --> LB - Hel --> LB - Az --> LB -``` +The CAPI control-plane provider for Kamaji does not currently render the signer sidecar from `KamajiControlPlane` spec. The work for Phase 1 includes: -### Linkage by name - -Following the `vm-instance` / `vm-disk` precedent, `kubernetes-nodes` references its parent `kubernetes` HelmRelease by **name**, not by an explicit CRD reference. The chart's `clusterName` value (e.g., `tenant-foo`) is used at template render time to: +- A patch to `clastix/cluster-api-control-plane-provider-kamaji` adding a generic `additionalContainers` (or signer-specific) field that is rendered into the `TenantControlPlane` and propagated through to the resulting Deployment. +- A corresponding upstream PR. The Cozystack release of this provider runs the patched version until the PR merges; the fork window is expected to be small. -- `lookup` the tenant's `KamajiControlPlane` and read its API endpoint, CA, and bootstrap-token data. -- Set Helm release naming convention `kubernetes-nodes-` so that orphan detection during migration is deterministic. -- Apply labels (`apps.cozystack.io/cluster: `) to all rendered objects, enabling label-selector-based reconciliation in `cluster-autoscaler` and any future controllers. +### Migration: seamless rolling update -If the parent control-plane is missing, the chart `fail`s the render with a clear error pointing at the expected HelmRelease name. This is the same mode of fragility present in `vm-instance/vm-disk` — accepted as a tradeoff for simplicity, with the understanding that a future iteration could move to an explicit CRD reference. +Standard CAPI MachineDeployment rolling update — no script, no migration tool: -### Backend abstraction +1. Operator pushes the chart upgrade. +2. The chart now renders the new `TalosConfigTemplate` and the new Talos-imaged `KubevirtMachineTemplate`. The `MachineDeployment`'s `template.spec.bootstrap.configRef` and `template.spec.infrastructureRef` switch to point at the new templates. +3. CAPI detects the template change, creates new Talos machines, cordons + drains old kubeadm machines, deletes them. Standard `maxSurge` / `maxUnavailable` knobs apply. -`kubernetes-nodes` exposes a `backend.type` field that selects which set of templates is rendered. Three backends are introduced; only `kubevirt-kubeadm` is fully equivalent to today's behaviour. +During the rollout, the tenant cluster passes through a brief mixed state: some Ubuntu + kubeadm nodes, some Talos nodes. The state is valid — both kinds of nodes register against the same Kamaji apiserver via standard kubelet CSR approval, share the same CNI (Cilium), and are indistinguishable from the apiserver's point of view. After rollout completes, every node is Talos. -#### `kubevirt-kubeadm` (existing, default) +### What stays the same in Phase 1 -Identical to today's flow. Renders `MachineDeployment` + `KubevirtMachineTemplate` + `KubeadmConfigTemplate` + `MachineHealthCheck`. Bootstrap via kubeadm join. Workers run a Cozystack-blessed Ubuntu/Talos-untouched image. CAPI + CAPK do all the lifecycle work; cluster-autoscaler drives scale. +- The chart's user-facing API. `values.yaml` (`nodeGroups`, control-plane fields, etc.) is identical. Tenants do not edit anything. +- The package's structural layout. `kubernetes` is still one HelmRelease bundling control-plane + pools. Splitting happens in Phase 2. +- CAPI, CAPK, kubevirt-ccm, cluster-autoscaler, Cilium — all unchanged in role. -This is the migration target for every existing pool. +## Phase 2 — Split the kubernetes package -#### `kubevirt-talos` +### What changes -Renders the same CAPI/CAPK objects as above, but with a `TalosConfigTemplate` (from `cluster-api-bootstrap-provider-talos`) replacing `KubeadmConfigTemplate`. Worker VMs boot from a Talos image. Bootstrap fetches the Talos machineconfig from CAPI and joins the cluster via standard Talos PKI. +Once Phase 1 has rolled out, the chart's internals are split into two sibling packages: -The tenant's `KamajiControlPlane` carries an `additionalContainers` entry running `clastix/talos-csr-signer` listening on UDP/50001, exposed alongside `:6443` on the tenant API LoadBalancer. This is what allows `talosctl` to operate against worker nodes whose control-plane is Kamaji rather than Talos. +- **`kubernetes`** — control-plane only. Renders `Cluster`, `KamajiControlPlane` (with the talos-csr-signer sidecar), `KubevirtCluster`, `kubevirt-ccm`, and addons (cert-manager, FluxCD, ingress-nginx, etc.). Renders no node-pool objects. +- **`kubernetes-nodes`** — exactly one node pool per HelmRelease. Renders `MachineDeployment` + `TalosConfigTemplate` + `KubevirtMachineTemplate` + `MachineHealthCheck` for that pool. Also renders a per-pool `cluster-autoscaler` Deployment in the management cluster, scoped to this pool's `MachineDeployment`. -This backend keeps CAPI in the loop because for KubeVirt VMs the `cluster-api-provider-kubevirt` machinery is the path of least resistance — it already handles VM lifecycle, networking, and storage attachment. +A tenant cluster is therefore described as `1 × kubernetes` HelmRelease + `N × kubernetes-nodes` HelmReleases. The control-plane holds no reference to its node pools — workers self-register against the apiserver via Talos bootstrap, as they already do after Phase 1. -#### `cloud-talos-hetzner` and `cloud-talos-azure` +There is no `backend.type` field in `kubernetes-nodes` values: the only backend is "kubevirt-talos", since Phase 3 (which would introduce alternative backends) is out of scope. The field will be added when Phase 3 lands. -No Cluster API involvement. The pattern mirrors what Cozystack already uses for the management cluster (see `/docs/v1.3/operations/multi-location/autoscaling/`): +```mermaid +flowchart LR + Old[Before Phase 2
kubernetes HelmRelease
monolith with nodeGroups] + Old --> Split[Phase 2 split] + Split --> CP[kubernetes HelmRelease
control-plane only] + Split --> NG1[kubernetes-nodes HelmRelease
pool-0] + Split --> NG2[kubernetes-nodes HelmRelease
pool-1] + + CP --> KamajiCP[KamajiControlPlane
+ talos-csr-signer] + CP --> CCM[kubevirt-ccm] + NG1 --> MD1[MachineDeployment
+ TalosConfigTemplate
+ KubevirtMachineTemplate
+ cluster-autoscaler] + NG2 --> MD2[MachineDeployment
+ TalosConfigTemplate
+ KubevirtMachineTemplate
+ cluster-autoscaler] +``` -- A `cluster-autoscaler` Deployment is rendered into the management cluster's namespace for this tenant, configured with `--cloud-provider=hetzner` (or `azure`), `--cloud-config` referencing a Secret with cloud credentials provided in the HelmRelease values, and `autoscalingGroups` describing min/max replicas, instance type, and region. -- A `Secret` holds the Talos machineconfig (see "Talos machineconfig" below). The autoscaler injects it via the cloud's `cloud-init` / `customData` mechanism when launching new instances. -- Newly booted instances complete their Talos bootstrap against the tenant's public API endpoint (Kamaji) using the Talos token in the machineconfig, register via kubelet, and obtain their kubelet client certificate through standard CSR approval. +### Linkage by name -These backends do not use CAPI at all. There is no `Machine`, no `MachineDeployment`. The autoscaler is the source of truth for desired pool size; the cloud's API is the source of truth for actual instance state. This is the same model proven on the management cluster. +`kubernetes-nodes` references the parent `kubernetes` HelmRelease **by name**, following the `vm-instance` / `vm-disk` precedent. The chart's `clusterName` value is used at template-render time to `lookup` the tenant's `KamajiControlPlane` for the API endpoint, CA, and bootstrap-token data. If the parent is missing, the chart `fail`s with a clear error pointing at the expected HelmRelease name. Same fragility tradeoff as in `vm-instance` / `vm-disk`, accepted for simplicity. ### Talos machineconfig: template + user overlay -A single Talos machineconfig per pool is generated by the `kubernetes-nodes` chart and stored as a Secret. It is constructed in two layers: +Each `kubernetes-nodes` HelmRelease generates a single Talos machineconfig for its pool, stored as a Secret. Built in two layers: -**System layer (chart-managed, not exposed to user):** +**System layer** (chart-managed, not exposed to user): cluster CA, machine CA, apiserver endpoint, Talos token, kilo annotations, Cozystack defaults (registry mirrors, kubelet flags, etc.). Looked up at render time from the parent `KamajiControlPlane`. -- Cluster CA, machine CA, apiserver endpoint — read at template time via `lookup` from the tenant's `KamajiControlPlane`. -- Talos token — generated once per pool, stored alongside the machineconfig. -- Kilo annotations (`kilo.squat.ai/location`, `kilo.squat.ai/persistent-keepalive`, `topology.kubernetes.io/zone`) when the pool participates in the Cozystack mesh. -- Standard Cozystack defaults: registry mirrors, kubelet flags, time servers, install disk hints. +**User layer** (`userMachineConfig` in `values.yaml`): extra kubelet args, extra labels/taints, registry mirror overrides, Talos image-factory schematic, anything else explicitly exposed by the chart. -**User layer (`backend.userMachineConfig` in values.yaml):** +The two layers are merged at render time; the user never writes raw Talos YAML for cluster-critical fields. The chart guarantees the machineconfig results in a worker that joins the right Kamaji control-plane. -- Extra kubelet args. -- Extra node labels and taints (free-form). -- Per-pool registry mirrors override. -- Extra `extensions` (talos image factory schematic). -- Anything else the user explicitly wants to pass through. +### `kubevirt-ccm` stays in `kubernetes` package -The two layers are merged at render time and the result is the machineconfig that gets injected into cloud-init or the KubevirtMachineTemplate. The user never writes raw Talos YAML for cluster-critical fields; the chart guarantees the result will join the right control-plane. +`kubevirt-ccm` is logically a cluster-level property, not a per-pool one. It remains in the `kubernetes` package and is rendered once per tenant cluster, independent of how many `kubernetes-nodes` HelmReleases attach to it. -### Cluster-autoscaler — one per pool, in the management cluster +### Migration: monolith → split -The current model (one autoscaler per tenant in the management cluster, discovering all pools via clusterapi labels) does not generalise: a `cloud-talos-hetzner` pool needs `--cloud-provider=hetzner`, a `cloud-talos-azure` pool needs `--cloud-provider=azure`, and the upstream cluster-autoscaler accepts only one cloud-provider flag per Deployment. +A migration script (modelled on `migrations/29` from the `virtual-machine` split) walks every `kubernetes` HelmRelease with non-empty `nodeGroups`. For each `nodeGroup`: -The new model: each `kubernetes-nodes` HelmRelease renders its own `cluster-autoscaler` Deployment, in the management cluster, scoped to its pool. The autoscaler: +1. Annotates the existing `MachineDeployment`, `TalosConfigTemplate`, `KubevirtMachineTemplate`, `MachineHealthCheck` (and the matching `cluster-autoscaler` Deployment if it's per-pool) with `helm.sh/resource-policy: keep`, so the upcoming `kubernetes` chart reconcile does not delete them. +2. Creates a new `kubernetes-nodes-` HelmRelease with values translated from the source `nodeGroup`. +3. Patches the existing objects' `meta.helm.sh/release-name` and `meta.helm.sh/release-namespace` annotations to claim ownership for the new HelmRelease. +4. Strips the `nodeGroup` entry from the source `kubernetes` HelmRelease values. -- For `kubevirt-*` backends, uses `--cloud-provider=clusterapi` and watches just this pool's MachineDeployment. -- For `cloud-talos-*` backends, uses the corresponding native provider and the values-supplied `autoscalingGroups`. +The script is idempotent and safe to re-run. After migration, the source `kubernetes` HelmRelease's `nodeGroups` is empty and the corresponding `kubernetes-nodes` HelmReleases own the CAPI machinery. -Coordination across pools is left to the standard scheduler — pending pods select among pools via standard mechanisms (taints, node selectors, topology constraints). - -### Tenant-side node lifecycle (NLC) - -When `cluster-autoscaler` scales a `cloud-talos-*` pool down, it deletes the cloud VM. The tenant's apiserver still has a `Node` object that will linger until something deletes it. CAPI was previously the agent doing this; without CAPI, we need an equivalent. - -The `node-lifecycle-controller` from `cozystack/local-ccm` is a good fit for this role. The `kubernetes-nodes` chart for `cloud-talos-*` backends renders an NLC Deployment that runs in the management cluster but uses a kubeconfig pointing to the **tenant** apiserver. It watches Node objects with the `ToBeDeletedByClusterAutoscaler:NoSchedule` taint and removes them after a configurable grace period and unreachability check. - -For `kubevirt-*` backends NLC is not needed: CAPI's machine controller already removes the Node object when it deletes the Machine. - -### `kubevirt-ccm` stays in the control-plane package - -The Kubernetes Cloud Controller Manager for KubeVirt-backed nodes (`kubevirt-ccm`, currently in `templates/kccm/manager.yaml`) remains in the `kubernetes` package, not in `kubernetes-nodes`. It is critical for KubeVirt-backed tenants — without it, KubeVirt VMs do not get their LoadBalancer Services properly wired — and it is logically a property of the cluster, not of a particular node pool. - -Tenants that have **no** KubeVirt pools at all (only `cloud-talos-*` pools) will have `kubevirt-ccm` running idle. This is acceptable cost; a future `enabled` flag in the `kubernetes` chart can disable it on demand. +A subsequent chart release removes `nodeGroups` from the `kubernetes` chart schema entirely; users still on the old shape get a clear validation error pointing at the migration tool. ## User-facing changes -- New CRD-style application `kubernetes-nodes` with `values.yaml` containing: `clusterName`, `backend.type`, `backend..*` settings, common `replicas`/`minReplicas`/`maxReplicas`, `roles`, `resources`, `userMachineConfig` (Talos backends only). -- Existing `kubernetes` `values.yaml` no longer accepts `nodeGroups` (after migration completes; during the migration window both shapes are accepted). -- New tenant-cluster pages in the dashboard list node pools as separate entities, with their backend type and current capacity. -- `cozystack` CLI gains commands to list, create, scale, and delete node pools per cluster. +- **End of Phase 1**: none visible — the chart's API is unchanged. Operators may notice workers running Talos in `kubectl describe node`; that's all. +- **End of Phase 2**: a new application kind `kubernetes-nodes` appears in the dashboard. Tenant cluster pages list node pools as separate entities. `cozystack` CLI gains commands for listing, creating, scaling and deleting node pools per cluster. ## Upgrade and rollback compatibility -The migration follows the precedent of the `virtual-machine` → `vm-instance` + `vm-disk` split: a long parallel period during which both shapes work, followed by a scripted migration and eventual removal of the legacy code path. - -**Phase 1 — both shapes accepted.** -- Ship `kubernetes-nodes` as a new package. Document its use for new node pools. -- The `kubernetes` chart continues to support `nodeGroups` in `values.yaml` exactly as today. -- Users can adopt `kubernetes-nodes` on a per-pool basis for new pools without touching existing ones. - -**Phase 2 — migration tool.** -- A migration script (modelled on `migrations/29` for the `virtual-machine` split) walks every `kubernetes` HelmRelease with non-empty `nodeGroups`. For each `nodeGroup` it: - 1. Patches the existing `MachineDeployment`, `KubevirtMachineTemplate`, `KubeadmConfigTemplate`, and `MachineHealthCheck` with `meta.helm.sh/release-name` and `meta.helm.sh/release-namespace` annotations pointing at the new `kubernetes-nodes-` HelmRelease. - 2. Creates the `kubernetes-nodes` HelmRelease with values copied from the source `nodeGroup`. - 3. Strips the `nodeGroup` entry from the `kubernetes` HelmRelease values. -- Critical pre-step: each affected resource is annotated `helm.sh/resource-policy: keep` first, so that the `kubernetes` chart's reconciliation does not delete it during the brief window before the new HelmRelease's first reconcile. -- The script is idempotent and safe to re-run. +**Phase 1 rollback** — if Talos rollout has issues, revert the chart upgrade. CAPI rolls workers back to the kubeadm template. Brief mixed-state window in both directions. -**Phase 3 — legacy removal.** -- The `kubernetes` chart drops `nodeGroups` from its schema entirely. Charts that still receive it produce a clear validation error pointing at the migration tool. -- Documentation deprecates the old shape. - -**Rollback.** -- During Phase 1 and Phase 2, rollback is a matter of reverting the migration: delete the `kubernetes-nodes` HelmRelease, restore the `nodeGroup` entry in `kubernetes` values, run reconcile. The migration script supports this direction explicitly. -- After Phase 3, rollback requires reinstating the legacy code path in the `kubernetes` chart. This is a hard cut and should not be done lightly; Phase 3 only ships once Phase 1 and 2 have been in production long enough to gather operational confidence. +**Phase 2 rollback** — during the migration window, both shapes coexist (Phase 1-only deployments still have monolithic `kubernetes`; new deployments use the split). The migration script supports a reverse direction. After legacy removal (post-Phase 2), rollback is a hard cut. ## Security -- `kubernetes-nodes` HelmReleases run with the same RBAC as `kubernetes` HelmReleases today — they're both managed by Cozystack platform components, not by tenants. The split does not introduce new tenant-controlled inputs. -- Talos backends introduce a new credential: the per-pool `TALOS_TOKEN`, used by `clastix/talos-csr-signer` to validate worker bootstrap. Stored in a Secret in the tenant's namespace, rotated on pool re-creation. -- `cloud-talos-*` backends introduce cloud-provider credentials (Hetzner API token, Azure service principal). These are user-supplied at HelmRelease creation, stored as Secrets in the tenant's namespace, and never read by the tenant's apiserver — only by the management-cluster `cluster-autoscaler`. -- The Talos machineconfig contains the cluster CA and the bootstrap token. It is stored as a Secret accessible only to the autoscaler and the chart's render path. It is **not** exposed to the tenant's apiserver or workloads. -- `clastix/talos-csr-signer` uses a single shared `TALOS_TOKEN` per pool with no per-node identity proof. This matches upstream Talos's `trustd` model. Co-developed with CLASTIX; experimental upstream status acknowledged but accepted, given Cozystack's involvement in its development. +- Talos workers introduce a per-pool `TALOS_TOKEN` used by `talos-csr-signer` to validate Talos PKI handshakes. Stored as a Secret in the tenant's namespace, rotated on pool re-creation. Shared-token model matches upstream Talos `trustd`; co-developed with CLASTIX. +- The Talos machineconfig contains the cluster CA and bootstrap token. Stored as a Secret accessible only to the chart's render path and CAPK at VM-creation time. Not exposed to tenant workloads. +- During Phase 1 rollout, kubeadm- and Talos-bootstrapped nodes share the same Kamaji control-plane and CNI. No privilege escalation between bootstrap modes. ## Failure and edge cases -- **`kubernetes-nodes` HelmRelease created before its parent `kubernetes` HelmRelease** → chart `fail`s the render with a clear error message identifying the missing parent. No partial CAPI/autoscaler resources created. -- **Parent `kubernetes` HelmRelease deleted while children exist** → all `kubernetes-nodes` HelmReleases for that cluster fail subsequent reconciles. An admission webhook on `kubernetes` HelmRelease delete blocks the operation if any `kubernetes-nodes` references it. -- **Migration runs while autoscaler is mid-scale-out** → resource-policy `keep` annotation prevents deletion. The new HelmRelease's first reconcile picks up the in-flight `Machine` objects via the standard CAPI reconcile. -- **`cluster-autoscaler` for a `cloud-talos-*` backend fails to delete a cloud VM** (rate limit, transient API error) → instances stay up; NLC will not see the Node as `ToBeDeletedByClusterAutoscaler:NoSchedule` and will not delete the Node. Operator alerted via metrics; manual cleanup required. Documented runbook. -- **Talos-CSR-signer pod restart during worker bootstrap** → worker retries `trustd` calls with exponential backoff (Talos default). No data lost. -- **Mixed-backend tenant where one pool fails reconcile** → other pools and the control-plane are unaffected (independent HelmReleases). The cluster degrades gracefully. +- **Kamaji provider patch not in production at Phase 1 ship time** → blocks Phase 1. Either the patch lands upstream, or Cozystack runs a fork until it does. +- **Talos image pull error / image missing** → KubeVirt VM doesn't boot, CAPI shows pending machines. Documented runbook for rolling back to the previous image reference. +- **talos-csr-signer pod restart during worker bootstrap** → workers retry trustd calls with exponential backoff. No data lost. +- **Mixed-state rollout interrupted (Phase 1)** → both kubeadm and Talos nodes coexist for longer than expected. Cluster remains functional; complete rollout when issue resolves. +- **Phase 2 migration runs while pool is mid-scale-out** → `helm.sh/resource-policy: keep` annotation prevents deletion. The new HelmRelease's first reconcile picks up in-flight Machines via standard CAPI reconcile. +- **`kubernetes-nodes` HelmRelease created before its parent `kubernetes`** → chart `fail`s the render with a clear error identifying the missing parent. No partial CAPI objects created. +- **Parent `kubernetes` HelmRelease deleted while children exist** → admission webhook on `kubernetes` HelmRelease delete blocks the operation if any `kubernetes-nodes` references it. ## Testing -- Unit tests for chart rendering: synthetic inputs covering each backend, expected Kubernetes objects, expected absence of forbidden combinations (e.g., `userMachineConfig` for `kubevirt-kubeadm`). -- Schema validation tests for the new `kubernetes-nodes` `values.yaml` shape. -- Migration script tests: synthetic existing `kubernetes` releases with various `nodeGroups` configurations; verify idempotence, rollback, and identity preservation (Machine names, BootstrapData ownership). -- Integration tests with `kind` and a stub KubeVirt: full lifecycle of `kubevirt-kubeadm` and `kubevirt-talos` pools. -- E2E in CI for `cloud-talos-*` backends using a small Hetzner project and an Azure subscription: scale-up, scale-down, NLC behaviour on rapid scale-down. -- Failure-injection tests: kill the talos-csr-signer pod during worker join; kill the cluster-autoscaler pod mid-scale; delete a pool's cloud-credentials Secret and verify graceful degradation. +- **Phase 1 unit**: synthetic inputs covering Talos machineconfig generation, MachineDeployment/template shape, signer sidecar wiring. +- **Phase 1 integration**: kind + KubeVirt + Kamaji + CABPT + signer; spin up a worker, verify it joins via Talos bootstrap, verify `talosctl` works against it. +- **Phase 1 migration**: synthetic existing `kubernetes` HelmRelease running Ubuntu+kubeadm; upgrade chart; verify CAPI rolls Talos nodes in and drains kubeadm ones without losing apiserver availability. +- **Phase 2 unit**: chart rendering for `kubernetes-nodes` across various inputs; expected object shape and labels. +- **Phase 2 migration**: synthetic existing `kubernetes` HelmRelease with `nodeGroups`; run migration script; verify idempotence, ownership transfer, and that no CAPI objects are deleted in flight. ## Rollout -- **Phase 1.** Implement `kubernetes-nodes` package with `kubevirt-kubeadm` backend only. Ship as opt-in alongside the existing `kubernetes` chart with no migration required for existing pools. -- **Phase 2.** Add `kubevirt-talos` backend, including talos-csr-signer integration in the `kubernetes` (control-plane) chart. -- **Phase 3.** Add `cloud-talos-hetzner` and `cloud-talos-azure` backends, including per-pool cluster-autoscaler and tenant-side NLC. -- **Phase 4.** Ship migration script. Document the migration; encourage but don't force adoption. -- **Phase 5.** Once telemetry shows broad migration of existing tenants, remove `nodeGroups` from `kubernetes` chart's schema; ship final migration. - -Each phase is independently shippable and rollback-safe. +1. Land the Kamaji control-plane provider patch (upstream PR; vendored fork if not merged in time). +2. Ship Phase 1: build and publish the Talos image, update the chart to use CABPT + Talos templates + signer sidecar. Existing tenants pick up Talos via chart upgrade and CAPI rolling update. +3. Ship Phase 2: introduce the `kubernetes-nodes` package alongside the (now Talos-only) `kubernetes` chart. Migration script for existing `nodeGroups`. Deprecate the embedded `nodeGroups` shape. +4. Later release: remove `nodeGroups` from `kubernetes` chart schema. ## Open questions -1. **CAPI removal long-term.** Should we set a roadmap target for removing CAPI from `kubevirt-kubeadm` and `kubevirt-talos` backends entirely (replacing CAPK with a thin Cozystack-internal controller that creates `VirtualMachine` objects directly)? This would unify all backends under "no CAPI" and reduce a substantial dependency, but requires re-implementing what CAPK gives us today (machine lifecycle, healthchecks, status). Out of scope for this proposal but worth scoping next. -2. **Backend extension shape.** The proposed `backend.type` enum has a fixed set of values. Adding AWS, GCP, or on-prem KVM later is straightforward (new `cloud-talos-aws` etc.), but should we accept arbitrary backend identifiers and dispatch through a plugin mechanism? Probably not — the explicit enum keeps the chart auditable. -3. **Per-pool talos-csr-signer vs cluster-wide.** Currently proposed as a single sidecar in the tenant's `KamajiControlPlane` Pod (cluster-wide). Should each pool have its own token for blast-radius isolation? Operationally heavier; security gain limited because tokens already give only the right to obtain a Talos machine cert, not Kubernetes API access. Open for discussion. -4. **NLC reuse vs fork.** Should we deploy the existing `local-ccm` NLC in tenant-targeting mode, or fork it into a `tenant-nlc` package? Reuse keeps the codebase smaller; fork makes the host vs tenant deployment paths explicitly different. Likely reuse is correct. -5. **Should `kubernetes-nodes` be allowed to advertise capacity to multiple `kubernetes` clusters?** Almost certainly no, but stating it explicitly. Each pool belongs to one cluster. +1. **Kamaji provider patch upstream timeline.** Aim to merge into `clastix/cluster-api-control-plane-provider-kamaji`. Cozystack carries a fork in the meantime. Track issue/PR link here once filed. +2. **Per-pool talos-csr-signer vs cluster-wide.** Currently proposed as a single sidecar in the tenant's `KamajiControlPlane` Pod (cluster-wide token). Should each pool have its own token for blast-radius isolation? Operationally heavier; security gain limited because tokens only grant the right to obtain a Talos machine cert, not Kubernetes API access. +3. **kubeadm template removal timing.** Keep `KubeadmConfigTemplate` in the chart for one release after Phase 1, removed in Phase 2 since the split re-architects regardless? +4. **Talos image build pipeline.** Where does the Cozystack Talos image live, who builds it, what cadence? ## Alternatives considered -**Keep the monolithic `kubernetes` package and add `nodeGroupsBackend` discriminators.** Rejected because it would force every tenant cluster's HelmRelease to know about every backend, and growing the values shape further entangles control-plane and node-pool lifecycles. The whole reason for the split is to *separate* lifecycles, not to make the same release manage more variety. - -**Sidero Metal / CAPS for Talos-on-bare-metal.** Rejected. Sidero Labs has officially deprecated Sidero Metal. Successor (Omni) is a closed-core SaaS, not a drop-in OSS CAPI provider. Sidero is also bare-metal-only and assumes a Talos control-plane, incompatible with Kamaji. +**Sidero Metal / CAPS for Talos**. Rejected. Sidero Labs has officially deprecated Sidero Metal. Successor (Omni) is closed-core SaaS, not a drop-in OSS provider. Sidero is also bare-metal-only and assumes a Talos control-plane, incompatible with Kamaji. -**Explicit `clusterRef` on `kubernetes-nodes` instead of name-based linkage.** Considered. Trade-off favours simplicity: name-based linkage matches the `vm-instance/vm-disk` precedent that Cozystack maintainers and users are already familiar with, and the security gain of a CRD reference is marginal because both packages are platform-controlled (not tenant-controlled). The fragility of the name-based approach is real but understood and accepted. +**Skip the split, keep the monolith after Talos migration.** Rejected as a long-term shape. Phase 1 alone is valuable as an OS migration, but stopping there leaves the chart monolithic and the road to hybrid clusters (Phase 3) blocked. The split is what enables future backends to be expressed cleanly. -**Single global cluster-autoscaler per tenant with multi-cloud-provider support.** Not feasible. Upstream cluster-autoscaler accepts one `--cloud-provider` flag; supporting multiple simultaneously would require either a fork or a per-pool autoscaler. Per-pool autoscaler in the management cluster is the natural fit. +**Switch to Talos and split in a single release.** Rejected as too risky. Two independent migrations bundled into one chart upgrade compounds rollback risk. Phasing them gives operators two well-understood checkpoints. -**Inline-disk-style "embedded" node pools** (each `kubernetes` HelmRelease has its node pools as a sub-section, but rendered as separate releases under the hood). Rejected because it does not actually decouple the lifecycle — a `helm upgrade` on the parent still touches all children. The split has to be at the user-visible HelmRelease level for the goals to be achieved. +**Explicit `clusterRef` on `kubernetes-nodes` instead of name-based linkage.** Considered. Trade-off favours simplicity: name-based linkage matches the `vm-instance` / `vm-disk` precedent and the security gain is marginal because both packages are platform-controlled. -**Exposing Talos machineconfig directly to users without a system layer.** Rejected because it forces every user to understand Talos machineconfig deeply, and gives them enough rope to break the join with Kamaji (wrong CA, wrong endpoint, wrong token). The template + user-overlay approach matches the ergonomics Cozystack offers everywhere else (system handles the boilerplate, user describes intent). +**Exposing Talos machineconfig directly to users without a system layer.** Rejected because it forces every user to understand Talos machineconfig deeply, and gives them enough rope to break the join with Kamaji (wrong CA, wrong endpoint, wrong token). The template + user-overlay approach matches the ergonomics Cozystack offers everywhere else. From 017cf283ad8a65bd1e0bb9115e2a2cdd95df1b50 Mon Sep 17 00:00:00 2001 From: Andrei Kvapil Date: Mon, 11 May 2026 20:56:29 +0200 Subject: [PATCH 3/3] design-proposal: link to Phase 3 draft (PR #9) Co-Authored-By: Claude Signed-off-by: Andrei Kvapil --- design-proposals/kubernetes-nodes-split/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/design-proposals/kubernetes-nodes-split/README.md b/design-proposals/kubernetes-nodes-split/README.md index b2618af..0215357 100644 --- a/design-proposals/kubernetes-nodes-split/README.md +++ b/design-proposals/kubernetes-nodes-split/README.md @@ -16,7 +16,7 @@ Hybrid clusters — workers that live outside the Cozystack management cluster ( ## Scope and related proposals -- **Phase 3 (hybrid clusters)** lives in a separate draft proposal — link to be added once that PR is open. None of the design here forecloses Phase 3; the package split is exactly what makes Phase 3 expressible cleanly. +- **Phase 3 (hybrid clusters)** lives in a separate draft proposal: [`kubernetes-nodes-hybrid-clusters`](../kubernetes-nodes-hybrid-clusters/) ([PR #9](https://github.com/cozystack/community/pull/9)). None of the design here forecloses Phase 3; the package split is exactly what makes Phase 3 expressible cleanly. - **Companion: [`cross-cluster-tenant-mesh`](../cross-cluster-tenant-mesh/)** (PR #7). Independent of this proposal; relevant once tenants need to reach services across cluster boundaries. ## Context