Atelet runs only on nodes hosting ateoms (#9)#134
Draft
eliranw wants to merge 11 commits into
Draft
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
This was referenced May 31, 2026
The new AteletNodeReconciler will refcount node claims per WorkerPool UID. Embedding the UID directly on the pod template means the reconciler can read it from pod labels without a separate WorkerPool lookup — which matters because the WorkerPool may be mid-deletion when its pods are being reconciled. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…nt-substrate#9) Add a busybox:1.36 init container to every ateom pod that probes \$HOST_IP:8085 until atelet's gRPC port is reachable. \$HOST_IP comes from the downward API (status.hostIP). This makes ateom robust to atelet upgrades, restarts, and node-cold-start races: the pod waits in Init:0/1 instead of crashlooping when atelet isn't yet serving. Independent of any node-labeling change — useful on its own. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…t-substrate#9) Atelet DaemonSet now requires the ate.dev/atelet=true label to schedule a pod. Substrate's new AteletNodeReconciler (next commit) will populate this label on nodes that host ateom workloads, and remove it when they don't. The init container added in the prior commit absorbs the atelet startup gap, so this change does not introduce a race for ateom pods. NOTE: at first apply on an existing cluster, the DS rolling update will terminate atelet pods on nodes that don't have the label yet. Deploy the new atecontroller image first; it will label nodes that currently host ateoms before the DS update has fully rolled out. See release notes. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
New Pod-keyed reconciler that will maintain the ate.dev/atelet=true node label and ate.dev/claim.<workerpool-uid> annotations. This commit lands the file skeleton plus the predicate and pod-to-node mapping with unit tests. The reconcile logic itself is stubbed (returns no-op); subsequent commits add it. Field indexer on spec.nodeName is registered in SetupWithManager so future List(client.MatchingFields) calls work against the cached client. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…trate#9) reconcileNode lists ateom pods on the node (via cached spec.nodeName field selector), computes the set of distinct WorkerPool UIDs present, and SSA-applies the desired label + per-pool claim annotations. SSA's granular-map semantics let us add and remove individual claim keys without disturbing other field owners. Also registers AteletNodeReconciler in the envtest TestMain so integration tests against the reconciler can run. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…gent-substrate#9) Three new envtest cases: - Two pools on the same node → two claim annotations, one label - Deleting one pool's pod with another pool still present → only that pool's annotation removed, label sticks - Deleting the last ateom pod → both the claim and the label are removed via SSA's granular-map removal Pod deletions use client.GracePeriodSeconds(0) so that envtest removes the pod object immediately rather than waiting for a kubelet that doesn't exist in the test environment. Validates the per-pool refcounting design. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…bstrate#9) Wire the new reconciler into the atecontroller binary so it runs alongside WorkerPool and ActorTemplate reconcilers in production. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…ubstrate#9) The finalizer holds WorkerPool deletion until every Node has released the per-pool claim annotation (ate.dev/claim.<wp.UID>). Claim release happens naturally as the Deployment cascade deletes the ateom pods and AteletNodeReconciler observes the deletions. handleDeletion lists Nodes and requeues with backoff until no claim remains, then removes the finalizer. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…-substrate#9) End-to-end envtest path: WorkerPool created -> finalizer present -> pod bound to node -> claim annotation appears -> pod deleted -> WorkerPool deleted -> finalizer holds until claim is gone -> both WorkerPool and claim annotation are absent. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
Adds nodes/get;list;watch;patch and pods/get;list;watch to the ate-controller ClusterRole, picked up from the +kubebuilder:rbac markers on the new reconciler. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…rate#9) Cleanup pass over the feature (no behavior change): - AteletNodeReconciler.reconcileNode: the node fetched for the existence check was bound to an unused variable. Discard the value and document that the Get exists only to avoid resurrecting a deleted Node via the SSA upsert. - ateomPodPredicate now gates on WorkerPoolUIDLabelKey (the key reconcileNode actually consumes) instead of WorkerPoolLabelKey, so the watch filter matches the work the reconciler does and a half-labeled pod can't trigger a no-op reconcile. - Centralize the claim-annotation format in a claimAnnotationKey helper, shared by the reconciler (writer) and the WorkerPool finalizer (reader), replacing the string concatenation duplicated across both files. - Tests: replace hand-rolled finalizer-scan loops with controllerutil.ContainsFinalizer, and reuse the makeNode helper instead of inlining Node construction. Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
a39a478 to
b046670
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #9. Atelet currently runs on every node as a DaemonSet. This makes it run only on nodes that currently host ateom pods.
A new
AteletNodeReconciler(Pod-keyed) watches ateom pods and maintains a substrate-ownedate.dev/atelet=truelabel on each Node currently hosting an ateom. The atelet DaemonSet now carriesnodeSelector: ate.dev/atelet=true, so its footprint follows ateom placement.Mechanism
spec.nodeNamefield index) and SSA-applies the desired node state.ate.dev/claim.<workerpool-uid>annotation per WorkerPool occupying it. The label is present iff at least one claim exists; SSA's field-ownership prunes claims/label as pods leave. Multiple WorkerPools can safely share a node.wait-for-ateletinit container that TCP-probes$HOST_IP:8085until the local atelet is serving. This absorbs the scheduling→atelet-ready gap and makes ateoms robust to atelet restarts/upgrades mid-life.ate.dev/release-node-claimsfinalizer on WorkerPool holds deletion until its claims are released, so claims can't leak.nodes: get;list;watch;patchandpods: get;list;watchto the controller ClusterRole (regenerated from kubebuilder markers).Why reactive (not a placement policy)
Substrate doesn't decide which nodes ateoms run on — the kube-scheduler does, freely. The reconciler just records that choice as a label. This keeps the change small and avoids inventing a node-selection policy; richer schemes (worker classes, capacity reservations) can layer on later.
Test plan
wait-for-ateletinit containermake verify(tests, gofmt, lint, codegen, licenses, go-mod-tidy, boilerplate)Migration / rollout
This replaces the atelet DS's "run on all nodes" behavior with "run on labeled nodes only." Recommended two-phase rollout for existing clusters:
On a fresh cluster the atelet DaemonSet runs zero pods until the first WorkerPool's ateoms are scheduled — expected, not a broken install (
kubectl rollout statuson a 0-desired DaemonSet returns success immediately). The init container means even a single-shotkubectl apply -kis safe: ateoms wait rather than crash.