Skip to content

Atelet runs only on nodes hosting ateoms (#9)#134

Draft
eliranw wants to merge 11 commits into
agent-substrate:mainfrom
eliranw:eliranw/atelet-node-placement
Draft

Atelet runs only on nodes hosting ateoms (#9)#134
eliranw wants to merge 11 commits into
agent-substrate:mainfrom
eliranw:eliranw/atelet-node-placement

Conversation

@eliranw
Copy link
Copy Markdown

@eliranw eliranw commented May 31, 2026

Summary

Closes #9. Atelet currently runs on every node as a DaemonSet. This makes it run only on nodes that currently host ateom pods.

A new AteletNodeReconciler (Pod-keyed) watches ateom pods and maintains a substrate-owned ate.dev/atelet=true label on each Node currently hosting an ateom. The atelet DaemonSet now carries nodeSelector: ate.dev/atelet=true, so its footprint follows ateom placement.

Mechanism

  • Reactive labeling. On each ateom-pod event, the reconciler lists pods on the affected node (via a cached spec.nodeName field index) and SSA-applies the desired node state.
  • Per-pool refcounting via annotations. Each node carries one ate.dev/claim.<workerpool-uid> annotation per WorkerPool occupying it. The label is present iff at least one claim exists; SSA's field-ownership prunes claims/label as pods leave. Multiple WorkerPools can safely share a node.
  • Init container. Every ateom pod gets a wait-for-atelet init container that TCP-probes $HOST_IP:8085 until the local atelet is serving. This absorbs the scheduling→atelet-ready gap and makes ateoms robust to atelet restarts/upgrades mid-life.
  • Finalizer. A ate.dev/release-node-claims finalizer on WorkerPool holds deletion until its claims are released, so claims can't leak.
  • RBAC. Adds nodes: get;list;watch;patch and pods: get;list;watch to the controller ClusterRole (regenerated from kubebuilder markers).

Why reactive (not a placement policy)

Substrate doesn't decide which nodes ateoms run on — the kube-scheduler does, freely. The reconciler just records that choice as a label. This keeps the change small and avoids inventing a node-selection policy; richer schemes (worker classes, capacity reservations) can layer on later.

Test plan

  • envtest: single pod → node gets label + claim
  • envtest: two pools on one node → two claims, one label
  • envtest: delete one pool's pod with another present → only its claim removed, label stays
  • envtest: delete last pod → claim and label both removed (SSA prune)
  • envtest: WorkerPool deletion held by finalizer until claim released, then completes
  • envtest: ateom Deployment carries the wait-for-atelet init container
  • make verify (tests, gofmt, lint, codegen, licenses, go-mod-tidy, boilerplate)

Migration / rollout

This replaces the atelet DS's "run on all nodes" behavior with "run on labeled nodes only." Recommended two-phase rollout for existing clusters:

  1. Deploy the new controller image + RBAC first. It observes existing ateom pods and labels their nodes.
  2. Apply the atelet DS manifest change. The rolling update drops atelet from unlabeled nodes (no ateoms there) and keeps it on labeled ones (no disruption).

On a fresh cluster the atelet DaemonSet runs zero pods until the first WorkerPool's ateoms are scheduled — expected, not a broken install (kubectl rollout status on a 0-desired DaemonSet returns success immediately). The init container means even a single-shot kubectl apply -k is safe: ateoms wait rather than crash.

@google-cla
Copy link
Copy Markdown

google-cla Bot commented May 31, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

eliranw added 11 commits May 31, 2026 17:16
The new AteletNodeReconciler will refcount node claims per WorkerPool
UID. Embedding the UID directly on the pod template means the
reconciler can read it from pod labels without a separate
WorkerPool lookup — which matters because the WorkerPool may be
mid-deletion when its pods are being reconciled.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…nt-substrate#9)

Add a busybox:1.36 init container to every ateom pod that probes
\$HOST_IP:8085 until atelet's gRPC port is reachable. \$HOST_IP comes
from the downward API (status.hostIP). This makes ateom robust to
atelet upgrades, restarts, and node-cold-start races: the pod waits
in Init:0/1 instead of crashlooping when atelet isn't yet serving.

Independent of any node-labeling change — useful on its own.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…t-substrate#9)

Atelet DaemonSet now requires the ate.dev/atelet=true label to
schedule a pod. Substrate's new AteletNodeReconciler (next commit)
will populate this label on nodes that host ateom workloads, and
remove it when they don't. The init container added in the prior
commit absorbs the atelet startup gap, so this change does not
introduce a race for ateom pods.

NOTE: at first apply on an existing cluster, the DS rolling update
will terminate atelet pods on nodes that don't have the label yet.
Deploy the new atecontroller image first; it will label nodes that
currently host ateoms before the DS update has fully rolled out.
See release notes.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
New Pod-keyed reconciler that will maintain the ate.dev/atelet=true
node label and ate.dev/claim.<workerpool-uid> annotations. This
commit lands the file skeleton plus the predicate and pod-to-node
mapping with unit tests. The reconcile logic itself is stubbed
(returns no-op); subsequent commits add it.

Field indexer on spec.nodeName is registered in SetupWithManager
so future List(client.MatchingFields) calls work against the
cached client.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…trate#9)

reconcileNode lists ateom pods on the node (via cached
spec.nodeName field selector), computes the set of distinct
WorkerPool UIDs present, and SSA-applies the desired label +
per-pool claim annotations. SSA's granular-map semantics let us
add and remove individual claim keys without disturbing other
field owners.

Also registers AteletNodeReconciler in the envtest TestMain
so integration tests against the reconciler can run.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…gent-substrate#9)

Three new envtest cases:
- Two pools on the same node → two claim annotations, one label
- Deleting one pool's pod with another pool still present →
  only that pool's annotation removed, label sticks
- Deleting the last ateom pod → both the claim and the label are
  removed via SSA's granular-map removal

Pod deletions use client.GracePeriodSeconds(0) so that envtest removes
the pod object immediately rather than waiting for a kubelet that
doesn't exist in the test environment.

Validates the per-pool refcounting design.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…bstrate#9)

Wire the new reconciler into the atecontroller binary so it runs
alongside WorkerPool and ActorTemplate reconcilers in production.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…ubstrate#9)

The finalizer holds WorkerPool deletion until every Node has
released the per-pool claim annotation (ate.dev/claim.<wp.UID>).
Claim release happens naturally as the Deployment cascade deletes
the ateom pods and AteletNodeReconciler observes the deletions.

handleDeletion lists Nodes and requeues with backoff until no
claim remains, then removes the finalizer.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…-substrate#9)

End-to-end envtest path: WorkerPool created -> finalizer present
-> pod bound to node -> claim annotation appears -> pod deleted ->
WorkerPool deleted -> finalizer holds until claim is gone -> both
WorkerPool and claim annotation are absent.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
Adds nodes/get;list;watch;patch and pods/get;list;watch to the
ate-controller ClusterRole, picked up from the +kubebuilder:rbac
markers on the new reconciler.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
…rate#9)

Cleanup pass over the feature (no behavior change):

- AteletNodeReconciler.reconcileNode: the node fetched for the
  existence check was bound to an unused variable. Discard the value
  and document that the Get exists only to avoid resurrecting a
  deleted Node via the SSA upsert.
- ateomPodPredicate now gates on WorkerPoolUIDLabelKey (the key
  reconcileNode actually consumes) instead of WorkerPoolLabelKey, so
  the watch filter matches the work the reconciler does and a
  half-labeled pod can't trigger a no-op reconcile.
- Centralize the claim-annotation format in a claimAnnotationKey
  helper, shared by the reconciler (writer) and the WorkerPool
  finalizer (reader), replacing the string concatenation duplicated
  across both files.
- Tests: replace hand-rolled finalizer-scan loops with
  controllerutil.ContainsFinalizer, and reuse the makeNode helper
  instead of inlining Node construction.

Signed-off-by: Eliran Wolff <eliranw@nvidia.com>
@eliranw eliranw force-pushed the eliranw/atelet-node-placement branch from a39a478 to b046670 Compare May 31, 2026 14:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Atelet should only run on nodes where ateoms are running.

1 participant