Skip to content

Feat/config optimization sources#222

Merged
aa1ex merged 5 commits into
kaasops:mainfrom
aa1ex:feat/config-optimization-sources
Jun 12, 2026
Merged

Feat/config optimization sources#222
aa1ex merged 5 commits into
kaasops:mainfrom
aa1ex:feat/config-optimization-sources

Conversation

@aa1ex

@aa1ex aa1ex commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Every kubernetes_logs source in vector runs its own apiserver clients: 3 watch streams (pods/namespaces/nodes) and a separate pod metadata cache. The operator generates a source per pipeline, so an agent on a cluster with N pipelines holds 3xN watch connections and N copies of the node's pod metadata, and reconnects all of them every ~290s. On our cluster with ~2000 pipelines vector agents generate 90-95% of all kube-apiserver requests (55-60M per hour), and an agent doesn't even start with default resource limits at ~500 sources.

This adds an opt-in operator flag --enable-config-optimization (a Vector CR can be opted out with the vector-operator.kaasops.io/config-optimization: disabled annotation, e.g. for a staged rollout). The CRD is not changed; the flag is expected to become the default and be removed eventually. When enabled, sources that differ only in the watched namespace are collapsed into a single source with a kubernetes.io/metadata.name in (...) selector (split at 1000 namespaces to keep the selector reasonable), and route transforms split the stream back per namespace: a flat route up to 16 namespaces, md5-bucketed two-level routing above that. Inputs of pipeline transforms and sinks are rewired automatically. Sources with different settings are left alone. An event matching several pipelines still reaches all of them. Source names are derived from the group settings hash and don't depend on the namespace list, so file checkpoints survive adding/removing pipelines. With the flag off the generated config doesn't change.

Numbers from a test bench (1000 pipelines, single-node kind, vector 0.48, same workload):

before after
agent watch requests to apiserver, per 10 min 6014 6
agent memory 2802 MiB 119 MiB
agent CPU / delivery throughput at nominal load no change
delivery integrity, 600k numbered events 0 lost, 0 duplicated
kube-apiserver memory (separate run on the same bench, ES sink) 2218 MiB 1403 MiB
apiserver 429 responses 0.24/s steady 0

Rollout note: enabling or disabling the optimization renames the sources, so vector re-reads the log files retained on the nodes once (no losses, one-time duplicates). Checkpoint migration via an init-container is planned as a follow-up PR. Docs: docs/config-optimization.md.

Closes the "Vector config optimization" roadmap item from the README.

@aa1ex aa1ex merged commit 5f863d5 into kaasops:main Jun 12, 2026
5 checks passed
@aa1ex aa1ex deleted the feat/config-optimization-sources branch June 12, 2026 14:20
@sakateka

Copy link
Copy Markdown
Contributor

Hi! Could you please explain what will happen if I enable optimization and one of the sinks gets stuck and can’t send logs, while its retry setting is left at the default (retry indefinitely), and I have buffers configured with when_full: block at every level?

Vector explicitly describes this behavior: “A source only sends events as fast as the slowest sink that is configured to provide backpressure (buffer.when_full = block)” (see the concepts documentation).

Will all pipelines built by this optimization under a single source suffer because of one slow sink?
Is there any best practice for avoiding this?

@aa1ex

aa1ex commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

Hi, @sakateka! Good question, and yes, that's the expected behavior.

With optimization enabled, pipelines that share identical source settings are collapsed onto a single kubernetes_logs source. Since Vector emits only as fast as the slowest sink applying backpressure (when_full: block), a stuck sink with indefinite retries will stall every pipeline sharing that source, not just its own. Pipelines in other groups and sources that aren't collapsed are unaffected.

This is an inherent trade-off of a shared source: the savings come from collapsing the per-pipeline watchers and readers, while in Vector backpressure isolation only comes from a non-blocking buffer or a bounded retry. So the cleanest fix is on the affected sink itself. Setting buffer.when_full: drop_newest (or a disk buffer with overflow) and/or a bounded request.retry_max_duration_secs breaks the backpressure back to the shared source, keeps the optimization's savings, and only impacts the misbehaving sink.

If you instead need hard isolation for a specific pipeline (a known-unreliable destination where dropping isn't acceptable), the optimizer hasn't been released yet, so before release we'll add a way to exclude that pipeline from optimization, keeping its own dedicated source. We'll document this behavior and the recommendations as well.

Thanks for the detailed report!

@aa1ex aa1ex mentioned this pull request Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants