Skip to content

Feat/checkpoint migration#224

Merged
aa1ex merged 10 commits into
kaasops:mainfrom
aa1ex:feat/checkpoint-migration
Jun 15, 2026
Merged

Feat/checkpoint migration#224
aa1ex merged 10 commits into
kaasops:mainfrom
aa1ex:feat/checkpoint-migration

Conversation

@aa1ex

@aa1ex aa1ex commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Turning the config optimization(#222) on or off renames the agent's kubernetes_logs sources (per-pipeline names vs optimizedSource-). Vector stores file read positions in <data_dir>//checkpoints.json, so after a rename it finds no checkpoint under the new name and re-reads the log files still on the node. On a 1000-pipeline stand that was a one-time ~70% duplicate burst, which makes enabling (or rolling back) the optimization disruptive.

--enable-checkpoint-migration (off by default) carries the positions across the rename. Two parts:

The agent config Secret name is bound to the optimization mode (-agent / -agent-opt), and both are kept current. A mode switch then changes the pod template and rolls the DaemonSet instead of doing a live --watch-config reload, so each node migrates exactly when its pod restarts and not-yet-rolled pods keep their previous config.
A checkpoint-merger init container runs before vector and consolidates the node's checkpoints (union keyed by file-content fingerprint, latest position) into the new source directories, so vector resumes each retained file at its last offset. It is idempotent and fail-open: on any error it logs and lets vector start, falling back to the pre-migration re-read.
Gated like the optimization (off by default, then on by default, then the gate removed). A Vector opted out of the optimization is unaffected.

Measured on a 1000-pipeline stand (vector 0.48, Elasticsearch sink): going from legacy to optimized with migration re-delivers 0% of the retained logs (vs ~70% without), and the rollback from optimized back to legacy is also 0%. Operator upgrade with the flag off does not restart agents. Watch connections drop from 3258 to 262 on enable and return on rollback.

Release CI builds and pushes kaasops/checkpoint-merger (override: --checkpoint-merger-image). Behavior, limitations and an ops/observability section are in docs/config-optimization.md. Worth noting from there: restricted-image clusters must mirror the merger image before enabling (init container, so an unpullable image stalls the pod, bounded by maxUnavailable=1); rollback seeds each legacy dir with the full checkpoint union (inert foreign entries, bounded, expire); the standby config is validated when it becomes active, not before.

@aa1ex aa1ex merged commit 11f5f36 into kaasops:main Jun 15, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant