Feat/checkpoint migration by aa1ex · Pull Request #224 · kaasops/vector-operator

aa1ex · 2026-06-15T13:09:30Z

Turning the config optimization(#222) on or off renames the agent's kubernetes_logs sources (per-pipeline names vs optimizedSource-). Vector stores file read positions in <data_dir>//checkpoints.json, so after a rename it finds no checkpoint under the new name and re-reads the log files still on the node. On a 1000-pipeline stand that was a one-time ~70% duplicate burst, which makes enabling (or rolling back) the optimization disruptive.

--enable-checkpoint-migration (off by default) carries the positions across the rename. Two parts:

The agent config Secret name is bound to the optimization mode (-agent / -agent-opt), and both are kept current. A mode switch then changes the pod template and rolls the DaemonSet instead of doing a live --watch-config reload, so each node migrates exactly when its pod restarts and not-yet-rolled pods keep their previous config.
A checkpoint-merger init container runs before vector and consolidates the node's checkpoints (union keyed by file-content fingerprint, latest position) into the new source directories, so vector resumes each retained file at its last offset. It is idempotent and fail-open: on any error it logs and lets vector start, falling back to the pre-migration re-read.
Gated like the optimization (off by default, then on by default, then the gate removed). A Vector opted out of the optimization is unaffected.

Measured on a 1000-pipeline stand (vector 0.48, Elasticsearch sink): going from legacy to optimized with migration re-delivers 0% of the retained logs (vs ~70% without), and the rollback from optimized back to legacy is also 0%. Operator upgrade with the flag off does not restart agents. Watch connections drop from 3258 to 262 on enable and return on rollback.

Release CI builds and pushes kaasops/checkpoint-merger (override: --checkpoint-merger-image). Behavior, limitations and an ops/observability section are in docs/config-optimization.md. Worth noting from there: restricted-image clusters must mirror the merger image before enabling (init container, so an unpullable image stalls the pod, bounded by maxUnavailable=1); rollback seeds each legacy dir with the full checkpoint union (inert foreign entries, bounded, expire); the standby config is validated when it becomes active, not before.

…tch (--enable-checkpoint-migration)

…irs (union, not advance-only)

…sts and docs

…usters

…bility

aa1ex added 10 commits June 13, 2026 01:27

feat: migrate vector file checkpoints on config optimization mode swi…

25dab23

…tch (--enable-checkpoint-migration)

fix(checkpoint): seed missing fingerprints into pre-existing source d…

a95ae23

…irs (union, not advance-only)

fix(checkpoint): clean up standby secret on disable, harden merger te…

32f8ebd

…sts and docs

chore: gitignore e2e artifacts dir

fbae4a7

ci: build and push checkpoint-merger image on release

ba2f06a

docs: dedupe checkpoint-merger image note, add rollout tip

14a21f0

docs: document merger image precondition for restricted/air-gapped cl…

71279da

…usters

docs: note image-pull latency on first migrated rollout

e325f06

feat(checkpoint): log migration mode engagement, document ops observa…

de3b068

…bility

docs: note per-source data_dir is not migrated (falls back to re-read)

6959ae1

aa1ex merged commit 11f5f36 into kaasops:main Jun 15, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat/checkpoint migration#224

Feat/checkpoint migration#224
aa1ex merged 10 commits into
kaasops:mainfrom
aa1ex:feat/checkpoint-migration

aa1ex commented Jun 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aa1ex commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aa1ex commented Jun 15, 2026 •

edited

Loading