Feat/checkpoint migration#224
Merged
Merged
Conversation
…tch (--enable-checkpoint-migration)
…irs (union, not advance-only)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Turning the config optimization(#222) on or off renames the agent's kubernetes_logs sources (per-pipeline names vs optimizedSource-). Vector stores file read positions in <data_dir>//checkpoints.json, so after a rename it finds no checkpoint under the new name and re-reads the log files still on the node. On a 1000-pipeline stand that was a one-time ~70% duplicate burst, which makes enabling (or rolling back) the optimization disruptive.
--enable-checkpoint-migration(off by default) carries the positions across the rename. Two parts:The agent config Secret name is bound to the optimization mode (-agent / -agent-opt), and both are kept current. A mode switch then changes the pod template and rolls the DaemonSet instead of doing a live --watch-config reload, so each node migrates exactly when its pod restarts and not-yet-rolled pods keep their previous config.
A checkpoint-merger init container runs before vector and consolidates the node's checkpoints (union keyed by file-content fingerprint, latest position) into the new source directories, so vector resumes each retained file at its last offset. It is idempotent and fail-open: on any error it logs and lets vector start, falling back to the pre-migration re-read.
Gated like the optimization (off by default, then on by default, then the gate removed). A Vector opted out of the optimization is unaffected.
Measured on a 1000-pipeline stand (vector 0.48, Elasticsearch sink): going from legacy to optimized with migration re-delivers 0% of the retained logs (vs ~70% without), and the rollback from optimized back to legacy is also 0%. Operator upgrade with the flag off does not restart agents. Watch connections drop from 3258 to 262 on enable and return on rollback.
Release CI builds and pushes kaasops/checkpoint-merger (override: --checkpoint-merger-image). Behavior, limitations and an ops/observability section are in docs/config-optimization.md. Worth noting from there: restricted-image clusters must mirror the merger image before enabling (init container, so an unpullable image stalls the pod, bounded by maxUnavailable=1); rollback seeds each legacy dir with the full checkpoint union (inert foreign entries, bounded, expire); the standby config is validated when it becomes active, not before.