Reduce log volume by PSeitz-dd · Pull Request #6568 · quickwit-oss/quickwit

PSeitz-dd · 2026-06-30T17:27:14Z

rate_limited_tracing macro (rate_limited_info etc.): the suppressed count is now folded into the emitted line as a num_suppressed=N field, instead of a separate preceding line.
In combination with #6549, this lets us aggregate on the field and recover the true pre-suppression log count.

3 Types of Log Changes:

INFO → DEBUG (routine per-operation chatter): node pool-add ×4 (the ~907M dominant source), send-to-index-serializer, spawning pipeline, merge schedule/download, leaf split-finished/offsets, offload-to-lambda, resetting pipeline, adding-shards-assignment.
INFO → rate-limited INFO (1/min) (worth a heartbeat): new-split, actor-exit (success), assigning shards, env-var defaults, truncated-shard.
ERROR → rate-limited ERROR (1/min): the recurring lambda invocation failure (kept visible, not flooding).

publish-new-splits now carries num_splits, num_docs, and total on-disk split_size.

The rate-limited tracing macros emitted a separate "suppressed N similar log messages" line before the next allowed line. Attach the count to that line instead, as a `suppressed_in_last_min` field. This makes the suppression rate visible inline on the message it belongs to and removes a distinct log pattern, slightly lowering volume.

…EM-759) A fleet-wide extract showed ~1.07B log lines, dominated by a single per-gossip-tick INFO (~907M, ~85%). Reclassify high-frequency operational logs to cut default-level volume by ~20x while preserving actionable signal: - INFO -> DEBUG for routine per-operation chatter with no liveness value: node pool-add (x4), send-to-index-serializer, spawning pipeline, merge schedule/download, leaf split-finished/offsets, offload-to-lambda, resetting pipeline, adding shards assignment. - INFO -> rate-limited INFO (1/min) for logs worth a heartbeat: new-split, actor-exit (success), assigning shards, env-var defaults, truncated-shard. - ERROR -> rate-limited ERROR (1/min) for the recurring lambda invocation failure, keeping it visible without flooding. Error/failure branches are untouched (e.g. actor-exit failure stays at ERROR). Stage/publish, merge-completion, and cluster lifecycle stay at INFO.

The publish-new-splits log carried no fields. Add num_splits, num_docs, and total on-disk split size to give operators visibility into publish throughput and split sizing without adding a new log line. num_splits is >1 only for partitioned sources (a single commit produces one split per partition); merges publish a single output split, with the merged inputs recorded in replaced_split_ids.

This is a low-volume (~80K), operationally meaningful event: the searcher spilling search work to Lambda. It carries capacity/cost signal and pairs with the lambda invocation error we deliberately keep visible. Demoting it to DEBUG bought no real volume reduction, so revert it to INFO.

Keep visibility into ingester, searcher, and generic-service pool membership at INFO but cap each to 1/min. These are far lower volume than the indexer pool-add (~907M), which stays at DEBUG.

…uppressed Rate-limit the indexer pool-add log (~907M, the dominant pattern) at INFO 1/min instead of demoting to DEBUG. At 1/min this collapses to ~1.4K/day -- the same volume win as DEBUG -- while keeping pool membership visible at INFO, consistent with the other three pool-add logs. Rename the rate-limit suppressed-count field from `suppressed_in_last_min` to `num_suppressed`: the count is messages suppressed since the call site last emitted, and since the window only resets on the next call it can span more than a minute, so the old name was misleading.

fulmicoton · 2026-07-01T11:22:37Z

        flag_value
    } else {
-        info!(default_value=%default_value, "using environment variable `{key}` default value");
+        crate::rate_limited_info!(limit_per_min = 1, default_value=%default_value, "using environment variable `{key}` default value");


we have ~2M logs of that in the last week on the hooray cluster. I'll try to find the call sites and cache them instead of rate-limiting

Added caching to the callsites (with some macro to avoid misuse in the future)

fulmicoton · 2026-07-01T11:22:51Z

        | ActorExitStatus::DownstreamClosed
        | ActorExitStatus::Killed => {
-            info!(actor_id, phase = ?exit_phase, exit_status = ?after_process_exit_status, "actor-exit");
+            quickwit_common::rate_limited_info!(limit_per_min = 1, actor_id, phase = ?exit_phase, exit_status = ?after_process_exit_status, "actor-exit");


I don't think we want to rate limit this one

I can remove it, but we had ~4M log lines of that in the last week

Removed the rate limiting

fulmicoton · 2026-07-01T11:25:09Z

I am all for reducing logging but this is a terrible PR!
Why do you want to rate limit the thing that displays the parsed cli options?

The "using environment variable ... default value" INFO lines flooded because callers like PostgresqlMetastore::new read the same variables on every (re)construction, and the metastore is rebuilt frequently. Rate- limiting the shared log statement (previous approach) funneled every key through one call-site counter, so within a burst only the first key was ever logged and the rest were deterministically dropped. Instead, add per-call-site caching macros in quickwit-common: - get_from_env_cached!(ty, key, default, sensitive) - get_bool_from_env_cached!(key, default) Each expands to a block-local `static LazyLock<T>`, so the read + log fire exactly once per call site per process, with no cross-key contention (distinct call sites get distinct statics). Revert lib.rs env logging back to plain info!. Convert the non-lazy hot readers (metastore QW_POSTGRES_*, node_config OTLP/JAEGER, CORS debug) and fold the existing hand-rolled LazyLock env caches (ingest v2 enable/disable, batch bytes, per-index metrics, doc validation, load estimation, field list limit, default load per shard) onto the same macro for consistency. Also drop the rate-limit on the actor-exit success log (back to info!).

PSeitz-dd · 2026-07-01T13:18:39Z

I am all for reducing logging but this is a terrible PR! Why do you want to rate limit the thing that displays the parsed cli options?

It's not parsed CLI options, it's reading environment variables, which we logged ~2M times in the last week on the hooray cluster.
I changed it to cache the env vars instead of rate limiting (since this covered only default values)

PSeitz added 3 commits June 30, 2026 18:55

PSeitz-dd requested a review from a team as a code owner June 30, 2026 17:27

PSeitz added 3 commits June 30, 2026 20:24

Rate-limit ingester/searcher/generic pool-add logs at INFO

546bfb7

Keep visibility into ingester, searcher, and generic-service pool membership at INFO but cap each to 1/min. These are far lower volume than the indexer pool-add (~907M), which stays at DEBUG.

PSeitz changed the title ~~Reduce default log volume by reclassifying verbose INFO logs (CLOUDPREM-759)~~ Reduce log volume Jun 30, 2026

fulmicoton reviewed Jul 1, 2026

View reviewed changes

PSeitz-dd requested a review from fulmicoton July 1, 2026 13:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reduce log volume#6568

Reduce log volume#6568
PSeitz-dd wants to merge 7 commits into
quickwit-oss:mainfrom
PSeitz:reduce_logs

PSeitz-dd commented Jun 30, 2026 •

edited by PSeitz

Loading

Uh oh!

fulmicoton Jul 1, 2026

Uh oh!

PSeitz-dd Jul 1, 2026

Uh oh!

PSeitz-dd Jul 1, 2026

Uh oh!

fulmicoton Jul 1, 2026

Uh oh!

PSeitz-dd Jul 1, 2026

Uh oh!

PSeitz-dd Jul 1, 2026

Uh oh!

fulmicoton commented Jul 1, 2026

Uh oh!

PSeitz-dd commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

PSeitz-dd commented Jun 30, 2026 • edited by PSeitz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fulmicoton Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz-dd Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz-dd Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

fulmicoton Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz-dd Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

PSeitz-dd Jul 1, 2026

Choose a reason for hiding this comment

Uh oh!

fulmicoton commented Jul 1, 2026

Uh oh!

PSeitz-dd commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

PSeitz-dd commented Jun 30, 2026 •

edited by PSeitz

Loading