Skip to content

mvcc: skip Value bytes from BoltDB for KeysOnly range queries (-77% latency, -97% heap at 10 KB/key)#21758

Open
notandruu wants to merge 1 commit into
etcd-io:mainfrom
notandruu:mvcc-range-keys-only-skip-value
Open

mvcc: skip Value bytes from BoltDB for KeysOnly range queries (-77% latency, -97% heap at 10 KB/key)#21758
notandruu wants to merge 1 commit into
etcd-io:mainfrom
notandruu:mvcc-range-keys-only-skip-value

Conversation

@notandruu
Copy link
Copy Markdown

Fixes #20386

Root Cause

When RangeRequest.KeysOnly=true, etcd still loaded the full serialized
KeyValue proto from BoltDB — including potentially multi-kilobyte Value
bytes — via proto.Unmarshal, then immediately discarded the value in
asembleRangeResponse:

// server/etcdserver/txn/range.go (before)
if r.KeysOnly {
    rr.KVs[i].Value = nil   // ← bytes were already allocated and copied from BoltDB
}

Kubernetes 1.34 uses Range WithKeysOnly for watch cache warming and list
operations. For a cluster with 10 000 secrets (10 KB each), a single
LIST with KeysOnly=true caused ~100 MB of transient heap allocations
that were discarded immediately.

Fix

Three-layer change:

  1. server/storage/mvcc/kv.go — add KeysOnly bool to RangeOptions.

  2. server/etcdserver/txn/range.go — set ro.KeysOnly = r.KeysOnly && r.SortTarget != VALUE
    in executeRange. When the request sorts by VALUE, values are still
    required for the sort step; all other KeysOnly requests can skip them.

  3. server/storage/mvcc/kvstore_txn.go — add unmarshalKVSkipValue, a hand-rolled
    protowire decoder that reads all KeyValue fields (Key, CreateRevision,
    ModRevision, Version, Lease) and skips proto field 5 (Value) entirely,
    never touching the backing buffer for those bytes. When ro.KeysOnly,
    rangeKeys calls this function instead of proto.Unmarshal.

The existing if r.KeysOnly { rr.KVs[i].Value = nil } in
asembleRangeResponse is kept as a safety net for the SortTarget == VALUE
case (values loaded → sorted → stripped).

Benchmark Results

Measured on Apple M3 Max, go1.26.3, darwin/arm64. 100 keys per range, each
key stored with a value of the given size.

goos: darwin
goarch: arm64
cpu: Apple M3 Max

BenchmarkRangeKeysOnly1KB_WithValue-16    111879    58458 ns/op    136057 B/op    719 allocs/op
BenchmarkRangeKeysOnly1KB_KeysOnly-16     168064    34670 ns/op     33656 B/op    619 allocs/op
                                                     -41% latency    -75% heap

BenchmarkRangeKeysOnly10KB_WithValue-16    40623   142281 ns/op   1057660 B/op    719 allocs/op
BenchmarkRangeKeysOnly10KB_KeysOnly-16    180662    32728 ns/op     33656 B/op    619 allocs/op
                                                     -77% latency    -97% heap

Heap savings scale linearly with value size because the Value allocation is
eliminated entirely. The 100 B of alloc difference (719 → 619 per 100-key
range) is the removal of per-KV Value slice headers from the allocator.

Tests Added

  • TestUnmarshalKVSkipValue — unit test for the protowire decoder: verifies
    all non-Value fields are correctly decoded and Value is always nil.
  • TestUnmarshalKVSkipValueEmptyValue — edge case: tombstone records with no
    value field decode cleanly.
  • TestRangeKeysOnlyDoesNotLoadValues — end-to-end: stores keys with values,
    ranges with KeysOnly=true and asserts Value == nil while all metadata
    fields (Key, CreateRevision, ModRevision, Version) are correctly populated;
    also verifies the normal (non-KeysOnly) path still returns values.
  • BenchmarkRangeKeysOnly* — four benchmarks (1 KB / 10 KB × with/without
    KeysOnly) to anchor future regressions.

@k8s-ci-robot
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: notandruu
Once this PR has been reviewed and has the lgtm label, please assign ahrtr for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Copy Markdown

Hi @notandruu. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@serathius
Copy link
Copy Markdown
Member

cc @nwnt

When a Range request has KeysOnly=true and the sort target is not VALUE,
etcd previously loaded the full serialized KeyValue proto from BoltDB —
including potentially large Value bytes — only to throw them away in
asembleRangeResponse. Kubernetes uses KeysOnly list requests for watch
cache warming and reflective discovery; each such list can touch thousands
of objects whose values may be kilobytes to megabytes in size.

This change avoids that wasted allocation by:

1. Adding a KeysOnly bool to RangeOptions in server/storage/mvcc/kv.go.

2. Setting ro.KeysOnly = r.KeysOnly && r.SortTarget != VALUE in
   executeRange (server/etcdserver/txn/range.go). Sorting by VALUE
   still requires loading values; all other cases can skip them.

3. Implementing unmarshalKVSkipValue in server/storage/mvcc/kvstore_txn.go,
   a hand-rolled protowire decoder that reads all KeyValue fields except
   Value (proto field 5), which it skips without allocating.

Benchmark (100 keys, Apple M3 Max, go1.26.3 darwin/arm64):

  1 KB values per key (100 keys):
    Before: 58458 ns/op  136057 B/op  719 allocs/op
    After:  34670 ns/op   33656 B/op  619 allocs/op
    Savings: -41% latency, -75% heap

  10 KB values per key (100 keys):
    Before: 142281 ns/op  1057660 B/op  719 allocs/op
    After:   32728 ns/op    33656 B/op  619 allocs/op
    Savings: -77% latency, -97% heap

The savings scale with value size because Value allocation is eliminated
entirely. At scale (Kubernetes list of 10 000 secrets each 10 KB), this
removes ~100 MB of transient heap per KeysOnly request.

Fixes etcd-io#20386

Signed-off-by: Andrew Liu <andrewjliu22@gmail.com>
@notandruu notandruu force-pushed the mvcc-range-keys-only-skip-value branch from 0e574ad to 87cdc39 Compare May 16, 2026 10:23
@notandruu notandruu marked this pull request as ready for review May 16, 2026 10:23
@notandruu
Copy link
Copy Markdown
Author

Could a maintainer run /ok-to-test to unblock CI? Happy to address any feedback.

@serathius
Copy link
Copy Markdown
Member

This is something that @nwnt was already working on and we discussed a design that was different than presented here.

// Fields are decoded by hand using protowire so that the Value bytes (proto
// field 5) are never copied from the backend buffer into Go heap memory. All
// other fields are decoded as normal. Unknown future fields are skipped safely.
func unmarshalKVSkipValue(b []byte, kv *mvccpb.KeyValue) error {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't want to maintain custom protobuf decoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Optimize Range WithKeysOnly to avoid loading values in memory

3 participants