mvcc: skip Value bytes from BoltDB for KeysOnly range queries (-77% latency, -97% heap at 10 KB/key)#21758
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: notandruu The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @notandruu. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
cc @nwnt |
When a Range request has KeysOnly=true and the sort target is not VALUE,
etcd previously loaded the full serialized KeyValue proto from BoltDB —
including potentially large Value bytes — only to throw them away in
asembleRangeResponse. Kubernetes uses KeysOnly list requests for watch
cache warming and reflective discovery; each such list can touch thousands
of objects whose values may be kilobytes to megabytes in size.
This change avoids that wasted allocation by:
1. Adding a KeysOnly bool to RangeOptions in server/storage/mvcc/kv.go.
2. Setting ro.KeysOnly = r.KeysOnly && r.SortTarget != VALUE in
executeRange (server/etcdserver/txn/range.go). Sorting by VALUE
still requires loading values; all other cases can skip them.
3. Implementing unmarshalKVSkipValue in server/storage/mvcc/kvstore_txn.go,
a hand-rolled protowire decoder that reads all KeyValue fields except
Value (proto field 5), which it skips without allocating.
Benchmark (100 keys, Apple M3 Max, go1.26.3 darwin/arm64):
1 KB values per key (100 keys):
Before: 58458 ns/op 136057 B/op 719 allocs/op
After: 34670 ns/op 33656 B/op 619 allocs/op
Savings: -41% latency, -75% heap
10 KB values per key (100 keys):
Before: 142281 ns/op 1057660 B/op 719 allocs/op
After: 32728 ns/op 33656 B/op 619 allocs/op
Savings: -77% latency, -97% heap
The savings scale with value size because Value allocation is eliminated
entirely. At scale (Kubernetes list of 10 000 secrets each 10 KB), this
removes ~100 MB of transient heap per KeysOnly request.
Fixes etcd-io#20386
Signed-off-by: Andrew Liu <andrewjliu22@gmail.com>
0e574ad to
87cdc39
Compare
|
Could a maintainer run |
|
This is something that @nwnt was already working on and we discussed a design that was different than presented here. |
| // Fields are decoded by hand using protowire so that the Value bytes (proto | ||
| // field 5) are never copied from the backend buffer into Go heap memory. All | ||
| // other fields are decoded as normal. Unknown future fields are skipped safely. | ||
| func unmarshalKVSkipValue(b []byte, kv *mvccpb.KeyValue) error { |
There was a problem hiding this comment.
Don't want to maintain custom protobuf decoder.
Fixes #20386
Root Cause
When
RangeRequest.KeysOnly=true, etcd still loaded the full serializedKeyValueproto from BoltDB — including potentially multi-kilobyteValuebytes — via
proto.Unmarshal, then immediately discarded the value inasembleRangeResponse:Kubernetes 1.34 uses
Range WithKeysOnlyfor watch cache warming and listoperations. For a cluster with 10 000 secrets (10 KB each), a single
LISTwithKeysOnly=truecaused ~100 MB of transient heap allocationsthat were discarded immediately.
Fix
Three-layer change:
server/storage/mvcc/kv.go— addKeysOnly booltoRangeOptions.server/etcdserver/txn/range.go— setro.KeysOnly = r.KeysOnly && r.SortTarget != VALUEin
executeRange. When the request sorts byVALUE, values are stillrequired for the sort step; all other
KeysOnlyrequests can skip them.server/storage/mvcc/kvstore_txn.go— addunmarshalKVSkipValue, a hand-rolledprotowiredecoder that reads allKeyValuefields (Key, CreateRevision,ModRevision, Version, Lease) and skips proto field 5 (
Value) entirely,never touching the backing buffer for those bytes. When
ro.KeysOnly,rangeKeyscalls this function instead ofproto.Unmarshal.The existing
if r.KeysOnly { rr.KVs[i].Value = nil }inasembleRangeResponseis kept as a safety net for theSortTarget == VALUEcase (values loaded → sorted → stripped).
Benchmark Results
Measured on Apple M3 Max, go1.26.3, darwin/arm64. 100 keys per range, each
key stored with a value of the given size.
Heap savings scale linearly with value size because the Value allocation is
eliminated entirely. The 100 B of alloc difference (719 → 619 per 100-key
range) is the removal of per-KV Value slice headers from the allocator.
Tests Added
TestUnmarshalKVSkipValue— unit test for the protowire decoder: verifiesall non-Value fields are correctly decoded and Value is always nil.
TestUnmarshalKVSkipValueEmptyValue— edge case: tombstone records with novalue field decode cleanly.
TestRangeKeysOnlyDoesNotLoadValues— end-to-end: stores keys with values,ranges with
KeysOnly=trueand assertsValue == nilwhile all metadatafields (Key, CreateRevision, ModRevision, Version) are correctly populated;
also verifies the normal (non-KeysOnly) path still returns values.
BenchmarkRangeKeysOnly*— four benchmarks (1 KB / 10 KB × with/withoutKeysOnly) to anchor future regressions.