[RFC] clientv3: add context-keyed channel pool by fuweid · Pull Request #21765 · etcd-io/etcd

fuweid · 2026-05-19T17:36:34Z

Motivation:

kube-apiserver currently creates one etcd client, and therefore one gRPC channel, per resource. When the watch cache is not ready, kube-apiserver delegates List requests to etcd. Delegated List requests and reflector requests then share the same gRPC channel.

Large delegated responses can increase memory pressure on the etcd server. Even when using the streaming Range API, delegated requests can slow down reflector requests that are populating the watch cache.

By default, etcd's gRPC server allows a very large number of concurrent streams. If a reflector request is slow, or if compaction triggers List-All-Without-Pagination, many streams can accumulate on one connection. Setting MAX_CONCURRENT_STREAMS is hard to tune because hitting the limit can also block reflector requests (grpc/grpc#21386).

Based on https://grpc.io/docs/guides/performance/, we probably should allow kube-apiserver to isolate traffic by different channels.

Quote:

Create a separate channel for each area of high load in the application.

Use a pool of gRPC channels to distribute RPCs over multiple connections
(channels must have different channel args to prevent re-use so define a
use-specific channel arg such as channel number).

Add an experimental context-keyed channel pool so kube-apiserver and other applications can select an initialized gRPC channel through request context for high-volume data transfers. Combined with MAX_CONCURRENT_STREAMS, this allows delegated requests to be isolated from reflector traffic.

Test:

Environment
- etcd member: Standard_D32s_v5 node
- etcd client: Standard_D32s_v5 node
- Cross-node access
Client workload

The client uses a single etcd connection, similar to how kube-apiserver keeps one client per resource.

For each test run:

Sequentially list about 1 GiB of data (480,000 objects) 10 times and measure the average runtime.
At the same time, start 100 goroutines that continuously read about 5 MiB of data each. These background reads are used to consume bandwidth and slow down the 1 GiB list operation.

This simulates kube-apiserver behavior when the watch cache for a resource is not ready. In that case, user List requests cannot be served from cache and are forwarded directly to etcd while other requests may still consume etcd/network bandwidth.

Result:

Conn	Mode	Page-size	Avg (grpc / cmux)
1	stream	no	2m45.653s / 3m17.042s
1	stream	10k	2m49.178s / 3m56.316s
1	non-stream	no	1m52.867s / 3m11.739s
1	non-stream	10k	2m14.412s / 3m13.986s
2	stream	no	4.522s / 4.687s
2	stream	10k	8.468s / 8.868s
2	non-stream	no	6.333s / 6.536s
2	non-stream	10k	9.448s / 10.204s

The native grpc-go server can be faster than the Go net/http HTTP/2 server path because they schedule writes differently. The net/http HTTP/2 server path prioritizes control frames, such as PING ACK, ahead of queued stream DATA in its round-robin write scheduler [1]. The client can therefore receive the ACK earlier, with fewer DATA bytes included in the BDP sample. The native grpc-go server uses its own loopyWriter and activeStreams scheduling, so DATA may be written or batched before the ACK is sent. This can produce a larger BDP sample and faster window growth, improving throughput for large responses.

Notes:

grpc: grpc native http2 server.
cmux: x/net http2 server by cmux.
Conn: 2 means the 1 GiB List uses a dedicated channel/connection.
Based on 0235df9

k8s-ci-robot · 2026-05-19T17:36:42Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: fuweid

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [fuweid]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Motivation: kube-apiserver currently creates one etcd client, and therefore one gRPC channel, per resource. When the watch cache is not ready, kube-apiserver delegates List requests to etcd. Delegated List requests and reflector requests then share the same gRPC channel. Large delegated responses can increase memory pressure on the etcd server. Even when using the streaming Range API, delegated requests can slow down reflector requests that are populating the watch cache. By default, etcd's gRPC server allows a very large number of concurrent streams. If a reflector request is slow, or if compaction triggers List-All-Without-Pagination, many streams can accumulate on one connection. Setting MAX_CONCURRENT_STREAMS is hard to tune because hitting the limit can also block reflector requests (grpc/grpc#21386). Based on https://grpc.io/docs/guides/performance/, we probably should allow kube-apiserver to isolate traffic by different channels. > Quote: > > 1. Create a separate channel for each area of high load in the application. > 2. Use a pool of gRPC channels to distribute RPCs over multiple connections > (channels must have different channel args to prevent re-use so define a > use-specific channel arg such as channel number). Add an experimental context-keyed channel pool so kube-apiserver and other applications can select an initialized gRPC channel through request context for high-volume data transfers. Combined with MAX_CONCURRENT_STREAMS, this allows delegated requests to be isolated from reflector traffic. Test: * Environment - etcd member: Standard_D32s_v5 node - etcd client: Standard_D32s_v5 node - Cross-node access * Client workload The client uses a single etcd connection, similar to how kube-apiserver keeps one client per resource. For each test run: 1. Sequentially list about 1 GiB of data (480,000 objects) 10 times and measure the average runtime. 2. At the same time, start 100 goroutines that continuously read about 5 MiB of data each. These background reads are used to consume bandwidth and slow down the 1 GiB list operation. This simulates kube-apiserver behavior when the watch cache for a resource is not ready. In that case, user List requests cannot be served from cache and are forwarded directly to etcd while other requests may still consume etcd/network bandwidth. Result: | Conn | Mode | Page-size | Avg (grpc / cmux) | | ---- | ---------- | --------- | ---------------------- | | 1 | stream | no | 2m45.653s / 3m17.042s | | 1 | stream | 10k | 2m49.178s / 3m56.316s | | 1 | non-stream | no | 1m52.867s / 3m11.739s | | 1 | non-stream | 10k | 2m14.412s / 3m13.986s | | 2 | stream | no | 4.522s / 4.687s | | 2 | stream | 10k | 8.468s / 8.868s | | 2 | non-stream | no | 6.333s / 6.536s | | 2 | non-stream | 10k | 9.448s / 10.204s | The native grpc-go server can be faster than the Go net/http HTTP/2 server path because they schedule writes differently. The net/http HTTP/2 server path prioritizes control frames, such as PING ACK, ahead of queued stream DATA in its round-robin write scheduler [1]. The client can therefore receive the ACK earlier, with fewer DATA bytes included in the BDP sample. The native grpc-go server uses its own loopyWriter and activeStreams scheduling, so DATA may be written or batched before the ACK is sent. This can produce a larger BDP sample and faster window growth, improving throughput for large responses. Notes: * grpc: grpc native http2 server. * cmux: x/net http2 server by cmux. * Conn: 2 means the 1 GiB List uses a dedicated channel/connection. Reference: [1]: https://github.com/golang/net/blob/v0.38.0/http2/writesched_roundrobin.go#L80-L83 Signed-off-by: Wei Fu <fuweid89@gmail.com>

serathius · 2026-05-19T18:14:00Z

For watch cache fallback we are less interested in throughput, more with max memory :P

Throughput is still pretty important for watch cache initialization.

serathius · 2026-05-19T18:14:32Z

cc @Jefftree

fuweid · 2026-05-19T18:20:21Z

For watch cache fallback we are less interested in throughput, more with max memory

With streaming, the memory issue could be gone. but right now, the MAX_CONCURRENT_STREAMS is still max (int32) by default and there is single channel. A lot of fallback requests should slow done each other.

codecov · 2026-05-19T18:22:36Z

Codecov Report

❌ Patch coverage is 70.53571% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.24%. Comparing base (0235df9) to head (ca815f2).
⚠️ Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
client/v3/channel_pool.go	59.25%	14 Missing and 8 partials ⚠️
client/v3/client.go	72.72%	5 Missing and 4 partials ⚠️
client/v3/channel_key.go	91.66%	1 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
client/v3/config.go	`85.71% <ø> (ø)`
client/v3/retry.go	`88.23% <100.00%> (ø)`
client/v3/channel_key.go	`91.66% <91.66%> (ø)`
client/v3/client.go	`84.26% <72.72%> (-0.97%)`	⬇️
client/v3/channel_pool.go	`59.25% <59.25%> (ø)`

... and 29 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #21765      +/-   ##
==========================================
+ Coverage   70.19%   70.24%   +0.05%     
==========================================
  Files         426      428       +2     
  Lines       35228    35342     +114     
==========================================
+ Hits        24727    24825      +98     
- Misses       9105     9110       +5     
- Partials     1396     1407      +11

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0235df9...ca815f2. Read the comment docs.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ahrtr · 2026-05-19T19:11:49Z

There are three solutions:

multiple grpc connections as this PR provides
[existing feature] multiple streams (this is an etcd's concept) share the same grpc channel/connection. This is only available for watch.
users can create multiple etcd clients

can we have a comparison of each solution?

fuweid · 2026-05-19T21:26:06Z

can we have a comparison of each solution?

Sure.

[existing feature] multiple streams (this is an etcd's concept) share the same grpc channel/connection. This is only available for watch.

The existing multiple-stream mechanism is not connection-level isolation. Those streams are still created from the same grpc.ClientConn (channel).

If that channel has only one underlying HTTP/2 connection (only one endpoint pass-through), all those streams still share the same connection. Even when the channel has multiple endpoints/SubConns, the stream placement is still controlled by the channel balancer and is not a user-visible, context-keyed dedicated connection.

So, this is not equivalent to what this PR is trying to provide. The PR is about allowing callers to explicitly select a separate gRPC channel/connection for a traffic class, so large Range traffic can be isolated at the transport level. The existing multiple-stream mechanism only helps Watch RPC multiplexing and does not apply to Range.

users can create multiple etcd clients

Yes, users can create multiple etcd clients today. I used that approach in kube-apiserver at the beginning for testing - kubernetes/kubernetes#138494 (comment), and it is doable. But I switched to use this change.

The downside is that the application has to duplicate etcd client construction and lifecycle handling, including config, auth, endpoint updates, metrics, and close behavior. The application also has to manually route each request to the intended client instance.

So, I try to provide the same connection-level isolation in clientv3 itself, while keeping a single etcd client as the user-facing object. Callers can select the intended channel through context.

NOTE: We have some cases which use a single etcd service IP as the endpoint. So, the etcd client only had one configured endpoint, and traffic was effectively using one HTTP/2/TCP connection after clientv3.New. That made the throughput issue much easier to reproduce. This change provides connection-level isolation for large Range traffic easily.

k8s-ci-robot added area/clientv3 area/testing labels May 19, 2026

k8s-ci-robot added approved size/XL labels May 19, 2026

fuweid added the type/feature label May 19, 2026

fuweid mentioned this pull request May 19, 2026

Plan to release v3.7.0 #21605

Open

13 tasks

fuweid force-pushed the support-subconn branch from 6ec5926 to ca815f2 Compare May 19, 2026 17:48

k8s-ci-robot added size/L and removed size/XL labels May 19, 2026

fuweid mentioned this pull request May 19, 2026

apiserver: consistent read fallback to etcd can overload apiserver and etcd when watch cache is behind kubernetes/kubernetes#138494

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] clientv3: add context-keyed channel pool#21765

[RFC] clientv3: add context-keyed channel pool#21765
fuweid wants to merge 1 commit into
etcd-io:mainfrom
fuweid:support-subconn

fuweid commented May 19, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

serathius commented May 19, 2026

Uh oh!

serathius commented May 19, 2026

Uh oh!

fuweid commented May 19, 2026

Uh oh!

codecov Bot commented May 19, 2026 •

edited

Loading

Uh oh!

ahrtr commented May 19, 2026

Uh oh!

fuweid commented May 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

Conversation

fuweid commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented May 19, 2026

Uh oh!

serathius commented May 19, 2026

Uh oh!

serathius commented May 19, 2026

Uh oh!

fuweid commented May 19, 2026

Uh oh!

codecov Bot commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ahrtr commented May 19, 2026

Uh oh!

fuweid commented May 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

4 participants

fuweid commented May 19, 2026 •

edited

Loading

codecov Bot commented May 19, 2026 •

edited

Loading

fuweid commented May 19, 2026 •

edited

Loading