[RFC] clientv3: add context-keyed channel pool#21765
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: fuweid The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Motivation: kube-apiserver currently creates one etcd client, and therefore one gRPC channel, per resource. When the watch cache is not ready, kube-apiserver delegates List requests to etcd. Delegated List requests and reflector requests then share the same gRPC channel. Large delegated responses can increase memory pressure on the etcd server. Even when using the streaming Range API, delegated requests can slow down reflector requests that are populating the watch cache. By default, etcd's gRPC server allows a very large number of concurrent streams. If a reflector request is slow, or if compaction triggers List-All-Without-Pagination, many streams can accumulate on one connection. Setting MAX_CONCURRENT_STREAMS is hard to tune because hitting the limit can also block reflector requests (grpc/grpc#21386). Based on https://grpc.io/docs/guides/performance/, we probably should allow kube-apiserver to isolate traffic by different channels. > Quote: > > 1. Create a separate channel for each area of high load in the application. > 2. Use a pool of gRPC channels to distribute RPCs over multiple connections > (channels must have different channel args to prevent re-use so define a > use-specific channel arg such as channel number). Add an experimental context-keyed channel pool so kube-apiserver and other applications can select an initialized gRPC channel through request context for high-volume data transfers. Combined with MAX_CONCURRENT_STREAMS, this allows delegated requests to be isolated from reflector traffic. Test: * Environment - etcd member: Standard_D32s_v5 node - etcd client: Standard_D32s_v5 node - Cross-node access * Client workload The client uses a single etcd connection, similar to how kube-apiserver keeps one client per resource. For each test run: 1. Sequentially list about 1 GiB of data (480,000 objects) 10 times and measure the average runtime. 2. At the same time, start 100 goroutines that continuously read about 5 MiB of data each. These background reads are used to consume bandwidth and slow down the 1 GiB list operation. This simulates kube-apiserver behavior when the watch cache for a resource is not ready. In that case, user List requests cannot be served from cache and are forwarded directly to etcd while other requests may still consume etcd/network bandwidth. Result: | Conn | Mode | Page-size | Avg (grpc / cmux) | | ---- | ---------- | --------- | ---------------------- | | 1 | stream | no | 2m45.653s / 3m17.042s | | 1 | stream | 10k | 2m49.178s / 3m56.316s | | 1 | non-stream | no | 1m52.867s / 3m11.739s | | 1 | non-stream | 10k | 2m14.412s / 3m13.986s | | 2 | stream | no | 4.522s / 4.687s | | 2 | stream | 10k | 8.468s / 8.868s | | 2 | non-stream | no | 6.333s / 6.536s | | 2 | non-stream | 10k | 9.448s / 10.204s | The native grpc-go server can be faster than the Go net/http HTTP/2 server path because they schedule writes differently. The net/http HTTP/2 server path prioritizes control frames, such as PING ACK, ahead of queued stream DATA in its round-robin write scheduler [1]. The client can therefore receive the ACK earlier, with fewer DATA bytes included in the BDP sample. The native grpc-go server uses its own loopyWriter and activeStreams scheduling, so DATA may be written or batched before the ACK is sent. This can produce a larger BDP sample and faster window growth, improving throughput for large responses. Notes: * grpc: grpc native http2 server. * cmux: x/net http2 server by cmux. * Conn: 2 means the 1 GiB List uses a dedicated channel/connection. Reference: [1]: https://github.com/golang/net/blob/v0.38.0/http2/writesched_roundrobin.go#L80-L83 Signed-off-by: Wei Fu <fuweid89@gmail.com>
|
For watch cache fallback we are less interested in throughput, more with max memory :P Throughput is still pretty important for watch cache initialization. |
|
cc @Jefftree |
With streaming, the memory issue could be gone. but right now, the MAX_CONCURRENT_STREAMS is still max (int32) by default and there is single channel. A lot of fallback requests should slow done each other. |
Codecov Report❌ Patch coverage is Additional details and impacted files
... and 29 files with indirect coverage changes @@ Coverage Diff @@
## main #21765 +/- ##
==========================================
+ Coverage 70.19% 70.24% +0.05%
==========================================
Files 426 428 +2
Lines 35228 35342 +114
==========================================
+ Hits 24727 24825 +98
- Misses 9105 9110 +5
- Partials 1396 1407 +11 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
|
There are three solutions:
can we have a comparison of each solution? |
Sure.
The existing multiple-stream mechanism is not connection-level isolation. Those streams are still created from the same If that channel has only one underlying HTTP/2 connection (only one endpoint pass-through), all those streams still share the same connection. Even when the channel has multiple endpoints/SubConns, the stream placement is still controlled by the channel balancer and is not a user-visible, context-keyed dedicated connection. So, this is not equivalent to what this PR is trying to provide. The PR is about allowing callers to explicitly select a separate gRPC channel/connection for a traffic class, so large Range traffic can be isolated at the transport level. The existing multiple-stream mechanism only helps Watch RPC multiplexing and does not apply to Range.
Yes, users can create multiple etcd clients today. I used that approach in kube-apiserver at the beginning for testing - kubernetes/kubernetes#138494 (comment), and it is doable. But I switched to use this change. The downside is that the application has to duplicate etcd client construction and lifecycle handling, including config, auth, endpoint updates, metrics, and close behavior. The application also has to manually route each request to the intended client instance. So, I try to provide the same connection-level isolation in clientv3 itself, while keeping a single etcd client as the user-facing object. Callers can select the intended channel through context.
|
Motivation:
kube-apiserver currently creates one etcd client, and therefore one gRPC channel, per resource. When the watch cache is not ready, kube-apiserver delegates List requests to etcd. Delegated List requests and reflector requests then share the same gRPC channel.
Large delegated responses can increase memory pressure on the etcd server. Even when using the streaming Range API, delegated requests can slow down reflector requests that are populating the watch cache.
By default, etcd's gRPC server allows a very large number of concurrent streams. If a reflector request is slow, or if compaction triggers List-All-Without-Pagination, many streams can accumulate on one connection. Setting MAX_CONCURRENT_STREAMS is hard to tune because hitting the limit can also block reflector requests (grpc/grpc#21386).
Based on https://grpc.io/docs/guides/performance/, we probably should allow kube-apiserver to isolate traffic by different channels.
Add an experimental context-keyed channel pool so kube-apiserver and other applications can select an initialized gRPC channel through request context for high-volume data transfers. Combined with MAX_CONCURRENT_STREAMS, this allows delegated requests to be isolated from reflector traffic.
Test:
Environment
Client workload
The client uses a single etcd connection, similar to how kube-apiserver keeps one client per resource.
For each test run:
This simulates kube-apiserver behavior when the watch cache for a resource is not ready. In that case, user List requests cannot be served from cache and are forwarded directly to etcd while other requests may still consume etcd/network bandwidth.
Result:
The native grpc-go server can be faster than the Go net/http HTTP/2 server path because they schedule writes differently. The net/http HTTP/2 server path prioritizes control frames, such as PING ACK, ahead of queued stream DATA in its round-robin write scheduler [1]. The client can therefore receive the ACK earlier, with fewer DATA bytes included in the BDP sample. The native grpc-go server uses its own loopyWriter and activeStreams scheduling, so DATA may be written or batched before the ACK is sent. This can produce a larger BDP sample and faster window growth, improving throughput for large responses.
Notes: