Skip to content

Support RDMA handshake V3#3326

Open
chenBright wants to merge 1 commit into
apache:masterfrom
chenBright:rdma_handshake_v3
Open

Support RDMA handshake V3#3326
chenBright wants to merge 1 commit into
apache:masterfrom
chenBright:rdma_handshake_v3

Conversation

@chenBright
Copy link
Copy Markdown
Contributor

What problem does this PR solve?

Issue Number: resolve

Problem Summary:

The legacy v2 handshake ("RDMA" magic + 36B fixed binary HelloMessage)
had correctness bugs that made the wire format effectively
unevolvable.

  1. Client never drained the "unknown tail" bytes when a peer sent
    msg_len > HELLO_MSG_LEN_MIN(40). Leftover bytes stayed in the
    socket recv buffer and silently corrupted the next ReadFromFd
    (the ACK).
  2. Server had the symmetric version of A.
  3. Server computed the body read length from its LOCAL
    g_rdma_hello_msg_len, implicitly assuming the peer's hello is the
    same length as its own. A longer peer left bytes behind; a shorter
    peer made the read block. Client used the correct compile-time
    constant; the two sides were not symmetric.

Combined, 1/2/3 meant v2 could not safely append a single byte to the
hello -- even an "optional hint" appended by a newer sender would
mis-align an older receiver's next read.

What is changed and the side effects?

Changed:

v3 handshake is designed which is wire format, magic-namespace-isolated from v2:

  [ "RDM3" 4B ][ pb_size 4B big-endian ][ RdmaHello protobuf bytes ]

with pb_size in (0, 4096]. RdmaHello carries the same 6 base fields
as v2 plus room to append future capabilities.

Why protobuf (and not "v2 plus length prefix")

  • Variable-length fields are coming. Future capability fields will
    include strings (rdma_device_name, netdev_name, ...) and other
    variable-length data. Supporting them on a fixed-binary protocol
    forces us to invent and maintain a TLV layer (per-field type +
    length + value framing, plus version-aware deserialization). That
    is reimplementing protobuf badly. Using protobuf from day one
    costs nothing and is the canonical answer.
  • Fixes v2 bug 1/2/3 generically: pb_size makes the wire
    self-describing, so the receiver never needs to guess the length
    or know the peer's schema version to read the body cleanly.
  • Append-only field evolution out of the box: new optional fields
    cost old receivers nothing -- they're skipped as unknown protobuf
    fields. v2 with a hand-rolled length prefix would still need
    per-field opt-in code on every side.
  • Built-in validation: ParseFromArray fails fast on malformed input;
    required-field presence is enforced at the parse layer, not by
    ad-hoc has_xxx() checks scattered through wire code.
  • bRPC already depends on protobuf -- no new build dependency.

Why a NEW MAGIC rather than a version field inside protobuf

  • Forces "breaking change" to be a deployment decision visible at
    the wire level. You cannot accidentally ship a backwards-
    incompatible patch via a field-semantics tweak.
  • Server-side dispatch routes by magic to fully independent state
    machines that can't entangle (no if (version == X) branches
    anywhere -- this is the abstraction's red line).
  • Any future breaking change bumps the magic ("RDM4", "RDM5", ...).
    v3 fields, once shipped, never change semantics.

Architecture

Handshake state machine is split out of RdmaEndpoint:

RdmaHandshake  (abstract base; only 3 virtuals + _ep pointer)
  |-- RdmaHandshakeClientV2 / ServerV2
  |-- RdmaHandshakeClientV3 / ServerV3
  |-- v2_wire::* (file-local helpers shared by v2 leaves)
  `-- v3_wire::* (file-local helpers shared by v3 leaves)

Endpoint-side ProcessHandshakeAt{Client,Server} is protocol-agnostic;
subclasses override only SendLocalHello() and ReceiveAndParseRemoteHello().
A ParsedHello struct decouples endpoint-side bring-up (window calculation,
BringUpQp) from any wire format.

Side effects:

  • Performance effects:

  • Breaking backward compatibility:

Server-side ALWAYS accepts both v2 and v3 (no gflag, no kill-switch);
magic routes to fully independent code paths. A single rolling upgrade
enables v3 fleet-wide.

Client-side picks the wire protocol via gflag with a safe default:

FLAGS_rdma_client_handshake_version  (default 2)
  2 = "RDMA" legacy (zero-regression default)
  3 = "RDM3" protobuf (opt-in once target servers support v3)

Sub-second rollback is one flag flip away. v3 client to v2-only legacy
server is NOT guaranteed to transparently fall back on the same
connection -- the supported migration is "upgrade servers first, then
opt-in clients".


Check List:

Background
==========
The legacy v2 handshake ("RDMA" magic + 36B fixed binary HelloMessage)
had correctness bugs that made the wire format effectively
unevolvable.

v2 bugs
=======
A. Client never drained the "unknown tail" bytes when a peer sent
   msg_len > HELLO_MSG_LEN_MIN(40). Leftover bytes stayed in the
   socket recv buffer and silently corrupted the next ReadFromFd
   (the ACK).
B. Server had the symmetric version of A.
C. Server computed the body read length from its LOCAL
   g_rdma_hello_msg_len, implicitly assuming the peer's hello is the
   same length as its own. A longer peer left bytes behind; a shorter
   peer made the read block. Client used the correct compile-time
   constant; the two sides were not symmetric.

Combined, A/B/C meant v2 could not safely append a single byte to the
hello -- even an "optional hint" appended by a newer sender would
mis-align an older receiver's next read.

v3 design
=========
Wire format, magic-namespace-isolated from v2:

  [ "RDM3" 4B ][ pb_size 4B big-endian ][ RdmaHello protobuf bytes ]

with pb_size in (0, 4096]. RdmaHello carries the same 6 base fields
as v2 plus room to append future capabilities.

Why protobuf (and not "v2 plus length prefix")
----------------------------------------------
- Variable-length fields are coming. Future capability fields will
  include strings (rdma_device_name, netdev_name, ...) and other
  variable-length data. Supporting them on a fixed-binary protocol
  forces us to invent and maintain a TLV layer (per-field type +
  length + value framing, plus version-aware deserialization). That
  is reimplementing protobuf badly. Using protobuf from day one
  costs nothing and is the canonical answer.
- Fixes v2 bug A/B/C generically: pb_size makes the wire
  self-describing, so the receiver never needs to guess the length
  or know the peer's schema version to read the body cleanly.
- Append-only field evolution out of the box: new optional fields
  cost old receivers nothing -- they're skipped as unknown protobuf
  fields. v2 with a hand-rolled length prefix would still need
  per-field opt-in code on every side.
- Built-in validation: ParseFromArray fails fast on malformed input;
  required-field presence is enforced at the parse layer, not by
  ad-hoc has_xxx() checks scattered through wire code.
- bRPC already depends on protobuf -- no new build dependency.

Why a NEW MAGIC rather than a version field inside protobuf
-----------------------------------------------------------
- Forces "breaking change" to be a deployment decision visible at
  the wire level. You cannot accidentally ship a backwards-
  incompatible patch via a field-semantics tweak.
- Server-side dispatch routes by magic to fully independent state
  machines that can't entangle (no `if (version == X)` branches
  anywhere -- this is the abstraction's red line).
- Any future breaking change bumps the magic ("RDM4", "RDM5", ...).
  v3 fields, once shipped, never change semantics.

Rollout
=======
Server-side ALWAYS accepts both v2 and v3 (no gflag, no kill-switch);
magic routes to fully independent code paths. A single rolling upgrade
enables v3 fleet-wide.

Client-side picks the wire protocol via gflag with a safe default:

    FLAGS_rdma_client_handshake_version  (default 2)
      2 = "RDMA" legacy (zero-regression default)
      3 = "RDM3" protobuf (opt-in once target servers support v3)

Sub-second rollback is one flag flip away. v3 client to v2-only legacy
server is NOT guaranteed to transparently fall back on the same
connection -- the supported migration is "upgrade servers first, then
opt-in clients".
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new RDMA handshake protocol ("v3") using a "RDM3" magic + length-prefixed protobuf RdmaHello, refactors the handshake state machine out of RdmaEndpoint into pluggable RdmaHandshake subclasses, and fixes long-standing v2 hello bugs (oversized/undersized msg_len, untrained trailing bytes). Server accepts both v2 and v3 by magic dispatch; clients pick the version via FLAGS_rdma_client_handshake_version (defaults to 2 for backward compatibility).

Changes:

  • Add RdmaHello protobuf + RdmaHandshake{Client,Server}V{2,3} classes, split v2_wire/v3_wire helpers, and wire-format-agnostic ParsedHello translation; add RdmaEndpoint::ApplyRemoteHello, new ReadFromFd(IOPortal*) / WriteToFd(IOBuf*) overloads, and template-based EAGAIN read/write loops.
  • Tighten v2 server validation (reject msg_len > 4096, drain valid tails), centralize ACK constants, and improve fallback/log messages including a handshake_version debug field.
  • Add v2/v3 byte-level baseline tests, RdmaRpcTest parameterized fixture run under both versions, fix Server::StartInternal so a failed validation no longer leaks user-owned pointers into _options, and update CI/build configs to compile the new .proto.

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/brpc/rdma/rdma_handshake.{h,cpp,proto} New abstract handshake hierarchy + v2/v3 wire helpers and protobuf definition.
src/brpc/rdma/rdma_endpoint.{h,cpp} Endpoint state machine now delegates wire handling, adds IOBuf/IOPortal IO helpers and ApplyRemoteHello.
src/brpc/socket.{h,cpp} Friend declarations for new handshake classes; minor formatting.
src/brpc/server.cpp Validate ServerOptions before copying to avoid taking ownership on failure paths.
src/brpc/rdma_transport.h Adjust friend declarations for new handshake classes.
test/brpc_rdma_unittest.cpp Migrate tests to new v2_wire::* names, add v3 baseline/error tests and parameterized RdmaRpcTest.
test/bvar_percentile_unittest.cpp Gate add test under !WITH_BABYLON_COUNTER.
src/butil/thread_key.h Add missing <algorithm> include.
CMakeLists.txt / BUILD.bazel / .bazelrc Compile the new proto, add rdma bazel config.
.github/workflows/ci-linux.yml, .github/actions/install-all-dependencies/action.yml Install libibverbs-dev, line-wrap commands, enable rdma config in babylon CI.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread test/brpc_rdma_unittest.cpp
Comment thread .github/workflows/ci-linux.yml
Comment thread src/brpc/rdma/rdma_handshake.cpp
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants