Skip to content

multikernel/tcp_splice

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TCP Splice

An out-of-tree Linux kernel module that accelerates co-located TCP (loopback and same-host container sidecars) by splicing two locally-connected TCP sockets through a small in-kernel ring, bypassing the skb/softirq/TCP-receive path for bulk data while leaving the real TCP connection intact.

It is the loadable-module form of the in-tree bpf_sock_splice_pair() prototype. A sock_ops BPF program calls the module's kfunc from each endpoint's ESTABLISHED callback; the module pairs the two ends of the connection through a private global registry (no sockmap). Once both ends are paired, the ring is the sole data path for sendmsg/recvmsg - there is no per-write fallback to TCP, so a full ring backpressures the sender (ordinary stream flow control, like a full send buffer) and a ring byte can never overtake TCP data, i.e. the splice never reorders the stream. Only the brief startup window before the second end pairs carries data over TCP; the sender records the one sequence number where it switches to the ring (ring_seq), and the receiver delivers all startup TCP up to that point before draining the ring, so the two never reorder. Sequence numbers stay frozen at post-handshake values, so FIN/RST/keepalive keep working over normal TCP and the pair tears down on a regular close.

Performance

netperf, sender and receiver pinned to adjacent CPUs, baseline TCP vs the splice path. Loopback is 127.0.0.1; "container" is two network namespaces over a veth pair plus a Linux bridge.

TCP_RR (request/response), transactions/sec, the headline case:

msg size loopback baseline splice splice + busy-poll
1 B 110.2k 267.0k (2.4x) 1113.8k (10.0x)
64 B 111.6k 265.7k (2.4x) 1073.3k (9.6x)
1 KB 105.8k 235.1k (2.2x) 713.0k (6.7x)
16 KB 40.5k 89.6k (2.2x) 123.7k (3.1x)
64 KB 17.8k 20.9k (1.2x) 40.5k (2.3x)

Container TCP_RR tracks loopback closely (e.g. 1 KB: 100.4k → 233.9k → 704.9k). The busy-poll budget is off by default; see Busy polling.

TCP_STREAM (bulk), Mbit/s: roughly neutral on bare-metal loopback (kernel TSO already amortizes per-packet cost), but a large win container-to-container, where the per-skb veth+bridge cost is exactly what the ring sidesteps:

msg size container baseline splice
1 KB 3710 21326 (5.8x)
4 KB 8084 48834 (6.0x)
16 KB 26083 27788 (1.1x)

These match the in-tree prototype and are reproduced by this module on the same hardware (1 KB loopback RR: 95.7k → 223.7k → 715.5k).

Requirements

  • A kernel built with CONFIG_DEBUG_INFO_BTF (for /sys/kernel/btf/vmlinux) and CONFIG_BPF_SYSCALL.
  • For module/: matching linux-headers, pahole (dwarves), binutils.
  • For bpf/ + ctl/: clang, bpftool, and libbpf headers (libbpf-dev).

Build

Each component has its own Makefile; the top level drives all three:

make            # module + bpf + ctl
make module     # just the kernel module
make bpf        # just the sock_ops object (+ skeleton)
make ctl        # just the control tool (consumes the bpf skeleton)

make module builds the .ko and then runs gen_btf.sh. That extra BTF step is required because distro -headers packages ship no vmlinux, so kbuild skips module BTF and the kfunc's BTF id stays unresolved (the kfunc would fail to register). gen_btf.sh resolves it against the running kernel's /sys/kernel/btf/vmlinux via pahole + resolve_btfids + objcopy.

Load

sudo insmod module/tcp_splice.ko      # or: make -C module load
# dmesg: "tcp_splice: loaded, bpf_tcp_splice_pair registered"
sudo rmmod tcp_splice                 # or: make -C module unload

A loaded BPF program that uses the kfunc pins the module, so rmmod is blocked while any paired socket is live.

Usage

The module is only the data plane; a sock_ops program decides which connections to splice. tcp-splice-ctl loads that program, sets the policy, and attaches it to a cgroup. Both endpoints of a connection must fall under the attached cgroup (see Pairing requirements below), so attach at a level that covers them - the cgroup-v2 root covers everything:

# enable for everything under the cgroup-v2 root, loopback flows only, port 6379
sudo ctl/tcp-splice-ctl enable --cgroup /sys/fs/cgroup --loopback-only --ports 6379

sudo ctl/tcp-splice-ctl status
# tcp_splice: enabled (loopback_only=1, ports=6379)

sudo ctl/tcp-splice-ctl disable

Options for enable:

option meaning
--cgroup PATH cgroup-v2 directory to attach to (required)
--loopback-only only splice loopback flows (127.0.0.0/8, ::1)
--ports a,b,c only splice flows where either endpoint uses one of these ports (default: any)
--busy-poll-us N set the ring busy-poll budget (writes net.core.busy_read; see Busy polling)
--ring-kbytes N per-direction ring size in KiB (writes the ring_kbytes module parameter; see Ring size)

The attachment is a pinned BPF link under /sys/fs/bpf/tcp_splice/, so it persists after tcp-splice-ctl exits; disable removes it. The module must be loaded first - it provides the kfunc the program calls. tcp-splice-ctl does not write any BPF itself: the sock_ops object is built into it from bpf/.

Busy polling

For latency-bound request/response traffic, the splice receiver can spin on the ring before parking, which collapses the per-cycle wakeup cost. The budget is the mainline net.core.busy_read sysctl (microseconds), which seeds sk_ll_usec on every socket:

sudo sysctl -w net.core.busy_read=50
# or, equivalently, via the control tool:
sudo ctl/tcp-splice-ctl enable --cgroup /sys/fs/cgroup --busy-poll-us 50

0 (default) disables it. An application can also opt in per-socket with setsockopt(SO_BUSY_POLL).

The spin is on the in-kernel ring directly, never on a NAPI instance, which matters because loopback and veth deliver via the per-CPU backlog and expose no pollable napi_id, so the kernel's generic sk_busy_loop() is a no-op for them. The module only borrows the budget value (sk_can_busy_loop()/ sk_busy_loop_timeout()), not the NAPI machinery.

Ring size

The per-direction ring is sized by the ring_kbytes module parameter (KiB, rounded up to a power of two; default 64):

# at load time
sudo insmod module/tcp_splice.ko ring_kbytes=256
# or at runtime (applies to connections spliced afterwards)
echo 256 | sudo tee /sys/module/tcp_splice/parameters/ring_kbytes
# or via the control tool
sudo ctl/tcp-splice-ctl enable --cgroup /sys/fs/cgroup --ring-kbytes 256

Because the ring is the sole data path (no TCP fallback), the size does not affect the bypass ratio - effectively all payload moves through the ring at any size. It is purely a flow-control/memory knob: a larger ring lets a fast sender run further ahead before it blocks on a full ring, at the cost of memory (two rings per spliced connection). tcp-splice-ctl status reports the current value.

Pairing requirements

Pairing keys two endpoints by a canonical, direction-independent 4-tuple, so the sock_ops program must call the kfunc on both ends (each installs its own proto under its own lock). For loopback the netns is folded into the key, so two unrelated 127.0.0.1 connections in different namespaces never mis-pair; for routable addresses (container veth IPs) the key is netns-agnostic, so the two ends in different namespaces still match. If only one end ever pairs, that socket simply keeps using plain TCP (the ring is engaged only once both ends are installed).

About

A Linux kernel module that accelerates co-located TCP by splicing two locally-connected TCP sockets through a small in-kernel ring, bypassing the skb/softirq/TCP-receive path for bulk data while leaving the real TCP connection intact.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors