Skip to content

Fix client hang during UDP bootstrap that prevents joining servers#1141

Open
JacobWoodson wants to merge 1 commit into
ryanhcode:mainfrom
JacobWoodson:fix/udp-join-hang-diagnostics
Open

Fix client hang during UDP bootstrap that prevents joining servers#1141
JacobWoodson wants to merge 1 commit into
ryanhcode:mainfrom
JacobWoodson:fix/udp-join-hang-diagnostics

Conversation

@JacobWoodson

Copy link
Copy Markdown

Problem

ConnectionMixin.sable$connect (and its local-server twin) opens a UDP channel alongside the Minecraft TCP one when the player connects. The bootstrap currently ends with:

final ChannelFuture channelFuture = new Bootstrap()...
        .connect(inetSocketAddress.getAddress(), inetSocketAddress.getPort());

channelFuture.syncUninterruptibly();

That syncUninterruptibly() runs on the Server Pinger (multiplayer list refresh) and Server Connector (actual join) threads. Those are the same threads that drive the TCP handshake / config phase of the Minecraft login. As long as the netty UDP connect() future hasn't resolved, no TCP login can make progress.

On most hosts the UDP connect() returns within milliseconds (UDP "connect" is just an OS-level associate-with-remote-address — no packets fly). But the netty call goes through NioDatagramChannel.doConnect() → javaChannel().connect(remoteAddress), which on Windows can stall when a third party is interposed on the socket layer: Windows Filtering Platform callouts from antivirus, "secure DNS" / enterprise proxies, some VPN clients, or layered service providers doing UDP inspection. On the affected client in this repro, that stall lasted 20-30+ seconds; the server's read-timeout fired first and closed the TCP socket. From the user's perspective, the login screen sat on "Connecting…" and then bounced with a generic "Disconnected".

This is the same shape of failure surfaced (with different proximate causes) in #852, #850 (workaround: attempt_udp_networking=false), #1035 (closed as dup of #482), and #1080 (NPE that this PR also fixes).

Repro

  • Sable NeoForge 1.21.1-1.2.2, Synthetic Horizons modpack, MC 1.21.1 + NeoForge 21.1.233.
  • One Windows 11 laptop on the LAN couldn't join a dedicated server (locally to 192.168.1.90:25565 or remotely to 47.38.244.85:25565). Other machines on the same LAN with the same pack joined fine.
  • Pinger thread eventually succeeded in bringing up its UDP channel (server appeared in the list), but the Server Connector's UDP bootstrap was the long one — long enough to push the server past its config-phase timeout.
  • With the patch in place on the same machine, UDP bootstrap completes in 1 ms on the same network path, and the player joins. The "drowning in the spawn ocean" verification was, uh, vivid.

Alternatives considered

I went through a few options before landing on this one. Recording them so reviewers can push back if they prefer a different shape.

  1. Tell users to set disable_udp_pipeline=true and call it done. This is the documented workaround today and was suggested on Client-side NPE in ClientboundSableUDPActivationPacket when UDP channel is null (tunnel/proxy environments) #1080. It works, but (a) it silently downgrades every user to TCP-only sub-level data forever rather than only on machines where UDP genuinely can't be set up, (b) the flag only short-circuits the server-side ServerConnectionListenerMixin — ConnectionMixin.sable$connect ignored it, so users following the doc were still hitting the bootstrap on the client, and (c) it doesn't address the root cause (a blocking call on the login thread).

  2. Move the UDP bootstrap off the login thread entirely (run the Bootstrap.connect(...) on the netty event loop and setUDPChannel from a listener). Cleaner long-term shape -- UDP bring-up becomes purely async, no timing coupling with TCP login at all. I didn't take this because it changes the lifecycle assumptions made by ClientboundSableUDPActivationPacket.handle (which today reads sable$getUDPChannel() synchronously, expecting connect() to be done by the time the server's activation packet arrives) and by anyone reading the field elsewhere. That's a bigger change with a wider blast radius for a backport-friendly fix, so I deferred it. Happy to do it as a follow-up if the maintainers prefer.

  3. Bounded await with a timeout, fall back to TCP-only on failure. What this PR does. The synchronous shape is preserved (so existing assumptions about sable$getUDPChannel() after Connection.connect returns still hold for the success path), but a slow or stuck connect() no longer holds the login thread hostage. The timeout (5 s) is generous enough that healthy networks always succeed (success path observed at <1 ms on every machine I tested) and short enough that the server-side login window (which is also ~20-30 s before the read-timeout fires) is preserved.

  4. Make the timeout configurable. Considered; rejected for now. A config knob is overkill when the practical range is "<1 s succeeds, otherwise something is broken and we want to fail fast". If a future deployment surfaces a legitimate need for a longer wait, promoting SABLE$UDP_CONNECT_TIMEOUT_MS to SableConfig is one line.

What this PR does

  • ConnectionMixin.sable$connect: replace syncUninterruptibly() with awaitUninterruptibly(SABLE$UDP_CONNECT_TIMEOUT_MS, MILLISECONDS) (5 s). On timeout: cancel the future, install a listener that logs the late outcome (and closes the channel if it eventually succeeds after we already gave up), return — letting TCP login proceed. On !isSuccess(): log the cause, return. Wrap the bootstrap construction itself in try/catch so an unchecked exception during bootstrap can't break the connect call either.
  • ConnectionMixin.sable$connectToLocalServer: same treatment for LocalChannel. Less critical (local channels rarely stall) but the inconsistency was hard to justify keeping.
  • Honor SableConfig.DISABLE_UDP_PIPELINE on the client: short-circuit both connect injects when the flag is set. Previously only ServerConnectionListenerMixin checked it, which made the documented workaround a half-measure.
  • ClientboundSableUDPActivationPacket.handle: null-check connectionExtension.sable$getUDPChannel() before dereferencing — fixes the NPE in Client-side NPE in ClientboundSableUDPActivationPacket when UDP channel is null (tunnel/proxy environments) #1080 (Cannot invoke 'io.netty.channel.Channel.eventLoop()' because 'channel' is null). When the channel is null, log a WARN and return; the client stays on TCP-only and the server's auth state for that connection just stays in AWAITING_AUTH until the connection closes. Also pattern-match the connection.getRemoteAddress() cast instead of an unchecked one.
  • Logging cleanup: the existing Starting remote client UDP channel future line now carries the remote address and transport class so logs from affected users are immediately actionable. All failure paths log at WARN with the cause. Routine breadcrumbs (config-disabled short-circuit, late-success-after-cancel, success path with timing) are demoted to DEBUG so unaffected users don't see any new chatter.

Implications

Success path

Byte-for-byte identical end-state -- the channel is set on the ConnectionExtension, SableUDPChannelHandlerClient.channelActive fires, ClientboundSableUDPActivationPacket.handle sees the channel and dispatches the auth response, server transitions SableUDPAuthenticationState to AUTHENTICATED. No timing changes when UDP comes up healthily (success observed at <1 ms across machines tested).

Failure / timeout path (where this PR is doing the actual work)

The interesting question is "if UDP setup fails, what's broken for the player?" I traced the gameplay-affecting UDP call sites to answer that:

  • sendUDPPacket has two production call sites in the codebase: SubLevelTrackingSystem.java:367-397 (sub-level snapshot replication) and SableCommand.java:58 (a debug echo command). The tracking system already has an explicit, designed-in TCP fallback branch keyed on isConnectedTo(player) — sub-level snapshots are sent as a ClientboundBundlePacket(ClientboundSableSnapshotInfoDualPacket, ClientboundSableSnapshotDualPacket) over TCP when UDP isn't authenticated. The packets are Dual (SableUDPPacket, SableTCPPacket) and the client decodes them identically off either transport; the only branch in ClientSableInterpolationState.receiveSnapshot on the receive mode is a debug-overlay string ("Networking through UDP/TCP"). The interpolation math, the snapshot buffer, and the rendered sub-level position are all identical.
  • isConnectedTo(player) returns false cleanly in our failure path (udpAuthStates entry stays in AWAITING_AUTH because the client never replies with SableUDPAuthenticationPacket).
  • SableUDPServer.sendPings() skips players that aren't AUTHENTICATED, so a fallback-to-TCP player generates no UDP keepalive traffic and is never bounced by the missed-pings logic.

So the practical effect of our timeout firing is: the player joins, sub-levels work normally, the only observable difference is the F3 debug overlay reports "Networking through TCP" (or "UNKNOWN" until the first snapshot arrives). The one real degradation is that under a lossy network, sub-level snapshots over TCP can show jitter from head-of-line blocking that UDP wouldn't have — but the failure mode that triggers this PR is at the OS socket layer, not on-wire packet loss, so affected users typically have clean networks and see no visible difference.

Other behavioral notes

  • disable_udp_pipeline now means what it says on the client — no UDP bootstrap, no UDP channel, no log lines. Previously the flag only short-circuited the server bootstrap, so a user setting it client-side (as Client-side NPE in ClientboundSableUDPActivationPacket when UDP channel is null (tunnel/proxy environments) #1080 recommends as a workaround) was still hitting the buggy bootstrap. This is a small behavior change for anyone relying on the side-effect that the client kept opening a UDP channel; I don't think such users exist, but worth calling out.
  • Stale auth-state entries: in the timeout path, the server's udpAuthStates map keeps an AWAITING_AUTH entry for the connection whose client never replied. The map is a WeakHashMap keyed on Connection and entries are also pruned in SableUDPServer.sendPings() when !connection.isConnected(), so it cleans up on disconnect. No leak; called out for completeness.
  • No protocol change, no packet format change, no jarjar change. Drop-in compatible across mixed 1.2.2 / patched-1.2.2 deployments — server-patched + client-unpatched (or vice versa) negotiates UDP exactly the same way it does today.
  • No new dependencies. One @Unique field added: SABLE$UDP_CONNECT_TIMEOUT_MS (mixin-safe naming).

AI assistance

This change was developed with the assistance of an AI coding assistant (Claude). I read every line, built the jar locally against mc1.21.1-1.2.2-neoforge, dropped it into the affected modpack instance, reproduced the original join failure on the affected machine, watched it succeed with the fix, and read the resulting logs to confirm the WARN/DEBUG split is what I want a future bug report to contain. I'm submitting this as my own contribution under the CONTRIBUTING.md terms.

Related issues

Connection.connect is mixin-injected to bootstrap a UDP channel
alongside the TCP one. The existing implementation calls
Bootstrap.connect(...).syncUninterruptibly() on the Server Pinger /
Server Connector thread - the same thread that drives TCP login.
If the netty connect future is slow to resolve (e.g. antivirus or
firewall delaying the OS-level UDP connect() syscall), the entire
login is blocked until the server gives up and closes the TCP
connection. The client sees a generic "Disconnected" and the server
sees "Timed out".

Replace the indefinite blocking sync with a bounded
awaitUninterruptibly(5s); on timeout / failure, cancel the future,
log the cause, and return so the TCP login can proceed. The success
path is unchanged.

Also:
  * honor SableConfig.DISABLE_UDP_PIPELINE on the client side too -
    previously it only short-circuited the server bootstrap, so users
    running the documented workaround were still bootstrapping a UDP
    channel on the client.
  * null-guard the UDP channel in ClientboundSableUDPActivationPacket
    (fixes ryanhcode#1080 NPE on channel.eventLoop() when UDP setup did not
    produce a channel).
  * pattern-match the remote address instead of an unchecked cast.
  * include remote address / transport class on the existing
    "Starting remote client UDP channel future" line and surface
    failure causes on WARN so future UDP setup issues are actionable
    from latest.log. Routine breadcrumbs at DEBUG only.
@CLAassistant

CLAassistant commented Jun 8, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@ryanhcode ryanhcode self-assigned this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants