Fix client hang during UDP bootstrap that prevents joining servers#1141
Open
JacobWoodson wants to merge 1 commit into
Open
Fix client hang during UDP bootstrap that prevents joining servers#1141JacobWoodson wants to merge 1 commit into
JacobWoodson wants to merge 1 commit into
Conversation
Connection.connect is mixin-injected to bootstrap a UDP channel
alongside the TCP one. The existing implementation calls
Bootstrap.connect(...).syncUninterruptibly() on the Server Pinger /
Server Connector thread - the same thread that drives TCP login.
If the netty connect future is slow to resolve (e.g. antivirus or
firewall delaying the OS-level UDP connect() syscall), the entire
login is blocked until the server gives up and closes the TCP
connection. The client sees a generic "Disconnected" and the server
sees "Timed out".
Replace the indefinite blocking sync with a bounded
awaitUninterruptibly(5s); on timeout / failure, cancel the future,
log the cause, and return so the TCP login can proceed. The success
path is unchanged.
Also:
* honor SableConfig.DISABLE_UDP_PIPELINE on the client side too -
previously it only short-circuited the server bootstrap, so users
running the documented workaround were still bootstrapping a UDP
channel on the client.
* null-guard the UDP channel in ClientboundSableUDPActivationPacket
(fixes ryanhcode#1080 NPE on channel.eventLoop() when UDP setup did not
produce a channel).
* pattern-match the remote address instead of an unchecked cast.
* include remote address / transport class on the existing
"Starting remote client UDP channel future" line and surface
failure causes on WARN so future UDP setup issues are actionable
from latest.log. Routine breadcrumbs at DEBUG only.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
ConnectionMixin.sable$connect(and its local-server twin) opens a UDP channel alongside the Minecraft TCP one when the player connects. The bootstrap currently ends with:That
syncUninterruptibly()runs on theServer Pinger(multiplayer list refresh) andServer Connector(actual join) threads. Those are the same threads that drive the TCP handshake / config phase of the Minecraft login. As long as the netty UDPconnect()future hasn't resolved, no TCP login can make progress.On most hosts the UDP
connect()returns within milliseconds (UDP "connect" is just an OS-level associate-with-remote-address — no packets fly). But the netty call goes throughNioDatagramChannel.doConnect()→javaChannel().connect(remoteAddress), which on Windows can stall when a third party is interposed on the socket layer: Windows Filtering Platform callouts from antivirus, "secure DNS" / enterprise proxies, some VPN clients, or layered service providers doing UDP inspection. On the affected client in this repro, that stall lasted 20-30+ seconds; the server's read-timeout fired first and closed the TCP socket. From the user's perspective, the login screen sat on "Connecting…" and then bounced with a generic "Disconnected".This is the same shape of failure surfaced (with different proximate causes) in #852, #850 (workaround:
attempt_udp_networking=false), #1035 (closed as dup of #482), and #1080 (NPE that this PR also fixes).Repro
192.168.1.90:25565or remotely to47.38.244.85:25565). Other machines on the same LAN with the same pack joined fine.Alternatives considered
I went through a few options before landing on this one. Recording them so reviewers can push back if they prefer a different shape.
Tell users to set
disable_udp_pipeline=trueand call it done. This is the documented workaround today and was suggested on Client-side NPE in ClientboundSableUDPActivationPacket when UDP channel is null (tunnel/proxy environments) #1080. It works, but (a) it silently downgrades every user to TCP-only sub-level data forever rather than only on machines where UDP genuinely can't be set up, (b) the flag only short-circuits the server-sideServerConnectionListenerMixin—ConnectionMixin.sable$connectignored it, so users following the doc were still hitting the bootstrap on the client, and (c) it doesn't address the root cause (a blocking call on the login thread).Move the UDP bootstrap off the login thread entirely (run the
Bootstrap.connect(...)on the netty event loop andsetUDPChannelfrom a listener). Cleaner long-term shape -- UDP bring-up becomes purely async, no timing coupling with TCP login at all. I didn't take this because it changes the lifecycle assumptions made byClientboundSableUDPActivationPacket.handle(which today readssable$getUDPChannel()synchronously, expecting connect() to be done by the time the server's activation packet arrives) and by anyone reading the field elsewhere. That's a bigger change with a wider blast radius for a backport-friendly fix, so I deferred it. Happy to do it as a follow-up if the maintainers prefer.Bounded
awaitwith a timeout, fall back to TCP-only on failure. What this PR does. The synchronous shape is preserved (so existing assumptions aboutsable$getUDPChannel()afterConnection.connectreturns still hold for the success path), but a slow or stuckconnect()no longer holds the login thread hostage. The timeout (5 s) is generous enough that healthy networks always succeed (success path observed at <1 ms on every machine I tested) and short enough that the server-side login window (which is also ~20-30 s before the read-timeout fires) is preserved.Make the timeout configurable. Considered; rejected for now. A config knob is overkill when the practical range is "<1 s succeeds, otherwise something is broken and we want to fail fast". If a future deployment surfaces a legitimate need for a longer wait, promoting
SABLE$UDP_CONNECT_TIMEOUT_MStoSableConfigis one line.What this PR does
ConnectionMixin.sable$connect: replacesyncUninterruptibly()withawaitUninterruptibly(SABLE$UDP_CONNECT_TIMEOUT_MS, MILLISECONDS)(5 s). On timeout: cancel the future, install a listener that logs the late outcome (and closes the channel if it eventually succeeds after we already gave up), return — letting TCP login proceed. On!isSuccess(): log the cause, return. Wrap the bootstrap construction itself in try/catch so an unchecked exception during bootstrap can't break the connect call either.ConnectionMixin.sable$connectToLocalServer: same treatment forLocalChannel. Less critical (local channels rarely stall) but the inconsistency was hard to justify keeping.SableConfig.DISABLE_UDP_PIPELINEon the client: short-circuit both connect injects when the flag is set. Previously onlyServerConnectionListenerMixinchecked it, which made the documented workaround a half-measure.ClientboundSableUDPActivationPacket.handle: null-checkconnectionExtension.sable$getUDPChannel()before dereferencing — fixes the NPE in Client-side NPE in ClientboundSableUDPActivationPacket when UDP channel is null (tunnel/proxy environments) #1080 (Cannot invoke 'io.netty.channel.Channel.eventLoop()' because 'channel' is null). When the channel is null, log a WARN and return; the client stays on TCP-only and the server's auth state for that connection just stays inAWAITING_AUTHuntil the connection closes. Also pattern-match theconnection.getRemoteAddress()cast instead of an unchecked one.Starting remote client UDP channel futureline now carries the remote address and transport class so logs from affected users are immediately actionable. All failure paths log at WARN with the cause. Routine breadcrumbs (config-disabled short-circuit, late-success-after-cancel, success path with timing) are demoted to DEBUG so unaffected users don't see any new chatter.Implications
Success path
Byte-for-byte identical end-state -- the channel is set on the
ConnectionExtension,SableUDPChannelHandlerClient.channelActivefires,ClientboundSableUDPActivationPacket.handlesees the channel and dispatches the auth response, server transitionsSableUDPAuthenticationStatetoAUTHENTICATED. No timing changes when UDP comes up healthily (success observed at <1 ms across machines tested).Failure / timeout path (where this PR is doing the actual work)
The interesting question is "if UDP setup fails, what's broken for the player?" I traced the gameplay-affecting UDP call sites to answer that:
sendUDPPackethas two production call sites in the codebase:SubLevelTrackingSystem.java:367-397(sub-level snapshot replication) andSableCommand.java:58(a debug echo command). The tracking system already has an explicit, designed-in TCP fallback branch keyed onisConnectedTo(player)— sub-level snapshots are sent as aClientboundBundlePacket(ClientboundSableSnapshotInfoDualPacket, ClientboundSableSnapshotDualPacket)over TCP when UDP isn't authenticated. The packets are Dual (SableUDPPacket, SableTCPPacket) and the client decodes them identically off either transport; the only branch inClientSableInterpolationState.receiveSnapshoton the receive mode is a debug-overlay string ("Networking through UDP/TCP"). The interpolation math, the snapshot buffer, and the rendered sub-level position are all identical.isConnectedTo(player)returnsfalsecleanly in our failure path (udpAuthStatesentry stays inAWAITING_AUTHbecause the client never replies withSableUDPAuthenticationPacket).SableUDPServer.sendPings()skips players that aren'tAUTHENTICATED, so a fallback-to-TCP player generates no UDP keepalive traffic and is never bounced by the missed-pings logic.So the practical effect of our timeout firing is: the player joins, sub-levels work normally, the only observable difference is the F3 debug overlay reports "Networking through TCP" (or "UNKNOWN" until the first snapshot arrives). The one real degradation is that under a lossy network, sub-level snapshots over TCP can show jitter from head-of-line blocking that UDP wouldn't have — but the failure mode that triggers this PR is at the OS socket layer, not on-wire packet loss, so affected users typically have clean networks and see no visible difference.
Other behavioral notes
disable_udp_pipelinenow means what it says on the client — no UDP bootstrap, no UDP channel, no log lines. Previously the flag only short-circuited the server bootstrap, so a user setting it client-side (as Client-side NPE in ClientboundSableUDPActivationPacket when UDP channel is null (tunnel/proxy environments) #1080 recommends as a workaround) was still hitting the buggy bootstrap. This is a small behavior change for anyone relying on the side-effect that the client kept opening a UDP channel; I don't think such users exist, but worth calling out.udpAuthStatesmap keeps anAWAITING_AUTHentry for the connection whose client never replied. The map is aWeakHashMapkeyed onConnectionand entries are also pruned inSableUDPServer.sendPings()when!connection.isConnected(), so it cleans up on disconnect. No leak; called out for completeness.@Uniquefield added:SABLE$UDP_CONNECT_TIMEOUT_MS(mixin-safe naming).AI assistance
This change was developed with the assistance of an AI coding assistant (Claude). I read every line, built the jar locally against
mc1.21.1-1.2.2-neoforge, dropped it into the affected modpack instance, reproduced the original join failure on the affected machine, watched it succeed with the fix, and read the resulting logs to confirm the WARN/DEBUG split is what I want a future bug report to contain. I'm submitting this as my own contribution under the CONTRIBUTING.md terms.Related issues
ClientboundSableUDPActivationPacket(fixed directly by the channel null guard)connect())