fix(lockservice): fence stale binds by allocator epoch#24906
fix(lockservice): fence stale binds by allocator epoch#24906iamlinjunhong wants to merge 6 commits into
Conversation
Qodo reviews are paused for this user.Troubleshooting steps vary by plan Learn more → On a Teams plan? Using GitHub Enterprise Server, GitLab Self-Managed, or Bitbucket Data Center? |
aptend
left a comment
There was a problem hiding this comment.
Review Result
Approved. I did not find any blocking issues in this PR.
What I Checked
GetBindandKeepLockTableBindpropagate the allocator epoch to CN.- CN stale-bind purge follows the existing
LockTable.Versionsemantics: cached binds withVersion < observed allocator epochare stale allocator state. - main-branch bind-change handling was reviewed against the epoch purge path, including waiter cleanup and commit-side bind validation.
- The added tests cover get-bind refresh, keepalive failure/epoch refresh, concurrent observation, remote/proxy bind purge, and waiter cleanup.
Local Verification
git diff --check origin/main...HEAD
go test ./pkg/lockservice -run 'Test(CommitDetectsStaleLocalBindAfterAllocatorRestart|AllocatorVersionZeroKeepsLocalBinds|AllocatorObserverDoesNotPurgeSameEpochBindVersions|GetBindAllowsRegressedAllocatorVersionWithoutPurging|KeepaliveEpochPurgeKeepsGroupMovePop|AllocatorObserverPurgesRemoteAndProxyLockTables|AllocatorObserverConcurrentKeepaliveAndGetBind|AllocatorObserverCloseWaitersOnStaleLocalBind)$'
timeout 180s go test ./pkg/lockserviceNote: the full go test ./pkg/lockservice hit the 180s local timeout without output. The PR-specific test set passed locally, and CI is green.
XuPeng-SH
left a comment
There was a problem hiding this comment.
I re-checked the latest head. I do not see a new high-confidence correctness bug to block on, and the unit coverage is already much stronger, but with a lockservice fix this concurrency-heavy I still think a few unhappy-path tests are missing before I am comfortable approving.\n\nThe main gaps I would still ask to close are:\n\n1. Repeated allocator restarts / superseded-ID accumulation\n The new fencing now relies on tracking superseded allocator identities, but I did not see a focused test that drives 3+ sequential allocator replacements and proves both correctness and bounded behavior over repeated generations.\n\n2. Client-side cancel before the remote lock request is actually sent\n The PR now keeps more remote-lock bookkeeping alive for later cleanup, and there is timeout coverage, but I did not find the specific unhappy path where the caller is canceled before the remote side ever received the lock request. I would like a test that proves the later unlock/close path handles that harmlessly.\n\n3. The narrow race between getLocalLockTable() and taking bindChangeMu.RLock()\n The latest code intentionally adds a second stale-bind recheck around that window. That is exactly the kind of defensive race fix that should be pinned by a test, and I did not find one that explicitly exercises a bind change in that gap.\n\nIf those concurrency-edge cases are covered, I am happy to recheck the latest head.
What type of PR is this?
Which issue(s) this PR fixes:
issue #24896
What this PR does / why we need it:
fix(lockservice): fence stale binds by allocator epoch