USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635
USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635copejon wants to merge 6 commits into
Conversation
|
@copejon: This pull request references USHIFT-6401 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughRBAC bootstrap context threading propagates a per-attempt 15s timeout context through the post-start hook into ensureRBACPolicy and priming helpers; all ClusterRole and ClusterRoleBinding List/Get/Create calls now use the provided context instead of context.TODO(). ChangesRBAC bootstrap context threading
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes
🚥 Pre-merge checks | ✅ 14 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (14 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go (1)
166-173: ⚡ Quick win
wait.Pollloop is not context-aware — cancellation won't short-circuit it.The hook context is now correctly threaded into the inner function, so individual API calls will fail fast when the context is cancelled. However,
wait.Pollitself has no awareness of the context; if the context is cancelled mid-poll-interval, the loop continues blocking for up to 30 more seconds before the next iteration observes the error. Replacing it withwait.PollWithContext(orwait.PollUntilContextTimeout) fully honors the shutdown signal.♻️ Proposed refactor
- err := wait.Poll(1*time.Second, 30*time.Second, func() (done bool, err error) { + err := wait.PollUntilContextTimeout(hookContext.Context, 1*time.Second, 30*time.Second, true, func(ctx context.Context) (done bool, err error) { client, err := clientset.NewForConfig(hookContext.LoopbackClientConfig) if err != nil { utilruntime.HandleError(fmt.Errorf("unable to initialize client set: %v", err)) return false, nil } - return ensureRBACPolicy(hookContext, p, client) + return ensureRBACPolicy(ctx, p, client) })Note: adjust
hookContext.ContexttohookContextifPostStartHookContextembedscontext.Context.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go` around lines 166 - 173, The wait.Poll call in the RBAC setup loop is not context-aware and can block after cancellation; replace the wait.Poll invocation in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or wait.PollUntilContextTimeout) so the loop short-circuits on hookContext cancellation; pass the hookContext (or hookContext.Context if PostStartHookContext embeds context.Context) as the context argument and keep the same polling interval and timeout while preserving the existing ensureRBACPolicy(hookContext, p, client) call and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`:
- Around line 166-173: The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 9f88f2a4-91ab-409f-bd6c-6d75d87351ef
⛔ Files ignored due to path filters (1)
vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (2)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.godeps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
d255cf8 to
b220d71
Compare
…ootstrap Thread context.Context through ensureRBACPolicy, primeAggregatedClusterRoles, and primeSplitClusterRoleBindings so that RBAC bootstrap API calls respect the post-start hook's cancellation signal instead of hanging indefinitely on context.TODO(). Includes carry patch (0040) so the fix survives future rebases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
b220d71 to
1d08a19
Compare
hookContext only cancels on server shutdown, not on deadline — so passing it bare does not break the deadlock. Wrap it with context.WithTimeout(15s) so each poll iteration's API calls are individually bounded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go`:
- Line 38: The benchmark currently ignores the error return from
ensureRBACPolicy; change the call to capture the error and fail the benchmark on
error (e.g., call b.Fatalf or b.Fatal with an explanatory message and the error)
so that ensureRBACPolicy(policy, coreClientSet, context.Background()) failures
are reported instead of being swallowed; reference ensureRBACPolicy, policy and
coreClientSet to locate the call and replace the ignored error with an if err !=
nil check that fails the test/benchmark.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 3303d4e4-2601-44ac-987c-eafc2eb27940
⛔ Files ignored due to path filters (1)
vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (3)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.godeps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.goscripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
🚧 Files skipped from review as they are similar to previous changes (1)
- deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
|
/test test-rebase |
Stop silently discarding the error from ensureRBACPolicy so failures surface instead of being swallowed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
♻️ Duplicate comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go (1)
38-40:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winAlso check the
donereturn value.The benchmark currently discards the
donereturn value fromensureRBACPolicy. Ifdone == false(buterr == nil), the RBAC policy setup did not complete, which would invalidate the benchmark results.Proposed fix
- if _, err := ensureRBACPolicy(context.Background(), policy, coreClientSet); err != nil { + done, err := ensureRBACPolicy(context.Background(), policy, coreClientSet) + if err != nil { b.Fatalf("ensureRBACPolicy failed: %v", err) } + if !done { + b.Fatalf("ensureRBACPolicy did not complete") + }As per coding guidelines, Go security (prodsec-skills): Never ignore error returns.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go` around lines 38 - 40, The benchmark is ignoring the boolean "done" returned by ensureRBACPolicy; update the call in the test to capture both returns (done, err := ensureRBACPolicy(...)) and if err != nil or done == false, call b.Fatalf with an explanatory message (e.g., "ensureRBACPolicy failed or did not complete: done=%v err=%v") so a nil error but incomplete setup is treated as a failure; locate the call to ensureRBACPolicy in storage_rbac_test.go and modify the b.Fatalf checks accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Duplicate comments:
In
`@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go`:
- Around line 38-40: The benchmark is ignoring the boolean "done" returned by
ensureRBACPolicy; update the call in the test to capture both returns (done, err
:= ensureRBACPolicy(...)) and if err != nil or done == false, call b.Fatalf with
an explanatory message (e.g., "ensureRBACPolicy failed or did not complete:
done=%v err=%v") so a nil error but incomplete setup is treated as a failure;
locate the call to ensureRBACPolicy in storage_rbac_test.go and modify the
b.Fatalf checks accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: 32dc0bb5-f181-4e30-b94d-68d80d05caff
⛔ Files ignored due to path filters (1)
vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.gois excluded by!**/vendor/**,!vendor/**
📒 Files selected for processing (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
go mod vendor never includes _test.go files — the test belongs in deps/ only. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The patch had wrong line offsets in 8 hunk headers — git apply --check reported "corrupt patch at line 18". Regenerated from git diff. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@copejon: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
/lgtm |
|
@copejon: you cannot LGTM your own PR. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: copejon The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Reduce per-attempt timeout from 15s to 12s so two full attempts fit comfortably within the 30s outer poll. Add a klog.Warningf when the context deadline fires during etcd readiness checks to distinguish timeouts from other errors. Co-authored-by: Cursor <cursoragent@cursor.com>
Replace
context.TODO()with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles,primeSplitClusterRoleBindings)Summary by CodeRabbit
Chores
Tests