Skip to content

USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635

Open
copejon wants to merge 6 commits into
openshift:mainfrom
copejon:fix-USHIFT-6401-alt-fix
Open

USHIFT-6401: Patch unbounded KAS context to break pre-hook deadlock#6635
copejon wants to merge 6 commits into
openshift:mainfrom
copejon:fix-USHIFT-6401-alt-fix

Conversation

@copejon
Copy link
Copy Markdown
Contributor

@copejon copejon commented May 7, 2026

Replace context.TODO() with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles, primeSplitClusterRoleBindings)

Summary by CodeRabbit

  • Chores

    • RBAC bootstrap/reconciliation now uses per-attempt, timeout-bound operation contexts for readiness checks and role/role-binding priming, improving boundedness of initialization attempts.
    • No user-facing behavior changes; reconciliation semantics remain unchanged.
  • Tests

    • RBAC bootstrap benchmark updated to propagate a context and to fail on unexpected errors.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label May 7, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented May 7, 2026

@copejon: This pull request references USHIFT-6401 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the bug to target either version "5.0." or "openshift-5.0.", but it targets "openshift-4.22" instead.

Details

In response to this:

Replace context.TODO() with the hook's cancelable context in the RBAC bootstrap post-start hook helpers (primeAggregatedClusterRoles, primeSplitClusterRoleBindings)

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 7, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 7, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 7, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

RBAC bootstrap context threading propagates a per-attempt 15s timeout context through the post-start hook into ensureRBACPolicy and priming helpers; all ClusterRole and ClusterRoleBinding List/Get/Create calls now use the provided context instead of context.TODO().

Changes

RBAC bootstrap context threading

Layer / File(s) Summary
Entry point: per-attempt timeout context
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
The post-start hook derives a 15s timeout context per wait.Poll attempt and passes it into ensureRBACPolicy.
Core: ensureRBACPolicy context usage
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
ensureRBACPolicy(ctx context.Context) now uses ctx for initial ClusterRole and ClusterRoleBinding List calls and forwards ctx to priming helpers.
Priming: aggregated ClusterRoles use ctx
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
primeAggregatedClusterRoles(ctx) accepts ctx and uses it for ClusterRoles().Get and Create calls (target and source roles).
Priming: split ClusterRoleBindings use ctx
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go, scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
primeSplitClusterRoleBindings(ctx) accepts ctx and uses it for ClusterRoleBindings().Get existence checks and Create calls.
Benchmarks/tests updated to pass context
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
Benchmark now imports context and calls ensureRBACPolicy(context.Background(), ...) and fails the benchmark on error.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

ready-for-human-review

🚥 Pre-merge checks | ✅ 14 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (14 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: adding timeout-bound context propagation to RBAC bootstrap to prevent pre-hook deadlocks on resource-constrained devices.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed PR contains no Ginkgo tests. The only test file modified (storage_rbac_test.go) has a static-named benchmark: BenchmarkEnsureRBACPolicy. No dynamic values detected.
Test Structure And Quality ✅ Passed The PR modifies only a benchmark test (BenchmarkEnsureRBACPolicy), not Ginkgo test code. The custom check requires reviewing Ginkgo test patterns, which are not present in this PR's changes.
Microshift Test Compatibility ✅ Passed PR does not add new Ginkgo e2e tests. Only a standard Go benchmark test is modified, making the MicroShift test compatibility check inapplicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests are added in this PR. Changes only include RBAC bootstrap code and a Go benchmark test (not Ginkgo), so SNO compatibility check is not applicable.
Topology-Aware Scheduling Compatibility ✅ Passed PR modifies Kubernetes API server RBAC bootstrap code only—no deployment manifests, workload scheduling, affinity, topology constraints, or pod configuration introduced.
Ote Binary Stdout Contract ✅ Passed PR introduces only context threading through helper functions with no new stdout writes. No fmt.Print*, klog.Print*, or process-level stdout violations found.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo e2e tests added in this PR. Changes are to RBAC bootstrap code and a unit benchmark test, not e2e tests requiring IPv6/disconnected network compatibility checks.
No-Weak-Crypto ✅ Passed PR contains no cryptographic code: changes are limited to context timeout threading in RBAC bootstrap initialization, replacing context.TODO() with context.WithTimeout().
Container-Privileges ✅ Passed PR contains only Go code changes for RBAC context threading; no privileged container settings, security contexts, or manifests were modified.
No-Sensitive-Data-In-Logs ✅ Passed PR contains only context threading refactoring with no new logging. Existing logs expose only role/binding names and RBAC permission metadata—no sensitive data.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go (1)

166-173: ⚡ Quick win

wait.Poll loop is not context-aware — cancellation won't short-circuit it.

The hook context is now correctly threaded into the inner function, so individual API calls will fail fast when the context is cancelled. However, wait.Poll itself has no awareness of the context; if the context is cancelled mid-poll-interval, the loop continues blocking for up to 30 more seconds before the next iteration observes the error. Replacing it with wait.PollWithContext (or wait.PollUntilContextTimeout) fully honors the shutdown signal.

♻️ Proposed refactor
-		err := wait.Poll(1*time.Second, 30*time.Second, func() (done bool, err error) {
+		err := wait.PollUntilContextTimeout(hookContext.Context, 1*time.Second, 30*time.Second, true, func(ctx context.Context) (done bool, err error) {
 			client, err := clientset.NewForConfig(hookContext.LoopbackClientConfig)
 			if err != nil {
 				utilruntime.HandleError(fmt.Errorf("unable to initialize client set: %v", err))
 				return false, nil
 			}
-			return ensureRBACPolicy(hookContext, p, client)
+			return ensureRBACPolicy(ctx, p, client)
 		})

Note: adjust hookContext.Context to hookContext if PostStartHookContext embeds context.Context.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`
around lines 166 - 173, The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go`:
- Around line 166-173: The wait.Poll call in the RBAC setup loop is not
context-aware and can block after cancellation; replace the wait.Poll invocation
in storage_rbac.go with a context-aware variant (e.g., wait.PollWithContext or
wait.PollUntilContextTimeout) so the loop short-circuits on hookContext
cancellation; pass the hookContext (or hookContext.Context if
PostStartHookContext embeds context.Context) as the context argument and keep
the same polling interval and timeout while preserving the existing
ensureRBACPolicy(hookContext, p, client) call and error handling.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 9f88f2a4-91ab-409f-bd6c-6d75d87351ef

📥 Commits

Reviewing files that changed from the base of the PR and between e98bbde and d255cf8.

⛔ Files ignored due to path filters (1)
  • vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (2)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go

@copejon copejon force-pushed the fix-USHIFT-6401-alt-fix branch from d255cf8 to b220d71 Compare May 19, 2026 22:04
@copejon copejon marked this pull request as ready for review May 19, 2026 22:06
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 19, 2026
@openshift-ci openshift-ci Bot requested review from pacevedom and vanhalenar May 19, 2026 22:06
…ootstrap

Thread context.Context through ensureRBACPolicy, primeAggregatedClusterRoles,
and primeSplitClusterRoleBindings so that RBAC bootstrap API calls respect the
post-start hook's cancellation signal instead of hanging indefinitely on
context.TODO().

Includes carry patch (0040) so the fix survives future rebases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@copejon copejon force-pushed the fix-USHIFT-6401-alt-fix branch from b220d71 to 1d08a19 Compare May 19, 2026 22:47
Comment thread deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go Outdated
hookContext only cancels on server shutdown, not on deadline — so
passing it bare does not break the deadlock. Wrap it with
context.WithTimeout(15s) so each poll iteration's API calls are
individually bounded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai coderabbitai Bot added the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go`:
- Line 38: The benchmark currently ignores the error return from
ensureRBACPolicy; change the call to capture the error and fail the benchmark on
error (e.g., call b.Fatalf or b.Fatal with an explanatory message and the error)
so that ensureRBACPolicy(policy, coreClientSet, context.Background()) failures
are reported instead of being swallowed; reference ensureRBACPolicy, policy and
coreClientSet to locate the call and replace the ignored error with an if err !=
nil check that fails the test/benchmark.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 3303d4e4-2601-44ac-987c-eafc2eb27940

📥 Commits

Reviewing files that changed from the base of the PR and between 1d08a19 and 9efba0f.

⛔ Files ignored due to path filters (1)
  • vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (3)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go
  • scripts/auto-rebase/rebase_patches/0040-rbac-bootstrap-hook-context-threading.patch
🚧 Files skipped from review as they are similar to previous changes (1)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac.go

Comment thread deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go Outdated
@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented May 27, 2026

/test test-rebase

Stop silently discarding the error from ensureRBACPolicy so failures
surface instead of being swallowed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai coderabbitai Bot removed the ready-for-human-review Indicates a PR has been reviewed by automated tools and is ready for human review label May 27, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go (1)

38-40: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Also check the done return value.

The benchmark currently discards the done return value from ensureRBACPolicy. If done == false (but err == nil), the RBAC policy setup did not complete, which would invalidate the benchmark results.

Proposed fix
-		if _, err := ensureRBACPolicy(context.Background(), policy, coreClientSet); err != nil {
+		done, err := ensureRBACPolicy(context.Background(), policy, coreClientSet)
+		if err != nil {
 			b.Fatalf("ensureRBACPolicy failed: %v", err)
 		}
+		if !done {
+			b.Fatalf("ensureRBACPolicy did not complete")
+		}

As per coding guidelines, Go security (prodsec-skills): Never ignore error returns.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In
`@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go`
around lines 38 - 40, The benchmark is ignoring the boolean "done" returned by
ensureRBACPolicy; update the call in the test to capture both returns (done, err
:= ensureRBACPolicy(...)) and if err != nil or done == false, call b.Fatalf with
an explanatory message (e.g., "ensureRBACPolicy failed or did not complete:
done=%v err=%v") so a nil error but incomplete setup is treated as a failure;
locate the call to ensureRBACPolicy in storage_rbac_test.go and modify the
b.Fatalf checks accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In
`@deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go`:
- Around line 38-40: The benchmark is ignoring the boolean "done" returned by
ensureRBACPolicy; update the call in the test to capture both returns (done, err
:= ensureRBACPolicy(...)) and if err != nil or done == false, call b.Fatalf with
an explanatory message (e.g., "ensureRBACPolicy failed or did not complete:
done=%v err=%v") so a nil error but incomplete setup is treated as a failure;
locate the call to ensureRBACPolicy in storage_rbac_test.go and modify the
b.Fatalf checks accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 32dc0bb5-f181-4e30-b94d-68d80d05caff

📥 Commits

Reviewing files that changed from the base of the PR and between 9efba0f and c25ec4d.

⛔ Files ignored due to path filters (1)
  • vendor/k8s.io/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go is excluded by !**/vendor/**, !vendor/**
📒 Files selected for processing (1)
  • deps/github.com/openshift/kubernetes/pkg/registry/rbac/rest/storage_rbac_test.go

copejon and others added 2 commits May 27, 2026 17:58
go mod vendor never includes _test.go files — the test belongs in
deps/ only.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The patch had wrong line offsets in 8 hunk headers — git apply --check
reported "corrupt patch at line 18". Regenerated from git diff.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 28, 2026

@copejon: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@copejon
Copy link
Copy Markdown
Contributor Author

copejon commented May 29, 2026

/lgtm

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

@copejon: you cannot LGTM your own PR.

Details

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 29, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: copejon

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Reduce per-attempt timeout from 15s to 12s so two full attempts fit
comfortably within the 30s outer poll. Add a klog.Warningf when the
context deadline fires during etcd readiness checks to distinguish
timeouts from other errors.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants