[release-4.22] OCPBUGS-94186: Add retry for transient API errors to prevent Degraded blips#1181
Conversation
blips Wrap all API write operations (Apply, Create, Update, Delete) with retry logic to absorb transient errors (conflicts, timeouts, connection refused) during upgrades before they reach status reporting. Without retry, a single transient failure sets Degraded=True for ~1 minute until the next sync resolves it, causing CI test failures. Retry parameters: 3 attempts, 500ms backoff with 2x factor (~3.5s max). Permanent errors (Forbidden, Invalid, AlreadyExists) are not retried. Assisted by Claude Code (Opus 4.6)
…de the retry closure before calling Update/UpdateStatus. Each retry attempt now uses the current resourceVersion instead of the original stale copy. - SyncConsoleConfig: re-fetches consoleConfig before UpdateStatus - SyncServiceCAConfigMap: re-fetches configmap and re-applies metadata before Update - SyncTrustedCAConfigMap: same pattern - ApplyCLIDownloads: re-fetches CLI downloads resource, re-applies spec and metadata before Update Assisted by: Claude (Opus 4.6)
- Guard deleteStorageVersionMigration against NotFound errors - Remove one-line retryOnTransientError wrapper, call util directly - Remove misleading nil-error test case from IsRetryableError Assisted-by: Claude (Opus 4.6)
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-94186, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-38676, which is invalid:
Comment DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-console |
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-94186, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/jira refresh |
|
@sg00dwin: This pull request references Jira Issue OCPBUGS-94186, which is valid. The bug has been moved to the POST state. 7 validation(s) were run on this bug
DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-console |
2 similar comments
|
/test e2e-aws-console |
|
/test e2e-aws-console |
|
/label backport-risk-assessed |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jhadvig, sg00dwin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jhadvig: This PR has been marked as verified by DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/test e2e-aws-console |
1 similar comment
|
/test e2e-aws-console |
|
@sg00dwin: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
Manual cherry-pick of #1164 to release-4.22.
Automated cherry-pick failed due to a merge conflict in
pkg/console/operator/sync_v400.go(theApplyConfigMapcall site). Resolved manually — took the incoming retry wrapper.Original PR
#1164
Test plan
sync_v400.gowhereApplyConfigMapneeded retry wrapper