ci: dedupe quality gates and add fast-fail pre-check (AE-3161)#345
ci: dedupe quality gates and add fast-fail pre-check (AE-3161)#345deanq wants to merge 1 commit into
Conversation
Eliminates redundant CI/CD work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR push. Workflow changes: - ci.yml: add concurrency group (cancel in-flight PR runs), paths-ignore for release-please bot PRs, new pre-check job (uvx ruff format+check, ~10s no-install fast-fail), bump setup-uv v2 -> v5, build job only on push:main since release-please.yml builds + publishes the same SHA. - release-please.yml: remove duplicate quality-gates job. Branch protection on main already enforces ci.yml green before code lands, so the second matrix validated nothing. Saves a full 4-leg matrix per merge. - e2e.yml: drop unit-tests job (duplicated ci.yml quality-gates), fold the summary step into the e2e job to remove the third workflow definition of the same install setup. Makefile changes: - dev: drop redundant 'uv pip install -e .' (uv sync handles editable install in workspace mode). - quality-check: alias ci-quality-github so there is one canonical CI quality gate, no drift between local and CI. - ci-quality-github / test-coverage: fix the latent junit overwrite bug where both pytest invocations wrote pytest-results.xml -- now writes pytest-results-parallel.xml and pytest-results-serial.xml. The double pytest pass is retained because it provides process isolation between parallel and serial tests; collapsing to a single loadgroup run surfaced state pollution in test_load_balancer_sls_stub. Note: removing quality-gates from release-please.yml + paths-ignore for the bot's files means the release-please PR will have no required CI check. Branch protection on main will need to allow missing checks for the release-please--* branch pattern, or add a no-op sentinel job named to match the protection rule. https://linear.app/runpod/issue/AE-3161
runpod-Henrik
left a comment
There was a problem hiding this comment.
Henrik's AI-Powered Bug Finder — PR #345 Review
Verdict: PASS WITH NITS (with one repo-admin action required before merge)
CI-only refactor. Local SDK behavior is unchanged. The optimization claims hold up (75% compute reduction for the typical PR-fixup loop; full matrix dedupe on merge-to-main). CI on this PR proves the new pipeline works: pre-check 10s, all four quality-gates legs 4-4:30 min, build skipped per the new if: guard.
The PR description already flags the branch-protection compatibility issue, which is the real release-blocker.
1. Issue: branch-protection compatibility on release-please bot PRs
The PR description calls this out explicitly. Recapping the QA risk in user terms:
- After this PR, a release-please bot PR that touches only
CHANGELOG.md+.release-please-manifest.jsoncausesci.ymlto be skipped viapaths-ignore— workflow doesn't run at all. GitHub branch protection treatspaths-ignoreskips as "check satisfied," so the bot PR can merge cleanly. - But if branch protection on
mainrequires a check namedCI / Quality Gates(4 statuses, one per matrix leg), and the workflow never runs, those status checks are simply absent. Whether branch protection blocks "missing required check" or accepts "not run because path-ignored" depends on a setting on the rule ("Require status checks to pass before merging" with vs. without "Strict — Require branches to be up to date"). Repo admin needs to verify this onmainbefore merging.
If the rule is strict and missing checks block, the bot PR will be permanently un-mergeable.
A safer path is the PR description's option 2: add a no-op success job named to match the required check (e.g., name: Quality Gates) that fires only on release-please--* branches.
This is not a code defect — it's a deployment-coordination requirement. Flagging it as a hard gate so it doesn't get missed.
2. Issue: PR-time wheel build is dropped
build job now runs only on push: main (`if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}`). That means make build + validate-wheel.sh no longer run on PR.
User scenario: a contributor changes pyproject.toml (e.g., adjusts [tool.setuptools.package-data], adds a non-Python file that needs include_package_data). The PR's CI passes (quality-gates is pure pytest), the wheel-validation regression lands on main, and the next PR's build job — or worse, release-please's PyPI publish — is the first signal.
The PR's rationale: "a PR-time wheel build would add ~1 min for no extra signal beyond quality-gates." That's mostly true — but validate-wheel.sh is wheel-packaging-specific signal that quality-gates doesn't catch. Examples that would slip past:
- Stale
package-dataglob after a directory rename. runpod_flash/rules/AGENTS.md(just added in 1.17.0) accidentally excluded from the wheel.- A
MANIFEST.inregression.
Suggest gating the build job on the actual files that affect packaging:
```yaml
if: ${{ (github.event_name == 'push' && github.ref == 'refs/heads/main') ||
(github.event_name == 'pull_request' && contains(github.event.pull_request.changed_files, 'pyproject.toml')) }}
```
Or even simpler: keep build on PRs, but only when pyproject.toml, Makefile, or scripts/validate-wheel.sh are touched. Cheap signal preserved.
3. Question: redundant skip-guards for release-please bot PRs
ci.yml skips release-please bot PRs in two places:
- Workflow-level
paths-ignore: [CHANGELOG.md, .release-please-manifest.json]— workflow doesn't run if all changed files match. - Job-level
if: ${{ !startsWith(github.head_ref, 'release-please--') }}onpre-check.
These overlap but aren't equivalent:
- If the bot PR ever touches a non-ignored file (e.g., the bot decides to update a workflow file),
paths-ignoreno longer applies → workflow runs. - Then
pre-checkskips via theif:, andquality-gates needs: [pre-check]will also skip per defaultneedssemantics → no quality signal on a release-please PR that's broader than CHANGELOG.
Is the intent "always skip on release-please branches even if it touches code," or "skip only when the change is metadata-only"? If the former, this is consistent. If the latter, drop the if: on pre-check and rely on paths-ignore alone.
Confirming the intent now is cheap; debugging which guard skipped which check after the fact is not.
4. Question: does make dev (without uv pip install -e .) make the flash CLI entry point importable?
User scenario: a contributor clones the repo, runs make dev, then flash --version.
Before: uv sync --all-groups + uv pip install -e . — the editable install registered the flash = runpod_flash.cli.main:app entry point in the venv.
After: uv sync --all-groups alone.
Whether uv sync registers the entry point for the project depends on whether the project has [tool.uv.sources] / workspace config that includes itself. CI passes because the matrix runs make ci-quality-github which uses uv run pytest … — that activates the project context regardless.
But a contributor running flash outside uv run (i.e., just flash --version from an activated .venv) might find it missing. This is a developer-onboarding hazard, not a production user issue. Worth a sanity check: from a clean clone, make dev followed by .venv/bin/flash --version — does it print 1.17.0, or "command not found"?
If "command not found," restore the uv pip install -e . and call it explicitly redundant-but-required for entry-point registration.
5. Nit: local make quality-check UX now noisier
quality-check aliases ci-quality-github, which means local devs running it see GitHub Actions annotation markers (::group::, ::endgroup::) sprinkled in their output. Functionally fine — the markers are inert outside Actions — but it's a minor UX regression for the most common local quality-check invocation.
If keeping single-source-of-truth is the priority (it should be), at least add a brief note in the help text that markers in the output are intentional. Or have quality-check set an env var that conditionally suppresses the @echo "::group::..." lines.
Nits
Makefile:75test-coveragecomment update mentions "serial pass for state isolation" — good clarification, matches the PR description's rationale for keeping the two-pass.e2e.ymlloses theunit-testsjob entirely. That's fine because the workflow isworkflow_dispatchonly (manually triggered E2E runs), and the manual user always runsci.ymlfirst via PR. But worth a one-line comment in the workflow file saying so, otherwise a future maintainer might add it back assuming it was dropped by accident.pre-checkrunsuvx ruff …which downloads ruff into a temporary venv every time. Cached viasetup-uv'scache-dependency-glob: pyproject.toml, so warm runs are fast — but on a first-of-day cold cache, ~30s vs ~10s. Acceptable tradeoff.- Trailing newline restored in
release-please.ymlfinal line. Good.
🤖 Reviewed by Henrik's AI-Powered Bug Finder
There was a problem hiding this comment.
Pull request overview
Removes redundant CI work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR. Centralizes the quality gate as a single canonical Makefile target, fixes a silent junit overwrite, and adds a fast-fail format/lint pre-check plus PR-cancel concurrency to ci.yml. Drops the duplicated 4-version matrix from release-please.yml and the duplicated unit-tests job (plus its artifact round-trip summary) from e2e.yml.
Changes:
Makefile: drop redundant editable install indev; aliasquality-checktoci-quality-github; split pytest passes into distinctpytest-results-parallel.xml/pytest-results-serial.xml.ci.yml: add concurrency group,paths-ignorefor release-please bot files, fast-failpre-checkjob, gatebuildtopush: main, bumpsetup-uv@v2 → v5, uploadpytest-results-*.xml.release-please.yml/e2e.yml: remove duplicatedquality-gatesmatrix andunit-tests/summaryjobs; bumpsetup-uv@v2 → v5; fold summary into thee2ejob.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| Makefile | Simplifies dev, makes ci-quality-github the canonical gate, fixes junit filename collision. |
| .github/workflows/ci.yml | Adds concurrency, paths-ignore, pre-check job; conditions build on push:main; v5 setup-uv. |
| .github/workflows/release-please.yml | Removes duplicate quality-gates matrix; release-please is the entry point; v5 setup-uv. |
| .github/workflows/e2e.yml | Drops duplicate unit-tests and summary jobs; folds summary into e2e job. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Quality gates. ci-quality-github is the canonical CI gate; the local aliases | ||
| # below run the same checks (with plain output instead of GitHub annotations). | ||
| quality-check: ci-quality-github # Essential quality gate (parallel by default) | ||
| quality-check-strict: format-check lint typecheck test-coverage # Strict quality gate with type checking | ||
| quality-check-serial: ci-quality-github-serial # Serial quality gate for debugging |
| pre-check: | ||
| name: Pre-check (format + lint) | ||
| runs-on: ubuntu-latest | ||
| timeout-minutes: 3 | ||
| if: ${{ !startsWith(github.head_ref, 'release-please--') }} | ||
| steps: | ||
| - uses: actions/checkout@v4 | ||
| - uses: astral-sh/setup-uv@v5 | ||
| with: | ||
| enable-cache: true | ||
| cache-dependency-glob: pyproject.toml | ||
| - name: Ruff format | ||
| run: uvx ruff format --check . | ||
| - name: Ruff lint | ||
| run: uvx ruff check . --output-format=github | ||
|
|
||
| quality-gates: | ||
| name: Quality Gates | ||
| runs-on: ubuntu-latest | ||
| needs: [pre-check] |
Summary
Eliminates redundant CI/CD work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR push. Linear: AE-3161.
What's redundant today
ci.ymlandrelease-please.ymlrun the full 4-version Quality Gates matrix onpush: main— 8 identical jobs per merge. The release workflow runs after CI on the same SHA, so the second matrix validates nothing new.CHANGELOG.mdand.release-please-manifest.json(several pushes per day during a release cycle).e2e.ymlhas its ownunit-testsjob that re-implementsci.yml's quality-gates with a driftingsetup-uvversion.make ci-quality-githubwrites both pytest passes to the same--junitxml=pytest-results.xml— the second pass silently overwrites the first report.Makefile:devrunsuv sync --all-groupsthenuv pip install -e .— the editable install is redundant in workspace mode.quality-checkandci-quality-githubare two Makefile targets running the same checks — drift risk, no single source of truth.concurrencygroup — pushing a fixup commit to a PR leaves the previous run racing.make devinstall across 4 matrix legs before bailing.Changes
Makefiledev: drop redundantuv pip install -e .quality-check: aliasci-quality-github(single source of truth)ci-quality-github: write distinctpytest-results-parallel.xmlandpytest-results-serial.xml(fixes silent junit overwrite).github/workflows/ci.ymlconcurrencygroup; cancel in-flight PR runspaths-ignore: [CHANGELOG.md, .release-please-manifest.json]— skip on release-please bot PRspre-checkjob:uvx ruff format --check && uvx ruff check, no project install, ~10s.quality-gatesneeds: [pre-check]buildjob only onpush: main(PR build added no signal beyond quality-gates)setup-uv@v2 → @v5pytest-results-*.xml.github/workflows/release-please.ymlquality-gatesjob entirely. Branch protection onmainalready enforces CI green before any code lands. Re-running the matrix here was duplicating ~24 min of compute per merge for zero added signal.release-pleasebecomes the entry point;pypi-publishruns if a release was created.setup-uv@v2 → @v5.github/workflows/e2e.ymlunit-testsjob (duplicatedci.ymlquality-gates)e2ejob; remove thesummaryjob and its artifact round-tripWhy the pytest double-pass stayed
Initial draft collapsed the two pytest invocations into a single
-n auto --dist=loadgrouprun, relying on theconftest.pyxdist_group("serial")hook for isolation. Local quality-check exposed test pollution:test_call_routes_to_execute_for_live_endpointfailed under the single-pass arrangement but passes in isolation and in the original two-pass layout. The double pass provides real process isolation between parallel and serial tests; the silent junit overwrite is fixed instead.Expected impact
⚠ Pre-merge gotcha
Removing
quality-gatesfromrelease-please.yml+paths-ignorefor the bot's files means the release-please bot's PR will have no required CI check. If branch protection onmainrequires a status check namedCI / Quality Gates, the bot's PR will be blocked.Two options:
release-please--*branches, named to match the required checkNeed a repo admin to apply one of these before merging.
Test plan
make quality-checkpasses locally with the new Makefile layoutmake devproduces a working install withoutuv pip install -e .make ci-quality-githubwrites bothpytest-results-parallel.xmlandpytest-results-serial.xmlpre-check+quality-gates(4 legs);buildjob is skipped (PR event, not push:main)release-please--*branches before mergerelease-please.ymlruns without aquality-gatesjobci.yml(paths-ignore filter)