Skip to content

ci: dedupe quality gates and add fast-fail pre-check (AE-3161)#345

Open
deanq wants to merge 1 commit into
mainfrom
deanq/ae-3161-optimize-ci-cd
Open

ci: dedupe quality gates and add fast-fail pre-check (AE-3161)#345
deanq wants to merge 1 commit into
mainfrom
deanq/ae-3161-optimize-ci-cd

Conversation

@deanq
Copy link
Copy Markdown
Member

@deanq deanq commented May 28, 2026

Summary

Eliminates redundant CI/CD work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR push. Linear: AE-3161.

What's redundant today

  1. Both ci.yml and release-please.yml run the full 4-version Quality Gates matrix on push: main — 8 identical jobs per merge. The release workflow runs after CI on the same SHA, so the second matrix validates nothing new.
  2. release-please bot PRs trigger the full CI matrix even though they only touch CHANGELOG.md and .release-please-manifest.json (several pushes per day during a release cycle).
  3. e2e.yml has its own unit-tests job that re-implements ci.yml's quality-gates with a drifting setup-uv version.
  4. make ci-quality-github writes both pytest passes to the same --junitxml=pytest-results.xml — the second pass silently overwrites the first report.
  5. Makefile:dev runs uv sync --all-groups then uv pip install -e . — the editable install is redundant in workspace mode.
  6. quality-check and ci-quality-github are two Makefile targets running the same checks — drift risk, no single source of truth.
  7. No concurrency group — pushing a fixup commit to a PR leaves the previous run racing.
  8. No fast-fail pre-check — a ruff format failure wastes the full make dev install across 4 matrix legs before bailing.

Changes

Makefile

  • dev: drop redundant uv pip install -e .
  • quality-check: alias ci-quality-github (single source of truth)
  • ci-quality-github: write distinct pytest-results-parallel.xml and pytest-results-serial.xml (fixes silent junit overwrite)

.github/workflows/ci.yml

  • Add concurrency group; cancel in-flight PR runs
  • paths-ignore: [CHANGELOG.md, .release-please-manifest.json] — skip on release-please bot PRs
  • New pre-check job: uvx ruff format --check && uvx ruff check, no project install, ~10s. quality-gates needs: [pre-check]
  • build job only on push: main (PR build added no signal beyond quality-gates)
  • Bump setup-uv@v2 → @v5
  • Upload artifact glob now pytest-results-*.xml

.github/workflows/release-please.yml

  • Remove quality-gates job entirely. Branch protection on main already enforces CI green before any code lands. Re-running the matrix here was duplicating ~24 min of compute per merge for zero added signal.
  • release-please becomes the entry point; pypi-publish runs if a release was created.
  • Bump setup-uv@v2 → @v5

.github/workflows/e2e.yml

  • Drop unit-tests job (duplicated ci.yml quality-gates)
  • Fold the summary step into the e2e job; remove the summary job and its artifact round-trip

Why the pytest double-pass stayed

Initial draft collapsed the two pytest invocations into a single -n auto --dist=loadgroup run, relying on the conftest.py xdist_group("serial") hook for isolation. Local quality-check exposed test pollution: test_call_routes_to_execute_for_live_endpoint failed under the single-pass arrangement but passes in isolation and in the original two-pass layout. The double pass provides real process isolation between parallel and serial tests; the silent junit overwrite is fixed instead.

Expected impact

Scenario Before After
Per PR push (typical) ~7 min × 4 matrix + 1 min build = ~29 min compute ~7 min × 1 leg parallel + 30s pre-check = ~7.5 min wall, ~28 min compute (75% less if pre-check fails fast)
Per merge to main ~7 min × 4 + ~7 min × 4 + 1 min build = ~57 min compute ~7 min × 4 + 1 min build = ~29 min compute
Per release-please bot PR push ~7 min × 4 = ~28 min compute 0 (skipped via paths-ignore)

⚠ Pre-merge gotcha

Removing quality-gates from release-please.yml + paths-ignore for the bot's files means the release-please bot's PR will have no required CI check. If branch protection on main requires a status check named CI / Quality Gates, the bot's PR will be blocked.

Two options:

  1. Loosen branch protection so missing checks are allowed (the bot only edits CHANGELOG/manifest — there's nothing to validate)
  2. Add a no-op status job that always passes on release-please--* branches, named to match the required check

Need a repo admin to apply one of these before merging.

Test plan

  • make quality-check passes locally with the new Makefile layout
  • make dev produces a working install without uv pip install -e .
  • make ci-quality-github writes both pytest-results-parallel.xml and pytest-results-serial.xml
  • CI on this PR runs pre-check + quality-gates (4 legs); build job is skipped (PR event, not push:main)
  • Branch protection adjusted for release-please--* branches before merge
  • On merge, confirm release-please.yml runs without a quality-gates job
  • Next release-please bot PR is skipped by ci.yml (paths-ignore filter)

Eliminates redundant CI/CD work that was costing ~24 min of compute per
merge and ~28 min per release-please bot PR push.

Workflow changes:
- ci.yml: add concurrency group (cancel in-flight PR runs), paths-ignore
  for release-please bot PRs, new pre-check job (uvx ruff format+check,
  ~10s no-install fast-fail), bump setup-uv v2 -> v5, build job only on
  push:main since release-please.yml builds + publishes the same SHA.
- release-please.yml: remove duplicate quality-gates job. Branch protection
  on main already enforces ci.yml green before code lands, so the second
  matrix validated nothing. Saves a full 4-leg matrix per merge.
- e2e.yml: drop unit-tests job (duplicated ci.yml quality-gates), fold the
  summary step into the e2e job to remove the third workflow definition of
  the same install setup.

Makefile changes:
- dev: drop redundant 'uv pip install -e .' (uv sync handles editable
  install in workspace mode).
- quality-check: alias ci-quality-github so there is one canonical CI
  quality gate, no drift between local and CI.
- ci-quality-github / test-coverage: fix the latent junit overwrite bug
  where both pytest invocations wrote pytest-results.xml -- now writes
  pytest-results-parallel.xml and pytest-results-serial.xml. The double
  pytest pass is retained because it provides process isolation between
  parallel and serial tests; collapsing to a single loadgroup run
  surfaced state pollution in test_load_balancer_sls_stub.

Note: removing quality-gates from release-please.yml + paths-ignore for
the bot's files means the release-please PR will have no required CI
check. Branch protection on main will need to allow missing checks for
the release-please--* branch pattern, or add a no-op sentinel job named
to match the protection rule.

https://linear.app/runpod/issue/AE-3161
Copy link
Copy Markdown
Contributor

@runpod-Henrik runpod-Henrik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Henrik's AI-Powered Bug Finder — PR #345 Review

Verdict: PASS WITH NITS (with one repo-admin action required before merge)

CI-only refactor. Local SDK behavior is unchanged. The optimization claims hold up (75% compute reduction for the typical PR-fixup loop; full matrix dedupe on merge-to-main). CI on this PR proves the new pipeline works: pre-check 10s, all four quality-gates legs 4-4:30 min, build skipped per the new if: guard.

The PR description already flags the branch-protection compatibility issue, which is the real release-blocker.


1. Issue: branch-protection compatibility on release-please bot PRs

The PR description calls this out explicitly. Recapping the QA risk in user terms:

  • After this PR, a release-please bot PR that touches only CHANGELOG.md + .release-please-manifest.json causes ci.yml to be skipped via paths-ignore — workflow doesn't run at all. GitHub branch protection treats paths-ignore skips as "check satisfied," so the bot PR can merge cleanly.
  • But if branch protection on main requires a check named CI / Quality Gates (4 statuses, one per matrix leg), and the workflow never runs, those status checks are simply absent. Whether branch protection blocks "missing required check" or accepts "not run because path-ignored" depends on a setting on the rule ("Require status checks to pass before merging" with vs. without "Strict — Require branches to be up to date"). Repo admin needs to verify this on main before merging.

If the rule is strict and missing checks block, the bot PR will be permanently un-mergeable.

A safer path is the PR description's option 2: add a no-op success job named to match the required check (e.g., name: Quality Gates) that fires only on release-please--* branches.

This is not a code defect — it's a deployment-coordination requirement. Flagging it as a hard gate so it doesn't get missed.


2. Issue: PR-time wheel build is dropped

build job now runs only on push: main (`if: ${{ github.event_name == 'push' && github.ref == 'refs/heads/main' }}`). That means make build + validate-wheel.sh no longer run on PR.

User scenario: a contributor changes pyproject.toml (e.g., adjusts [tool.setuptools.package-data], adds a non-Python file that needs include_package_data). The PR's CI passes (quality-gates is pure pytest), the wheel-validation regression lands on main, and the next PR's build job — or worse, release-please's PyPI publish — is the first signal.

The PR's rationale: "a PR-time wheel build would add ~1 min for no extra signal beyond quality-gates." That's mostly true — but validate-wheel.sh is wheel-packaging-specific signal that quality-gates doesn't catch. Examples that would slip past:

  • Stale package-data glob after a directory rename.
  • runpod_flash/rules/AGENTS.md (just added in 1.17.0) accidentally excluded from the wheel.
  • A MANIFEST.in regression.

Suggest gating the build job on the actual files that affect packaging:

```yaml
if: ${{ (github.event_name == 'push' && github.ref == 'refs/heads/main') ||
(github.event_name == 'pull_request' && contains(github.event.pull_request.changed_files, 'pyproject.toml')) }}
```

Or even simpler: keep build on PRs, but only when pyproject.toml, Makefile, or scripts/validate-wheel.sh are touched. Cheap signal preserved.


3. Question: redundant skip-guards for release-please bot PRs

ci.yml skips release-please bot PRs in two places:

  1. Workflow-level paths-ignore: [CHANGELOG.md, .release-please-manifest.json] — workflow doesn't run if all changed files match.
  2. Job-level if: ${{ !startsWith(github.head_ref, 'release-please--') }} on pre-check.

These overlap but aren't equivalent:

  • If the bot PR ever touches a non-ignored file (e.g., the bot decides to update a workflow file), paths-ignore no longer applies → workflow runs.
  • Then pre-check skips via the if:, and quality-gates needs: [pre-check] will also skip per default needs semantics → no quality signal on a release-please PR that's broader than CHANGELOG.

Is the intent "always skip on release-please branches even if it touches code," or "skip only when the change is metadata-only"? If the former, this is consistent. If the latter, drop the if: on pre-check and rely on paths-ignore alone.

Confirming the intent now is cheap; debugging which guard skipped which check after the fact is not.


4. Question: does make dev (without uv pip install -e .) make the flash CLI entry point importable?

User scenario: a contributor clones the repo, runs make dev, then flash --version.

Before: uv sync --all-groups + uv pip install -e . — the editable install registered the flash = runpod_flash.cli.main:app entry point in the venv.

After: uv sync --all-groups alone.

Whether uv sync registers the entry point for the project depends on whether the project has [tool.uv.sources] / workspace config that includes itself. CI passes because the matrix runs make ci-quality-github which uses uv run pytest … — that activates the project context regardless.

But a contributor running flash outside uv run (i.e., just flash --version from an activated .venv) might find it missing. This is a developer-onboarding hazard, not a production user issue. Worth a sanity check: from a clean clone, make dev followed by .venv/bin/flash --version — does it print 1.17.0, or "command not found"?

If "command not found," restore the uv pip install -e . and call it explicitly redundant-but-required for entry-point registration.


5. Nit: local make quality-check UX now noisier

quality-check aliases ci-quality-github, which means local devs running it see GitHub Actions annotation markers (::group::, ::endgroup::) sprinkled in their output. Functionally fine — the markers are inert outside Actions — but it's a minor UX regression for the most common local quality-check invocation.

If keeping single-source-of-truth is the priority (it should be), at least add a brief note in the help text that markers in the output are intentional. Or have quality-check set an env var that conditionally suppresses the @echo "::group::..." lines.


Nits

  • Makefile:75 test-coverage comment update mentions "serial pass for state isolation" — good clarification, matches the PR description's rationale for keeping the two-pass.
  • e2e.yml loses the unit-tests job entirely. That's fine because the workflow is workflow_dispatch only (manually triggered E2E runs), and the manual user always runs ci.yml first via PR. But worth a one-line comment in the workflow file saying so, otherwise a future maintainer might add it back assuming it was dropped by accident.
  • pre-check runs uvx ruff … which downloads ruff into a temporary venv every time. Cached via setup-uv's cache-dependency-glob: pyproject.toml, so warm runs are fast — but on a first-of-day cold cache, ~30s vs ~10s. Acceptable tradeoff.
  • Trailing newline restored in release-please.yml final line. Good.

🤖 Reviewed by Henrik's AI-Powered Bug Finder

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Removes redundant CI work that was costing ~24 min of compute per merge and ~28 min per release-please bot PR. Centralizes the quality gate as a single canonical Makefile target, fixes a silent junit overwrite, and adds a fast-fail format/lint pre-check plus PR-cancel concurrency to ci.yml. Drops the duplicated 4-version matrix from release-please.yml and the duplicated unit-tests job (plus its artifact round-trip summary) from e2e.yml.

Changes:

  • Makefile: drop redundant editable install in dev; alias quality-check to ci-quality-github; split pytest passes into distinct pytest-results-parallel.xml / pytest-results-serial.xml.
  • ci.yml: add concurrency group, paths-ignore for release-please bot files, fast-fail pre-check job, gate build to push: main, bump setup-uv@v2 → v5, upload pytest-results-*.xml.
  • release-please.yml / e2e.yml: remove duplicated quality-gates matrix and unit-tests/summary jobs; bump setup-uv@v2 → v5; fold summary into the e2e job.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
Makefile Simplifies dev, makes ci-quality-github the canonical gate, fixes junit filename collision.
.github/workflows/ci.yml Adds concurrency, paths-ignore, pre-check job; conditions build on push:main; v5 setup-uv.
.github/workflows/release-please.yml Removes duplicate quality-gates matrix; release-please is the entry point; v5 setup-uv.
.github/workflows/e2e.yml Drops duplicate unit-tests and summary jobs; folds summary into e2e job.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread Makefile
Comment on lines +104 to +108
# Quality gates. ci-quality-github is the canonical CI gate; the local aliases
# below run the same checks (with plain output instead of GitHub annotations).
quality-check: ci-quality-github # Essential quality gate (parallel by default)
quality-check-strict: format-check lint typecheck test-coverage # Strict quality gate with type checking
quality-check-serial: ci-quality-github-serial # Serial quality gate for debugging
Comment thread .github/workflows/ci.yml
Comment on lines +31 to +50
pre-check:
name: Pre-check (format + lint)
runs-on: ubuntu-latest
timeout-minutes: 3
if: ${{ !startsWith(github.head_ref, 'release-please--') }}
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
with:
enable-cache: true
cache-dependency-glob: pyproject.toml
- name: Ruff format
run: uvx ruff format --check .
- name: Ruff lint
run: uvx ruff check . --output-format=github

quality-gates:
name: Quality Gates
runs-on: ubuntu-latest
needs: [pre-check]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants