[Klaud Cold] Update dsv4-fp4-mi355x-vllm vLLM ROCm image to v0.22.0#1624
Conversation
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bf86321 to
4e52736
Compare
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26696267330 |
There was a problem hiding this comment.
LGTM — straightforward image bump from a nightly digest pin to the stable v0.22.0 tag, with a matching perf-changelog entry.
Extended reasoning...
Overview
This PR touches two files: .github/configs/amd-master.yaml (a single-line image tag change for the dsv4-fp4-mi355x-vllm recipe, from nightly-4f940896... to v0.22.0) and perf-changelog.yaml (a corresponding changelog entry). No code paths, no flags, no sweep-space changes — just a container image version bump.
Security risks
None. This is a config-only change that swaps one published vllm/vllm-openai-rocm tag for another from the same upstream registry. No auth, crypto, permissions, or data-handling surface is involved.
Level of scrutiny
Low. Image-tag bumps in this YAML are routine, follow an established pattern (multiple sibling entries in the file already pin vllm/vllm-openai-rocm to stable vX.Y.Z tags), and the move from a floating-ish nightly digest to a stable release is the safer direction. Functional validation is gated on the full-sweep-enabled label, which will exercise the new image.
Other factors
The bug hunter flagged one nit: the IMPORTANT comment block above the entry still describes the previous nightly-pin rationale (digest-suffixed nightly vs floating :nightly, enroot squashfs caching) and is stale now that the pin is a stable release. It's a real staleness introduced by this PR but has zero functional impact and the inline comment already calls it out for the author to address if desired — it doesn't warrant blocking approval.
| # probe to validate the ROCm DP+EP path. | ||
| dsv4-fp4-mi355x-vllm: | ||
| image: vllm/vllm-openai-rocm:nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458 | ||
| image: vllm/vllm-openai-rocm:v0.22.0 |
There was a problem hiding this comment.
🟡 Nit: the comment block above this entry (lines 2118-2127) describes the image as a 'nightly image' and contains an 'IMPORTANT' block instructing maintainers to pin to a digest-suffixed nightly rather than the floating :nightly (because launch_mi355x-amds.sh caches enroot squashfs files by image string). After the bump to v0.22.0 (a stable versioned tag), both the 'nightly image' wording and the digest-vs-floating rationale are stale and will mislead future maintainers. Consider updating the comment block to reflect the stable-tag pin, or dropping the nightly-pinning paragraph entirely.
Extended reasoning...
What's stale
The diff changes the image at line 2138 from vllm/vllm-openai-rocm:nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458 to vllm/vllm-openai-rocm:v0.22.0. The 10-line comment block immediately above the entry (lines 2118-2127), which was written to explain why the previous nightly pin was needed, now contradicts the actual image:
- Line 2118-2119: "DSv4 on MI355X via vLLM, using the official vllm/vllm-openai-rocm nightly image." —
v0.22.0is a stable release tag, not a nightly. - Lines 2123-2127: "IMPORTANT: pin to a digest-suffixed nightly tag rather than the floating
:nightly. launch_mi355x-amds.sh caches enroot squashfs files keyed on the image string and short-circuits re-import if the file already exists, so the floating tag silently keeps a stale build even after Docker Hub updates:nightly." — the digest-vs-floating concern is specific to mutable tags.v0.22.0is an immutable stable release, so this entire rationale is moot for the currently pinned image.
Why this matters (and why it's still just a nit)
The staleness is introduced directly by this PR — the comment was accurate when the image was nightly-4f940896... and becomes inaccurate the moment the image becomes v0.22.0. A future maintainer looking at this entry to understand why the tag is structured this way will read an IMPORTANT block that doesn't describe what they're seeing, which is the canonical recipe for stale-comment confusion.
That said, there is no functional impact: the sweep config itself is correct, and the enroot squashfs caching behavior described in the comment is still true in general (it's just no longer the reason this particular tag is pinned the way it is). For that reason this is filed as a nit, not a blocking issue.
Addressing the refutation
One verifier refuted this as a non-egregious nit not worth posting. I think it's worth flagging because (a) the comment block is unusually detailed — it isn't a one-line label, it's an IMPORTANT: paragraph with a specific operational rationale; (b) the staleness was created by this PR rather than being pre-existing drift, so it's directly in scope for an image-bump cleanup; and (c) the wording is actively misdirecting ("nightly image" / "digest-suffixed nightly tag") rather than merely incomplete. The IMPORTANT-block cache-keying guidance is still generally accurate as a reference, but its placement above a stable-tag entry implies it explains this pin, which it no longer does.
Suggested fix
Either:
- Rewrite the IMPORTANT block to be tag-scheme-agnostic, e.g. note that
launch_mi355x-amds.shcaches enroot squashfs files by image string and so any tag (nightly or stable) must be replaced rather than re-pushed; or - Drop the nightly-specific paragraph and the "nightly image" phrasing, since the stable
v0.22.0tag is immutable on Docker Hub and the caching concern doesn't apply to it in practice.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26696268345 |
|
/reuse-sweep-run |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=26701256420 |
Summary
Update vLLM ROCm image from nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458 to v0.22.0
Recipes touched:
dsv4-fp4-mi355x-vllmTest plan
🤖 Generated with Claude Code
Note
Low Risk
Config-only container tag and changelog update; no sweep matrix or serving-flag changes, with validation expected via the existing full-sweep workflow.
Overview
Pins
dsv4-fp4-mi355x-vllmto the stablevllm/vllm-openai-rocm:v0.22.0image instead of the digest-stylenightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458build. Model, runner, precision, and TP8 fixed-seq-len sweeps are unchanged.Adds a matching
perf-changelog.yamlentry documenting the ROCm image bump and linking PR #1624.Reviewed by Cursor Bugbot for commit 61fbd32. Bugbot is set up for automated code reviews on this repo. Configure here.