Skip to content

[Klaud Cold] Update dsv4-fp4-mi355x-vllm vLLM ROCm image to v0.22.0#1624

Merged
functionstackx merged 2 commits into
mainfrom
klaud-cold/dsv4-fp4-mi355x-vllm-v0.22.0
May 31, 2026
Merged

[Klaud Cold] Update dsv4-fp4-mi355x-vllm vLLM ROCm image to v0.22.0#1624
functionstackx merged 2 commits into
mainfrom
klaud-cold/dsv4-fp4-mi355x-vllm-v0.22.0

Conversation

@functionstackx
Copy link
Copy Markdown
Collaborator

@functionstackx functionstackx commented May 30, 2026

Summary

Update vLLM ROCm image from nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458 to v0.22.0

Recipes touched: dsv4-fp4-mi355x-vllm

Test plan

  • full-sweep-enabled sweep passes.

🤖 Generated with Claude Code


Note

Low Risk
Config-only container tag and changelog update; no sweep matrix or serving-flag changes, with validation expected via the existing full-sweep workflow.

Overview
Pins dsv4-fp4-mi355x-vllm to the stable vllm/vllm-openai-rocm:v0.22.0 image instead of the digest-style nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458 build. Model, runner, precision, and TP8 fixed-seq-len sweeps are unchanged.

Adds a matching perf-changelog.yaml entry documenting the ROCm image bump and linking PR #1624.

Reviewed by Cursor Bugbot for commit 61fbd32. Bugbot is set up for automated code reviews on this repo. Configure here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the klaud-cold/dsv4-fp4-mi355x-vllm-v0.22.0 branch from bf86321 to 4e52736 Compare May 30, 2026 22:12
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

1 similar comment
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Copy link
Copy Markdown
Contributor

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward image bump from a nightly digest pin to the stable v0.22.0 tag, with a matching perf-changelog entry.

Extended reasoning...

Overview

This PR touches two files: .github/configs/amd-master.yaml (a single-line image tag change for the dsv4-fp4-mi355x-vllm recipe, from nightly-4f940896... to v0.22.0) and perf-changelog.yaml (a corresponding changelog entry). No code paths, no flags, no sweep-space changes — just a container image version bump.

Security risks

None. This is a config-only change that swaps one published vllm/vllm-openai-rocm tag for another from the same upstream registry. No auth, crypto, permissions, or data-handling surface is involved.

Level of scrutiny

Low. Image-tag bumps in this YAML are routine, follow an established pattern (multiple sibling entries in the file already pin vllm/vllm-openai-rocm to stable vX.Y.Z tags), and the move from a floating-ish nightly digest to a stable release is the safer direction. Functional validation is gated on the full-sweep-enabled label, which will exercise the new image.

Other factors

The bug hunter flagged one nit: the IMPORTANT comment block above the entry still describes the previous nightly-pin rationale (digest-suffixed nightly vs floating :nightly, enroot squashfs caching) and is stale now that the pin is a stable release. It's a real staleness introduced by this PR but has zero functional impact and the inline comment already calls it out for the author to address if desired — it doesn't warrant blocking approval.

# probe to validate the ROCm DP+EP path.
dsv4-fp4-mi355x-vllm:
image: vllm/vllm-openai-rocm:nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458
image: vllm/vllm-openai-rocm:v0.22.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the comment block above this entry (lines 2118-2127) describes the image as a 'nightly image' and contains an 'IMPORTANT' block instructing maintainers to pin to a digest-suffixed nightly rather than the floating :nightly (because launch_mi355x-amds.sh caches enroot squashfs files by image string). After the bump to v0.22.0 (a stable versioned tag), both the 'nightly image' wording and the digest-vs-floating rationale are stale and will mislead future maintainers. Consider updating the comment block to reflect the stable-tag pin, or dropping the nightly-pinning paragraph entirely.

Extended reasoning...

What's stale

The diff changes the image at line 2138 from vllm/vllm-openai-rocm:nightly-4f940896a32c9e2a0eba7f50d521bf5f6b4de458 to vllm/vllm-openai-rocm:v0.22.0. The 10-line comment block immediately above the entry (lines 2118-2127), which was written to explain why the previous nightly pin was needed, now contradicts the actual image:

  • Line 2118-2119: "DSv4 on MI355X via vLLM, using the official vllm/vllm-openai-rocm nightly image."v0.22.0 is a stable release tag, not a nightly.
  • Lines 2123-2127: "IMPORTANT: pin to a digest-suffixed nightly tag rather than the floating :nightly. launch_mi355x-amds.sh caches enroot squashfs files keyed on the image string and short-circuits re-import if the file already exists, so the floating tag silently keeps a stale build even after Docker Hub updates :nightly." — the digest-vs-floating concern is specific to mutable tags. v0.22.0 is an immutable stable release, so this entire rationale is moot for the currently pinned image.

Why this matters (and why it's still just a nit)

The staleness is introduced directly by this PR — the comment was accurate when the image was nightly-4f940896... and becomes inaccurate the moment the image becomes v0.22.0. A future maintainer looking at this entry to understand why the tag is structured this way will read an IMPORTANT block that doesn't describe what they're seeing, which is the canonical recipe for stale-comment confusion.

That said, there is no functional impact: the sweep config itself is correct, and the enroot squashfs caching behavior described in the comment is still true in general (it's just no longer the reason this particular tag is pinned the way it is). For that reason this is filed as a nit, not a blocking issue.

Addressing the refutation

One verifier refuted this as a non-egregious nit not worth posting. I think it's worth flagging because (a) the comment block is unusually detailed — it isn't a one-line label, it's an IMPORTANT: paragraph with a specific operational rationale; (b) the staleness was created by this PR rather than being pre-existing drift, so it's directly in scope for an image-bump cleanup; and (c) the wording is actively misdirecting ("nightly image" / "digest-suffixed nightly tag") rather than merely incomplete. The IMPORTANT-block cache-keying guidance is still generally accurate as a reference, but its placement above a stable-tag entry implies it explains this pin, which it no longer does.

Suggested fix

Either:

  1. Rewrite the IMPORTANT block to be tag-scheme-agnostic, e.g. note that launch_mi355x-amds.sh caches enroot squashfs files by image string and so any tag (nightly or stable) must be replaced rather than re-pushed; or
  2. Drop the nightly-specific paragraph and the "nightly image" phrasing, since the stable v0.22.0 tag is immutable on Docker Hub and the caching concern doesn't apply to it in practice.

@github-actions
Copy link
Copy Markdown
Contributor

@functionstackx
Copy link
Copy Markdown
Collaborator Author

/reuse-sweep-run

@functionstackx functionstackx merged commit c353742 into main May 31, 2026
15 of 22 checks passed
@functionstackx functionstackx deleted the klaud-cold/dsv4-fp4-mi355x-vllm-v0.22.0 branch May 31, 2026 02:38
@github-actions
Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant