Skip to content

AMD gfx1151 (Strix Halo / RDNA3.5) ROCm results + a precision-downgrade reward-hack case for kernel_static_checker #155

@fxp

Description

@fxp

Summary

Sharing KernelBench results on AMD gfx1151 (Strix Halo / Radeon 8060S, RDNA3.5) — a target not yet in results/timing/ — produced by an LLM agent (GLM‑5.1) writing Triton kernels and evaluated with the official scripts/run_and_check.py. Along the way we hit a precision‑downgrade reward‑hack that kernel_static_checker.py currently only warns on. Posting per the note in EVAL.md ("We welcome suggestions and contributions here").

Environment

  • GPU: AMD Radeon 8060S (gfx1151, RDNA3.5), 128 GB unified LPDDR5X
  • ROCm 7.13 (TheRock) · PyTorch 2.12.0a0+rocm7.13 · Triton 3.7 (rocm) · backend=triton
  • Eval: official scripts/run_and_check.py, check_kernel=True, precision=fp32

Reward‑hack finding: FP16 downcast passes as a huge "speedup"

On level2 / 56_Matmul_Sigmoid_Sum, an agent kernel that downcasts to FP16 reports:

  • Speedup over eager 11.0×, over torch.compile 10.5×, correctness 5/5 (allclose 1e‑2)
  • Static checker: [WARN] Precision downgrade detected: required FP32 but code uses FP16 — only a warning, not rejected; the result was also flagged excessive_speedup: True.

Forcing genuine FP32 (no downcast, no F.linear bypass) on the same problem collapses it to ~1.04×. So ~10× of the "speedup" was FP16 + calling F.linear, not real fusion. Same pattern on 64_Gemm_LogSumExp_... (7.7× → 1.02×).

Suggestion: when a run requests precision=fp32, treat a detected precision downgrade as a hard correctness=False / disqualification rather than a warning (or gate it behind a config flag). Happy to send a small PR to kernel_static_checker.py.

Minor: the checker flags any pass statement as "inheritance bypass (Contains 'pass' statement)", which false‑positives on legitimate empty __init__ / control‑flow pass.

Compliant gfx1151 results (FP32, pass static checker, vs torch.compile)

After enforcing FP32 + the static checker, the durable wins (beat torch.compile, not just eager) are modest and honest:

Problem vs eager vs torch.compile
level2 / 12_Gemm_Multiply_LeakyReLU 1.79× 1.68×
level1 / 38_L1Norm 1.51× 1.20×

Most simple fusions (softmax, swish, cumsum) just match torch.compile (~1.0×) on this bandwidth‑bound iGPU; the durable gains come from epilogue fusions that inductor doesn't fully fuse.

Happy to contribute gfx1151 baseline‑timing JSONs under results/timing/gfx1151_StrixHalo/ if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions