Summary
Sharing KernelBench results on AMD gfx1151 (Strix Halo / Radeon 8060S, RDNA3.5) — a target not yet in results/timing/ — produced by an LLM agent (GLM‑5.1) writing Triton kernels and evaluated with the official scripts/run_and_check.py. Along the way we hit a precision‑downgrade reward‑hack that kernel_static_checker.py currently only warns on. Posting per the note in EVAL.md ("We welcome suggestions and contributions here").
Environment
- GPU: AMD Radeon 8060S (gfx1151, RDNA3.5), 128 GB unified LPDDR5X
- ROCm 7.13 (TheRock) · PyTorch
2.12.0a0+rocm7.13 · Triton 3.7 (rocm) · backend=triton
- Eval: official
scripts/run_and_check.py, check_kernel=True, precision=fp32
Reward‑hack finding: FP16 downcast passes as a huge "speedup"
On level2 / 56_Matmul_Sigmoid_Sum, an agent kernel that downcasts to FP16 reports:
- Speedup over eager 11.0×, over torch.compile 10.5×, correctness 5/5 (allclose 1e‑2)
- Static checker:
[WARN] Precision downgrade detected: required FP32 but code uses FP16 — only a warning, not rejected; the result was also flagged excessive_speedup: True.
Forcing genuine FP32 (no downcast, no F.linear bypass) on the same problem collapses it to ~1.04×. So ~10× of the "speedup" was FP16 + calling F.linear, not real fusion. Same pattern on 64_Gemm_LogSumExp_... (7.7× → 1.02×).
Suggestion: when a run requests precision=fp32, treat a detected precision downgrade as a hard correctness=False / disqualification rather than a warning (or gate it behind a config flag). Happy to send a small PR to kernel_static_checker.py.
Minor: the checker flags any pass statement as "inheritance bypass (Contains 'pass' statement)", which false‑positives on legitimate empty __init__ / control‑flow pass.
Compliant gfx1151 results (FP32, pass static checker, vs torch.compile)
After enforcing FP32 + the static checker, the durable wins (beat torch.compile, not just eager) are modest and honest:
| Problem |
vs eager |
vs torch.compile |
| level2 / 12_Gemm_Multiply_LeakyReLU |
1.79× |
1.68× |
| level1 / 38_L1Norm |
1.51× |
1.20× |
Most simple fusions (softmax, swish, cumsum) just match torch.compile (~1.0×) on this bandwidth‑bound iGPU; the durable gains come from epilogue fusions that inductor doesn't fully fuse.
Happy to contribute gfx1151 baseline‑timing JSONs under results/timing/gfx1151_StrixHalo/ if there's interest.
Summary
Sharing KernelBench results on AMD gfx1151 (Strix Halo / Radeon 8060S, RDNA3.5) — a target not yet in
results/timing/— produced by an LLM agent (GLM‑5.1) writing Triton kernels and evaluated with the officialscripts/run_and_check.py. Along the way we hit a precision‑downgrade reward‑hack thatkernel_static_checker.pycurrently only warns on. Posting per the note in EVAL.md ("We welcome suggestions and contributions here").Environment
2.12.0a0+rocm7.13· Triton 3.7 (rocm) ·backend=tritonscripts/run_and_check.py,check_kernel=True,precision=fp32Reward‑hack finding: FP16 downcast passes as a huge "speedup"
On
level2 / 56_Matmul_Sigmoid_Sum, an agent kernel that downcasts to FP16 reports:[WARN] Precision downgrade detected: required FP32 but code uses FP16— only a warning, not rejected; the result was also flaggedexcessive_speedup: True.Forcing genuine FP32 (no downcast, no
F.linearbypass) on the same problem collapses it to ~1.04×. So ~10× of the "speedup" was FP16 + callingF.linear, not real fusion. Same pattern on64_Gemm_LogSumExp_...(7.7× → 1.02×).Suggestion: when a run requests
precision=fp32, treat a detected precision downgrade as a hardcorrectness=False/ disqualification rather than a warning (or gate it behind a config flag). Happy to send a small PR tokernel_static_checker.py.Minor: the checker flags any
passstatement as "inheritance bypass (Contains 'pass' statement)", which false‑positives on legitimate empty__init__/ control‑flowpass.Compliant gfx1151 results (FP32, pass static checker, vs torch.compile)
After enforcing FP32 + the static checker, the durable wins (beat torch.compile, not just eager) are modest and honest:
Most simple fusions (softmax, swish, cumsum) just match torch.compile (~1.0×) on this bandwidth‑bound iGPU; the durable gains come from epilogue fusions that inductor doesn't fully fuse.
Happy to contribute gfx1151 baseline‑timing JSONs under
results/timing/gfx1151_StrixHalo/if there's interest.