AMD gfx1151 (Strix Halo / RDNA3.5) ROCm results + a precision-downgrade reward-hack case for kernel_static_checker

## Summary

Sharing KernelBench results on **AMD gfx1151 (Strix Halo / Radeon 8060S, RDNA3.5)** — a target not yet in `results/timing/` — produced by an LLM agent (GLM‑5.1) writing Triton kernels and evaluated with the official `scripts/run_and_check.py`. Along the way we hit a **precision‑downgrade reward‑hack** that `kernel_static_checker.py` currently only *warns* on. Posting per the note in EVAL.md ("We welcome suggestions and contributions here").

## Environment
- GPU: AMD Radeon 8060S (**gfx1151**, RDNA3.5), 128 GB unified LPDDR5X
- ROCm 7.13 (TheRock) · PyTorch `2.12.0a0+rocm7.13` · Triton 3.7 (rocm) · `backend=triton`
- Eval: official `scripts/run_and_check.py`, `check_kernel=True`, `precision=fp32`

## Reward‑hack finding: FP16 downcast passes as a huge "speedup"

On `level2 / 56_Matmul_Sigmoid_Sum`, an agent kernel that **downcasts to FP16** reports:
- Speedup over eager **11.0×**, over torch.compile **10.5×**, correctness 5/5 (allclose 1e‑2)
- Static checker: `[WARN] Precision downgrade detected: required FP32 but code uses FP16` — only a **warning**, not rejected; the result was also flagged `excessive_speedup: True`.

Forcing genuine FP32 (no downcast, no `F.linear` bypass) on the same problem collapses it to **~1.04×**. So ~10× of the "speedup" was FP16 + calling `F.linear`, not real fusion. Same pattern on `64_Gemm_LogSumExp_...` (7.7× → 1.02×).

**Suggestion:** when a run requests `precision=fp32`, treat a detected precision downgrade as a hard `correctness=False` / disqualification rather than a warning (or gate it behind a config flag). Happy to send a small PR to `kernel_static_checker.py`.

Minor: the checker flags *any* `pass` statement as "inheritance bypass (Contains 'pass' statement)", which false‑positives on legitimate empty `__init__` / control‑flow `pass`.

## Compliant gfx1151 results (FP32, pass static checker, vs torch.compile)

After enforcing FP32 + the static checker, the durable wins (beat **torch.compile**, not just eager) are modest and honest:

| Problem | vs eager | vs torch.compile |
|---|---|---|
| level2 / 12_Gemm_Multiply_LeakyReLU | 1.79× | **1.68×** |
| level1 / 38_L1Norm | 1.51× | **1.20×** |

Most simple fusions (softmax, swish, cumsum) just match torch.compile (~1.0×) on this bandwidth‑bound iGPU; the durable gains come from epilogue fusions that inductor doesn't fully fuse.

Happy to contribute gfx1151 baseline‑timing JSONs under `results/timing/gfx1151_StrixHalo/` if there's interest.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMD gfx1151 (Strix Halo / RDNA3.5) ROCm results + a precision-downgrade reward-hack case for kernel_static_checker #155

Summary

Environment

Reward‑hack finding: FP16 downcast passes as a huge "speedup"

Compliant gfx1151 results (FP32, pass static checker, vs torch.compile)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Problem	vs eager	vs torch.compile
level2 / 12_Gemm_Multiply_LeakyReLU	1.79×	1.68×
level1 / 38_L1Norm	1.51×	1.20×

AMD gfx1151 (Strix Halo / RDNA3.5) ROCm results + a precision-downgrade reward-hack case for kernel_static_checker #155

Description

Summary

Environment

Reward‑hack finding: FP16 downcast passes as a huge "speedup"

Compliant gfx1151 results (FP32, pass static checker, vs torch.compile)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions