Skip to content

Performance tuning for YOLO-like workloads#9

Draft
Human9000-bit wants to merge 3 commits into
enthropy7:mainfrom
Human9000-bit:yolo-tuning
Draft

Performance tuning for YOLO-like workloads#9
Human9000-bit wants to merge 3 commits into
enthropy7:mainfrom
Human9000-bit:yolo-tuning

Conversation

@Human9000-bit

@Human9000-bit Human9000-bit commented Jun 27, 2026

Copy link
Copy Markdown
Contributor

Thanks to enthropy7's benchmark refactoring, we can now see which computational kernel path takes this or that convolution.
via BENCH_COOLDOWN=0 cargo run --release --example bench_yolo --features=blas we get such profiling of YOLO models (on x86_64, ymmv):

11.43ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.1/conv/Conv
10.66ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.0/conv/Conv
7.63ms  [1, 160, 160, 16] → [1, 160, 160, 8]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv1/conv/Conv
5.48ms  [1, 40, 40, 128] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.0/conv/Conv
5.07ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.4/m.0/cv1/conv/Conv
4.86ms  [1, 160, 160, 8] → [1, 160, 160, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv2/conv/
3.90ms  [1, 80, 80, 16] → [1, 80, 80, 32]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv2/conv/
3.78ms  [1, 20, 20, 256] → [1, 20, 20, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.2/cv2.2.0/conv/
3.68ms  [1, 40, 40, 64] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.1/conv/
3.60ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv1/conv/Conv

, so winograd 3x3 path takes most of the computational time.
At initial state (5faba42), winograd path was just scalar with vectorized GEMM function inisde.

@Human9000-bit

Copy link
Copy Markdown
Contributor Author

Vectorized the most hot function on winograd path, the winograd_input_tile

@enthropy7

Copy link
Copy Markdown
Owner

I measured that optimize - speed stayed the same. After careful review we have that situation: the reason of lack of speedup on benches - winograd_input_tile is only the input transform, beggining of the work in a winograd conv. The batched GEMM dominates, and this PR doesn't touch it. Vectorizing the input tile can't move the needle.

Two issues that matter more than the change itself:

The new benchmark doesn't actually exercise winograd. bench_winograd_conv_modes calls conv2d_nhwc (no padding). I confirmed with the dispatch recorder that that routes to im2col-gemm, not winograd. Winograd only fires via conv2d_nhwc_padded (SAME padding). So the winograd bench measures im2col and would show no change from this PR. To actually measure winograd, call conv2d_nhwc_padded (or go through the runner with a padded 3×3 conv).

On aarch64, winograd isn't used for YOLO at all. The runner intercepts every 3×3 group=1 conv with the indirect kernel (conv2d_nhwc_indirect_padded) before winograd is reached. So the NEON winograd path has no effect on YOLO on ARM - I didn't bench the Orange Pi because it would be a guaranteed null result.


WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?

(It's really important and i just haven't had a time to implement it through my whole work on my fw, so i would be very grateful for it)

x86

the winograd GEMM at gemm_conv.rs:488:

for a in 0..16 {
    ////
    super::super::matmul::blas_sgemm(v_slice, u_slice, m_slice, n_tiles, c_in, c_out);
}

Three things to do, biggest first:

  • It calls raw blas_sgemm, not matmul_2d_slices_fused_maybe_packed , so it bypasses our fast packed kernels (mr12/mr6/avx512) and the fused bias/activation epilogue. Route it through the fused entry. This is the main win.
  • 16 GEMMs run sequentially. The positions a ∈ 0..16 are independent - parallelize them (par_iter / rayon scope) or fold into one batched GEMM.
  • The weights U are re-packed every inference; conv weights are static, thus pack once via pack_b_for_session and pass packed_b in.

ARM

conv2d_nhwc_indirect_padded (gemm_conv.rs:18): this is the path YOLO's 3×3 convs actually take on aarch64. Its body is plain for b / oy / ox loops with no parallelism (no par_chunks_mut_dispatch). On a 4-core A53 that idles three cores - parallelizing over output rows is an easy ~4× before any tiling work.

And of course fix the bench to call conv2d_nhwc_padded so it measures winograd at all.

Thanks for your work. Furthermore, I'm looking forward to help you!

@Human9000-bit

Copy link
Copy Markdown
Contributor Author

fixed benchmark

@enthropy7

Copy link
Copy Markdown
Owner

Thanks for following up. The structure is exactly right (route the per-α GEMMs through matmul_2d_slices_fused_maybe_packed, parallelize over the 16 positions, pre-pack U). but I tested it carefully and we have to deal with few small issues. there's a correctness bug and, on x86, a perf regression.

Let's figure out what's wrong

Bias and activation are fused into the epilogue of each of the 16 per-α GEMMs. i.e. they're applied in the Winograd transform domain, on the intermediate M[α]. That's mathematically wrong:

bias then gets summed ~16× by the output transform A^T·M·A (the transform combines the 16 M[α] with ±1 coefficients),
activation (ReLU/SiLU) is nonlinear, so it can't be applied to the transformed intermediates at all. only to the final spatial output.

I verified numerically against conv2d_nhwc_indirect_padded (same conv, 16×16×8→16, 3×3 s1 pad 1):

config max abs diff vs reference
no bias, no act 3e-7 ✅
bias, no act 6.40 ❌
no bias, ReLU 1.60 ❌
bias + ReLU 6.60 ❌

The parallelization itself is bit-exact (the no-bias/no-act row), so the GEMM rework is sound. The output is only wrong once bias or activation is present - which is every conv in a real model (YOLO convs all have bias + SiLU). Note the YOLO bench wouldn't catch this since it only checks output shape, not values. (that's the idea to modify bench)

Fix: pass GemmEpilogue::IDENTITY to the 16 GEMMs and keep bias + activation in the output transform (step 4), where the original scalar code applied them once on the final spatial output.

Perf regression on x86

Measured on the correct (no-bias) path, yolo_p3 shape 80×80×128→256, min of 200 iters (12-core box):

  min
baseline (scalar, sequential) 20.4 ms
this PR (parallel + packed) 22.4 ms

~10% slower, consistent across runs. The cause is nested parallelism: the outer par_chunks_mut_dispatch over the 16 positions, while each matmul_2d_slices_fused_maybe_packed also parallelizes internally via rayon → oversubscription.

Fix: don't nest. Either parallelize the 16 positions with the inner GEMM forced sequential (disabled ParallelMatmulConfig), or keep the GEMM internally parallel and loop the 16 sequentially. need to benchmark both and keep the winner.

also few minor things to complete

Your winograd_input_tile_vectorized_matches_scalar test passes despite this bug because it only checks the input-tile transform in isolation, not the full conv. Please add a full-conv correctness test (winograd vs conv2d_nhwc_indirect_padded) with bias + activation so this is caught going forward.

On NEON correctness bug would fire too. the bug is arch-independent (it's epilogue placement, not SIMD). One thing to know though: on aarch64 the runner routes 3×3 group=1 convs through conv2d_nhwc_indirect_padded before winograd is reached, so winograd isn't in the YOLO path on ARM at all. The fix matters for x86 YOLO; for ARM YOLO perf the lever is conv2d_nhwc_indirect_padded (which is currently single-threaded).

Good catch fixing bench_winograd_conv_modes to use conv2d_nhwc_padded. apps/bench/src/main.rs still uses conv2d_nhwc (no padding) for the yolo conv cases though → that routes to im2col-GEMM, not winograd. Switch those to conv2d_nhwc_padded too if you want them to measure this path.

I think after that we would have pretty nice piece of code, which I hope will be the foundation for a blazing-fast winograd

@Human9000-bit

Human9000-bit commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

reverted bias and activation, removed par_chunk_dispatch in favor of sequential approach with internal matmul_2d_slices_fused_maybe_packed 's parallelism. Packing is now optional and is only present for big enough (how big? shoud_parallelize_len isn't really intended to do this) inputs, as it intoduces heavy regressions for smaller matricies

it gives us:
* internal parallelization
* advantage of `packed_b` usage (when applicable)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants