Performance tuning for YOLO-like workloads by Human9000-bit · Pull Request #9 · enthropy7/YSCV

Human9000-bit · 2026-06-27T07:04:39Z

Thanks to enthropy7's benchmark refactoring, we can now see which computational kernel path takes this or that convolution.
via BENCH_COOLDOWN=0 cargo run --release --example bench_yolo --features=blas we get such profiling of YOLO models (on x86_64, ymmv):

11.43ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.1/conv/Conv
10.66ms  [1, 80, 80, 64] → [1, 80, 80, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.0/cv2.0.0/conv/Conv
7.63ms  [1, 160, 160, 16] → [1, 160, 160, 8]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv1/conv/Conv
5.48ms  [1, 40, 40, 128] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.0/conv/Conv
5.07ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.4/m.0/cv1/conv/Conv
4.86ms  [1, 160, 160, 8] → [1, 160, 160, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.2/m.0/cv2/conv/
3.90ms  [1, 80, 80, 16] → [1, 80, 80, 32]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv2/conv/
3.78ms  [1, 20, 20, 256] → [1, 20, 20, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.2/cv2.2.0/conv/
3.68ms  [1, 40, 40, 64] → [1, 40, 40, 64]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.23/cv2.1/cv2.1.1/conv/
3.60ms  [1, 80, 80, 32] → [1, 80, 80, 16]  k=[3, 3] s=[1, 1] via nhwc-padded/winograd-3x3 at /model.16/m.0/cv1/conv/Conv

, so winograd 3x3 path takes most of the computational time.
At initial state (5faba42), winograd path was just scalar with vectorized GEMM function inisde.

Human9000-bit · 2026-06-27T07:14:11Z

Vectorized the most hot function on winograd path, the winograd_input_tile

enthropy7 · 2026-06-27T18:04:52Z

I measured that optimize - speed stayed the same. After careful review we have that situation: the reason of lack of speedup on benches - winograd_input_tile is only the input transform, beggining of the work in a winograd conv. The batched GEMM dominates, and this PR doesn't touch it. Vectorizing the input tile can't move the needle.

Two issues that matter more than the change itself:

The new benchmark doesn't actually exercise winograd. bench_winograd_conv_modes calls conv2d_nhwc (no padding). I confirmed with the dispatch recorder that that routes to im2col-gemm, not winograd. Winograd only fires via conv2d_nhwc_padded (SAME padding). So the winograd bench measures im2col and would show no change from this PR. To actually measure winograd, call conv2d_nhwc_padded (or go through the runner with a padded 3×3 conv).

On aarch64, winograd isn't used for YOLO at all. The runner intercepts every 3×3 group=1 conv with the indirect kernel (conv2d_nhwc_indirect_padded) before winograd is reached. So the NEON winograd path has no effect on YOLO on ARM - I didn't bench the Orange Pi because it would be a guaranteed null result.

WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?

(It's really important and i just haven't had a time to implement it through my whole work on my fw, so i would be very grateful for it)

x86

the winograd GEMM at gemm_conv.rs:488:

for a in 0..16 {
    ////
    super::super::matmul::blas_sgemm(v_slice, u_slice, m_slice, n_tiles, c_in, c_out);
}

Three things to do, biggest first:

It calls raw blas_sgemm, not matmul_2d_slices_fused_maybe_packed , so it bypasses our fast packed kernels (mr12/mr6/avx512) and the fused bias/activation epilogue. Route it through the fused entry. This is the main win.
16 GEMMs run sequentially. The positions a ∈ 0..16 are independent - parallelize them (par_iter / rayon scope) or fold into one batched GEMM.
The weights U are re-packed every inference; conv weights are static, thus pack once via pack_b_for_session and pass packed_b in.

ARM

conv2d_nhwc_indirect_padded (gemm_conv.rs:18): this is the path YOLO's 3×3 convs actually take on aarch64. Its body is plain for b / oy / ox loops with no parallelism (no par_chunks_mut_dispatch). On a 4-core A53 that idles three cores - parallelizing over output rows is an easy ~4× before any tiling work.

And of course fix the bench to call conv2d_nhwc_padded so it measures winograd at all.

Thanks for your work. Furthermore, I'm looking forward to help you!

Human9000-bit · 2026-06-28T06:46:41Z

fixed benchmark

enthropy7 · 2026-06-28T23:30:49Z

Thanks for following up. The structure is exactly right (route the per-α GEMMs through matmul_2d_slices_fused_maybe_packed, parallelize over the 16 positions, pre-pack U). but I tested it carefully and we have to deal with few small issues. there's a correctness bug and, on x86, a perf regression.

Let's figure out what's wrong

Bias and activation are fused into the epilogue of each of the 16 per-α GEMMs. i.e. they're applied in the Winograd transform domain, on the intermediate M[α]. That's mathematically wrong:

bias then gets summed ~16× by the output transform A^T·M·A (the transform combines the 16 M[α] with ±1 coefficients),
activation (ReLU/SiLU) is nonlinear, so it can't be applied to the transformed intermediates at all. only to the final spatial output.

I verified numerically against conv2d_nhwc_indirect_padded (same conv, 16×16×8→16, 3×3 s1 pad 1):

config	max abs diff vs reference
no bias, no act	3e-7 ✅
bias, no act	6.40 ❌
no bias, ReLU	1.60 ❌
bias + ReLU	6.60 ❌

The parallelization itself is bit-exact (the no-bias/no-act row), so the GEMM rework is sound. The output is only wrong once bias or activation is present - which is every conv in a real model (YOLO convs all have bias + SiLU). Note the YOLO bench wouldn't catch this since it only checks output shape, not values. (that's the idea to modify bench)

Fix: pass GemmEpilogue::IDENTITY to the 16 GEMMs and keep bias + activation in the output transform (step 4), where the original scalar code applied them once on the final spatial output.

Perf regression on x86

Measured on the correct (no-bias) path, yolo_p3 shape 80×80×128→256, min of 200 iters (12-core box):

	min
baseline (scalar, sequential)	20.4 ms
this PR (parallel + packed)	22.4 ms

~10% slower, consistent across runs. The cause is nested parallelism: the outer par_chunks_mut_dispatch over the 16 positions, while each matmul_2d_slices_fused_maybe_packed also parallelizes internally via rayon → oversubscription.

Fix: don't nest. Either parallelize the 16 positions with the inner GEMM forced sequential (disabled ParallelMatmulConfig), or keep the GEMM internally parallel and loop the 16 sequentially. need to benchmark both and keep the winner.

also few minor things to complete

Your winograd_input_tile_vectorized_matches_scalar test passes despite this bug because it only checks the input-tile transform in isolation, not the full conv. Please add a full-conv correctness test (winograd vs conv2d_nhwc_indirect_padded) with bias + activation so this is caught going forward.

On NEON correctness bug would fire too. the bug is arch-independent (it's epilogue placement, not SIMD). One thing to know though: on aarch64 the runner routes 3×3 group=1 convs through conv2d_nhwc_indirect_padded before winograd is reached, so winograd isn't in the YOLO path on ARM at all. The fix matters for x86 YOLO; for ARM YOLO perf the lever is conv2d_nhwc_indirect_padded (which is currently single-threaded).

Good catch fixing bench_winograd_conv_modes to use conv2d_nhwc_padded. apps/bench/src/main.rs still uses conv2d_nhwc (no padding) for the yolo conv cases though → that routes to im2col-GEMM, not winograd. Switch those to conv2d_nhwc_padded too if you want them to measure this path.

I think after that we would have pretty nice piece of code, which I hope will be the foundation for a blazing-fast winograd

Human9000-bit · 2026-06-29T12:20:23Z

reverted bias and activation, removed par_chunk_dispatch in favor of sequential approach with internal matmul_2d_slices_fused_maybe_packed 's parallelism. Packing is now optional and is only present for big enough (how big? shoud_parallelize_len isn't really intended to do this) inputs, as it intoduces heavy regressions for smaller matricies

it gives us: * internal parallelization * advantage of `packed_b` usage (when applicable)

Human9000-bit force-pushed the yolo-tuning branch from 239c3b7 to 40a7288 Compare June 27, 2026 07:26

Human9000-bit added 2 commits June 28, 2026 11:46

add yolo-like conv benchmarks

2deff15

vectorize winograd_input_tile

600c262

Human9000-bit force-pushed the yolo-tuning branch from 40a7288 to 600c262 Compare June 28, 2026 06:46

Human9000-bit closed this Jun 28, 2026

Human9000-bit reopened this Jun 28, 2026

Human9000-bit force-pushed the yolo-tuning branch from 80a5562 to 99e7f94 Compare June 28, 2026 18:07

Human9000-bit force-pushed the yolo-tuning branch from 99e7f94 to 3856d6e Compare June 29, 2026 12:15

Use matmul_2d_slices_fused_maybe_packed

eb2aa02

it gives us: * internal parallelization * advantage of `packed_b` usage (when applicable)

Human9000-bit force-pushed the yolo-tuning branch from 3856d6e to eb2aa02 Compare June 29, 2026 12:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance tuning for YOLO-like workloads#9

Performance tuning for YOLO-like workloads#9
Human9000-bit wants to merge 3 commits into
enthropy7:mainfrom
Human9000-bit:yolo-tuning

Human9000-bit commented Jun 27, 2026 •

edited

Loading

Uh oh!

Human9000-bit commented Jun 27, 2026

Uh oh!

enthropy7 commented Jun 27, 2026

Uh oh!

Human9000-bit commented Jun 28, 2026

Uh oh!

enthropy7 commented Jun 28, 2026

Uh oh!

Human9000-bit commented Jun 29, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Human9000-bit commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Human9000-bit commented Jun 27, 2026

Uh oh!

enthropy7 commented Jun 27, 2026

WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?

x86

Three things to do, biggest first:

ARM

Uh oh!

Human9000-bit commented Jun 28, 2026

Uh oh!

enthropy7 commented Jun 28, 2026

Let's figure out what's wrong

Perf regression on x86

also few minor things to complete

I think after that we would have pretty nice piece of code, which I hope will be the foundation for a blazing-fast winograd

Uh oh!

Human9000-bit commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Human9000-bit commented Jun 27, 2026 •

edited

Loading

Human9000-bit commented Jun 29, 2026 •

edited

Loading