Performance tuning for YOLO-like workloads#9
Conversation
|
Vectorized the most hot function on winograd path, the |
239c3b7 to
40a7288
Compare
|
I measured that optimize - speed stayed the same. After careful review we have that situation: the reason of lack of speedup on benches - Two issues that matter more than the change itself: The new benchmark doesn't actually exercise winograd. bench_winograd_conv_modes calls On aarch64, winograd isn't used for YOLO at all. The runner intercepts every 3×3 group=1 conv with the indirect kernel (conv2d_nhwc_indirect_padded) before winograd is reached. So the NEON winograd path has no effect on YOLO on ARM - I didn't bench the Orange Pi because it would be a guaranteed null result. WHAT WE SHOULD DO TO ACTUALLY BOOST WINOGRAD?
x86the winograd GEMM at Three things to do, biggest first:
ARMconv2d_nhwc_indirect_padded (gemm_conv.rs:18): this is the path YOLO's 3×3 convs actually take on aarch64. Its body is plain for b / oy / ox loops with no parallelism (no par_chunks_mut_dispatch). On a 4-core A53 that idles three cores - parallelizing over output rows is an easy ~4× before any tiling work. And of course fix the bench to call conv2d_nhwc_padded so it measures winograd at all. Thanks for your work. Furthermore, I'm looking forward to help you! |
40a7288 to
600c262
Compare
|
fixed benchmark |
80a5562 to
99e7f94
Compare
|
Thanks for following up. The structure is exactly right (route the per-α GEMMs through matmul_2d_slices_fused_maybe_packed, parallelize over the 16 positions, pre-pack U). but I tested it carefully and we have to deal with few small issues. there's a correctness bug and, on x86, a perf regression. Let's figure out what's wrongBias and activation are fused into the epilogue of each of the 16 per-α GEMMs. i.e. they're applied in the Winograd transform domain, on the intermediate M[α]. That's mathematically wrong: bias then gets summed ~16× by the output transform I verified numerically against conv2d_nhwc_indirect_padded (same conv, 16×16×8→16, 3×3 s1 pad 1):
The parallelization itself is bit-exact (the no-bias/no-act row), so the GEMM rework is sound. The output is only wrong once bias or activation is present - which is every conv in a real model (YOLO convs all have bias + SiLU). Note the YOLO bench wouldn't catch this since it only checks output shape, not values. (that's the idea to modify bench) Fix: pass GemmEpilogue::IDENTITY to the 16 GEMMs and keep bias + activation in the output transform (step 4), where the original scalar code applied them once on the final spatial output. Perf regression on x86Measured on the correct (no-bias) path, yolo_p3 shape 80×80×128→256, min of 200 iters (12-core box):
~10% slower, consistent across runs. The cause is nested parallelism: the outer par_chunks_mut_dispatch over the 16 positions, while each matmul_2d_slices_fused_maybe_packed also parallelizes internally via rayon → oversubscription. Fix: don't nest. Either parallelize the 16 positions with the inner GEMM forced sequential (disabled ParallelMatmulConfig), or keep the GEMM internally parallel and loop the 16 sequentially. need to benchmark both and keep the winner. also few minor things to completeYour winograd_input_tile_vectorized_matches_scalar test passes despite this bug because it only checks the input-tile transform in isolation, not the full conv. Please add a full-conv correctness test (winograd vs conv2d_nhwc_indirect_padded) with bias + activation so this is caught going forward. On NEON correctness bug would fire too. the bug is arch-independent (it's epilogue placement, not SIMD). One thing to know though: on aarch64 the runner routes 3×3 group=1 convs through conv2d_nhwc_indirect_padded before winograd is reached, so winograd isn't in the YOLO path on ARM at all. The fix matters for x86 YOLO; for ARM YOLO perf the lever is conv2d_nhwc_indirect_padded (which is currently single-threaded). Good catch fixing bench_winograd_conv_modes to use conv2d_nhwc_padded. I think after that we would have pretty nice piece of code, which I hope will be the foundation for a blazing-fast winograd |
99e7f94 to
3856d6e
Compare
|
reverted bias and activation, removed |
it gives us: * internal parallelization * advantage of `packed_b` usage (when applicable)
3856d6e to
eb2aa02
Compare
Thanks to enthropy7's benchmark refactoring, we can now see which computational kernel path takes this or that convolution.
via
BENCH_COOLDOWN=0 cargo run --release --example bench_yolo --features=blaswe get such profiling of YOLO models (on x86_64, ymmv):, so winograd 3x3 path takes most of the computational time.
At initial state (5faba42), winograd path was just scalar with vectorized GEMM function inisde.