transpose: add num_batches to batch independent transposes into one dispatch#124
transpose: add num_batches to batch independent transposes into one dispatch#124atassis wants to merge 8 commits into
Conversation
andrej
left a comment
There was a problem hiding this comment.
Thank you for the contribution! This will be a useful addition. Just a couple nitpicks, then please rebase on devel and we can go ahead and merge this.
|
Note the CI failures we're seeing should disappear after a rebase. :) |
…ispatch GEMV and StridedCopy already take num_batches to batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern). num_batches>1 lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s*s sub-tiles, so the kernel is unchanged. num_batches=1 (default) is byte-identical to the previous single-transpose schedule.
Adds num_batches=2 (default suite) and num_batches=4 (extensive) cases to the
transpose test, with a batched golden reference. The operator's batched path was
previously untested. Verified on device (NPU2): num_batches in {1,2,4} pass.
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
Co-authored-by: André Rösti <androsti@amd.com>
722c6f9 to
4aa4c91
Compare
Drop the diff-relative phrasing ('original'/'unchanged') flagged in review;
the comment now describes the access-pattern layout as-is. Rationale moved to
the PR description.
|
Hi there! Thanks for the feedback, have applied it in both PRs. |
I think both of these contributions are valuable, so thank you for them! Your [Xilinx/mlir-aie#3178] PR has had human eyes on it, but, speaking for myself, I would prefer to see the CoPilot review issues there explained or resolved before commenting further. |
|
@thomthehound good point, resolved them |
CI Test Resultsb6ae95b (2026_06_23_16_04_10) IRON - CI SummaryExamplesiron/applications/llama_3.2_1b
Smalliron/operators/axpy
iron/operators/dequant
iron/operators/elementwise_add
iron/operators/elementwise_mul
iron/operators/gelu
iron/operators/gemm
iron/operators/gemv
iron/operators/layer_norm
iron/operators/mem_copy
iron/operators/mha
iron/operators/relu
iron/operators/rms_norm
iron/operators/rope
iron/operators/sigmoid
iron/operators/silu
iron/operators/softmax
iron/operators/swiglu_decode
iron/operators/swiglu_prefill
iron/operators/tanh
iron/operators/transpose
Krackan - SmallIRONTested on iron/operators/axpy
iron/operators/dequant
iron/operators/elementwise_add
iron/operators/elementwise_mul
iron/operators/gelu
iron/operators/gemm
iron/operators/gemv
iron/operators/layer_norm
iron/operators/mem_copy
iron/operators/mha
iron/operators/relu
iron/operators/rms_norm
iron/operators/rope
iron/operators/sigmoid
iron/operators/silu
iron/operators/softmax
iron/operators/swiglu_decode
iron/operators/swiglu_prefill
iron/operators/tanh
iron/operators/transpose
Trends: IRON Trendsiron/operators/axpytest_axpy[input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_8-tile_size_256-scalar_factor_3.0]
iron/operators/dequanttest_dequant[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32]
test_dequant[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32]
test_dequant[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-group_size_32]
test_dequant[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-group_size_32]
iron/operators/elementwise_addtest_elementwise_add[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_add[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_add[input_length_2048-num_aie_columns_4-tile_size_512]
test_elementwise_add[input_length_2048-num_aie_columns_8-tile_size_256]
iron/operators/elementwise_multest_elementwise_mul[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_mul[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_mul[input_length_2048-num_aie_columns_4-tile_size_512]
test_elementwise_mul[input_length_2048-num_aie_columns_8-tile_size_256]
iron/operators/gelutest_gelu[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_gelu[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
test_gelu[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256]
test_gelu[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128]
iron/operators/gemmtest_gemm[M_1792-K_896-N_1152-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_64-k_32-n_48-trace_size_0-partition_N_1]
test_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_8-b_col_maj_True-c_col_maj_True-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1]
test_gemm[M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4]
test_gemm[M_896-K_1792-N_640-num_aie_columns_8-b_col_maj_False-c_col_maj_True-m_32-k_64-n_80-trace_size_0-partition_N_1]
iron/operators/gemvtest_gemv[M_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128]
test_gemv[M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048]
test_gemv[M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024]
test_gemv[M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512]
test_gemv[M_2048-K_8192-num_aie_columns_8-tile_size_input_1-tile_size_output_256]
test_gemv[M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_8-tile_size_input_4-tile_size_output_1024]
iron/operators/layer_normtest_layer_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_layer_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
test_layer_norm[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256]
test_layer_norm[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128]
iron/operators/mem_copytest_mem_copy[input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048]
test_mem_copy[input_length_2048-num_cores_16-num_channels_2-bypass_False-tile_size_128]
test_mem_copy[input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_8-num_channels_1-bypass_False-tile_size_256]
test_mem_copy[input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256]
iron/operators/mhatest_mha[seq_len_16384-dim_64-num_heads_1-num_pipelines_8-num_kv_heads_0]
iron/operators/rms_normtest_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_8-num_channels_1-tile_size_256-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_8-num_channels_2-tile_size_128-weighted_False]
iron/operators/ropetest_rope[rows_32-cols_512-angle_rows_32-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_4-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_8-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_4-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_8-method_type_0]
iron/operators/softmaxtest_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_1024]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_2048]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_512]
iron/operators/swiglu_decodetest_swiglu_decode[embedding_dim_1024-hidden_dim_3584]
test_swiglu_decode[embedding_dim_2048-hidden_dim_2048]
iron/operators/swiglu_prefilltest_swiglu_prefill[seq_len_256-embedding_dim_2048-hidden_dim_2048-prio_accuracy_False]
iron/operators/transposetest_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8-num_batches_1]
test_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8-num_batches_2]
test_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8]
test_transpose[M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8-num_batches_1]
test_transpose[M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8]
Krackan - ExamplesIRONTested on iron/applications/llama_3.2_1b
Trends: IRON Trendsiron/applications/llama_3.2_1btest_llama_3_2_1b[llama_3.2_1b_prompt_1024_tokens_1]
test_llama_3_2_1b[llama_3.2_1b_prompt_1024_tokens_40]
test_llama_3_2_1b[llama_3.2_1b_prompt_13_tokens_1]
test_llama_3_2_1b[llama_3.2_1b_prompt_13_tokens_40]
Phoenix - SmallIRONTested on iron/operators/axpy
iron/operators/dequant
iron/operators/elementwise_add
iron/operators/elementwise_mul
iron/operators/gelu
iron/operators/gemm
iron/operators/gemv
iron/operators/layer_norm
iron/operators/mem_copy
iron/operators/relu
iron/operators/rms_norm
iron/operators/rope
iron/operators/sigmoid
iron/operators/silu
iron/operators/softmax
iron/operators/swiglu_decode
iron/operators/swiglu_prefill
iron/operators/tanh
iron/operators/transpose
Trends: IRON Trendsiron/operators/axpytest_axpy[input_length_2048-num_aie_columns_1-tile_size_2048-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_2-tile_size_1024-scalar_factor_3.0]
test_axpy[input_length_2048-num_aie_columns_4-tile_size_512-scalar_factor_3.0]
iron/operators/dequanttest_dequant[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-group_size_32]
test_dequant[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-group_size_32]
test_dequant[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-group_size_32]
test_dequant[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-group_size_32]
iron/operators/elementwise_addtest_elementwise_add[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_add[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_add[input_length_2048-num_aie_columns_4-tile_size_512]
iron/operators/elementwise_multest_elementwise_mul[input_length_2048-num_aie_columns_1-tile_size_2048]
test_elementwise_mul[input_length_2048-num_aie_columns_2-tile_size_1024]
test_elementwise_mul[input_length_2048-num_aie_columns_4-tile_size_512]
iron/operators/gelutest_gelu[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_gelu[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_gelu[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_gelu[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
iron/operators/gemmtest_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_False-c_col_maj_False-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_192-K_384-N_64-num_aie_columns_4-b_col_maj_True-c_col_maj_True-m_48-k_96-n_16-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_1-b_col_maj_False-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_2048-K_2048-N_2048-num_aie_columns_2-b_col_maj_True-c_col_maj_False-m_64-k_64-n_64-trace_size_0-partition_N_1]
test_gemm[M_384-K_1536-N_1792-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_32-k_48-n_64-trace_size_0-partition_N_1]
test_gemm[M_64-K_512-N_256-num_aie_columns_4-b_col_maj_True-c_col_maj_False-m_16-k_64-n_64-trace_size_0-partition_N_4]
iron/operators/gemvtest_gemv[M_128-K_128-num_aie_columns_1-tile_size_input_32-tile_size_output_128]
test_gemv[M_2048-K_8192-num_aie_columns_1-tile_size_input_1-tile_size_output_2048]
test_gemv[M_2048-K_8192-num_aie_columns_2-tile_size_input_1-tile_size_output_1024]
test_gemv[M_2048-K_8192-num_aie_columns_4-tile_size_input_1-tile_size_output_512]
test_gemv[M_8192-K_2048-num_aie_columns_1-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_2-tile_size_input_4-tile_size_output_1024]
test_gemv[M_8192-K_2048-num_aie_columns_4-tile_size_input_4-tile_size_output_1024]
iron/operators/layer_normtest_layer_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048]
test_layer_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024]
test_layer_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512]
test_layer_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256]
iron/operators/mem_copytest_mem_copy[input_length_2048-num_cores_1-num_channels_1-bypass_False-tile_size_2048]
test_mem_copy[input_length_2048-num_cores_2-num_channels_1-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_2-num_channels_2-bypass_False-tile_size_1024]
test_mem_copy[input_length_2048-num_cores_4-num_channels_1-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_4-num_channels_2-bypass_False-tile_size_512]
test_mem_copy[input_length_2048-num_cores_8-num_channels_2-bypass_False-tile_size_256]
iron/operators/rms_normtest_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_1-tile_size_2048-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_1-num_channels_2-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_1-tile_size_1024-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_2-num_channels_2-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_False]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_1-tile_size_512-weighted_True]
test_rms_norm[input_length_2048-num_aie_columns_4-num_channels_2-tile_size_256-weighted_False]
iron/operators/ropetest_rope[rows_32-cols_512-angle_rows_32-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_32-aie_columns_4-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_1-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_2-method_type_0]
test_rope[rows_32-cols_512-angle_rows_8-aie_columns_4-method_type_0]
iron/operators/softmaxtest_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_1024]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_2048]
test_softmax[input_length_32768-num_aie_columns_2-num_channels_2-tile_size_512]
iron/operators/swiglu_decodetest_swiglu_decode[embedding_dim_1024-hidden_dim_3584]
test_swiglu_decode[embedding_dim_2048-hidden_dim_2048]
iron/operators/swiglu_prefilltest_swiglu_prefill[seq_len_256-embedding_dim_2048-hidden_dim_2048-prio_accuracy_False]
iron/operators/transposetest_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8-num_batches_1]
test_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8-num_batches_2]
test_transpose[M_2048-N_64-aie_columns_1-channels_1-m_64-n_64-s_8]
test_transpose[M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8-num_batches_1]
test_transpose[M_2048-N_64-aie_columns_1-channels_2-m_64-n_64-s_8]
Phoenix - ExamplesIRONTested on Trends: IRON Trends |
GEMV and StridedCopy already take
num_batchesto batch B independent same-shape operations into a single dispatch; Transpose did not, forcing callers to unroll B per-head/per-batch transposes into B separate dispatches for identical kernel work (a common multi-head-attention pattern).Added
num_batchesonTranspose(default 1).num_batches>1lays B contiguous (M,N) matrices back-to-back and streams them through the same ObjectFifos (one task group per batch); the core still only sees s×s sub-tiles, so the kernel is unchanged.num_batches>1test coverage (the batched path was previously untested), with a batched golden reference.Changed
get_arg_specprepends a batch dim only whennum_batches>1;num_batches=1is byte-identical to the previous single-transpose schedule.Removed
Verified on device (NPU2):
num_batchesin {1, 2, 4} pass. Mirrors GEMV's existingnum_batches.