Add 64-bit integer vectors and operations on them#253
Conversation
|
The documentation for load/store_interleaved_128 was misleading. Both formulations are valid for 32-bit elements but the 8- and 16-bit elements already behaved differently, following the NEON vld4/vst4 semantics rather than our documented semantics. This misled me into generalizing the op to 64-bit numbers incorrectly. I've changed the implementation back to vld4/vst4 semantics in subsequent commits and updated documentation. |
| pub(crate) fn unrolled_array(len: usize, item: impl FnMut(usize) -> TokenStream) -> TokenStream { | ||
| let items = (0..len).map(item).collect::<Vec<_>>(); | ||
| quote! { [#(#items),*] } | ||
| } | ||
|
|
||
| pub(crate) fn scalar_binary(f: TokenStream, vec_ty: &VecType, simd: impl ToTokens) -> TokenStream { | ||
| let scalar = vec_ty.scalar.rust(vec_ty.scalar_bits); | ||
| let len = vec_ty.len; | ||
| let items = unrolled_array(len, |idx| quote! { #f(a[#idx], b[#idx]) }); | ||
|
|
||
| quote! { | ||
| let a: [#scalar; #len] = a.into(); | ||
| let b: [#scalar; #len] = b.into(); | ||
| let result: [#scalar; #len] = #items; | ||
| result.simd_into(#simd) | ||
| } | ||
| } | ||
|
|
||
| pub(crate) fn scalar_binary_method( | ||
| method: &str, | ||
| vec_ty: &VecType, | ||
| simd: impl ToTokens, | ||
| ) -> TokenStream { | ||
| let method = Ident::new(method, Span::call_site()); | ||
| let scalar = vec_ty.scalar.rust(vec_ty.scalar_bits); | ||
| let len = vec_ty.len; | ||
| let items = unrolled_array(len, |idx| quote! { a[#idx].#method(b[#idx]) }); | ||
|
|
||
| quote! { | ||
| let a: [#scalar; #len] = a.into(); | ||
| let b: [#scalar; #len] = b.into(); | ||
| let result: [#scalar; #len] = #items; | ||
| result.simd_into(#simd) | ||
| } | ||
| } | ||
|
|
||
| pub(crate) fn scalar_shift(f: TokenStream, vec_ty: &VecType, simd: impl ToTokens) -> TokenStream { | ||
| let scalar = vec_ty.scalar.rust(vec_ty.scalar_bits); | ||
| let len = vec_ty.len; | ||
| let items = unrolled_array(len, |idx| quote! { #f(a[#idx], shift) }); | ||
|
|
||
| quote! { | ||
| let a: [#scalar; #len] = a.into(); | ||
| let result: [#scalar; #len] = #items; | ||
| result.simd_into(#simd) | ||
| } | ||
| } | ||
|
|
||
| pub(crate) fn scalar_compare(method: &str, vec_ty: &VecType, simd: impl ToTokens) -> TokenStream { | ||
| let scalar = vec_ty.scalar.rust(vec_ty.scalar_bits); | ||
| let mask_scalar = ScalarType::Mask.rust(vec_ty.scalar_bits); | ||
| let len = vec_ty.len; | ||
| let op = match method { | ||
| "simd_eq" => quote! { == }, | ||
| "simd_lt" => quote! { < }, | ||
| "simd_le" => quote! { <= }, | ||
| "simd_ge" => quote! { >= }, | ||
| "simd_gt" => quote! { > }, | ||
| _ => unreachable!("unsupported scalar comparison: {method}"), | ||
| }; | ||
| let items = unrolled_array(len, |idx| { | ||
| quote! { if a[#idx] #op b[#idx] { true_lane } else { false_lane } } | ||
| }); | ||
|
|
||
| quote! { | ||
| let a: [#scalar; #len] = a.into(); | ||
| let b: [#scalar; #len] = b.into(); | ||
| let true_lane: #mask_scalar = !0; | ||
| let false_lane: #mask_scalar = 0; | ||
| let result: [#mask_scalar; #len] = #items; | ||
| result.simd_into(#simd) |
There was a problem hiding this comment.
This basically duplicates the scalar fallback code but I didn't want to do a big refactoring here that would change the scalar fallback level.
But that refactoring might be worth it considering that #256 also needs it.
Add i64/u64 vector types and operations across the generated SIMD backends, with focused int64 coverage and optimized interleaved load/store paths where available.
|
I'm curious, do you think this will overall impact the compile time for the crate a lot, even if none of the 64-bit stuff is used? Have you done any measurements? |
|
It really shouldn't. This is all generic code, so it is not actually instantiated and doesn't turn into MIR or LLVM IR until something actually calls it. The downside of generics is that if we call the same function 5 times you get 5 different instantiations of it so 5x the IR for LLVM to chew through, but in our case we want all the intrinsics inlined anyway so this is unavoidable, generics or not. |
Stacked on top of #231 because many 64-bit ops (e.g. min/max) were only added in AVX-512
Supersedes #97