Skip to content

feat(simd): real WASM SIMD128 backend (fill simd_wasm.rs scalar scaffolding)#225

Merged
AdaWorldAPI merged 1 commit into
masterfrom
claude/ndarray-wasm-scalar-zr9n46
Jun 28, 2026
Merged

feat(simd): real WASM SIMD128 backend (fill simd_wasm.rs scalar scaffolding)#225
AdaWorldAPI merged 1 commit into
masterfrom
claude/ndarray-wasm-scalar-zr9n46

Conversation

@AdaWorldAPI

@AdaWorldAPI AdaWorldAPI commented Jun 28, 2026

Copy link
Copy Markdown
Owner

Summary

src/simd_wasm.rs was commented-out scaffolding, so wasm32 silently fell back to the pure-scalar SIMD types. This fills it with a native core::arch::wasm32 v128 backend, mirroring simd_neon::aarch64_simd's proven split (native v128 for the float/byte hot path, scalar fallback for the long tail).

It also unblocks the wasm build itself — the crate did not compile for wasm32 at all before this change (pre-existing x86-only AMX leaks), so the new backend would have been unreachable.

What's new

simd_wasm::wasm32_simd (gated target_arch="wasm32" + target_feature="simd128"):

  • F32x16 / F64x8 as [v128; 4] + F32Mask16 / F64Mask8 — full API parity with the scalar macro (arith, reduce, min/max/clamp, all six compare-masks, to_bits/from_bits, cast_i32, round/floor/sqrt/abs/mul_add, operator impls, Mask::select).
  • I8x16 (one v128) = union of the scalar + NEON method sets (add/sub/min/max/cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs), so consumers stay portable across every backend.
  • v128 hot kernels mirroring the NEON ones: dot_f32x4, popcount_u8x16, hamming_u8x16, hamming_u8x64 (Fingerprint<256> distance via i8x16_popcnt), base17_l1 (sign-extends i16→i32 before the subtract — no overflow), codebook_gather_f32x4, bf16_to_f32_batch.

Documented cross-backend divergences (consistent with how NEON/AVX already differ):

  • No FMA in base simd128 → mul_add is mul+add unless +relaxed-simd (then f32x4_relaxed_madd).
  • round = f32x4_nearest (ties-to-even, same as NEON vrndnq_*).
  • IEEE NaN-propagating min/max (the existing simd_exp_f32 NaN save/restore already absorbs this).

Wiring:

  • simd.rs gets a wasm32 + simd128 dispatch arm (re-exports the 8 native names from wasm32_simd, the remainder from scalar); the full-scalar fallback arm now excludes that case.
  • PREFERRED_*_LANES get wasm 128-bit widths (F32=4 / F64=2 / U64=2 / I16=8).
  • New .cargo/config-wasm.toml enables +simd128.

Unblocking the wasm build (pre-existing x86-only AMX leaks, not the SIMD scaffolding): the x86-only amx_matmul / simd_amx re-exports in simd.rs were unconditional, and backend::gemm_bf16 called amx_matmul::matmul_bf16_to_f32 directly. Gated both to target_arch="x86_64", and routed gemm_bf16 through the portable hpc::quantized::bf16_gemm_f32 on non-x86 (bit-equivalent to the AMX dispatcher's own scalar fallback). The x86 path is byte-identical by construction (the original block now lives under cfg(target_arch="x86_64")).

Verification

  • cargo build -p ndarray --lib green for wasm32 +simd128 (native), scalar, --no-default-features (no_std), and x86_64 default.
  • A faithful standalone copy of wasm32_simd compiled to wasm32+simd128 and run under node: 51 numeric checks pass — exact comparison-mask bit-patterns, saturating_abs(i8::MIN)=127, Hamming=512, Base17 vs scalar (incl. a pathological |a-b|=60000 overflow case), bf16 shift, codebook gather.
  • x86 regression: 217 SIMD tests + 85 backend/bf16 tests pass.
  • cargo clippy -p ndarray --lib -- -D warnings clean; cargo fmt --check clean.

Adversarial review (3-angle Opus)

  • x86-regression: PASS — x86 path proven byte-identical; non-x86 BF16 fallback bit-equivalent (alpha=1, beta=0 overwrites C).
  • base17 i16 overflow (P1): real, fixedbase17_l1_wasm now widens i16→i32 via i32x4_extend_{low,high}_i16x8 before the subtract, matching the scalar reference for the full i16 range (regression test added).
  • cfg-gating no_std (P0): false positivepub mod simd is itself #[cfg(feature="std")], so the native wasm arm is transitively std-gated; the --no-default-features wasm build is empirically clean.

Notes

  • Native U8x64 / I32x16 / U64x8 stay scalar on wasm — exactly as NEON keeps them; the free Hamming/Base17 kernels cover those hot paths.
  • Full-crate (workspace) wasm builds are still blocked by an unrelated getrandom 0.3 issue via ndarray-rand/numeric-tests; -p ndarray --lib is the correct wasm surface and is green.

🤖 Generated with Claude Code


Generated by Claude Code

Summary by CodeRabbit

  • New Features

    • Added WebAssembly SIMD128 support for wasm builds, unlocking faster vector operations and native SIMD types on supported browsers/runtime targets.
    • Added broader SIMD lane coverage and runtime-friendly type exports for wasm targets.
    • Added support for several new wasm-optimized math and data-processing kernels.
  • Bug Fixes

    • Fixed BF16 matrix multiplication routing so non-x86 builds use a portable path instead of x86-only behavior.
    • Improved wasm build reliability by keeping x86-specific SIMD features out of non-x86 builds.

…olding)

`src/simd_wasm.rs` was commented-out scaffolding; wasm32 fell back to the
pure-scalar SIMD types. Fill it with a native `core::arch::wasm32` v128
backend, mirroring `simd_neon::aarch64_simd`'s proven split (native v128 for
the float/byte hot path, scalar fallback for the long tail).

simd_wasm::wasm32_simd (gated wasm32 + target_feature="simd128"):
- F32x16 / F64x8 as [v128;4] + F32Mask16 / F64Mask8 — full API parity with
  the scalar macro (arith/reduce/min-max/clamp/compare-mask/to-from-bits/
  cast/round/floor/sqrt/abs/mul_add + operator impls + Mask::select).
- I8x16 (one v128) = union of the scalar + NEON method sets (add/sub/min/max/
  cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs).
- v128 hot kernels mirroring the NEON ones: dot_f32x4, popcount_u8x16,
  hamming_u8x16, hamming_u8x64 (Fingerprint<256> distance via i8x16_popcnt),
  base17_l1 (i16->i32 sign-extend before subtract, no overflow),
  codebook_gather_f32x4, bf16_to_f32_batch.
- mul_add: f32x4_relaxed_madd under +relaxed-simd, else mul+add (base
  simd128 has no FMA). round = f32x4_nearest (ties-even, =NEON). NaN in
  min/max follows IEEE (=NEON). All cross-backend divergences documented.

Dispatch (simd.rs): wasm32+simd128 arm re-exports the 8 native names from
wasm32_simd and the remainder from scalar; the full-scalar fallback arm now
excludes that case. Added wasm32 PREFERRED_*_LANES (128-bit widths) and
.cargo/config-wasm.toml (-Ctarget-feature=+simd128).

Also unblock the wasm build (pre-existing x86-only AMX leaks, not the SIMD
scaffolding): gate the x86-only amx_matmul / simd_amx re-exports in simd.rs
to target_arch="x86_64", and route backend::gemm_bf16 through the portable
hpc::quantized::bf16_gemm_f32 on non-x86 (bit-equivalent to the AMX
dispatcher's own scalar fallback). x86 paths are untouched by construction.

Verified: cargo build -p ndarray --lib for wasm32 (+simd128 native /
scalar / no_std) and x86_64 all green; a faithful standalone copy of
wasm32_simd run under node passes 51 numeric checks (mask bit-patterns,
saturating_abs(i8::MIN)=127, Hamming, Base17 incl. |a-b|=60000 overflow,
bf16 shift); 217 SIMD + 85 backend/bf16 x86 tests pass; clippy -D warnings
and fmt clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01NSk1qJ28eh6A2hd4JUWFRr
@coderabbitai

coderabbitai Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Implements a complete wasm32+simd128 SIMD backend in src/simd_wasm.rs with F32x16, F64x8, and I8x16 vector types plus hot kernels. Wires the backend into lane-width constants and type re-exports in src/simd.rs, restricts AMX paths to x86_64, adds a portable bf16_gemm_f32 fallback for non-x86 std builds in src/backend/mod.rs, and adds .cargo/config-wasm.toml to enable +simd128.

Changes

WASM SIMD128 Backend

Layer / File(s) Summary
Build config and dispatch wiring
.cargo/config-wasm.toml, src/simd.rs
Adds Cargo config enabling +simd128 rustflags for wasm targets; adds wasm32 PREFERRED_*_LANES constants; adds wasm32+simd128 re-export arm and restricts AMX re-exports to x86_64.
gemm_bf16 portable non-x86 fallback
src/backend/mod.rs
Restricts existing AMX-based gemm_bf16 std path to x86_64; adds std+non-x86_64 branch routing through portable scalar bf16_gemm_f32.
F32x16 and F64x8 vector types
src/simd_wasm.rs (lines 1–407, 409–724, 860–866)
Introduces wasm32_simd module; implements F32x16 and F64x8 (4×v128 each) with constructors, reductions, elementwise ops, comparisons, arithmetic traits, and mask types F32Mask16/F64Mask8 with select.
I8x16 type and hot kernels
src/simd_wasm.rs (lines 726–1096)
Implements I8x16 with arithmetic, i4 unpack, saturating_abs; adds dot_f32x4_wasm, hamming_u8x{16,64}_wasm, base17_l1_wasm (i32-domain to avoid i16 wrap), codebook_gather_f32x4_wasm, bf16_to_f32_batch_wasm, and wasm runtime tests.
Developer notes
.claude/blackboard.md
Adds session notes documenting the WASM SIMD128 implementation, API parity, dispatch changes, build status, and adversarial review outcomes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • AdaWorldAPI/ndarray#203: Also modifies src/simd.rs to extend architecture-specific SIMD re-export surface for types including i8x16 across x86/NEON/scalar backends.
  • AdaWorldAPI/ndarray#217: Also modifies gemm_bf16 dispatch in src/backend/mod.rs, specifically the std/x86 AMX matmul_bf16_to_f32 routing that this PR further restricts to x86_64.

Poem

🐇 Hop hop! Through wasm lanes I leap,
Four v128 chunks of f32 to keep.
With simd128 flag I finally compile,
Base17 in i32 to dodge the wrap's guile.
The rabbit runs fast on every wasm core!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: replacing wasm SIMD scaffolding with a real SIMD128 backend.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@AdaWorldAPI AdaWorldAPI merged commit b8eaa1e into master Jun 28, 2026
16 of 17 checks passed

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 500d57e6a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/simd.rs
pub use crate::simd_wasm::wasm32_simd::{f32x16, f64x8, i8x16, F32Mask16, F32x16, F64Mask8, F64x8, I8x16};
#[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))]
pub use scalar::{
batch_packed_i4_16, f32x8, f64x4, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x32, i8x64, palette_lookup_u8x8,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the wasm I8x16 in batch_packed_i4_16

When building wasm32 with +simd128, crate::simd::I8x16 is the native v128 type re-exported just above, but this line re-exports scalar::batch_packed_i4_16, whose closure receives simd::scalar::I8x16 (src/simd_scalar.rs:1630). That makes portable callers fail only on wasm SIMD if they pass a helper taking crate::simd::I8x16 or write into out: &mut [I8x16], and it also bypasses the new native byte-lane backend for this W1a primitive. Re-export a wasm implementation of batch_packed_i4_16 built on wasm32_simd::I8x16, or keep I8x16 scalar in this dispatch arm.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants