feat(simd): real WASM SIMD128 backend (fill simd_wasm.rs scalar scaffolding) by AdaWorldAPI · Pull Request #225 · AdaWorldAPI/ndarray

AdaWorldAPI · 2026-06-28T16:03:35Z

Summary

src/simd_wasm.rs was commented-out scaffolding, so wasm32 silently fell back to the pure-scalar SIMD types. This fills it with a native core::arch::wasm32 v128 backend, mirroring simd_neon::aarch64_simd's proven split (native v128 for the float/byte hot path, scalar fallback for the long tail).

It also unblocks the wasm build itself — the crate did not compile for wasm32 at all before this change (pre-existing x86-only AMX leaks), so the new backend would have been unreachable.

What's new

simd_wasm::wasm32_simd (gated target_arch="wasm32" + target_feature="simd128"):

F32x16 / F64x8 as [v128; 4] + F32Mask16 / F64Mask8 — full API parity with the scalar macro (arith, reduce, min/max/clamp, all six compare-masks, to_bits/from_bits, cast_i32, round/floor/sqrt/abs/mul_add, operator impls, Mask::select).
I8x16 (one v128) = union of the scalar + NEON method sets (add/sub/min/max/cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs), so consumers stay portable across every backend.
v128 hot kernels mirroring the NEON ones: dot_f32x4, popcount_u8x16, hamming_u8x16, hamming_u8x64 (Fingerprint<256> distance via i8x16_popcnt), base17_l1 (sign-extends i16→i32 before the subtract — no overflow), codebook_gather_f32x4, bf16_to_f32_batch.

Documented cross-backend divergences (consistent with how NEON/AVX already differ):

No FMA in base simd128 → mul_add is mul+add unless +relaxed-simd (then f32x4_relaxed_madd).
round = f32x4_nearest (ties-to-even, same as NEON vrndnq_*).
IEEE NaN-propagating min/max (the existing simd_exp_f32 NaN save/restore already absorbs this).

Wiring:

simd.rs gets a wasm32 + simd128 dispatch arm (re-exports the 8 native names from wasm32_simd, the remainder from scalar); the full-scalar fallback arm now excludes that case.
PREFERRED_*_LANES get wasm 128-bit widths (F32=4 / F64=2 / U64=2 / I16=8).
New .cargo/config-wasm.toml enables +simd128.

Unblocking the wasm build (pre-existing x86-only AMX leaks, not the SIMD scaffolding): the x86-only amx_matmul / simd_amx re-exports in simd.rs were unconditional, and backend::gemm_bf16 called amx_matmul::matmul_bf16_to_f32 directly. Gated both to target_arch="x86_64", and routed gemm_bf16 through the portable hpc::quantized::bf16_gemm_f32 on non-x86 (bit-equivalent to the AMX dispatcher's own scalar fallback). The x86 path is byte-identical by construction (the original block now lives under cfg(target_arch="x86_64")).

Verification

cargo build -p ndarray --lib green for wasm32 +simd128 (native), scalar, --no-default-features (no_std), and x86_64 default.
A faithful standalone copy of wasm32_simd compiled to wasm32+simd128 and run under node: 51 numeric checks pass — exact comparison-mask bit-patterns, saturating_abs(i8::MIN)=127, Hamming=512, Base17 vs scalar (incl. a pathological |a-b|=60000 overflow case), bf16 shift, codebook gather.
x86 regression: 217 SIMD tests + 85 backend/bf16 tests pass.
cargo clippy -p ndarray --lib -- -D warnings clean; cargo fmt --check clean.

Adversarial review (3-angle Opus)

x86-regression: PASS — x86 path proven byte-identical; non-x86 BF16 fallback bit-equivalent (alpha=1, beta=0 overwrites C).
base17 i16 overflow (P1): real, fixed — base17_l1_wasm now widens i16→i32 via i32x4_extend_{low,high}_i16x8 before the subtract, matching the scalar reference for the full i16 range (regression test added).
cfg-gating no_std (P0): false positive — pub mod simd is itself #[cfg(feature="std")], so the native wasm arm is transitively std-gated; the --no-default-features wasm build is empirically clean.

Notes

Native U8x64 / I32x16 / U64x8 stay scalar on wasm — exactly as NEON keeps them; the free Hamming/Base17 kernels cover those hot paths.
Full-crate (workspace) wasm builds are still blocked by an unrelated getrandom 0.3 issue via ndarray-rand/numeric-tests; -p ndarray --lib is the correct wasm surface and is green.

🤖 Generated with Claude Code

Generated by Claude Code

Summary by CodeRabbit

New Features
- Added WebAssembly SIMD128 support for wasm builds, unlocking faster vector operations and native SIMD types on supported browsers/runtime targets.
- Added broader SIMD lane coverage and runtime-friendly type exports for wasm targets.
- Added support for several new wasm-optimized math and data-processing kernels.
Bug Fixes
- Fixed BF16 matrix multiplication routing so non-x86 builds use a portable path instead of x86-only behavior.
- Improved wasm build reliability by keeping x86-specific SIMD features out of non-x86 builds.

…olding) `src/simd_wasm.rs` was commented-out scaffolding; wasm32 fell back to the pure-scalar SIMD types. Fill it with a native `core::arch::wasm32` v128 backend, mirroring `simd_neon::aarch64_simd`'s proven split (native v128 for the float/byte hot path, scalar fallback for the long tail). simd_wasm::wasm32_simd (gated wasm32 + target_feature="simd128"): - F32x16 / F64x8 as [v128;4] + F32Mask16 / F64Mask8 — full API parity with the scalar macro (arith/reduce/min-max/clamp/compare-mask/to-from-bits/ cast/round/floor/sqrt/abs/mul_add + operator impls + Mask::select). - I8x16 (one v128) = union of the scalar + NEON method sets (add/sub/min/max/ cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs). - v128 hot kernels mirroring the NEON ones: dot_f32x4, popcount_u8x16, hamming_u8x16, hamming_u8x64 (Fingerprint<256> distance via i8x16_popcnt), base17_l1 (i16->i32 sign-extend before subtract, no overflow), codebook_gather_f32x4, bf16_to_f32_batch. - mul_add: f32x4_relaxed_madd under +relaxed-simd, else mul+add (base simd128 has no FMA). round = f32x4_nearest (ties-even, =NEON). NaN in min/max follows IEEE (=NEON). All cross-backend divergences documented. Dispatch (simd.rs): wasm32+simd128 arm re-exports the 8 native names from wasm32_simd and the remainder from scalar; the full-scalar fallback arm now excludes that case. Added wasm32 PREFERRED_*_LANES (128-bit widths) and .cargo/config-wasm.toml (-Ctarget-feature=+simd128). Also unblock the wasm build (pre-existing x86-only AMX leaks, not the SIMD scaffolding): gate the x86-only amx_matmul / simd_amx re-exports in simd.rs to target_arch="x86_64", and route backend::gemm_bf16 through the portable hpc::quantized::bf16_gemm_f32 on non-x86 (bit-equivalent to the AMX dispatcher's own scalar fallback). x86 paths are untouched by construction. Verified: cargo build -p ndarray --lib for wasm32 (+simd128 native / scalar / no_std) and x86_64 all green; a faithful standalone copy of wasm32_simd run under node passes 51 numeric checks (mask bit-patterns, saturating_abs(i8::MIN)=127, Hamming, Base17 incl. |a-b|=60000 overflow, bf16 shift); 217 SIMD + 85 backend/bf16 x86 tests pass; clippy -D warnings and fmt clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Claude-Session: https://claude.ai/code/session_01NSk1qJ28eh6A2hd4JUWFRr

coderabbitai · 2026-06-28T16:03:57Z

Caution

Review failed

Pull request was closed or merged during review

📝 Walkthrough

Walkthrough

Implements a complete wasm32+simd128 SIMD backend in src/simd_wasm.rs with F32x16, F64x8, and I8x16 vector types plus hot kernels. Wires the backend into lane-width constants and type re-exports in src/simd.rs, restricts AMX paths to x86_64, adds a portable bf16_gemm_f32 fallback for non-x86 std builds in src/backend/mod.rs, and adds .cargo/config-wasm.toml to enable +simd128.

Changes

WASM SIMD128 Backend

Layer / File(s)	Summary
Build config and dispatch wiring `.cargo/config-wasm.toml`, `src/simd.rs`	Adds Cargo config enabling `+simd128` rustflags for wasm targets; adds `wasm32` `PREFERRED_*_LANES` constants; adds `wasm32+simd128` re-export arm and restricts AMX re-exports to `x86_64`.
`gemm_bf16` portable non-x86 fallback `src/backend/mod.rs`	Restricts existing AMX-based `gemm_bf16` std path to `x86_64`; adds `std+non-x86_64` branch routing through portable scalar `bf16_gemm_f32`.
`F32x16` and `F64x8` vector types `src/simd_wasm.rs` (lines 1–407, 409–724, 860–866)	Introduces `wasm32_simd` module; implements `F32x16` and `F64x8` (4×`v128` each) with constructors, reductions, elementwise ops, comparisons, arithmetic traits, and mask types `F32Mask16`/`F64Mask8` with `select`.
`I8x16` type and hot kernels `src/simd_wasm.rs` (lines 726–1096)	Implements `I8x16` with arithmetic, i4 unpack, `saturating_abs`; adds `dot_f32x4_wasm`, `hamming_u8x{16,64}_wasm`, `base17_l1_wasm` (i32-domain to avoid i16 wrap), `codebook_gather_f32x4_wasm`, `bf16_to_f32_batch_wasm`, and wasm runtime tests.
Developer notes `.claude/blackboard.md`	Adds session notes documenting the WASM SIMD128 implementation, API parity, dispatch changes, build status, and adversarial review outcomes.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

AdaWorldAPI/ndarray#203: Also modifies src/simd.rs to extend architecture-specific SIMD re-export surface for types including i8x16 across x86/NEON/scalar backends.
AdaWorldAPI/ndarray#217: Also modifies gemm_bf16 dispatch in src/backend/mod.rs, specifically the std/x86 AMX matmul_bf16_to_f32 routing that this PR further restricts to x86_64.

Poem

🐇 Hop hop! Through wasm lanes I leap,
Four v128 chunks of f32 to keep.
With simd128 flag I finally compile,
Base17 in i32 to dodge the wrap's guile.
The rabbit runs fast on every wasm core!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly summarizes the main change: replacing wasm SIMD scaffolding with a real SIMD128 backend.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands.}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 500d57e6a1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-28T16:09:38Z

+pub use crate::simd_wasm::wasm32_simd::{f32x16, f64x8, i8x16, F32Mask16, F32x16, F64Mask8, F64x8, I8x16};
+#[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))]
+pub use scalar::{
+    batch_packed_i4_16, f32x8, f64x4, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x32, i8x64, palette_lookup_u8x8,


Use the wasm I8x16 in batch_packed_i4_16

When building wasm32 with +simd128, crate::simd::I8x16 is the native v128 type re-exported just above, but this line re-exports scalar::batch_packed_i4_16, whose closure receives simd::scalar::I8x16 (src/simd_scalar.rs:1630). That makes portable callers fail only on wasm SIMD if they pass a helper taking crate::simd::I8x16 or write into out: &mut [I8x16], and it also bypasses the new native byte-lane backend for this W1a primitive. Re-export a wasm implementation of batch_packed_i4_16 built on wasm32_simd::I8x16, or keep I8x16 scalar in this dispatch arm.

Useful? React with 👍 / 👎.

AdaWorldAPI merged commit b8eaa1e into master Jun 28, 2026
16 of 17 checks passed

chatgpt-codex-connector Bot reviewed Jun 28, 2026

View reviewed changes

AdaWorldAPI mentioned this pull request Jun 28, 2026

splat3d: helix_orient — deterministic 1-3 byte surfel/gaussian orientation codec #226

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(simd): real WASM SIMD128 backend (fill simd_wasm.rs scalar scaffolding)#225

feat(simd): real WASM SIMD128 backend (fill simd_wasm.rs scalar scaffolding)#225
AdaWorldAPI merged 1 commit into
masterfrom
claude/ndarray-wasm-scalar-zr9n46

AdaWorldAPI commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AdaWorldAPI commented Jun 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's new

Verification

Adversarial review (3-angle Opus)

Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

AdaWorldAPI commented Jun 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 28, 2026 •

edited

Loading