AdaWorldAPI · AdaWorldAPI · Jun 28, 2026 · Jun 28, 2026 · chatgpt-codex-connector · Jun 28, 2026
diff --git a/.cargo/config-wasm.toml b/.cargo/config-wasm.toml
@@ -0,0 +1,23 @@
+# WebAssembly with SIMD128 — enables the native v128 SIMD backend
+# (`src/simd_wasm.rs::wasm32_simd`) instead of the pure-scalar fallback.
+#
+# Use with:
+#   cargo --config .cargo/config-wasm.toml build -p ndarray --lib --target wasm32-unknown-unknown
+#
+# Equivalent env form:
+#   RUSTFLAGS='-Ctarget-feature=+simd128' cargo build -p ndarray --lib --target wasm32-unknown-unknown
+#
+# The `simd128` target feature is what gates the `wasm32_simd` module and the
+# `simd.rs` dispatch arm that re-exports its `F32x16` / `F64x8` / `I8x16`
+# types. Without it, `wasm32` falls back to the portable scalar SIMD types
+# (still correct, just not vectorized). Add `+relaxed-simd` as well to light
+# up the fused `f32x4_relaxed_madd` path in `mul_add`:
+#
+#   rustflags = ["-Ctarget-feature=+simd128,+relaxed-simd"]
+#
+# Applies to both wasm32-unknown-unknown and wasm32-wasip1.
+[target.wasm32-unknown-unknown]
+rustflags = ["-Ctarget-feature=+simd128"]
+
+[target.wasm32-wasip1]
+rustflags = ["-Ctarget-feature=+simd128"]
diff --git a/.claude/blackboard.md b/.claude/blackboard.md
@@ -3,6 +3,85 @@
 > **Read this first.** The "Polyglot Notebook" architecture below is a
 > separate/older program, not the current epoch.
 
+## 2026-06-28 — WASM SIMD128 backend filled in (`src/simd_wasm.rs`)
+
+Replaced the commented-out scaffolding in `src/simd_wasm.rs` with a real
+`core::arch::wasm32` SIMD128 backend, mirroring `simd_neon::aarch64_simd`'s
+proven split (native v128 for the float/byte hot path, scalar fallback for
+the long tail). Branch `claude/ndarray-wasm-scalar-zr9n46`.
+
+**`src/simd_wasm.rs::wasm32_simd`** (gated `#[cfg(all(target_arch="wasm32",
+target_feature="simd128"))]`):
+- `F32x16` / `F64x8` as `[v128;4]` + `F32Mask16` / `F64Mask8` — full API
+  parity with the scalar macro (splat/from_slice/from_array/to_array/
+  copy_to_slice/reduce_{sum,min,max}/abs/sqrt/round/floor/mul_add/
+  simd_{min,max,clamp,lt,le,gt,ge,eq,ne}/to_bits/from_bits/cast_i32 +
+  Add/Sub/Mul/Div/*Assign/Neg/Debug/PartialEq/Default + Mask::select).
+- `I8x16` (one `v128`) = UNION of the scalar + NEON method sets
+  (add/sub/min/max/cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs)
+  so consumers are portable across every backend.
+- Free hot-kernels (v128 counterparts to the NEON kernels):
+  `dot_f32x4_wasm`, `popcount_u8x16_wasm`, `hamming_u8x16_wasm`,
+  `hamming_u8x64_wasm` (Fingerprint<256> distance via `i8x16_popcnt`),
+  `base17_l1_wasm`, `codebook_gather_f32x4_wasm`, `bf16_to_f32_batch_wasm`.
+- `mul_add`: `f32x4_relaxed_madd` under `+relaxed-simd`, else mul+add
+  (base simd128 has no FMA). `round()` = `f32x4_nearest` (ties-even, =NEON).
+  NaN in simd_min/max follows IEEE (NaN-propagating, =NEON); the existing
+  `simd_exp_f32` NaN save/restore already absorbs this. All documented.
+
+**Dispatch (`src/simd.rs`):** new `target_arch="wasm32" + target_feature=
+"simd128"` arm re-exports the 8 native names from `wasm32_simd` and the
+remainder from `scalar`; the "Other non-x86" arm now excludes that case
+(wasm-without-simd128 + riscv etc. stay full-scalar). Added wasm32
+`PREFERRED_*_LANES` arms (F32=4/F64=2/U64=2/I16=8, 128-bit widths) and a
+`.cargo/config-wasm.toml` (`-Ctarget-feature=+simd128`).
+
+**Unblocked the wasm build (pre-existing x86 leaks, not SIMD-scaffolding):**
+the crate did NOT compile for wasm at all — `src/simd.rs` re-exported the
+x86-only `amx_matmul` / `simd_amx` modules unconditionally, and
+`backend::gemm_bf16` called `amx_matmul::matmul_bf16_to_f32` directly.
+Gated both re-exports to `#[cfg(target_arch="x86_64")]`; split `gemm_bf16`
+into the IDENTICAL x86 AMX path + a non-x86 branch routing through the
+portable `hpc::quantized::bf16_gemm_f32(.., 1.0, 0.0)` (the same scalar
+reference the AMX dispatcher itself falls back to → bit-equivalent). x86
+behavior is untouched by construction (the original block now lives under
+`cfg(target_arch="x86_64")`).
+
+[VERIFICATION] (1) `cargo build -p ndarray --lib` for wasm32 **+simd128**
+(native) AND **without** simd128 (scalar) AND **--no-default-features**
+(no_std) AND x86_64 default — all green. (2) A standalone faithful copy of
+`wasm32_simd` built to wasm32+simd128 and run under **node**: 51 numeric
+checks (incl. exact mask bit-patterns, saturating_abs(i8::MIN)=127,
+Hamming=512, Base17 vs scalar incl. a pathological |a-b|=60000 overflow
+case, bf16 shift) all PASS. (3) x86 regression: 217 SIMD tests + 85
+backend/bf16 tests pass; `clippy -p ndarray --lib -- -D warnings` clean;
+`fmt --check` clean. Harness: `/tmp/.../scratchpad/wasmverify`.
+
+[ADVERSARIAL REVIEW] Ran a 3-angle Opus review (cfg-gating / intrinsic-
+semantics / x86-regression). x86-regression = PASS (x86 path byte-identical;
+non-x86 bf16 fallback bit-equivalent). Two findings resolved: (P0 cfg-gating
+"no_std arm break") = **false positive** — `pub mod simd` is itself
+`#[cfg(feature="std")]` (lib.rs:239), so the native wasm arm is transitively
+std-gated; `--no-default-features` wasm build is clean (empirically
+confirmed). (P1 base17 i16 wrap) = **real, fixed** — `base17_l1_wasm` now
+sign-extends i16→i32 via `i32x4_extend_{low,high}_i16x8` BEFORE the subtract,
+so `|a-b|` is computed in i32 and matches the scalar reference for the full
+i16 range (the prior i16-domain abs-diff, like NEON's `vabdq_s16`, wrapped at
+|a-b|>i16::MAX). Doc nits (mul_add ULP wording, reduce_sum order, Tier-enum
+comment) also tightened.
+
+[NOTE] The stale top-of-CLAUDE.md "Build currently fails (exit 101)" no
+longer reproduces — x86 lib builds clean this turn.
+
+[LOOSE END] Full-crate (workspace) wasm build still blocked by `getrandom
+0.3` (via `ndarray-rand`/`numeric-tests`, members that depend ON ndarray)
+needing the `wasm_js` backend — orthogonal to this work; `-p ndarray --lib`
+is the correct wasm surface and it is green. `bf16_to_f32_batch_wasm` is
+provided + tested but NOT wired into the `bf16_to_f32_batch` dispatch (left
+scalar to keep the BF16 path untouched); wire it if a wasm BF16 hot path
+appears. Native U8x64/I32x16/U64x8 stay scalar on wasm (same as NEON keeps
+them scalar) — the free Hamming/Base17 kernels cover those hot paths.
+
 ## 2026-06-17 — DECISION: HHTL fork ladder coded in `hpc::entropy_ladder` (CONJECTURE)
 
 Reified the operator's standing idea — *if the orthogonal (helix/CAM-PQ)

diff --git a/src/backend/mod.rs b/src/backend/mod.rs
@@ -210,7 +210,12 @@ pub fn gemm_i8(a: &[u8], b: &[i8], c: &mut [i32], m: usize, n: usize, k: usize)
 #[allow(clippy::needless_return)]
 pub fn gemm_bf16(a: &[u16], b: &[u16], c: &mut [f32], m: usize, n: usize, k: usize) {
     // Reinterpret u16 slices as BF16 slices (repr(transparent))
-    #[cfg(feature = "std")]
+    //
+    // x86_64: route through the ArrayView2-based AMX dispatcher
+    // (`amx_matmul::matmul_bf16_to_f32` = AMX TDPBF16PS → AVX-512 VDPBF16PS →
+    // scalar tiled `bf16_gemm_f32`). That module is `#[cfg(target_arch =
+    // "x86_64")]`, so off x86 we call the same scalar reference directly.
+    #[cfg(all(feature = "std", target_arch = "x86_64"))]
     {
         use crate::{ArrayView2, ArrayViewMut2};
 
@@ -235,6 +240,22 @@ pub fn gemm_bf16(a: &[u16], b: &[u16], c: &mut [f32], m: usize, n: usize, k: usi
         crate::hpc::amx_matmul::matmul_bf16_to_f32(lhs, rhs, out).expect("gemm_bf16: matmul shape contract");
         return;
     }
+    // Non-x86 std hosts (aarch64 / wasm32 / riscv …): the AMX tile path is
+    // x86-only; route through the portable scalar reference
+    // `crate::hpc::quantized::bf16_gemm_f32` (alpha = 1, beta = 0 → C
+    // overwritten), bit-equivalent to the scalar fallback the x86 dispatcher
+    // takes on non-AMX silicon.
+    #[cfg(all(feature = "std", not(target_arch = "x86_64")))]
+    {
+        let a_bf16: &[crate::hpc::quantized::BF16] =
+            // SAFETY: BF16 is #[repr(transparent)] over u16; bit pattern preserved.
+            unsafe { core::slice::from_raw_parts(a.as_ptr() as *const crate::hpc::quantized::BF16, a.len()) };
+        let b_bf16: &[crate::hpc::quantized::BF16] =
+            // SAFETY: same repr(transparent) invariant as `a_bf16` above.
+            unsafe { core::slice::from_raw_parts(b.as_ptr() as *const crate::hpc::quantized::BF16, b.len()) };
+        crate::hpc::quantized::bf16_gemm_f32(&a_bf16[..m * k], &b_bf16[..k * n], &mut c[..m * n], m, n, k, 1.0, 0.0);
+        return;
+    }
     #[cfg(not(feature = "std"))]
     {
         let _ = (a, b, c, m, n, k);

diff --git a/src/simd.rs b/src/simd.rs
@@ -12,6 +12,11 @@ use std::sync::LazyLock;
 // `detect_tier()`'s feature-detection blocks are `target_arch = "x86_64"`
 // or `"aarch64"` gated, both false on i686. Without `dead_code` allowance
 // the `-D warnings` build fails with `variants ... are never constructed`.
+// Note: this `Tier` enum is *runtime* dispatch only. On `wasm32 +
+// target_feature = "simd128"` the SIMD *types* are NOT scalar — they come
+// from the compile-time `simd_wasm::wasm32_simd` v128 backend (re-exported
+// below); `detect_tier()` simply has no wasm arm, so the runtime tier stays
+// `Scalar`.
 #[allow(dead_code)]
 #[derive(Clone, Copy, PartialEq, Debug)]
 #[repr(u8)]
@@ -156,7 +161,9 @@ pub const PREFERRED_F64_LANES: usize = 8;
 pub const PREFERRED_F64_LANES: usize = 4;
 #[cfg(target_arch = "aarch64")]
 pub const PREFERRED_F64_LANES: usize = 2; // NEON: float64x2_t = 2 × f64
-#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
+#[cfg(target_arch = "wasm32")]
+pub const PREFERRED_F64_LANES: usize = 2; // WASM SIMD128: f64x2 = 2 × f64
+#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
 pub const PREFERRED_F64_LANES: usize = 4; // scalar fallback: same as AVX2 shape
 
 /// Preferred f32 SIMD width.
@@ -167,7 +174,9 @@ pub const PREFERRED_F32_LANES: usize = 16;
 pub const PREFERRED_F32_LANES: usize = 8;
 #[cfg(target_arch = "aarch64")]
 pub const PREFERRED_F32_LANES: usize = 4; // NEON: float32x4_t = 4 × f32
-#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
+#[cfg(target_arch = "wasm32")]
+pub const PREFERRED_F32_LANES: usize = 4; // WASM SIMD128: f32x4 = 4 × f32
+#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
 pub const PREFERRED_F32_LANES: usize = 8;
 
 /// Preferred u64 SIMD width.
@@ -178,7 +187,9 @@ pub const PREFERRED_U64_LANES: usize = 8;
 pub const PREFERRED_U64_LANES: usize = 4;
 #[cfg(target_arch = "aarch64")]
 pub const PREFERRED_U64_LANES: usize = 2; // NEON: uint64x2_t
-#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
+#[cfg(target_arch = "wasm32")]
+pub const PREFERRED_U64_LANES: usize = 2; // WASM SIMD128: i64x2 = 2 × u64
+#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
 pub const PREFERRED_U64_LANES: usize = 4;
 
 /// Preferred i16 SIMD width (for Base17 L1 on i16[17]).
@@ -191,7 +202,9 @@ pub const PREFERRED_I16_LANES: usize = 32;
 pub const PREFERRED_I16_LANES: usize = 16;
 #[cfg(target_arch = "aarch64")]
 pub const PREFERRED_I16_LANES: usize = 8; // NEON: int16x8_t
-#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
+#[cfg(target_arch = "wasm32")]
+pub const PREFERRED_I16_LANES: usize = 8; // WASM SIMD128: i16x8 = 8 × i16
+#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
 pub const PREFERRED_I16_LANES: usize = 16;
 
 // ============================================================================
@@ -376,10 +389,28 @@ pub use scalar::{
     I64x4, I64x8, U16x16, U16x32, U32x16, U32x8, U64x4, U64x8, U8x64,
 };
 
-// Other non-x86 targets (wasm, riscv, etc.): full scalar fallback.
+// wasm32 + simd128: the native v128 float hot path (F32x16 / F64x8 + masks)
+// and native I8x16 come from `simd_wasm::wasm32_simd`; the long-tail integer
+// and 256-bit-shaped types come from the scalar fallback. Same split
+// `simd_neon` uses on aarch64 (native float kernels, scalar for the rest).
+// The `wasm32_simd` module only exists under `target_feature = "simd128"`,
+// so this arm is gated identically.
+#[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))]
+pub use crate::simd_wasm::wasm32_simd::{f32x16, f64x8, i8x16, F32Mask16, F32x16, F64Mask8, F64x8, I8x16};
+#[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))]
+pub use scalar::{
+    batch_packed_i4_16, f32x8, f64x4, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x32, i8x64, palette_lookup_u8x8,
+    prefetch_read_t0, prefetch_read_t1, prefetch_read_t2, u16x16, u16x8, u32x16, u32x8, u64x4, u64x8, u8x64, u8x8,
+    F32x8, F64x4, I16x16, I16x32, I32x16, I32x8, I64x4, I64x8, I8x32, I8x64, U16x16, U16x32, U16x8, U32x16, U32x8,
+    U64x4, U64x8, U8x64, U8x8,
+};
+
+// Other non-x86 targets — wasm32 without simd128, riscv, etc.: full scalar
+// fallback. Excludes the wasm32+simd128 case handled by the native arm above.
 #[cfg(all(
     not(target_arch = "x86_64"),
     not(target_arch = "aarch64"),
+    not(all(target_arch = "wasm32", target_feature = "simd128")),
     not(feature = "nightly-simd")
 ))]
 pub use scalar::{
@@ -577,11 +608,16 @@ pub use crate::hpc::heel_f64x8::cosine_f32_to_f64_simd;
 // whole AMX ladder through the canonical `ndarray::simd::*` import (W1a)
 // without dipping into `crate::hpc::amx_matmul` directly. `amx_available()`
 // exposes the runtime tier check for reporting.
-#[cfg(feature = "std")]
+// AMX is x86_64-only (the `amx_matmul` / `simd_amx` modules are
+// `#[cfg(target_arch = "x86_64")]`), so these re-exports are arch-gated.
+// Off x86 the cross-platform entry points are `backend::gemm_i8` /
+// `backend::gemm_bf16` (portable scalar / NEON / wasm-SIMD paths).
+#[cfg(all(feature = "std", target_arch = "x86_64"))]
 pub use crate::hpc::amx_matmul::{amx_available, matmul_i8_to_i32};
 // CPU-generation detection (cached): SPR / EMR / GNR / Sierra Forest. Lets a
 // consumer report which silicon a run landed on and distinguish "no AMX
 // silicon" from "AMX present but not OS-enabled" — both surface via `amx_report`.
+#[cfg(target_arch = "x86_64")]
 pub use crate::simd_amx::{amx_report, cpu_model, CpuModel};
 
 // Elementwise slice ops — polyfill-dispatched (F32x16/F64x8 chunks + scalar tail).