Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions .cargo/config-wasm.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# WebAssembly with SIMD128 — enables the native v128 SIMD backend
# (`src/simd_wasm.rs::wasm32_simd`) instead of the pure-scalar fallback.
#
# Use with:
# cargo --config .cargo/config-wasm.toml build -p ndarray --lib --target wasm32-unknown-unknown
#
# Equivalent env form:
# RUSTFLAGS='-Ctarget-feature=+simd128' cargo build -p ndarray --lib --target wasm32-unknown-unknown
#
# The `simd128` target feature is what gates the `wasm32_simd` module and the
# `simd.rs` dispatch arm that re-exports its `F32x16` / `F64x8` / `I8x16`
# types. Without it, `wasm32` falls back to the portable scalar SIMD types
# (still correct, just not vectorized). Add `+relaxed-simd` as well to light
# up the fused `f32x4_relaxed_madd` path in `mul_add`:
#
# rustflags = ["-Ctarget-feature=+simd128,+relaxed-simd"]
#
# Applies to both wasm32-unknown-unknown and wasm32-wasip1.
[target.wasm32-unknown-unknown]
rustflags = ["-Ctarget-feature=+simd128"]

[target.wasm32-wasip1]
rustflags = ["-Ctarget-feature=+simd128"]
79 changes: 79 additions & 0 deletions .claude/blackboard.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,85 @@
> **Read this first.** The "Polyglot Notebook" architecture below is a
> separate/older program, not the current epoch.

## 2026-06-28 — WASM SIMD128 backend filled in (`src/simd_wasm.rs`)

Replaced the commented-out scaffolding in `src/simd_wasm.rs` with a real
`core::arch::wasm32` SIMD128 backend, mirroring `simd_neon::aarch64_simd`'s
proven split (native v128 for the float/byte hot path, scalar fallback for
the long tail). Branch `claude/ndarray-wasm-scalar-zr9n46`.

**`src/simd_wasm.rs::wasm32_simd`** (gated `#[cfg(all(target_arch="wasm32",
target_feature="simd128"))]`):
- `F32x16` / `F64x8` as `[v128;4]` + `F32Mask16` / `F64Mask8` — full API
parity with the scalar macro (splat/from_slice/from_array/to_array/
copy_to_slice/reduce_{sum,min,max}/abs/sqrt/round/floor/mul_add/
simd_{min,max,clamp,lt,le,gt,ge,eq,ne}/to_bits/from_bits/cast_i32 +
Add/Sub/Mul/Div/*Assign/Neg/Debug/PartialEq/Default + Mask::select).
- `I8x16` (one `v128`) = UNION of the scalar + NEON method sets
(add/sub/min/max/cmp_gt + from_i4_packed_u64/lane_i8/saturating_abs)
so consumers are portable across every backend.
- Free hot-kernels (v128 counterparts to the NEON kernels):
`dot_f32x4_wasm`, `popcount_u8x16_wasm`, `hamming_u8x16_wasm`,
`hamming_u8x64_wasm` (Fingerprint<256> distance via `i8x16_popcnt`),
`base17_l1_wasm`, `codebook_gather_f32x4_wasm`, `bf16_to_f32_batch_wasm`.
- `mul_add`: `f32x4_relaxed_madd` under `+relaxed-simd`, else mul+add
(base simd128 has no FMA). `round()` = `f32x4_nearest` (ties-even, =NEON).
NaN in simd_min/max follows IEEE (NaN-propagating, =NEON); the existing
`simd_exp_f32` NaN save/restore already absorbs this. All documented.

**Dispatch (`src/simd.rs`):** new `target_arch="wasm32" + target_feature=
"simd128"` arm re-exports the 8 native names from `wasm32_simd` and the
remainder from `scalar`; the "Other non-x86" arm now excludes that case
(wasm-without-simd128 + riscv etc. stay full-scalar). Added wasm32
`PREFERRED_*_LANES` arms (F32=4/F64=2/U64=2/I16=8, 128-bit widths) and a
`.cargo/config-wasm.toml` (`-Ctarget-feature=+simd128`).

**Unblocked the wasm build (pre-existing x86 leaks, not SIMD-scaffolding):**
the crate did NOT compile for wasm at all — `src/simd.rs` re-exported the
x86-only `amx_matmul` / `simd_amx` modules unconditionally, and
`backend::gemm_bf16` called `amx_matmul::matmul_bf16_to_f32` directly.
Gated both re-exports to `#[cfg(target_arch="x86_64")]`; split `gemm_bf16`
into the IDENTICAL x86 AMX path + a non-x86 branch routing through the
portable `hpc::quantized::bf16_gemm_f32(.., 1.0, 0.0)` (the same scalar
reference the AMX dispatcher itself falls back to → bit-equivalent). x86
behavior is untouched by construction (the original block now lives under
`cfg(target_arch="x86_64")`).

[VERIFICATION] (1) `cargo build -p ndarray --lib` for wasm32 **+simd128**
(native) AND **without** simd128 (scalar) AND **--no-default-features**
(no_std) AND x86_64 default — all green. (2) A standalone faithful copy of
`wasm32_simd` built to wasm32+simd128 and run under **node**: 51 numeric
checks (incl. exact mask bit-patterns, saturating_abs(i8::MIN)=127,
Hamming=512, Base17 vs scalar incl. a pathological |a-b|=60000 overflow
case, bf16 shift) all PASS. (3) x86 regression: 217 SIMD tests + 85
backend/bf16 tests pass; `clippy -p ndarray --lib -- -D warnings` clean;
`fmt --check` clean. Harness: `/tmp/.../scratchpad/wasmverify`.

[ADVERSARIAL REVIEW] Ran a 3-angle Opus review (cfg-gating / intrinsic-
semantics / x86-regression). x86-regression = PASS (x86 path byte-identical;
non-x86 bf16 fallback bit-equivalent). Two findings resolved: (P0 cfg-gating
"no_std arm break") = **false positive** — `pub mod simd` is itself
`#[cfg(feature="std")]` (lib.rs:239), so the native wasm arm is transitively
std-gated; `--no-default-features` wasm build is clean (empirically
confirmed). (P1 base17 i16 wrap) = **real, fixed** — `base17_l1_wasm` now
sign-extends i16→i32 via `i32x4_extend_{low,high}_i16x8` BEFORE the subtract,
so `|a-b|` is computed in i32 and matches the scalar reference for the full
i16 range (the prior i16-domain abs-diff, like NEON's `vabdq_s16`, wrapped at
|a-b|>i16::MAX). Doc nits (mul_add ULP wording, reduce_sum order, Tier-enum
comment) also tightened.

[NOTE] The stale top-of-CLAUDE.md "Build currently fails (exit 101)" no
longer reproduces — x86 lib builds clean this turn.

[LOOSE END] Full-crate (workspace) wasm build still blocked by `getrandom
0.3` (via `ndarray-rand`/`numeric-tests`, members that depend ON ndarray)
needing the `wasm_js` backend — orthogonal to this work; `-p ndarray --lib`
is the correct wasm surface and it is green. `bf16_to_f32_batch_wasm` is
provided + tested but NOT wired into the `bf16_to_f32_batch` dispatch (left
scalar to keep the BF16 path untouched); wire it if a wasm BF16 hot path
appears. Native U8x64/I32x16/U64x8 stay scalar on wasm (same as NEON keeps
them scalar) — the free Hamming/Base17 kernels cover those hot paths.

## 2026-06-17 — DECISION: HHTL fork ladder coded in `hpc::entropy_ladder` (CONJECTURE)

Reified the operator's standing idea — *if the orthogonal (helix/CAM-PQ)
Expand Down
23 changes: 22 additions & 1 deletion src/backend/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -210,7 +210,12 @@ pub fn gemm_i8(a: &[u8], b: &[i8], c: &mut [i32], m: usize, n: usize, k: usize)
#[allow(clippy::needless_return)]
pub fn gemm_bf16(a: &[u16], b: &[u16], c: &mut [f32], m: usize, n: usize, k: usize) {
// Reinterpret u16 slices as BF16 slices (repr(transparent))
#[cfg(feature = "std")]
//
// x86_64: route through the ArrayView2-based AMX dispatcher
// (`amx_matmul::matmul_bf16_to_f32` = AMX TDPBF16PS → AVX-512 VDPBF16PS →
// scalar tiled `bf16_gemm_f32`). That module is `#[cfg(target_arch =
// "x86_64")]`, so off x86 we call the same scalar reference directly.
#[cfg(all(feature = "std", target_arch = "x86_64"))]
{
use crate::{ArrayView2, ArrayViewMut2};

Expand All @@ -235,6 +240,22 @@ pub fn gemm_bf16(a: &[u16], b: &[u16], c: &mut [f32], m: usize, n: usize, k: usi
crate::hpc::amx_matmul::matmul_bf16_to_f32(lhs, rhs, out).expect("gemm_bf16: matmul shape contract");
return;
}
// Non-x86 std hosts (aarch64 / wasm32 / riscv …): the AMX tile path is
// x86-only; route through the portable scalar reference
// `crate::hpc::quantized::bf16_gemm_f32` (alpha = 1, beta = 0 → C
// overwritten), bit-equivalent to the scalar fallback the x86 dispatcher
// takes on non-AMX silicon.
#[cfg(all(feature = "std", not(target_arch = "x86_64")))]
{
let a_bf16: &[crate::hpc::quantized::BF16] =
// SAFETY: BF16 is #[repr(transparent)] over u16; bit pattern preserved.
unsafe { core::slice::from_raw_parts(a.as_ptr() as *const crate::hpc::quantized::BF16, a.len()) };
let b_bf16: &[crate::hpc::quantized::BF16] =
// SAFETY: same repr(transparent) invariant as `a_bf16` above.
unsafe { core::slice::from_raw_parts(b.as_ptr() as *const crate::hpc::quantized::BF16, b.len()) };
crate::hpc::quantized::bf16_gemm_f32(&a_bf16[..m * k], &b_bf16[..k * n], &mut c[..m * n], m, n, k, 1.0, 0.0);
return;
}
#[cfg(not(feature = "std"))]
{
let _ = (a, b, c, m, n, k);
Expand Down
48 changes: 42 additions & 6 deletions src/simd.rs
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,11 @@ use std::sync::LazyLock;
// `detect_tier()`'s feature-detection blocks are `target_arch = "x86_64"`
// or `"aarch64"` gated, both false on i686. Without `dead_code` allowance
// the `-D warnings` build fails with `variants ... are never constructed`.
// Note: this `Tier` enum is *runtime* dispatch only. On `wasm32 +
// target_feature = "simd128"` the SIMD *types* are NOT scalar — they come
// from the compile-time `simd_wasm::wasm32_simd` v128 backend (re-exported
// below); `detect_tier()` simply has no wasm arm, so the runtime tier stays
// `Scalar`.
#[allow(dead_code)]
#[derive(Clone, Copy, PartialEq, Debug)]
#[repr(u8)]
Expand Down Expand Up @@ -156,7 +161,9 @@ pub const PREFERRED_F64_LANES: usize = 8;
pub const PREFERRED_F64_LANES: usize = 4;
#[cfg(target_arch = "aarch64")]
pub const PREFERRED_F64_LANES: usize = 2; // NEON: float64x2_t = 2 × f64
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
#[cfg(target_arch = "wasm32")]
pub const PREFERRED_F64_LANES: usize = 2; // WASM SIMD128: f64x2 = 2 × f64
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
pub const PREFERRED_F64_LANES: usize = 4; // scalar fallback: same as AVX2 shape

/// Preferred f32 SIMD width.
Expand All @@ -167,7 +174,9 @@ pub const PREFERRED_F32_LANES: usize = 16;
pub const PREFERRED_F32_LANES: usize = 8;
#[cfg(target_arch = "aarch64")]
pub const PREFERRED_F32_LANES: usize = 4; // NEON: float32x4_t = 4 × f32
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
#[cfg(target_arch = "wasm32")]
pub const PREFERRED_F32_LANES: usize = 4; // WASM SIMD128: f32x4 = 4 × f32
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
pub const PREFERRED_F32_LANES: usize = 8;

/// Preferred u64 SIMD width.
Expand All @@ -178,7 +187,9 @@ pub const PREFERRED_U64_LANES: usize = 8;
pub const PREFERRED_U64_LANES: usize = 4;
#[cfg(target_arch = "aarch64")]
pub const PREFERRED_U64_LANES: usize = 2; // NEON: uint64x2_t
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
#[cfg(target_arch = "wasm32")]
pub const PREFERRED_U64_LANES: usize = 2; // WASM SIMD128: i64x2 = 2 × u64
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
pub const PREFERRED_U64_LANES: usize = 4;

/// Preferred i16 SIMD width (for Base17 L1 on i16[17]).
Expand All @@ -191,7 +202,9 @@ pub const PREFERRED_I16_LANES: usize = 32;
pub const PREFERRED_I16_LANES: usize = 16;
#[cfg(target_arch = "aarch64")]
pub const PREFERRED_I16_LANES: usize = 8; // NEON: int16x8_t
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64")))]
#[cfg(target_arch = "wasm32")]
pub const PREFERRED_I16_LANES: usize = 8; // WASM SIMD128: i16x8 = 8 × i16
#[cfg(not(any(target_arch = "x86_64", target_arch = "aarch64", target_arch = "wasm32")))]
pub const PREFERRED_I16_LANES: usize = 16;

// ============================================================================
Expand Down Expand Up @@ -376,10 +389,28 @@ pub use scalar::{
I64x4, I64x8, U16x16, U16x32, U32x16, U32x8, U64x4, U64x8, U8x64,
};

// Other non-x86 targets (wasm, riscv, etc.): full scalar fallback.
// wasm32 + simd128: the native v128 float hot path (F32x16 / F64x8 + masks)
// and native I8x16 come from `simd_wasm::wasm32_simd`; the long-tail integer
// and 256-bit-shaped types come from the scalar fallback. Same split
// `simd_neon` uses on aarch64 (native float kernels, scalar for the rest).
// The `wasm32_simd` module only exists under `target_feature = "simd128"`,
// so this arm is gated identically.
#[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))]
pub use crate::simd_wasm::wasm32_simd::{f32x16, f64x8, i8x16, F32Mask16, F32x16, F64Mask8, F64x8, I8x16};
#[cfg(all(target_arch = "wasm32", target_feature = "simd128", not(feature = "nightly-simd")))]
pub use scalar::{
batch_packed_i4_16, f32x8, f64x4, i16x16, i16x32, i32x16, i32x8, i64x4, i64x8, i8x32, i8x64, palette_lookup_u8x8,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use the wasm I8x16 in batch_packed_i4_16

When building wasm32 with +simd128, crate::simd::I8x16 is the native v128 type re-exported just above, but this line re-exports scalar::batch_packed_i4_16, whose closure receives simd::scalar::I8x16 (src/simd_scalar.rs:1630). That makes portable callers fail only on wasm SIMD if they pass a helper taking crate::simd::I8x16 or write into out: &mut [I8x16], and it also bypasses the new native byte-lane backend for this W1a primitive. Re-export a wasm implementation of batch_packed_i4_16 built on wasm32_simd::I8x16, or keep I8x16 scalar in this dispatch arm.

Useful? React with 👍 / 👎.

prefetch_read_t0, prefetch_read_t1, prefetch_read_t2, u16x16, u16x8, u32x16, u32x8, u64x4, u64x8, u8x64, u8x8,
F32x8, F64x4, I16x16, I16x32, I32x16, I32x8, I64x4, I64x8, I8x32, I8x64, U16x16, U16x32, U16x8, U32x16, U32x8,
U64x4, U64x8, U8x64, U8x8,
};

// Other non-x86 targets — wasm32 without simd128, riscv, etc.: full scalar
// fallback. Excludes the wasm32+simd128 case handled by the native arm above.
#[cfg(all(
not(target_arch = "x86_64"),
not(target_arch = "aarch64"),
not(all(target_arch = "wasm32", target_feature = "simd128")),
not(feature = "nightly-simd")
))]
pub use scalar::{
Expand Down Expand Up @@ -577,11 +608,16 @@ pub use crate::hpc::heel_f64x8::cosine_f32_to_f64_simd;
// whole AMX ladder through the canonical `ndarray::simd::*` import (W1a)
// without dipping into `crate::hpc::amx_matmul` directly. `amx_available()`
// exposes the runtime tier check for reporting.
#[cfg(feature = "std")]
// AMX is x86_64-only (the `amx_matmul` / `simd_amx` modules are
// `#[cfg(target_arch = "x86_64")]`), so these re-exports are arch-gated.
// Off x86 the cross-platform entry points are `backend::gemm_i8` /
// `backend::gemm_bf16` (portable scalar / NEON / wasm-SIMD paths).
#[cfg(all(feature = "std", target_arch = "x86_64"))]
pub use crate::hpc::amx_matmul::{amx_available, matmul_i8_to_i32};
// CPU-generation detection (cached): SPR / EMR / GNR / Sierra Forest. Lets a
// consumer report which silicon a run landed on and distinguish "no AMX
// silicon" from "AMX present but not OS-enabled" — both surface via `amx_report`.
#[cfg(target_arch = "x86_64")]
pub use crate::simd_amx::{amx_report, cpu_model, CpuModel};

// Elementwise slice ops — polyfill-dispatched (F32x16/F64x8 chunks + scalar tail).
Expand Down
Loading
Loading