[FIX]: Spsc queue false sharing by raphael-s-steiner · Pull Request #960 · hw-native-sys/simpler

raphael-s-steiner · 2026-06-01T09:10:37Z

The mask and data ptr of PTO2SpscQueue are used (read) by both producer and consumer, but lies on a consumer cache line where the consumer also writes to - > False Sharing!

FIX: Duplicated mask and data ptr such that producer and consumer both have a local copy.
(Duplicated rather than own cache line as there is an imposed size constraint of 4 cache lines.)

Additionally, a minor optimisation to pop_batch:
Cached head value gets updated when less elements than requested rather than if no elements.

coderabbitai · 2026-06-01T09:10:52Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2af194ec-3997-476d-8d69-dd4e3faeb2d1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

PTO2SpscQueue across both a2a3 and a5 runtime implementations is refactored to maintain separate cached buffer pointers and masks for producer and consumer execution paths, replacing the previous shared buffer_/mask_ fields to improve cache line isolation.

Changes

SPSC Queue Per-Side Cache Isolation

Layer / File(s)	Summary
Data structure: per-side cached fields `src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`, `src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h`	PTO2SpscQueue struct replaces single `buffer_` and `mask_` members with producer-side (`buffer_p_`, `mask_p_`) and consumer-side (`buffer_c_`, `mask_c_`) cached fields in both implementations.
Initialization: per-side mask setup `src/a2a3/runtime/.../pto_scheduler.h`, `src/a5/runtime/.../pto_scheduler.h`	`init_data_from_layout()` writes the capacity-derived mask into both `mask_p_` and `mask_c_` instead of a single `mask_` field in both code versions.
Arena wiring and lifecycle cleanup `src/a2a3/runtime/.../pto_scheduler.h`, `src/a5/runtime/.../pto_scheduler.h`	`wire_arena_pointers()` wires the arena buffer pointer into both `buffer_p_` and `buffer_c_`, and `destroy()` clears both per-side pointers instead of a single buffer in both implementations.
Producer path: push with isolated state `src/a2a3/runtime/.../pto_scheduler.h`, `src/a5/runtime/.../pto_scheduler.h`	`push()` method now uses producer-cached `mask_p_` for full-capacity checks and `buffer_p_` for element writes in both a2a3 and a5 implementations.
Consumer path: pop_batch with isolated state `src/a2a3/runtime/.../pto_scheduler.h`, `src/a5/runtime/.../pto_scheduler.h`	`pop_batch()` method reads queued entries from consumer-cached `buffer_c_` indexed with `mask_c_` (after head-cache refill logic) in both a2a3 and a5 implementations.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit's queue grows divided—
Producer and consumer, once tight-knit, now widen their stride,
Each path claims its own cache line with pride,
No false sharing to slow the ride! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title '[FIX]: Spsc queue false sharing' directly and clearly describes the main change: fixing false sharing in the SPSC queue by separating producer and consumer cached pointers.
Description check	✅ Passed	The description is directly related to the changeset, explaining the false sharing problem, the fix (duplicating mask and data pointers for producer and consumer), and a minor optimization to pop_batch.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request splits the shared buffer and mask members of PTO2SpscQueue into separate producer (buffer_p_, mask_p_) and consumer (buffer_c_, mask_c_) local copies in both scheduler headers, and updates the pop_batch logic. However, these changes reduce the struct size from 256 to 192 bytes, which will cause static assertions regarding struct size and alignment to fail. To resolve this, 96 bytes of padding should be added to the end of the struct to maintain the 256-byte size.

raphael-s-steiner · 2026-06-05T08:42:48Z

Benchmark Comparison: `411157ef` → `SPSCQueue-false-sharing` (`0edc907`)

Runtime: tensormap_and_ringbuffer | Rounds: 100 | Platform: a2a3
Device: baseline=4, current=5 (parallel)
PTO-ISA: ddafa8da
⚠️ task-submit not available — run was unlocked.

tensormap_and_ringbuffer

Device (us) = AICPU mailbox orch_start→orch_end — the primary metric. Host timing is unreliable here because both processes ran simultaneously sharing the host CPU.

Example	Base (us)	HEAD (us)	Delta (us)	Change (%)	Assessment
alternating_matmul_add (Case1)	2328.4	2299.5	-28.9	-1.24%	Within noise
benchmark_bgemm (Case0)	2122.1	2108.7	-13.4	-0.63%	Within noise
paged_attention_unroll (Case1)	2652.5	2711.3	+58.8	+2.22%	Marginal (device variance)
paged_attention_unroll (Case2)	2066.6	2047.3	-19.3	-0.93%	Within noise
paged_attention_unroll_manual_scope (Case1)	2709.5	2697.8	-11.7	-0.43%	Within noise
paged_attention_unroll_manual_scope (Case2)	2053.0	2039.8	-13.2	-0.64%	Within noise
batch_paged_attention (Case1)	4603.5	4475.8	-127.7	-2.77%	Notable improvement
spmd_paged_attention (Case1)	2813.5	2729.6	-83.9	-2.98%	Notable improvement
spmd_paged_attention (Case2)	2158.4	2092.5	-65.9	-3.05%	Notable improvement

Overall: 3 of 9 improved (>2%), 0 regressions >5%.

Interpretation

batch_paged_attention and spmd_paged_attention show consistent device-time improvements of ~3%, plausibly attributable to the false-sharing fix on the SPSCQueue.
paged_attention_unroll (Case1) shows +2.22% — barely above the ±2% noise margin for different-device comparisons. Not a confirmed regression.
Host timing is highly variable (±50–100%) because baseline and current ran in parallel sharing the host CPU. Ignore it for this comparison.

Caveat: Baseline and current ran on different NPU devices (4 vs 5). Results within ±2% may reflect device-to-device variance rather than real code changes. For a definitive result, re-run on the same device: /benchmark -d <single_device>.

raphael-s-steiner added 2 commits June 1, 2026 09:47

SPSCQueue improved caching

8beecd3

removed false sharing

209c16f

gemini-code-assist Bot reviewed Jun 1, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h Outdated

Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h Outdated

raphael-s-steiner added 3 commits June 1, 2026 11:30

added padding

4fbcff2

fix unit tests

0fc7167

Merge branch 'main' into SPSCQueue-false-sharing

4709a4b

poursoul reviewed Jun 2, 2026

View reviewed changes

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h

5 cache lines

0edc907

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX]: Spsc queue false sharing#960

[FIX]: Spsc queue false sharing#960
raphael-s-steiner wants to merge 6 commits into
hw-native-sys:mainfrom
huawei-csl:SPSCQueue-false-sharing

raphael-s-steiner commented Jun 1, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Review skipped

Walkthrough

Changes

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raphael-s-steiner commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

raphael-s-steiner commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Jun 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated Code Review Effort

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

raphael-s-steiner commented Jun 5, 2026

Benchmark Comparison: 411157ef → SPSCQueue-false-sharing (0edc907)

tensormap_and_ringbuffer

Interpretation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

raphael-s-steiner commented Jun 1, 2026 •

edited

Loading

coderabbitai Bot commented Jun 1, 2026 •

edited

Loading

Benchmark Comparison: `411157ef` → `SPSCQueue-false-sharing` (`0edc907`)