Skip to content

[FIX]: Spsc queue false sharing#960

Open
raphael-s-steiner wants to merge 6 commits into
hw-native-sys:mainfrom
huawei-csl:SPSCQueue-false-sharing
Open

[FIX]: Spsc queue false sharing#960
raphael-s-steiner wants to merge 6 commits into
hw-native-sys:mainfrom
huawei-csl:SPSCQueue-false-sharing

Conversation

@raphael-s-steiner
Copy link
Copy Markdown
Contributor

@raphael-s-steiner raphael-s-steiner commented Jun 1, 2026

The mask and data ptr of PTO2SpscQueue are used (read) by both producer and consumer, but lies on a consumer cache line where the consumer also writes to - > False Sharing!

FIX: Duplicated mask and data ptr such that producer and consumer both have a local copy.
(Duplicated rather than own cache line as there is an imposed size constraint of 4 cache lines.)

Additionally, a minor optimisation to pop_batch:
Cached head value gets updated when less elements than requested rather than if no elements.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jun 1, 2026

Review Change Stack

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2af194ec-3997-476d-8d69-dd4e3faeb2d1

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

PTO2SpscQueue across both a2a3 and a5 runtime implementations is refactored to maintain separate cached buffer pointers and masks for producer and consumer execution paths, replacing the previous shared buffer_/mask_ fields to improve cache line isolation.

Changes

SPSC Queue Per-Side Cache Isolation

Layer / File(s) Summary
Data structure: per-side cached fields
src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h, src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h
PTO2SpscQueue struct replaces single buffer_ and mask_ members with producer-side (buffer_p_, mask_p_) and consumer-side (buffer_c_, mask_c_) cached fields in both implementations.
Initialization: per-side mask setup
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
init_data_from_layout() writes the capacity-derived mask into both mask_p_ and mask_c_ instead of a single mask_ field in both code versions.
Arena wiring and lifecycle cleanup
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
wire_arena_pointers() wires the arena buffer pointer into both buffer_p_ and buffer_c_, and destroy() clears both per-side pointers instead of a single buffer in both implementations.
Producer path: push with isolated state
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
push() method now uses producer-cached mask_p_ for full-capacity checks and buffer_p_ for element writes in both a2a3 and a5 implementations.
Consumer path: pop_batch with isolated state
src/a2a3/runtime/.../pto_scheduler.h, src/a5/runtime/.../pto_scheduler.h
pop_batch() method reads queued entries from consumer-cached buffer_c_ indexed with mask_c_ (after head-cache refill logic) in both a2a3 and a5 implementations.

Estimated Code Review Effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

A rabbit's queue grows divided—
Producer and consumer, once tight-knit, now widen their stride,
Each path claims its own cache line with pride,
No false sharing to slow the ride! 🐰✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 69.23% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title '[FIX]: Spsc queue false sharing' directly and clearly describes the main change: fixing false sharing in the SPSC queue by separating producer and consumer cached pointers.
Description check ✅ Passed The description is directly related to the changeset, explaining the false sharing problem, the fix (duplicating mask and data pointers for producer and consumer), and a minor optimization to pop_batch.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request splits the shared buffer and mask members of PTO2SpscQueue into separate producer (buffer_p_, mask_p_) and consumer (buffer_c_, mask_c_) local copies in both scheduler headers, and updates the pop_batch logic. However, these changes reduce the struct size from 256 to 192 bytes, which will cause static assertions regarding struct size and alignment to fail. To resolve this, 96 bytes of padding should be added to the end of the struct to maintain the 256-byte size.

Comment thread src/a2a3/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h Outdated
Comment thread src/a5/runtime/tensormap_and_ringbuffer/runtime/scheduler/pto_scheduler.h Outdated
@raphael-s-steiner
Copy link
Copy Markdown
Contributor Author

Benchmark Comparison: 411157ef  SPSCQueue-false-sharing (0edc907)

Runtime: tensormap_and_ringbuffer | Rounds: 100 | Platform: a2a3
Device: baseline=4, current=5 (parallel)
PTO-ISA: ddafa8da
⚠️ task-submit not available — run was unlocked.

tensormap_and_ringbuffer

Device (us) = AICPU mailbox orch_start→orch_end — the primary metric. Host timing is unreliable here because both processes ran simultaneously sharing the host CPU.

Example Base (us) HEAD (us) Delta (us) Change (%) Assessment
alternating_matmul_add (Case1) 2328.4 2299.5 -28.9 -1.24% Within noise
benchmark_bgemm (Case0) 2122.1 2108.7 -13.4 -0.63% Within noise
paged_attention_unroll (Case1) 2652.5 2711.3 +58.8 +2.22% Marginal (device variance)
paged_attention_unroll (Case2) 2066.6 2047.3 -19.3 -0.93% Within noise
paged_attention_unroll_manual_scope (Case1) 2709.5 2697.8 -11.7 -0.43% Within noise
paged_attention_unroll_manual_scope (Case2) 2053.0 2039.8 -13.2 -0.64% Within noise
batch_paged_attention (Case1) 4603.5 4475.8 -127.7 -2.77% Notable improvement
spmd_paged_attention (Case1) 2813.5 2729.6 -83.9 -2.98% Notable improvement
spmd_paged_attention (Case2) 2158.4 2092.5 -65.9 -3.05% Notable improvement

Overall: 3 of 9 improved (>2%), 0 regressions >5%.


Interpretation

  • batch_paged_attention and spmd_paged_attention show consistent device-time improvements of ~3%, plausibly attributable to the false-sharing fix on the SPSCQueue.
  • paged_attention_unroll (Case1) shows +2.22% — barely above the ±2% noise margin for different-device comparisons. Not a confirmed regression.
  • Host timing is highly variable (±50–100%) because baseline and current ran in parallel sharing the host CPU. Ignore it for this comparison.

Caveat: Baseline and current ran on different NPU devices (4 vs 5). Results within ±2% may reflect device-to-device variance rather than real code changes. For a definitive result, re-run on the same device: /benchmark -d <single_device>.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants