[FIX]: Spsc queue false sharing#960
Conversation
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughPTO2SpscQueue across both a2a3 and a5 runtime implementations is refactored to maintain separate cached buffer pointers and masks for producer and consumer execution paths, replacing the previous shared buffer_/mask_ fields to improve cache line isolation. ChangesSPSC Queue Per-Side Cache Isolation
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request splits the shared buffer and mask members of PTO2SpscQueue into separate producer (buffer_p_, mask_p_) and consumer (buffer_c_, mask_c_) local copies in both scheduler headers, and updates the pop_batch logic. However, these changes reduce the struct size from 256 to 192 bytes, which will cause static assertions regarding struct size and alignment to fail. To resolve this, 96 bytes of padding should be added to the end of the struct to maintain the 256-byte size.
Benchmark Comparison: |
| Example | Base (us) | HEAD (us) | Delta (us) | Change (%) | Assessment |
|---|---|---|---|---|---|
| alternating_matmul_add (Case1) | 2328.4 | 2299.5 | -28.9 | -1.24% | Within noise |
| benchmark_bgemm (Case0) | 2122.1 | 2108.7 | -13.4 | -0.63% | Within noise |
| paged_attention_unroll (Case1) | 2652.5 | 2711.3 | +58.8 | +2.22% | Marginal (device variance) |
| paged_attention_unroll (Case2) | 2066.6 | 2047.3 | -19.3 | -0.93% | Within noise |
| paged_attention_unroll_manual_scope (Case1) | 2709.5 | 2697.8 | -11.7 | -0.43% | Within noise |
| paged_attention_unroll_manual_scope (Case2) | 2053.0 | 2039.8 | -13.2 | -0.64% | Within noise |
| batch_paged_attention (Case1) | 4603.5 | 4475.8 | -127.7 | -2.77% | Notable improvement |
| spmd_paged_attention (Case1) | 2813.5 | 2729.6 | -83.9 | -2.98% | Notable improvement |
| spmd_paged_attention (Case2) | 2158.4 | 2092.5 | -65.9 | -3.05% | Notable improvement |
Overall: 3 of 9 improved (>2%), 0 regressions >5%.
Interpretation
- batch_paged_attention and spmd_paged_attention show consistent device-time improvements of ~3%, plausibly attributable to the false-sharing fix on the SPSCQueue.
- paged_attention_unroll (Case1) shows +2.22% — barely above the ±2% noise margin for different-device comparisons. Not a confirmed regression.
- Host timing is highly variable (±50–100%) because baseline and current ran in parallel sharing the host CPU. Ignore it for this comparison.
Caveat: Baseline and current ran on different NPU devices (4 vs 5). Results within ±2% may reflect device-to-device variance rather than real code changes. For a definitive result, re-run on the same device:
/benchmark -d <single_device>.
The mask and data ptr of PTO2SpscQueue are used (read) by both producer and consumer, but lies on a consumer cache line where the consumer also writes to - > False Sharing!
FIX: Duplicated mask and data ptr such that producer and consumer both have a local copy.
(Duplicated rather than own cache line as there is an imposed size constraint of 4 cache lines.)
Additionally, a minor optimisation to pop_batch:
Cached head value gets updated when less elements than requested rather than if no elements.