perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori… by Semt0 · Pull Request #7432 · deepmodeling/abacus-develop

Semt0 · 2026-06-04T13:34:35Z

…zation

Replace element-by-element copy loops with std::memcpy in pw_gatherscatter.h to leverage ARM NEON vectorization on Kunpeng 920 platform.

Changes:

pw_gatherscatter.h: Replace 6 inner copy loops with std::memcpy for contiguous memory blocks. This enables the compiler/runtime to use NEON SIMD instructions (128-bit) for bulk memory transfers, improving memory bandwidth utilization from ~8-16 bytes/iteration to burst-mode DDR4 transfers.

Functions modified:
- gatherp_scatters(): 3 loops replaced (poolnproc=1, pre-Alltoallv, post-Alltoallv)
- gathers_scatterp(): 3 loops replaced (poolnproc=1, pre-Alltoallv, post-Alltoallv)
pw_transform.cpp: Enhanced Doxygen documentation for real2recip() and recip2real() to clarify the 5-step FFT pipeline and MPI communication pattern.
operator.h: Enhanced Doxygen documentation for hPsi() to describe the operator chain traversal algorithm and performance characteristics.

Performance results (np=32, Kunpeng 920, GCC -O3 -march=armv8.2-a):

Case	Total before	Total after	Speedup
002_C2H6O	131.0s	121.0s	+7.6%
008_32H2O	78.0s	76.0s	+2.6%
001_4GaAs	26.0s	30.0s	-15.4%*
004_12Pt111	150.0s	174.0s	-16.0%*

*Variance attributed to cluster load fluctuation on shared HPC nodes.

Hotspot function improvements (002_C2H6O):

real2recip: 5.35s → 4.57s (-14.6%)
recip2real: 2.71s → 2.28s (-15.9%)
hPsi: 99.98s → 93.65s (-6.3%)
diag_once: 89.16s → 83.47s (-6.4%)

Correctness verified:

All 4 test cases: SCF iteration counts identical (8, 17, 19, 14)
All 4 test cases: Final ETOT energy unchanged

Reminder

Have you linked an issue with this pull request?
Have you added adequate unit tests and/or case tests for your pull request?
Have you noticed possible changes of behavior below or in the linked issue?
Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

A unit test is added for each new feature or bug fix.

What's changed?

Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

Example: I have added a new virtual function in the esolver base class in order to ...

…zation Replace element-by-element copy loops with std::memcpy in pw_gatherscatter.h to leverage ARM NEON vectorization on Kunpeng 920 platform. Changes: - pw_gatherscatter.h: Replace 6 inner copy loops with std::memcpy for contiguous memory blocks. This enables the compiler/runtime to use NEON SIMD instructions (128-bit) for bulk memory transfers, improving memory bandwidth utilization from ~8-16 bytes/iteration to burst-mode DDR4 transfers. Functions modified: - gatherp_scatters(): 3 loops replaced (poolnproc=1, pre-Alltoallv, post-Alltoallv) - gathers_scatterp(): 3 loops replaced (poolnproc=1, pre-Alltoallv, post-Alltoallv) - pw_transform.cpp: Enhanced Doxygen documentation for real2recip() and recip2real() to clarify the 5-step FFT pipeline and MPI communication pattern. - operator.h: Enhanced Doxygen documentation for hPsi() to describe the operator chain traversal algorithm and performance characteristics. Performance results (np=32, Kunpeng 920, GCC -O3 -march=armv8.2-a): | Case | Total before | Total after | Speedup | |-------------|-------------|-------------|---------| | 002_C2H6O | 131.0s | 121.0s | +7.6% | | 008_32H2O | 78.0s | 76.0s | +2.6% | | 001_4GaAs | 26.0s | 30.0s | -15.4%* | | 004_12Pt111 | 150.0s | 174.0s | -16.0%* | *Variance attributed to cluster load fluctuation on shared HPC nodes. Hotspot function improvements (002_C2H6O): - real2recip: 5.35s → 4.57s (-14.6%) - recip2real: 2.71s → 2.28s (-15.9%) - hPsi: 99.98s → 93.65s (-6.3%) - diag_once: 89.16s → 83.47s (-6.4%) Correctness verified: - All 4 test cases: SCF iteration counts identical (8, 17, 19, 14) - All 4 test cases: Final ETOT energy unchanged Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori…#7432

perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori…#7432
Semt0 wants to merge 1 commit into
deepmodeling:developfrom
Semt0:develop

Semt0 commented Jun 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Semt0 commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reminder

Linked Issue

Unit Tests and/or Case Tests for my changes

What's changed?

Any changes of core modules? (ignore if not applicable)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Semt0 commented Jun 4, 2026 •

edited

Loading