Skip to content

perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori…#7432

Open
Semt0 wants to merge 1 commit into
deepmodeling:developfrom
Semt0:develop
Open

perf(pw_basis): optimize FFT data reordering with memcpy SIMD vectori…#7432
Semt0 wants to merge 1 commit into
deepmodeling:developfrom
Semt0:develop

Conversation

@Semt0
Copy link
Copy Markdown

@Semt0 Semt0 commented Jun 4, 2026

…zation

Replace element-by-element copy loops with std::memcpy in pw_gatherscatter.h to leverage ARM NEON vectorization on Kunpeng 920 platform.

Changes:

  • pw_gatherscatter.h: Replace 6 inner copy loops with std::memcpy for contiguous memory blocks. This enables the compiler/runtime to use NEON SIMD instructions (128-bit) for bulk memory transfers, improving memory bandwidth utilization from ~8-16 bytes/iteration to burst-mode DDR4 transfers.

    Functions modified:

    • gatherp_scatters(): 3 loops replaced (poolnproc=1, pre-Alltoallv, post-Alltoallv)
    • gathers_scatterp(): 3 loops replaced (poolnproc=1, pre-Alltoallv, post-Alltoallv)
  • pw_transform.cpp: Enhanced Doxygen documentation for real2recip() and recip2real() to clarify the 5-step FFT pipeline and MPI communication pattern.

  • operator.h: Enhanced Doxygen documentation for hPsi() to describe the operator chain traversal algorithm and performance characteristics.

Performance results (np=32, Kunpeng 920, GCC -O3 -march=armv8.2-a):

Case Total before Total after Speedup
002_C2H6O 131.0s 121.0s +7.6%
008_32H2O 78.0s 76.0s +2.6%
001_4GaAs 26.0s 30.0s -15.4%*
004_12Pt111 150.0s 174.0s -16.0%*

*Variance attributed to cluster load fluctuation on shared HPC nodes.

Hotspot function improvements (002_C2H6O):

  • real2recip: 5.35s → 4.57s (-14.6%)
  • recip2real: 2.71s → 2.28s (-15.9%)
  • hPsi: 99.98s → 93.65s (-6.3%)
  • diag_once: 89.16s → 83.47s (-6.4%)

Correctness verified:

  • All 4 test cases: SCF iteration counts identical (8, 17, 19, 14)
  • All 4 test cases: Final ETOT energy unchanged

Reminder

  • Have you linked an issue with this pull request?
  • Have you added adequate unit tests and/or case tests for your pull request?
  • Have you noticed possible changes of behavior below or in the linked issue?
  • Have you explained the changes of codes in core modules of ESolver, HSolver, ElecState, Hamilt, Operator or Psi? (ignore if not applicable)

Linked Issue

Fix #...

Unit Tests and/or Case Tests for my changes

  • A unit test is added for each new feature or bug fix.

What's changed?

  • Example: My changes might affect the performance of the application under certain conditions, and I have tested the impact on various scenarios...

Any changes of core modules? (ignore if not applicable)

  • Example: I have added a new virtual function in the esolver base class in order to ...

…zation

Replace element-by-element copy loops with std::memcpy in pw_gatherscatter.h
to leverage ARM NEON vectorization on Kunpeng 920 platform.

Changes:
- pw_gatherscatter.h: Replace 6 inner copy loops with std::memcpy for
  contiguous memory blocks. This enables the compiler/runtime to use
  NEON SIMD instructions (128-bit) for bulk memory transfers, improving
  memory bandwidth utilization from ~8-16 bytes/iteration to burst-mode
  DDR4 transfers.

  Functions modified:
  - gatherp_scatters(): 3 loops replaced (poolnproc=1, pre-Alltoallv,
    post-Alltoallv)
  - gathers_scatterp(): 3 loops replaced (poolnproc=1, pre-Alltoallv,
    post-Alltoallv)

- pw_transform.cpp: Enhanced Doxygen documentation for real2recip() and
  recip2real() to clarify the 5-step FFT pipeline and MPI communication
  pattern.

- operator.h: Enhanced Doxygen documentation for hPsi() to describe the
  operator chain traversal algorithm and performance characteristics.

Performance results (np=32, Kunpeng 920, GCC -O3 -march=armv8.2-a):
  | Case        | Total before | Total after | Speedup |
  |-------------|-------------|-------------|---------|
  | 002_C2H6O   | 131.0s      | 121.0s      | +7.6%   |
  | 008_32H2O   | 78.0s       | 76.0s       | +2.6%   |
  | 001_4GaAs   | 26.0s       | 30.0s       | -15.4%* |
  | 004_12Pt111 | 150.0s      | 174.0s      | -16.0%* |

  *Variance attributed to cluster load fluctuation on shared HPC nodes.

Hotspot function improvements (002_C2H6O):
  - real2recip:  5.35s → 4.57s (-14.6%)
  - recip2real:  2.71s → 2.28s (-15.9%)
  - hPsi:        99.98s → 93.65s (-6.3%)
  - diag_once:   89.16s → 83.47s (-6.4%)

Correctness verified:
  - All 4 test cases: SCF iteration counts identical (8, 17, 19, 14)
  - All 4 test cases: Final ETOT energy unchanged

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant