modified: src/force/neighbor.cu#1577
Merged
Merged
Conversation
modified: src/force/neighbor.cuh modified: src/force/nep.cu modified: src/force/nep_small_box.cuh
modified: src/force/nep_small_box.cuh
modified: src/force/neighbor.cuh modified: src/force/nep.cu modified: src/force/nep_multigpu.cu modified: src/force/nep_small_box.cuh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds local
size_tcasts to large NEP array size calculations and flattened neighbor-list/descriptor offsets that can exceed 32-bit integer range in large-system GPU runs.Modification
Cast large
GPU_Vector::resize()products tosize_tinNEP::NEP():nep_data.NL_radialnep_data.Fpnep_data.sum_fxyzCast per-GPU
GPU_Vector::resize()products tosize_tinNEP_MULTIGPU::allocate_memory():nep_data[gpu].NL_radialnep_data[gpu].Fpnep_data[gpu].sum_fxyzCast flattened NEP large-box offsets to
size_tfor:NL_radialwrites and readsFpwrites and readssum_fxyzwrites and readsCast the global neighbor-list allocation size in
Neighbor::initialize().Cast flattened global/local neighbor-list offsets in non-ILP global neighbor-list kernels.
Cast flattened read/write offsets in the ordinary neighbor-list sort kernel in
neighbor.cuh.Validation


The x-axis is the number of atoms, shown on a logarithmic scale. The speed panel reports throughput in units of
10^7 atom step s^-1. The memory panel reports peak GPU memory used in GiB. Blue circles represent the old version; orange squares represent the new version. Dashed vertical lines in the memory panel mark the first failed point for each version. The A100 speed panel includes an inset that zooms in on the high-atom-count region of the new version.Compared with the old version, the new version reaches about 3.62x more atoms on A100 and about 1.47x more atoms on 2V100. The corresponding peak GPU memory usage increases from 22.50 to 77.69 GiB on A100, and from 39.54 to 57.22 GiB on 2V100.