-
Notifications
You must be signed in to change notification settings - Fork 877
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/accelerator add support for more functions #13006
coll/accelerator add support for more functions #13006
Conversation
ae0525b
to
94a56ab
Compare
8658265
to
5e0bcd7
Compare
/* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs | ||
CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes, | ||
but could vary due to layout issues/gaps for derived datatypes */ | ||
if ((rc > 0) && ((sbufsize * comm_size) <= (size_t)mca_coll_accelerator_alltoall_thresh)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar problem here: at higher process counts we would tend to use the device memory for many small send/recv. I'd expect that the overhead for small transfers becomes more pronounced for small messages at higher process counts. Why do we need to factor in the process count at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been honestly going back and forth on whether to use the parameter as a per process parameter or a parameter that limits the amount of temporary memory allocated on the CPU side. I am happy to use this as a per-process parameter (i.e. without communicator size playing a role). My original (maybe incorrect) line of thinking had to do with the v-versions of these operations (that this commit doesn't tackle yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the parameter to be a per-process threshold (vs. the overall buffer size)
mca_coll_accelerator_allgather_thresh = 65536; | ||
(void) mca_base_component_var_register(&mca_coll_accelerator_component.super.collm_version, | ||
"allgather_thresh", | ||
"max. overall msg length for which to copy accelerator buffer to CPU for allgather operation", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure that overall msg length
communicates well that the threshold is compared to send_size*comm_size
. See my comment above for my concerns about using the process count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed the word 'overall' with the adjustment of the meaning of the parameter . Thank you for the review!
5e0bcd7
to
167b8b2
Compare
167b8b2
to
d9af881
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a needed update. However, I have 2 issues with this PR:
- conceptually working around the span of the datatype to move data back and forth between the CPU and GPU will ensure we overwrite all the gaps. This leads to unexpected behaviors, if a GPU kernel changes the values in the gaps.
- we need to be careful to move the correct amount of data, which must include the size of the communicator for some operations.
I don't have a better idea on how to address these. Either we document it clearly, or we use the pack functions and only support homogeneous environments.
} | ||
if ((MPI_IN_PLACE != sbuf) && (rc > 0) && | ||
(sbufsize <= (size_t)mca_coll_accelerator_allgather_thresh)) { | ||
sbuf1 = (char*)malloc(sbufsize * comm_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are in an allgather, same buffer goes to all ranks. Why allocating sbufsize * comm_size
?
} | ||
|
||
if (NULL != rbuf1) { | ||
mca_coll_accelerator_memcpy(rbuf2, rbuf_dev, rbuf1, MCA_ACCELERATOR_NO_DEVICE_ID, rbufsize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Times comm_size for the h2d tranfer.
goto exit;; | ||
} | ||
if (NULL != rbuf1) { | ||
mca_coll_accelerator_memcpy(rbuf2, rbuf_dev, rbuf1, MCA_ACCELERATOR_NO_DEVICE_ID, rbufsize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Times comm_size for the h2d transfer ?
add support for MPI_Reduce_scatter Signed-off-by: Edgar Gabriel <[email protected]>
add support for bcast, allgather and alltoall for device buffers using a temporary buffer on the CPU. The maximum msg length for each operation for which to use this approach can be controlled through an mca parameter. Signed-off-by: Edgar Gabriel <[email protected]>
owner file update to indicate that the component is active and ownership is shared between NVidia and AMD. Signed-off-by: Edgar Gabriel <[email protected]>
d9af881
to
8731f21
Compare
Just to make sure I don't miss other items such as the update of the accelerator_memcpy length after the change of the meaning in the threshold value, I re-run our the validation testsuite and the osu-benchmarks with data validation turned on, and it looks good for the new functions as far as I can tell. |
Add support for more functions to the coll/accelerator component. Specifically:
To motivate the work on non-reduction operations, the main benefit is for short messages, since communication from GPU buffers has typically a higher latency than communication from CPU buffers.
Here is some data to support the PR, gathered on an MI250X node with 16 GPUs:
MPI_Bcast:
MPI_Allgather:
MPI_Alltoall:
This data demonstrates that there are benefits of using the copy-to-host approach especially for short messages. Results on different settings might vary, but the user has the option to control the desired behavior using the MCA parameter (and later using #12991 by explicitly specifying which collective operation this component should be registered for.