coll/accelerator add support for more functions #13006

edgargabriel · 2024-12-30T17:05:25Z

Add support for more functions to the coll/accelerator component. Specifically:

MPI_Reduce_scatter was missing as the only reduction operation
introduce support for Bcast, Allgather, and Alltoall operations. The max. msg size for which to use the copy-through-CPU mechanism can be controlled by MCA parameters for each function separatly.

To motivate the work on non-reduction operations, the main benefit is for short messages, since communication from GPU buffers has typically a higher latency than communication from CPU buffers.

Here is some data to support the PR, gathered on an MI250X node with 16 GPUs:

MPI_Bcast:

# OSU MPI-ROCM Broadcast Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16         16         32       32
# copy to CPU?            no        yes         no      yes  
1                      12.24       2.84      14.86     3.44
2                      17.19       4.18	     21.37     4.64
4                      17.22       4.13	     21.13     4.65
8                      17.19       4.09	     21.08     4.63
16                     17.23       4.11	     21.11     4.63
32                     17.47       4.18	     28.43     4.71
64                     20.02       4.15	     28.03     4.74
128                    27.54       8.37	     45.03     8.05
256                    28.30      12.67	     44.73    13.38
512                    27.65      32.51	     44.65    28.39
1024                   27.86      57.93	     44.92    64.72
2048                   28.06      27.66	    207.97    34.89
4096                   54.40      32.14	    429.97    36.70
8192                   54.64      45.65	    430.32    57.30
16384                  55.76      56.71	    433.16    67.10
32768                  56.96      84.60	    459.17    93.21
65536                  59.84     141.01	    440.84   162.06
131072                 65.34     256.07	    466.58   279.53
262144                 76.69     474.92	    553.10   524.82
524288                 97.60     803.38	    674.06   955.30
1048576               271.92    2030.49	   1069.67  1844.79

MPI_Allgather:

# OSU MPI-ROCM Allgather Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16        16        32         32
# Copy to CPU?            no       yes        no        yes
1                      44.70      5.86     54.62       6.85
2                      45.41      8.68	   54.67       9.67
4                      45.88      8.76	   54.09       9.71
8                      44.98      8.84	   66.06      10.12
16                     59.17      9.18	   83.69      10.70
32                     72.29      9.69	   99.86      11.57
64                     83.31     10.35	  120.70      12.65
128                    95.16     16.67	  148.10      19.50
256                    95.92     28.52	  148.03      33.09
512                    95.41     72.15	  147.25      86.03
1024                  111.54    144.25	  393.00     146.43
2048                  112.29     84.30	  761.04     105.84
4096                  159.91    111.35	  794.44     132.48
8192                  163.18    164.20	  801.66     188.61
16384                 169.69    277.55	  806.56     326.65
32768                 178.70    544.21	  810.47     622.84
65536                 195.38    891.83	 1661.78    1925.92
131072                226.69   1527.13	 1643.53    9896.87
262144                298.12   6354.89	 1667.39   19147.58
524288                450.23  12872.46	 1910.64   40723.25
1048576               764.79  26134.81	 3752.21   81590.10

MPI_Alltoall:

# OSU MPI-ROCM All-to-All Personalized Exchange Latency Test v7.5
# Datatype: MPI_CHAR.
# Size       Avg Latency(us)
# procs                   16        16        32        32
# copy to CPU?            no       yes        no       yes
1                     102.40     62.82    109.26     39.49
2                     119.86     65.76	  131.27     41.89
4                      70.86     27.01	  135.16     41.77
8                      75.00     26.99	  147.80     41.83
16                     88.56     27.45	  175.73     42.33
32                    112.92     27.56	  243.95     42.88
64                    184.78     27.92	  409.48     44.33
128                   363.13     34.33	  611.12     53.13
256                   484.21     46.77	  948.13     70.85
512                   205.33    135.85	  489.44    235.73
1024                  184.16    196.80	  482.84    295.78
2048                  182.26    154.94	  487.60    296.06
4096                  187.28    163.12	  538.29    311.60
8192                  189.87    184.98	  539.06    346.44
16384                 190.54    232.64	  544.14    431.32
32768                 191.31    336.01	  549.42    533.02
65536                 197.93    492.32	  557.00   1050.14
131072                199.14    889.92	  583.84  11395.43
262144                244.68   7316.06	  843.04  21582.68
524288                389.49  13314.65	 1481.85  40945.10
1048576               697.91  24498.35	 2753.87  80627.56

This data demonstrates that there are benefits of using the copy-to-host approach especially for short messages. Results on different settings might vary, but the user has the option to control the desired behavior using the MCA parameter (and later using #12991 by explicitly specifying which collective operation this component should be registered for.

ompi/mca/coll/accelerator/coll_accelerator_alltoall.c

devreal · 2024-12-31T00:16:31Z

ompi/mca/coll/accelerator/coll_accelerator_alltoall.c

+    /* Using sbufsize here on purpose to ensure symmetric decision for handling of GPU vs
+       CPU buffers. The two buffer sizes are expected to be the same for pre-defined datatypes,
+       but could vary due to layout issues/gaps for derived datatypes */
+    if ((rc > 0) && ((sbufsize * comm_size) <= (size_t)mca_coll_accelerator_alltoall_thresh)) {


Similar problem here: at higher process counts we would tend to use the device memory for many small send/recv. I'd expect that the overhead for small transfers becomes more pronounced for small messages at higher process counts. Why do we need to factor in the process count at all?

I have been honestly going back and forth on whether to use the parameter as a per process parameter or a parameter that limits the amount of temporary memory allocated on the CPU side. I am happy to use this as a per-process parameter (i.e. without communicator size playing a role). My original (maybe incorrect) line of thinking had to do with the v-versions of these operations (that this commit doesn't tackle yet.

I have changed the parameter to be a per-process threshold (vs. the overall buffer size)

ompi/mca/coll/accelerator/coll_accelerator_alltoall.c

devreal · 2024-12-31T00:20:25Z

ompi/mca/coll/accelerator/coll_accelerator_component.c

+    mca_coll_accelerator_allgather_thresh = 65536;
+    (void) mca_base_component_var_register(&mca_coll_accelerator_component.super.collm_version,
+                                           "allgather_thresh",
+                                           "max. overall msg length for which to copy accelerator buffer to CPU for allgather operation",


I'm not sure that overall msg length communicates well that the threshold is compared to send_size*comm_size. See my comment above for my concerns about using the process count.

I removed the word 'overall' with the adjustment of the meaning of the parameter . Thank you for the review!

bosilca

This is a needed update. However, I have 2 issues with this PR:

conceptually working around the span of the datatype to move data back and forth between the CPU and GPU will ensure we overwrite all the gaps. This leads to unexpected behaviors, if a GPU kernel changes the values in the gaps.
we need to be careful to move the correct amount of data, which must include the size of the communicator for some operations.

I don't have a better idea on how to address these. Either we document it clearly, or we use the pack functions and only support homogeneous environments.

bosilca · 2025-01-27T23:31:08Z

ompi/mca/coll/accelerator/coll_accelerator_allgather.c

+    }
+    if ((MPI_IN_PLACE != sbuf) && (rc > 0) &&
+        (sbufsize <= (size_t)mca_coll_accelerator_allgather_thresh)) {
+        sbuf1 = (char*)malloc(sbufsize * comm_size);


We are in an allgather, same buffer goes to all ranks. Why allocating sbufsize * comm_size?

ompi/mca/coll/accelerator/coll_accelerator_allgather.c

ompi/mca/coll/accelerator/coll_accelerator_alltoall.c

bosilca · 2025-01-27T23:43:30Z

ompi/mca/coll/accelerator/coll_accelerator_allgather.c

+    }
+
+    if (NULL != rbuf1) {
+        mca_coll_accelerator_memcpy(rbuf2, rbuf_dev, rbuf1, MCA_ACCELERATOR_NO_DEVICE_ID, rbufsize,


Times comm_size for the h2d tranfer.

bosilca · 2025-01-27T23:44:35Z

ompi/mca/coll/accelerator/coll_accelerator_alltoall.c

+        goto exit;;
+    }
+    if (NULL != rbuf1) {
+         mca_coll_accelerator_memcpy(rbuf2, rbuf_dev, rbuf1, MCA_ACCELERATOR_NO_DEVICE_ID, rbufsize,


Times comm_size for the h2d transfer ?

add support for MPI_Reduce_scatter Signed-off-by: Edgar Gabriel <[email protected]>

add support for bcast, allgather and alltoall for device buffers using a temporary buffer on the CPU. The maximum msg length for each operation for which to use this approach can be controlled through an mca parameter. Signed-off-by: Edgar Gabriel <[email protected]>

owner file update to indicate that the component is active and ownership is shared between NVidia and AMD. Signed-off-by: Edgar Gabriel <[email protected]>

edgargabriel · 2025-01-28T14:59:08Z

Just to make sure I don't miss other items such as the update of the accelerator_memcpy length after the change of the meaning in the threshold value, I re-run our the validation testsuite and the osu-benchmarks with data validation turned on, and it looks good for the new functions as far as I can tell.

github-actions bot added the Target: main label Dec 30, 2024

edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from ae0525b to 94a56ab Compare December 30, 2024 17:06

edgargabriel requested review from bosilca and devreal December 30, 2024 17:06

edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 8658265 to 5e0bcd7 Compare December 30, 2024 17:22

devreal reviewed Dec 31, 2024

View reviewed changes

edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 5e0bcd7 to 167b8b2 Compare January 2, 2025 15:03

devreal approved these changes Jan 27, 2025

View reviewed changes

edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from 167b8b2 to d9af881 Compare January 27, 2025 19:12

bosilca reviewed Jan 27, 2025

View reviewed changes

edgargabriel and others added 3 commits January 28, 2025 14:14

coll/accelerator: add reduce_scatter

e0fba8f

add support for MPI_Reduce_scatter Signed-off-by: Edgar Gabriel <[email protected]>

coll/accelerator: update owner file

8731f21

owner file update to indicate that the component is active and ownership is shared between NVidia and AMD. Signed-off-by: Edgar Gabriel <[email protected]>

edgargabriel force-pushed the topic/coll-accelerator-new-funcs branch from d9af881 to 8731f21 Compare January 28, 2025 14:28

bosilca approved these changes Jan 28, 2025

View reviewed changes

edgargabriel merged commit c5e02ab into open-mpi:main Jan 29, 2025
15 checks passed

edgargabriel mentioned this pull request Feb 24, 2025

Unexpected low performance with libfabirc/CXI and openmpi #13112

Closed

github-actions bot mentioned this pull request May 8, 2025

fortran: fix common symbol sizes and alignments #13230

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

coll/accelerator add support for more functions #13006

coll/accelerator add support for more functions #13006

Uh oh!

edgargabriel commented Dec 30, 2024

Uh oh!

Uh oh!

Uh oh!

devreal Dec 31, 2024

Uh oh!

edgargabriel Jan 2, 2025

Uh oh!

edgargabriel Jan 2, 2025

Uh oh!

Uh oh!

Uh oh!

devreal Dec 31, 2024

Uh oh!

edgargabriel Jan 2, 2025

Uh oh!

bosilca left a comment

Uh oh!

bosilca Jan 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bosilca Jan 27, 2025

Uh oh!

bosilca Jan 27, 2025

Uh oh!

edgargabriel commented Jan 28, 2025

Uh oh!

Uh oh!

Uh oh!

coll/accelerator add support for more functions #13006

coll/accelerator add support for more functions #13006

Uh oh!

Conversation

edgargabriel commented Dec 30, 2024

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bosilca left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

edgargabriel commented Jan 28, 2025

Uh oh!

Uh oh!

Uh oh!