Closed as not planned
Description
Question
Unexpected low performance with osu_alltoall and ROCM GPU buffers on frontier like system.
I'm seeing the below performance with the following environment variables set:
export FI_CXI_RDZV_THRESHOLD=16384
export FI_CXI_RDZV_EAGER_SIZE=2048
export FI_CXI_OFLOW_BUF_SIZE=12582912
export FI_CXI_OFLOW_BUF_COUNT=3
export FI_CXI_DEFAULT_CQ_SIZE=131072
export FI_CXI_REQ_BUF_MAX_CACHED=0
export FI_CXI_REQ_BUF_MIN_POSTED=6
export FI_CXI_REQ_BUF_SIZE=12582912
export FI_CXI_RX_MATCH_MODE=software
export FI_MR_CACHE_MAX_SIZE=-1
export FI_MR_CACHE_MAX_COUNT=524288
mpirun -x FI_OFI_RXM_ENABLE_SHM -x FI_LOG_LEVEL -x FI_CXI_RDZV_THRESHOLD -x FI_CXI_RDZV_EAGER_SIZE -x FI_CXI_OFLOW_BUF_SIZE -x FI_CXI_OFLOW_BUF_COUNT -x FI_CXI_DEFAULT_CQ_SIZE -x FI_CXI_REQ_BUF_MAX_CACHED -x FI_CXI_REQ_BUF_MIN_POSTED -x FI_CXI_REQ_BUF_SIZE -x FI_CXI_RX_MATCH_MODE -x FI_MR_CACHE_MAX_SIZE -x FI_MR_CACHE_MAX_COUNT -x FI_LNX_SRQ_SUPPORT -x FI_SHM_USE_XPMEM -x LD_LIBRARY_PATH --mca mtl_ofi_av table --display mapping,bindings --mca btl '^tcp,ofi,vader,openib' --mca pml '^ucx' --mca mtl ofi --mca opal_common_ofi_provider_include cxi --bind-to core --map-by ppr:1:l3 -np 16 /sw/crusher/ums/ompix/DEVELOP/cce/13.0.0/install/osu-micro-benchmarks-7.5-1//build-ompi/_install/libexec/osu-micro-benchmarks/mpi/collective/osu_alltoall -d rocm D D
1 69.69
2 69.73
4 686.55
8 684.56
16 685.98
32 687.88
64 690.14
128 696.51
256 700.48
512 60.52
1024 61.01
2048 62.00
4096 62.81
8192 5429.99
16384 280.84
32768 100.29
65536 152.72
131072 258.75
262144 509.10
524288 996.54
1048576 2052.14
I'm using the main branch from open MPI and libfabric. Is there an explanation for the lower than expected performance numbers.
In comparison, below is the performance when using system buffers. System buffers give better performance.
1 40.33
2 40.22
4 38.57
8 38.85
16 39.46
32 43.24
64 43.58
128 47.72
256 45.92
512 41.34
1024 41.57
2048 43.83
4096 46.32
8192 48.40
16384 67.62
32768 93.77
65536 150.25
131072 267.46
262144 583.49
524288 1145.67
1048576 2522.32
This is using main branch of open MPI and libfabric. Is there an explanation for the lower than expected performance numbers. @iziemba