You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
OS version (e.g Linux distro) + CPU architecture (x86_64/aarch64/ppc64le/...)
cat /etc/issue or cat /etc/redhat-release + uname -a
$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.2 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.2"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.2 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.2"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.2"
$ uname -a
Linux xeonmax 5.14.0-284.11.1.el9_2.x86_64 #1 SMP PREEMPT_DYNAMIC Tue May 9 17:09:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
For RDMA/IB/RoCE related issues:
Driver version:
rpm -q rdma-core or rpm -q libibverbs
or: MLNX_OFED version ofed_info -s
$ ofed_info -s
MLNX_OFED_LINUX-23.04-1.1.3.0:
HW information from ibstat or ibv_devinfo -vv command
$ ibstat
CA 'mlx5_0'
CA type: MT4129
Number of ports: 1
Firmware version: 28.36.1010
Hardware version: 0
Node GUID: 0xb83fd20300fa62bc
System image GUID: 0xb83fd20300fa62bc
Port 1:
State: Active
Physical state: LinkUp
Rate: 400
Base lid: 80
LMC: 0
SM lid: 69
Capability mask: 0xa651e848
Port GUID: 0xb83fd20300fa62bc
Link layer: InfiniBand
Additional information (depending on the issue)
OpenMPI version
4.1.5
Output of ucx_info -d to show transports and devices recognized by UCX
[1692032992.421228] [xm066:789921:1] ucp_ep.c:1504 UCX DIAG ep 0x7ffff4be46c0: error 'Connection reset by remote peer' on rc_mlx5/mlx5_0:1 will not be handled since no error callback is installed
[1692032992.422317] [xm016:819369:1] ucp_ep.c:1504 UCX DIAG ep 0x7ffff4be4340: error 'Request canceled' on rc_mlx5/mlx5_0:1 will not be handled since no error callback is installed
[xm016:819369] pml_ucx.c:865 Error: ucx send failed: Request canceled
MPI_ERR_OTHER: known error not in list
MADNESS: fatal error: caught an MPI exception
[xm066:789921:1:790043] ib_mlx5_log.c:171 Remote access on mlx5_0:1/IB (synd 0x13 vend 0x88 hw_synd 0/0)
[xm066:789921:1:790043] ib_mlx5_log.c:171 RC QP 0xf367 wqe[20413]: RDMA_READ s-- [rva 0x7fff90a2f340 rkey 0xdbff7f00] [va 0x7fffb5a15040 len 4032 lkey 0x2ac3ba] [rqpn 0x1b38c dlid=79 sl=0 port=1 src_path_bits=\
0]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 30 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
==== backtrace (tid: 790043) ====
0 0x0000000000027373 uct_ib_mlx5_completion_with_err() ???:0
1 0x000000000003f0b4 uct_rc_mlx5_iface_check_rx_completion() ???:0
2 0x000000000002771d uct_ib_mlx5_check_completion() ???:0
3 0x000000000003cd37 uct_rc_mlx5_iface_check_rx_completion() ???:0
4 0x000000000004665a ucp_worker_progress() ???:0
5 0x000000000003a8fc opal_progress() ???:0
6 0x0000000000050c45 ompi_request_default_test_some() ???:0
7 0x000000000008f4f3 PMPI_Testsome() ???:0
8 0x0000000000fa3bdc madness::RMI::RmiTask::process_some() worldrmi.cc:0
9 0x0000000000faa050 madness::RMI::RmiTask::run() ???:0
10 0x000000000046157d madness::ThreadBase::main() ???:0
11 0x000000000009f802 start_thread() ???:0
12 0x000000000003f450 __clone3() :0 (edited)
The text was updated successfully, but these errors were encountered:
Describe the bug
rc errors re. callbacks while running in Open-MPI. NB I am reporting this issue but am not the creator/author.
Steps to Reproduce
ucx_info -v
)Setup and versions
cat /etc/issue
orcat /etc/redhat-release
+uname -a
rpm -q rdma-core
orrpm -q libibverbs
ofed_info -s
ibstat
oribv_devinfo -vv
commandAdditional information (depending on the issue)
4.1.5
ucx_info -d
to show transports and devices recognized by UCXOutput
The text was updated successfully, but these errors were encountered: