Skip to content

Problem with intra-node communication  #6518

@mredenti

Description

@mredenti

Version of Singularity:

$ singularity --version
SingularityPRO version 3.11-5.el8

Actual behavior

When running

#!/bin/bash
#SBATCH --ntasks=8
#SBATCH --ntasks-per-node=4
#SBATCH --nodes=2

module load openmpi/4.1.6--nvhpc--24.3
mpirun -np 8 singularity exec fall3d_opeacc.sif Fall3d.x 

I get the following error

[lrdn2911:1723502:0:1723502]      cma_ep.c:88   process_vm_readv(pid=1723503 {0x14745e5ac800,61928}-->{0x150dba573e00,61928}) returned -1: Bad address
[lrdn2912:2435814:0:2435814]      cma_ep.c:88   process_vm_readv(pid=2435813 {0x14d1065ac800,61928}-->{0x15453a573e00,61928}) returned -1: Bad address
[lrdn2911:1723498:0:1723498]      cma_ep.c:88   process_vm_readv(pid=1723500 {0x1505545ac800,61928}-->{0x1490a2573e00,61928}) returned -1: Bad address
[lrdn2912:2435816:0:2435816]      cma_ep.c:88   process_vm_readv(pid=2435815 {0x149e885ac800,61928}-->{0x154f4a573e00,61928}) returned -1: Bad address
==== backtrace (tid:1723498) ====
 0 0x0000000000003803 uct_cma_ep_tx_error()  /build-result/src/hpcx-v2.20-gcc-inbox-redhat8-cuda12-x86_64/ucx-39c8f9b/src/uct/sm/scopy/cma/cma_ep.c:85
...

CMA (Cross-Memory Attach) is enabled inside UCX/Open MPI but fails to on the process_vm_readv()/process_vm_writev() system calls to do zero-copy shared memory transfers between processes on the same node.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions