Skip to content

fix/comm-destroy-communicator-leak#15

Open
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy
Open

fix/comm-destroy-communicator-leak#15
GordonYang1 wants to merge 1 commit into
InfiniTensor:masterfrom
GordonYang1:fix/delete-communicator-on-comm-destroy

Conversation

@GordonYang1
Copy link
Copy Markdown
Collaborator

@GordonYang1 GordonYang1 commented May 19, 2026

Summary

Fixes the Communicator lifetime leak noted in the "Known Issues & Future Work" section of #10.

CommInitAll::Execute allocates the outer Communicator via new, but the previous CommDestroy only tore down the backend instance and never deleted the outer object. Every infiniCommInitAll / infiniCommDestroy pair leaked one Communicator.

Changes

  • src/base/comm_destroy.h:
    • Narrow Execute to a concrete void *comm_handle signature, mirroring CommInitAll::Execute's void **comm_handle.
    • Add a null-handle check, consistent with CommInitAll.
    • Delete the Communicator only after the backend Apply returns kSuccess.

Backend implementations are intentionally untouched: ownership of the outer Communicator now lives entirely in the base layer, symmetric with the allocation in CommInitAll. Future backends such as NCCL, MCCL, HCCL, and others get correct lifetime handling without each backend having to remember to delete the outer object.

Out of scope

Tracked separately to keep this PR minimal:

  • RAII-fying OmpiInstance::Destroy() into ~OmpiInstance().
  • FinalizeImpl not calling MPI_Finalize, as noted in DEVELOPER_NOTES_OMPI.md.

Test environment

Validated on a heterogeneous 2-node cluster with container-to-container direct connection over RDMA:

Node Hardware Container IP
A NVIDIA A100 iccl-nvidia 192.168.163.40
B MetaX C550 iccl-metax-clean 192.168.162.49
  • OS / container: --network host --ipc host --privileged, /dev/infiniband mounted on both sides.
  • OpenMPI: 4.1.6 (/opt/openmpi-4.1.6), built with --with-ucx=/opt/ucx-1.17.0.
  • UCX: 1.17.0 (/opt/ucx-1.17.0), built with --with-verbs --with-rdmacm.
  • CUDA Toolkit on Node A; MACA SDK at /opt/maca on Node B.
  • Inter-container SSH direct connection on port 22222, with no docker exec wrapper required for mpirun or icclrun --build.
  • UCX transport configuration:
    • UCX_NET_DEVICES=mlx5_0:1
    • UCX_TLS=rc,rc_verbs,self,sm
    • UCX_RNDV_SCHEME=put_zcopy
  • World size: 16 ranks, 8 NVIDIA + 8 MetaX.

Build configuration via icclrun --build:

  • Node A: -DWITH_NVIDIA=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF
  • Node B: -DWITH_METAX=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF

Logs & Screenshots

all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log

all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log

reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log

broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log

all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log

Copy link
Copy Markdown
Collaborator

@Ziminli Ziminli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同时麻烦 rebase 到最新,补充新增的示例程序运行日志文件。

Comment thread src/base/comm_destroy.h Outdated
Comment on lines +27 to +28
// Pair with the `new` in `CommInitAll`.
delete static_cast<Communicator *>(comm_handle);
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要先做 status 的检查,如果是 kSuccess 的话再做 delete 操作,否则直接然后 status 即可。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已更改

Comment thread src/base/comm_destroy.h Outdated
} // namespace infini::ccl

#endif // INFINI_CCL_BASE_COMM_DESTROY_H_
#endif // INFINI_CCL_BASE_COMM_DESTROY_H_ No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

文件末尾需要空一行。

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已更改

@GordonYang1 GordonYang1 force-pushed the fix/delete-communicator-on-comm-destroy branch 2 times, most recently from f9e299e to a768919 Compare May 20, 2026 02:42
@GordonYang1 GordonYang1 force-pushed the fix/delete-communicator-on-comm-destroy branch from a768919 to 86ef0a1 Compare May 20, 2026 03:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants