fix/comm-destroy-communicator-leak#15
Open
GordonYang1 wants to merge 1 commit into
Open
Conversation
Ziminli
requested changes
May 20, 2026
Collaborator
Ziminli
left a comment
There was a problem hiding this comment.
同时麻烦 rebase 到最新,补充新增的示例程序运行日志文件。
Comment on lines
+27
to
+28
| // Pair with the `new` in `CommInitAll`. | ||
| delete static_cast<Communicator *>(comm_handle); |
Collaborator
There was a problem hiding this comment.
这里需要先做 status 的检查,如果是 kSuccess 的话再做 delete 操作,否则直接然后 status 即可。
| } // namespace infini::ccl | ||
|
|
||
| #endif // INFINI_CCL_BASE_COMM_DESTROY_H_ | ||
| #endif // INFINI_CCL_BASE_COMM_DESTROY_H_ No newline at end of file |
f9e299e to
a768919
Compare
a768919 to
86ef0a1
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the
Communicatorlifetime leak noted in the "Known Issues & Future Work" section of #10.CommInitAll::Executeallocates the outerCommunicatorvianew, but the previousCommDestroyonly tore down the backend instance and never deleted the outer object. EveryinfiniCommInitAll/infiniCommDestroypair leaked oneCommunicator.Changes
src/base/comm_destroy.h:Executeto a concretevoid *comm_handlesignature, mirroringCommInitAll::Execute'svoid **comm_handle.CommInitAll.Communicatoronly after the backendApplyreturnskSuccess.Backend implementations are intentionally untouched: ownership of the outer
Communicatornow lives entirely in the base layer, symmetric with the allocation inCommInitAll. Future backends such as NCCL, MCCL, HCCL, and others get correct lifetime handling without each backend having to remember todeletethe outer object.Out of scope
Tracked separately to keep this PR minimal:
OmpiInstance::Destroy()into~OmpiInstance().FinalizeImplnot callingMPI_Finalize, as noted inDEVELOPER_NOTES_OMPI.md.Test environment
Validated on a heterogeneous 2-node cluster with container-to-container direct connection over RDMA:
iccl-nvidia192.168.163.40iccl-metax-clean192.168.162.49--network host --ipc host --privileged,/dev/infinibandmounted on both sides./opt/openmpi-4.1.6), built with--with-ucx=/opt/ucx-1.17.0./opt/ucx-1.17.0), built with--with-verbs --with-rdmacm./opt/macaon Node B.22222, with nodocker execwrapper required formpirunoricclrun --build.UCX_NET_DEVICES=mlx5_0:1UCX_TLS=rc,rc_verbs,self,smUCX_RNDV_SCHEME=put_zcopyBuild configuration via
icclrun --build:-DWITH_NVIDIA=ON -DWITH_OMPI=ON -DWITH_NCCL=OFF-DWITH_METAX=ON -DWITH_OMPI=ON -DWITH_NCCL=OFFLogs & Screenshots
all_reduce test (MetaX-NVIDIA heterogeneous)
all_reduce.log
all_gather test (MetaX-NVIDIA heterogeneous)
all_gather.log
reduce_scatter test (MetaX-NVIDIA heterogeneous)
reduce_scatter.log
broadcast test (MetaX-NVIDIA heterogeneous)
broadcast.log
all_to_all test (MetaX-NVIDIA heterogeneous)
all_to_all.log