Skip to content

fix some memleak #13306

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

fix some memleak #13306

wants to merge 2 commits into from

Conversation

wjjahah
Copy link

@wjjahah wjjahah commented Jun 13, 2025

avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function;
mca_base_var_register will strdup, and not free old string;
opal_patcher_base_framework not close;

avx_component module retain twice by avx_component_op_query and ompi_op_base_op_select function;
mca_base_var_register will strdup, and not free old string;
opal_patcher_base_framework not close;

Signed-off-by: wjjahah <[email protected]>
opal_common_ucx open the opal_memory_base_framework but not close;

Signed-off-by: wjjahah <[email protected]>
@wjjahah
Copy link
Author

wjjahah commented Jun 17, 2025

Hi @bosilca , this PR is waiting for workflow approval. Could you please approve it when you have a moment?

@jsquyres
Copy link
Member

@bosilca Looks like you queued up the NVIDIA job last night, but it sat in queue for 10+ hours, so I tried to cancel it. But now it seems stuck. Can you check what's happening on the NVIDIA side?

@xbw22109
Copy link
Contributor

Sorry, not directly related to the topic of this PR: Why are there queues at Nvidia? I'm having the same problem (13261), and I think 13211 is having the same problem.

@jsquyres
Copy link
Member

jsquyres commented Jun 21, 2025

@xbw22109 Sorry about this. We actually run CI at a variety of different locations -- not just on Github cloud resources (including NVIDIA). That being said, sometimes something goes wrong on these CI resources, and sometimes they need some care and feeding. That happened with NVIDIA's CI resources this past week; they fixed the issue, but we probably missed some of the pending PRs that falsely failed. I've re-queued the jobs on #13261 and #13211. Sorry for the hassle! 😦

FYI: know that you can also re-trigger CI by pushing a new commit to a PR (even if you git commit --amend and make a meaningless change to the PR).

@xbw22109
Copy link
Contributor

@jsquyres Thanks for the explanation!

@wjjahah wjjahah requested a review from bosilca June 23, 2025 06:43
Copy link
Member

@bosilca bosilca left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to test the UCX change to convince myself it is safe, but I can't right now.

@bwbarrett , @hppritcha can you chime in regarding the patcher change. It won't impact UCX as UCX does it's own memory tracking, but it might impacts others that I'm not aware of.

}
if( NULL != module->opm_3buff_fns[i] ) {
OBJ_RETAIN(module);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change is correct, the module is already retained once for each function used (during *_enable).

@@ -90,7 +90,7 @@ static int opal_patcher_base_close(void)
return opal_patcher->patch_fini();
}

return OPAL_SUCCESS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am really vary skeptical about this change. While logically it sounds reasonable, the patcher is a very special module, it changes the way the memory allocations/deallocations are tracked, and I'm definitively not sure we can unload it. I personally would be against this change without proper testing, but I defer to others for additional insights.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It just release component resources, and patch_list has been released in line 86.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly my concern. The patcher works by interposition, aka. it interposes functions calls to sbrk or similar API. If the shared library where these functions are defined runs out of scope and is unloaded, bad things will happen (as it will call a function in a memory area that has been release and possibly reused).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn is the one that should really chime in on this. Looking at the code, all the right bits are there such that unloading the component should work. But clearly we're not testing it.

@@ -90,7 +90,7 @@ static int opal_patcher_base_close(void)
return opal_patcher->patch_fini();
}

return OPAL_SUCCESS;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hjelmn is the one that should really chime in on this. Looking at the code, all the right bits are there such that unloading the component should work. But clearly we're not testing it.

@wjjahah wjjahah requested a review from bosilca June 30, 2025 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants