Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix the AllReduce hang issue in torch plugin #26

Merged
merged 2 commits into from
Feb 5, 2025

Conversation

MC952-arch
Copy link
Collaborator

No description provided.

@MC952-arch MC952-arch force-pushed the main branch 2 times, most recently from 652b7bb to 39035b1 Compare January 24, 2025 08:36
Copy link

@heavyrain-lzy heavyrain-lzy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

char line[1024];
const char* glooIface = flagcxGetEnv("FLAGCX_GLOO_SOCKET_IFNAME");
if(glooIface == NULL) {
FLAGCXCHECK(getHostName(line, 1024, '.'));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the machine need to support IPoIB to use IB hardware communication?

Copy link
Collaborator

@aoyulong aoyulong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. We should design a better mechanism for timing and tracing.

@aoyulong aoyulong merged commit e13b888 into FlagOpen:main Feb 5, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants