Skip to content

How to solve the communication problem in distributed training?Thank you!Β #14

@XieZixiUSTC

Description

@XieZixiUSTC

Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/8
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/8
Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/8
Initializing distributed: GLOBAL_RANK: 6, MEMBER: 7/8
Initializing distributed: GLOBAL_RANK: 7, MEMBER: 8/8
[2025-03-11 05:20:54,412][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2025-03-11 05:20:54,412][torch.distributed.distributed_c10d][INFO] - Added key: store_based_barrier_key:1 to store for rank: 0
[2025-03-11 05:21:04,421][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=8, worker_count=1, timeout=0:30:00)
[2025-03-11 05:21:14,425][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=8, worker_count=1, timeout=0:30:00)
[2025-03-11 05:21:24,426][torch.distributed.distributed_c10d][INFO] - Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=8, worker_count=1, timeout=0:30:00)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions