Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] deepseek v3 cannot run in multi-node #2658

Closed
3 of 5 tasks
JohnnyBoyzzz opened this issue Dec 30, 2024 · 11 comments
Closed
3 of 5 tasks

[Bug] deepseek v3 cannot run in multi-node #2658

JohnnyBoyzzz opened this issue Dec 30, 2024 · 11 comments

Comments

@JohnnyBoyzzz
Copy link

Checklist

  • 1. I have searched related issues but cannot get the expected help.
  • 2. The bug has not been fixed in the latest version.
  • 3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
  • 4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
  • 5. Please use English, otherwise it will be closed.

Describe the bug

I am deploying a deepseek v3 model using a multi-node approach. Each node contains 8*A800-80G. I run the command on the first node and then on the second node, but the progress is stuck at "Init torch distributed begin." and cannot proceed any further.
image

Reproduction

node 1

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init first_ip:8001 --nnodes 2 --node-rank 0 --trust-remote-code

node 2

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init second_ip:8001 --nnodes 2 --node-rank 1 --trust-remote-code

Environment

fist ip

python3 -m sglang.check_env
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:

  • 'underscore_attrs_are_private' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
    CUDA available: True
    GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
    GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.99
    CUDA Driver Version: 550.90.07
    PyTorch: 2.4.0+cu121
    sglang: 0.4.1.post3
    flashinfer: 0.1.6+cu121torch2.4
    triton: 3.0.0
    transformers: 4.46.3
    torchao: 0.7.0
    numpy: 1.26.4
    aiohttp: 3.10.5
    fastapi: 0.115.5
    hf_transfer: 0.1.8
    huggingface_hub: 0.24.7
    interegular: 0.3.3
    modelscope: Module Not Found
    orjson: 3.10.10
    packaging: 24.1
    psutil: 6.0.0
    pydantic: 2.9.1
    multipart: 0.0.9
    zmq: 26.2.0
    uvicorn: 0.30.6
    uvloop: 0.20.0
    vllm: 0.6.3.post1
    xgrammar: Module Not Found
    openai: 1.44.1
    anthropic: 0.34.2
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
    GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
    GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
    GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
    GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
    NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODESYS SYS SYS SYS
    NIC1 PXB PXB NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODESYS SYS SYS SYS
    NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODESYS SYS SYS SYS
    NIC3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE X PIXSYS SYS SYS SYS
    NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE PIX X SYS SYS SYS SYS
    NIC5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE
    NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYSPIX X NODE NODE
    NIC7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE X PIX
    NIC8 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8

ulimit soft: 65535

second ip

python3 -m sglang.check_env
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:

  • 'underscore_attrs_are_private' has been removed
    warnings.warn(message, UserWarning)
    Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
    CUDA available: True
    GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
    GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
    CUDA_HOME: /usr/local/cuda
    NVCC: Cuda compilation tools, release 12.4, V12.4.99
    CUDA Driver Version: 550.90.07
    PyTorch: 2.4.0+cu121
    sglang: 0.4.1.post3
    flashinfer: 0.1.6+cu121torch2.4
    triton: 3.0.0
    transformers: 4.46.3
    torchao: 0.7.0
    numpy: 1.26.4
    aiohttp: 3.10.5
    fastapi: 0.115.5
    hf_transfer: 0.1.8
    huggingface_hub: 0.24.7
    interegular: 0.3.3
    modelscope: Module Not Found
    orjson: 3.10.11
    packaging: 24.1
    psutil: 6.0.0
    pydantic: 2.9.1
    multipart: 0.0.9
    zmq: 26.2.0
    uvicorn: 0.30.6
    uvloop: 0.20.0
    vllm: 0.6.3.post1
    xgrammar: Module Not Found
    openai: 1.55.0
    anthropic: 0.34.2
    decord: 0.6.0
    NVIDIA Topology:
    GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
    GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
    GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
    GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
    GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
    GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
    NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODESYS SYS SYS SYS
    NIC1 PXB PXB NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODESYS SYS SYS SYS
    NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODESYS SYS SYS SYS
    NIC3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE X PIXSYS SYS SYS SYS
    NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE PIX X SYS SYS SYS SYS
    NIC5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE
    NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYSPIX X NODE NODE
    NIC7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE X PIX
    NIC8 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8

ulimit soft: 65535

@zhyncs
Copy link
Member

zhyncs commented Dec 30, 2024

A800 doesn't support FP8. May you try to use H20 multi nodes or H800 multi nodes?

@zhyncs
Copy link
Member

zhyncs commented Dec 30, 2024

@JohnnyBoyzzz
Copy link
Author

A800 doesn't support FP8. May you try to use H20 multi nodes or H800 multi nodes?

Alright, thank you for your suggestion. I found the BF16 version of the DeepSeek v3 model on Hugging Face. I'll download it and try to see if I can run it under multi-node.

@mycpuorg
Copy link

I am here for H100 instructions. The recommended page says this:

If you have two H100 nodes, the usage is similar to the aforementioned H20.

But I am experiencing the same issue mentioned here in the issue. Can somebody please help?

@mycpuorg
Copy link

@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue?

@fsygd
Copy link
Contributor

fsygd commented Dec 31, 2024

@JohnnyBoyzzz Here is the problem. when lauching server in node 2, you should use the first_ip for nccl-init, NOT second_ip

node 2
GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init second_ip:8001 --nnodes 2 --node-rank 1 --trust-remote-code

fyi @mycpuorg

@mycpuorg
Copy link

Thanks @fsygd in my case I am using the master ip. Yet the effect is the same, which tells us that the nodes are somehow unable to talk to each other and complete the initial handshake or w/e

@JohnnyBoyzzz
Copy link
Author

Using two H20 nodes works well ref https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

@zhyncs when i run the BF16 model(https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main) in four nodes(8*A800 80G), have another bug, how to fix it ?
image

Command

node 1

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 0 --trust-remote-code --disable-cuda-graph

node 2

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 1 --trust-remote-code --disable-cuda-graph

node3

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 2 --trust-remote-code --disable-cuda-graph

node4

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 3 --trust-remote-code --disable-cuda-graph

@fsygd
Copy link
Contributor

fsygd commented Dec 31, 2024

Thanks @fsygd in my case I am using the master ip. Yet the effect is the same, which tells us that the nodes are somehow unable to talk to each other and complete the initial handshake or w/e

Yes, I think the reason is the network between the two nodes. I have tested 2 nodes H800 successfully in #2647 (comment)

@JohnnyBoyzzz
Copy link
Author

@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue?

cannot run in bf16...found another bug

@bingshuailiu
Copy link

@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue?

cannot run in bf16...found another bug

dude have u fixed the bug on bf16 model with 4 machine (8 * 4 A800) ? Same issue for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants