[Bug] deepseek v3 cannot run in multi-node #2658

JohnnyBoyzzz · 2024-12-30T07:42:52Z

Checklist

1. I have searched related issues but cannot get the expected help.
2. The bug has not been fixed in the latest version.
3. Please note that if the bug-related issue you submitted lacks corresponding environment info and a minimal reproducible demo, it will be challenging for us to reproduce and resolve the issue, reducing the likelihood of receiving feedback.
4. If the issue you raised is not a bug but a question, please raise a discussion at https://github.com/sgl-project/sglang/discussions/new/choose Otherwise, it will be closed.
5. Please use English, otherwise it will be closed.

Describe the bug

I am deploying a deepseek v3 model using a multi-node approach. Each node contains 8*A800-80G. I run the command on the first node and then on the second node, but the progress is stuck at "Init torch distributed begin." and cannot proceed any further.

Reproduction

node 1

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init first_ip:8001 --nnodes 2 --node-rank 0 --trust-remote-code

node 2

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init second_ip:8001 --nnodes 2 --node-rank 1 --trust-remote-code

Environment

fist ip

python3 -m sglang.check_env
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:

'underscore_attrs_are_private' has been removed
warnings.warn(message, UserWarning)
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.4.1.post3
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.3
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.10
packaging: 24.1
psutil: 6.0.0
pydantic: 2.9.1
multipart: 0.0.9
zmq: 26.2.0
uvicorn: 0.30.6
uvloop: 0.20.0
vllm: 0.6.3.post1
xgrammar: Module Not Found
openai: 1.44.1
anthropic: 0.34.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODESYS SYS SYS SYS
NIC1 PXB PXB NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODESYS SYS SYS SYS
NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODESYS SYS SYS SYS
NIC3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE X PIXSYS SYS SYS SYS
NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE PIX X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE
NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYSPIX X NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE X PIX
NIC8 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8

ulimit soft: 65535

second ip

python3 -m sglang.check_env
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:

'underscore_attrs_are_private' has been removed
warnings.warn(message, UserWarning)
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.4.1.post3
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.3
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.11
packaging: 24.1
psutil: 6.0.0
pydantic: 2.9.1
multipart: 0.0.9
zmq: 26.2.0
uvicorn: 0.30.6
uvloop: 0.20.0
vllm: 0.6.3.post1
xgrammar: Module Not Found
openai: 1.55.0
anthropic: 0.34.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODESYS SYS SYS SYS
NIC1 PXB PXB NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODESYS SYS SYS SYS
NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODESYS SYS SYS SYS
NIC3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE X PIXSYS SYS SYS SYS
NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE PIX X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE
NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYSPIX X NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE X PIX
NIC8 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE PIX X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8

ulimit soft: 65535

zhyncs · 2024-12-30T08:16:28Z

A800 doesn't support FP8. May you try to use H20 multi nodes or H800 multi nodes?

zhyncs · 2024-12-30T08:17:37Z

Using two H20 nodes works well ref https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

JohnnyBoyzzz · 2024-12-30T08:42:15Z

A800 doesn't support FP8. May you try to use H20 multi nodes or H800 multi nodes?

Alright, thank you for your suggestion. I found the BF16 version of the DeepSeek v3 model on Hugging Face. I'll download it and try to see if I can run it under multi-node.

mycpuorg · 2024-12-31T02:46:14Z

I am here for H100 instructions. The recommended page says this:

If you have two H100 nodes, the usage is similar to the aforementioned H20.

But I am experiencing the same issue mentioned here in the issue. Can somebody please help?

mycpuorg · 2024-12-31T02:47:08Z

@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue?

fsygd · 2024-12-31T05:43:29Z

@JohnnyBoyzzz Here is the problem. when lauching server in node 2, you should use the first_ip for nccl-init, NOT second_ip

node 2
GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init second_ip:8001 --nnodes 2 --node-rank 1 --trust-remote-code

fyi @mycpuorg

mycpuorg · 2024-12-31T05:53:49Z

Thanks @fsygd in my case I am using the master ip. Yet the effect is the same, which tells us that the nodes are somehow unable to talk to each other and complete the initial handshake or w/e

JohnnyBoyzzz · 2024-12-31T07:49:38Z

Using two H20 nodes works well ref https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208

@zhyncs when i run the BF16 model(https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main) in four nodes(8*A800 80G), have another bug, how to fix it ?

Command

node 1

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 0 --trust-remote-code --disable-cuda-graph

node 2

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 1 --trust-remote-code --disable-cuda-graph

node3

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 2 --trust-remote-code --disable-cuda-graph

node4

GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 3 --trust-remote-code --disable-cuda-graph

fsygd · 2024-12-31T07:49:41Z

Thanks @fsygd in my case I am using the master ip. Yet the effect is the same, which tells us that the nodes are somehow unable to talk to each other and complete the initial handshake or w/e

Yes, I think the reason is the network between the two nodes. I have tested 2 nodes H800 successfully in #2647 (comment)

JohnnyBoyzzz · 2024-12-31T07:52:34Z

@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue?

cannot run in bf16...found another bug

bingshuailiu · 2025-01-02T08:23:37Z

@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue?

cannot run in bf16...found another bug

dude have u fixed the bug on bf16 model with 4 machine (8 * 4 A800) ? Same issue for me.

JohnnyBoyzzz closed this as completed Dec 30, 2024

mycpuorg mentioned this issue Dec 31, 2024

[Bug] DeepSeekV3 instructions don't work for multi-node H100 setup #2673

Open

5 tasks

JohnnyBoyzzz reopened this Dec 31, 2024

JohnnyBoyzzz closed this as completed Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] deepseek v3 cannot run in multi-node #2658

[Bug] deepseek v3 cannot run in multi-node #2658

JohnnyBoyzzz commented Dec 30, 2024

zhyncs commented Dec 30, 2024

zhyncs commented Dec 30, 2024

JohnnyBoyzzz commented Dec 30, 2024

mycpuorg commented Dec 31, 2024

mycpuorg commented Dec 31, 2024

fsygd commented Dec 31, 2024

mycpuorg commented Dec 31, 2024

JohnnyBoyzzz commented Dec 31, 2024

fsygd commented Dec 31, 2024

JohnnyBoyzzz commented Dec 31, 2024

bingshuailiu commented Jan 2, 2025

[Bug] deepseek v3 cannot run in multi-node #2658

[Bug] deepseek v3 cannot run in multi-node #2658

Comments

JohnnyBoyzzz commented Dec 30, 2024

Checklist

Describe the bug

Reproduction

node 1

node 2

Environment

fist ip

second ip

zhyncs commented Dec 30, 2024

zhyncs commented Dec 30, 2024

JohnnyBoyzzz commented Dec 30, 2024

mycpuorg commented Dec 31, 2024

mycpuorg commented Dec 31, 2024

fsygd commented Dec 31, 2024

mycpuorg commented Dec 31, 2024

JohnnyBoyzzz commented Dec 31, 2024

Command

node 1

node 2

node3

node4

fsygd commented Dec 31, 2024

JohnnyBoyzzz commented Dec 31, 2024

bingshuailiu commented Jan 2, 2025