-
Notifications
You must be signed in to change notification settings - Fork 651
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] deepseek v3 cannot run in multi-node #2658
Comments
A800 doesn't support FP8. May you try to use H20 multi nodes or H800 multi nodes? |
Using two H20 nodes works well ref https://github.com/sgl-project/sglang/tree/main/benchmark/deepseek_v3#example-serving-with-2-h208 |
Alright, thank you for your suggestion. I found the BF16 version of the DeepSeek v3 model on Hugging Face. I'll download it and try to see if I can run it under multi-node. |
I am here for H100 instructions. The recommended page says this:
But I am experiencing the same issue mentioned here in the issue. Can somebody please help? |
@JohnnyBoyzzz unless the bf16 weights just worked out of the box for you, can you please re-open this issue? |
@JohnnyBoyzzz Here is the problem. when lauching server in node 2, you should use the
fyi @mycpuorg |
Thanks @fsygd in my case I am using the master ip. Yet the effect is the same, which tells us that the nodes are somehow unable to talk to each other and complete the initial handshake or w/e |
@zhyncs when i run the BF16 model(https://huggingface.co/opensourcerelease/DeepSeek-V3-bf16/tree/main) in four nodes(8*A800 80G), have another bug, how to fix it ? Commandnode 1GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 0 --trust-remote-code --disable-cuda-graph node 2GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 1 --trust-remote-code --disable-cuda-graph node3GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 2 --trust-remote-code --disable-cuda-graph node4GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path models/deepseek-ai/DeepSeek-V3-BF16 --tp 32 --nccl-init host_ip:80 --nnodes 4 --node-rank 3 --trust-remote-code --disable-cuda-graph |
Yes, I think the reason is the network between the two nodes. I have tested 2 nodes H800 successfully in #2647 (comment) |
cannot run in bf16...found another bug |
dude have u fixed the bug on bf16 model with 4 machine (8 * 4 A800) ? Same issue for me. |
Checklist
Describe the bug
I am deploying a deepseek v3 model using a multi-node approach. Each node contains 8*A800-80G. I run the command on the first node and then on the second node, but the progress is stuck at "Init torch distributed begin." and cannot proceed any further.
Reproduction
node 1
GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init first_ip:8001 --nnodes 2 --node-rank 0 --trust-remote-code
node 2
GLOO_SOCKET_IFNAME=eth0 /root/anaconda3/envs/sglang/bin/python -m sglang.launch_server --model-path /models/deepseek-ai/DeepSeek-V3 --tp 16 --nccl-init second_ip:8001 --nnodes 2 --node-rank 1 --trust-remote-code
Environment
fist ip
python3 -m sglang.check_env
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.4.1.post3
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.3
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.10
packaging: 24.1
psutil: 6.0.0
pydantic: 2.9.1
multipart: 0.0.9
zmq: 26.2.0
uvicorn: 0.30.6
uvloop: 0.20.0
vllm: 0.6.3.post1
xgrammar: Module Not Found
openai: 1.44.1
anthropic: 0.34.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODESYS SYS SYS SYS
NIC1 PXB PXB NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODESYS SYS SYS SYS
NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODESYS SYS SYS SYS
NIC3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE X PIXSYS SYS SYS SYS
NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE PIX X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE
NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYSPIX X NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE X PIX
NIC8 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
ulimit soft: 65535
second ip
python3 -m sglang.check_env
/root/anaconda3/envs/sglang/lib/python3.10/site-packages/pydantic/_internal/_config.py:341: UserWarning: Valid config keys have changed in V2:
warnings.warn(message, UserWarning)
Python: 3.10.14 (main, May 6 2024, 19:42:50) [GCC 11.2.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA A800-SXM4-80GB
GPU 0,1,2,3,4,5,6,7 Compute Capability: 8.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.99
CUDA Driver Version: 550.90.07
PyTorch: 2.4.0+cu121
sglang: 0.4.1.post3
flashinfer: 0.1.6+cu121torch2.4
triton: 3.0.0
transformers: 4.46.3
torchao: 0.7.0
numpy: 1.26.4
aiohttp: 3.10.5
fastapi: 0.115.5
hf_transfer: 0.1.8
huggingface_hub: 0.24.7
interegular: 0.3.3
modelscope: Module Not Found
orjson: 3.10.11
packaging: 24.1
psutil: 6.0.0
pydantic: 2.9.1
multipart: 0.0.9
zmq: 26.2.0
uvicorn: 0.30.6
uvloop: 0.20.0
vllm: 0.6.3.post1
xgrammar: Module Not Found
openai: 1.55.0
anthropic: 0.34.2
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4NIC5 NIC6 NIC7 NIC8 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV8 NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU1 NV8 X NV8 NV8 NV8 NV8 NV8 NV8 NODE PXB PXB NODE NODESYS SYS SYS SYS 0-31,64-95 0 N/A
GPU2 NV8 NV8 X NV8 NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU3 NV8 NV8 NV8 X NV8 NV8 NV8 NV8 NODE NODE NODE PXB PXBSYS SYS SYS SYS 0-31,64-95 0 N/A
GPU4 NV8 NV8 NV8 NV8 X NV8 NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU5 NV8 NV8 NV8 NV8 NV8 X NV8 NV8 SYS SYS SYS SYS SYSPXB PXB NODE NODE 32-63,96-127 1 N/A
GPU6 NV8 NV8 NV8 NV8 NV8 NV8 X NV8 SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
GPU7 NV8 NV8 NV8 NV8 NV8 NV8 NV8 X SYS SYS SYS SYS SYSNODE NODE PXB PXB 32-63,96-127 1 N/A
NIC0 NODE NODE NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODESYS SYS SYS SYS
NIC1 PXB PXB NODE NODE SYS SYS SYS SYS NODE X PIX NODE NODESYS SYS SYS SYS
NIC2 PXB PXB NODE NODE SYS SYS SYS SYS NODE PIX X NODE NODESYS SYS SYS SYS
NIC3 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE X PIXSYS SYS SYS SYS
NIC4 NODE NODE PXB PXB SYS SYS SYS SYS NODE NODE NODE PIX X SYS SYS SYS SYS
NIC5 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYS X PIX NODE NODE
NIC6 SYS SYS SYS SYS PXB PXB NODE NODE SYS SYS SYS SYS SYSPIX X NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE X PIX
NIC8 SYS SYS SYS SYS NODE NODE PXB PXB SYS SYS SYS SYS SYSNODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
NIC Legend:
NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
ulimit soft: 65535
The text was updated successfully, but these errors were encountered: