Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support multi-node & autoscaling & routing together for models like Deepseek-R1 #758

Open
Jeffwan opened this issue Feb 27, 2025 · 7 comments
Assignees
Labels
area/autoscaling area/distributed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@Jeffwan
Copy link
Collaborator

Jeffwan commented Feb 27, 2025

🚀 Feature Description and Motivation

Orchestration

  1. Deepseek-r1 full weights needs to be deployed using multi-node orchestration. If we adopt cross node TP, Let's make sure we unblock RDMA communication in such case..
  2. Let's make sure the rolling upgrade experiences are expected.
  3. We also need graceful shutdown to make sure in-flight request can be handled correctly.

Autoscaling

In such cases, traditional autoscaling may not work well.

  • For resource metrics like SM_ACTIVE etc, it is still aggregated at the pod level and make no big differences.
  • For applications metrics, only head pod which has the apiserver deployed emit the metrics. it has to be consistent with the number of the units.

Routing

  1. Router should skip some worker pods and only consider head pod for request touring
  2. Make sure it remove the pod when it comes into terminating stage.

Use Case

As a user, I want to host deepseek-r1 full weights version and autoscale the workloads based on the traffic

Proposed Solution

No response

@Jeffwan Jeffwan changed the title Support multi-node & autoscaling together for models like Deepseek-R1 Support multi-node & autoscaling & routing together for models like Deepseek-R1 Feb 27, 2025
@Jeffwan Jeffwan self-assigned this Feb 27, 2025
@Jeffwan Jeffwan added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Feb 27, 2025
@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 1, 2025

Routing

Image

Image
always hit the head


Update: after running more tests. I notice this is not true. I did see it comes to other pods, but due to some issues, the request didn't run through.

Image

python3 benchmark_serving.py --backend vllm  --model deepseek-ai/deepseek-r1 --trust-remote-code --served-model-name deepseek-r1-671b --base-url http://localhost:8888 --endpoint /v1/completions --num-prompts 100 --request-rate 2 --metric_percentiles '50,90,95,99' --goodput ttft:1000 tpot:100 --max-concurrency 200 --random-input-len 2048 --random-output-len 200 --dataset-name random --ignore-eos 

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 2, 2025

RayCluster Orchestration related

  1. ray.io/overwrite-container-cmd -> RayCluster level
  2. header & worker annotations has to be set separately, there's no propogation to different roles yet. RayClusterFleet spec.templates.metadata controls RayCluster metadata.
  3. Probe can be overrided by users. or disable injection

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 2, 2025

vLLM 0.7.3 problem

Image
hang for long time, I checked vllm-project/vllm#13136 and decide to rebuild the image

FROM vllm/vllm-openai:v0.7.3
RUN pip3 install -U ray[default,adag]==2.40.0 --progress-bar off # important for future healthcheck
RUN pip3 install -U nvidia-nccl-cu12 --progress-bar off
ENTRYPOINT [""]

Note: in 0.7.3, ray[adag] was used to replace ray[default]. this bring issues to kuberay based deployment because our injected prob uses agent to check healthy status. I considered to use v0.7.2 but notice 0.7.3 brings flashattentionv3 for MLA optimization, so I just stick to v0.7.3

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 2, 2025

RDMA setup

From the nccl logs, we can see that cross-node communication is happening over RDMA, while intra-node transfers fall back to IPC (NVLink in this case). ('NCCL INFO NVLS multicast support is available')

RDMA(RoCE) logs
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Bootstrap: Using eth0:192.168.0.90<0>
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO cudaDriverVersion 12020
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NCCL version 2.25.1+cuda12.2
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NCCL_IB_HCA set to mlx5_
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [4]mlx5_5:1/RoCE [5]mlx5_6:1/RoCE [6]mlx5_7:1/RoCE [7]mlx5_8:1/RoCE [RO]; OOB eth0:192.168.0.90<0>
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Using network IB
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO ncclCommInitRank comm 0xc764960 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId e000 commId 0xd0f99dd1affac83 - Init START
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO RAS client listening socket at ::1<28028>
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Bootstrap timings total 0.090936 (create 0.000030, send 0.000074, recv 0.000036, ring 0.030250, delay 0.000001)
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Setting affinity for GPU 0 to ffff,ffffffff,00000000,0000ffff,ffffffff
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS multicast support is available on dev 0
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO comm 0xc764960 rank 0 nRanks 16 nNodes 2 localRanks 8 localRank 0 MNNVL 0
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  0:  0  8
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  1:  1  9
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  2:  2 10
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  3:  3 11
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  4:  4 12
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  5:  5 13
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  6:  6 14
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS Head  7:  7 15
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 00/16 :  0  7  6  5  4  3  2  1  9 10 11 12 13 14 15  8
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 01/16 :  0  8 15 14 13 12 11 10  9  1  2  3  4  5  6  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 02/16 :  0  7  6  5  4  3 11 12 13 14 15  8  9 10  2  1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 03/16 :  0  1  2 10  9  8 15 14 13 12 11  3  4  5  6  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 04/16 :  0  7  6  5 13 14 15  8  9 10 11 12  4  3  2  1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 05/16 :  0  1  2  3  4 12 11 10  9  8 15 14 13  5  6  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 06/16 :  0  7 15  8  9 10 11 12 13 14  6  5  4  3  2  1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 07/16 :  0  1  2  3  4  5  6 14 13 12 11 10  9  8 15  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 08/16 :  0  7  6  5  4  3  2  1  9 10 11 12 13 14 15  8
dee�[36m(RayWorkerWrapper pid=996)�[0m INFO 03-02 10:21:47 utils.py:916] Found nccl from library libnccl.so.2
�[36m(RayWorkerWrapper pid=996)�[0m INFO 03-02 10:21:47 pynccl.py:69] vLLM is using nccl==2.25.1
�[36m(RayWorkerWrapper pid=342, ip=192.168.0.83)�[0m INFO 03-02 10:21:42 __init__.py:207] Automatically detected platform cuda.�[32m [repeated 7x across cluster]�[0m
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO cudaDriverVersion 12020
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Bootstrap: Using eth0:192.168.0.83<0>
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NCCL version 2.25.1+cuda12.2
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NET/Plugin: Could not find: libnccl-net.so. Using internal network plugin.
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NCCL_IB_DISABLE set by environment to 0.
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NCCL_SOCKET_IFNAME set by environment to eth0
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NCCL_IB_HCA set to mlx5_1:1,mlx5_2:1,mlx5_3:1,mlx5_4:1,mlx5_5:1,mlx5_6:1,mlx5_7:1,mlx5_8:1
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NET/IB : Using [0]mlx5_1:1/RoCE [1]mlx5_2:1/RoCE [2]mlx5_3:1/RoCE [3]mlx5_4:1/RoCE [4]mlx5_5:1/RoCE [5]mlx5_6:1/RoCE [6]mlx5_7:1/RoCE [7]mlx5_8:1/RoCE [RO]; OOB eth0:192.168.0.83<0>
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Using network IB
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO ncclCommInitRank comm 0xde1dae0 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 44000 commId 0xd0f99dd1affac83 - Init START
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO RAS client listening socket at ::1<28028>
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Bootstrap timings total 0.006130 (create 0.000024, send 0.000165, recv 0.000208, ring 0.001345, delay 0.000000)
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NCCL_CUMEM_ENABLE set by environment to 0.
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Setting affinity for GPU 1 to ffff,ffffffff,00000000,0000ffff,ffffffff
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO NVLS multicast support is available on dev 1
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO comm 0xde1dae0 rank 9 nRanks 16 nNodes 2 localRanks 8 localRank 1 MNNVL 0
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Trees [0] 10/-1/-1->9->8 [1] 10/-1/-1->9->1 [2] -1/-1/-1->9->8 [3] 10/-1/-1->9->8 [4] 10/-1/-1->9->8 [5] 10/-1/-1->9->8 [6] 10/-1/-1->9->8 [7] 10/-1/-1->9->8 [8] 10/-1/-1->9->8 [9] 10/1/-1->9->-1 [10] -1/-1/-1->9->8 [11] 10/-1/-1->9->8 [12] 10/-1/-1->9->8 [13] 10/-1/-1->9->8 [14] 10/-1/-1->9->8 [15] 10/-1/-1->9->8
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO P2P Chunksize set to 131072
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:1377 [1] NCCL INFO [Proxy Service] Device 1 CPU core 40
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:1381 [1] NCCL INFO [Proxy Service UDS] Device 1 CPU core 41
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Channel 00/0 : 9[1] -> 10[2] via P2P/IPC
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Channel 02/0 : 9[1] -> 10[2] via P2P/IPC
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Channel 04/0 : 9[1] -> 10[2] via P2P/IPC
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Channel 06/0 : 9[1] -> 10[2] via P2P/IPC
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Channel 08/0 : 9[1] -> 10[2] via P2P/IPC
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Channel 10/0 : 9[1] -> 10[2] via P2P/IPC
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worke
�[36m(RayWorkerWrapper pid=335, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-w
�[36m(RayWorkerWrapper pid=338, ip=192.168.0.83)�[0m d
�[36m(RayWorkerWrapper pid=341, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:341:341 [6] NCCL INFO Channel 12/0 : 14[6] -> 15
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  0:  0  8
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  1:  1  9
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  2:  2 10
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  3:  3 11
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  4:  4 12
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  5:  5 13
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  6:  6 14
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS Head  7:  7 15
�[36m(RayWorkerWrapper pid=996)�[0m deep
�[36m(RayWorkerWrapper pid=1015)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:1015:21258 [7] NCCL INFO [Proxy Progress] Device 7 CPU core 93
�[36m(RayWorkerWrapper pid=1015)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:1015:1015 [7] NCCL INFO Channel 07/0 : 15[7] -> 7[7] [receive] via NET/IB/15/GDRDMA
�[36m(RayWorkerWrapper pid=1015)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:1015:1015 [7] NCCL INFO Channel 15/0 : 15[7] -> 7[7] [receive] via NET/IB/1
�[36m(RayWorkerWrapper pid=983)�[0m deeps
�[36m(RayWorkerWrapper pid=1005)�[0m deepseek-r1-671b-88957849-q6slh
�[36m(RayWorkerWrapper pid=987)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:987:987 [5] NCCL INFO Channel 07/0 : 5[5] -> 6[6] v
�[36m(RayWorkerWrapper pid=337, ip=192.168.0.83)�[0m deepseek-r1-67
�[36m(RayWorkerWrapper pid=340, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:340:340 [7] NCCL INFO Channel 07/0 : 15[7] -> 7[7] [send] via NET/IB/15/GDRDMApseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 09/16 :  0  8 15 14 13 12 11 10  9  1  2  3  4  5  6  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 10/16 :  0  7  6  5  4  3 11 12 13 14 15  8  9 10  2  1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 11/16 :  0  1  2 10  9  8 15 14 13 12 11  3  4  5  6  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 12/16 :  0  7  6  5 13 14 15  8  9 10 11 12  4  3  2  1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 13/16 :  0  1  2  3  4 12 11 10  9  8 15 14 13  5  6  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 14/16 :  0  7 15  8  9 10 11 12 13 14  6  5  4  3  2  1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 15/16 :  0  1  2  3  4  5  6 14 13 12 11 10  9  8 15  7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Trees [0] 1/8/-1->0->-1 [1] -1/-1/-1->0->7 [2] 1/-1/-1->0->7 [3] 1/-1/-1->0->7 [4] 1/-1/-1->0->7 [5] 1/-1/-1->0->7 [6] 1/-1/-1->0->7 [7] 1/-1/-1->0->7 [8] 1/-1/-1->0->8 [9] -1/-1/-1->0->7 [10] 1/-1/-1->0->7 [11] 1/-1/-1->0->7 [12] 1/-1/-1->0->7 [13] 1/-1/-1->0->7 [14] 1/-1/-1->0->7 [15] 1/-1/-1->0->7
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO P2P Chunksize set to 131072
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Check P2P Type intraNodeP2pSupport 1 directMode 0
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:21242 [0] NCCL INFO [Proxy Service] Device 0 CPU core 31
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:21249 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 32
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 00/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 02/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 04/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 06/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 08/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 10/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 12/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 14/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:21256 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 129
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 00/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 08/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 01/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 03/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 05/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 07/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 09/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 11/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 13/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 15/0 : 0[0] -> 7[7] via P2P/IPC
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 00/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 08/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Connected all trees
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO NVLS comm 0xc764960 headRank 0 nHeads 8 buffSize 1048576 nvlsPerRankSize 33554432 nvlsTotalSize 268435456
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 01/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 02/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 03/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 04/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 05/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 06/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 07/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 09/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 10/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 11/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 12/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 13/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 14/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 15/0 : 8[0] -> 0[0] [receive] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 02/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 03/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 04/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 05/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 06/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 07/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 10/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 11/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 12/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 13/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 14/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Channel 15/0 : 0[0] -> 8[0] [send] via NET/IB/8/GDRDMA
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Connected NVLS tree
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO CC Off, workFifoBytes 1048576
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO ncclCommInitRank comm 0xc764960 rank 0 nranks 16 cudaDev 0 nvmlDev 0 busId e000 commId 0xd0f99dd1affac83 - Init COMPLETE
deepseek-r1-671b-88957849-q6slh-head-fwl2w:734:734 [0] NCCL INFO Init timings - ncclCommInitRank: rank 0 nranks 16 total 3.08 (kernels 0.36, alloc 0.89, bootstrap 0.09, allgathers 0.01, topo 0.53, graphs 0.01, connections 1.18, rest 0.00)

�[36m(RayWorkerWrapper pid=340, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:340:340 [7] NCCL INFO Channel 15/0 : 15[7] -> 7[7] [send] via NET/IB/15/GDRDMA
�[36m(RayWorkerWrapper pid=340, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhsk
�[36m(RayWorkerWrapper pid=338, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:338:338 [3] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
�[36m(RayWorkerWrapper pid=338, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:338:338 [3] NCCL INFO Connected all t
�[36m(RayWorkerWrapper pid=342, ip=192.168.0.83)�[0m 6] via P2P/IPC
�[36m(RayWorkerWrapper pid=342, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:342:342 [5] NCCL IN
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Connected all trees
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO 
�[36m(RayWorkerWrapper pid=340, ip=192.168.0.83)�[0m deepse
�[36m(RayWorkerWrapper pid=996)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:996:996 [3] NCCL INFO NVLS comm 0xbb0e900 headRank 3 nHeads 8 buffSize 1048576 nvlsPerRankSize 33554432 nvlsTotalSize 268435456
�[36m(RayWorkerWrapper pid=981)�[0m deepseek-r1-671b-88957
�[36m(RayWorkerWrapper pid=993)�[0m deepseek-r1-671b-88957849-q6slh-hea
�[36m(RayWorkerWrapper pid=1015)�[0m 5/GDRDMA
�[36m(RayWorkerWrapper pid=1015)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:1015:1015 [7] NCCL INFO Channel 01/0 : 15[7] -> 7[7] [re
�[36m(RayWorkerWrapper pid=983)�[0m deepseek-r1-671b-
�[36m(RayWorkerWrapper pid=1005)�[0m deepseek-r1-6
�[36m(RayWorkerWrapper pid=987)�[0m ia P2P/IPC
�[36m(RayWorkerWrapper pid=987)�[0m deepseek-r1-671b-88957849-q6slh-head-fwl2w:987:987 [5] NCCL INFO Channel 02/0 : 13[5] -> 5[5] [receive] via N
�[36m(RayWorkerWrapper pid=335, ip=192.168.0.83)�[0m de
�[36m(RayWorkerWrapper pid=341, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-wo
�[36m(RayWorkerWrapper pid=339, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-wor
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m NVLS comm 0xde1dae0 headRank 1 nHeads 8 buffSize 1048576 nvlsPerRankSize 33554432 nvlsTotalSize 268435456
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Connected NVLS tree
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO threadThresholds 8/8/64 | 128/8/64 | 512 | 512
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 2 p2p channels per peer
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so libnccl-net.so. Using internal tuner plugin.
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO ncclCommInitRank comm 0xde1dae0 rank 9 nranks 16 cudaDev 1 nvmlDev 1 busId 44000 commId 0xd0f99dd1affac83 - Init COMPLETE
�[36m(RayWorkerWrapper pid=336, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:336:336 [1] NCCL INFO Init timings - ncclCommInitRank: rank 9 nranks 16 total 2.94 (kernels 0.29, alloc 1.03, bootstrap 0.01, allgathers 0.01, topo 0.54, graphs 0.01, connections 1.06, rest 0.00)
�[36m(RayWorkerWrapper pid=337, ip=192.168.0.83)�[0m Channel 00/0 : 2[2] -> 10[2] [receive] via NET/IB/10/GDRDMA
�[36m(RayWorkerWrapper pid=335, ip=192.168.0.83)�[0m deepseek-r1-671b-88957849-q6slh-worker-group-worker-hhskt:335:335 [0] NCCL INFO ncclCommInitRank comm 0xcf3b380 r
�[36m(RayWorkerWrapper pid=338, ip=192.168.0.83)�[0m rees
�[36m(RayWorkerWrapper pid=342, ip=192.168.0.83)�[0m FO Connected all trees
WARNING 03-02 10:21:50 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes.
INFO 03-02 10:21:50 shm_broadcast.py:258] vLLM message queue communication handle: Handle(connect_ip='192.168.0.90', local_reader_ranks=[1, 2, 3, 4, 5, 6, 7], buffer_handle=(7, 4194304, 6, 'psm_1ee0df8a'), local_subscribe_port=60107, remote_subscribe_port=49929)
�[36m(RayWorkerWrapper pid=996)�[0m WARNING 03-02 10:21:50 custom_all_reduce.py:84] Custom allreduce is disabled because this process group spans across nodes.
�[36m(RayWorkerWrapper pid=1015)�[0m ceive] via NET/IB/15/GDRDMA
�[36m(RayWorkerWrapper pid=987)�[0m ET/IB/13/GDRDMA
�[36m(RayWorkerWrapper pid=342, ip=192.168.0.83)�[0m INFO 03-02 10:21:44 cuda.py:160] Using Triton MLA backend.�[32m [repeated 14x across cluster]�[0m
�[36m(RayWorkerWrapper pid=335, ip=192.168.0.83)�[0m ank 8 nranks 16 cudaDev 0 nvmlDev 0 busId e000 commId 0xd0f99dd1affac83 - Init COMPLETE

some warning messages

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:586 NCCL WARN Cuda failure 1 'invalid argument'

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:709 NCCL WARN rank 0 failed to NVLS register sendbuff 0x7f252bc00000 sendbuffSize 2097152 recvbuff 0x7f282cc00000 recvbuffSize 2097152

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:586 NCCL WARN Cuda failure 1 'invalid argument'

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:709 NCCL WARN rank 0 failed to NVLS register sendbuff 0x7f282cc00000 sendbuffSize 2097152 recvbuff 0x7f282cc00000 recvbuffSize 2097152

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:586 NCCL WARN Cuda failure 1 'invalid argument'

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:709 NCCL WARN rank 0 failed to NVLS register sendbuff 0x7f282cc00000 sendbuffSize 2097152 recvbuff 0x7f252bc00000 recvbuffSize 2097152

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:586 NCCL WARN Cuda failure 1 'invalid argument'

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:709 NCCL WARN rank 0 failed to NVLS register sendbuff 0x7f282cc00000 sendbuffSize 2097152 recvbuff 0x7f24dbc00000 recvbuffSize 2097152

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:586 NCCL WARN Cuda failure 1 'invalid argument'

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:709 NCCL WARN rank 0 failed to NVLS register sendbuff 0x7f252bc00000 sendbuffSize 2097152 recvbuff 0x7f282cc00000 recvbuffSize 2097152

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:586 NCCL WARN Cuda failure 1 'invalid argument'

deepseek-r1-671b-6dc6684dd9-6m8kj-head-vgzzr:734:734 [0] transport/nvls.cc:709 NCCL WARN rank 0 failed to NVLS register sendbuff 0x7f282cc00000 sendbuffSize 2097152 recvbuff 0x7f282cc00000 recvbuffSize 2097152

@xieus
Copy link
Collaborator

xieus commented Mar 2, 2025

  • For applications metrics, only head pod which has the apiserver deployed emit the metrics. it has to be consistent with the number of the units.

Thanks @Jeffwan. This is a great feature. One quick question, is the head pod concept referring to the Ray head node (the underlying implementation) or a broader context?

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 3, 2025

@xieus it's specific to ray head.

@Jeffwan
Copy link
Collaborator Author

Jeffwan commented Mar 3, 2025

Autoscaling

Image

NAME                                                          READY   STATUS              RESTARTS   AGE     IP             NODE           NOMINATED NODE   READINESS GATES
deepseek-r1-671b-56f9654bbb-mgdwd-head-lf5xg                  1/1     Running             0          27m     192.168.0.74   192.168.0.51   <none>           <none>
deepseek-r1-671b-56f9654bbb-mgdwd-worker-group-worker-pb4hh   1/1     Running             0          27m     192.168.0.81   192.168.0.52   <none>           <none>

need minor changes to filter out the worker nodes

E0303 01:10:33.242360       1 kpa.go:256] Failed to get stable and panic metrics for default/deepseek-r1-671b: no data available
E0303 01:10:33.249115       1 controller.go:329] "msg"="Reconciler error" "error"="failed to compute desired number of replicas based on listed metrics for RayClusterFleet/default/deepseek-r1-671b: can not calculate metrics for scale deepseek-r1-671b" "PodAutoscaler"={"name":"deepseek-r1-671b-autoscaling","namespace":"default"} "controller"="podautoscaler" "controllerGroup"="autoscaling.aibrix.ai" "controllerKind"="PodAutoscaler" "name"="deepseek-r1-671b-autoscaling" "namespace"="default" "reconcileID"="432ed9d8-f944-47f8-9975-047731c77ebf"
E0303 01:13:33.242425       1 controller.go:329] "msg"="Reconciler error" "error"="failed to update metrics for scale target reference: failed to fetch metrics from source http://192.168.0.84:8000/metrics: Get \"http://192.168.0.84:8000/metrics\": dial tcp 192.168.0.84:8000: connect: connection refused" "PodAutoscaler"={"name":"deepseek-r1-671b-autoscaling","namespace":"default"} "controller"="podautoscaler" "controllerGroup"="autoscaling.aibrix.ai" "controllerKind"="PodAutoscaler" "name"="deepseek-r1-671b-autoscaling" "namespace"="default" "reconcileID"="d308d1c6-432f-491b-b192-33619c952e3a"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/autoscaling area/distributed priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

No branches or pull requests

3 participants