Skip to content

Conversation

TianyuZhang1214
Copy link
Contributor

@TianyuZhang1214 TianyuZhang1214 commented Oct 20, 2025

Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices

Introduction

We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.

Reproduction Steps

Pulling the Docker Image

To obtain the Docker image, use the following command:

docker pull ghcr.io/antgroup/sglang:h20-blog-release

The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang

Checking Environment Variables

All environment variables are stored in the /root/env.sh file, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.

Launching SGLang

We recommend running four containers: two for Prefill nodes and two for Decode nodes.

1. Launching Prefill Nodes (Identical Configuration for Both Nodes)

Note:

  • Both Prefill nodes use the same launch parameters.
  • Adjust the port number if there is a conflict.
PYTHONUNBUFFERED=1 \
SGL_CHUNKED_PREFIX_CACHE_THRESHOLD=0 \
nohup python3 -m sglang.launch_server \
--trust-remote-code \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode prefill \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--host 0.0.0.0 \
--port 61001 \
--tp-size 8 \
--page-size 64 \
--attention-backend fa3 \
--mem-fraction-static 0.9 \
--chunked-prefill-size 16384 \
--max-running-requests 512 \
--context-length 65535 \
--enable-cache-report \
--log-level info \
--load-balance-method round_robin \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
> /home/admin/logs/stdout.log 2>&1 &

2. Launching Decode Nodes

Note:

  • Set {node_rank} to 0 or 1 for the respective node.
  • Replace {decode_master_ip} with the IP address of Node 0.
  • Adjust the port number if there is a conflict.
Node-0
PYTHONUNBUFFERED=1 \
SGL_ENABLE_JIT_DEEPGEMM=1 \
SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=64 \
ENABLE_SWAPAB=1 \
nohup python3 -m sglang.launch_server \
--model-path /path/to/DeepSeek-R1 \
--disaggregation-mode decode \
--disaggregation-transfer-backend mooncake \
--disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \
--disaggregation-bootstrap-port 9000 \
--attention-backend flashmla \
--host 0.0.0.0 \
--port 61001 \
--trust-remote-code \
--dist-init-addr {decode_master_ip}:62001 \
--nnodes 2 \
--node-rank {node_rank} \
--tp-size 16 \
--dp-size 16 \
--enable-dp-attention \
--mem-fraction-static 0.88 \
--max-running-requests 512 \
--context-length 65535 \
--log-level info \
--decode-log-interval 50 \
--page-size 64 \
--schedule-conservativeness 0.3 \
--enable-cache-report \
--moe-dense-tp-size 1 \
--enable-deepep-moe \
--enable-dp-lm-head \
--cuda-graph-max-bs 32 \
--speculative-algorithm NEXTN \
--speculative-num-steps 1 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 2 \
--init-expert-location /root/expert_workload.json \
--prefill-round-robin-balance \
--quantization fp8 \
--kv-cache-dtype fp8_e4m3 \
--deepep-mode low_latency_overlap \
--enable-single-batch-overlap \
> /home/admin/logs/stdout.log 2>&1 &

3. Launching SGLang Router

Note:

  • Replace {decode_master_ip}, {prefill_node_0_ip}, and {prefill_node_1_ip} with the respective IP addresses.
  • Adjust the port number if there is a conflict.
nohup python3 -m sglang_router.launch_router \
--pd-disaggregation \
--mini-lb \
--host 0.0.0.0 \
--decode http://{decode_master_ip}:61001 \
--port 8000 \
--prefill http://{prefill_node_0_ip}:61001 \
--prefill http://{prefill_node_1_ip}:61001 \
> /home/admin/logs/router.log 2>&1 &

Testing

1. Running the Benchmark

Note:

  • This script is designed to observe peak performance in logs. Since --request-rate is set to inf, all requests are sent at once, making TTFT and TPOT data less meaningful.
  • Replace {path-to-shareGPT} with the path to the ShareGPT dataset.
nohup python3 -m sglang.bench_serving \
--host 0.0.0.0 \
--port 8000 \
--dataset-path {path-to-shareGPT} \
--num-prompt 4096 \
--random-input 4096 \
--random-output 1536 \
--request-rate "inf" \
--max-concurrency 2048 \
--warmup-requests 0 \
--backend sglang \
--dataset-name random \
--random-range-ratio 1 \
> /home/local/workspace/bench.log 2>&1 &

2. Observing Logs

To monitor peak performance, filter logs for entries with running-req: 32:

grep -E 'Decode batch.*running-req: 32' /home/admin/logs/sglang.log

Example Output (for batch size = 32):

2025-10-20 03:02:22 INFO 31223 [DP3 TP3 EP3 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 157952, token usage: 0.21, accept len: 1.93, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 693.45, #queue-req: 0
2025-10-20 03:02:22 INFO 31225 [DP5 TP5 EP5 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 164224, token usage: 0.22, accept len: 1.92, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 674.19, #queue-req: 1
2025-10-20 03:02:22 INFO 31222 [DP2 TP2 EP2 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 162112, token usage: 0.22, accept len: 1.90, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 655.17, #queue-req: 1
2025-10-20 03:02:22 INFO 31224 [DP4 TP4 EP4 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 168768, token usage: 0.22, accept len: 1.93, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 679.00, #queue-req: 2
2025-10-20 03:02:22 INFO 31227 [DP7 TP7 EP7 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 157696, token usage: 0.21, accept len: 1.92, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 673.31, #queue-req: 0
2025-10-20 03:02:26 INFO 31222 [DP2 TP2 EP2 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 159488, token usage: 0.21, accept len: 1.92, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 679.66, #queue-req: 0
2025-10-20 03:02:27 INFO 31224 [DP4 TP4 EP4 scheduler_metrics_mixin.py:222] Decode batch. #running-req: 32, #token: 160320, token usage: 0.21, accept len: 1.94, pre-allocated usage: 0.00, #retracted-req: 0, cuda graph: True, gen throughput (token/s): 673.26, #queue-req: 0

Related PRs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants