-
Notifications
You must be signed in to change notification settings - Fork 3
[Don't merge] Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…when seq_lens is small
Reviewer's GuideThis PR implements a new one-shot multi-head attention mode for DeepSeek-V2, enriches fused MoE Triton kernels with descriptor/TMA/filtering support, introduces Triton-based KV buffer operations in the memory pool and utils, updates config generation for down-MoE scenarios, and adds a comprehensive benchmark/tuning script for the fused MoE kernels. Sequence diagram for one-shot MHA attention path in DeepSeek-V2sequenceDiagram
participant FB as ForwardBatch
participant Attn as DeepseekV2AttentionMLA
participant KVPool as MLATokenToKVPool
FB->>Attn: forward_prepare(...)
Attn->>FB: _support_mha_one_shot(...)
alt MHA_ONE_SHOT supported
Attn->>Attn: forward_normal_one_shot_prepare(...)
Attn->>FB: fetch_mha_one_shot_kv_indices()
Attn->>KVPool: get_mla_kv_buffer(...)
KVPool-->>Attn: (kv_a, k_pe)
Attn->>Attn: forward_normal_one_shot_core(...)
else fallback
Attn->>Attn: forward_normal_chunked_kv_prepare(...)
end
Sequence diagram for fused MoE Triton kernel invocation with TMA/descriptor supportsequenceDiagram
participant Worker as BenchmarkWorker
participant FusedMoE as FusedMoE
participant Kernel as TritonKernel
Worker->>FusedMoE: benchmark(...)
FusedMoE->>Kernel: invoke_fused_moe_kernel(..., a_desc, b_desc, filter_expert)
Kernel-->>FusedMoE: (results)
FusedMoE-->>Worker: (latency results)
Class diagram for new and updated DeepSeek-V2 attention and MoE classesclassDiagram
class AttnForwardMethod {
+MHA_CHUNKED_KV
+MHA_ONE_SHOT
+MLA_FUSED_ROPE
}
class DeepseekV2AttentionMLA {
+kv_cache_dtype
+forward_normal_one_shot_prepare()
+forward_normal_one_shot_core()
+_set_mla_kv_buffer()
+_get_mla_kv_buffer()
+_concat_and_cast_mha_k()
}
class ForwardBatch {
+mha_one_shot_kv_indices
+mha_one_shot
+fetch_mha_one_shot_kv_indices()
}
class MLATokenToKVPool {
+get_mla_kv_buffer()
}
AttnForwardMethod <|-- DeepseekV2AttentionMLA
DeepseekV2AttentionMLA <.. ForwardBatch
ForwardBatch <.. MLATokenToKVPool
Class diagram for Fused MoE Triton kernel and config changesclassDiagram
class BenchmarkWorker {
+benchmark()
+tune()
}
class BestConfigTrace {
+update()
+total_time
+config_dict()
}
class MoeRunnerConfig {
+inplace
+num_experts
+num_local_experts
}
class FusedMoE {
+fused_experts_impl(..., filter_expert)
}
class FusedMoEConfig {
+get_config_file_name(..., down_moe)
+get_moe_configs(..., down_moe)
+try_get_optimal_moe_config(..., return_down_config)
}
BenchmarkWorker <.. BestConfigTrace
FusedMoEConfig <.. FusedMoE
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
|
Sorry to bother you, is this optimization only for the DeepSeek-R1 model? Is the DeepSeek-V3 model also available? |
Yes, V3 is also available. |
Thanks for your reply. When I try to reproduce your work, it shows that there is a problem with the DeepGEMM library. Can you tell me the link of the Deepgemm library you use? In this repository, I tried the |
|
Hello, I have successfully run it according to your configuration, and the performance is very good. However, there is an issue: during the operation, the sglang logs are not printed. Even when I set |
The log is redirected to |
Could you please tell me if you used a Docker image or compiled from source code to get it to work successfully? I tried compiling from source code but failed. |
Please use the docker image. These features from FlashMLA-FP8, DeepEP, and DeepGEMM are still under review and require compilation from source. For DeepGEMM, you must first merge code from PR#183 and PR#192, which can be complex before they are integrated into the main branch. The Docker image simplifies this process. |
|
I’d like to reproduce the final performance reported in the blog (each node achieves 16.5k input tokens per second and 5.7k output tokens per second on 4096-token input sequences). How should I do that? I only found scripts for peak performance testing—what command did you use to benchmark the metrics shown in the blog? Could you share it? Thanks. |
Thanks so much. I found the redirected log file at |
I’m deploying via Docker images and it’s running fine. If you manage to compile from source and run it successfully, I’d appreciate it if you could share your steps. |
To reproduce the performance metrics from the blog (16.5k input tokens/s and 5.7k output tokens/s for 4096-token sequences), follow the setup in the |
|
Do you mean using Also, when you say I only recently started trying to reproduce DeepSeek’s performance under PD separation, so if I’ve misunderstood any of the fundamentals, please feel free to correct me. |
|
I reread your reply and would like to restate my updated understanding. |
This test requires 4 nodes with 8 H20 GPUs each (4×8 H20). |
|
I reproduced the following results on three H20 96G nodes (P1D2) using the command you provided — does this outcome meet expectations? Does this result mean that, across 2 nodes, the output throughput is Updated: |
|
I can’t find |
|
@zheng1
Yes, the decode output is expected, congratulations!
The decode output may vary across DP ranks; consider 11,646.24 tokens as an estimated value.
The
Yes, you're using only 1 Prefill node, so the pressure on Decode is insufficient. Consider using 2 Prefill nodes. |
Thanks for your reply. I merged based on the suggested pr, but I'm still encountering many problems, especially with SBO and SwapAB GEMM features. My installation process is as follows: FP8 MLA: DEEPEP(NVSHMEM has been installed.): DEEPGEMM: In fact, even after the above steps, there's still a long way to go before successfully starting the service. Are there any other pull requests that need to be merged? |
Would you mind testing with the Docker image directly? |
Understood, I will find a way to try your docker image. |
|
stdout.log |
Could you please send me the operating steps? |
|
Hi @TianyuZhang1214 The process is killed immediately, and I'm not sure how to troubleshoot it. I checked dmesg, but there's nothing useful there either. Have you ever encountered a problem like this before? |
@yangzhipeng1108 By the way, all environment variables are defined in |
All Hopper GPUs are supported. H20 is recommended for our optimizations.
No, we haven't. We’ve only tested on Hopper GPUs—sorry about that. |
My environment is: Does this patch require this driver version, or is it necessary to upgrade the driver version? |
|
Would you mind disclosing the P and D node deployment parameters and test scripts for the base (BF16+MTP) from the article Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G? This is very important to me. |
|
sudo docker run |
|
All relevant details are included in this PR. Please refer to the Launching SGLang and Testing sections in the PR description. |
We're sorry, but we haven't encountered this error before. You may need to troubleshoot and resolve it on your own. |
Deploying DeepSeek-R1 on H20-96G with SGLang: Best Practices
Introduction
We published an article on LMSYS titled "Together with SGLang: Best Practices for Serving DeepSeek-R1 on H20-96G", sharing our best practices for deploying the DeepSeek-R1 model on H20-96G hardware.
To facilitate reproduction of our experimental results and provide access to our code, we have released this pull request in the DeepSeek-R1 repository.
Reproduction Steps
Pulling the Docker Image
To obtain the Docker image, use the following command:
The image is hosted at: https://github.com/orgs/antgroup/packages/container/package/sglang
Checking Environment Variables
All environment variables are stored in the
/root/env.shfile, configured for our H20 environment. Before launching SGLang, verify that these variables are suitable for your environment.Launching SGLang
We recommend running four containers: two for Prefill nodes and two for Decode nodes.
1. Launching Prefill Nodes (Identical Configuration for Both Nodes)
Note:
2. Launching Decode Nodes
Note:
{node_rank}to0or1for the respective node.{decode_master_ip}with the IP address of Node 0.Node-0
PYTHONUNBUFFERED=1 \ SGL_ENABLE_JIT_DEEPGEMM=1 \ SGLANG_DEEPEP_NUM_MAX_DISPATCH_TOKENS_PER_RANK=96 \ ENABLE_SWAPAB=1 \ nohup python3 -m sglang.launch_server \ --model-path /path/to/DeepSeek-R1 \ --disaggregation-mode decode \ --disaggregation-transfer-backend mooncake \ --disaggregation-ib-device mlx5_bond_0,mlx5_bond_1,mlx5_bond_2,mlx5_bond_3 \ --disaggregation-bootstrap-port 9000 \ --attention-backend flashmla \ --host 0.0.0.0 \ --port 61001 \ --trust-remote-code \ --dist-init-addr {decode_master_ip}:62001 \ --nnodes 2 \ --node-rank {node_rank} \ --tp-size 16 \ --dp-size 16 \ --enable-dp-attention \ --mem-fraction-static 0.88 \ --max-running-requests 768 \ --context-length 65535 \ --log-level info \ --decode-log-interval 50 \ --page-size 64 \ --schedule-conservativeness 0.3 \ --enable-cache-report \ --moe-dense-tp-size 1 \ --enable-deepep-moe \ --enable-dp-lm-head \ --cuda-graph-max-bs 48 \ --speculative-algorithm NEXTN \ --speculative-num-steps 1 \ --speculative-eagle-topk 1 \ --speculative-num-draft-tokens 2 \ --init-expert-location /root/expert_workload.json \ --prefill-round-robin-balance \ --quantization fp8 \ --kv-cache-dtype fp8_e4m3 \ --moe-a2a-backend deepep \ --deepep-mode low_latency_overlap \ --enable-single-batch-overlap \ > /home/admin/logs/stdout.log 2>&1 &3. Launching SGLang Router
Note:
{decode_master_ip},{prefill_node_0_ip}, and{prefill_node_1_ip}with the respective IP addresses.nohup python3 -m sglang_router.launch_router \ --pd-disaggregation \ --mini-lb \ --host 0.0.0.0 \ --decode http://{decode_master_ip}:61001 \ --port 8000 \ --prefill http://{prefill_node_0_ip}:61001 \ --prefill http://{prefill_node_1_ip}:61001 \ > /home/admin/logs/router.log 2>&1 &Testing
1. Running the Benchmark
Note:
--request-rateis set toinf, all requests are sent at once, making TTFT and TPOT data less meaningful.{path-to-shareGPT}with the path to the ShareGPT dataset.nohup python3 -m sglang.bench_serving \ --host 0.0.0.0 \ --port 8000 \ --dataset-path {path-to-shareGPT} \ --num-prompt 4096 \ --random-input 4096 \ --random-output 1536 \ --request-rate "inf" \ --max-concurrency 2048 \ --warmup-requests 0 \ --backend sglang \ --dataset-name random \ --random-range-ratio 1 \ > /home/local/workspace/bench.log 2>&1 &2. Observing Logs
To monitor peak performance, filter logs for entries with
running-req: 48:grep -E 'Decode batch.*running-req: 48' /home/admin/logs/sglang.logExample Output (for batch size = 48):
Related PRs
Profiling
Open the following links and then view the profiling files: Perfetto
Decode
Input=4K, chunked-prefill-size=16384: h20_blog_prefill_tp8_input_4k.json.gzDecode
running-req: 48: h20_blog_decode_ep16_bs48.json.gzrunning-req: 32: h20_blog_decode_ep16_bs32.json.gz