diff --git a/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md new file mode 100644 index 0000000..fa68935 --- /dev/null +++ b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md @@ -0,0 +1,88 @@ +# Day 6: One More Thing, DeepSeek-V3/R1 Inference System Overview +## System Design Principles +The optimization objectives of serving DeepSeek-V3/R1 inference are: **higher throughput and lower latency.** + +To optimize these two objectives, our solution employs cross-node Expert Parallelism (EP). +- First, EP significantly scales the batch size, enhancing GPU matrix computation efficiency and boosting throughput. +- Second, EP distributes experts across GPUs, with each GPU processing only a small subset of experts (reducing memory access demands), thereby lowering latency. + +However, EP increases system complexity, primarily in two aspects: + 1. EP introduces cross-node communication. To optimize throughput, appropriate computational workflows must be designed to overlap communication with computation. + 2. EP involves multiple nodes, thereby inherently requiring Data Parallelism (DP) and necessitating load balancing between different DP instances. + +This article focuses on how we address these challenges by: +- leveraging EP to scale batch size, +- hiding communication latency behind computation, and +- performing load balancing. + +### Large-scale Cross-node Expert Parallelism (EP) +Due to the large number of experts in DeepSeek-V3/R1—where only 8 out of 256 experts per layer are activated—the model’s high sparsity necessitates an extremely large overall batch size. This ensures sufficient batch size per expert, enabling higher throughput and lower latency. Large-scale cross-node EP is essential. + +As we have adopted prefill-decode disaggregation architecture, we employ different degrees of parallelisms during the prefill and decode phases: +- **Prefilling Phase [Routed Expert EP32, MLA/Shared Expert DP32]**: Each deployment unit spans 4 nodes with 32 redundant routed experts, where each GPU handles 9 routed experts and 1 shared expert. +- **Decoding Phase [Routed Expert EP144, MLA/Shared Expert DP144]**: Each deployment unit spans 18 nodes with 32 redundant routed experts, where each GPU manages 2 routed experts and 1 shared expert. + +### Computation-Communication Overlapping +Large-scale cross-node EP introduces significant communication overhead. To mitigate this, we employ a dual-batch overlap strategy to hide communication costs and improve overall throughput by splitting a batch of requests into two microbatches. +During the prefilling phase, these two microbatches executed alternately and the communication cost of one microbatch is hide behind the computation of the other. + +![Communication-Computation Overlapping during Prefilling Phase.png](figures/Communication-Computation%20Overlapping%20during%20Prefilling%20Phase.png) +*Communication-Computation Overlapping during Prefilling Phase* + +During the decoding phase, the execution durations of different stages are unbalanced. Hence, we subdivide the attention layer into two steps and use a 5-stage pipeline to achieve a seamless communication-computation overlapping. +![Communication-Computation Overlapping during Decoding Phase.png](figures/Communication-Computation%20Overlapping%20during%20Decoding%20Phase.png) +*Communication-Computation Overlapping during Decoding Phase* + +More details about our communication-computation overlapping mechanism can be found at https://github.com/deepseek-ai/profile-data. + +### Achieving Optimal Load Balancing +The large-scale parallelism (including DP and EP) introduces a critical challenge: if a single GPU is overloaded with computation or communication, it becomes a performance bottleneck, slowing the entire system while leaving other GPUs idle. To maximize resource utilization, we strive to balance computational and communication loads across all GPUs. + +#### 1. Prefill Load Balancer + - Key Issue: Varying request counts and sequence lengths across DP instances lead to imbalanced core-attention computation and dispatch send load. + - Optimization Objectives: + - Balance core-attention computation across GPUs (core-attention computational load balancing). + - Equalize input token counts per GPU (dispatch send load balancing), preventing prolonged processing on specific GPUs. +#### 2. Decode Load Balancer + - Key Issue: Uneven request counts and sequence lengths across DP instances cause disparities in core-attention computation (linked to KVCache usage) and dispatch send load. + - Optimization Objectives: + - Balance KVCache usage across GPUs (core-attention computational load balancing). + - Equalize request counts per GPU (dispatch send load balancing). +#### 3. Expert-Parallel Load Balancer + - Key Issue: For a given MoE model, there exist inherently high-load experts, resulting in an imbalance in expert computational workloads across different GPUs. + - Optimization Objective: + - Balance expert computation on each GPU (i.e., minimize the maximum dispatch receive load across all GPUs). + +### Diagram of DeepSeek's Online Inference System +![Diagram of DeepSeek's Online Inference System.jpg](figures/Diagram%20of%20DeepSeek%27s%20Online%20Inference%20System.jpg) +*Diagram of DeepSeek's Online Inference System* + +### Statistics of DeepSeek's Online Service +All DeepSeek-V3/R1 inference services are served on H800 GPUs with precision consistent with training. +Specifically, matrix multiplications and dispatch transmissions adopt the FP8 format aligned with training, +while core MLA computations and combine transmissions use the BF16 format, ensuring optimal service performance. + +Additionally, due to high service load during the day and low load at night, we implemented a mechanism to deploy inference services across all nodes during peak daytime hours. +During low-load nighttime periods, we reduce inference nodes and allocate resources to research and training. +Over the past 24 hours (UTC+8 02/27/2025 12:00 PM to 02/28/2025 12:00 PM), the combined peak node occupancy for V3 and R1 inference services reached 278, with an average occupancy of 226.75 nodes (each node contains 8 H800 GPUs). +Assuming the leasing cost of one H800 GPU is $2 per hour, the total daily cost amounts to $87,072. + +![H800 Node Count For Inference Service.jpg](figures/H800%20Node%20Count%20For%20Inference%20Service.jpg) +*H800 Node Count For Inference Service.png* + +Within the 24-hour statistical period (UTC+8 02/27/2025 12:00 PM to 02/28/2025 12:00 PM), V3 and R1: +- Total input tokens: 608B, of which 342B tokens (56.3%) hit the on-disk KV cache. +- Total output tokens: 168B. The average output speed was 20–22 tokens per second, and the average kvcache length per output token was 4,989 tokens. +- Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding. + +The above statistics include all user requests from web, APP, and API. If all tokens were billed at DeepSeek-R1’s pricing (*), the total daily revenue would be $562,027, with a cost profit margin of 545%. + +_(*) R1 Pricing: \$0.14/M input tokens (cache hit), \$0.55/M input tokens (cache miss), $2.19/M output tokens._ + +However, our actual revenue is substantially lower for the following reasons: +- DeepSeek-V3’s pricing is significantly lower than R1, +- Only a subset of services are monetized (web and APP access remain free), +- Nighttime discounts are automatically applied during off-peak hours. + +![Cost And Theoretical Income.jpg](figures/Cost%20And%20Theoretical%20Income.jpg) +*Cost And Theoretical Income* diff --git a/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Communication-Computation Overlapping during Decoding Phase.png b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Communication-Computation Overlapping during Decoding Phase.png new file mode 100644 index 0000000..c76c914 Binary files /dev/null and b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Communication-Computation Overlapping during Decoding Phase.png differ diff --git a/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Communication-Computation Overlapping during Prefilling Phase.png b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Communication-Computation Overlapping during Prefilling Phase.png new file mode 100644 index 0000000..21450da Binary files /dev/null and b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Communication-Computation Overlapping during Prefilling Phase.png differ diff --git a/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Cost And Theoretical Income.jpg b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Cost And Theoretical Income.jpg new file mode 100644 index 0000000..e57dfce Binary files /dev/null and b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Cost And Theoretical Income.jpg differ diff --git a/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Diagram of DeepSeek's Online Inference System.jpg b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Diagram of DeepSeek's Online Inference System.jpg new file mode 100644 index 0000000..acff400 Binary files /dev/null and b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/Diagram of DeepSeek's Online Inference System.jpg differ diff --git a/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/H800 Node Count For Inference Service.jpg b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/H800 Node Count For Inference Service.jpg new file mode 100644 index 0000000..d1f3315 Binary files /dev/null and b/LLMs/DeepSeek/open-infra-index/202502OpenSourceWeek/figures/H800 Node Count For Inference Service.jpg differ diff --git a/LLMs/DeepSeek/open-infra-index/LICENSE b/LLMs/DeepSeek/open-infra-index/LICENSE new file mode 100644 index 0000000..0e259d4 --- /dev/null +++ b/LLMs/DeepSeek/open-infra-index/LICENSE @@ -0,0 +1,121 @@ +Creative Commons Legal Code + +CC0 1.0 Universal + + CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE + LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN + ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS + INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES + REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS + PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM + THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED + HEREUNDER. + +Statement of Purpose + +The laws of most jurisdictions throughout the world automatically confer +exclusive Copyright and Related Rights (defined below) upon the creator +and subsequent owner(s) (each and all, an "owner") of an original work of +authorship and/or a database (each, a "Work"). + +Certain owners wish to permanently relinquish those rights to a Work for +the purpose of contributing to a commons of creative, cultural and +scientific works ("Commons") that the public can reliably and without fear +of later claims of infringement build upon, modify, incorporate in other +works, reuse and redistribute as freely as possible in any form whatsoever +and for any purposes, including without limitation commercial purposes. +These owners may contribute to the Commons to promote the ideal of a free +culture and the further production of creative, cultural and scientific +works, or to gain reputation or greater distribution for their Work in +part through the use and efforts of others. + +For these and/or other purposes and motivations, and without any +expectation of additional consideration or compensation, the person +associating CC0 with a Work (the "Affirmer"), to the extent that he or she +is an owner of Copyright and Related Rights in the Work, voluntarily +elects to apply CC0 to the Work and publicly distribute the Work under its +terms, with knowledge of his or her Copyright and Related Rights in the +Work and the meaning and intended legal effect of CC0 on those rights. + +1. Copyright and Related Rights. A Work made available under CC0 may be +protected by copyright and related or neighboring rights ("Copyright and +Related Rights"). Copyright and Related Rights include, but are not +limited to, the following: + + i. the right to reproduce, adapt, distribute, perform, display, + communicate, and translate a Work; + ii. moral rights retained by the original author(s) and/or performer(s); +iii. publicity and privacy rights pertaining to a person's image or + likeness depicted in a Work; + iv. rights protecting against unfair competition in regards to a Work, + subject to the limitations in paragraph 4(a), below; + v. rights protecting the extraction, dissemination, use and reuse of data + in a Work; + vi. database rights (such as those arising under Directive 96/9/EC of the + European Parliament and of the Council of 11 March 1996 on the legal + protection of databases, and under any national implementation + thereof, including any amended or successor version of such + directive); and +vii. other similar, equivalent or corresponding rights throughout the + world based on applicable law or treaty, and any national + implementations thereof. + +2. Waiver. To the greatest extent permitted by, but not in contravention +of, applicable law, Affirmer hereby overtly, fully, permanently, +irrevocably and unconditionally waives, abandons, and surrenders all of +Affirmer's Copyright and Related Rights and associated claims and causes +of action, whether now known or unknown (including existing as well as +future claims and causes of action), in the Work (i) in all territories +worldwide, (ii) for the maximum duration provided by applicable law or +treaty (including future time extensions), (iii) in any current or future +medium and for any number of copies, and (iv) for any purpose whatsoever, +including without limitation commercial, advertising or promotional +purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each +member of the public at large and to the detriment of Affirmer's heirs and +successors, fully intending that such Waiver shall not be subject to +revocation, rescission, cancellation, termination, or any other legal or +equitable action to disrupt the quiet enjoyment of the Work by the public +as contemplated by Affirmer's express Statement of Purpose. + +3. Public License Fallback. Should any part of the Waiver for any reason +be judged legally invalid or ineffective under applicable law, then the +Waiver shall be preserved to the maximum extent permitted taking into +account Affirmer's express Statement of Purpose. In addition, to the +extent the Waiver is so judged Affirmer hereby grants to each affected +person a royalty-free, non transferable, non sublicensable, non exclusive, +irrevocable and unconditional license to exercise Affirmer's Copyright and +Related Rights in the Work (i) in all territories worldwide, (ii) for the +maximum duration provided by applicable law or treaty (including future +time extensions), (iii) in any current or future medium and for any number +of copies, and (iv) for any purpose whatsoever, including without +limitation commercial, advertising or promotional purposes (the +"License"). The License shall be deemed effective as of the date CC0 was +applied by Affirmer to the Work. Should any part of the License for any +reason be judged legally invalid or ineffective under applicable law, such +partial invalidity or ineffectiveness shall not invalidate the remainder +of the License, and in such case Affirmer hereby affirms that he or she +will not (i) exercise any of his or her remaining Copyright and Related +Rights in the Work or (ii) assert any associated claims and causes of +action with respect to the Work, in either case contrary to Affirmer's +express Statement of Purpose. + +4. Limitations and Disclaimers. + + a. No trademark or patent rights held by Affirmer are waived, abandoned, + surrendered, licensed or otherwise affected by this document. + b. Affirmer offers the Work as-is and makes no representations or + warranties of any kind concerning the Work, express, implied, + statutory or otherwise, including without limitation warranties of + title, merchantability, fitness for a particular purpose, non + infringement, or the absence of latent or other defects, accuracy, or + the present or absence of errors, whether or not discoverable, all to + the greatest extent permissible under applicable law. + c. Affirmer disclaims responsibility for clearing rights of other persons + that may apply to the Work or any use thereof, including without + limitation any person's Copyright and Related Rights in the Work. + Further, Affirmer disclaims responsibility for obtaining any necessary + consents, permissions or other rights required for any use of the + Work. + d. Affirmer understands and acknowledges that Creative Commons is not a + party to this document and has no duty or obligation with respect to + this CC0 or use of the Work. diff --git a/LLMs/DeepSeek/open-infra-index/README.md b/LLMs/DeepSeek/open-infra-index/README.md new file mode 100644 index 0000000..645190e --- /dev/null +++ b/LLMs/DeepSeek/open-infra-index/README.md @@ -0,0 +1,107 @@ +All of the resources in this directory are copied from [DeepSeek AI's Open Infra Index](https://github.com/deepseek-ai/open-infra-index). Credits to @DeepSeekAI + + + + + +
+ DeepSeek-Open-Infra +
+
+ +# Hello, DeepSeek Open Infra! + +## 202502 Open-Source Week +We're a tiny team @deepseek-ai pushing our limits in AGI exploration. + +Starting **this week** , Feb 24, 2025 we'll open-source 5 repos – one daily drop – not because we've made grand claims, +but simply as developers sharing our small-but-sincere progress with full transparency. + +These are humble building blocks of our online service: documented, deployed and battle-tested in production. +No vaporware, just sincere code that moved our tiny yet ambitious dream forward. + +Why? Because every line shared becomes collective momentum that accelerates the journey. +Daily unlocks begin soon. No ivory towers - just pure garage-energy and community-driven innovation 🔧 + +Stay tuned – let's geek out in the open together. + +### Day 1 - [FlashMLA](https://github.com/deepseek-ai/FlashMLA) + +**Efficient MLA Decoding Kernel for Hopper GPUs** +Optimized for variable-length sequences, battle-tested in production + +🔗 [**FlashMLA GitHub Repo**](https://github.com/deepseek-ai/FlashMLA) +✅ BF16 support +✅ Paged KV cache (block size 64) +⚡ Performance: 3000 GB/s memory-bound | BF16 580 TFLOPS compute-bound on H800 + +### Day 2 - [DeepEP](https://github.com/deepseek-ai/DeepEP) + +Excited to introduce **DeepEP** - the first open-source EP communication library for MoE model training and inference. + +🔗 [**DeepEP GitHub Repo**](https://github.com/deepseek-ai/DeepEP) +✅ Efficient and optimized all-to-all communication +✅ Both intranode and internode support with NVLink and RDMA +✅ High-throughput kernels for training and inference prefilling +✅ Low-latency kernels for inference decoding +✅ Native FP8 dispatch support +✅ Flexible GPU resource control for computation-communication overlapping + +### Day 3 - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM) + +Introducing **DeepGEMM** - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference. + +🔗 [**DeepGEMM GitHub Repo**](https://github.com/deepseek-ai/DeepGEMM) +⚡ Up to 1350+ FP8 TFLOPS on Hopper GPUs +✅ No heavy dependency, as clean as a tutorial +✅ Fully Just-In-Time compiled +✅ Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes +✅ Supports dense layout and two MoE layouts + +### Day 4 - Optimized Parallelism Strategies + +✅ **DualPipe** - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training. +🔗 [**GitHub Repo**](https://github.com/deepseek-ai/DualPipe) + +✅ **EPLB** - an expert-parallel load balancer for V3/R1. +🔗 [**GitHub Repo**](https://github.com/deepseek-ai/eplb) + +📊 Analyze computation-communication overlap in V3/R1. +🔗 [**GitHub Repo**](https://github.com/deepseek-ai/profile-data) + +### Day 5 - 3FS, Thruster for All DeepSeek Data Access + +Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks. + +⚡ 6.6 TiB/s aggregate read throughput in a 180-node cluster +⚡ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster +⚡ 40+ GiB/s peak throughput per client node for KVCache lookup +🧬 Disaggregated architecture with strong consistency semantics +✅ Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1 + +📥 3FS → https://github.com/deepseek-ai/3FS +⛲ Smallpond - data processing framework on 3FS → https://github.com/deepseek-ai/smallpond + + +### Day 6 - One More Thing: DeepSeek-V3/R1 Inference System Overview +Optimized throughput and latency via: +🔧 Cross-node EP-powered batch scaling +🔄 Computation-communication overlap +⚖️ Load balancing + +Production data of V3/R1 online services: +⚡ 73.7k/14.8k input/output tokens per second per H800 node +🚀 Cost profit margin 545% + +![Cost And Theoretical Income.jpg](202502OpenSourceWeek/figures/Cost%20And%20Theoretical%20Income.jpg) + +💡 We hope this week's insights offer value to the community and contribute to our shared AGI goals. + +📖 Deep Dive: 🔗[Day 6 - One More Thing: DeepSeek-V3/R1 Inference System Overview](202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md) +📖 中文版: 🔗[DeepSeek-V3 / R1 推理系统概览](https://zhuanlan.zhihu.com/p/27181462601) + +## 2024 AI Infrastructure Paper (SC24) +### Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning + +[**📄 Paper Link**](https://dl.acm.org/doi/10.1109/SC41406.2024.00089) +[**📄 Arxiv Paper Link**](https://arxiv.org/abs/2408.14158)