Skip to content

Commit

Permalink
feat: Add deepseek open-infra-index's copy
Browse files Browse the repository at this point in the history
  • Loading branch information
YeonwooSung committed Mar 1, 2025
1 parent 6ea100d commit c828953
Show file tree
Hide file tree
Showing 8 changed files with 316 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Day 6: One More Thing, DeepSeek-V3/R1 Inference System Overview
## System Design Principles
The optimization objectives of serving DeepSeek-V3/R1 inference are: **higher throughput and lower latency.**

To optimize these two objectives, our solution employs cross-node Expert Parallelism (EP).
- First, EP significantly scales the batch size, enhancing GPU matrix computation efficiency and boosting throughput.
- Second, EP distributes experts across GPUs, with each GPU processing only a small subset of experts (reducing memory access demands), thereby lowering latency.

However, EP increases system complexity, primarily in two aspects:
1. EP introduces cross-node communication. To optimize throughput, appropriate computational workflows must be designed to overlap communication with computation.
2. EP involves multiple nodes, thereby inherently requiring Data Parallelism (DP) and necessitating load balancing between different DP instances.

This article focuses on how we address these challenges by:
- leveraging EP to scale batch size,
- hiding communication latency behind computation, and
- performing load balancing.

### Large-scale Cross-node Expert Parallelism (EP)
Due to the large number of experts in DeepSeek-V3/R1β€”where only 8 out of 256 experts per layer are activatedβ€”the model’s high sparsity necessitates an extremely large overall batch size. This ensures sufficient batch size per expert, enabling higher throughput and lower latency. Large-scale cross-node EP is essential.

As we have adopted prefill-decode disaggregation architecture, we employ different degrees of parallelisms during the prefill and decode phases:
- **Prefilling Phase [Routed Expert EP32, MLA/Shared Expert DP32]**: Each deployment unit spans 4 nodes with 32 redundant routed experts, where each GPU handles 9 routed experts and 1 shared expert.
- **Decoding Phase [Routed Expert EP144, MLA/Shared Expert DP144]**: Each deployment unit spans 18 nodes with 32 redundant routed experts, where each GPU manages 2 routed experts and 1 shared expert.

### Computation-Communication Overlapping
Large-scale cross-node EP introduces significant communication overhead. To mitigate this, we employ a dual-batch overlap strategy to hide communication costs and improve overall throughput by splitting a batch of requests into two microbatches.
During the prefilling phase, these two microbatches executed alternately and the communication cost of one microbatch is hide behind the computation of the other.

![Communication-Computation Overlapping during Prefilling Phase.png](figures/Communication-Computation%20Overlapping%20during%20Prefilling%20Phase.png)
*Communication-Computation Overlapping during Prefilling Phase*

During the decoding phase, the execution durations of different stages are unbalanced. Hence, we subdivide the attention layer into two steps and use a 5-stage pipeline to achieve a seamless communication-computation overlapping.
![Communication-Computation Overlapping during Decoding Phase.png](figures/Communication-Computation%20Overlapping%20during%20Decoding%20Phase.png)
*Communication-Computation Overlapping during Decoding Phase*

More details about our communication-computation overlapping mechanism can be found at https://github.com/deepseek-ai/profile-data.

### Achieving Optimal Load Balancing
The large-scale parallelism (including DP and EP) introduces a critical challenge: if a single GPU is overloaded with computation or communication, it becomes a performance bottleneck, slowing the entire system while leaving other GPUs idle. To maximize resource utilization, we strive to balance computational and communication loads across all GPUs.

#### 1. Prefill Load Balancer
- Key Issue: Varying request counts and sequence lengths across DP instances lead to imbalanced core-attention computation and dispatch send load.
- Optimization Objectives:
- Balance core-attention computation across GPUs (core-attention computational load balancing).
- Equalize input token counts per GPU (dispatch send load balancing), preventing prolonged processing on specific GPUs.
#### 2. Decode Load Balancer
- Key Issue: Uneven request counts and sequence lengths across DP instances cause disparities in core-attention computation (linked to KVCache usage) and dispatch send load.
- Optimization Objectives:
- Balance KVCache usage across GPUs (core-attention computational load balancing).
- Equalize request counts per GPU (dispatch send load balancing).
#### 3. Expert-Parallel Load Balancer
- Key Issue: For a given MoE model, there exist inherently high-load experts, resulting in an imbalance in expert computational workloads across different GPUs.
- Optimization Objective:
- Balance expert computation on each GPU (i.e., minimize the maximum dispatch receive load across all GPUs).

### Diagram of DeepSeek's Online Inference System
![Diagram of DeepSeek's Online Inference System.jpg](figures/Diagram%20of%20DeepSeek%27s%20Online%20Inference%20System.jpg)
*Diagram of DeepSeek's Online Inference System*

### Statistics of DeepSeek's Online Service
All DeepSeek-V3/R1 inference services are served on H800 GPUs with precision consistent with training.
Specifically, matrix multiplications and dispatch transmissions adopt the FP8 format aligned with training,
while core MLA computations and combine transmissions use the BF16 format, ensuring optimal service performance.

Additionally, due to high service load during the day and low load at night, we implemented a mechanism to deploy inference services across all nodes during peak daytime hours.
During low-load nighttime periods, we reduce inference nodes and allocate resources to research and training.
Over the past 24 hours (UTC+8 02/27/2025 12:00 PM to 02/28/2025 12:00 PM), the combined peak node occupancy for V3 and R1 inference services reached 278, with an average occupancy of 226.75 nodes (each node contains 8 H800 GPUs).
Assuming the leasing cost of one H800 GPU is $2 per hour, the total daily cost amounts to $87,072.

![H800 Node Count For Inference Service.jpg](figures/H800%20Node%20Count%20For%20Inference%20Service.jpg)
*H800 Node Count For Inference Service.png*

Within the 24-hour statistical period (UTC+8 02/27/2025 12:00 PM to 02/28/2025 12:00 PM), V3 and R1:
- Total input tokens: 608B, of which 342B tokens (56.3%) hit the on-disk KV cache.
- Total output tokens: 168B. The average output speed was 20–22 tokens per second, and the average kvcache length per output token was 4,989 tokens.
- Each H800 node delivers an average throughput of ~73.7k tokens/s input (including cache hits) during prefilling or ~14.8k tokens/s output during decoding.

The above statistics include all user requests from web, APP, and API. If all tokens were billed at DeepSeek-R1’s pricing (*), the total daily revenue would be $562,027, with a cost profit margin of 545%.

_(*) R1 Pricing: \$0.14/M input tokens (cache hit), \$0.55/M input tokens (cache miss), $2.19/M output tokens._

However, our actual revenue is substantially lower for the following reasons:
- DeepSeek-V3’s pricing is significantly lower than R1,
- Only a subset of services are monetized (web and APP access remain free),
- Nighttime discounts are automatically applied during off-peak hours.

![Cost And Theoretical Income.jpg](figures/Cost%20And%20Theoretical%20Income.jpg)
*Cost And Theoretical Income*
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
121 changes: 121 additions & 0 deletions LLMs/DeepSeek/open-infra-index/LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
Creative Commons Legal Code

CC0 1.0 Universal

CREATIVE COMMONS CORPORATION IS NOT A LAW FIRM AND DOES NOT PROVIDE
LEGAL SERVICES. DISTRIBUTION OF THIS DOCUMENT DOES NOT CREATE AN
ATTORNEY-CLIENT RELATIONSHIP. CREATIVE COMMONS PROVIDES THIS
INFORMATION ON AN "AS-IS" BASIS. CREATIVE COMMONS MAKES NO WARRANTIES
REGARDING THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS
PROVIDED HEREUNDER, AND DISCLAIMS LIABILITY FOR DAMAGES RESULTING FROM
THE USE OF THIS DOCUMENT OR THE INFORMATION OR WORKS PROVIDED
HEREUNDER.

Statement of Purpose

The laws of most jurisdictions throughout the world automatically confer
exclusive Copyright and Related Rights (defined below) upon the creator
and subsequent owner(s) (each and all, an "owner") of an original work of
authorship and/or a database (each, a "Work").

Certain owners wish to permanently relinquish those rights to a Work for
the purpose of contributing to a commons of creative, cultural and
scientific works ("Commons") that the public can reliably and without fear
of later claims of infringement build upon, modify, incorporate in other
works, reuse and redistribute as freely as possible in any form whatsoever
and for any purposes, including without limitation commercial purposes.
These owners may contribute to the Commons to promote the ideal of a free
culture and the further production of creative, cultural and scientific
works, or to gain reputation or greater distribution for their Work in
part through the use and efforts of others.

For these and/or other purposes and motivations, and without any
expectation of additional consideration or compensation, the person
associating CC0 with a Work (the "Affirmer"), to the extent that he or she
is an owner of Copyright and Related Rights in the Work, voluntarily
elects to apply CC0 to the Work and publicly distribute the Work under its
terms, with knowledge of his or her Copyright and Related Rights in the
Work and the meaning and intended legal effect of CC0 on those rights.

1. Copyright and Related Rights. A Work made available under CC0 may be
protected by copyright and related or neighboring rights ("Copyright and
Related Rights"). Copyright and Related Rights include, but are not
limited to, the following:

i. the right to reproduce, adapt, distribute, perform, display,
communicate, and translate a Work;
ii. moral rights retained by the original author(s) and/or performer(s);
iii. publicity and privacy rights pertaining to a person's image or
likeness depicted in a Work;
iv. rights protecting against unfair competition in regards to a Work,
subject to the limitations in paragraph 4(a), below;
v. rights protecting the extraction, dissemination, use and reuse of data
in a Work;
vi. database rights (such as those arising under Directive 96/9/EC of the
European Parliament and of the Council of 11 March 1996 on the legal
protection of databases, and under any national implementation
thereof, including any amended or successor version of such
directive); and
vii. other similar, equivalent or corresponding rights throughout the
world based on applicable law or treaty, and any national
implementations thereof.

2. Waiver. To the greatest extent permitted by, but not in contravention
of, applicable law, Affirmer hereby overtly, fully, permanently,
irrevocably and unconditionally waives, abandons, and surrenders all of
Affirmer's Copyright and Related Rights and associated claims and causes
of action, whether now known or unknown (including existing as well as
future claims and causes of action), in the Work (i) in all territories
worldwide, (ii) for the maximum duration provided by applicable law or
treaty (including future time extensions), (iii) in any current or future
medium and for any number of copies, and (iv) for any purpose whatsoever,
including without limitation commercial, advertising or promotional
purposes (the "Waiver"). Affirmer makes the Waiver for the benefit of each
member of the public at large and to the detriment of Affirmer's heirs and
successors, fully intending that such Waiver shall not be subject to
revocation, rescission, cancellation, termination, or any other legal or
equitable action to disrupt the quiet enjoyment of the Work by the public
as contemplated by Affirmer's express Statement of Purpose.

3. Public License Fallback. Should any part of the Waiver for any reason
be judged legally invalid or ineffective under applicable law, then the
Waiver shall be preserved to the maximum extent permitted taking into
account Affirmer's express Statement of Purpose. In addition, to the
extent the Waiver is so judged Affirmer hereby grants to each affected
person a royalty-free, non transferable, non sublicensable, non exclusive,
irrevocable and unconditional license to exercise Affirmer's Copyright and
Related Rights in the Work (i) in all territories worldwide, (ii) for the
maximum duration provided by applicable law or treaty (including future
time extensions), (iii) in any current or future medium and for any number
of copies, and (iv) for any purpose whatsoever, including without
limitation commercial, advertising or promotional purposes (the
"License"). The License shall be deemed effective as of the date CC0 was
applied by Affirmer to the Work. Should any part of the License for any
reason be judged legally invalid or ineffective under applicable law, such
partial invalidity or ineffectiveness shall not invalidate the remainder
of the License, and in such case Affirmer hereby affirms that he or she
will not (i) exercise any of his or her remaining Copyright and Related
Rights in the Work or (ii) assert any associated claims and causes of
action with respect to the Work, in either case contrary to Affirmer's
express Statement of Purpose.

4. Limitations and Disclaimers.

a. No trademark or patent rights held by Affirmer are waived, abandoned,
surrendered, licensed or otherwise affected by this document.
b. Affirmer offers the Work as-is and makes no representations or
warranties of any kind concerning the Work, express, implied,
statutory or otherwise, including without limitation warranties of
title, merchantability, fitness for a particular purpose, non
infringement, or the absence of latent or other defects, accuracy, or
the present or absence of errors, whether or not discoverable, all to
the greatest extent permissible under applicable law.
c. Affirmer disclaims responsibility for clearing rights of other persons
that may apply to the Work or any use thereof, including without
limitation any person's Copyright and Related Rights in the Work.
Further, Affirmer disclaims responsibility for obtaining any necessary
consents, permissions or other rights required for any use of the
Work.
d. Affirmer understands and acknowledges that Creative Commons is not a
party to this document and has no duty or obligation with respect to
this CC0 or use of the Work.
107 changes: 107 additions & 0 deletions LLMs/DeepSeek/open-infra-index/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,107 @@
All of the resources in this directory are copied from [DeepSeek AI's Open Infra Index](https://github.com/deepseek-ai/open-infra-index). Credits to @DeepSeekAI

<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

<div align="center">
<img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek-Open-Infra" />
</div>
<hr>

# Hello, DeepSeek Open Infra!

## 202502 Open-Source Week
We're a tiny team @deepseek-ai pushing our limits in AGI exploration.

Starting **this week** , Feb 24, 2025 we'll open-source 5 repos – one daily drop – not because we've made grand claims,
but simply as developers sharing our small-but-sincere progress with full transparency.

These are humble building blocks of our online service: documented, deployed and battle-tested in production.
No vaporware, just sincere code that moved our tiny yet ambitious dream forward.

Why? Because every line shared becomes collective momentum that accelerates the journey.
Daily unlocks begin soon. No ivory towers - just pure garage-energy and community-driven innovation πŸ”§

Stay tuned – let's geek out in the open together.

### Day 1 - [FlashMLA](https://github.com/deepseek-ai/FlashMLA)

**Efficient MLA Decoding Kernel for Hopper GPUs**
Optimized for variable-length sequences, battle-tested in production

πŸ”— [**FlashMLA GitHub Repo**](https://github.com/deepseek-ai/FlashMLA)
βœ… BF16 support
βœ… Paged KV cache (block size 64)
⚑ Performance: 3000 GB/s memory-bound | BF16 580 TFLOPS compute-bound on H800

### Day 2 - [DeepEP](https://github.com/deepseek-ai/DeepEP)

Excited to introduce **DeepEP** - the first open-source EP communication library for MoE model training and inference.

πŸ”— [**DeepEP GitHub Repo**](https://github.com/deepseek-ai/DeepEP)
βœ… Efficient and optimized all-to-all communication
βœ… Both intranode and internode support with NVLink and RDMA
βœ… High-throughput kernels for training and inference prefilling
βœ… Low-latency kernels for inference decoding
βœ… Native FP8 dispatch support
βœ… Flexible GPU resource control for computation-communication overlapping

### Day 3 - [DeepGEMM](https://github.com/deepseek-ai/DeepGEMM)

Introducing **DeepGEMM** - an FP8 GEMM library that supports both dense and MoE GEMMs, powering V3/R1 training and inference.

πŸ”— [**DeepGEMM GitHub Repo**](https://github.com/deepseek-ai/DeepGEMM)
⚑ Up to 1350+ FP8 TFLOPS on Hopper GPUs
βœ… No heavy dependency, as clean as a tutorial
βœ… Fully Just-In-Time compiled
βœ… Core logic at ~300 lines - yet outperforms expert-tuned kernels across most matrix sizes
βœ… Supports dense layout and two MoE layouts

### Day 4 - Optimized Parallelism Strategies

βœ… **DualPipe** - a bidirectional pipeline parallelism algorithm for computation-communication overlap in V3/R1 training.
πŸ”— [**GitHub Repo**](https://github.com/deepseek-ai/DualPipe)

βœ… **EPLB** - an expert-parallel load balancer for V3/R1.
πŸ”— [**GitHub Repo**](https://github.com/deepseek-ai/eplb)

πŸ“Š Analyze computation-communication overlap in V3/R1.
πŸ”— [**GitHub Repo**](https://github.com/deepseek-ai/profile-data)

### Day 5 - 3FS, Thruster for All DeepSeek Data Access

Fire-Flyer File System (3FS) - a parallel file system that utilizes the full bandwidth of modern SSDs and RDMA networks.

⚑ 6.6 TiB/s aggregate read throughput in a 180-node cluster
⚑ 3.66 TiB/min throughput on GraySort benchmark in a 25-node cluster
⚑ 40+ GiB/s peak throughput per client node for KVCache lookup
🧬 Disaggregated architecture with strong consistency semantics
βœ… Training data preprocessing, dataset loading, checkpoint saving/reloading, embedding vector search & KVCache lookups for inference in V3/R1

πŸ“₯ 3FS β†’ https://github.com/deepseek-ai/3FS
β›² Smallpond - data processing framework on 3FS β†’ https://github.com/deepseek-ai/smallpond


### Day 6 - One More Thing: DeepSeek-V3/R1 Inference System Overview
Optimized throughput and latency via:
πŸ”§ Cross-node EP-powered batch scaling
πŸ”„ Computation-communication overlap
βš–οΈ Load balancing

Production data of V3/R1 online services:
⚑ 73.7k/14.8k input/output tokens per second per H800 node
πŸš€ Cost profit margin 545%

![Cost And Theoretical Income.jpg](202502OpenSourceWeek/figures/Cost%20And%20Theoretical%20Income.jpg)

πŸ’‘ We hope this week's insights offer value to the community and contribute to our shared AGI goals.

πŸ“– Deep Dive: πŸ”—[Day 6 - One More Thing: DeepSeek-V3/R1 Inference System Overview](202502OpenSourceWeek/day_6_one_more_thing_deepseekV3R1_inference_system_overview.md)
πŸ“– δΈ­ζ–‡η‰ˆ: πŸ”—[DeepSeek-V3 / R1 ζŽ¨η†η³»η»Ÿζ¦‚θ§ˆ](https://zhuanlan.zhihu.com/p/27181462601)

## 2024 AI Infrastructure Paper (SC24)
### Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning

[**πŸ“„ Paper Link**](https://dl.acm.org/doi/10.1109/SC41406.2024.00089)
[**πŸ“„ Arxiv Paper Link**](https://arxiv.org/abs/2408.14158)

0 comments on commit c828953

Please sign in to comment.