Skip to content

Commit 871ace6

Browse files
shihaobaisufubaowangzaijun
authored
refactor: weight refactor, including norm, mm, quantization and embedding (#1193)
Co-authored-by: sufubao <[email protected]> Co-authored-by: wangzaijun <[email protected]>
1 parent 91644a4 commit 871ace6

File tree

161 files changed

+3356
-4769
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

161 files changed

+3356
-4769
lines changed

docs/CN/source/models/add_new_model.md

Lines changed: 0 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -162,19 +162,6 @@ class BloomPreAndPostLayerWeight(PreAndPostLayerWeight):
162162
self.tp_rank_: split_vob_size * (self.tp_rank_ + 1), :])
163163
self.lm_head_weight_ = self.wte_weight_
164164
return
165-
166-
def verify_load(self):
167-
errors = "weights load not ok"
168-
weights = [self.pre_norm_weight_,
169-
self.pre_norm_bias_,
170-
self.final_norm_weight_,
171-
self.final_norm_bias_,
172-
self.wte_weight_,
173-
self.lm_head_weight_]
174-
for i in range(len(weights)):
175-
assert weights[i] is not None, "index:" + str(i) + " " + errors
176-
return
177-
178165
~~~
179166

180167
***transformer_layer_weight.py***
@@ -204,30 +191,6 @@ class BloomTransformerLayerWeight(TransformerLayerWeight):
204191
self._load_qkvo_weights(weights)
205192
self._load_ffn_weights(weights)
206193
return
207-
208-
def verify_load(self):
209-
errors = "weights load not ok"
210-
weights = [self.att_norm_weight_,
211-
self.att_norm_bias_,
212-
self.q_weight_,
213-
self.k_weight_,
214-
self.v_weight_,
215-
self.q_bias_,
216-
self.k_bias_,
217-
self.v_bias_,
218-
self.o_weight_,
219-
self.o_bias_,
220-
221-
self.ffn_norm_weight_,
222-
self.ffn_norm_bias_,
223-
self.ffn_1_weight_,
224-
self.ffn_1_bias_,
225-
self.ffn_2_weight_,
226-
self.ffn_2_bias_,
227-
]
228-
for i in range(len(weights)):
229-
assert weights[i] is not None, "index:" + str(i) + " " + errors
230-
return
231194

232195
def _load_qkvo_weights(self, weights):
233196
if f"h.{self.layer_num_}.input_layernorm.weight" in weights:

docs/CN/source/tutorial/api_server_args.rst

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -367,17 +367,14 @@ PD 分离模式参数
367367
.. option:: --quant_type
368368

369369
量化方法,可选值:
370-
371-
* ``ppl-w4a16-128``
372-
* ``flashllm-w6a16``
373-
* ``ao-int4wo-[32,64,128,256]``
374-
* ``ao-int8wo``
375-
* ``ao-fp8w8a16``
376-
* ``ao-fp6w6a16``
370+
377371
* ``vllm-w8a8``
378372
* ``vllm-fp8w8a8``
379373
* ``vllm-fp8w8a8-b128``
374+
* ``deepgemm-fp8w8a8-b128``
380375
* ``triton-fp8w8a8-block128``
376+
* ``awq``
377+
* ``awq_marlin``
381378
* ``none`` (默认)
382379

383380
.. option:: --quant_cfg
@@ -389,13 +386,7 @@ PD 分离模式参数
389386
.. option:: --vit_quant_type
390387

391388
ViT 量化方法,可选值:
392-
393-
* ``ppl-w4a16-128``
394-
* ``flashllm-w6a16``
395-
* ``ao-int4wo-[32,64,128,256]``
396-
* ``ao-int8wo``
397-
* ``ao-fp8w8a16``
398-
* ``ao-fp6w6a16``
389+
399390
* ``vllm-w8a8``
400391
* ``vllm-fp8w8a8``
401392
* ``none`` (默认)

docs/CN/source/tutorial/deepseek_deployment.rst

Lines changed: 20 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -49,13 +49,14 @@ LightLLM 支持以下几种部署模式:
4949
.. code-block:: bash
5050
5151
# H200 单机 DeepSeek-R1 DP + EP 模式
52-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
52+
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
5353
--model_dir /path/DeepSeek-R1 \
5454
--tp 8 \
55-
--dp 8
55+
--dp 8 \
56+
--enable_ep_moe
5657
5758
**参数说明:**
58-
- `MOE_MODE=EP`: 设置专家并行模式
59+
- `--enable_ep_moe`: 设置专家并行模式
5960
- `--tp 8`: 张量并行度
6061
- `--dp 8`: 数据并行度,通常设置为与 tp 相同的值
6162

@@ -119,14 +120,14 @@ LightLLM 支持以下几种部署模式:
119120
# H200 多机 DeepSeek-R1 EP 模式 Node 0
120121
# 使用方法: sh multi_node_ep_node0.sh <nccl_host>
121122
export nccl_host=$1
122-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
123+
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
123124
--model_dir /path/DeepSeek-R1 \
124125
--tp 16 \
125126
--dp 16 \
126127
--nnodes 2 \
127128
--node_rank 0 \
128129
--nccl_host $nccl_host \
129-
--nccl_port 2732
130+
--nccl_port 2732 --enable_ep_moe
130131
131132
**Node 1 启动命令:**
132133

@@ -135,14 +136,14 @@ LightLLM 支持以下几种部署模式:
135136
# H200 多机 DeepSeek-R1 EP 模式 Node 1
136137
# 使用方法: sh multi_node_ep_node1.sh <nccl_host>
137138
export nccl_host=$1
138-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
139+
LOADWORKER=18 python -m lightllm.server.api_server --port 8088 \
139140
--model_dir /path/DeepSeek-R1 \
140141
--tp 16 \
141142
--dp 16 \
142143
--nnodes 2 \
143144
--node_rank 1 \
144145
--nccl_host $nccl_host \
145-
--nccl_port 2732
146+
--nccl_port 2732 --enable_ep_moe
146147
147148
**可选优化参数:**
148149
- `--enable_prefill_microbatch_overlap`: 启用预填充微批次重叠
@@ -179,7 +180,7 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
179180
export host=$1
180181
export pd_master_ip=$2
181182
nvidia-cuda-mps-control -d
182-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
183+
LOADWORKER=18 python -m lightllm.server.api_server \
183184
--model_dir /path/DeepSeek-R1 \
184185
--run_mode "prefill" \
185186
--tp 8 \
@@ -189,7 +190,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
189190
--nccl_port 2732 \
190191
--disable_cudagraph \
191192
--pd_master_ip $pd_master_ip \
192-
--pd_master_port 60011
193+
--pd_master_port 60011 \
194+
--enable_ep_moe
193195
# 如果需要启用微批次重叠,可以取消注释以下行
194196
#--enable_prefill_microbatch_overlap
195197
@@ -202,7 +204,7 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
202204
export host=$1
203205
export pd_master_ip=$2
204206
nvidia-cuda-mps-control -d
205-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
207+
LOADWORKER=18 python -m lightllm.server.api_server \
206208
--model_dir /path/DeepSeek-R1 \
207209
--run_mode "decode" \
208210
--tp 8 \
@@ -212,7 +214,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
212214
--nccl_port 12322 \
213215
--disable_cudagraph \
214216
--pd_master_ip $pd_master_ip \
215-
--pd_master_port 60011
217+
--pd_master_port 60011 \
218+
--enable_ep_moe
216219
# 如果需要启用微批次重叠,可以取消注释以下行
217220
#--enable_decode_microbatch_overlap
218221
@@ -269,7 +272,7 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
269272
export host=$1
270273
export config_server_host=$2
271274
nvidia-cuda-mps-control -d
272-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
275+
LOADWORKER=18 python -m lightllm.server.api_server \
273276
--model_dir /path/DeepSeek-R1 \
274277
--run_mode "prefill" \
275278
--host $host \
@@ -279,15 +282,16 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
279282
--nccl_port 2732 \
280283
--disable_cudagraph \
281284
--config_server_host $config_server_host \
282-
--config_server_port 60088
285+
--config_server_port 60088 \
286+
--enable_ep_moe
283287
# 如果需要启用微批次重叠,可以取消注释以下行
284288
#--enable_prefill_microbatch_overlap
285289
286290
# Decode 服务
287291
export host=$1
288292
export config_server_host=$2
289293
nvidia-cuda-mps-control -d
290-
MOE_MODE=EP LOADWORKER=18 python -m lightllm.server.api_server \
294+
LOADWORKER=18 python -m lightllm.server.api_server \
291295
--model_dir /path/DeepSeek-R1 \
292296
--run_mode "decode" \
293297
--host $host \
@@ -296,7 +300,8 @@ PD (Prefill-Decode) 分离模式将预填充和解码阶段分离部署,可以
296300
--tp 8 \
297301
--dp 8 \
298302
--config_server_host $config_server_host \
299-
--config_server_port 60088
303+
--config_server_port 60088 \
304+
--enable_ep_moe
300305
# 如果需要启用微批次重叠,可以取消注释以下行
301306
#--enable_decode_microbatch_overlap
302307

docs/EN/source/models/add_new_model.md

Lines changed: 0 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -162,18 +162,6 @@ class BloomPreAndPostLayerWeight(PreAndPostLayerWeight):
162162
self.tp_rank_: split_vob_size * (self.tp_rank_ + 1), :])
163163
self.lm_head_weight_ = self.wte_weight_
164164
return
165-
166-
def verify_load(self):
167-
errors = "weights load not ok"
168-
weights = [self.pre_norm_weight_,
169-
self.pre_norm_bias_,
170-
self.final_norm_weight_,
171-
self.final_norm_bias_,
172-
self.wte_weight_,
173-
self.lm_head_weight_]
174-
for i in range(len(weights)):
175-
assert weights[i] is not None, "index:" + str(i) + " " + errors
176-
return
177165

178166
~~~
179167

@@ -204,30 +192,6 @@ class BloomTransformerLayerWeight(TransformerLayerWeight):
204192
self._load_qkvo_weights(weights)
205193
self._load_ffn_weights(weights)
206194
return
207-
208-
def verify_load(self):
209-
errors = "weights load not ok"
210-
weights = [self.att_norm_weight_,
211-
self.att_norm_bias_,
212-
self.q_weight_,
213-
self.k_weight_,
214-
self.v_weight_,
215-
self.q_bias_,
216-
self.k_bias_,
217-
self.v_bias_,
218-
self.o_weight_,
219-
self.o_bias_,
220-
221-
self.ffn_norm_weight_,
222-
self.ffn_norm_bias_,
223-
self.ffn_1_weight_,
224-
self.ffn_1_bias_,
225-
self.ffn_2_weight_,
226-
self.ffn_2_bias_,
227-
]
228-
for i in range(len(weights)):
229-
assert weights[i] is not None, "index:" + str(i) + " " + errors
230-
return
231195

232196
def _load_qkvo_weights(self, weights):
233197
if f"h.{self.layer_num_}.input_layernorm.weight" in weights:

docs/EN/source/tutorial/api_server_args.rst

Lines changed: 5 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -359,17 +359,14 @@ Quantization Parameters
359359
.. option:: --quant_type
360360

361361
Quantization method, optional values:
362-
363-
* ``ppl-w4a16-128``
364-
* ``flashllm-w6a16``
365-
* ``ao-int4wo-[32,64,128,256]``
366-
* ``ao-int8wo``
367-
* ``ao-fp8w8a16``
368-
* ``ao-fp6w6a16``
362+
369363
* ``vllm-w8a8``
370364
* ``vllm-fp8w8a8``
371365
* ``vllm-fp8w8a8-b128``
366+
* ``deepgemm-fp8w8a8-b128``
372367
* ``triton-fp8w8a8-block128``
368+
* ``awq``
369+
* ``awq_marlin``
373370
* ``none`` (default)
374371

375372
.. option:: --quant_cfg
@@ -381,13 +378,7 @@ Quantization Parameters
381378
.. option:: --vit_quant_type
382379

383380
ViT quantization method, optional values:
384-
385-
* ``ppl-w4a16-128``
386-
* ``flashllm-w6a16``
387-
* ``ao-int4wo-[32,64,128,256]``
388-
* ``ao-int8wo``
389-
* ``ao-fp8w8a16``
390-
* ``ao-fp6w6a16``
381+
391382
* ``vllm-w8a8``
392383
* ``vllm-fp8w8a8``
393384
* ``none`` (default)

0 commit comments

Comments
 (0)