build Qwen 72B TP4 int8 weight only Out of Memory  using four 4090

Hi, I attempted to build a 72B Qwen model on a machine with four 4090 cards using int8 weight-only quantization, and encountered an out-of-memory (OOM) issue during the build stage."。

Regarding a **72B model using int8 weight-only quantization**, in theory, each rank should **require 18GB of memory**, 4090 has 24GB memory. Why does the compilation process still result in Out of Memory (OOM)? Are there any methods to reduce memory consumption during building?


> [12/29/2023-05:24:11] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1611 steps to complete.
[12/29/2023-05:24:11] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 206.787ms to assign 11 blocks to 1611 nodes requiring 335907840 bytes.
[12/29/2023-05:24:11] [TRT] [I] Total Activation Memory: 335907840
[12/29/2023-05:24:11] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[12/29/2023-05:24:11] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[12/29/2023-05:24:11] [TRT] [W] Requested amount of GPU memory (20572012544 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[12/29/2023-05:24:12] [TRT] [E] 2:
[12/29/2023-05:24:12] [TRT] [E] 2: [globWriter.cpp::makeResizableGpuMemory::423] Error Code 2: OutOfMemory (no further information)
[12/29/2023-05:24:12] [TRT-LLM] [E] Engine building failed, please check the error log.
[12/29/2023-05:24:12] [TRT-LLM] [I] Config saved to /workspace/model/qwen_72b_fp16_trt_int8_tp4_pp1_mbs8_mil5120/config.json.
Traceback (most recent call last):
  File "/app/tensorrt_llm/examples/qwen/build.py", line 642, in <module>
    build(0, args)
  File "/app/tensorrt_llm/examples/qwen/build.py", line 614, in build
    assert engine is not None, f'Failed to build engine for rank {cur_rank}'
AssertionError: Failed to build engine for rank 0

build params

> python examples/qwen/build.py \
                --hf_model_dir /model/Qwen-72B \
                --dtype float16 \
                --remove_input_padding \
                --use_gpt_attention_plugin float16 \
                --enable_context_fmha \
                --use_gemm_plugin float16 \
                --world_size 4 \
                --tp_size 4 \
                --pp_size 1 \
                --use_inflight_batching \
                --output_dir /workspace/model/qwen_72b \
                --use_weight_only \
                --weight_only_precision int8 \
                --max_input_len 2048 \
                --max_output_len 512 \
                --max_batch_size 1 \
                --log_level verbose \
                --rotary_base 1000000 \
                --paged_kv_cache \



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

build Qwen 72B TP4 int8 weight only Out of Memory using four 4090 #772

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

build Qwen 72B TP4 int8 weight only Out of Memory using four 4090 #772

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions