Skip to content

build Qwen 72B TP4 int8 weight only Out of Memory using four 4090 #772

Open
@snippetzero

Description

@snippetzero

Hi, I attempted to build a 72B Qwen model on a machine with four 4090 cards using int8 weight-only quantization, and encountered an out-of-memory (OOM) issue during the build stage."。

Regarding a 72B model using int8 weight-only quantization, in theory, each rank should require 18GB of memory, 4090 has 24GB memory. Why does the compilation process still result in Out of Memory (OOM)? Are there any methods to reduce memory consumption during building?

[12/29/2023-05:24:11] [TRT] [I] [BlockAssignment] Started assigning block shifts. This will take 1611 steps to complete.
[12/29/2023-05:24:11] [TRT] [I] [BlockAssignment] Algorithm ShiftNTopDown took 206.787ms to assign 11 blocks to 1611 nodes requiring 335907840 bytes.
[12/29/2023-05:24:11] [TRT] [I] Total Activation Memory: 335907840
[12/29/2023-05:24:11] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[12/29/2023-05:24:11] [TRT] [E] 2: [virtualMemoryBuffer.cpp::resizePhysical::140] Error Code 2: OutOfMemory (no further information)
[12/29/2023-05:24:11] [TRT] [W] Requested amount of GPU memory (20572012544 bytes) could not be allocated. There may not be enough free memory for allocation to succeed.
[12/29/2023-05:24:12] [TRT] [E] 2:
[12/29/2023-05:24:12] [TRT] [E] 2: [globWriter.cpp::makeResizableGpuMemory::423] Error Code 2: OutOfMemory (no further information)
[12/29/2023-05:24:12] [TRT-LLM] [E] Engine building failed, please check the error log.
[12/29/2023-05:24:12] [TRT-LLM] [I] Config saved to /workspace/model/qwen_72b_fp16_trt_int8_tp4_pp1_mbs8_mil5120/config.json.
Traceback (most recent call last):
File "/app/tensorrt_llm/examples/qwen/build.py", line 642, in
build(0, args)
File "/app/tensorrt_llm/examples/qwen/build.py", line 614, in build
assert engine is not None, f'Failed to build engine for rank {cur_rank}'
AssertionError: Failed to build engine for rank 0

build params

python examples/qwen/build.py
--hf_model_dir /model/Qwen-72B
--dtype float16
--remove_input_padding
--use_gpt_attention_plugin float16
--enable_context_fmha
--use_gemm_plugin float16
--world_size 4
--tp_size 4
--pp_size 1
--use_inflight_batching
--output_dir /workspace/model/qwen_72b
--use_weight_only
--weight_only_precision int8
--max_input_len 2048
--max_output_len 512
--max_batch_size 1
--log_level verbose
--rotary_base 1000000
--paged_kv_cache \

Metadata

Metadata

Assignees

Labels

MemoryMemory utilization in TRTLLM: leak/OOM handling, footprint optimization, memory profiling.triagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions