Open
Description
System Info
8xH100
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
# Docker Image
docker pull nvcr.io/nvidia/tensorrt-llm/release:0.21.0rc2
# w4a8_awq quantizaion with modelopt v0.31.0
python3 /app/tensorrt_llm/examples/quantization/quantize.py --model_dir Mixtral-8x7B-Instruct-v0.1 --dtype float16 --tp_size 2 --output_dir Mixtral-8x7B-Instruct-v0.1-tp2-awq-w4a8 --qformat w4a8_awq --calib_size 512 --batch_size 16 --kv_cache_dtype fp8
# build engine
trtllm-build --checkpoint_dir Mixtral-8x7B-Instruct-v0.1-tp2-awq-w4a8 --max_batch_size 512 --max_seq_len 5120 --max_num_tokens 8192 --remove_input_padding enable --kv_cache_type=paged --multiple_profiles enable --workers 4 --output_dir mixtral-8x7b-w4a8 --reduce_fusion enable --gemm_plugin auto
# run summarize.py
mpirun -n 2 --allow-run-as-root python3 /app/tensorrt_llm/examples/summarize.py --test_trt_llm --data_type fp16 --hf_model_dir Mixtral-8x7B-Instruct-v0.1 --engine_dir mixtral-8x7b-w4a8
Expected behavior
Output is normal
actual behavior
additional notes
Nothing