Skip to content

Mixtral-8x7B-Instruct awq-w4a8 output shows duplicated Chinese text #5379

Open
@wanzhenchn

Description

@wanzhenchn

System Info

8xH100

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

# Docker Image
docker pull nvcr.io/nvidia/tensorrt-llm/release:0.21.0rc2
# w4a8_awq quantizaion with modelopt v0.31.0
python3 /app/tensorrt_llm/examples/quantization/quantize.py --model_dir Mixtral-8x7B-Instruct-v0.1 --dtype float16 --tp_size 2 --output_dir Mixtral-8x7B-Instruct-v0.1-tp2-awq-w4a8 --qformat w4a8_awq --calib_size 512 --batch_size 16 --kv_cache_dtype fp8

# build engine
trtllm-build --checkpoint_dir Mixtral-8x7B-Instruct-v0.1-tp2-awq-w4a8 --max_batch_size 512 --max_seq_len 5120 --max_num_tokens 8192 --remove_input_padding enable --kv_cache_type=paged --multiple_profiles enable --workers 4 --output_dir mixtral-8x7b-w4a8 --reduce_fusion enable --gemm_plugin auto

# run summarize.py
mpirun -n 2 --allow-run-as-root python3 /app/tensorrt_llm/examples/summarize.py --test_trt_llm --data_type fp16 --hf_model_dir Mixtral-8x7B-Instruct-v0.1 --engine_dir mixtral-8x7b-w4a8

Expected behavior

Output is normal

actual behavior

Image

additional notes

Nothing

Metadata

Metadata

Assignees

Labels

InvestigatingLow PrecisionLower-precision formats (INT8/INT4/FP8) for TRTLLM quantization (AWQ, GPTQ).bugSomething isn't workingtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions