llama examples fail to run

Following instructions at https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/README.md
I tried a bunch of different models and they all fail in `run.py` on:
```
Traceback (most recent call last):
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/trt-llm/TensorRT-LLM/examples/llama/../run.py", line 874, in <module>
    main(args)
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/trt-llm/TensorRT-LLM/examples/llama/../run.py", line 681, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/env/lib/conda/ctx-shared-trt-stable/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 311, in from_dir
    assert max_batch_size <= model_config.max_batch_size
AssertionError
Traceback (most recent call last):
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/trt-llm/TensorRT-LLM/examples/llama/../run.py", line 874, in <module>
    main(args)
  File "/data/stas/faster-ttft/core/dawn/exp/infer/faster-ttft/trt-llm/TensorRT-LLM/examples/llama/../run.py", line 681, in main
    runner = runner_cls.from_dir(**runner_kwargs)
  File "/env/lib/conda/ctx-shared-trt-stable/lib/python3.10/site-packages/tensorrt_llm/runtime/model_runner_cpp.py", line 311, in from_dir
    assert max_batch_size <= model_config.max_batch_size
AssertionError
```

here is a full repro:
```
git clone https://github.com/NVIDIA/TensorRT-LLM/
cd TensorRT-LLM/examples/llama
pip install -r requirements.txt
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
python convert_checkpoint.py --model_dir Llama-2-7b-chat-hf \
                            --output_dir Llama-2-7b-chat-hf_2gpu_tp2 \
                            --dtype float16 \
                            --tp_size 2
trtllm-build --checkpoint_dir Llama-2-7b-chat-hf_2gpu_tp2 \
            --output_dir trt_engines/Llama-2-7b-chat-hf_2gpu_tp2 \
            --gemm_plugin auto

wget https://www.gutenberg.org/cache/epub/64317/pg64317.txt
awk '{printf "%s\\n", $0} END {printf "\\nSummarize this story:"}' pg64317.txt > pg64317_sanitized.txt

mpirun -n 2 --allow-run-as-root --oversubscribe \
    python ../run.py \
    --max_output_len 128 \
    --max_input_length 32768 \
    --input_file pg64317_sanitized.txt \
    --engine_dir  trt_engines/Llama-2-7b-chat-hf_2gpu_tp2 \
    --tokenizer_dir Llama-2-7b-chat-hf
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

llama examples fail to run #2303

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

llama examples fail to run #2303

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions