Description
System Info
- Architecture: x86_64
- OS Ubuntu 22.04
- GPU: NVIDIA GeForce RTX 4090
- Gpu memory 2x24gb
- CPU max MHz: 5000.0000
- Driver Version: 535.183.01
- CUDA Version: 12.2
- Container: nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3
- TensorRT-LLM version: 0.10.0
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
nvidia-docker run -d -it --name trtllm -v /home/remotessh/text-generation-webui/models/Llama-2-13b-chat-hf:/root/.cache/huggingface/llama-2-13b-chat-hf -v /home/remotessh/TensorRT_engines:/engines --shm-size=16G --network=host nvcr.io/nvidia/tritonserver:24.06-trtllm-python-py3 /bin/bash
11f808569af7484e003da1e5eb26729a7decd74f470b0699d983056df2ca1aef
git clone https://github.com/triton-inference-server/tensorrtllm_backend.git
cd tensorrtllm_backend/
git clone https://github.com/NVIDIA/TensorRT-LLM.git
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
export PYTHONPATH=/root/.cache/huggingface/llama-2-13b-chat-hf
- This leads to the following error:
ImportError: The
bindings
module does not exist. Please check the package integrity. If you are attempting to use the pip development mode (editable installation), please executebuild_wheels.py
first, and then runpip install -e .
.
python scripts/build_wheel.py
pip install -e
cd TensorRT-LLM/examples/llama
python convert_checkpoint.py --model_dir /root/.cache/huggingface/llama-2-13b-chat-hf \
--output_dir /workspace/tensorrt_llm/llama-2-13b-chat-hf \
--dtype float16 \
--tp_size 2
To get this error:
/opt/tritonserver/tensorrtllm_backend/TensorRT-LLM/examples/llama# python convert_checkpoint.py --model_dir /root/.cache/huggingface/llama-2-13b-chat-hf/llama-2-13b-chat-hf --output_dir /workspace/tensorrt_llm/llama-2-13b-chat-hf --dtype float16 --tp_size 2
Traceback (most recent call last):
File "/opt/tritonserver/tensorrtllm_backend/TensorRT-LLM/examples/llama/convert_checkpoint.py", line 8, in
import tensorrt_llm
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/init.py", line 32, in
import tensorrt_llm.functional as functional
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/functional.py", line 28, in
from . import graph_rewriting as gw
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/graph_rewriting.py", line 12, in
from .network import Network
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/network.py", line 27, in
from tensorrt_llm.module import Module
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/module.py", line 17, in
from ._common import default_net
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_common.py", line 31, in
from ._utils import str_dtype_to_trt
File "/usr/local/lib/python3.10/dist-packages/tensorrt_llm/_utils.py", line 30, in
from tensorrt_llm.bindings.BuildInfo import ENABLE_MULTI_DEVICE
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
Expected behavior
convert checkpoints successfull
actual behavior
ImportError: /usr/local/lib/python3.10/dist-packages/tensorrt_llm/bindings.cpython-310-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail14torchCheckFailEPKcS2_jRKSs
additional notes
I'm following the official documentation and some fixes suggested by other devs