Update TensorRT-LLM backend (triton-inference-server#494)

kaiyux · web-flow · commit 566b4ff3c7e1 · 2024-06-11T17:06:40.000+08:00
Update TensorRT-LLM backend
diff --git a/README.md b/README.md
@@ -45,8 +45,6 @@ repo. If you don't find your answer there you can ask questions on the
 
 There are several ways to access the TensorRT-LLM Backend.
 
-**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker).**
-
 ### Run the Pre-built Docker Container
 
 Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
diff --git a/docs/llama.md b/docs/llama.md
@@ -1,13 +1,42 @@
+## End to end workflow to run llama 7b
 
-## End to end workflow to run llama
+0. Make sure that you have initialized the TRT-LLM submodule:
 
-* Build engine
+```bash
+git lfs install
+git submodule update --init --recursive
+```
+
+1. (Optional) Download the LLaMa model from HuggingFace:
+
+```bash
+huggingface-cli login
+
+huggingface-cli download meta-llama/Llama-2-7b-hf
+```
+
+> **NOTE**
+>
+> Make sure that you have access to https://huggingface.co/meta-llama/Llama-2-7b-hf.
+
+2. Start the Triton Server Docker container:
+
+```bash
+# Replace <yy.mm> with the version of Triton you want to use.
+# The command below assumes the the current directory is the
+# TRT-LLM backend root git repository.
+
+docker run --rm -ti -v `pwd`:/mnt -w /mnt -v ~/.cache/huggingface:~/.cache/huggingface --gpus all nvcr.io/nvidia/tritonserver:\<yy.mm\>-trtllm-python-py3 bash
+```
 
+3. Build the engine:
 ```bash
-export HF_LLAMA_MODEL=llama-7b-hf/
+# Replace 'HF_LLAMA_MODE' with another path if you didn't download the model from step 1
+# or you're not using HuggingFace.
+export HF_LLAMA_MODEL=`python3 -c "from pathlib import Path; from huggingface_hub import hf_hub_download; print(Path(hf_hub_download('meta-llama/Llama-2-7b-hf', filename='config.json')).parent)"`
 export UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b/
 export ENGINE_PATH=/tmp/engines/llama/7b/
-python convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
+python tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
                              --output_dir ${UNIFIED_CKPT_PATH} \
                              --dtype float16
 
diff --git a/tensorrt_llm b/tensorrt_llm
@@ -1 +1 @@
-Subproject commit b777bd64750abf30ca7eda48e8b6ba3c5174aafd
+Subproject commit db4edea1e1359bcfcac7bbb87c1b639b5611c721
diff --git a/tools/version.txt b/tools/version.txt
@@ -1 +1 @@
-225fd4fc55948de398989c334464d4478064b4f7
+1353d8632b255979eac4667d631a90538c07d269

Original file line number	Diff line number	Diff line change
`@@ -1 +1 @@`
`1`		`-225fd4fc55948de398989c334464d4478064b4f7`
	`1`	`+1353d8632b255979eac4667d631a90538c07d269`