Skip to content

Commit

Permalink
Update TensorRT-LLM backend (triton-inference-server#494)
Browse files Browse the repository at this point in the history
Update TensorRT-LLM backend
  • Loading branch information
kaiyux authored Jun 11, 2024
1 parent 39ba55a commit 566b4ff
Show file tree
Hide file tree
Showing 4 changed files with 35 additions and 8 deletions.
2 changes: 0 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,8 +45,6 @@ repo. If you don't find your answer there you can ask questions on the

There are several ways to access the TensorRT-LLM Backend.

**Before Triton 23.10 release, please use [Option 3 to build TensorRT-LLM backend via Docker](#option-3-build-via-docker).**

### Run the Pre-built Docker Container

Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
Expand Down
37 changes: 33 additions & 4 deletions docs/llama.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,42 @@
## End to end workflow to run llama 7b

## End to end workflow to run llama
0. Make sure that you have initialized the TRT-LLM submodule:

* Build engine
```bash
git lfs install
git submodule update --init --recursive
```

1. (Optional) Download the LLaMa model from HuggingFace:

```bash
huggingface-cli login

huggingface-cli download meta-llama/Llama-2-7b-hf
```

> **NOTE**
>
> Make sure that you have access to https://huggingface.co/meta-llama/Llama-2-7b-hf.
2. Start the Triton Server Docker container:

```bash
# Replace <yy.mm> with the version of Triton you want to use.
# The command below assumes the the current directory is the
# TRT-LLM backend root git repository.

docker run --rm -ti -v `pwd`:/mnt -w /mnt -v ~/.cache/huggingface:~/.cache/huggingface --gpus all nvcr.io/nvidia/tritonserver:\<yy.mm\>-trtllm-python-py3 bash
```

3. Build the engine:
```bash
export HF_LLAMA_MODEL=llama-7b-hf/
# Replace 'HF_LLAMA_MODE' with another path if you didn't download the model from step 1
# or you're not using HuggingFace.
export HF_LLAMA_MODEL=`python3 -c "from pathlib import Path; from huggingface_hub import hf_hub_download; print(Path(hf_hub_download('meta-llama/Llama-2-7b-hf', filename='config.json')).parent)"`
export UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b/
export ENGINE_PATH=/tmp/engines/llama/7b/
python convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
python tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
--output_dir ${UNIFIED_CKPT_PATH} \
--dtype float16

Expand Down
2 changes: 1 addition & 1 deletion tensorrt_llm
Submodule tensorrt_llm updated 301 files
2 changes: 1 addition & 1 deletion tools/version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
225fd4fc55948de398989c334464d4478064b4f7
1353d8632b255979eac4667d631a90538c07d269

0 comments on commit 566b4ff

Please sign in to comment.