Skip to content

Commit 7a92137

Browse files
authored
Update TensorRT-LLM backend (triton-inference-server#101)
* Update TensorRT-LLM backend
1 parent 329937a commit 7a92137

File tree

24 files changed

+1074
-375
lines changed

24 files changed

+1074
-375
lines changed

README.md

Lines changed: 33 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -26,8 +26,6 @@
2626
# OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
2727
-->
2828

29-
[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
30-
3129
# TensorRT-LLM Backend
3230
The Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
3331
You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).
@@ -51,18 +49,14 @@ There are several ways to access the TensorRT-LLM Backend.
5149

5250
### Option 1. Run the Docker Container
5351

54-
**The NGC container will be available with Triton 23.10 release soon**
55-
56-
Starting with release 23.10, Triton includes a container with the TensorRT-LLM
52+
Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
5753
Backend and Python Backend. This container should have everything to run a
5854
TensorRT-LLM model. You can find this container on the
5955
[Triton NGC page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).
6056

6157
### Option 2. Build via the build.py Script in Server Repo
6258

63-
**Building via the build.py script will be available with Triton 23.10 release soon**
64-
65-
You can follow steps described in the
59+
Starting with Triton 23.10 release, you can follow steps described in the
6660
[Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
6761
guide and use the
6862
[build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
@@ -73,7 +67,7 @@ shown below, which will build the same TRT-LLM container as the one on the NGC.
7367

7468
```bash
7569
BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
76-
TENSORRTLLM_BACKEND_REPO_TAG=r23.10
70+
TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
7771
PYTHON_BACKEND_REPO_TAG=r23.10
7872

7973
# Run the build script. The flags for some features or endpoints can be removed if not needed.
@@ -98,6 +92,9 @@ don't need by removing the corresponding flags.
9892

9993
### Option 3. Build via Docker
10094

95+
The version of Triton Server used in this build option can be found in the
96+
[Dockerfile](./dockerfile/Dockerfile.trt_llm_backend).
97+
10198
```bash
10299
# Update the submodules
103100
cd tensorrtllm_backend
@@ -126,13 +123,17 @@ TensorRT-LLM repository for more details on how to to prepare the engines for de
126123
```bash
127124
# Update the submodule TensorRT-LLM repository
128125
git submodule update --init --recursive
126+
git lfs install
127+
git lfs pull
129128

130129
# TensorRT-LLM is required for generating engines. You can skip this step if
131130
# you already have the package installed. If you are generating engines within
132131
# the Triton container, you have to install the TRT-LLM package.
133-
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
134-
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
135-
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
132+
(cd tensorrt_llm &&
133+
bash docker/common/install_cmake.sh &&
134+
export PATH=/usr/local/cmake/bin:$PATH &&
135+
python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
136+
pip3 install ./build/tensorrt_llm*.whl)
136137

137138
# Go to the tensorrt_llm/examples/gpt directory
138139
cd tensorrt_llm/examples/gpt
@@ -209,19 +210,31 @@ The following table shows the fields that need to be modified before deployment:
209210
| `tokenizer_dir` | The path to the tokenizer for the model. In this example, the path should be set to `/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
210211
| `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `auto` |
211212

212-
### Launch Triton server *within NGC container*
213+
### Launch Triton server
213214

214-
**The NGC container will be available with Triton 23.10 release soon**
215+
Please follow the option corresponding to the way you build the TensorRT-LLM backend.
216+
217+
#### Option 1. Launch Triton server *within Triton NGC container*
218+
219+
```bash
220+
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
221+
```
215222

216-
Before the Triton 23.10 release, you can launch the Triton 23.09 container
217-
`nvcr.io/nvidia/tritonserver:23.09-py3` and add the directory
218-
`/opt/tritonserver/backends/tensorrtllm` within the container following the
219-
instructions in [Option 3 Build via Docker](#option-3-build-via-docker).
223+
#### Option 2. Launch Triton server *within the Triton container built via build.py script*
224+
225+
```bash
226+
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend tritonserver bash
227+
```
228+
229+
#### Option 3. Launch Triton server *within the Triton container built via Docker*
220230

221231
```bash
222-
# Launch the Triton container
223232
docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
233+
```
234+
235+
Once inside the container, you can launch the Triton server with the following command:
224236

237+
```bash
225238
cd /tensorrtllm_backend
226239
# --world_size is the number of GPUs you want to use for serving
227240
python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo
@@ -236,9 +249,7 @@ I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0
236249

237250
### Query the server with the Triton generate endpoint
238251

239-
**This feature will be available with Triton 23.10 release soon**
240-
241-
You can query the server using Triton's
252+
Starting with Triton 23.10 release, you can query the server using Triton's
242253
[generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
243254
with a curl command based on the following general format within your client
244255
environment/container:

all_models/inflight_batcher_llm/ensemble/config.pbtxt

Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,12 @@ input [
6060
dims: [ 1 ]
6161
optional: true
6262
},
63+
{
64+
name: "embedding_bias"
65+
data_type: TYPE_FP16
66+
dims: [ -1 ]
67+
optional: true
68+
},
6369
{
6470
name: "top_k"
6571
data_type: TYPE_UINT32
@@ -119,6 +125,18 @@ input [
119125
data_type: TYPE_BOOL
120126
dims: [ 1 ]
121127
optional: true
128+
},
129+
{
130+
name: "prompt_embedding_table"
131+
data_type: TYPE_FP16
132+
dims: [ -1, -1 ]
133+
optional: true
134+
},
135+
{
136+
name: "prompt_vocab_size"
137+
data_type: TYPE_UINT32
138+
dims: [ 1 ]
139+
optional: true
122140
}
123141
]
124142
output [
@@ -161,6 +179,14 @@ ensemble_scheduling {
161179
key: "REQUEST_OUTPUT_LEN"
162180
value: "_REQUEST_OUTPUT_LEN"
163181
}
182+
output_map {
183+
key: "STOP_WORDS_IDS"
184+
value: "_STOP_WORDS_IDS"
185+
}
186+
output_map {
187+
key: "BAD_WORDS_IDS"
188+
value: "_BAD_WORDS_IDS"
189+
}
164190
},
165191
{
166192
model_name: "tensorrt_llm"
@@ -185,6 +211,10 @@ ensemble_scheduling {
185211
key: "pad_id"
186212
value: "pad_id"
187213
}
214+
input_map {
215+
key: "embedding_bias"
216+
value: "embedding_bias"
217+
}
188218
input_map {
189219
key: "runtime_top_k"
190220
value: "top_k"
@@ -225,6 +255,22 @@ ensemble_scheduling {
225255
key: "streaming"
226256
value: "stream"
227257
}
258+
input_map {
259+
key: "prompt_embedding_table"
260+
value: "prompt_embedding_table"
261+
}
262+
input_map {
263+
key: "prompt_vocab_size"
264+
value: "prompt_vocab_size"
265+
}
266+
input_map {
267+
key: "stop_words_list"
268+
value: "_STOP_WORDS_IDS"
269+
}
270+
input_map {
271+
key: "bad_words_list"
272+
value: "_BAD_WORDS_IDS"
273+
}
228274
output_map {
229275
key: "output_ids"
230276
value: "_TOKENS_BATCH"

all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -63,6 +63,24 @@ input [
6363
reshape: { shape: [ ] }
6464
optional: true
6565
},
66+
{
67+
name: "stop_words_list"
68+
data_type: TYPE_INT32
69+
dims: [ 2, -1 ]
70+
optional: true
71+
},
72+
{
73+
name: "bad_words_list"
74+
data_type: TYPE_INT32
75+
dims: [ 2, -1 ]
76+
optional: true
77+
},
78+
{
79+
name: "embedding_bias"
80+
data_type: TYPE_FP16
81+
dims: [ -1 ]
82+
optional: true
83+
},
6684
{
6785
name: "beam_width"
6886
data_type: TYPE_UINT32
@@ -137,6 +155,19 @@ input [
137155
data_type: TYPE_BOOL
138156
dims: [ 1 ]
139157
optional: true
158+
},
159+
{
160+
name: "prompt_embedding_table"
161+
data_type: TYPE_FP16
162+
dims: [ -1, -1 ]
163+
optional: true
164+
},
165+
{
166+
name: "prompt_vocab_size"
167+
data_type: TYPE_UINT32
168+
dims: [ 1 ]
169+
reshape: { shape: [ ] }
170+
optional: true
140171
}
141172
]
142173
output [
@@ -211,3 +242,9 @@ parameters: {
211242
string_value: "${enable_trt_overlap}"
212243
}
213244
}
245+
parameters: {
246+
key: "exclude_input_in_output"
247+
value: {
248+
string_value: "${exclude_input_in_output}"
249+
}
250+
}

ci/L0_backend_trtllm/generate_engines.sh

Lines changed: 35 additions & 33 deletions
Original file line numberDiff line numberDiff line change
@@ -29,18 +29,23 @@ BASE_DIR=/opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm
2929
GPT_DIR=/opt/tritonserver/tensorrtllm_backend/tensorrt_llm/examples/gpt
3030

3131
function build_base_model {
32+
local NUM_GPUS=$1
3233
cd ${GPT_DIR}
3334
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
3435
pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd
35-
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16
36+
python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism ${NUM_GPUS} --storage-type float16
3637
cd ${BASE_DIR}
3738
}
3839

3940
function build_tensorrt_engine_inflight_batcher {
41+
local NUM_GPUS=$1
4042
cd ${GPT_DIR}
43+
local GPT_MODEL_DIR=./c-model/gpt2/${NUM_GPUS}-gpu/
44+
local OUTPUT_DIR=inflight_${NUM_GPUS}_gpu/
4145
# ./c-model/gpt2/ must already exist (it will if build_base_model
4246
# has already been run)
43-
python3 build.py --model_dir=./c-model/gpt2/1-gpu/ \
47+
python3 build.py --model_dir="${GPT_MODEL_DIR}" \
48+
--world_size="${NUM_GPUS}" \
4449
--dtype float16 \
4550
--use_inflight_batching \
4651
--use_gpt_attention_plugin float16 \
@@ -49,47 +54,44 @@ function build_tensorrt_engine_inflight_batcher {
4954
--remove_input_padding \
5055
--use_layernorm_plugin float16 \
5156
--hidden_act gelu \
52-
--output_dir=inflight_single_gpu/
57+
--parallel_build \
58+
--output_dir="${OUTPUT_DIR}"
5359
cd ${BASE_DIR}
54-
5560
}
5661

57-
function build_tensorrt_engine_inflight_batcher_multi_gpu {
58-
cd ${GPT_DIR}
59-
python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 4 --storage-type float16
60-
python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
61-
--world_size=4 \
62-
--dtype float16 \
63-
--use_inflight_batching \
64-
--use_gpt_attention_plugin float16 \
65-
--paged_kv_cache \
66-
--use_gemm_plugin float16 \
67-
--remove_input_padding \
68-
--use_layernorm_plugin float16 \
69-
--hidden_act gelu \
70-
--parallel_build \
71-
--output_dir=inflight_multi_gpu/
72-
cd ${BASE_DIR}
62+
function install_trt_llm {
63+
# Install CMake
64+
bash /opt/tritonserver/tensorrtllm_backend/tensorrt_llm/docker/common/install_cmake.sh
65+
export PATH="/usr/local/cmake/bin:${PATH}"
66+
67+
# PyTorch needs to be built from source for aarch64
68+
ARCH="$(uname -i)"
69+
if [ "${ARCH}" = "aarch64" ]; then TORCH_INSTALL_TYPE="src_non_cxx11_abi"; \
70+
else TORCH_INSTALL_TYPE="pypi"; fi && \
71+
(cd /opt/tritonserver/tensorrtllm_backend/tensorrt_llm &&
72+
bash docker/common/install_pytorch.sh $TORCH_INSTALL_TYPE &&
73+
python3 ./scripts/build_wheel.py --trt_root="${TRT_ROOT}" &&
74+
pip3 install ./build/tensorrt_llm*.whl)
7375
}
7476

7577
# Install TRT LLM
76-
# FIXME: Update the url
77-
pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@${TENSORRTLLM_BACKEND_REPO_TAG}
78-
mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
79-
cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
80-
81-
export LD_LIBRARY_PATH=/usr/local/tensorrt/lib/:$LD_LIBRARY_PATH
82-
export TRT_ROOT=/usr/local/tensorrt
78+
install_trt_llm
8379

8480
# Generate the TRT_LLM model engines
85-
build_base_model
86-
build_tensorrt_engine_inflight_batcher
87-
build_tensorrt_engine_inflight_batcher_multi_gpu
81+
NUM_GPUS_TO_TEST=("1" "2" "4")
82+
for NUM_GPU in "${NUM_GPUS_TO_TEST[@]}"; do
83+
AVAILABLE_GPUS=$(nvidia-smi -L | wc -l)
84+
if [ "$AVAILABLE_GPUS" -lt "$NUM_GPU" ]; then
85+
continue
86+
fi
87+
88+
build_base_model "${NUM_GPU}"
89+
build_tensorrt_engine_inflight_batcher "${NUM_GPU}"
90+
done
8891

8992
# Move the TRT_LLM model engines to the CI directory
9093
mkdir engines
91-
mv ${GPT_DIR}/inflight_single_gpu engines/
92-
mv ${GPT_DIR}/inflight_multi_gpu engines/
94+
mv ${GPT_DIR}/inflight_*_gpu/ engines/
9395

9496
# Move the tokenizer into the CI directory
9597
mkdir tokenizer
@@ -98,4 +100,4 @@ mv ${GPT_DIR}/gpt2/* tokenizer/
98100
# Now that the engines are generated, we should remove the
99101
# tensorrt_llm module to ensure the C++ backend tests are
100102
# not using it
101-
rm -rf /usr/local/lib/python3.10/dist-packages/tensorrt_llm
103+
pip3 uninstall -y torch tensorrt_llm

0 commit comments

Comments
 (0)