TheCodeWrangler
diff --git a/‎README.md
Lines changed: 33 additions & 22 deletions b/‎README.md
Lines changed: 33 additions & 22 deletions
diff --git a/‎all_models/inflight_batcher_llm/ensemble/config.pbtxt
Lines changed: 46 additions & 0 deletions b/‎all_models/inflight_batcher_llm/ensemble/config.pbtxt
Lines changed: 46 additions & 0 deletions
diff --git a/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Lines changed: 37 additions & 0 deletions b/‎all_models/inflight_batcher_llm/tensorrt_llm/config.pbtxt
Lines changed: 37 additions & 0 deletions
diff --git a/‎ci/L0_backend_trtllm/generate_engines.sh
Lines changed: 35 additions & 33 deletions b/‎ci/L0_backend_trtllm/generate_engines.sh
Lines changed: 35 additions & 33 deletions
@@ -26,8 +26,6 @@
 # OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
 -->
 
-[![License](https://img.shields.io/badge/License-BSD3-lightgrey.svg)](https://opensource.org/licenses/BSD-3-Clause)
-
 # TensorRT-LLM Backend
 The Triton backend for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
 You can learn more about Triton backends in the [backend repo](https://github.com/triton-inference-server/backend).
@@ -51,18 +49,14 @@ There are several ways to access the TensorRT-LLM Backend.
 
 ### Option 1. Run the Docker Container
 
-**The NGC container will be available with Triton 23.10 release soon**
-
-Starting with release 23.10, Triton includes a container with the TensorRT-LLM
+Starting with Triton 23.10 release, Triton includes a container with the TensorRT-LLM
 Backend and Python Backend. This container should have everything to run a
 TensorRT-LLM model. You can find this container on the
 [Triton NGC page](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/tritonserver).
 
 ### Option 2. Build via the build.py Script in Server Repo
 
-**Building via the build.py script will be available with Triton 23.10 release soon**
-
-You can follow steps described in the
+Starting with Triton 23.10 release, you can follow steps described in the
 [Building With Docker](https://github.com/triton-inference-server/server/blob/main/docs/customization_guide/build.md#building-with-docker)
 guide and use the
 [build.py](https://github.com/triton-inference-server/server/blob/main/build.py)
@@ -73,7 +67,7 @@ shown below, which will build the same TRT-LLM container as the one on the NGC.
 
 ```bash
 BASE_CONTAINER_IMAGE_NAME=nvcr.io/nvidia/tritonserver:23.10-py3-min
-TENSORRTLLM_BACKEND_REPO_TAG=r23.10
+TENSORRTLLM_BACKEND_REPO_TAG=release/0.5.0
 PYTHON_BACKEND_REPO_TAG=r23.10
 
 # Run the build script. The flags for some features or endpoints can be removed if not needed.
@@ -98,6 +92,9 @@ don't need by removing the corresponding flags.
 
 ### Option 3. Build via Docker
 
+The version of Triton Server used in this build option can be found in the
+[Dockerfile](./dockerfile/Dockerfile.trt_llm_backend).
+
 ```bash
 # Update the submodules
 cd tensorrtllm_backend
@@ -126,13 +123,17 @@ TensorRT-LLM repository for more details on how to to prepare the engines for de
 ```bash
 # Update the submodule TensorRT-LLM repository
 git submodule update --init --recursive
+git lfs install
+git lfs pull
 
 # TensorRT-LLM is required for generating engines. You can skip this step if
 # you already have the package installed. If you are generating engines within
 # the Triton container, you have to install the TRT-LLM package.
-pip install git+https://github.com/NVIDIA/TensorRT-LLM.git
-mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
-cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
+(cd tensorrt_llm &&
+    bash docker/common/install_cmake.sh &&
+    export PATH=/usr/local/cmake/bin:$PATH &&
+    python3 ./scripts/build_wheel.py --trt_root="/usr/local/tensorrt" &&
+    pip3 install ./build/tensorrt_llm*.whl)
 
 # Go to the tensorrt_llm/examples/gpt directory
 cd tensorrt_llm/examples/gpt
@@ -209,19 +210,31 @@ The following table shows the fields that need to be modified before deployment:
 | `tokenizer_dir` | The path to the tokenizer for the model. In this example, the path should be set to `/tensorrtllm_backend/tensorrt_llm/examples/gpt/gpt2` as the tensorrtllm_backend directory will be mounted to `/tensorrtllm_backend` within the container |
 | `tokenizer_type` | The type of the tokenizer for the model, `t5`, `auto` and `llama` are supported. In this example, the type should be set to `auto` |
 
-### Launch Triton server *within NGC container*
+### Launch Triton server
 
-**The NGC container will be available with Triton 23.10 release soon**
+Please follow the option corresponding to the way you build the TensorRT-LLM backend.
+
+#### Option 1. Launch Triton server *within Triton NGC container*
+
+```bash
+docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend nvcr.io/nvidia/tritonserver:23.10-trtllm-python-py3 bash
+```
 
-Before the Triton 23.10 release, you can launch the Triton 23.09 container
-`nvcr.io/nvidia/tritonserver:23.09-py3` and add the directory
-`/opt/tritonserver/backends/tensorrtllm` within the container following the
-instructions in [Option 3 Build via Docker](#option-3-build-via-docker).
+#### Option 2. Launch Triton server *within the Triton container built via build.py script*
+
+```bash
+docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend tritonserver bash
+```
+
+#### Option 3. Launch Triton server *within the Triton container built via Docker*
 
 ```bash
-# Launch the Triton container
 docker run --rm -it --net host --shm-size=2g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -v /path/to/tensorrtllm_backend:/tensorrtllm_backend triton_trt_llm bash
+```
+
+Once inside the container, you can launch the Triton server with the following command:
 
+```bash
 cd /tensorrtllm_backend
 # --world_size is the number of GPUs you want to use for serving
 python3 scripts/launch_triton_server.py --world_size=4 --model_repo=/tensorrtllm_backend/triton_model_repo
@@ -236,9 +249,7 @@ I0919 14:52:10.517138 293 http_server.cc:187] Started Metrics Service at 0.0.0.0
 
 ### Query the server with the Triton generate endpoint
 
-**This feature will be available with Triton 23.10 release soon**
-
-You can query the server using Triton's
+Starting with Triton 23.10 release, you can query the server using Triton's
 [generate endpoint](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_generate.md)
 with a curl command based on the following general format within your client
 environment/container:
 
@@ -60,6 +60,12 @@ input [
     dims: [ 1 ]
     optional: true
   },
+  {
+    name: "embedding_bias"
+    data_type: TYPE_FP16
+    dims: [ -1 ]
+    optional: true
+  },
   {
     name: "top_k"
     data_type: TYPE_UINT32
@@ -119,6 +125,18 @@ input [
     data_type: TYPE_BOOL
     dims: [ 1 ]
     optional: true
+  },
+  {
+    name: "prompt_embedding_table"
+    data_type: TYPE_FP16
+    dims: [ -1, -1 ]
+    optional: true
+  },
+  {
+    name: "prompt_vocab_size"
+    data_type: TYPE_UINT32
+    dims: [ 1 ]
+    optional: true
   }
 ]
 output [
@@ -161,6 +179,14 @@ ensemble_scheduling {
         key: "REQUEST_OUTPUT_LEN"
         value: "_REQUEST_OUTPUT_LEN"
       }
+      output_map {
+        key: "STOP_WORDS_IDS"
+        value: "_STOP_WORDS_IDS"
+      }
+      output_map {
+        key: "BAD_WORDS_IDS"
+        value: "_BAD_WORDS_IDS"
+      }
     },
     {
       model_name: "tensorrt_llm"
@@ -185,6 +211,10 @@ ensemble_scheduling {
           key: "pad_id"
           value: "pad_id"
       }
+      input_map {
+          key: "embedding_bias"
+          value: "embedding_bias"
+      }
       input_map {
           key: "runtime_top_k"
           value: "top_k"
@@ -225,6 +255,22 @@ ensemble_scheduling {
           key: "streaming"
           value: "stream"
       }
+      input_map {
+        key: "prompt_embedding_table"
+        value: "prompt_embedding_table"
+      }
+      input_map {
+        key: "prompt_vocab_size"
+        value: "prompt_vocab_size"
+      }
+      input_map {
+        key: "stop_words_list"
+        value: "_STOP_WORDS_IDS"
+      }
+      input_map {
+        key: "bad_words_list"
+        value: "_BAD_WORDS_IDS"
+      }
       output_map {
         key: "output_ids"
         value: "_TOKENS_BATCH"
 
@@ -63,6 +63,24 @@ input [
     reshape: { shape: [ ] }
     optional: true
   },
+  {
+    name: "stop_words_list"
+    data_type: TYPE_INT32
+    dims: [ 2, -1 ]
+    optional: true
+  },
+  {
+    name: "bad_words_list"
+    data_type: TYPE_INT32
+    dims: [ 2, -1 ]
+    optional: true
+  },
+  {
+    name: "embedding_bias"
+    data_type: TYPE_FP16
+    dims: [ -1 ]
+    optional: true
+  },
   {
     name: "beam_width"
     data_type: TYPE_UINT32
@@ -137,6 +155,19 @@ input [
     data_type: TYPE_BOOL
     dims: [ 1 ]
     optional: true
+  },
+  {
+    name: "prompt_embedding_table"
+    data_type: TYPE_FP16
+    dims: [ -1, -1 ]
+    optional: true
+  },
+  {
+    name: "prompt_vocab_size"
+    data_type: TYPE_UINT32
+    dims: [ 1 ]
+    reshape: { shape: [ ] }
+    optional: true
   }
 ]
 output [
@@ -211,3 +242,9 @@ parameters: {
     string_value: "${enable_trt_overlap}"
   }
 }
+parameters: {
+  key: "exclude_input_in_output"
+  value: {
+    string_value: "${exclude_input_in_output}"
+  }
+}
@@ -29,18 +29,23 @@ BASE_DIR=/opt/tritonserver/tensorrtllm_backend/ci/L0_backend_trtllm
 GPT_DIR=/opt/tritonserver/tensorrtllm_backend/tensorrt_llm/examples/gpt
 
 function build_base_model {
+    local NUM_GPUS=$1
     cd ${GPT_DIR}
     rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
     pushd gpt2 && rm pytorch_model.bin model.safetensors && wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin && popd
-    python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16
+    python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism ${NUM_GPUS} --storage-type float16
     cd ${BASE_DIR}
 }
 
 function build_tensorrt_engine_inflight_batcher {
+    local NUM_GPUS=$1
     cd ${GPT_DIR}
+    local GPT_MODEL_DIR=./c-model/gpt2/${NUM_GPUS}-gpu/
+    local OUTPUT_DIR=inflight_${NUM_GPUS}_gpu/
     # ./c-model/gpt2/ must already exist (it will if build_base_model
     # has already been run)
-    python3 build.py --model_dir=./c-model/gpt2/1-gpu/ \
+    python3 build.py --model_dir="${GPT_MODEL_DIR}" \
+                 --world_size="${NUM_GPUS}" \
                  --dtype float16 \
                  --use_inflight_batching \
                  --use_gpt_attention_plugin float16 \
@@ -49,47 +54,44 @@ function build_tensorrt_engine_inflight_batcher {
                  --remove_input_padding \
                  --use_layernorm_plugin float16 \
                  --hidden_act gelu \
-                 --output_dir=inflight_single_gpu/
+                 --parallel_build \
+                 --output_dir="${OUTPUT_DIR}"
     cd ${BASE_DIR}
-
 }
 
-function build_tensorrt_engine_inflight_batcher_multi_gpu {
-    cd ${GPT_DIR}
-    python3 hf_gpt_convert.py -p 8 -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 4 --storage-type float16
-    python3 build.py --model_dir=./c-model/gpt2/4-gpu/ \
-                 --world_size=4 \
-                 --dtype float16 \
-                 --use_inflight_batching \
-                 --use_gpt_attention_plugin float16 \
-                 --paged_kv_cache \
-                 --use_gemm_plugin float16 \
-                 --remove_input_padding \
-                 --use_layernorm_plugin float16 \
-                 --hidden_act gelu \
-                 --parallel_build \
-                 --output_dir=inflight_multi_gpu/
-    cd ${BASE_DIR}
+function install_trt_llm {
+    # Install CMake
+    bash /opt/tritonserver/tensorrtllm_backend/tensorrt_llm/docker/common/install_cmake.sh
+    export PATH="/usr/local/cmake/bin:${PATH}"
+
+    # PyTorch needs to be built from source for aarch64
+    ARCH="$(uname -i)"
+    if [ "${ARCH}" = "aarch64" ]; then TORCH_INSTALL_TYPE="src_non_cxx11_abi"; \
+    else TORCH_INSTALL_TYPE="pypi"; fi && \
+    (cd /opt/tritonserver/tensorrtllm_backend/tensorrt_llm &&
+        bash docker/common/install_pytorch.sh $TORCH_INSTALL_TYPE &&
+        python3 ./scripts/build_wheel.py --trt_root="${TRT_ROOT}" &&
+        pip3 install ./build/tensorrt_llm*.whl)
 }
 
 # Install TRT LLM
-# FIXME: Update the url
-pip install git+https://github.com/NVIDIA/TensorRT-LLM.git@${TENSORRTLLM_BACKEND_REPO_TAG}
-mkdir /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
-cp /opt/tritonserver/backends/tensorrtllm/* /usr/local/lib/python3.10/dist-packages/tensorrt_llm/libs/
-
-export LD_LIBRARY_PATH=/usr/local/tensorrt/lib/:$LD_LIBRARY_PATH
-export TRT_ROOT=/usr/local/tensorrt
+install_trt_llm
 
 # Generate the TRT_LLM model engines
-build_base_model
-build_tensorrt_engine_inflight_batcher
-build_tensorrt_engine_inflight_batcher_multi_gpu
+NUM_GPUS_TO_TEST=("1" "2" "4")
+for NUM_GPU in "${NUM_GPUS_TO_TEST[@]}"; do
+    AVAILABLE_GPUS=$(nvidia-smi -L | wc -l)
+    if [ "$AVAILABLE_GPUS" -lt "$NUM_GPU" ]; then
+        continue
+    fi
+
+    build_base_model "${NUM_GPU}"
+    build_tensorrt_engine_inflight_batcher "${NUM_GPU}"
+done
 
 # Move the TRT_LLM model engines to the CI directory
 mkdir engines
-mv ${GPT_DIR}/inflight_single_gpu engines/
-mv ${GPT_DIR}/inflight_multi_gpu engines/
+mv ${GPT_DIR}/inflight_*_gpu/ engines/
 
 # Move the tokenizer into the CI directory
 mkdir tokenizer
@@ -98,4 +100,4 @@ mv ${GPT_DIR}/gpt2/* tokenizer/
 # Now that the engines are generated, we should remove the
 # tensorrt_llm module to ensure the C++ backend tests are
 # not using it
-rm -rf /usr/local/lib/python3.10/dist-packages/tensorrt_llm
+pip3 uninstall -y torch tensorrt_llm