This document describes how you can run multiple instances of LLaMa model on single and multiple GPUs running on the same machine. The guide focuses on the following scenarios:
-
Running multiple instances of LLaMa model on multiple GPUs:
a. Using Orchestrator mode.
b. Using Leader mode.
-
Setup the model repository as described in LLaMa Guide.
-
Increase the number of instances for the
instance_group
parameter for thetensorrt_llm
model. -
Start the triton server:
# Replace the <gpu> with the gpu you want to use for this model.
CUDA_VISIBLE_DEVICES=<gpu> tritonserver --model-repository `pwd`/llama_ifb &
This would create multiple instances of the tensorrt_llm
model, running on the
same GPU.
Note
Running multiple instances of a single model is generally not recommended. If you choose to do this, you need to ensure the GPU has enough resources for multiple copies of a single model. The performance implications of running multiple models on the same GPU are unpredictable.
Note
For production deployments please make sure to adjust the
max_tokens_in_paged_kv_cache
parameter, otherwise you may run out of GPU memory since TensorRT-LLM by default may use 90% of GPU for KV-Cache for each model instance. Additionally, if you rely onkv_cache_free_gpu_mem_fraction
the memory allocated to each instance will depend on the order in which instances are loaded.
- Run the test client to measure performance:
python3 tools/inflight_batcher_llm/end_to_end_test.py --dataset ci/L0_backend_trtllm/simple_data.json --max-input-len 500
If you plan to use the BLS version instead of the ensemble model, you might also
need to adjust the number of model instances for the tensorrt_llm_bls
model.
The default value only allows a single request for the whole pipeline which
might increase the latency and reduce the throughput.
- Kill the server:
pgrep tritonserver | xargs kill
Unlike other Triton backend models, the TensorRT-LLM backend does not support
using instance_group
setting for determining the placement of model instances
on different GPUs. In this section, we demonstrate how you can use
Leader Mode and Orchestrator Mode
for running multiple instances of a LLaMa model on different GPUs.
For this section, let's assume that we have four GPUs and the CUDA device ids are 0, 1, 2, and 3. We will be launching two instances of the LLaMa2-7b model with tensor parallelism equal to 2. The first instance will run on GPUs 0 and 1 and the second instance will run on GPUs 2 and 3.
- Create the engines:
# Update if the model is not available in huggingface cache
export HF_LLAMA_MODEL=`python3 -c "from pathlib import Path; from huggingface_hub import hf_hub_download; print(Path(hf_hub_download('meta-llama/Llama-2-7b-hf', filename='config.json')).parent)"`
export UNIFIED_CKPT_PATH=/tmp/ckpt/llama/7b-2tp-2gpu/
export ENGINE_PATH=/tmp/engines/llama/7b-2tp-2gpu/
# Create the checkpoint
python tensorrt_llm/examples/llama/convert_checkpoint.py --model_dir ${HF_LLAMA_MODEL} \
--output_dir ${UNIFIED_CKPT_PATH} \
--dtype float16 \
--tp_size 2
# Build the engines
trtllm-build --checkpoint_dir ${UNIFIED_CKPT_PATH} \
--remove_input_padding enable \
--gpt_attention_plugin float16 \
--context_fmha enable \
--gemm_plugin float16 \
--output_dir ${ENGINE_PATH} \
--kv_cache_type paged \
--max_batch_size 64
- Setup the model repository:
# Setup the model repository for the first instance.
cp all_models/inflight_batcher_llm/ llama_ifb -r
python3 tools/fill_template.py -i llama_ifb/preprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,preprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/postprocessing/config.pbtxt tokenizer_dir:${HF_LLAMA_MODEL},triton_max_batch_size:64,postprocessing_instance_count:1
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm_bls/config.pbtxt triton_max_batch_size:64,decoupled_mode:False,bls_instance_count:1,accumulate_tokens:False,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i llama_ifb/ensemble/config.pbtxt triton_max_batch_size:64,logits_datatype:TYPE_FP32
python3 tools/fill_template.py -i llama_ifb/tensorrt_llm/config.pbtxt triton_backend:tensorrtllm,triton_max_batch_size:64,decoupled_mode:False,max_beam_width:1,engine_dir:${ENGINE_PATH},max_tokens_in_paged_kv_cache:2560,max_attention_window_size:2560,kv_cache_free_gpu_mem_fraction:0.5,exclude_input_in_output:True,enable_kv_cache_reuse:False,batching_strategy:inflight_fused_batching,max_queue_delay_microseconds:0,encoder_input_features_data_type:TYPE_FP16,logits_datatype:TYPE_FP32
For leader mode, we will launch two separate mpirun
commands to launch two
separate Triton servers, one for each GPU (4 Triton Server instances in total).
We also need to use a reverse proxy in front of them to load balance the requests
between the servers.
3a. Launch the servers:
CUDA_VISIBLE_DEVICES=0,1 python3 scripts/launch_triton_server.py --world_size 2 --model_repo=llama_ifb/ --http_port 8000 --grpc_port 8001 --metrics_port 8004
CUDA_VISIBLE_DEVICES=2,3 python3 scripts/launch_triton_server.py --world_size 2 --model_repo=llama_ifb/ --http_port 8002 --grpc_port 8003 --metrics_port 8005
4a. Install NGINX:
apt update
apt install nginx -y
5a. Setup the NGINX configuration and store it in /etc/nginx/sites-available/tritonserver
:
upstream tritonserver {
server localhost:8000;
server localhost:8002;
}
server {
listen 8080;
location / {
proxy_pass http://tritonserver;
}
}
6a. Create a symlink and restart NGINX to enable the configuration:
ln -s /etc/nginx/sites-available/tritonserver /etc/nginx/sites-enabled/tritonserver
service nginx restart
7a. Run the test client to measure performance:
pip3 install tritonclient[all]
# Test the load on all the servers
python3 tools/inflight_batcher_llm/end_to_end_test.py --dataset ci/L0_backend_trtllm/simple_data.json --max-input-len 500 -u localhost:8080
# Test the load on one of the servers
python3 tools/inflight_batcher_llm/end_to_end_test.py --dataset ci/L0_backend_trtllm/simple_data.json --max-input-len 500 -u localhost:8000
8a. Kill the server:
pgrep mpirun | xargs kill
With orchestrator mode, there are two options for running multiple instances of a single model:
-
Creating separate Triton models
-
Starting from the 24.08 release, you can use Triton
instance_group
field to specify the number TRT-LLM model instances. With that option, the load balancing decision will be done in Triton core.
3b. Create a copy of the tensorrt_llm
model:
cp llama_ifb/tensorrt_llm llama_ifb/tensorrt_llm_2 -r
4b. Modify the gpu_device_ids
field in the config file to specify which GPUs
should be used by each model:
sed -i 's/\${gpu_device_ids}/0,1/g' llama_ifb/tensorrt_llm/config.pbtxt
sed -i 's/\${gpu_device_ids}/2,3/g' llama_ifb/tensorrt_llm_2/config.pbtxt
sed -i 's/name: "tensorrt_llm"/name: "tensorrt_llm_2"/g' llama_ifb/tensorrt_llm_2/config.pbtxt
Note
If you want to use the ensemble or BLS models, you have to create a copy of the ensemble and BLS models as well and modify the "tensorrt_llm" model name to "tensorrt_llm_2" in the config file.
5b. Launch the server:
python3 scripts/launch_triton_server.py --multi-model --model_repo=llama_ifb/
Alternatively, you can start all MPI ranks at once and avoid dynamic process spawning
by using the --disable-spawn-processes
. The config file must specify which ranks each
model should use:
sed -i 's/\${participant_ids}/1,2/g' llama_ifb/tensorrt_llm/config.pbtxt
sed -i 's/\${participant_ids}/3,4/g' llama_ifb/tensorrt_llm_2/config.pbtxt
Note that rank 0 is reserved for the orchestrator rank.
python3 scripts/launch_triton_server.py --multi-model --model_repo=llama_ifb/ --disable-spawn-processes --world_size=5
6b. Run the test client to measure performance:
pip3 install tritonclient[all]
# We will only benchmark the core tensorrtllm models.
python3 tools/inflight_batcher_llm/benchmark_core_model.py --max-input-len 500 \
dataset --dataset ci/L0_backend_trtllm/simple_data.json \
--tokenizer-dir $HF_LLAMA_MODEL \
--tesnorrt-llm-model-name tensorrtllm \
--tensorrt-llm-model-name tensorrtllm_2
7b. Kill the server:
pgrep mpirun | xargs kill
In order to use Triton core's load balancing for multiple instances, you can
increase the number of instances in the instance_group
field and use the
gpu_device_ids
parameter to specify which GPUs will be used by each model
instance.
For example, if you're running a TP=2 model on a 4-GPU system and you want to run one instance on GPUs 0 and 1 and the other instance on GPUs 2 and 3, you can use the following model configuration:
instance_group [
{kind: KIND_CPU, count: 2}
]
parameters: {
key: "gpu_device_ids"
value: {
string_value: "0,1;2,3"
}
}
Please note that the number of set of GPU device ids must equal the number of instances.
The table below summarizes the differences between the orchestrator mode and leader mode:
Orchestrator Mode (Separate Models) | Orchestrator Mode (Triton Load Balancing) | Leader Mode | |
---|---|---|---|
Requires Reverse Proxy | ❌ | ❌ | ✅ |
Requires Client Changes | ✅ | ❌ | ❌ |
Orchestrator mode by default uses MPI_Comm_Spawn
to create the child
processes. If MPI_Comm_Spawn
is used, it is not possible to distribute
the model across multiple nodes.
It is also possible to use orchestrator mode with MPI processes that have been
pre-spawned. In order to do that, you need to set --disable-spawn-processes
when using the launch_triton_server.py
script or export TRTLLM_ORCHESTRATOR_SPAWN_PROCESSES=0
. In this mode,
it is possible to run the server across different nodes in orchestrator mode.
In order to use the orchestrator mode itself, you need to set the --multi-model
flag when using the launch_triton_server.py
script or export TRTLLM_ORCHESTRATOR=1
.