How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine...

I built the engine, and had two separate LoRA layers with the base llama3.1 model. The output from the build is rank0.engine, config.json, and then a lora folder with the following structure:
lora
|
|__>0
|      |_> adapter_config.json
|__  |_> adapter_model.safetensors
|
|_>1
|      |_> adapter_config.json
|__  |_> adapter_model.safetensors

Is this expected? I figured there would be rank engines? I passed these in the lora directory on the engine build:
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir /opt/tensorrt_llm_engine --gemm_plugin auto --lora_plugin auto --max_batch_size 8 --max_input_len 512 --max_seq_len 562 --lora_dir  "/opt/lora_1" "/opt/lora_2" --max_lora_rank 8 --lora_target_modules attn_q attn_k attn_v

Any advice is appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine... #2371

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine... #2371

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions