Skip to content

How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine... #2371

Open
@JoJoLev

Description

@JoJoLev

I built the engine, and had two separate LoRA layers with the base llama3.1 model. The output from the build is rank0.engine, config.json, and then a lora folder with the following structure:
lora
|
|>0
| |_> adapter_config.json
|
|> adapter_model.safetensors
|
|
>1
| |> adapter_config.json
|__ |
> adapter_model.safetensors

Is this expected? I figured there would be rank engines? I passed these in the lora directory on the engine build:
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir /opt/tensorrt_llm_engine --gemm_plugin auto --lora_plugin auto --max_batch_size 8 --max_input_len 512 --max_seq_len 562 --lora_dir "/opt/lora_1" "/opt/lora_2" --max_lora_rank 8 --lora_target_modules attn_q attn_k attn_v

Any advice is appreciated.

Metadata

Metadata

Assignees

Labels

LLM API/WorkflowHigh-level LLM Python API & tools (e.g., trtllm-llmapi-launch) for TRTLLM inference/workflows.questionFurther information is requestedtriagedIssue has been triaged by maintainers

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions