diff --git a/docs/lora_warmup.md b/docs/lora_warmup.md new file mode 100644 index 00000000..6d331e66 --- /dev/null +++ b/docs/lora_warmup.md @@ -0,0 +1,231 @@ +# LoRa Warmup Example with BFloat16 + +This document provides an example of initializing LoRa weights and configs as warmups to the backend model so that inference can use LoRa adapters using only the `lora_task_id`. This approach avoids the need for LoRa weights or config to be used within the requests made to the backend, and allows for bfloat16 weights to be used without needing to express them in a `python` backend model (such as `preprocessing`) where numpy conversion does not support `bfloat16`. + +This example assumes that the user as pre-trained a model and has the LoRa weights and configs available from the training process as `.safetensor` files and a `config.json` file. + +## Compile Base Model + +The base model should be compiled according to the guidance provided in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM). + +## Prepare LoRa Weights as Warmup files + +1. Convert to `.bin` format + + The expected format for lora weights by the provided conversion script, [hf_lora_convert](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py) assumes the existance of `adapter_config.json` and `adapter_model.bin`. If weights are stored from training as `adapter_model.safetensors`, the following script can be used to convert the weights to the expected format. + + ```python + import torch + from safetensors.torch import load_file + + ADAPTER_DIR = + + torch.save( + safetensors_load_file( + os.path.join(ADAPTER_DIR, "adapter_model.safetensors")) + , + os.path.join(ADAPTER_DIR, "adapter_model.bin), + ) + ``` + +2. Prepare `config` and `weights` for TensorRT-LLM + + The [hf_lora_convert](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py) script can be used to convert the weights and config to the expected format for TensorRT-LLM. + + As of v0.10.0 the conversion script saves outputs in the `.npy` format only. This can be updated by updating `write_npy=False` in the [hf_lora_convert.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py#L142) file. + + After allowing for the output to be saved as `.bin` + + ```python + from hf_lora_convert import convert_hf_model + + ADAPTER_DIR = + DTYPE = "bfloat16" # Specify adapter + + convert_hf_model( + folder, + dtype="bfloat16", + out_dir=folder + ) + ``` + + This will result in the saving of two output files `adapter_model.bin` and `adapter_config.bin`. + + These files can be used for warmup inputs to the backend model. + + +## Configure Warmup for `tensorrt_llm` model + +After obtaining the warmup lora weights and configs from the previous steps a warmup folder should be added to the `tensorrt_llm` model directory. + +1. Create warmup folder: + + Files for the warmup will be added within the model-repository which will be run using triton-inference-server + + ```bash + model-repository/ + ensemble/ + preprocessing/ + postprocessing/ + tensorrt_llm/ + - config.pbtxt + - 1/ + warmup/ + - Files will be added here + ``` + +1. Creating warmup files + + ```python + import struct + + WARMUP_DIR = + + # Define warmup input ids (example ) + input_ids = [123, 456, 1, 33] + + # Write to a binary file + with open(os.path.join(WARMUP_DIR, "input_ids"), "wb") as f: + for i in input_ids: + f.write(struct.pack('