Skip to content

Commit ad13938

Browse files
Added documentation of using warmups to initialize lora weights
1 parent ada5799 commit ad13938

File tree

1 file changed

+231
-0
lines changed

1 file changed

+231
-0
lines changed

docs/lora_warmup.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# LoRa Warmup Example with BFloat16
2+
3+
This document provides an example of initializing LoRa weights and configs as warmups to the backend model so that inference can use LoRa adapters using only the `lora_task_id`. This approach avoids the need for LoRa weights or config to be used within the requests made to the backend, and allows for bfloat16 weights to be used without needing to express them in a `python` backend model (such as `preprocessing`) where numpy conversion does not support `bfloat16`.
4+
5+
This example assumes that the user as pre-trained a model and has the LoRa weights and configs available from the training process as `.safetensor` files and a `config.json` file.
6+
7+
## Compile Base Model
8+
9+
The base model should be compiled according to the guidance provided in [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM).
10+
11+
## Prepare LoRa Weights as Warmup files
12+
13+
1. Convert to `.bin` format
14+
15+
The expected format for lora weights by the provided conversion script, [hf_lora_convert](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py) assumes the existance of `adapter_config.json` and `adapter_model.bin`. If weights are stored from training as `adapter_model.safetensors`, the following script can be used to convert the weights to the expected format.
16+
17+
```python
18+
import torch
19+
from safetensors.torch import load_file
20+
21+
ADAPTER_DIR = <directory for adapter checkpoint / weights>
22+
23+
torch.save(
24+
safetensors_load_file(
25+
os.path.join(ADAPTER_DIR, "adapter_model.safetensors"))
26+
,
27+
os.path.join(ADAPTER_DIR, "adapter_model.bin),
28+
)
29+
```
30+
31+
2. Prepare `config` and `weights` for TensorRT-LLM
32+
33+
The [hf_lora_convert](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py) script can be used to convert the weights and config to the expected format for TensorRT-LLM.
34+
35+
As of v0.10.0 the conversion script saves outputs in the `.npy` format only. This can be updated by updating `write_npy=False` in the [hf_lora_convert.py](https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/hf_lora_convert.py#L142) file.
36+
37+
After allowing for the output to be saved as `.bin`
38+
39+
```python
40+
from hf_lora_convert import convert_hf_model
41+
42+
ADAPTER_DIR = <directory for adapter checkpoint / weights>
43+
DTYPE = "bfloat16" # Specify adapter
44+
45+
convert_hf_model(
46+
folder,
47+
dtype="bfloat16",
48+
out_dir=folder
49+
)
50+
```
51+
52+
This will result in the saving of two output files `adapter_model.bin` and `adapter_config.bin`.
53+
54+
These files can be used for warmup inputs to the backend model.
55+
56+
57+
## Configure Warmup for `tensorrt_llm` model
58+
59+
After obtaining the warmup lora weights and configs from the previous steps a warmup folder should be added to the `tensorrt_llm` model directory.
60+
61+
1. Create warmup folder:
62+
63+
Files for the warmup will be added within the model-repository which will be run using triton-inference-server
64+
65+
```bash
66+
model-repository/
67+
ensemble/
68+
preprocessing/
69+
postprocessing/
70+
tensorrt_llm/
71+
- config.pbtxt
72+
- 1/
73+
warmup/
74+
- Files will be added here
75+
```
76+
77+
1. Creating warmup files
78+
79+
```python
80+
import struct
81+
82+
WARMUP_DIR = <path to warmup dir>
83+
84+
# Define warmup input ids (example )
85+
input_ids = [123, 456, 1, 33]
86+
87+
# Write to a binary file
88+
with open(os.path.join(WARMUP_DIR, "input_ids"), "wb") as f:
89+
for i in input_ids:
90+
f.write(struct.pack('<i', i)) # '<i' means little-endian int
91+
92+
input_lengths = len(input_ids)
93+
94+
# Write to a binary file
95+
with open(os.path.join(warmup_dir, "input_lengths"), "wb") as f:
96+
f.write(struct.pack('<i', input_lengths)) # '<i' means little-endian int
97+
98+
# save end_id
99+
end_id = 128001 # Will vary based on tokenizer used
100+
101+
with open(os.path.join(warmup_dir, "end_id"), "wb") as f:
102+
f.write(struct.pack('<i', end_id)) # '<i' means little-endian int
103+
104+
# Specify output lengths (using small value to speed up warmup)
105+
request_output_len = 3
106+
with open(os.path.join(warmup_dir, "output_lengths"), "wb") as f:
107+
f.write(struct.pack('<i', request_output_len)) # '<i' means little-endian int
108+
109+
# Specify beam width
110+
beam_width = 3
111+
with open(os.path.join(warmup_dir, "beam_width"), "wb") as f:
112+
f.write(struct.pack('<i', beam_width)) # '<i' means little-endian int
113+
114+
# Specify lora_task_id(s)
115+
n_adapters = 3
116+
for lora_task_id in range(n_adapters):
117+
# Write to a binary file
118+
with open(os.path.join(warmup_dir, f"lora_id_{lora_task_id}"), "wb") as f:
119+
f.write(struct.pack('<q', lora_task_id))
120+
```
121+
122+
The above script will create the necessary files for warmup. The `input_ids` should be updated to reflect the input_ids that will be used for warmup. The `end_id` should be updated to reflect the end_id used by the tokenizer. The `request_output_len` and `beam_width` should be set to the desired values for warmup and match the complation parameters which were performed on the base model. The `n_adapters` should be set to the number of adapters that will be used for warmup.
123+
124+
The converted `adapter_model.bin` and `adapter_config.bin` should be copied to the warmup directory but renamed for each adapter being used. For this example we will assume that there are 3 adapters and the files are renamed resulting in the following contents of the `warmup` directory:
125+
126+
```bash
127+
warmup/
128+
- input_ids
129+
- input_lengths
130+
- end_id
131+
- output_lengths
132+
- beam_width
133+
- lora_id_0
134+
- lora_id_1
135+
- lora_id_2
136+
- adapter_model_0.bin
137+
- adapter_config_0.bin
138+
- adapter_model_1.bin
139+
- adapter_config_1.bin
140+
- adapter_model_2.bin
141+
- adapter_config_2.bin
142+
143+
```
144+
145+
1. Updating the model `config.pbtxt`
146+
147+
The configuration file for the `tensorrt_llm` model can then be updated to perform the warmup. The `config.pbtxt` file should be updated to include the warmup configuration.
148+
149+
The dimensions of the adapter must be known to provide shapes within the configuration. This can be inspected by reading the `adapter_model.bin` file.
150+
151+
The following is an example of the `config.pbtxt` file with the warmup configuration added:
152+
153+
```pbtxt
154+
155+
model_warmup [
156+
{
157+
name: "lora_0_warmup"
158+
batch_size: 1
159+
inputs: {
160+
key: "lora_task_id"
161+
value: {
162+
data_type: TYPE_UINT64
163+
dims: [ 1 ]
164+
input_data_file: "lora_id_0"
165+
}
166+
}
167+
inputs: {
168+
key: "lora_weights"
169+
value: {
170+
data_type: TYPE_BF16 # This should match the datatype of the adapter
171+
dims: [ 224, 589824] # This should match the dimensions of the adapter
172+
input_data_file: "adapter_model_0.bin"
173+
}
174+
}
175+
inputs: {
176+
key: "end_id"
177+
value: {
178+
data_type: TYPE_UINT32
179+
dims: [ 1 ]
180+
input_data_file: "end_id"
181+
}
182+
}
183+
inputs: {
184+
key: "lora_config"
185+
value: {
186+
data_type: TYPE_INT32
187+
dims: [ 224, 3 ] # This should match the dimensions of the adapter
188+
input_data_file: "adapter_config_0.bin"
189+
}
190+
}
191+
inputs: {
192+
key: "input_ids"
193+
value: {
194+
data_type: TYPE_INT32
195+
dims: [ 4 ]
196+
input_data_file: "input_ids"
197+
}
198+
}
199+
inputs: {
200+
key: "input_lengths"
201+
value: {
202+
data_type: TYPE_INT32
203+
dims: [ 1 ]
204+
input_data_file: "input_lengths"
205+
}
206+
}
207+
inputs: {
208+
key: "request_output_len"
209+
value: {
210+
data_type: TYPE_UINT32
211+
dims: [ 1 ]
212+
input_data_file: "output_lengths"
213+
}
214+
}
215+
inputs: {
216+
key: "beam_width"
217+
value: {
218+
data_type: TYPE_UINT32
219+
dims: [ 1 ]
220+
input_data_file: "beam_width"
221+
}
222+
}
223+
},
224+
... # repeat for other two adapters
225+
]
226+
227+
```
228+
229+
1. Startup and calling
230+
231+
After the model has been warmed up using this process calls can be made within the normal triton-inference-server environment while only requiring passing `lora_task_id` to the model. The model will use the lora weights associated with the `lora_task_id` to perform inference as defined from the warmups.

0 commit comments

Comments
 (0)