|
| 1 | +# Kandinsky 5.0 Image Quickstart |
| 2 | + |
| 3 | +In this example, we'll be training a Kandinsky 5.0 Image LoRA. |
| 4 | + |
| 5 | +## Hardware requirements |
| 6 | + |
| 7 | +Kandinsky 5.0 employs a **huge 7B parameter Qwen2.5-VL text encoder** in addition to a standard CLIP encoder and the Flux VAE. This places significant demand on both VRAM and System RAM. |
| 8 | + |
| 9 | +Simply loading the Qwen encoder requires roughly **14GB** of memory on its own. When training a rank-16 LoRA with full gradient checkpointing: |
| 10 | + |
| 11 | +- **24GB VRAM** is the comfortable minimum (RTX 3090/4090). |
| 12 | +- **16GB VRAM** is possible but requires aggressive offloading and likely `int8` quantization of the base model. |
| 13 | + |
| 14 | +You'll need: |
| 15 | + |
| 16 | +- **System RAM**: At least 32GB, ideally 64GB, to handle the initial model load without crashing. |
| 17 | +- **GPU**: NVIDIA RTX 3090 / 4090 or professional cards (A6000, A100, etc.). |
| 18 | + |
| 19 | +### Memory offloading (recommended) |
| 20 | + |
| 21 | +Given the size of the text encoder, you should almost certainly use grouped offloading if you are on consumer hardware. This offloads the transformer blocks to CPU memory when they are not actively being computed. |
| 22 | + |
| 23 | +Add the following to your `config.json`: |
| 24 | + |
| 25 | +```json |
| 26 | +{ |
| 27 | + "enable_group_offload": true, |
| 28 | + "group_offload_type": "block_level", |
| 29 | + "group_offload_blocks_per_group": 1, |
| 30 | + "group_offload_use_stream": true |
| 31 | +} |
| 32 | +``` |
| 33 | + |
| 34 | +- `--group_offload_use_stream`: Only works on CUDA devices. |
| 35 | +- **Do not** combine this with `--enable_model_cpu_offload`. |
| 36 | + |
| 37 | +Additionally, set `"offload_during_startup": true` in your `config.json` to reduce VRAM usage during the initialization and caching phase. This ensures the text encoder and VAE are not loaded simultaneously. |
| 38 | + |
| 39 | +## Prerequisites |
| 40 | + |
| 41 | +Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12. |
| 42 | + |
| 43 | +You can check this by running: |
| 44 | + |
| 45 | +```bash |
| 46 | +python --version |
| 47 | +``` |
| 48 | + |
| 49 | +If you don't have python 3.12 installed on Ubuntu, you can try the following: |
| 50 | + |
| 51 | +```bash |
| 52 | +apt -y install python3.12 python3.12-venv |
| 53 | +``` |
| 54 | + |
| 55 | +## Installation |
| 56 | + |
| 57 | +Install SimpleTuner via pip: |
| 58 | + |
| 59 | +```bash |
| 60 | +pip install simpletuner[cuda] |
| 61 | +``` |
| 62 | + |
| 63 | +For manual installation or development setup, see the [installation documentation](/documentation/INSTALL.md). |
| 64 | + |
| 65 | +## Setting up the environment |
| 66 | + |
| 67 | +### Web interface method |
| 68 | + |
| 69 | +The SimpleTuner WebUI makes setup fairly straightforward. To run the server: |
| 70 | + |
| 71 | +```bash |
| 72 | +simpletuner server |
| 73 | +``` |
| 74 | + |
| 75 | +Access it at http://localhost:8001. |
| 76 | + |
| 77 | +### Manual / command-line method |
| 78 | + |
| 79 | +To run SimpleTuner via command-line tools, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file. |
| 80 | + |
| 81 | +#### Configuration file |
| 82 | + |
| 83 | +An experimental script, `configure.py`, may help you skip this section: |
| 84 | + |
| 85 | +```bash |
| 86 | +simpletuner configure |
| 87 | +``` |
| 88 | + |
| 89 | +If you prefer to manually configure: |
| 90 | + |
| 91 | +Copy `config/config.json.example` to `config/config.json`: |
| 92 | + |
| 93 | +```bash |
| 94 | +cp config/config.json.example config/config.json |
| 95 | +``` |
| 96 | + |
| 97 | +You will need to modify the following variables: |
| 98 | + |
| 99 | +- `model_type`: `lora` |
| 100 | +- `model_family`: `kandinsky5-image` |
| 101 | +- `model_flavour`: |
| 102 | + - `t2i-lite-sft`: (Default) The standard SFT checkpoint. Best for fine-tuning styles/characters. |
| 103 | + - `t2i-lite-pretrain`: The pretrain checkpoint. Better for teaching entirely new concepts from scratch. |
| 104 | + - `i2i-lite-sft` / `i2i-lite-pretrain`: For image-to-image training. Requires conditioning images in your dataset. |
| 105 | +- `output_dir`: Where to save your checkpoints. |
| 106 | +- `train_batch_size`: Start with `1`. |
| 107 | +- `gradient_accumulation_steps`: Use `1` or higher to simulate larger batches. |
| 108 | +- `validation_resolution`: `1024x1024` is standard for this model. |
| 109 | +- `validation_guidance`: `5.0` is the recommended default for Kandinsky 5. |
| 110 | +- `flow_schedule_shift`: `1.0` is the default. Adjusting this changes how the model prioritizes details vs composition (see below). |
| 111 | + |
| 112 | +#### Validation prompts |
| 113 | + |
| 114 | +Inside `config/config.json` is the "primary validation prompt". You can also create a library of prompts in `config/user_prompt_library.json`: |
| 115 | + |
| 116 | +```json |
| 117 | +{ |
| 118 | + "portrait": "A high quality portrait of a woman, cinematic lighting, 8k", |
| 119 | + "landscape": "A beautiful mountain landscape at sunset, oil painting style" |
| 120 | +} |
| 121 | +``` |
| 122 | + |
| 123 | +Enable it by adding this to your `config.json`: |
| 124 | + |
| 125 | +```json |
| 126 | +{ |
| 127 | + "user_prompt_library": "config/user_prompt_library.json" |
| 128 | +} |
| 129 | +``` |
| 130 | + |
| 131 | +#### Flow schedule shifting |
| 132 | + |
| 133 | +Kandinsky 5 is a flow-matching model. The `shift` parameter controls the noise distribution during training and inference. |
| 134 | + |
| 135 | +- **Shift 1.0 (Default)**: Balanced training. |
| 136 | +- **Lower Shift (< 1.0)**: Focuses training more on high-frequency details (texture, noise). |
| 137 | +- **Higher Shift (> 1.0)**: Focuses training more on low-frequency details (composition, color, structure). |
| 138 | + |
| 139 | +If your model learns styles well but fails on composition, try increasing the shift. If it learns composition but lacks texture, try decreasing it. |
| 140 | + |
| 141 | +#### Quantised model training |
| 142 | + |
| 143 | +You can reduce VRAM usage significantly by quantizing the transformer to 8-bit. |
| 144 | + |
| 145 | +In `config.json`: |
| 146 | + |
| 147 | +```json |
| 148 | + "base_model_precision": "int8-quanto", |
| 149 | + "text_encoder_1_precision": "no_change", |
| 150 | + "text_encoder_2_precision": "no_change", |
| 151 | + "lora_rank": 16, |
| 152 | + "base_model_default_dtype": "bf16" |
| 153 | +``` |
| 154 | + |
| 155 | +> **Note**: We do not recommend quantizing the text encoders (`no_change`) as Qwen2.5-VL is sensitive to quantization effects and is already the heaviest part of the pipeline. |
| 156 | +
|
| 157 | +#### Dataset considerations |
| 158 | + |
| 159 | +You will need a dataset configuration file, e.g., `config/multidatabackend.json`. |
| 160 | + |
| 161 | +```json |
| 162 | +[ |
| 163 | + { |
| 164 | + "id": "my-image-dataset", |
| 165 | + "type": "local", |
| 166 | + "dataset_type": "image", |
| 167 | + "instance_data_dir": "datasets/my_images", |
| 168 | + "caption_strategy": "textfile", |
| 169 | + "resolution": 1024, |
| 170 | + "crop": true, |
| 171 | + "crop_aspect": "square", |
| 172 | + "repeats": 10 |
| 173 | + }, |
| 174 | + { |
| 175 | + "id": "text-embeds", |
| 176 | + "type": "local", |
| 177 | + "dataset_type": "text_embeds", |
| 178 | + "default": true, |
| 179 | + "cache_dir": "cache/text/kandinsky5", |
| 180 | + "disabled": false |
| 181 | + } |
| 182 | +] |
| 183 | +``` |
| 184 | + |
| 185 | +Then create your dataset directory: |
| 186 | + |
| 187 | +```bash |
| 188 | +mkdir -p datasets/my_images |
| 189 | +# Copy your images and .txt caption files here |
| 190 | +``` |
| 191 | + |
| 192 | +#### Login to WandB and Huggingface Hub |
| 193 | + |
| 194 | +```bash |
| 195 | +wandb login |
| 196 | +huggingface-cli login |
| 197 | +``` |
| 198 | + |
| 199 | +### Executing the training run |
| 200 | + |
| 201 | +**Option 1 (Recommended):** |
| 202 | + |
| 203 | +```bash |
| 204 | +simpletuner train |
| 205 | +``` |
| 206 | + |
| 207 | +**Option 2 (Legacy):** |
| 208 | + |
| 209 | +```bash |
| 210 | +./train.sh |
| 211 | +``` |
| 212 | + |
| 213 | +## Notes & troubleshooting tips |
| 214 | + |
| 215 | +### Lowest VRAM config |
| 216 | + |
| 217 | +To run on 16GB or constrained 24GB setups: |
| 218 | + |
| 219 | +1. **Enable Group Offload**: `--enable_group_offload`. |
| 220 | +2. **Quantize Base Model**: Set `"base_model_precision": "int8-quanto"`. |
| 221 | +3. **Batch Size**: Keep it at `1`. |
| 222 | + |
| 223 | +### Artifacts and "Burnt" images |
| 224 | + |
| 225 | +If validation images look over-saturated or noisy ("burnt"): |
| 226 | + |
| 227 | +- **Check Guidance**: Ensure `validation_guidance` is around `5.0`. Higher values (like 7.0+) often fry the image on this model. |
| 228 | +- **Check Flow Shift**: Extreme `flow_schedule_shift` values can cause instability. Stick to `1.0` to start. |
| 229 | +- **Learning Rate**: 1e-4 is standard for LoRA, but if you see artifacts, try lowering to 5e-5. |
| 230 | + |
| 231 | +### TREAD training |
| 232 | + |
| 233 | +Kandinsky 5 supports [TREAD](/documentation/TREAD.md) for faster training by dropping tokens. |
| 234 | + |
| 235 | +Add to `config.json`: |
| 236 | + |
| 237 | +```json |
| 238 | +{ |
| 239 | + "tread_config": { |
| 240 | + "routes": [ |
| 241 | + { |
| 242 | + "selection_ratio": 0.5, |
| 243 | + "start_layer_idx": 2, |
| 244 | + "end_layer_idx": -2 |
| 245 | + } |
| 246 | + ] |
| 247 | + } |
| 248 | +} |
| 249 | +``` |
| 250 | + |
| 251 | +This drops 50% of tokens in the middle layers, speeding up the transformer pass. |
0 commit comments