Skip to content

Commit 290df51

Browse files
authored
Merge pull request #2021 from bghira/main
merge
2 parents 5f52609 + bead6ff commit 290df51

File tree

72 files changed

+5116
-190
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

72 files changed

+5116
-190
lines changed

AGENTS.md

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
- Venv location: `.venv`
66
- Python version: `3.12`
77
- Test framework: `unittest` (NOT `pytest`)
8+
- Test command: `.venv/bin/python -m unittest -v -f`
9+
- Test average runtime: ~300 seconds
810

911
## Code style
1012

@@ -13,6 +15,7 @@
1315
- Use type: ignore only when absolutely necessary
1416
- NEVER add a code fallback path unless it is explicit to the requirements
1517
- Do not make assumptions if confusion arises. Instead, stop working, and request clarification.
18+
- Let's not add wandering, rambling comments in notes. Be concise and to the point or leave no comment at all since the code should be self-explanatory.
1619

1720
## Plan inspection guidelines
1821

@@ -27,3 +30,10 @@
2730
## File preservation
2831

2932
- Do not remove untracked files from the repository unless explicitly instructed to do so
33+
34+
## Problem solving
35+
36+
- It's always tempting to jump right into declaring an answer, but the best solutions come from carefully-developed understanding of the root cause
37+
- Problems should always be provable through tests, logging, or other means
38+
- Generally speaking, it's fine to run the full application end-to-end to verify a fix, "it's heavy" is not a valid excuse to avoid verification - we're on a ML development workstation that's designed to allow running these workloads
39+
- For the most part, things should not be marked as CUDA-only unless it relies on third-party compiled CUDA kernels or similar. Don't be afraid to use the available accelerator eg. mps, cuda, if available on the system opportunistically.

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -130,6 +130,8 @@ Detailed quickstart guides are available for all supported models:
130130
- **[OmniGen Guide](/documentation/quickstart/OMNIGEN.md)** - Unified image generation model
131131
- **[Qwen Image Guide](/documentation/quickstart/QWEN_IMAGE.md)** - 20B parameter large-scale training
132132
- **[Stable Cascade Stage C Guide](/quickstart/STABLE_CASCADE_C.md)** - Prior LoRAs with combined prior+decoder validation
133+
- **[Kandinsky 5.0 Image Guide](/documentation/quickstart/KANDINSKY5_IMAGE.md)** - Image generation with Qwen2.5-VL + Flux VAE
134+
- **[Kandinsky 5.0 Video Guide](/documentation/quickstart/KANDINSKY5_VIDEO.md)** - Video generation with HunyuanVideo VAE
133135

134136
---
135137

documentation/OPTIONS.md

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -782,7 +782,6 @@ usage: train.py [-h] --model_family
782782
[--override_dataset_config [OVERRIDE_DATASET_CONFIG]]
783783
[--cache_dir CACHE_DIR] [--cache_dir_text CACHE_DIR_TEXT]
784784
[--cache_dir_vae CACHE_DIR_VAE]
785-
[--cache_clear_validation_prompts [CACHE_CLEAR_VALIDATION_PROMPTS]]
786785
[--compress_disk_cache [COMPRESS_DISK_CACHE]]
787786
[--aspect_bucket_disable_rebuild [ASPECT_BUCKET_DISABLE_REBUILD]]
788787
[--keep_vae_loaded [KEEP_VAE_LOADED]]
@@ -1301,9 +1300,6 @@ options:
13011300
--cache_dir_vae CACHE_DIR_VAE
13021301
This is the path to a local directory that will
13031302
contain your VAE outputs
1304-
--cache_clear_validation_prompts [CACHE_CLEAR_VALIDATION_PROMPTS]
1305-
When provided, any validation prompt entries in the
1306-
text embed cache will be recreated
13071303
--compress_disk_cache [COMPRESS_DISK_CACHE]
13081304
If set, will gzip-compress the disk cache for Pytorch
13091305
files. This will save substantial disk space, but may

documentation/QUICKSTART.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -29,6 +29,8 @@ For the complete and most accurate feature matrix, refer to the [main README](..
2929
| Qwen Image | 20B |||* | **required** (int8/nf4) | bf16 |||| [QWEN_IMAGE.md](/documentation/quickstart/QWEN_IMAGE.md) |
3030
| Qwen Image Edit | 20B |||* | **required** (int8/nf4) | bf16 |||| [QWEN_EDIT.md](/documentation/quickstart/QWEN_EDIT.md) |
3131
| Stable Cascade (C) | 1B, 3.6B prior |||* | not supported | fp32 (required) |||| [STABLE_CASCADE_C.md](/documentation/quickstart/STABLE_CASCADE_C.md) |
32+
| Kandinsky 5.0 Image | 6B (lite) |||* | int8 optional | bf16 |||| [KANDINSKY5_IMAGE.md](/documentation/quickstart/KANDINSKY5_IMAGE.md) |
33+
| Kandinsky 5.0 Video | 2B (lite), 19B (pro) |||* | int8 optional | bf16 |||| [KANDINSKY5_VIDEO.md](/documentation/quickstart/KANDINSKY5_VIDEO.md) |
3234

3335
*✓ = supported, ✓* = requires DeepSpeed/FSDP2 for full-rank, ✗ = not supported, `✓+` indicates checkpointing is recommended due to VRAM pressure.*
3436

documentation/data_presets/preset_audio_dataset_with_lyrics.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,12 +36,12 @@ Use this configuration in your `multidatabackend.json` to load the dataset direc
3636

3737
```bibtex
3838
@misc{jiang2025advancingfoundationmodelmusic,
39-
title={Advancing the Foundation Model for Music Understanding},
39+
title={Advancing the Foundation Model for Music Understanding},
4040
author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
4141
year={2025},
4242
eprint={2508.01178},
4343
archivePrefix={arXiv},
4444
primaryClass={cs.SD},
45-
url={https://arxiv.org/abs/2508.01178},
45+
url={https://arxiv.org/abs/2508.01178},
4646
}
4747
```
Lines changed: 251 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,251 @@
1+
# Kandinsky 5.0 Image Quickstart
2+
3+
In this example, we'll be training a Kandinsky 5.0 Image LoRA.
4+
5+
## Hardware requirements
6+
7+
Kandinsky 5.0 employs a **huge 7B parameter Qwen2.5-VL text encoder** in addition to a standard CLIP encoder and the Flux VAE. This places significant demand on both VRAM and System RAM.
8+
9+
Simply loading the Qwen encoder requires roughly **14GB** of memory on its own. When training a rank-16 LoRA with full gradient checkpointing:
10+
11+
- **24GB VRAM** is the comfortable minimum (RTX 3090/4090).
12+
- **16GB VRAM** is possible but requires aggressive offloading and likely `int8` quantization of the base model.
13+
14+
You'll need:
15+
16+
- **System RAM**: At least 32GB, ideally 64GB, to handle the initial model load without crashing.
17+
- **GPU**: NVIDIA RTX 3090 / 4090 or professional cards (A6000, A100, etc.).
18+
19+
### Memory offloading (recommended)
20+
21+
Given the size of the text encoder, you should almost certainly use grouped offloading if you are on consumer hardware. This offloads the transformer blocks to CPU memory when they are not actively being computed.
22+
23+
Add the following to your `config.json`:
24+
25+
```json
26+
{
27+
"enable_group_offload": true,
28+
"group_offload_type": "block_level",
29+
"group_offload_blocks_per_group": 1,
30+
"group_offload_use_stream": true
31+
}
32+
```
33+
34+
- `--group_offload_use_stream`: Only works on CUDA devices.
35+
- **Do not** combine this with `--enable_model_cpu_offload`.
36+
37+
Additionally, set `"offload_during_startup": true` in your `config.json` to reduce VRAM usage during the initialization and caching phase. This ensures the text encoder and VAE are not loaded simultaneously.
38+
39+
## Prerequisites
40+
41+
Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.
42+
43+
You can check this by running:
44+
45+
```bash
46+
python --version
47+
```
48+
49+
If you don't have python 3.12 installed on Ubuntu, you can try the following:
50+
51+
```bash
52+
apt -y install python3.12 python3.12-venv
53+
```
54+
55+
## Installation
56+
57+
Install SimpleTuner via pip:
58+
59+
```bash
60+
pip install simpletuner[cuda]
61+
```
62+
63+
For manual installation or development setup, see the [installation documentation](/documentation/INSTALL.md).
64+
65+
## Setting up the environment
66+
67+
### Web interface method
68+
69+
The SimpleTuner WebUI makes setup fairly straightforward. To run the server:
70+
71+
```bash
72+
simpletuner server
73+
```
74+
75+
Access it at http://localhost:8001.
76+
77+
### Manual / command-line method
78+
79+
To run SimpleTuner via command-line tools, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
80+
81+
#### Configuration file
82+
83+
An experimental script, `configure.py`, may help you skip this section:
84+
85+
```bash
86+
simpletuner configure
87+
```
88+
89+
If you prefer to manually configure:
90+
91+
Copy `config/config.json.example` to `config/config.json`:
92+
93+
```bash
94+
cp config/config.json.example config/config.json
95+
```
96+
97+
You will need to modify the following variables:
98+
99+
- `model_type`: `lora`
100+
- `model_family`: `kandinsky5-image`
101+
- `model_flavour`:
102+
- `t2i-lite-sft`: (Default) The standard SFT checkpoint. Best for fine-tuning styles/characters.
103+
- `t2i-lite-pretrain`: The pretrain checkpoint. Better for teaching entirely new concepts from scratch.
104+
- `i2i-lite-sft` / `i2i-lite-pretrain`: For image-to-image training. Requires conditioning images in your dataset.
105+
- `output_dir`: Where to save your checkpoints.
106+
- `train_batch_size`: Start with `1`.
107+
- `gradient_accumulation_steps`: Use `1` or higher to simulate larger batches.
108+
- `validation_resolution`: `1024x1024` is standard for this model.
109+
- `validation_guidance`: `5.0` is the recommended default for Kandinsky 5.
110+
- `flow_schedule_shift`: `1.0` is the default. Adjusting this changes how the model prioritizes details vs composition (see below).
111+
112+
#### Validation prompts
113+
114+
Inside `config/config.json` is the "primary validation prompt". You can also create a library of prompts in `config/user_prompt_library.json`:
115+
116+
```json
117+
{
118+
"portrait": "A high quality portrait of a woman, cinematic lighting, 8k",
119+
"landscape": "A beautiful mountain landscape at sunset, oil painting style"
120+
}
121+
```
122+
123+
Enable it by adding this to your `config.json`:
124+
125+
```json
126+
{
127+
"user_prompt_library": "config/user_prompt_library.json"
128+
}
129+
```
130+
131+
#### Flow schedule shifting
132+
133+
Kandinsky 5 is a flow-matching model. The `shift` parameter controls the noise distribution during training and inference.
134+
135+
- **Shift 1.0 (Default)**: Balanced training.
136+
- **Lower Shift (< 1.0)**: Focuses training more on high-frequency details (texture, noise).
137+
- **Higher Shift (> 1.0)**: Focuses training more on low-frequency details (composition, color, structure).
138+
139+
If your model learns styles well but fails on composition, try increasing the shift. If it learns composition but lacks texture, try decreasing it.
140+
141+
#### Quantised model training
142+
143+
You can reduce VRAM usage significantly by quantizing the transformer to 8-bit.
144+
145+
In `config.json`:
146+
147+
```json
148+
"base_model_precision": "int8-quanto",
149+
"text_encoder_1_precision": "no_change",
150+
"text_encoder_2_precision": "no_change",
151+
"lora_rank": 16,
152+
"base_model_default_dtype": "bf16"
153+
```
154+
155+
> **Note**: We do not recommend quantizing the text encoders (`no_change`) as Qwen2.5-VL is sensitive to quantization effects and is already the heaviest part of the pipeline.
156+
157+
#### Dataset considerations
158+
159+
You will need a dataset configuration file, e.g., `config/multidatabackend.json`.
160+
161+
```json
162+
[
163+
{
164+
"id": "my-image-dataset",
165+
"type": "local",
166+
"dataset_type": "image",
167+
"instance_data_dir": "datasets/my_images",
168+
"caption_strategy": "textfile",
169+
"resolution": 1024,
170+
"crop": true,
171+
"crop_aspect": "square",
172+
"repeats": 10
173+
},
174+
{
175+
"id": "text-embeds",
176+
"type": "local",
177+
"dataset_type": "text_embeds",
178+
"default": true,
179+
"cache_dir": "cache/text/kandinsky5",
180+
"disabled": false
181+
}
182+
]
183+
```
184+
185+
Then create your dataset directory:
186+
187+
```bash
188+
mkdir -p datasets/my_images
189+
# Copy your images and .txt caption files here
190+
```
191+
192+
#### Login to WandB and Huggingface Hub
193+
194+
```bash
195+
wandb login
196+
huggingface-cli login
197+
```
198+
199+
### Executing the training run
200+
201+
**Option 1 (Recommended):**
202+
203+
```bash
204+
simpletuner train
205+
```
206+
207+
**Option 2 (Legacy):**
208+
209+
```bash
210+
./train.sh
211+
```
212+
213+
## Notes & troubleshooting tips
214+
215+
### Lowest VRAM config
216+
217+
To run on 16GB or constrained 24GB setups:
218+
219+
1. **Enable Group Offload**: `--enable_group_offload`.
220+
2. **Quantize Base Model**: Set `"base_model_precision": "int8-quanto"`.
221+
3. **Batch Size**: Keep it at `1`.
222+
223+
### Artifacts and "Burnt" images
224+
225+
If validation images look over-saturated or noisy ("burnt"):
226+
227+
- **Check Guidance**: Ensure `validation_guidance` is around `5.0`. Higher values (like 7.0+) often fry the image on this model.
228+
- **Check Flow Shift**: Extreme `flow_schedule_shift` values can cause instability. Stick to `1.0` to start.
229+
- **Learning Rate**: 1e-4 is standard for LoRA, but if you see artifacts, try lowering to 5e-5.
230+
231+
### TREAD training
232+
233+
Kandinsky 5 supports [TREAD](/documentation/TREAD.md) for faster training by dropping tokens.
234+
235+
Add to `config.json`:
236+
237+
```json
238+
{
239+
"tread_config": {
240+
"routes": [
241+
{
242+
"selection_ratio": 0.5,
243+
"start_layer_idx": 2,
244+
"end_layer_idx": -2
245+
}
246+
]
247+
}
248+
}
249+
```
250+
251+
This drops 50% of tokens in the middle layers, speeding up the transformer pass.

0 commit comments

Comments
 (0)