bghira
diff --git a/‎AGENTS.md‎
Lines changed: 10 additions & 0 deletions b/‎AGENTS.md‎
Lines changed: 10 additions & 0 deletions
diff --git a/‎README.md‎
Lines changed: 2 additions & 0 deletions b/‎README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎documentation/OPTIONS.md‎
Lines changed: 0 additions & 4 deletions b/‎documentation/OPTIONS.md‎
Lines changed: 0 additions & 4 deletions
diff --git a/‎documentation/QUICKSTART.md‎
Lines changed: 2 additions & 0 deletions b/‎documentation/QUICKSTART.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎documentation/data_presets/preset_audio_dataset_with_lyrics.md‎
Lines changed: 2 additions & 2 deletions b/‎documentation/data_presets/preset_audio_dataset_with_lyrics.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎documentation/quickstart/KANDINSKY5_IMAGE.md‎
Lines changed: 251 additions & 0 deletions b/‎documentation/quickstart/KANDINSKY5_IMAGE.md‎
Lines changed: 251 additions & 0 deletions
@@ -5,6 +5,8 @@
 - Venv location: `.venv`
 - Python version: `3.12`
 - Test framework: `unittest` (NOT `pytest`)
+- Test command: `.venv/bin/python -m unittest -v -f`
+- Test average runtime: ~300 seconds
 
 ## Code style
 
@@ -13,6 +15,7 @@
 - Use type: ignore only when absolutely necessary
 - NEVER add a code fallback path unless it is explicit to the requirements
 - Do not make assumptions if confusion arises. Instead, stop working, and request clarification.
+- Let's not add wandering, rambling comments in notes. Be concise and to the point or leave no comment at all since the code should be self-explanatory.
 
 ## Plan inspection guidelines
 
@@ -27,3 +30,10 @@
 ## File preservation
 
 - Do not remove untracked files from the repository unless explicitly instructed to do so
+
+## Problem solving
+
+- It's always tempting to jump right into declaring an answer, but the best solutions come from carefully-developed understanding of the root cause
+- Problems should always be provable through tests, logging, or other means
+- Generally speaking, it's fine to run the full application end-to-end to verify a fix, "it's heavy" is not a valid excuse to avoid verification - we're on a ML development workstation that's designed to allow running these workloads
+- For the most part, things should not be marked as CUDA-only unless it relies on third-party compiled CUDA kernels or similar. Don't be afraid to use the available accelerator eg. mps, cuda, if available on the system opportunistically.
@@ -130,6 +130,8 @@ Detailed quickstart guides are available for all supported models:
 - **[OmniGen Guide](/documentation/quickstart/OMNIGEN.md)** - Unified image generation model
 - **[Qwen Image Guide](/documentation/quickstart/QWEN_IMAGE.md)** - 20B parameter large-scale training
 - **[Stable Cascade Stage C Guide](/quickstart/STABLE_CASCADE_C.md)** - Prior LoRAs with combined prior+decoder validation
+- **[Kandinsky 5.0 Image Guide](/documentation/quickstart/KANDINSKY5_IMAGE.md)** - Image generation with Qwen2.5-VL + Flux VAE
+- **[Kandinsky 5.0 Video Guide](/documentation/quickstart/KANDINSKY5_VIDEO.md)** - Video generation with HunyuanVideo VAE
 
 ---
 
 
@@ -782,7 +782,6 @@ usage: train.py [-h] --model_family
                 [--override_dataset_config [OVERRIDE_DATASET_CONFIG]]
                 [--cache_dir CACHE_DIR] [--cache_dir_text CACHE_DIR_TEXT]
                 [--cache_dir_vae CACHE_DIR_VAE]
-                [--cache_clear_validation_prompts [CACHE_CLEAR_VALIDATION_PROMPTS]]
                 [--compress_disk_cache [COMPRESS_DISK_CACHE]]
                 [--aspect_bucket_disable_rebuild [ASPECT_BUCKET_DISABLE_REBUILD]]
                 [--keep_vae_loaded [KEEP_VAE_LOADED]]
@@ -1301,9 +1300,6 @@ options:
   --cache_dir_vae CACHE_DIR_VAE
                         This is the path to a local directory that will
                         contain your VAE outputs
-  --cache_clear_validation_prompts [CACHE_CLEAR_VALIDATION_PROMPTS]
-                        When provided, any validation prompt entries in the
-                        text embed cache will be recreated
   --compress_disk_cache [COMPRESS_DISK_CACHE]
                         If set, will gzip-compress the disk cache for Pytorch
                         files. This will save substantial disk space, but may
 
@@ -29,6 +29,8 @@ For the complete and most accurate feature matrix, refer to the [main README](..
 | Qwen Image | 20B | ✓ | ✓ | ✓* | **required** (int8/nf4) | bf16 | ✓ | ✓ | ✗ | [QWEN_IMAGE.md](/documentation/quickstart/QWEN_IMAGE.md) |
 | Qwen Image Edit | 20B | ✓ | ✓ | ✓* | **required** (int8/nf4) | bf16 | ✓ | ✓ | ✗ | [QWEN_EDIT.md](/documentation/quickstart/QWEN_EDIT.md) |
 | Stable Cascade (C) | 1B, 3.6B prior | ✓ | ✓ | ✓* | not supported | fp32 (required) | ✓ | ✗ | ✗ | [STABLE_CASCADE_C.md](/documentation/quickstart/STABLE_CASCADE_C.md) |
+| Kandinsky 5.0 Image | 6B (lite) | ✓ | ✓ | ✓* | int8 optional | bf16 | ✓ | ✓ | ✗ | [KANDINSKY5_IMAGE.md](/documentation/quickstart/KANDINSKY5_IMAGE.md) |
+| Kandinsky 5.0 Video | 2B (lite), 19B (pro) | ✓ | ✓ | ✓* | int8 optional | bf16 | ✓ | ✓ | ✗ | [KANDINSKY5_VIDEO.md](/documentation/quickstart/KANDINSKY5_VIDEO.md) |
 
 *✓ = supported, ✓* = requires DeepSpeed/FSDP2 for full-rank, ✗ = not supported, `✓+` indicates checkpointing is recommended due to VRAM pressure.*
 
 
@@ -36,12 +36,12 @@ Use this configuration in your `multidatabackend.json` to load the dataset direc
 
 ```bibtex
 @misc{jiang2025advancingfoundationmodelmusic,
-      title={Advancing the Foundation Model for Music Understanding}, 
+      title={Advancing the Foundation Model for Music Understanding},
       author={Yi Jiang and Wei Wang and Xianwen Guo and Huiyun Liu and Hanrui Wang and Youri Xu and Haoqi Gu and Zhongqian Xie and Chuanjiang Luo},
       year={2025},
       eprint={2508.01178},
       archivePrefix={arXiv},
       primaryClass={cs.SD},
-      url={https://arxiv.org/abs/2508.01178}, 
+      url={https://arxiv.org/abs/2508.01178},
 }
 ```
@@ -0,0 +1,251 @@
+# Kandinsky 5.0 Image Quickstart
+
+In this example, we'll be training a Kandinsky 5.0 Image LoRA.
+
+## Hardware requirements
+
+Kandinsky 5.0 employs a **huge 7B parameter Qwen2.5-VL text encoder** in addition to a standard CLIP encoder and the Flux VAE. This places significant demand on both VRAM and System RAM.
+
+Simply loading the Qwen encoder requires roughly **14GB** of memory on its own. When training a rank-16 LoRA with full gradient checkpointing:
+
+- **24GB VRAM** is the comfortable minimum (RTX 3090/4090).
+- **16GB VRAM** is possible but requires aggressive offloading and likely `int8` quantization of the base model.
+
+You'll need:
+
+- **System RAM**: At least 32GB, ideally 64GB, to handle the initial model load without crashing.
+- **GPU**: NVIDIA RTX 3090 / 4090 or professional cards (A6000, A100, etc.).
+
+### Memory offloading (recommended)
+
+Given the size of the text encoder, you should almost certainly use grouped offloading if you are on consumer hardware. This offloads the transformer blocks to CPU memory when they are not actively being computed.
+
+Add the following to your `config.json`:
+
+```json
+{
+  "enable_group_offload": true,
+  "group_offload_type": "block_level",
+  "group_offload_blocks_per_group": 1,
+  "group_offload_use_stream": true
+}
+```
+
+- `--group_offload_use_stream`: Only works on CUDA devices.
+- **Do not** combine this with `--enable_model_cpu_offload`.
+
+Additionally, set `"offload_during_startup": true` in your `config.json` to reduce VRAM usage during the initialization and caching phase. This ensures the text encoder and VAE are not loaded simultaneously.
+
+## Prerequisites
+
+Make sure that you have python installed; SimpleTuner does well with 3.10 through 3.12.
+
+You can check this by running:
+
+```bash
+python --version
+```
+
+If you don't have python 3.12 installed on Ubuntu, you can try the following:
+
+```bash
+apt -y install python3.12 python3.12-venv
+```
+
+## Installation
+
+Install SimpleTuner via pip:
+
+```bash
+pip install simpletuner[cuda]
+```
+
+For manual installation or development setup, see the [installation documentation](/documentation/INSTALL.md).
+
+## Setting up the environment
+
+### Web interface method
+
+The SimpleTuner WebUI makes setup fairly straightforward. To run the server:
+
+```bash
+simpletuner server
+```
+
+Access it at http://localhost:8001.
+
+### Manual / command-line method
+
+To run SimpleTuner via command-line tools, you will need to set up a configuration file, the dataset and model directories, and a dataloader configuration file.
+
+#### Configuration file
+
+An experimental script, `configure.py`, may help you skip this section:
+
+```bash
+simpletuner configure
+```
+
+If you prefer to manually configure:
+
+Copy `config/config.json.example` to `config/config.json`:
+
+```bash
+cp config/config.json.example config/config.json
+```
+
+You will need to modify the following variables:
+
+- `model_type`: `lora`
+- `model_family`: `kandinsky5-image`
+- `model_flavour`:
+  - `t2i-lite-sft`: (Default) The standard SFT checkpoint. Best for fine-tuning styles/characters.
+  - `t2i-lite-pretrain`: The pretrain checkpoint. Better for teaching entirely new concepts from scratch.
+  - `i2i-lite-sft` / `i2i-lite-pretrain`: For image-to-image training. Requires conditioning images in your dataset.
+- `output_dir`: Where to save your checkpoints.
+- `train_batch_size`: Start with `1`.
+- `gradient_accumulation_steps`: Use `1` or higher to simulate larger batches.
+- `validation_resolution`: `1024x1024` is standard for this model.
+- `validation_guidance`: `5.0` is the recommended default for Kandinsky 5.
+- `flow_schedule_shift`: `1.0` is the default. Adjusting this changes how the model prioritizes details vs composition (see below).
+
+#### Validation prompts
+
+Inside `config/config.json` is the "primary validation prompt". You can also create a library of prompts in `config/user_prompt_library.json`:
+
+```json
+{
+  "portrait": "A high quality portrait of a woman, cinematic lighting, 8k",
+  "landscape": "A beautiful mountain landscape at sunset, oil painting style"
+}
+```
+
+Enable it by adding this to your `config.json`:
+
+```json
+{
+  "user_prompt_library": "config/user_prompt_library.json"
+}
+```
+
+#### Flow schedule shifting
+
+Kandinsky 5 is a flow-matching model. The `shift` parameter controls the noise distribution during training and inference.
+
+- **Shift 1.0 (Default)**: Balanced training.
+- **Lower Shift (< 1.0)**: Focuses training more on high-frequency details (texture, noise).
+- **Higher Shift (> 1.0)**: Focuses training more on low-frequency details (composition, color, structure).
+
+If your model learns styles well but fails on composition, try increasing the shift. If it learns composition but lacks texture, try decreasing it.
+
+#### Quantised model training
+
+You can reduce VRAM usage significantly by quantizing the transformer to 8-bit.
+
+In `config.json`:
+
+```json
+  "base_model_precision": "int8-quanto",
+  "text_encoder_1_precision": "no_change",
+  "text_encoder_2_precision": "no_change",
+  "lora_rank": 16,
+  "base_model_default_dtype": "bf16"
+```
+
+> **Note**: We do not recommend quantizing the text encoders (`no_change`) as Qwen2.5-VL is sensitive to quantization effects and is already the heaviest part of the pipeline.
+
+#### Dataset considerations
+
+You will need a dataset configuration file, e.g., `config/multidatabackend.json`.
+
+```json
+[
+  {
+    "id": "my-image-dataset",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "datasets/my_images",
+    "caption_strategy": "textfile",
+    "resolution": 1024,
+    "crop": true,
+    "crop_aspect": "square",
+    "repeats": 10
+  },
+  {
+    "id": "text-embeds",
+    "type": "local",
+    "dataset_type": "text_embeds",
+    "default": true,
+    "cache_dir": "cache/text/kandinsky5",
+    "disabled": false
+  }
+]
+```
+
+Then create your dataset directory:
+
+```bash
+mkdir -p datasets/my_images
+# Copy your images and .txt caption files here
+```
+
+#### Login to WandB and Huggingface Hub
+
+```bash
+wandb login
+huggingface-cli login
+```
+
+### Executing the training run
+
+**Option 1 (Recommended):**
+
+```bash
+simpletuner train
+```
+
+**Option 2 (Legacy):**
+
+```bash
+./train.sh
+```
+
+## Notes & troubleshooting tips
+
+### Lowest VRAM config
+
+To run on 16GB or constrained 24GB setups:
+
+1.  **Enable Group Offload**: `--enable_group_offload`.
+2.  **Quantize Base Model**: Set `"base_model_precision": "int8-quanto"`.
+3.  **Batch Size**: Keep it at `1`.
+
+### Artifacts and "Burnt" images
+
+If validation images look over-saturated or noisy ("burnt"):
+
+- **Check Guidance**: Ensure `validation_guidance` is around `5.0`. Higher values (like 7.0+) often fry the image on this model.
+- **Check Flow Shift**: Extreme `flow_schedule_shift` values can cause instability. Stick to `1.0` to start.
+- **Learning Rate**: 1e-4 is standard for LoRA, but if you see artifacts, try lowering to 5e-5.
+
+### TREAD training
+
+Kandinsky 5 supports [TREAD](/documentation/TREAD.md) for faster training by dropping tokens.
+
+Add to `config.json`:
+
+```json
+{
+  "tread_config": {
+    "routes": [
+      {
+        "selection_ratio": 0.5,
+        "start_layer_idx": 2,
+        "end_layer_idx": -2
+      }
+    ]
+  }
+}
+```
+
+This drops 50% of tokens in the middle layers, speeding up the transformer pass.