Skip to content

Commit 7fb0126

Browse files
authored
Merge pull request #2251 from bghira/main
merge
2 parents 8974b43 + 6128ccc commit 7fb0126

File tree

86 files changed

+7108
-884
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

86 files changed

+7108
-884
lines changed

documentation/DATALOADER.md

Lines changed: 28 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -348,13 +348,40 @@ A video dataset should be a folder of (eg. mp4) video files and the usual method
348348
```
349349

350350
- In the `video` subsection, we have the following keys we can set:
351-
- `num_frames` (optional, int) is how many seconds of data we'll train on.
351+
- `num_frames` (optional, int) is how many frames of data we'll train on.
352352
- At 25 fps, 125 frames is 5 seconds of video, standard output. This should be your target.
353353
- `min_frames` (optional, int) determines the minimum length of a video that will be considered for training.
354354
- This should be at least equal to `num_frames`. Not setting it ensures it'll be equal.
355355
- `max_frames` (optional, int) determines the maximum length of a video that will be considered for training.
356356
- `is_i2v` (optional, bool) determines whether i2v training will be done on a dataset.
357357
- This is set to True by default for LTX. You can disable it, however.
358+
- `bucket_strategy` (optional, string) determines how videos are grouped into buckets:
359+
- `aspect_ratio` (default): Bucket by spatial aspect ratio only (e.g., `1.78`, `0.75`). Same behavior as image datasets.
360+
- `resolution_frames`: Bucket by resolution and frame count in `WxH@F` format (e.g., `1920x1080@125`). Useful for training on datasets with varying resolutions and durations.
361+
- `frame_interval` (optional, int) when using `bucket_strategy: "resolution_frames"`, frame counts are rounded down to the nearest multiple of this value. Set this to your model's required frame count factor (some models require `num_frames - 1` to be divisible by a certain value).
362+
363+
**Note:** When using `bucket_strategy: "resolution_frames"` with `num_frames` set, you'll get a single frame bucket and videos shorter than `num_frames` will be discarded. Unset `num_frames` if you want multiple frame buckets with fewer discards.
364+
365+
Example using `resolution_frames` bucketing for mixed-resolution video datasets:
366+
367+
```json
368+
{
369+
"id": "mixed-resolution-videos",
370+
"type": "local",
371+
"dataset_type": "video",
372+
"resolution": 720,
373+
"resolution_type": "pixel_area",
374+
"instance_data_dir": "datasets/videos",
375+
"video": {
376+
"bucket_strategy": "resolution_frames",
377+
"frame_interval": 25,
378+
"min_frames": 25,
379+
"max_frames": 250
380+
}
381+
}
382+
```
383+
384+
This configuration will create buckets like `1280x720@100`, `1920x1080@125`, `640x480@75`, etc. Videos are grouped by their training resolution and frame count (rounded to nearest 25 frames).
358385

359386

360387
##### Configuration

documentation/OPTIONS.md

Lines changed: 64 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -233,6 +233,59 @@ TorchAO includes generally-available 4bit and 8bit optimisers: `ao-adamw8bit`, `
233233

234234
It also provides two optimisers that are directed toward Hopper (H100 or better) users: `ao-adamfp8`, and `ao-adamwfp8`
235235

236+
#### SDNQ (SD.Next Quantization Engine)
237+
238+
[SDNQ](https://github.com/disty0/sdnq) is a quantization library optimized for training that works across all platforms: AMD (ROCm), Apple (MPS), and NVIDIA (CUDA). It provides quantized training with stochastic rounding and quantized optimizer states for memory efficiency.
239+
240+
##### Recommended Precision Levels
241+
242+
**For full finetuning** (model weights are updated):
243+
- `uint8-sdnq` - Best balance of memory savings and training quality
244+
- `uint16-sdnq` - Higher precision for maximum quality (e.g., Stable Cascade)
245+
- `int16-sdnq` - Signed 16-bit alternative
246+
- `fp16-sdnq` - Quantized FP16, maximum precision with SDNQ benefits
247+
248+
**For LoRA training** (frozen base model weights):
249+
- `int8-sdnq` - Signed 8-bit, good general purpose choice
250+
- `int6-sdnq`, `int5-sdnq` - Lower precision, smaller memory
251+
- `uint5-sdnq`, `uint4-sdnq`, `uint3-sdnq`, `uint2-sdnq` - Aggressive compression
252+
253+
**Note:** `int7-sdnq` is available but not recommended (slow and not much smaller than int8).
254+
255+
**Important:** Below 5-bit precision, SDNQ automatically enables SVD (Singular Value Decomposition) with 8 steps to maintain quality. SVD takes longer to quantize and is non-deterministic, which is why Disty0 provides pre-quantized SVD models on HuggingFace. SVD adds compute overhead during training, so avoid for full finetuning where weights are actively updated.
256+
257+
**Key features:**
258+
- Cross-platform: Works identically on AMD, Apple, and NVIDIA hardware
259+
- Training-optimized: Uses stochastic rounding to reduce quantization error accumulation
260+
- Memory efficient: Supports quantized optimizer state buffers
261+
- Decoupled matmul: Weight precision and matmul precision are independent (INT8/FP8/FP16 matmul available)
262+
263+
##### SDNQ Optimisers
264+
265+
SDNQ includes optimizers with optional quantized state buffers for additional memory savings:
266+
267+
- `sdnq-adamw` - AdamW with quantized state buffers (uint8, group_size=32)
268+
- `sdnq-adamw+no_quant` - AdamW without quantized states (for comparison)
269+
- `sdnq-adafactor` - Adafactor with quantized state buffers
270+
- `sdnq-came` - CAME optimizer with quantized state buffers
271+
- `sdnq-lion` - Lion optimizer with quantized state buffers
272+
- `sdnq-muon` - Muon optimizer with quantized state buffers
273+
- `sdnq-muon+quantized_matmul` - Muon with INT8 matmul in zeropower computation
274+
275+
All SDNQ optimizers use stochastic rounding by default and can be configured with `--optimizer_config` for custom settings like `use_quantized_buffers=false` to disable state quantization.
276+
277+
**Muon-specific options:**
278+
- `use_quantized_matmul` - Enable INT8/FP8/FP16 matmul in zeropower_via_newtonschulz5
279+
- `quantized_matmul_dtype` - Matmul precision: `int8` (consumer GPUs), `fp8` (datacenter), `fp16`
280+
- `zeropower_dtype` - Precision for zeropower computation (ignored when `use_quantized_matmul=True`)
281+
- Prefix args with `muon_` or `adamw_` to set different values for Muon vs AdamW fallback
282+
283+
**Pre-quantized models:** Disty0 provides pre-quantized uint4 SVD models at [huggingface.co/collections/Disty0/sdnq](https://huggingface.co/collections/Disty0/sdnq). Load these normally, then convert with `convert_sdnq_model_to_training()` after importing SDNQ (SDNQ must be imported before loading to register with Diffusers).
284+
285+
**Note on checkpointing:** SDNQ training models are saved in both native PyTorch format (`.pt`) for training resumption and safetensors format for inference. The native format is required for proper training resumption as SDNQ's `SDNQTensor` class uses custom serialization.
286+
287+
**Disk space tip:** To save disk space, you can keep only the quantized weights and use SDNQ's [dequantize_sdnq_training.py](https://github.com/Disty0/sdnq/blob/main/scripts/dequantize_sdnq_training.py) script to dequantize when needed for inference.
288+
236289
### `--quantization_config`
237290

238291
- **What**: JSON object or file path describing Diffusers `quantization_config` overrides when using `--quantize_via=pipeline`.
@@ -312,6 +365,17 @@ Using `--sageattention_usage` to enable training with SageAttention should be en
312365
- **What**: Uploads to Hugging Face Hub from a background worker so checkpoint pushes do not pause the training loop.
313366
- **Why**: Keeps training and validation running while Hub uploads proceed asynchronously. Final uploads are still awaited before the run exits so failures surface.
314367

368+
### `--webhook_config`
369+
370+
- **What**: Configuration for webhook targets (e.g., Discord, custom endpoints) to receive real-time training events.
371+
- **Why**: Allows you to monitor training runs with external tools and dashboards, receiving notifications at key training stages.
372+
- **Notes**: The `job_id` field in webhook payloads can be populated by setting the `SIMPLETUNER_JOB_ID` environment variable before training:
373+
```bash
374+
export SIMPLETUNER_JOB_ID="my-training-run-name"
375+
python train.py
376+
```
377+
This is useful for monitoring tools receiving webhooks from multiple training runs to identify which config sent each event. If SIMPLETUNER_JOB_ID is not set, job_id will be null in webhook payloads.
378+
315379
### `--publishing_config`
316380

317381
- **What**: Optional JSON/dict/file path describing non-Hugging Face publishing targets (S3-compatible storage, Backblaze B2, Azure Blob Storage, Dropbox).

documentation/quickstart/HUNYUANVIDEO.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -186,7 +186,8 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
186186
"video": {
187187
"num_frames": 61,
188188
"min_frames": 61,
189-
"frame_rate": 24
189+
"frame_rate": 24,
190+
"bucket_strategy": "aspect_ratio"
190191
},
191192
"repeats": 10
192193
},
@@ -201,6 +202,15 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
201202
]
202203
```
203204

205+
In the `video` subsection:
206+
- `num_frames`: Target frame count for training. Must satisfy `(frames - 1) % 4 == 0`.
207+
- `min_frames`: Minimum video length (shorter videos are discarded).
208+
- `max_frames`: Maximum video length filter.
209+
- `bucket_strategy`: How videos are grouped into buckets:
210+
- `aspect_ratio` (default): Group by spatial aspect ratio only.
211+
- `resolution_frames`: Group by `WxH@F` format (e.g., `854x480@61`) for mixed-resolution/duration datasets.
212+
- `frame_interval`: When using `resolution_frames`, round frame counts to this interval.
213+
204214
> See caption_strategy options and requirements in [DATALOADER.md](../DATALOADER.md#caption_strategy).
205215
206216
- **Text Embed Caching**: Highly recommended. Hunyuan uses a large LLM text encoder. Caching saves significant VRAM during training.

documentation/quickstart/KANDINSKY5_VIDEO.md

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,8 @@ Video datasets require careful setup. Create `config/multidatabackend.json`:
136136
"video": {
137137
"num_frames": 61,
138138
"min_frames": 61,
139-
"frame_rate": 24
139+
"frame_rate": 24,
140+
"bucket_strategy": "aspect_ratio"
140141
},
141142
"repeats": 10
142143
},
@@ -151,6 +152,15 @@ Video datasets require careful setup. Create `config/multidatabackend.json`:
151152
]
152153
```
153154

155+
In the `video` subsection:
156+
- `num_frames`: Target frame count for training.
157+
- `min_frames`: Minimum video length (shorter videos are discarded).
158+
- `max_frames`: Maximum video length filter.
159+
- `bucket_strategy`: How videos are grouped into buckets:
160+
- `aspect_ratio` (default): Group by spatial aspect ratio only.
161+
- `resolution_frames`: Group by `WxH@F` format (e.g., `1920x1080@61`) for mixed-resolution/duration datasets.
162+
- `frame_interval`: When using `resolution_frames`, round frame counts to this interval.
163+
154164
> See caption_strategy options and requirements in [DATALOADER.md](../DATALOADER.md#caption_strategy).
155165
156166
#### Directory setup

documentation/quickstart/LONGCAT_VIDEO.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -83,6 +83,12 @@ Or launch the Web UI and submit a job with the same config.
8383
- For image‑to‑video runs, include a conditioning image per sample; it is placed in the first latent frame and kept fixed during sampling.
8484
- LongCat‑Video is 30 fps by design. The default 93 frames is ~3.1 s; if you change frame counts, keep `(frames - 1) % 4 == 0` and remember duration scales with fps.
8585

86+
### Video bucket strategy
87+
88+
In your dataset's `video` section, you can configure how videos are grouped:
89+
- `bucket_strategy`: `aspect_ratio` (default) groups by spatial aspect ratio. `resolution_frames` groups by `WxH@F` format (e.g., `480x832@93`) for mixed-resolution/duration datasets.
90+
- `frame_interval`: When using `resolution_frames`, round frame counts to this interval (e.g., set to 4 to match the VAE temporal stride).
91+
8692
---
8793

8894
## 5) Validation & inference

documentation/quickstart/LTXVIDEO.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -373,7 +373,8 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
373373
"repeats": 0,
374374
"video": {
375375
"num_frames": 125,
376-
"min_frames": 125
376+
"min_frames": 125,
377+
"bucket_strategy": "aspect_ratio"
377378
}
378379
},
379380
{
@@ -392,13 +393,17 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
392393
> See caption_strategy options and requirements in [DATALOADER.md](../DATALOADER.md#caption_strategy).
393394
394395
- In the `video` subsection, we have the following keys we can set:
395-
- `num_frames` (optional, int) is how many seconds of data we'll train on.
396+
- `num_frames` (optional, int) is how many frames of data we'll train on.
396397
- At 25 fps, 125 frames is 5 seconds of video, standard output. This should be your target.
397398
- `min_frames` (optional, int) determines the minimum length of a video that will be considered for training.
398399
- This should be at least equal to `num_frames`. Not setting it ensures it'll be equal.
399400
- `max_frames` (optional, int) determines the maximum length of a video that will be considered for training.
400401
- `is_i2v` (optional, bool) determines whether i2v training will be done on a dataset.
401402
- This is set to True by default for LTX. You can disable it, however.
403+
- `bucket_strategy` (optional, string) determines how videos are grouped into buckets:
404+
- `aspect_ratio` (default): Group by spatial aspect ratio only (e.g., `1.78`, `0.75`).
405+
- `resolution_frames`: Group by resolution and frame count in `WxH@F` format (e.g., `768x512@125`). Useful for mixed-resolution/duration datasets.
406+
- `frame_interval` (optional, int) when using `resolution_frames`, round frame counts to this interval. Set this to your model's required frame count factor.
402407

403408
Then, create a `datasets` directory:
404409

documentation/quickstart/SANAVIDEO.md

Lines changed: 6 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -308,7 +308,8 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
308308
"repeats": 0,
309309
"video": {
310310
"num_frames": 81,
311-
"min_frames": 81
311+
"min_frames": 81,
312+
"bucket_strategy": "aspect_ratio"
312313
}
313314
},
314315
{
@@ -331,6 +332,10 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
331332
- `min_frames` (optional, int) determines the minimum length of a video that will be considered for training.
332333
- `max_frames` (optional, int) determines the maximum length of a video that will be considered for training.
333334
- `is_i2v` (optional, bool) determines whether i2v training will be done on a dataset.
335+
- `bucket_strategy` (optional, string) determines how videos are grouped into buckets:
336+
- `aspect_ratio` (default): Group by spatial aspect ratio only (e.g., `1.78`, `0.75`).
337+
- `resolution_frames`: Group by resolution and frame count in `WxH@F` format (e.g., `832x480@81`). Useful for mixed-resolution/duration datasets.
338+
- `frame_interval` (optional, int) when using `resolution_frames`, round frame counts to this interval.
334339

335340
Then, create a `datasets` directory:
336341

documentation/quickstart/WAN.md

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -542,7 +542,8 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
542542
"repeats": 0,
543543
"video": {
544544
"num_frames": 75,
545-
"min_frames": 75
545+
"min_frames": 75,
546+
"bucket_strategy": "aspect_ratio"
546547
}
547548
},
548549
{
@@ -593,11 +594,15 @@ Create a `--data_backend_config` (`config/multidatabackend.json`) document conta
593594
</details>
594595

595596
- In the `video` subsection, we have the following keys we can set:
596-
- `num_frames` (optional, int) is how many seconds of data we'll train on.
597+
- `num_frames` (optional, int) is how many frames of data we'll train on.
597598
- At 15 fps, 75 frames is 5 seconds of video, standard output. This should be your target.
598599
- `min_frames` (optional, int) determines the minimum length of a video that will be considered for training.
599600
- This should be at least equal to `num_frames`. Not setting it ensures it'll be equal.
600601
- `max_frames` (optional, int) determines the maximum length of a video that will be considered for training.
602+
- `bucket_strategy` (optional, string) determines how videos are grouped into buckets:
603+
- `aspect_ratio` (default): Group by spatial aspect ratio only (e.g., `1.78`, `0.75`).
604+
- `resolution_frames`: Group by resolution and frame count in `WxH@F` format (e.g., `832x480@75`). Useful for mixed-resolution/duration datasets.
605+
- `frame_interval` (optional, int) when using `resolution_frames`, round frame counts to this interval.
601606
<!-- - `is_i2v` (optional, bool) determines whether i2v training will be done on a dataset.
602607
- This is set to True by default for Wan 2.1. You can disable it, however.
603608
-->

setup.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -265,6 +265,7 @@ def _collect_package_files(*directories: str):
265265
"peft-singlora>=0.2.0",
266266
"cryptography>=41.0.0",
267267
"torchcodec>=0.8.1",
268+
"sdnq>=0.1.2",
268269
]
269270

270271
platform_deps_for_install = get_platform_dependencies()

simpletuner/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -79,4 +79,4 @@ def _suppress_swigvarlink(message, *args, **kwargs):
7979
warnings.warn = _suppress_swigvarlink
8080

8181

82-
__version__ = "3.3.2"
82+
__version__ = "3.3.3"

0 commit comments

Comments
 (0)