You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -54,6 +54,7 @@ SimpleTuner provides comprehensive training support across multiple diffusion mo
54
54
-**Multi-GPU training** - Distributed training across multiple GPUs with automatic optimization
55
55
-**Advanced caching** - Image, video, audio, and caption embeddings cached to disk for faster training
56
56
-**Aspect bucketing** - Support for varied image/video sizes and aspect ratios
57
+
-**Concept sliders** - Slider-friendly targeting for LoRA/LyCORIS/full (via LyCORIS `full`) with positive/negative/neutral sampling and per-prompt strength; see [Slider LoRA guide](/documentation/SLIDER_LORA.md)
57
58
-**Memory optimization** - Most models trainable on 24G GPU, many on 16G with optimizations
58
59
-**DeepSpeed & FSDP2 integration** - Train large models on smaller GPUs with optim/grad/parameter sharding, context parallel attention, gradient checkpointing, and optimizer state offload
Copy file name to clipboardExpand all lines: documentation/DREAMBOOTH.md
+24Lines changed: 24 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -222,6 +222,30 @@ Alternatively, one might use the real name of their subject, or a 'similar enoug
222
222
223
223
After a number of training experiments, it seems as though a 'similar enough' celebrity is the best choice, especially if prompting the model for the person's real name ends up looking dissimilar.
224
224
225
+
# Scheduled Sampling (Rollout)
226
+
227
+
When training on small datasets like in Dreambooth, models can quickly overfit to the "perfect" noise added during training. This leads to **exposure bias**: the model learns to denoise perfect inputs but fails when faced with its own slightly imperfect outputs during inference.
228
+
229
+
**Scheduled Sampling (Rollout)** addresses this by occasionally letting the model generate its own noisy latents for a few steps during the training loop. Instead of training on pure Gaussian noise + signal, it trains on "rollout" samples that contain the model's own previous errors. This teaches the model to correct itself, leading to more robust and stable subject generation.
230
+
231
+
> 🟢 This feature is experimental but highly recommended for small datasets where overfitting or "frying" is common.
232
+
> ⚠️ Enabling rollout increases compute requirements, as the model must perform extra inference steps during the training loop.
233
+
234
+
To enable it, add these keys to your `config.json`:
235
+
236
+
```json
237
+
{
238
+
"scheduled_sampling_max_step_offset": 10,
239
+
"scheduled_sampling_probability": 1.0,
240
+
"scheduled_sampling_ramp_steps": 1000,
241
+
"scheduled_sampling_sampler": "unipc"
242
+
}
243
+
```
244
+
245
+
*`scheduled_sampling_max_step_offset`: How many steps to generate. A small value (e.g., 5-10) is often enough.
246
+
*`scheduled_sampling_probability`: How often to apply this technique (0.0 to 1.0).
247
+
*`scheduled_sampling_ramp_steps`: Ramp up the probability over the first N steps to avoid destabilizing early training.
248
+
225
249
# Exponential moving average (EMA)
226
250
227
251
A second model can be trained in parallel to your checkpoint, nearly for free - only the resulting system memory (by default) is consumed, rather than more VRAM.
Copy file name to clipboardExpand all lines: documentation/OPTIONS.md
+80-4Lines changed: 80 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -619,6 +619,74 @@ See the [DATALOADER.md](DATALOADER.md#automatic-dataset-oversubscription) guide
619
619
-**What**: Train a model using a more gradual weighting on the loss landscape.
620
620
-**Why**: When training pixel diffusion models, they will simply degrade without using a specific loss weighting schedule. This is the case with DeepFloyd, where soft-min-snr-gamma was found to essentially be mandatory for good results. You may find success with latent diffusion model training, but in small experiments, it was found to potentially produce blurry results.
621
621
622
+
### `--diff2flow_enabled`
623
+
624
+
-**What**: Enable the Diffusion-to-Flow bridge for epsilon or v-prediction models.
625
+
-**Why**: Allows models trained with standard diffusion objectives to use flow-matching targets (noise - latents) without changing the model architecture.
626
+
-**Note**: Experimental feature.
627
+
628
+
### `--diff2flow_loss`
629
+
630
+
-**What**: Train with Flow Matching loss instead of the native prediction loss.
631
+
-**Why**: When enabled alongside `--diff2flow_enabled`, this calculates the loss against the flow target (noise - latents) instead of the model's native target (epsilon or velocity).
632
+
-**Note**: Requires `--diff2flow_enabled`.
633
+
634
+
### `--scheduled_sampling_max_step_offset`
635
+
636
+
-**What**: Maximum number of steps to "roll out" during training.
637
+
-**Why**: Enables Scheduled Sampling (Rollout), where the model generates its own inputs for a few steps during training. This helps the model learn to correct its own errors and reduces exposure bias.
638
+
-**Default**: 0 (disabled). Set to a positive integer (e.g., 5 or 10) to enable.
639
+
640
+
### `--scheduled_sampling_strategy`
641
+
642
+
-**What**: Strategy for choosing the rollout offset.
-**Why**: Controls the distribution of rollout lengths. `uniform` samples evenly; `biased_early` favors shorter rollouts; `biased_late` favors longer rollouts.
646
+
647
+
### `--scheduled_sampling_probability`
648
+
649
+
-**What**: Probability of applying a non-zero rollout offset for a given sample.
650
+
-**Default**: 0.0.
651
+
-**Why**: Controls how often scheduled sampling is applied. A value of 0.0 disables it even if `max_step_offset` is > 0. A value of 1.0 applies it to every sample.
652
+
653
+
### `--scheduled_sampling_prob_start`
654
+
655
+
-**What**: Initial probability for scheduled sampling at the start of the ramp.
656
+
-**Default**: 0.0.
657
+
658
+
### `--scheduled_sampling_prob_end`
659
+
660
+
-**What**: Final probability for scheduled sampling at the end of the ramp.
661
+
-**Default**: 0.5.
662
+
663
+
### `--scheduled_sampling_ramp_steps`
664
+
665
+
-**What**: Number of steps to ramp the probability from `prob_start` to `prob_end`.
666
+
-**Default**: 0 (no ramp).
667
+
668
+
### `--scheduled_sampling_start_step`
669
+
670
+
-**What**: Global step to start the scheduled sampling ramp.
671
+
-**Default**: 0.0.
672
+
673
+
### `--scheduled_sampling_ramp_shape`
674
+
675
+
-**What**: Shape of the probability ramp.
676
+
-**Choices**: `linear`, `cosine`.
677
+
-**Default**: `linear`.
678
+
679
+
### `--scheduled_sampling_sampler`
680
+
681
+
-**What**: The solver used for the rollout generation steps.
682
+
-**Choices**: `unipc`, `euler`, `dpm`, `rk4`.
683
+
-**Default**: `unipc`.
684
+
685
+
### `--scheduled_sampling_order`
686
+
687
+
-**What**: The order of the solver used for rollout.
0 commit comments