Merge pull request #2560 from bghira/feature/pre-shuffled-captions

bghira · web-flow · commit fd2b29ffd89a · 2026-02-01T15:32:04.000-06:00
configurable caption_shuffle for cached text embeddings
diff --git a/documentation/DATALOADER.es.md b/documentation/DATALOADER.es.md
@@ -186,6 +186,50 @@ Tanto `textfile` como `parquet` soportan multi-captions:
 - Útil cuando tus captions contienen saltos de línea intencionales que deben preservarse como un único caption.
 - Por defecto: `false` (los captions se dividen por saltos de línea)
 
+### `caption_shuffle`
+
+Genera variantes mezcladas determinísticas de captions basados en tags para aumento de datos. Esto ayuda al modelo a aprender que el orden de las tags no importa y reduce el sobreajuste a secuencias específicas de tags.
+
+**Configuración:**
+
+```json
+{
+  "caption_shuffle": {
+    "enable": true,
+    "count": 3,
+    "seed": 42,
+    "split_on": "comma",
+    "position_start": 1,
+    "include_original": true
+  }
+}
+```
+
+**Parámetros:**
+
+- `enable` (bool): Si se habilita el mezclado de captions. Por defecto: `false`
+- `count` (int): Número de variantes mezcladas a generar por caption. Por defecto: `1`
+- `seed` (int): Semilla para mezclado determinístico. Si no se especifica, usa el valor global `--seed`.
+- `split_on` (string): Delimitador para dividir captions en tags. Opciones: `comma`, `space`, `period`. Por defecto: `comma`
+- `position_start` (int): Mantener las primeras N tags en su posición original (útil para mantener tags de sujeto/estilo al principio). Por defecto: `0`
+- `include_original` (bool): Si se incluye el caption original sin mezclar junto con las variantes mezcladas. Por defecto: `true`
+
+**Ejemplo:**
+
+Con `split_on: "comma"`, `position_start: 1`, `count: 2`:
+
+- Original: `"dog, running, park, sunny day"`
+- Resultado: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
+
+La primera tag "dog" permanece fija mientras las tags restantes se mezclan.
+
+**Notas:**
+
+- El mezclado se aplica durante el pre-cacheo de embeddings de texto, así que todas las variantes se calculan de una vez.
+- Durante el entrenamiento, se selecciona una variante aleatoriamente por muestra.
+- Si un caption tiene menos tags que `position_start + 2`, el mezclado se omite (nada significativo que mezclar).
+- Cuando `include_original: false` pero el mezclado no es posible, se incluye el original de todos modos con una advertencia.
+
 ### `metadata_backend`
 
 - **Valores:** `discovery` | `parquet` | `huggingface`
diff --git a/documentation/DATALOADER.hi.md b/documentation/DATALOADER.hi.md
@@ -186,6 +186,50 @@ Metadata discovery के दौरान loader प्रत्येक file 
 - उपयोगी जब आपके captions में intentional line breaks हों जिन्हें एक single caption के रूप में संरक्षित रखना हो।
 - Default: `false` (captions newlines द्वारा split होते हैं)
 
+### `caption_shuffle`
+
+Data augmentation के लिए tag-based captions के deterministic shuffled variants generate करता है। यह model को सिखाता है कि tag order महत्वपूर्ण नहीं है और specific tag sequences पर overfitting कम करता है।
+
+**Configuration:**
+
+```json
+{
+  "caption_shuffle": {
+    "enable": true,
+    "count": 3,
+    "seed": 42,
+    "split_on": "comma",
+    "position_start": 1,
+    "include_original": true
+  }
+}
+```
+
+**Parameters:**
+
+- `enable` (bool): Caption shuffling enable करना है या नहीं। Default: `false`
+- `count` (int): प्रति caption generate करने के लिए shuffled variants की संख्या। Default: `1`
+- `seed` (int): Deterministic shuffling के लिए seed। यदि specify नहीं किया गया, तो global `--seed` value उपयोग होता है।
+- `split_on` (string): Captions को tags में split करने के लिए delimiter। Options: `comma`, `space`, `period`। Default: `comma`
+- `position_start` (int): पहले N tags को उनकी original position में रखें (subject/style tags को पहले रखने के लिए उपयोगी)। Default: `0`
+- `include_original` (bool): Shuffled variants के साथ unshuffled original caption include करना है या नहीं। Default: `true`
+
+**Example:**
+
+`split_on: "comma"`, `position_start: 1`, `count: 2` के साथ:
+
+- Original: `"dog, running, park, sunny day"`
+- Result: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
+
+पहला tag "dog" fixed रहता है जबकि बाकी tags shuffle होते हैं।
+
+**Notes:**
+
+- Shuffling text embed pre-caching के दौरान apply होता है, इसलिए सभी variants एक बार में calculate होते हैं।
+- Training के दौरान, प्रति sample एक variant randomly select होता है।
+- यदि caption में `position_start + 2` से कम tags हैं, तो shuffling skip होता है (shuffle करने के लिए कुछ meaningful नहीं)।
+- जब `include_original: false` लेकिन shuffling possible नहीं है, तो warning के साथ original include होता है।
+
 ### `metadata_backend`
 
 - **Values:** `discovery` | `parquet` | `huggingface`
diff --git a/documentation/DATALOADER.ja.md b/documentation/DATALOADER.ja.md
@@ -186,6 +186,50 @@ Hugging Face の音声データセットでは、キャプション（プロン
 - 意図的な改行を含むキャプションを単一のキャプションとして保持したい場合に便利です。
 - デフォルト: `false`（改行でキャプションを分割）
 
+### `caption_shuffle`
+
+タグベースのキャプションの決定論的シャッフルバリアントを生成し、データ拡張に使用します。これにより、モデルはタグの順序が重要でないことを学習し、特定のタグシーケンスへの過学習を軽減します。
+
+**設定：**
+
+```json
+{
+  "caption_shuffle": {
+    "enable": true,
+    "count": 3,
+    "seed": 42,
+    "split_on": "comma",
+    "position_start": 1,
+    "include_original": true
+  }
+}
+```
+
+**パラメータ：**
+
+- `enable` (bool): キャプションシャッフルを有効にするかどうか。デフォルト: `false`
+- `count` (int): キャプションごとに生成するシャッフルバリアントの数。デフォルト: `1`
+- `seed` (int): 決定論的シャッフルのシード。指定されていない場合、グローバル `--seed` 値を使用します。
+- `split_on` (string): キャプションをタグに分割する区切り文字。オプション: `comma`、`space`、`period`。デフォルト: `comma`
+- `position_start` (int): 最初の N 個のタグを元の位置に保持（主題/スタイルタグを先頭に保持するのに便利）。デフォルト: `0`
+- `include_original` (bool): シャッフルされていない元のキャプションをバリアントに含めるかどうか。デフォルト: `true`
+
+**例：**
+
+`split_on: "comma"`、`position_start: 1`、`count: 2` の場合：
+
+- 元: `"dog, running, park, sunny day"`
+- 結果: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
+
+最初のタグ「dog」は固定されたまま、残りのタグがシャッフルされます。
+
+**注意：**
+
+- シャッフルはテキスト埋め込みのプリキャッシュ時に適用されるため、すべてのバリアントは一度に計算されます。
+- トレーニング中、サンプルごとに 1 つのバリアントがランダムに選択されます。
+- キャプションのタグ数が `position_start + 2` より少ない場合、シャッフルはスキップされます（意味のあるシャッフルができないため）。
+- `include_original: false` だがシャッフルできない場合、警告とともに元のキャプションが含まれます。
+
 ### `metadata_backend`
 
 - **値:** `discovery` | `parquet` | `huggingface`
diff --git a/documentation/DATALOADER.md b/documentation/DATALOADER.md
@@ -220,6 +220,50 @@ Both `textfile` and `parquet` support multi-captions:
 - Useful when your captions contain intentional line breaks that should be preserved as a single caption.
 - Default: `false` (captions are split by newlines)
 
+### `caption_shuffle`
+
+Generates deterministic shuffled variants of tag-based captions for data augmentation. This helps the model learn that tag order doesn't matter and reduces overfitting to specific tag sequences.
+
+**Configuration:**
+
+```json
+{
+  "caption_shuffle": {
+    "enable": true,
+    "count": 3,
+    "seed": 42,
+    "split_on": "comma",
+    "position_start": 1,
+    "include_original": true
+  }
+}
+```
+
+**Parameters:**
+
+- `enable` (bool): Whether to enable caption shuffling. Default: `false`
+- `count` (int): Number of shuffled variants to generate per caption. Default: `1`
+- `seed` (int): Seed for deterministic shuffling. If not specified, uses the global `--seed` value.
+- `split_on` (string): Delimiter for splitting captions into tags. Options: `comma`, `space`, `period`. Default: `comma`
+- `position_start` (int): Keep the first N tags in their original position (useful for keeping subject/style tags first). Default: `0`
+- `include_original` (bool): Whether to include the unshuffled original caption alongside shuffled variants. Default: `true`
+
+**Example:**
+
+With `split_on: "comma"`, `position_start: 1`, `count: 2`:
+
+- Original: `"dog, running, park, sunny day"`
+- Result: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
+
+The first tag "dog" stays fixed while the remaining tags are shuffled.
+
+**Notes:**
+
+- Shuffling is applied during text embed pre-caching, so all variants are computed once upfront.
+- During training, one variant is randomly selected per sample.
+- If a caption has fewer tags than `position_start + 2`, shuffling is skipped (nothing meaningful to shuffle).
+- When `include_original: false` but shuffling isn't possible, the original is included anyway with a warning.
+
 ### `metadata_backend`
 
 - **Values:** `discovery` | `parquet` | `huggingface`
diff --git a/documentation/DATALOADER.pt-BR.md b/documentation/DATALOADER.pt-BR.md
@@ -186,6 +186,50 @@ Tanto `textfile` quanto `parquet` suportam multi-captions:
 - Útil quando suas captions contêm quebras de linha intencionais que devem ser preservadas como uma única caption.
 - Padrão: `false` (captions são divididas por novas linhas)
 
+### `caption_shuffle`
+
+Gera variantes embaralhadas determinísticas de captions baseadas em tags para aumento de dados. Isso ajuda o modelo a aprender que a ordem das tags não importa e reduz o overfitting em sequências específicas de tags.
+
+**Configuração:**
+
+```json
+{
+  "caption_shuffle": {
+    "enable": true,
+    "count": 3,
+    "seed": 42,
+    "split_on": "comma",
+    "position_start": 1,
+    "include_original": true
+  }
+}
+```
+
+**Parâmetros:**
+
+- `enable` (bool): Se deve habilitar o embaralhamento de captions. Padrão: `false`
+- `count` (int): Número de variantes embaralhadas a gerar por caption. Padrão: `1`
+- `seed` (int): Seed para embaralhamento determinístico. Se não especificado, usa o valor global `--seed`.
+- `split_on` (string): Delimitador para dividir captions em tags. Opções: `comma`, `space`, `period`. Padrão: `comma`
+- `position_start` (int): Manter as primeiras N tags em sua posição original (útil para manter tags de assunto/estilo primeiro). Padrão: `0`
+- `include_original` (bool): Se deve incluir a caption original não embaralhada junto com as variantes embaralhadas. Padrão: `true`
+
+**Exemplo:**
+
+Com `split_on: "comma"`, `position_start: 1`, `count: 2`:
+
+- Original: `"dog, running, park, sunny day"`
+- Resultado: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
+
+A primeira tag "dog" permanece fixa enquanto as tags restantes são embaralhadas.
+
+**Notas:**
+
+- O embaralhamento é aplicado durante o pré-cache de embeddings de texto, então todas as variantes são calculadas de uma vez.
+- Durante o treinamento, uma variante é selecionada aleatoriamente por amostra.
+- Se uma caption tiver menos tags que `position_start + 2`, o embaralhamento é pulado (nada significativo para embaralhar).
+- Quando `include_original: false` mas o embaralhamento não é possível, a original é incluída mesmo assim com um aviso.
+
 ### `metadata_backend`
 
 - **Valores:** `discovery` | `parquet` | `huggingface`
diff --git a/documentation/DATALOADER.zh.md b/documentation/DATALOADER.zh.md
@@ -186,6 +186,50 @@
 - 适用于包含有意换行的字幕，希望保持为单一字幕的情况。
 - 默认值: `false`（按换行符拆分字幕）
 
+### `caption_shuffle`
+
+生成基于标签的字幕的确定性打乱变体，用于数据增强。这有助于模型学习标签顺序不重要，并减少对特定标签序列的过拟合。
+
+**配置：**
+
+```json
+{
+  "caption_shuffle": {
+    "enable": true,
+    "count": 3,
+    "seed": 42,
+    "split_on": "comma",
+    "position_start": 1,
+    "include_original": true
+  }
+}
+```
+
+**参数：**
+
+- `enable` (bool): 是否启用字幕打乱。默认: `false`
+- `count` (int): 每个字幕生成的打乱变体数量。默认: `1`
+- `seed` (int): 确定性打乱的种子。如果未指定，使用全局 `--seed` 值。
+- `split_on` (string): 将字幕拆分为标签的分隔符。选项: `comma`、`space`、`period`。默认: `comma`
+- `position_start` (int): 保持前 N 个标签在原始位置（适用于保持主题/风格标签在前）。默认: `0`
+- `include_original` (bool): 是否在打乱变体中包含未打乱的原始字幕。默认: `true`
+
+**示例：**
+
+使用 `split_on: "comma"`、`position_start: 1`、`count: 2`：
+
+- 原始: `"dog, running, park, sunny day"`
+- 结果: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
+
+第一个标签"dog"保持固定，其余标签被打乱。
+
+**注意：**
+
+- 打乱在文本嵌入预缓存期间应用，因此所有变体一次性计算完成。
+- 训练期间，每个样本随机选择一个变体。
+- 如果字幕的标签数少于 `position_start + 2`，则跳过打乱（没有可以有意义地打乱的内容）。
+- 当 `include_original: false` 但无法打乱时，原始字幕仍会被包含并显示警告。
+
 ### `metadata_backend`
 
 - **取值:** `discovery` | `parquet` | `huggingface`
diff --git a/simpletuner/helpers/data_backend/factory.py b/simpletuner/helpers/data_backend/factory.py
@@ -572,6 +572,37 @@ def _maybe_convert_pixel_edge(field_name: str) -> None:
                 f"(id={backend['id']}) When using a huggingface data backend, caption_strategy must be set to 'huggingface'."
             )
 
+    # Validate and store caption_shuffle config
+    caption_shuffle = backend.get("caption_shuffle", {})
+    if caption_shuffle:
+        if not isinstance(caption_shuffle, dict):
+            raise ValueError(
+                f"(id={backend['id']}) caption_shuffle must be a dictionary, got {type(caption_shuffle).__name__}"
+            )
+        # Validate split_on
+        valid_split_on = {"comma", "space", "period"}
+        split_on = caption_shuffle.get("split_on", "comma")
+        if split_on not in valid_split_on:
+            raise ValueError(
+                f"(id={backend['id']}) caption_shuffle.split_on must be one of {valid_split_on}, got '{split_on}'"
+            )
+        # Validate count
+        count = caption_shuffle.get("count", 1)
+        if not isinstance(count, int) or count < 1:
+            raise ValueError(f"(id={backend['id']}) caption_shuffle.count must be a positive integer, got {count}")
+        # Validate position_start
+        position_start = caption_shuffle.get("position_start", 0)
+        if not isinstance(position_start, int) or position_start < 0:
+            raise ValueError(
+                f"(id={backend['id']}) caption_shuffle.position_start must be a non-negative integer, got {position_start}"
+            )
+        # Set seed default from args if not specified
+        if "seed" not in caption_shuffle:
+            global_seed = _get_arg_value(args, "seed")
+            if global_seed is not None:
+                caption_shuffle["seed"] = global_seed
+        output["config"]["caption_shuffle"] = caption_shuffle
+
     if not is_audio_dataset:
         maximum_image_size = backend.get("maximum_image_size", _get_arg_value(args, "maximum_image_size"))
         target_downsample_size = backend.get("target_downsample_size", _get_arg_value(args, "target_downsample_size"))
diff --git a/simpletuner/helpers/prompts.py b/simpletuner/helpers/prompts.py
diff --git a/simpletuner/templates/components/dataloader/sections/captions.html b/simpletuner/templates/components/dataloader/sections/captions.html