You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: documentation/DATALOADER.es.md
+44Lines changed: 44 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,6 +186,50 @@ Tanto `textfile` como `parquet` soportan multi-captions:
186
186
- Útil cuando tus captions contienen saltos de línea intencionales que deben preservarse como un único caption.
187
187
- Por defecto: `false` (los captions se dividen por saltos de línea)
188
188
189
+
### `caption_shuffle`
190
+
191
+
Genera variantes mezcladas determinísticas de captions basados en tags para aumento de datos. Esto ayuda al modelo a aprender que el orden de las tags no importa y reduce el sobreajuste a secuencias específicas de tags.
192
+
193
+
**Configuración:**
194
+
195
+
```json
196
+
{
197
+
"caption_shuffle": {
198
+
"enable": true,
199
+
"count": 3,
200
+
"seed": 42,
201
+
"split_on": "comma",
202
+
"position_start": 1,
203
+
"include_original": true
204
+
}
205
+
}
206
+
```
207
+
208
+
**Parámetros:**
209
+
210
+
-`enable` (bool): Si se habilita el mezclado de captions. Por defecto: `false`
211
+
-`count` (int): Número de variantes mezcladas a generar por caption. Por defecto: `1`
212
+
-`seed` (int): Semilla para mezclado determinístico. Si no se especifica, usa el valor global `--seed`.
213
+
-`split_on` (string): Delimitador para dividir captions en tags. Opciones: `comma`, `space`, `period`. Por defecto: `comma`
214
+
-`position_start` (int): Mantener las primeras N tags en su posición original (útil para mantener tags de sujeto/estilo al principio). Por defecto: `0`
215
+
-`include_original` (bool): Si se incluye el caption original sin mezclar junto con las variantes mezcladas. Por defecto: `true`
216
+
217
+
**Ejemplo:**
218
+
219
+
Con `split_on: "comma"`, `position_start: 1`, `count: 2`:
220
+
221
+
- Original: `"dog, running, park, sunny day"`
222
+
- Resultado: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
223
+
224
+
La primera tag "dog" permanece fija mientras las tags restantes se mezclan.
225
+
226
+
**Notas:**
227
+
228
+
- El mezclado se aplica durante el pre-cacheo de embeddings de texto, así que todas las variantes se calculan de una vez.
229
+
- Durante el entrenamiento, se selecciona una variante aleatoriamente por muestra.
230
+
- Si un caption tiene menos tags que `position_start + 2`, el mezclado se omite (nada significativo que mezclar).
231
+
- Cuando `include_original: false` pero el mezclado no es posible, se incluye el original de todos modos con una advertencia.
Copy file name to clipboardExpand all lines: documentation/DATALOADER.hi.md
+44Lines changed: 44 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,6 +186,50 @@ Metadata discovery के दौरान loader प्रत्येक file
186
186
- उपयोगी जब आपके captions में intentional line breaks हों जिन्हें एक single caption के रूप में संरक्षित रखना हो।
187
187
- Default: `false` (captions newlines द्वारा split होते हैं)
188
188
189
+
### `caption_shuffle`
190
+
191
+
Data augmentation के लिए tag-based captions के deterministic shuffled variants generate करता है। यह model को सिखाता है कि tag order महत्वपूर्ण नहीं है और specific tag sequences पर overfitting कम करता है।
192
+
193
+
**Configuration:**
194
+
195
+
```json
196
+
{
197
+
"caption_shuffle": {
198
+
"enable": true,
199
+
"count": 3,
200
+
"seed": 42,
201
+
"split_on": "comma",
202
+
"position_start": 1,
203
+
"include_original": true
204
+
}
205
+
}
206
+
```
207
+
208
+
**Parameters:**
209
+
210
+
-`enable` (bool): Caption shuffling enable करना है या नहीं। Default: `false`
211
+
-`count` (int): प्रति caption generate करने के लिए shuffled variants की संख्या। Default: `1`
212
+
-`seed` (int): Deterministic shuffling के लिए seed। यदि specify नहीं किया गया, तो global `--seed` value उपयोग होता है।
213
+
-`split_on` (string): Captions को tags में split करने के लिए delimiter। Options: `comma`, `space`, `period`। Default: `comma`
214
+
-`position_start` (int): पहले N tags को उनकी original position में रखें (subject/style tags को पहले रखने के लिए उपयोगी)। Default: `0`
215
+
-`include_original` (bool): Shuffled variants के साथ unshuffled original caption include करना है या नहीं। Default: `true`
216
+
217
+
**Example:**
218
+
219
+
`split_on: "comma"`, `position_start: 1`, `count: 2` के साथ:
220
+
221
+
- Original: `"dog, running, park, sunny day"`
222
+
- Result: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
223
+
224
+
पहला tag "dog" fixed रहता है जबकि बाकी tags shuffle होते हैं।
225
+
226
+
**Notes:**
227
+
228
+
- Shuffling text embed pre-caching के दौरान apply होता है, इसलिए सभी variants एक बार में calculate होते हैं।
229
+
- Training के दौरान, प्रति sample एक variant randomly select होता है।
230
+
- यदि caption में `position_start + 2` से कम tags हैं, तो shuffling skip होता है (shuffle करने के लिए कुछ meaningful नहीं)।
231
+
- जब `include_original: false` लेकिन shuffling possible नहीं है, तो warning के साथ original include होता है।
Copy file name to clipboardExpand all lines: documentation/DATALOADER.md
+44Lines changed: 44 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -220,6 +220,50 @@ Both `textfile` and `parquet` support multi-captions:
220
220
- Useful when your captions contain intentional line breaks that should be preserved as a single caption.
221
221
- Default: `false` (captions are split by newlines)
222
222
223
+
### `caption_shuffle`
224
+
225
+
Generates deterministic shuffled variants of tag-based captions for data augmentation. This helps the model learn that tag order doesn't matter and reduces overfitting to specific tag sequences.
226
+
227
+
**Configuration:**
228
+
229
+
```json
230
+
{
231
+
"caption_shuffle": {
232
+
"enable": true,
233
+
"count": 3,
234
+
"seed": 42,
235
+
"split_on": "comma",
236
+
"position_start": 1,
237
+
"include_original": true
238
+
}
239
+
}
240
+
```
241
+
242
+
**Parameters:**
243
+
244
+
-`enable` (bool): Whether to enable caption shuffling. Default: `false`
245
+
-`count` (int): Number of shuffled variants to generate per caption. Default: `1`
246
+
-`seed` (int): Seed for deterministic shuffling. If not specified, uses the global `--seed` value.
247
+
-`split_on` (string): Delimiter for splitting captions into tags. Options: `comma`, `space`, `period`. Default: `comma`
248
+
-`position_start` (int): Keep the first N tags in their original position (useful for keeping subject/style tags first). Default: `0`
249
+
-`include_original` (bool): Whether to include the unshuffled original caption alongside shuffled variants. Default: `true`
250
+
251
+
**Example:**
252
+
253
+
With `split_on: "comma"`, `position_start: 1`, `count: 2`:
254
+
255
+
- Original: `"dog, running, park, sunny day"`
256
+
- Result: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
257
+
258
+
The first tag "dog" stays fixed while the remaining tags are shuffled.
259
+
260
+
**Notes:**
261
+
262
+
- Shuffling is applied during text embed pre-caching, so all variants are computed once upfront.
263
+
- During training, one variant is randomly selected per sample.
264
+
- If a caption has fewer tags than `position_start + 2`, shuffling is skipped (nothing meaningful to shuffle).
265
+
- When `include_original: false` but shuffling isn't possible, the original is included anyway with a warning.
Copy file name to clipboardExpand all lines: documentation/DATALOADER.pt-BR.md
+44Lines changed: 44 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -186,6 +186,50 @@ Tanto `textfile` quanto `parquet` suportam multi-captions:
186
186
- Útil quando suas captions contêm quebras de linha intencionais que devem ser preservadas como uma única caption.
187
187
- Padrão: `false` (captions são divididas por novas linhas)
188
188
189
+
### `caption_shuffle`
190
+
191
+
Gera variantes embaralhadas determinísticas de captions baseadas em tags para aumento de dados. Isso ajuda o modelo a aprender que a ordem das tags não importa e reduz o overfitting em sequências específicas de tags.
192
+
193
+
**Configuração:**
194
+
195
+
```json
196
+
{
197
+
"caption_shuffle": {
198
+
"enable": true,
199
+
"count": 3,
200
+
"seed": 42,
201
+
"split_on": "comma",
202
+
"position_start": 1,
203
+
"include_original": true
204
+
}
205
+
}
206
+
```
207
+
208
+
**Parâmetros:**
209
+
210
+
-`enable` (bool): Se deve habilitar o embaralhamento de captions. Padrão: `false`
211
+
-`count` (int): Número de variantes embaralhadas a gerar por caption. Padrão: `1`
212
+
-`seed` (int): Seed para embaralhamento determinístico. Se não especificado, usa o valor global `--seed`.
213
+
-`split_on` (string): Delimitador para dividir captions em tags. Opções: `comma`, `space`, `period`. Padrão: `comma`
214
+
-`position_start` (int): Manter as primeiras N tags em sua posição original (útil para manter tags de assunto/estilo primeiro). Padrão: `0`
215
+
-`include_original` (bool): Se deve incluir a caption original não embaralhada junto com as variantes embaralhadas. Padrão: `true`
216
+
217
+
**Exemplo:**
218
+
219
+
Com `split_on: "comma"`, `position_start: 1`, `count: 2`:
220
+
221
+
- Original: `"dog, running, park, sunny day"`
222
+
- Resultado: `["dog, running, park, sunny day", "dog, park, sunny day, running", "dog, sunny day, running, park"]`
223
+
224
+
A primeira tag "dog" permanece fixa enquanto as tags restantes são embaralhadas.
225
+
226
+
**Notas:**
227
+
228
+
- O embaralhamento é aplicado durante o pré-cache de embeddings de texto, então todas as variantes são calculadas de uma vez.
229
+
- Durante o treinamento, uma variante é selecionada aleatoriamente por amostra.
230
+
- Se uma caption tiver menos tags que `position_start + 2`, o embaralhamento é pulado (nada significativo para embaralhar).
231
+
- Quando `include_original: false` mas o embaralhamento não é possível, a original é incluída mesmo assim com um aviso.
0 commit comments