bghira
diff --git a/‎.github/workflows/docs.yml‎
Lines changed: 2 additions & 2 deletions b/‎.github/workflows/docs.yml‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎documentation/DATALOADER.es.md‎
Lines changed: 76 additions & 0 deletions b/‎documentation/DATALOADER.es.md‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎documentation/DATALOADER.hi.md‎
Lines changed: 76 additions & 0 deletions b/‎documentation/DATALOADER.hi.md‎
Lines changed: 76 additions & 0 deletions
diff --git a/‎documentation/DATALOADER.ja.md‎
Lines changed: 76 additions & 0 deletions b/‎documentation/DATALOADER.ja.md‎
Lines changed: 76 additions & 0 deletions
@@ -38,10 +38,10 @@ jobs:
 
       - name: Install dependencies
         run: |
-          pip install mkdocs mkdocs-material pymdown-extensions mkdocs-static-i18n
+          pip install zensical
 
       - name: Build documentation
-        run: mkdocs build --strict
+        run: zensical build
 
       - name: Upload artifact
         if: github.event_name == 'push' && github.ref == 'refs/heads/main'
 
@@ -551,6 +551,24 @@ Esto es especialmente útil cuando:
 - Sin sobresuscripción: se lanza un error
 - Con `--allow_dataset_oversubscription`: repeats ajustado automáticamente a 1 (25 × 2 = 50 muestras)
 
+### `max_num_samples`
+
+- **Descripción:** Limita el dataset a un número máximo de muestras. Cuando se establece, se selecciona un subconjunto aleatorio determinista del tamaño especificado del dataset completo.
+- **Caso de uso:** Útil para grandes datasets de regularización donde deseas usar solo una parte de los datos para evitar dominar conjuntos de entrenamiento más pequeños.
+- **Selección determinista:** La selección aleatoria usa el `id` del dataset como semilla, asegurando que el mismo subconjunto sea seleccionado entre sesiones de entrenamiento para reproducibilidad.
+- **Por defecto:** `null` (sin límite, se usan todas las muestras)
+
+#### Ejemplo
+```json
+{
+  "id": "regularization-data",
+  "max_num_samples": 1000,
+  ...
+}
+```
+
+Esto seleccionará de forma determinista 1000 muestras del dataset, con la misma selección utilizada cada vez que se ejecute el entrenamiento.
+
 ### `start_epoch` / `start_step`
 
 - Programa cuándo un dataset empieza a muestrear.
@@ -559,6 +577,38 @@ Esto es especialmente útil cuando:
 - Los datasets que nunca cumplen su condición de inicio (por ejemplo, `start_epoch` más allá de `--num_train_epochs`) se omitirán y se anotarán en la model card.
 - Las estimaciones de pasos en la barra de progreso son aproximadas cuando los datasets programados se activan a mitad de ejecución porque la longitud de la época puede aumentar cuando nuevos datos entran en línea.
 
+### `end_epoch` / `end_step`
+
+- Programa cuándo un dataset **deja** de muestrear (complementando `start_epoch`/`start_step`).
+- `end_epoch` (default: `null` = sin límite) detiene el muestreo después de esta época; `end_step` (default: `null` = sin límite) detiene el muestreo después de este paso de optimizador.
+- Cualquiera de las condiciones que termine detendrá el dataset; funcionan de forma independiente.
+- Útil para flujos de trabajo de **aprendizaje curricular** donde deseas:
+  - Entrenar con datos de baja resolución primero, luego cambiar a datos de mayor resolución.
+  - Eliminar gradualmente los datos de regularización después de cierto punto.
+  - Crear entrenamiento multi-etapa dentro de un solo archivo de configuración.
+
+**Ejemplo: Aprendizaje Curricular**
+```json
+[
+  {
+    "id": "lowres-512",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "/data/512",
+    "end_step": 300
+  },
+  {
+    "id": "highres-1024",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "/data/1024",
+    "start_step": 300
+  }
+]
+```
+
+En este ejemplo, el dataset de 512px se usa para los pasos 1-300, luego el dataset de 1024px toma el control desde el paso 300 en adelante.
+
 ### `is_regularisation_data`
 
 - También puede escribirse `is_regularization_data`
@@ -579,6 +629,32 @@ Esto es especialmente útil cuando:
 - **Advertencia:** Esto es destructivo y no se puede deshacer. Úsalo con cuidado.
 - **Default:** Usa el argumento `--delete_problematic_images` del trainer (default: `false`).
 
+### Ver Estadísticas de Filtrado
+
+Cuando SimpleTuner procesa tu dataset, rastrea cuántos archivos fueron filtrados y por qué. Estas estadísticas se almacenan en el archivo de caché del dataset (`aspect_ratio_bucket_indices_*.json`) y pueden verse en la WebUI.
+
+**Estadísticas rastreadas:**
+- **total_processed**: Número de archivos procesados
+- **too_small**: Archivos filtrados por estar debajo de `minimum_image_size`
+- **too_long**: Archivos filtrados por exceder límites de duración (audio/video)
+- **metadata_missing**: Archivos omitidos por falta de metadatos
+- **not_found**: Archivos que no se pudieron localizar
+- **already_exists**: Archivos ya en caché (no reprocesados)
+- **other**: Archivos filtrados por otras razones
+
+**Ver en la WebUI:**
+
+Al navegar por datasets en el explorador de archivos de la WebUI, seleccionar un directorio con un dataset existente mostrará estadísticas de filtrado si están disponibles. Esto ayuda a diagnosticar por qué tu dataset puede tener menos muestras utilizables de lo esperado.
+
+**Solución de problemas de archivos filtrados:**
+
+Si muchos archivos están siendo filtrados como `too_small`:
+1. Verifica tu configuración de `minimum_image_size` — debe coincidir con `resolution` y `resolution_type`
+2. Para `resolution_type=pixel`, `minimum_image_size` es la longitud mínima del borde más corto
+3. Para `resolution_type=area` o `pixel_area`, `minimum_image_size` es el área total mínima
+
+Consulta la sección [Solución de Problemas](#solución-de-problemas-de-datasets-filtrados) a continuación para más detalles.
+
 ### `slider_strength`
 
 - **Valores:** Cualquier valor float (positivo, negativo o cero)
 
@@ -551,6 +551,24 @@ effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_ste
 - Oversubscription के बिना: Error आएगा
 - `--allow_dataset_oversubscription` के साथ: repeats स्वतः 1 पर सेट होंगे (25 × 2 = 50 samples)
 
+### `max_num_samples`
+
+- **विवरण:** Dataset को अधिकतम samples की संख्या तक सीमित करता है। सेट करने पर, पूर्ण dataset से निर्दिष्ट आकार का एक deterministic random subset चुना जाता है।
+- **उपयोग का मामला:** बड़े regularization datasets के लिए उपयोगी जहाँ आप छोटे training sets को overwhelm न करने के लिए डेटा का केवल एक हिस्सा उपयोग करना चाहते हैं।
+- **Deterministic selection:** Random selection dataset `id` को seed के रूप में उपयोग करता है, जिससे reproducibility के लिए training sessions में समान subset चुना जाना सुनिश्चित होता है।
+- **डिफ़ॉल्ट:** `null` (कोई सीमा नहीं, सभी samples उपयोग होते हैं)
+
+#### उदाहरण
+```json
+{
+  "id": "regularization-data",
+  "max_num_samples": 1000,
+  ...
+}
+```
+
+यह dataset से 1000 samples को deterministically select करेगा, जिसमें हर बार training चलाने पर समान selection उपयोग होगी।
+
 ### `start_epoch` / `start_step`
 
 - यह schedule करता है कि dataset sampling कब शुरू होगी।
@@ -559,6 +577,38 @@ effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_ste
 - जिन datasets की start condition कभी पूरी नहीं होती (उदा., `start_epoch` `--num_train_epochs` से आगे), उन्हें skip किया जाएगा और model card में नोट किया जाएगा।
 - जब scheduled datasets mid‑run में active होते हैं, तो progress‑bar step estimates approximate हो जाते हैं क्योंकि epoch length बढ़ सकती है।
 
+### `end_epoch` / `end_step`
+
+- यह schedule करता है कि dataset sampling कब **बंद** होगी (`start_epoch`/`start_step` का पूरक)।
+- `end_epoch` (डिफ़ॉल्ट: `null` = कोई सीमा नहीं) इस epoch के बाद sampling बंद कर देता है; `end_step` (डिफ़ॉल्ट: `null` = कोई सीमा नहीं) इस optimizer step के बाद sampling बंद कर देता है।
+- कोई भी condition समाप्त होने पर dataset बंद हो जाएगा; वे स्वतंत्र रूप से काम करते हैं।
+- **Curriculum learning** workflows के लिए उपयोगी जहाँ आप चाहते हैं:
+  - पहले low-resolution data पर train करें, फिर high-resolution data पर switch करें।
+  - एक निश्चित बिंदु के बाद regularisation data को धीरे-धीरे हटाएं।
+  - एक single config file में multi-stage training बनाएं।
+
+**उदाहरण: Curriculum Learning**
+```json
+[
+  {
+    "id": "lowres-512",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "/data/512",
+    "end_step": 300
+  },
+  {
+    "id": "highres-1024",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "/data/1024",
+    "start_step": 300
+  }
+]
+```
+
+इस उदाहरण में, 512px dataset steps 1-300 के लिए उपयोग होता है, फिर 1024px dataset step 300 से आगे संभाल लेता है।
+
 ### `is_regularisation_data`
 
 - इसे `is_regularization_data` भी लिखा जा सकता है
@@ -579,6 +629,32 @@ effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_ste
 - **Warning:** यह destructive है और undo नहीं किया जा सकता। सावधानी से उपयोग करें।
 - **Default:** trainer के `--delete_problematic_images` argument पर fallback करता है (डिफ़ॉल्ट: `false`)।
 
+### Filtering Statistics देखना
+
+जब SimpleTuner आपके dataset को process करता है, यह track करता है कि कितनी files filter हुईं और क्यों। ये statistics dataset के cache file (`aspect_ratio_bucket_indices_*.json`) में store होती हैं और WebUI में देखी जा सकती हैं।
+
+**Track की जाने वाली Statistics:**
+- **total_processed**: Process की गई files की संख्या
+- **too_small**: `minimum_image_size` से नीचे होने के कारण filter की गई files
+- **too_long**: duration limits से अधिक होने के कारण filter की गई files (audio/video)
+- **metadata_missing**: missing metadata के कारण skip की गई files
+- **not_found**: जो files locate नहीं हो सकीं
+- **already_exists**: cache में पहले से मौजूद files (reprocess नहीं हुईं)
+- **other**: अन्य कारणों से filter की गई files
+
+**WebUI में देखना:**
+
+WebUI file browser में datasets browse करते समय, किसी existing dataset वाली directory select करने पर filtering statistics दिखाई देंगी (यदि उपलब्ध हों)। यह diagnose करने में मदद करता है कि आपके dataset में expected से कम usable samples क्यों हैं।
+
+**Filtered files का Troubleshooting:**
+
+यदि बहुत सी files `too_small` के रूप में filter हो रही हैं:
+1. अपनी `minimum_image_size` setting check करें — यह `resolution` और `resolution_type` से match होनी चाहिए
+2. `resolution_type=pixel` के लिए, `minimum_image_size` minimum shorter edge length है
+3. `resolution_type=area` या `pixel_area` के लिए, `minimum_image_size` minimum total area है
+
+अधिक details के लिए नीचे [Troubleshooting](#filtered-datasets-का-troubleshooting) section देखें।
+
 ### `slider_strength`
 
 - **Values:** कोई भी float मान (positive, negative, या zero)
 
@@ -552,6 +552,24 @@ effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_ste
 - オーバーサブスクリプションなし: エラー
 - `--allow_dataset_oversubscription` あり: repeats が自動的に 1 へ設定（25 × 2 = 50 サンプル）
 
+### `max_num_samples`
+
+- **説明：** データセットの最大サンプル数を制限します。設定すると、完全なデータセットから指定されたサイズの決定論的なランダムサブセットが選択されます。
+- **使用例：** 大規模な正則化データセットで、小さなトレーニングセットを圧倒しないようにデータの一部のみを使用したい場合に便利です。
+- **決定論的選択：** ランダム選択はデータセット `id` をシードとして使用し、再現性のためにトレーニングセッション間で同じサブセットが選択されることを保証します。
+- **デフォルト：** `null`（制限なし、すべてのサンプルを使用）
+
+#### 例
+```json
+{
+  "id": "regularization-data",
+  "max_num_samples": 1000,
+  ...
+}
+```
+
+これにより、データセットから 1000 サンプルが決定論的に選択され、トレーニングを実行するたびに同じ選択が使用されます。
+
 ### `start_epoch` / `start_step`
 
 - データセットのサンプリング開始タイミングをスケジュールします。
@@ -560,6 +578,38 @@ effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_ste
 - 開始条件を満たさないデータセット（例: `start_epoch` が `--num_train_epochs` を超える）はスキップされ、モデルカードに記載されます。
 - 進行中にスケジュールされたデータセットが有効になるとエポック長が増えるため、進捗バーのステップ見積もりは概算になります。
 
+### `end_epoch` / `end_step`
+
+- データセットのサンプリング**終了**タイミングをスケジュールします（`start_epoch`/`start_step` を補完）。
+- `end_epoch`（既定: `null` = 制限なし）はこのエポック後にサンプリングを停止；`end_step`（既定: `null` = 制限なし）はこの最適化ステップ後にサンプリングを停止。
+- どちらかの条件が終了するとデータセットは停止します。両者は独立して動作します。
+- **カリキュラム学習**ワークフローに有用です：
+  - 早期に低解像度データで訓練し、その後高解像度データに切り替える。
+  - ある時点以降、正則化データを段階的に廃止する。
+  - 単一の設定ファイル内で多段階訓練を作成する。
+
+**例：カリキュラム学習**
+```json
+[
+  {
+    "id": "lowres-512",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "/data/512",
+    "end_step": 300
+  },
+  {
+    "id": "highres-1024",
+    "type": "local",
+    "dataset_type": "image",
+    "instance_data_dir": "/data/1024",
+    "start_step": 300
+  }
+]
+```
+
+この例では、512px データセットはステップ 1-300 で使用され、その後ステップ 300 から 1024px データセットに引き継がれます。
+
 ### `is_regularisation_data`
 
 - `is_regularization_data` と綴ることもできます。
@@ -580,6 +630,32 @@ effective_batch_size = train_batch_size × num_gpus × gradient_accumulation_ste
 - **警告:** 破壊的で元に戻せません。注意して使用してください。
 - **既定値:** トレーナーの `--delete_problematic_images` 引数（既定: `false`）にフォールバックします。
 
+### フィルタリング統計の確認
+
+SimpleTuner がデータセットを処理する際、フィルタで除外されたファイルの数と理由を追跡します。これらの統計はデータセットのキャッシュファイル（`aspect_ratio_bucket_indices_*.json`）に保存され、WebUI で確認できます。
+
+**追跡される統計:**
+- **total_processed**: 処理されたファイル数
+- **too_small**: `minimum_image_size` 未満でフィルタされたファイル
+- **too_long**: 時間制限を超えたファイル（オーディオ/ビデオ）
+- **metadata_missing**: メタデータ不足でスキップされたファイル
+- **not_found**: 見つからなかったファイル
+- **already_exists**: キャッシュに既存のファイル（再処理なし）
+- **other**: その他の理由でフィルタされたファイル
+
+**WebUI での確認:**
+
+WebUI のファイルブラウザでデータセットを閲覧する際、既存のデータセットがあるディレクトリを選択すると、フィルタリング統計が表示されます（利用可能な場合）。これは、データセットの使用可能なサンプル数が予想より少ない理由を診断するのに役立ちます。
+
+**フィルタされたファイルのトラブルシューティング:**
+
+多くのファイルが `too_small` としてフィルタされている場合:
+1. `minimum_image_size` の設定を確認 — `resolution` と `resolution_type` に合わせる必要があります
+2. `resolution_type=pixel` の場合、`minimum_image_size` は最小短辺の長さです
+3. `resolution_type=area` または `pixel_area` の場合、`minimum_image_size` は最小総面積です
+
+詳細は下記の[トラブルシューティング](#フィルタされたデータセットのトラブルシューティング)セクションを参照してください。
+
 ### `slider_strength`
 
 - **値:** 任意の浮動小数値（正、負、または 0）