This repository contains the code, scripts and configurations to download and prepare data for multimodal training.
This folder contains:
- Robust download script for HuggingFace datasets (+ SLURM script)
- Util to check the HF hub cache for corruption
- Important: Configure cache directories (HF datasets cache and HF Hub cache) correctly (see subfolder README)
Check the respective sub README for more details.