This repository provides sample implementations for the pre-training and finetuning pipelines of BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings.
If you want to use the finetuned version of BabyHuBERT on the Voice Type Classification task please look at VTC2.0
BabyHuBERT extends HuBERT’s self-supervised learning framework to child-centered multilingual long-form recordings. It follows the same two-stage pre-training procedure as HuBERT, starting from WavLM-base-plus features, and is implemented using the torchaudio HuBERT example.
Before running the pre-training or finetuning pipelines, install the dependencies below:
pip install uv
# Create and activate the pretraining environment
uv venv .venv-pretrain
source .venv-pretrain/bin/activate
# Install the pretraining dependencies
uv syncFor the finetuning environment:
git clone https://github.com/arxaqapi/segma.git
cd segma
# Create and activate the finetuning environment
uv venv .venv-finetuning
source .venv-finetuning/bin/activate
# Install the finetuning dependencies
uv syncThe HuBERT model architecture requires two iterations of pre-training. BabyHuBERT follows this same two-stage process.
-
preprocess_samples.py: Adjusts the distribution of sample durations by merging segments that overlap or are separated by less than 2 seconds. -
archive_samples.py: Generates training set archives, sharded into 32 archives for distributed training.
All SLURM scripts follow the naming format:
launch_*.sh
-
Generate Features (
-gf) → 32 separate jobs, each using 1×A100 GPU. -
K-means Clustering (
-lk) → Single job requiring 1 TB+ RAM. -
Generate Labels (
-gl) → 32 separate CPU jobs.
Training was conducted on 32×H100 GPUs, distributed across 8 nodes (4 GPUs per node).
Use correct environment
source .venv-pretrain/bin/activatesrun uv run preprocess.py -gf -lk -gl \
--num-shards-kmeans 6 \
--feat-type wavlm-base-plus \
--layer-index 6 \
--num-rank 32 \
--num-cluster 500srun uv run train.py \
--dataset longforms \
--dataset-path ./exp_iter/data/wavlm-base-plus_1_7 \
--exp-dir ./exp_iter2_B175 \
--feature-type hubert \
--num-class 500 \
--max-updates 400000 \
--seconds-per-batch 175 \
--learning-rate 0.0005 \
--gpus 4 \
--num-nodes 8srun uv run preprocess.py -gf -lk -gl \
--num-shards-kmeans 6 \
--feat-type baby-hubert-175s \
--layer-index 7 \
--num-rank 32 \
--num-cluster 500srun uv run train.py \
--dataset longforms \
--dataset-path ./exp_iter2_B175/data/baby-hubert-175s_1_7 \
--exp-dir ./exp_iter3_B175 \
--feature-type hubert \
--num-class 500 \
--max-updates 400000 \
--seconds-per-batch 175 \
--learning-rate 0.0005 \
--gpus 4 \
--num-nodes 8Finetuning is performed using the segma library.
Use correct environment
source .venv-finetuning/bin/activateModify the config file:
segma/src/segma/config/train_surgical_hubert_hydra.yml
Choose the HuBERT model checkpoint to finetune:
# HuBERT-base
wav_encoder: hubert_base
# BabyHuBERT-1
wav_encoder: "path/to/BabyHuBERT-1-checkpoint"
# BabyHuBERT-2
wav_encoder: "path/to/BabyHuBERT-2-checkpoint"# Set environment variables
run_id="BabyHuBERT2VTC"
config_model="train_surgical_hubert_hydra.yml"
user_path="/path/to/checkpoint"
segma_path="/path/to/segma"
# Launch finetuning
srun uv run $segma_path/scripts/auto_train.py \
--auto-resume \
--all-weights \
--run-id $run_id \
--output $user_path/checkpoints/ \
--config $user_path/checkpoints/$run_id/config.ymlTo cite this work, please use the following bibtex.
@misc{charlot2025babyhubertmultilingualselfsupervisedlearning,
title={BabyHuBERT: Multilingual Self-Supervised Learning for Segmenting Speakers in Child-Centered Long-Form Recordings},
author={Théo Charlot and Tarek Kunze and Maxime Poli and Alejandrina Cristia and Emmanuel Dupoux and Marvin Lavechin},
year={2025},
eprint={2509.15001},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2509.15001},
}