This repository provides a full Speaker Verification pipeline with:
- PTM embeddings (WavLM / HuBERT / Wav2Vec2)
- Handcrafted features (MFBE / MFCC / Pitch)
- Training for mode 1/2/3 in
train/ - A Streamlit demo app in
app/
SpeechVeri_MultiFeatures/
├── data_preparation/ # Notebooks for audio conversion/splitting
├── embedding/ # PTM embedding extraction
├── extract_feature_model/ # Handcrafted feature extraction notebook
├── train/ # Training and evaluation pipeline
├── app/ # Streamlit demo (register/compare/identify)
├── requirements.txt
└── README.md
From repository root:
pip install -r requirements.txtRecommended Python version: 3.10–3.12.
Use notebooks in data_preparation/ to convert/cut/split audio.
Expected downstream input is normalized .wav files.
Use embedding/main.ipynb (or APIs in embedding/embedding.py).
Output .pt files include:
embeddingswith shape(N, 13, 768)speaker_idsfilenames
Use notebooks in extract_feature_model/ to generate handcrafted features (mfbe_pitch, mfcc_pitch, etc.).
Feature sample order must match PTM embedding sample order.
train/train.py is currently used via notebook/API (no argparse CLI entrypoint).
Recommended:
cd train
jupyter notebook main.ipynbOr run with Python API:
from types import SimpleNamespace
from train.train import train
args = SimpleNamespace(
embedding_path="path/to/embedding_shards_or_pt",
feature_path="path/to/feature_shards_or_pt",
mode=3,
fusion_method="concat", # concat | gating | film
feature_mode="mfbe_pitch",
use_gating=True,
use_augment=False,
batch_size=64,
learning_rate=1e-3,
weight_decay=1e-4,
num_epochs=100,
optimizer="adam",
lr_scheduler="plateau",
early_stop_patience=10,
mixed_precision=True,
embedding_dim=512,
output_dir="train/outputs",
exp_name="Mode3_concat_train_raw_wavlm_mfbe_pitch",
seed=42,
duration="train_raw",
pretrained_model="wavlm",
)
model, history, exp_dir = train(args)
print(exp_dir)From repository root:
streamlit run app/streamlit_app.pyDefault app checkpoint:
train/outputs/experiments/Mode3_concat_train_raw_wavlm_mfbe_pitch/best_model.pth
If you hit a Streamlit watcher issue with torch.classes:
streamlit run app/streamlit_app.py --server.fileWatcherType noneFor embedding extraction scripts, speaker ID is parsed from filename prefix before _.
Example:
speaker001_sample01.wav-> speaker IDspeaker001
embedding/README.mdextract_feature_model/README.mdtrain/README.mdapp/README.md
- Mode 3 consumes both
embeddingandfeatureinputs. - Valid mode 3 fusion methods in current code:
concat,gating,film. cross_attentionhas been removed for mode 3 and raisesValueError.