Real-time speech denoising using an IRM U-Net — Vietnamese speech, trained on VIVOS + DEMAND + MUSAN.
- Overview
- Architecture
- Project Structure
- Quick Start
- Datasets
- Configuration
- Documentation
- Contributing
denoise-audio is a deep-learning pipeline that removes background noise from speech in real time.
The model processes 256 ms audio windows at a time and emits a single denoised hop (8 ms) per
inference step, making it suitable for live microphone streaming.
| Property | Value |
|---|---|
| Model | IRM U-Net with asymmetric pooling |
| Input | log₁p-magnitude spectrogram (256 × 32 × 1) |
| Sample rate | 16 000 Hz |
| Algorithmic latency | 256 ms |
| Output hop latency | 8 ms per step |
| Language focus | Vietnamese (VIVOS corpus) + any noise from DEMAND/MUSAN |
| Framework | TensorFlow / Keras |
The model predicts an Ideal Ratio Mask (IRM)
Key design decisions:
| Feature | Reason |
|---|---|
Asymmetric pooling (2,1) |
Preserves the time axis T=32 at every encoder depth |
| Squeeze-and-Excitation block | Channel-wise attention at bottleneck |
SpatialDropout2D(0.3) |
Regularisation on small Vietnamese dataset |
| Mask applied inside model | Loss computed on clean estimate, not raw mask |
See docs/architecture.md for the full architecture diagram and parameter count.
denoise-audio/
├── configs/
│ ├── data_config.yaml # Dataset paths, SNR range, split ratios
│ └── train_config.yaml # STFT params, model hyper-params, callbacks
│
├── data/
│ ├── raw/
│ │ ├── clean/vivos/ # VIVOS Vietnamese speech WAVs
│ │ └── noise/
│ │ ├── demand/ # DEMAND noise WAVs
│ │ └── musan/ # MUSAN music + noise WAVs
│ └── processed/
│ ├── train/ # .npz files: keys "noisy", "clean" (B, 256, 32, 1)
│ ├── val/
│ └── test/
│
├── docs/ # Extended technical documentation
│ ├── architecture.md
│ ├── data_pipeline.md
│ ├── training.md
│ ├── realtime.md
│ └── contributing.md
│
├── models/
│ └── checkpoints/ # Saved .keras model weights
│
├── notebooks/
│ └── architecture_experiment.ipynb
│
├── scripts/
│ └── download_dataset.py # Kaggle API dataset downloader
│
└── src/
├── audio/ # Low-level audio I/O utilities
├── data/
│ ├── preprocessing/ # WAV → (noisy, clean) spectrogram segments
│ │ └── audio.py ← AudioPreprocessor, build_dataset
│ ├── postprocessing/ # Model output → PCM audio (real-time)
│ │ └── audio.py ← AudioPostprocessor
│ └── dataset.py # tf.data.Dataset loader (TODO)
├── model/
│ └── unet.py # UNetDenoiser, build_unet_denoise()
├── realtime/ # Streaming inference utilities (TODO)
└── training/
└── metrics.py # CombinedSpectralLoss, SI-SNR, PESQ, STOI
pip install -r requirements.txt
# Optional — perceptual evaluation metrics
pip install pesq pystoiRequires Python ≥ 3.10 and TensorFlow ≥ 2.15.
Requires a Kaggle API token saved at ~/.kaggle/kaggle.json.
# Download all datasets defined in configs/data_config.yaml
python scripts/download_dataset.py
# Download individual dataset
python scripts/download_dataset.py --dataset vivos
python scripts/download_dataset.py --dataset demand
python scripts/download_dataset.py --dataset musan| Dataset | Purpose | Size |
|---|---|---|
| VIVOS | Clean Vietnamese speech | ~700 MB |
| DEMAND | Environmental noise (15 categories) | ~8 GB |
| MUSAN | Music + noise (optional) | ~11 GB |
Converts raw WAV pairs into .npz segment files used by the training loop:
# All splits (train / val / test)
python -m src.data.preprocessing.audio --split all --pairs-per-clean 3
# Single split
python -m src.data.preprocessing.audio --split trainEach .npz file contains:
noisy— shape(N, 256, 32, 1)— model inputclean— shape(N, 256, 32, 1)— training target
# (training script coming soon — see docs/training.md)
python scripts/train.py --config configs/train_config.yamlDefault hyper-parameters (configs/train_config.yaml):
| Parameter | Value |
|---|---|
| Batch size | 16 |
| Epochs | 100 |
| Learning rate | 3 × 10⁻⁴ (Adam) |
| LR scheduler | ReduceLROnPlateau (patience=5) |
| Early stopping | patience=15 |
| Loss | 0.7 × SpectralL1 + 0.3 × SpectralConvergence |
import collections
import numpy as np
from src.data.preprocessing import AudioPreprocessor
from src.data.postprocessing import AudioPostprocessor
import tensorflow as tf
model = tf.keras.models.load_model("models/checkpoints/best.keras")
pre = AudioPreprocessor.from_configs()
post = AudioPostprocessor.from_configs()
buffer = collections.deque(maxlen=32) # 32-frame ring buffer
for hop_samples in audio_stream: # stream 128 samples (8 ms) at a time
buffer.append(hop_samples)
if len(buffer) < 32:
continue
waveform = np.concatenate(buffer)
log_mag, noisy_stft = pre.compute_magnitude(waveform)
clean_est = model.predict(
log_mag[np.newaxis, ..., np.newaxis], verbose=0
)[0, ..., 0] # (256, 32)
pcm_frame = post.reconstruct_frame(clean_est, noisy_stft) # (128,) float32
speaker.write(pcm_frame)See docs/realtime.md for detailed streaming integration.
| Dataset | ID | Categories | SR | License |
|---|---|---|---|---|
| VIVOS | vivos |
Vietnamese read speech | 16 kHz | CC BY-SA 4.0 |
| DEMAND | demand |
15 noise environments (transport, office, outdoor, …) | 16 / 48 kHz | CC BY-SA 3.0 |
| MUSAN | musan |
Music, speech, noise | 16 kHz | CC BY 4.0 |
All datasets are downloaded via the Kaggle API and placed under data/raw/ according to
paths defined in configs/data_config.yaml.
All hyper-parameters are centralised in two YAML files:
| File | Controls |
|---|---|
configs/data_config.yaml |
Dataset paths, SNR range [-5, 20] dB, train/val/test split ratios |
configs/train_config.yaml |
STFT params, model shape, optimizer, LR schedule, callbacks |
Important:
stft.*values intrain_config.yamlmust stay in sync between preprocessing and postprocessing. Changingn_fftorhop_lengthrequires rebuilding the dataset.
| Document | Description |
|---|---|
| docs/architecture.md | U-Net architecture, SE block, asymmetric pooling, parameter counts |
| docs/data_pipeline.md | Full math for STFT, SNR mixing, log1p, segmentation, iSTFT |
| docs/training.md | Training loop, loss functions, metrics (SI-SNR, PESQ, STOI) |
| docs/realtime.md | Sliding-buffer streaming design, latency analysis |
| docs/contributing.md | How to implement # TODO blocks, code style, PR workflow |
| src/data/TECHNICAL.md | Detailed math reference for preprocessing / postprocessing |
| src/model/README.md | Model architecture technical reference |
This repository uses skeleton # TODO stubs — see docs/contributing.md
for the implementation guide and development workflow.
src/data/preprocessing/audio.py — 11 TODOs (AudioPreprocessor + build_dataset)
src/data/postprocessing/audio.py — 6 TODOs (AudioPostprocessor)
src/data/dataset.py — tf.data.Dataset loader
src/realtime/ — streaming inference utilities
scripts/train.py — training entry-point