GPU-accelerated audio processing pipeline: vocal separation (Demucs), speaker diarization (Pyannote), transcription (WhisperX), and text sentiment analysis. Its primary use case is building AI-ready audio databases — transforming raw recordings into structured, speaker-attributed JSON with word-level timestamps that feed directly into RAG pipelines, vector stores, and fine-tuning datasets. The pipeline uses a Ghost Track strategy: AI models run against a clean, music-free vocal stem to maximize accuracy, then the resulting metadata is applied back to the original audio, preserving its acoustic character. Designed to run on 24 GB consumer GPUs with all models resident in VRAM simultaneously, it processes large corpora in batch with no model reload overhead between files.
Audio Refinery runs in two modes that share the same core pipeline. Pick the one that fits your workflow:
Install locally and process a file or a whole directory from the command line:
make dev-setup # install (Python 3.11 + uv)
audio-refinery pipeline --base-dir /data/audio/batchBest for interactive use, ad-hoc processing, and batch runs over a local directory.
→ Full command reference: docs/cli.md
Run the containerized service and submit jobs over HTTP (URI-in / URI-out, async, multi-job batches):
docker run --gpus all -p 8000:8000 \
-e REFINERY_API_KEYS=your-secret-key \
-e HF_TOKEN=hf_your_token \
lunarcommand/audio-refinery:latestBest for production deployments, integration with workflow orchestrators, and processing remote audio behind presigned URLs.
→ Operational guide: docs/service.md
The CLI and local development both use a Python 3.11 virtualenv. (The service path needs only Docker — see docs/service.md.)
# Create and activate a Python 3.11 virtualenv
uv venv --python 3.11.14
source .venv/bin/activate
# Install all deps (uv sync, whisperx, CUDA torch wheels, pre-commit hooks)
make dev-setup
# Copy the env template and add your Hugging Face token
cp .env.example .env
# Edit .env and set HF_TOKEN=hf_your_token_here
# Verify the install
make test
audio-refinery --helpCUDA note:
uv syncresolves torch from PyPI and installs the CPU build.make dev-setupautomatically reinstallstorch==2.1.2+cu121andtorchaudio==2.1.2+cu121(CUDA 12.1) as its final step. If your system uses a different CUDA version, runmake install-torch-cudaafter editing the wheel URLs in the Makefile.
NumPy constraint:
numpy<2.0.0is pinned inpyproject.toml. Do not upgrade it — WhisperX and some audio libraries break with NumPy 2.x.
Pyannote speaker diarization models are gated on Hugging Face. This applies to both CLI and service mode. Complete these steps once:
- Create a Hugging Face account at huggingface.co if you don't have one.
- Accept the license for each gated model (must be logged in):
- Create a read-only access token: Profile → Settings → Access Tokens → New token.
- Provide it to the tool:
- CLI: add
HF_TOKEN=hf_your_token_hereto.env(copy from.env.example), orexport HF_TOKEN=...in your shell. - Service: pass
-e HF_TOKEN=hf_your_token_heretodocker run.
- CLI: add
The .env file is gitignored. The token is never embedded in code.
CLI users should also review the scratch directory and Demucs model weights notes before the first run.
Both entry points are thin callers around one shared pipeline core:
flowchart TD
CLI["CLI — audio-refinery<br/>one-shot / batch over a directory"]
SVC["Service — HTTP API<br/>async jobs, URI-in / URI-out"]
CLI --> CORE
SVC --> CORE
subgraph CORE [core pipeline]
direction LR
SEP[separate] --> DIA[diarize] --> TRX[transcribe] --> SEN["sentiment (optional)"]
end
The CLI loads models per invocation; the service loads them once at container startup and keeps them resident across jobs. See docs/architecture.md for the full design, model selection rationale, and data model.
| Document | Description |
|---|---|
| Index | Navigation hub for all documentation |
| CLI Reference | Every command, flag, and example for workstation use |
| Service Guide | HTTP API, container deployment, env vars, ops, troubleshooting |
| Architecture | Ghost Track pipeline design, model selection rationale, data model |
| Use Cases | Who uses this and for what |
| Performance | Throughput benchmarks, scaling options, optimization guide |
| Deployment | Production patterns, async workers, Docker, monitoring |
| Development | Dev setup, testing, contributing, release process |
uv venv --python 3.11.14
source .venv/bin/activate
# Install all deps including whisperx, CUDA torch, dev tools, and pre-commit hooks
make dev-setup
# Run unit tests (no GPU required)
make test
# Run integration tests (requires GPU, HF_TOKEN, and test audio)
make test-integration
# Lint and format
make lint
make formatSee docs/development.md for the full developer guide, including how to run the service locally.
audio-refinery is released under the MIT License.
Dependency note: The Pyannote model weights (pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0) are gated on Hugging Face under separate terms. If you run this tool in a commercial data product, verify that your Hugging Face account's accepted terms cover your use case. The MIT license on this software does not extend to the model weights — those are governed by their respective Hugging Face model cards.
