Skip to content

LunarCommand/audio-refinery

Repository files navigation

Audio Refinery — whimsical audio processing for AI

Audio Refinery

GPU-accelerated audio processing pipeline: vocal separation (Demucs), speaker diarization (Pyannote), transcription (WhisperX), and text sentiment analysis. Its primary use case is building AI-ready audio databases — transforming raw recordings into structured, speaker-attributed JSON with word-level timestamps that feed directly into RAG pipelines, vector stores, and fine-tuning datasets. The pipeline uses a Ghost Track strategy: AI models run against a clean, music-free vocal stem to maximize accuracy, then the resulting metadata is applied back to the original audio, preserving its acoustic character. Designed to run on 24 GB consumer GPUs with all models resident in VRAM simultaneously, it processes large corpora in batch with no model reload overhead between files.

Choose your path

Audio Refinery runs in two modes that share the same core pipeline. Pick the one that fits your workflow:

Run one-off transcriptions on a workstation — CLI

Install locally and process a file or a whole directory from the command line:

make dev-setup                                      # install (Python 3.11 + uv)
audio-refinery pipeline --base-dir /data/audio/batch

Best for interactive use, ad-hoc processing, and batch runs over a local directory.

Full command reference: docs/cli.md

Deploy at scale behind an HTTP API — service

Run the containerized service and submit jobs over HTTP (URI-in / URI-out, async, multi-job batches):

docker run --gpus all -p 8000:8000 \
  -e REFINERY_API_KEYS=your-secret-key \
  -e HF_TOKEN=hf_your_token \
  lunarcommand/audio-refinery:latest

Best for production deployments, integration with workflow orchestrators, and processing remote audio behind presigned URLs.

Operational guide: docs/service.md


Installation

The CLI and local development both use a Python 3.11 virtualenv. (The service path needs only Docker — see docs/service.md.)

# Create and activate a Python 3.11 virtualenv
uv venv --python 3.11.14
source .venv/bin/activate

# Install all deps (uv sync, whisperx, CUDA torch wheels, pre-commit hooks)
make dev-setup

# Copy the env template and add your Hugging Face token
cp .env.example .env
# Edit .env and set HF_TOKEN=hf_your_token_here

# Verify the install
make test
audio-refinery --help

CUDA note: uv sync resolves torch from PyPI and installs the CPU build. make dev-setup automatically reinstalls torch==2.1.2+cu121 and torchaudio==2.1.2+cu121 (CUDA 12.1) as its final step. If your system uses a different CUDA version, run make install-torch-cuda after editing the wheel URLs in the Makefile.

NumPy constraint: numpy<2.0.0 is pinned in pyproject.toml. Do not upgrade it — WhisperX and some audio libraries break with NumPy 2.x.


Prerequisites

Hugging Face access token (required for diarization)

Pyannote speaker diarization models are gated on Hugging Face. This applies to both CLI and service mode. Complete these steps once:

  1. Create a Hugging Face account at huggingface.co if you don't have one.
  2. Accept the license for each gated model (must be logged in):
  3. Create a read-only access token: Profile → Settings → Access Tokens → New token.
  4. Provide it to the tool:
    • CLI: add HF_TOKEN=hf_your_token_here to .env (copy from .env.example), or export HF_TOKEN=... in your shell.
    • Service: pass -e HF_TOKEN=hf_your_token_here to docker run.

The .env file is gitignored. The token is never embedded in code.

CLI users should also review the scratch directory and Demucs model weights notes before the first run.


Architecture at a glance

Both entry points are thin callers around one shared pipeline core:

flowchart TD
    CLI["CLI — audio-refinery<br/>one-shot / batch over a directory"]
    SVC["Service — HTTP API<br/>async jobs, URI-in / URI-out"]
    CLI --> CORE
    SVC --> CORE
    subgraph CORE [core pipeline]
        direction LR
        SEP[separate] --> DIA[diarize] --> TRX[transcribe] --> SEN["sentiment (optional)"]
    end
Loading

The CLI loads models per invocation; the service loads them once at container startup and keeps them resident across jobs. See docs/architecture.md for the full design, model selection rationale, and data model.


Documentation

Document Description
Index Navigation hub for all documentation
CLI Reference Every command, flag, and example for workstation use
Service Guide HTTP API, container deployment, env vars, ops, troubleshooting
Architecture Ghost Track pipeline design, model selection rationale, data model
Use Cases Who uses this and for what
Performance Throughput benchmarks, scaling options, optimization guide
Deployment Production patterns, async workers, Docker, monitoring
Development Dev setup, testing, contributing, release process

Development

uv venv --python 3.11.14
source .venv/bin/activate

# Install all deps including whisperx, CUDA torch, dev tools, and pre-commit hooks
make dev-setup

# Run unit tests (no GPU required)
make test

# Run integration tests (requires GPU, HF_TOKEN, and test audio)
make test-integration

# Lint and format
make lint
make format

See docs/development.md for the full developer guide, including how to run the service locally.


License & Dependencies

audio-refinery is released under the MIT License.

Dependency note: The Pyannote model weights (pyannote/speaker-diarization-3.1 and pyannote/segmentation-3.0) are gated on Hugging Face under separate terms. If you run this tool in a commercial data product, verify that your Hugging Face account's accepted terms cover your use case. The MIT license on this software does not extend to the model weights — those are governed by their respective Hugging Face model cards.

About

GPU-accelerated audio pipeline: vocal separation, speaker diarization, transcription and sentiment analysis.

Topics

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors