This project uses deep learning to analyze audio files and detect AI-generated content. The goal of this project is to listen to an .mp3 or a .wav file and determine if it's AI generated or not.
- Install uv if you haven't already:
curl -LsSf https://astral.sh/uv/install.sh | sh
# or
pip install uv- Create and activate a virtual environment with uv:
uv venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows- Install the package in development mode:
uv pip install -e .- Create and activate a virtual environment:
python -m venv .venv
source .venv/bin/activate # On Unix/macOS
# or
.venv\Scripts\activate # On Windows- Install the package in development mode:
pip install -e .This will install the audio-processing-ai package and all its dependencies.
For development setup, see CONTRIBUTING.md.
# Clone the repository
git clone https://github.com/yourusername/audio-processing-ai.git
cd audio-processing-ai
# Switch to main branch (if not already there)
git checkout main-copy
# Create virtual environment and install dev dependencies
uv venv
source .venv/bin/activate
uv pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest tests/ -vFor the training step, I used this file from here [Link text][https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/master/README.md] that is a 16khz model for inference to finetune against.
For the example here, I set up a data folder at the top level with /data/train/ai and /data/train/real
and would .mp3 and .wav files that I want to finetune against. I got the real data from
FMA [Link Text][https://github.com/mdeff/fma] for testing, and the AI generated data from
Facebook's Music Gen. There needs to be the word "ai" in the path of the ai folders and "real" in the
path to the real songs.
NOTE: In /model/pretrained/cnn14.py, I'm hardcoding the path to be /model/pretrained/pretrained_models/Cnn14_16k_mAP=0.438.pth.gz. This would have to be changed in the future. Cnn14 only takes in gzip files so gzip your file beforehand
Steps:
- First place files in
data/train/(if you are going to finetune data against your model) All AI Files should go in/data/train/aiand all of the real files goes in/data/train/real. This is because we need to do supervised learning before training the classifier which file is AI music and which is Real - Figure out the model you are going to finetune against
- Update this line (PRETRAINED_MODEL_PATH = 'model/pretrained/pretrained_models/Cnn14_16k_mAP=0.438.pth.gz') at cnn14.py to the .pth.gz file location of your choice
To train the model:
python train.py \
--num-epochs 5 \
--dataFolder data/train/ \
--savedPath model/saved_models/your_model.pth \
[--resume-from path/to/checkpoint.pth] # Optional: resume from a checkpointRequired arguments:
--savedPath: Path where the model will be saved (must end in .pth)--dataFolder: Directory containing training data (default: "data/train/")--num-epochs: Number of training epochs (default: 5)
Optional arguments:
--resume-from: Path to a checkpoint to resume training from
Place your audio files in the data/predict/ folder, then run predictions:
python predict.py \
--folder data/predict \
--model model/saved_models/your_model.pth \
--output resultsOr specify any folder containing audio files:
python predict.py \
--folder path/to/your/audio/files \
--model model/saved_models/your_model.pth \
--output resultsRequired arguments:
--folder: Directory containing .mp3/.wav/.flac/.m4a files to analyze--model: Path to your trained model (.pth file)--output: Output directory for CSV and Excel results
The script will:
- Process each audio file in the specified folder
- Generate predictions for AI-generated content and audio scene tags
- Save results to a CSV file named
predictions_YYYYMMDD_HHMM.csvand an Excel file with summary statistics
To evaluate your trained model on a test set, use the evaluation_pipeline.py script. This will generate comprehensive metrics including ROC curves, confusion matrices, and threshold analysis.
Prerequisites:
- Your data should be split into train/val/test folders
- Test folder should contain
real/andai/subfolders with audio files
Example evaluation:
python evaluation_pipeline.py \
--model model/saved_models/your_model.pth \
--data-split data/split/ \
--output-dir evaluation_resultsThe evaluation will generate:
evaluation_plots.png: ROC curve, logit distributions, threshold analysis, and confusion matrixevaluation_thresholds.csv: Performance metrics across different thresholdsevaluation_report.txt: Detailed text report with all metrics
Example Output:
See sample_runs/eval_sample_run/ for an example of evaluation results. This directory contains:
evaluation_plots.png: Visual performance metricsevaluation_report.txt: Detailed evaluation reportevaluation_thresholds.csv: Threshold analysis data
Required arguments:
--model: Path to trained model (.pth file)--data-split: Path to split data folder (containing train/val/test subdirs)
Optional arguments:
--output-dir: Output directory for results (default: auto-generated with timestamp)--seed: Random seed for reproducibility (default: 42)
The project includes a Gradio web interface for interactive AI audio detection. You can run it locally or deploy it to Modal.
To run the Gradio app locally with a PyTorch model:
python gradio_app.py \
--model model/saved_models/your_model.pth \
[--threshold 0.35] \
[--onnx]To run with an ONNX model:
python gradio_app.py \
--model model/saved_models/your_model.onnx \
--onnx \
[--threshold 0.35]Required arguments:
--model: Path to your trained model (.pth for PyTorch or .onnx for ONNX)
Optional arguments:
--threshold: Threshold for AI detection (default: 0.35)--onnx: Use ONNX model instead of PyTorch (required if model is .onnx)
The app will launch at http://127.0.0.1:7860 by default.
Features:
- Single File Upload: Upload one audio file (.mp3, .wav, .flac, .m4a) for instant prediction
- Batch Processing: Upload a ZIP file containing multiple audio files
- Results include AI detection confidence and music analysis (genre, mood, tempo, energy)
- Batch processing shows interactive tables, charts, and summary statistics
- Download full results as CSV/Excel for further analysis
The Gradio app can be deployed to Modal for cloud hosting. The deployed app uses ONNX models for optimized inference.
Prerequisites:
- Install Modal CLI:
pip install modal- Authenticate with Modal:
modal token new- Convert your PyTorch model to ONNX (if not already done):
python scripts/setup_modal.py \
--pth-path model/saved_models/your_model.pth \
--onnx-path model/saved_models/your_model.onnx- Upload the ONNX model to Modal volume:
modal volume put ai-audio-models model/saved_models/your_model.onnx model.onnxDeploy to Modal:
modal deploy gradio_app.pyAccess the deployed app: Once deployed, your app will be available at:
Note: The Modal deployment uses ONNX models for better performance and compatibility. Make sure you've uploaded your ONNX model to the ai-audio-models volume before deploying.
audio-processing-ai/
├── .github/
│ └── workflows/ # GitHub Actions CI/CD workflows
├── data/
│ ├── train/ # Training data
│ │ ├── ai/ # AI-generated audio files
│ │ └── real/ # Real audio files
│ └── predict/ # User audio files for prediction
├── model/
│ ├── pretrained/ # Pretrained model weights
│ │ └── pretrained_models/
│ └── saved_models/ # Trained model checkpoints
├── sample_runs/ # Example evaluation outputs
│ └── eval_sample_run/ # Sample evaluation results
├── src/
│ └── audio_processing_ai/ # Main package
│ ├── dataset/ # Dataset loading and processing utilities
│ ├── model/ # Model architecture and pretrained weights
│ ├── inference/ # Inference scripts and label files
│ └── scripts/ # Utility scripts (including threshold_sweep.py)
├── tests/ # Test files
├── train.py # Training script
├── predict.py # Prediction script
├── evaluation_pipeline.py # Model evaluation script
├── gradio_app.py # Gradio web interface (local and Modal deployment)
├── pyproject.toml # Package configuration
├── uv.lock # uv lock file (if using uv)
├── .pre-commit-config.yaml # Pre-commit hooks configuration
├── .gitignore # Git ignore rules
├── CONTRIBUTING.md # Contributing guidelines
├── CHANGELOG.md # Changelog
└── README.md # This file
- The project uses PyTorch for deep learning
- Audio processing is done using torchaudio and librosa
- Model architecture is based on CNN14 with dual-head classification
- Training data should be organized in the
data/train/directory - Prediction files can be placed in
data/predict/for convenience - Model checkpoints are saved in
model/saved_models/ - Evaluation results are saved with timestamps for easy tracking
- The project is structured as a proper Python package following modern packaging standards
- All modules are organized under
src/audio_processing_ai/for better code organization - Uses
uvfor fast dependency management (recommended) orpipas an alternative - Python 3.9+ is required for compatibility with all dependencies
- Includes comprehensive CI/CD with GitHub Actions for testing, linting, and deployment
- Pre-commit hooks ensure code quality and consistency
- Automated dependency updates and PyPI publishing workflows