Skip to content

jonasgunklach/PodcastAdDetectionModel

Repository files navigation

Podcast Ad Detection ML Pipeline

A complete machine learning pipeline for detecting advertisements in multilingual podcast episodes, optimized for iPhone deployment.

Overview

This pipeline:

  1. Downloads the newest 3-5 episodes from Apple Podcast URLs
  2. Compresses audio files to <100MB while maintaining quality
  3. Transcribes audio using Groq Whisper Large V3 Turbo ($0.04/hour)
  4. Identifies ad segments using GPT-5 Nano
  5. Trains a DistilBERT model for local ad detection
  6. Exports model to Core ML for iPhone deployment

Setup

Prerequisites

  • Python 3.8+
  • FFmpeg installed on system (brew install ffmpeg on macOS)
  • API Keys:

Installation

  1. Install dependencies:
pip install -r requirements.txt
  1. Set up API keys (choose one method):

Option A: Using settings.json (Recommended for local development)

cp settings.json.example settings.json
# Then edit settings.json and add your API keys

Option B: Using environment variables

export GROQ_API_KEY='your-groq-api-key'
export OPENAI_API_KEY='your-openai-api-key'

Option C: Using .env file Create a .env file in the project root:

GROQ_API_KEY=your-groq-api-key
OPENAI_API_KEY=your-openai-api-key

The config will check in this order: settings.json → .env → environment variables.

Usage

Step 1: Download Episodes

Download and compress podcast episodes:

python download_episodes.py

This will:

  • Extract RSS feeds from Apple Podcast URLs
  • Download the newest 3-5 episodes per podcast
  • Compress audio to <100MB
  • Save to data/audio/
  • Track progress in data/metadata.json

Step 2: Transcribe Audio

Transcribe downloaded episodes:

python transcribe_audio.py

This will:

  • Process episodes marked as transcribed: false in metadata
  • Use Groq Whisper Large V3 Turbo for multilingual transcription
  • Save transcripts with timestamps to data/transcripts/
  • Update metadata with transcript paths

Step 3: Detect Ads

Identify advertisement segments in transcripts:

python detect_ads.py

This will:

  • Use GPT-5 Nano to analyze transcripts
  • Identify ad segments with confidence scores
  • Save ad segments to data/ads/
  • Update metadata with ad detection status

Step 4: Train Model

Train DistilBERT model for ad detection:

python train_model.py

This will:

  • Prepare training data from transcripts and ad labels
  • Train DistilBERT multilingual model
  • Evaluate on test set
  • Save model to models/ad_detector/
  • Export to Core ML format for iPhone

Cost Estimates

Per 1-hour episode:

  • Transcription (Groq): $0.04
  • Ad Detection (GPT-5 Nano): ~$0.00075
  • Total: ~$0.041 per episode

For 100 episodes: ~$4.10

Project Structure

CleanCastMLModel/
├── podcastURLs.py              # Apple Podcast URLs by category
├── download_episodes.py        # Download and compress episodes
├── transcribe_audio.py         # Transcribe using Groq Whisper
├── detect_ads.py               # Detect ads using GPT-5 Nano
├── train_model.py              # Train DistilBERT model
├── config.py                   # Configuration and API keys
├── requirements.txt            # Python dependencies
├── README.md                   # This file
├── optimization_recommendations.md  # Optimization strategies
├── data/
│   ├── metadata.json           # Episode tracking
│   ├── audio/                  # Downloaded audio files
│   ├── transcripts/            # Transcription JSON files
│   └── ads/                    # Ad segment JSON files
└── models/
    └── ad_detector/            # Trained model files

Configuration

Edit config.py to adjust:

  • EPISODES_PER_PODCAST: Number of episodes to download (default: 5)
  • MAX_AUDIO_SIZE_MB: Maximum audio file size (default: 100MB)
  • GROQ_MODEL: Whisper model name (default: whisper-large-v3-turbo)
  • OPENAI_MODEL: GPT model for ad detection (default: gpt-5-nano)
  • Training hyperparameters

Optimization

See optimization_recommendations.md for strategies to:

  • Reduce transcription costs (80% savings possible)
  • Optimize model for iPhone deployment
  • Use audio features to skip full transcription
  • Implement hybrid approaches

iPhone Deployment

The trained model is exported to Core ML format (.mlpackage) for iPhone deployment. The model:

  • Runs locally on-device
  • Processes audio/text in real-time
  • No API costs after training
  • Works offline

Troubleshooting

FFmpeg not found

Install FFmpeg:

brew install ffmpeg  # macOS
# or
apt-get install ffmpeg  # Linux

API Key Errors

Make sure API keys are set:

echo $GROQ_API_KEY
echo $OPENAI_API_KEY

Out of Memory during Training

Reduce BATCH_SIZE in config.py or use a machine with more RAM/GPU.

License

MIT License

Contributing

Contributions welcome! Please open issues for bugs or feature requests.

About

AI Ad Detection Pipeline & Model for Podcasts / Audio

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages