A complete machine learning pipeline for detecting advertisements in multilingual podcast episodes, optimized for iPhone deployment.
This pipeline:
- Downloads the newest 3-5 episodes from Apple Podcast URLs
- Compresses audio files to <100MB while maintaining quality
- Transcribes audio using Groq Whisper Large V3 Turbo ($0.04/hour)
- Identifies ad segments using GPT-5 Nano
- Trains a DistilBERT model for local ad detection
- Exports model to Core ML for iPhone deployment
- Python 3.8+
- FFmpeg installed on system (
brew install ffmpegon macOS) - API Keys:
- Groq API key (get from https://console.groq.com)
- OpenAI API key (for GPT-5 Nano)
- Install dependencies:
pip install -r requirements.txt- Set up API keys (choose one method):
Option A: Using settings.json (Recommended for local development)
cp settings.json.example settings.json
# Then edit settings.json and add your API keysOption B: Using environment variables
export GROQ_API_KEY='your-groq-api-key'
export OPENAI_API_KEY='your-openai-api-key'Option C: Using .env file
Create a .env file in the project root:
GROQ_API_KEY=your-groq-api-key
OPENAI_API_KEY=your-openai-api-key
The config will check in this order: settings.json → .env → environment variables.
Download and compress podcast episodes:
python download_episodes.pyThis will:
- Extract RSS feeds from Apple Podcast URLs
- Download the newest 3-5 episodes per podcast
- Compress audio to <100MB
- Save to
data/audio/ - Track progress in
data/metadata.json
Transcribe downloaded episodes:
python transcribe_audio.pyThis will:
- Process episodes marked as
transcribed: falsein metadata - Use Groq Whisper Large V3 Turbo for multilingual transcription
- Save transcripts with timestamps to
data/transcripts/ - Update metadata with transcript paths
Identify advertisement segments in transcripts:
python detect_ads.pyThis will:
- Use GPT-5 Nano to analyze transcripts
- Identify ad segments with confidence scores
- Save ad segments to
data/ads/ - Update metadata with ad detection status
Train DistilBERT model for ad detection:
python train_model.pyThis will:
- Prepare training data from transcripts and ad labels
- Train DistilBERT multilingual model
- Evaluate on test set
- Save model to
models/ad_detector/ - Export to Core ML format for iPhone
Per 1-hour episode:
- Transcription (Groq): $0.04
- Ad Detection (GPT-5 Nano): ~$0.00075
- Total: ~$0.041 per episode
For 100 episodes: ~$4.10
CleanCastMLModel/
├── podcastURLs.py # Apple Podcast URLs by category
├── download_episodes.py # Download and compress episodes
├── transcribe_audio.py # Transcribe using Groq Whisper
├── detect_ads.py # Detect ads using GPT-5 Nano
├── train_model.py # Train DistilBERT model
├── config.py # Configuration and API keys
├── requirements.txt # Python dependencies
├── README.md # This file
├── optimization_recommendations.md # Optimization strategies
├── data/
│ ├── metadata.json # Episode tracking
│ ├── audio/ # Downloaded audio files
│ ├── transcripts/ # Transcription JSON files
│ └── ads/ # Ad segment JSON files
└── models/
└── ad_detector/ # Trained model files
Edit config.py to adjust:
EPISODES_PER_PODCAST: Number of episodes to download (default: 5)MAX_AUDIO_SIZE_MB: Maximum audio file size (default: 100MB)GROQ_MODEL: Whisper model name (default: whisper-large-v3-turbo)OPENAI_MODEL: GPT model for ad detection (default: gpt-5-nano)- Training hyperparameters
See optimization_recommendations.md for strategies to:
- Reduce transcription costs (80% savings possible)
- Optimize model for iPhone deployment
- Use audio features to skip full transcription
- Implement hybrid approaches
The trained model is exported to Core ML format (.mlpackage) for iPhone deployment. The model:
- Runs locally on-device
- Processes audio/text in real-time
- No API costs after training
- Works offline
Install FFmpeg:
brew install ffmpeg # macOS
# or
apt-get install ffmpeg # LinuxMake sure API keys are set:
echo $GROQ_API_KEY
echo $OPENAI_API_KEYReduce BATCH_SIZE in config.py or use a machine with more RAM/GPU.
MIT License
Contributions welcome! Please open issues for bugs or feature requests.