A comprehensive machine learning pipeline for classifying and analyzing news articles from Heise Online, Germany's leading technology news platform.
├── data/
│ ├── raw/ # Raw scraped articles (SQLite database)
│ ├── processed/ # Preprocessed text data
│ └── embeddings/ # Generated embeddings for ML models
├── ingestion/
│ └── ingest_heise.py # Async web scraper for Heise news archive
├── feature_store/
│ └── load_article_features.py # Load derived features into SQLite
├── embedding/
│ ├── embed.py # Text embedding generation
│ └── build_faiss.py # Build FAISS index from embeddings
├── labeling/
│ ├── weak_rules.py # Rule-based labeling system
│ └── llm_labeler.py # LLM-powered labeling for training data
├── model/
│ ├── train.py # Model training pipeline
│ └── predict.py # Inference and prediction
├── active_learning/
│ └── select_samples.py # Active learning sample selection
├── setup_pipeline_db.py # Database setup and schema migrations
├── api/
│ └── app.py # REST API for model serving
├── ui/
│ └── streamlit_app.py # Web interface for classification
└── README.md
- Asynchronous Web Scraping: Concurrent scraping of Heise news archives using asyncio and aiohttp
- Robust Data Extraction: Extracts headlines, timestamps, and URLs from complex HTML structures
- SQLite Storage: Efficient local database storage with deduplication
- Modular Pipeline: Clean separation of concerns for data ingestion, processing, and modeling
- Active Learning: Intelligent sample selection for efficient model training
- Multi-modal Labeling: Combines rule-based and LLM-based approaches for data labeling
- Web API: RESTful API for model inference
- Interactive UI: Streamlit-based web interface for easy interaction
The scraper (ingestion/ingest_heise.py) fetches articles from Heise's monthly archives:
cd /home/jan/heise-classification
python3 ingestion/ingest_heise.pyFeatures:
- Scrapes all 12 months of both 2024 and 2025 concurrently (24 months total)
- Extracts structured data: headline, timestamp (HH:MM DD-MM-YYYY), URL, unique ID
- Handles missing pages gracefully (future months)
- Stores data in SQLite with automatic deduplication
- Comprehensive logging and progress tracking
- Concurrent processing for optimal performance
Sample Output:
2026-01-08 00:40:40,413 - INFO - Total articles in database: 19167
2026-01-08 00:40:40,413 - INFO - Articles by year:
2026-01-08 00:40:40,413 - INFO - Year 2024: 9093 articles
2026-01-08 00:40:40,413 - INFO - Year 2025: 10074 articles
Articles are stored in data/raw/heise_articles.db with the following structure:
CREATE TABLE articles (
id TEXT PRIMARY KEY, -- UUID v4 unique identifier
headline TEXT NOT NULL, -- Article headline/title
timestamp TEXT NOT NULL, -- Full publication timestamp (HH:MM DD-MM-YYYY format)
url TEXT NOT NULL, -- Full article URL
date_scraped TEXT NOT NULL -- ISO format scrape timestamp
);-
Clone and setup:
git clone <repository-url> cd heise-classification pip install aiohttp beautifulsoup4 sentence-transformers faiss-cpu scikit-learn numpy joblib
-
Run data ingestion:
python3 ingestion/ingest_heise.py
-
Verify data:
import sqlite3 conn = sqlite3.connect('data/raw/heise_articles.db') cursor = conn.cursor() cursor.execute('SELECT COUNT(*) FROM articles') print(f"Total articles: {cursor.fetchone()[0]}")
-
Set up pipeline tables:
python3 setup_pipeline_db.py --db-path data/raw/heise_articles.db
-
Run preprocessing and write JSONL:
python3 preprocessing/preprocess.py --db-path data/raw/heise_articles.db \ --output-path data/processed/processed_articles.jsonl
-
Load derived features into SQLite:
python3 feature_store/load_article_features.py --db-path data/raw/heise_articles.db \ --jsonl-path data/processed/processed_articles.jsonl
-
Generate embeddings and build FAISS index:
python3 embedding/embed.py --db-path data/raw/heise_articles.db python3 embedding/build_faiss.py --db-path data/raw/heise_articles.db
-
Create weak labels:
python3 labeling/weak_rules.py --db-path data/raw/heise_articles.db
-
Train baseline model and store predictions:
python3 model/train.py --db-path data/raw/heise_articles.db
-
Select low-confidence samples for review:
python3 active_learning/select_samples.py --db-path data/raw/heise_articles.db --threshold 0.6aiohttp- Asynchronous HTTP clientbeautifulsoup4- HTML parsingsqlite3- Database storage (built-in Python)asyncio- Concurrent processing (built-in Python)logging- Structured logging (built-in Python)sentence-transformers- Embedding modelfaiss-cpu- Vector indexscikit-learn- Baseline classifiernumpy- Numerical processingjoblib- Model serialization
The project follows clean architecture principles with:
- Modular component design
- Comprehensive error handling
- Extensive logging
- Type hints for better code maintainability
- Async/await patterns for performance
By default, the scraper collects data from both 2024 and 2025. To customize this:
# Scrape only specific years
scraper = HeiseScraper(years=[2023, 2024])
# Scrape a single year
scraper = HeiseScraper(years=[2024])
# Default behavior (2024 and 2025)
scraper = HeiseScraper()- Add Streamlit labeling UI
- Add LLM suggestions for label assist
- Add API endpoints for embeddings and search
- Expand active learning strategies
This project is for educational and research purposes. Please respect Heise's terms of service when scraping their content.