Skip to content

Heise news classification pipeline with embeddings, weak labels and active learning.

Notifications You must be signed in to change notification settings

heyjan/heise-classification

Repository files navigation

Heise News Classification Pipeline

A comprehensive machine learning pipeline for classifying and analyzing news articles from Heise Online, Germany's leading technology news platform.

Project Structure

├── data/
│   ├── raw/           # Raw scraped articles (SQLite database)
│   ├── processed/     # Preprocessed text data
│   └── embeddings/    # Generated embeddings for ML models
├── ingestion/
│   └── ingest_heise.py    # Async web scraper for Heise news archive
├── feature_store/
│   └── load_article_features.py  # Load derived features into SQLite
├── embedding/
│   ├── embed.py       # Text embedding generation
│   └── build_faiss.py # Build FAISS index from embeddings
├── labeling/
│   ├── weak_rules.py      # Rule-based labeling system
│   └── llm_labeler.py     # LLM-powered labeling for training data
├── model/
│   ├── train.py       # Model training pipeline
│   └── predict.py     # Inference and prediction
├── active_learning/
│   └── select_samples.py  # Active learning sample selection
├── setup_pipeline_db.py   # Database setup and schema migrations
├── api/
│   └── app.py         # REST API for model serving
├── ui/
│   └── streamlit_app.py   # Web interface for classification
└── README.md

Features

  • Asynchronous Web Scraping: Concurrent scraping of Heise news archives using asyncio and aiohttp
  • Robust Data Extraction: Extracts headlines, timestamps, and URLs from complex HTML structures
  • SQLite Storage: Efficient local database storage with deduplication
  • Modular Pipeline: Clean separation of concerns for data ingestion, processing, and modeling
  • Active Learning: Intelligent sample selection for efficient model training
  • Multi-modal Labeling: Combines rule-based and LLM-based approaches for data labeling
  • Web API: RESTful API for model inference
  • Interactive UI: Streamlit-based web interface for easy interaction

Data Ingestion

The scraper (ingestion/ingest_heise.py) fetches articles from Heise's monthly archives:

cd /home/jan/heise-classification
python3 ingestion/ingest_heise.py

Features:

  • Scrapes all 12 months of both 2024 and 2025 concurrently (24 months total)
  • Extracts structured data: headline, timestamp (HH:MM DD-MM-YYYY), URL, unique ID
  • Handles missing pages gracefully (future months)
  • Stores data in SQLite with automatic deduplication
  • Comprehensive logging and progress tracking
  • Concurrent processing for optimal performance

Sample Output:

2026-01-08 00:40:40,413 - INFO - Total articles in database: 19167
2026-01-08 00:40:40,413 - INFO - Articles by year:
2026-01-08 00:40:40,413 - INFO -   Year 2024: 9093 articles
2026-01-08 00:40:40,413 - INFO -   Year 2025: 10074 articles

Database Schema

Articles are stored in data/raw/heise_articles.db with the following structure:

CREATE TABLE articles (
    id TEXT PRIMARY KEY,           -- UUID v4 unique identifier
    headline TEXT NOT NULL,        -- Article headline/title
    timestamp TEXT NOT NULL,       -- Full publication timestamp (HH:MM DD-MM-YYYY format)
    url TEXT NOT NULL,            -- Full article URL
    date_scraped TEXT NOT NULL    -- ISO format scrape timestamp
);

Getting Started

  1. Clone and setup:

    git clone <repository-url>
    cd heise-classification
    pip install aiohttp beautifulsoup4 sentence-transformers faiss-cpu scikit-learn numpy joblib
  2. Run data ingestion:

    python3 ingestion/ingest_heise.py
  3. Verify data:

    import sqlite3
    conn = sqlite3.connect('data/raw/heise_articles.db')
    cursor = conn.cursor()
    cursor.execute('SELECT COUNT(*) FROM articles')
    print(f"Total articles: {cursor.fetchone()[0]}")
  4. Set up pipeline tables:

    python3 setup_pipeline_db.py --db-path data/raw/heise_articles.db
  5. Run preprocessing and write JSONL:

    python3 preprocessing/preprocess.py --db-path data/raw/heise_articles.db \
      --output-path data/processed/processed_articles.jsonl
  6. Load derived features into SQLite:

    python3 feature_store/load_article_features.py --db-path data/raw/heise_articles.db \
      --jsonl-path data/processed/processed_articles.jsonl
  7. Generate embeddings and build FAISS index:

    python3 embedding/embed.py --db-path data/raw/heise_articles.db
    python3 embedding/build_faiss.py --db-path data/raw/heise_articles.db
  8. Create weak labels:

    python3 labeling/weak_rules.py --db-path data/raw/heise_articles.db
  9. Train baseline model and store predictions:

    python3 model/train.py --db-path data/raw/heise_articles.db
  10. Select low-confidence samples for review:

python3 active_learning/select_samples.py --db-path data/raw/heise_articles.db --threshold 0.6

Dependencies

  • aiohttp - Asynchronous HTTP client
  • beautifulsoup4 - HTML parsing
  • sqlite3 - Database storage (built-in Python)
  • asyncio - Concurrent processing (built-in Python)
  • logging - Structured logging (built-in Python)
  • sentence-transformers - Embedding model
  • faiss-cpu - Vector index
  • scikit-learn - Baseline classifier
  • numpy - Numerical processing
  • joblib - Model serialization

Development

The project follows clean architecture principles with:

  • Modular component design
  • Comprehensive error handling
  • Extensive logging
  • Type hints for better code maintainability
  • Async/await patterns for performance

Configuration

Customizing Years to Scrape

By default, the scraper collects data from both 2024 and 2025. To customize this:

# Scrape only specific years
scraper = HeiseScraper(years=[2023, 2024])

# Scrape a single year
scraper = HeiseScraper(years=[2024])

# Default behavior (2024 and 2025)
scraper = HeiseScraper()

Next Steps

  1. Add Streamlit labeling UI
  2. Add LLM suggestions for label assist
  3. Add API endpoints for embeddings and search
  4. Expand active learning strategies

License

This project is for educational and research purposes. Please respect Heise's terms of service when scraping their content.

About

Heise news classification pipeline with embeddings, weak labels and active learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages