Heise News Classification Pipeline

A comprehensive machine learning pipeline for classifying and analyzing news articles from Heise Online, Germany's leading technology news platform.

Project Structure

├── data/
│   ├── raw/           # Raw scraped articles (SQLite database)
│   ├── processed/     # Preprocessed text data
│   └── embeddings/    # Generated embeddings for ML models
├── ingestion/
│   └── ingest_heise.py    # Async web scraper for Heise news archive
├── feature_store/
│   └── load_article_features.py  # Load derived features into SQLite
├── embedding/
│   ├── embed.py       # Text embedding generation
│   └── build_faiss.py # Build FAISS index from embeddings
├── labeling/
│   ├── weak_rules.py      # Rule-based labeling system
│   └── llm_labeler.py     # LLM-powered labeling for training data
├── model/
│   ├── train.py       # Model training pipeline
│   └── predict.py     # Inference and prediction
├── active_learning/
│   └── select_samples.py  # Active learning sample selection
├── setup_pipeline_db.py   # Database setup and schema migrations
├── api/
│   └── app.py         # REST API for model serving
├── ui/
│   └── streamlit_app.py   # Web interface for classification
└── README.md

Features

Asynchronous Web Scraping: Concurrent scraping of Heise news archives using asyncio and aiohttp
Robust Data Extraction: Extracts headlines, timestamps, and URLs from complex HTML structures
SQLite Storage: Efficient local database storage with deduplication
Modular Pipeline: Clean separation of concerns for data ingestion, processing, and modeling
Active Learning: Intelligent sample selection for efficient model training
Multi-modal Labeling: Combines rule-based and LLM-based approaches for data labeling
Web API: RESTful API for model inference
Interactive UI: Streamlit-based web interface for easy interaction

Data Ingestion

The scraper (ingestion/ingest_heise.py) fetches articles from Heise's monthly archives:

cd /home/jan/heise-classification
python3 ingestion/ingest_heise.py

Features:

Scrapes all 12 months of both 2024 and 2025 concurrently (24 months total)
Extracts structured data: headline, timestamp (HH:MM DD-MM-YYYY), URL, unique ID
Handles missing pages gracefully (future months)
Stores data in SQLite with automatic deduplication
Comprehensive logging and progress tracking
Concurrent processing for optimal performance

Sample Output:

2026-01-08 00:40:40,413 - INFO - Total articles in database: 19167
2026-01-08 00:40:40,413 - INFO - Articles by year:
2026-01-08 00:40:40,413 - INFO -   Year 2024: 9093 articles
2026-01-08 00:40:40,413 - INFO -   Year 2025: 10074 articles

Database Schema

Articles are stored in data/raw/heise_articles.db with the following structure:

CREATE TABLE articles (
    id TEXT PRIMARY KEY,           -- UUID v4 unique identifier
    headline TEXT NOT NULL,        -- Article headline/title
    timestamp TEXT NOT NULL,       -- Full publication timestamp (HH:MM DD-MM-YYYY format)
    url TEXT NOT NULL,            -- Full article URL
    date_scraped TEXT NOT NULL    -- ISO format scrape timestamp
);

Getting Started

Clone and setup:

git clone <repository-url>
cd heise-classification
pip install aiohttp beautifulsoup4 sentence-transformers faiss-cpu scikit-learn numpy joblib

Run data ingestion:
```
python3 ingestion/ingest_heise.py
```

Verify data:

import sqlite3
conn = sqlite3.connect('data/raw/heise_articles.db')
cursor = conn.cursor()
cursor.execute('SELECT COUNT(*) FROM articles')
print(f"Total articles: {cursor.fetchone()[0]}")

Set up pipeline tables:

python3 setup_pipeline_db.py --db-path data/raw/heise_articles.db

Run preprocessing and write JSONL:

python3 preprocessing/preprocess.py --db-path data/raw/heise_articles.db \
  --output-path data/processed/processed_articles.jsonl

Load derived features into SQLite:

python3 feature_store/load_article_features.py --db-path data/raw/heise_articles.db \
  --jsonl-path data/processed/processed_articles.jsonl

Generate embeddings and build FAISS index:

python3 embedding/embed.py --db-path data/raw/heise_articles.db
python3 embedding/build_faiss.py --db-path data/raw/heise_articles.db

Create weak labels:

python3 labeling/weak_rules.py --db-path data/raw/heise_articles.db

Train baseline model and store predictions:

python3 model/train.py --db-path data/raw/heise_articles.db

Select low-confidence samples for review:

python3 active_learning/select_samples.py --db-path data/raw/heise_articles.db --threshold 0.6

Dependencies

aiohttp - Asynchronous HTTP client
beautifulsoup4 - HTML parsing
sqlite3 - Database storage (built-in Python)
asyncio - Concurrent processing (built-in Python)
logging - Structured logging (built-in Python)
sentence-transformers - Embedding model
faiss-cpu - Vector index
scikit-learn - Baseline classifier
numpy - Numerical processing
joblib - Model serialization

Development

The project follows clean architecture principles with:

Modular component design
Comprehensive error handling
Extensive logging
Type hints for better code maintainability
Async/await patterns for performance

Configuration

Customizing Years to Scrape

By default, the scraper collects data from both 2024 and 2025. To customize this:

# Scrape only specific years
scraper = HeiseScraper(years=[2023, 2024])

# Scrape a single year
scraper = HeiseScraper(years=[2024])

# Default behavior (2024 and 2025)
scraper = HeiseScraper()

Next Steps

Add Streamlit labeling UI
Add LLM suggestions for label assist
Add API endpoints for embeddings and search
Expand active learning strategies

License

This project is for educational and research purposes. Please respect Heise's terms of service when scraping their content.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heise News Classification Pipeline

Project Structure

Features

Data Ingestion

Database Schema

Getting Started

Dependencies

Development

Configuration

Customizing Years to Scrape

Next Steps

License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.cursor		.cursor
active_learning		active_learning
api		api
data		data
embedding		embedding
feature_store		feature_store
ingestion		ingestion
labeling		labeling
model		model
preprocessing		preprocessing
ui		ui
AGENTS.md		AGENTS.md
README.md		README.md
setup_pipeline_db.py		setup_pipeline_db.py

heyjan/heise-classification

Folders and files

Latest commit

History

Repository files navigation

Heise News Classification Pipeline

Project Structure

Features

Data Ingestion

Database Schema

Getting Started

Dependencies

Development

Configuration

Customizing Years to Scrape

Next Steps

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages