-
-
Notifications
You must be signed in to change notification settings - Fork 20
Description
Title & Overview
Template: Summarization: An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): Learners will build and compare extractive (sentence selection) and abstractive (transformer-generated) summarization systems. It is intermediate because it requires evaluation beyond ROUGE (length control, factual consistency) and error analysis on summary quality.
Purpose
The value-add is teaching learners how to design defensible summarization pipelines by comparing extractive vs abstractive approaches, managing sequence length, and performing structured error analysis. This tutorial stresses reproducibility, metrics, and report generation for summarization tasks.
Prerequisites
- Skills: Python, Git, pandas, ML basics.
- NLP: tokenization, embeddings, evaluation metrics (ROUGE, BLEU, BERTScore).
- Tooling: pandas, scikit-learn, Hugging Face Transformers, MLflow, FastAPI.
Setup Instructions
-
Environment: Conda/Poetry (Python 3.11), deterministic seeds.
-
Install: pandas, scikit-learn, Hugging Face Transformers + Datasets, ROUGE-score, MLflow, FastAPI.
-
Datasets:
- Small: CNN/DailyMail (summarization, extractive + abstractive).
- Medium: XSum (single-sentence abstractive summaries).
-
Repo layout:
tutorials/t7-summarization/ ├─ notebooks/ ├─ src/ │ ├─ extractive.py │ ├─ abstractive.py │ ├─ eval.py │ └─ config.yaml ├─ data/README.md ├─ reports/ └─ tests/
Core Concepts
- Extractive vs abstractive: selecting vs generating summaries.
- Length control: truncation, sentence rank thresholds, max token length.
- Byte-level BPE: ensures coverage of rare words in abstractive models.
- Evaluation: ROUGE, BLEU, BERTScore; limitations of overlap-based metrics.
- Error slicing: summary quality across length buckets, factual consistency checks.
Step-by-Step Walkthrough
-
Data intake & splits: load CNN/DailyMail and XSum; reproducible train/val/test.
-
Extractive baseline:
- TextRank or TF-IDF sentence scoring.
- Select top-k sentences by importance.
-
Abstractive baseline:
- DistilBART or T5-small with byte-level BPE tokenizer.
- Fine-tuning with teacher forcing; inference with beam search.
-
Evaluation: ROUGE-L, BLEU, BERTScore; length compliance metrics.
-
Error analysis: over/under-generation, hallucinations, truncation issues.
-
Reporting: metrics tables, example summaries, error slices in
reports/t7-summarization.md. -
(Optional) Serve: FastAPI endpoint with extractive/abstractive options; schema validation, max length guardrail.
Hands-On Exercises
- Ablations: extractive vs abstractive; greedy vs beam search; different summary lengths.
- Robustness: add noisy sentences; test whether models ignore irrelevant info.
- Slice analysis: compare summary quality for long vs short articles.
- Stretch: constrained decoding (coverage penalty, length penalty tuning).
Common Pitfalls & Troubleshooting
- Hallucinations: abstractive models may invent facts; requires human/error analysis.
- Metrics misuse: ROUGE alone ≠ summary quality; complement with BERTScore.
- Length issues: truncation may cut key sentences; manage max token length.
- Extractive bias: sentence ordering can skew extractive baselines.
- OOM: abstractive training with long sequences; mitigate with gradient accumulation.
Best Practices
- Always compare extractive vs abstractive baselines.
- Track config, seeds, tokenizer artifacts, dataset fingerprints in MLflow.
- Use error slices: factual vs hallucinated summaries, long vs short docs.
- Unit tests: ensure extractive method selects k sentences reproducibly.
- Guardrails: enforce max length and schema validation in serving.
Reflection & Discussion Prompts
- Why do abstractive models hallucinate more than extractive ones?
- When is extractive summarization preferable in practice?
- How do we balance brevity vs coverage in summary design?
Next Steps / Advanced Extensions
- Experiment with PEGASUS or Longformer for long-document summarization.
- Evaluate factual consistency with QA-based metrics.
- Domain adaptation: summarization for civic/public documents.
- Lightweight deployment: summarization API with latency monitoring.
Glossary / Key Terms
Extractive summarization, abstractive summarization, ROUGE, BLEU, BERTScore, hallucination, beam search, length penalty.
Additional Resources
- [Hugging Face Transformers](https://huggingface.co/docs/transformers)
- [Hugging Face Datasets](https://huggingface.co/datasets)
- [ROUGE](https://github.com/google-research/google-research/tree/master/rouge)
- [MLflow](https://mlflow.org/)
- [FastAPI](https://fastapi.tiangolo.com/)
Contributors
Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset licenses: CNN/DailyMail (non-commercial), XSum (BBC dataset).
Issues Referenced
Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T7: Summarization.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status