Skip to content

Text Analysis Tutorial: Summarization (extractive vs abstractive; length control) #251

@chinaexpert1

Description

@chinaexpert1

Title & Overview

Template: Summarization: An Intermediate, End-to-End Analysis Tutorial
Overview (≤2 sentences): Learners will build and compare extractive (sentence selection) and abstractive (transformer-generated) summarization systems. It is intermediate because it requires evaluation beyond ROUGE (length control, factual consistency) and error analysis on summary quality.

Purpose

The value-add is teaching learners how to design defensible summarization pipelines by comparing extractive vs abstractive approaches, managing sequence length, and performing structured error analysis. This tutorial stresses reproducibility, metrics, and report generation for summarization tasks.

Prerequisites

  • Skills: Python, Git, pandas, ML basics.
  • NLP: tokenization, embeddings, evaluation metrics (ROUGE, BLEU, BERTScore).
  • Tooling: pandas, scikit-learn, Hugging Face Transformers, MLflow, FastAPI.

Setup Instructions

  • Environment: Conda/Poetry (Python 3.11), deterministic seeds.

  • Install: pandas, scikit-learn, Hugging Face Transformers + Datasets, ROUGE-score, MLflow, FastAPI.

  • Datasets:

    • Small: CNN/DailyMail (summarization, extractive + abstractive).
    • Medium: XSum (single-sentence abstractive summaries).
  • Repo layout:

    tutorials/t7-summarization/
      ├─ notebooks/
      ├─ src/
      │   ├─ extractive.py
      │   ├─ abstractive.py
      │   ├─ eval.py
      │   └─ config.yaml
      ├─ data/README.md
      ├─ reports/
      └─ tests/
    

Core Concepts

  • Extractive vs abstractive: selecting vs generating summaries.
  • Length control: truncation, sentence rank thresholds, max token length.
  • Byte-level BPE: ensures coverage of rare words in abstractive models.
  • Evaluation: ROUGE, BLEU, BERTScore; limitations of overlap-based metrics.
  • Error slicing: summary quality across length buckets, factual consistency checks.

Step-by-Step Walkthrough

  1. Data intake & splits: load CNN/DailyMail and XSum; reproducible train/val/test.

  2. Extractive baseline:

    • TextRank or TF-IDF sentence scoring.
    • Select top-k sentences by importance.
  3. Abstractive baseline:

    • DistilBART or T5-small with byte-level BPE tokenizer.
    • Fine-tuning with teacher forcing; inference with beam search.
  4. Evaluation: ROUGE-L, BLEU, BERTScore; length compliance metrics.

  5. Error analysis: over/under-generation, hallucinations, truncation issues.

  6. Reporting: metrics tables, example summaries, error slices in reports/t7-summarization.md.

  7. (Optional) Serve: FastAPI endpoint with extractive/abstractive options; schema validation, max length guardrail.

Hands-On Exercises

  • Ablations: extractive vs abstractive; greedy vs beam search; different summary lengths.
  • Robustness: add noisy sentences; test whether models ignore irrelevant info.
  • Slice analysis: compare summary quality for long vs short articles.
  • Stretch: constrained decoding (coverage penalty, length penalty tuning).

Common Pitfalls & Troubleshooting

  • Hallucinations: abstractive models may invent facts; requires human/error analysis.
  • Metrics misuse: ROUGE alone ≠ summary quality; complement with BERTScore.
  • Length issues: truncation may cut key sentences; manage max token length.
  • Extractive bias: sentence ordering can skew extractive baselines.
  • OOM: abstractive training with long sequences; mitigate with gradient accumulation.

Best Practices

  • Always compare extractive vs abstractive baselines.
  • Track config, seeds, tokenizer artifacts, dataset fingerprints in MLflow.
  • Use error slices: factual vs hallucinated summaries, long vs short docs.
  • Unit tests: ensure extractive method selects k sentences reproducibly.
  • Guardrails: enforce max length and schema validation in serving.

Reflection & Discussion Prompts

  • Why do abstractive models hallucinate more than extractive ones?
  • When is extractive summarization preferable in practice?
  • How do we balance brevity vs coverage in summary design?

Next Steps / Advanced Extensions

  • Experiment with PEGASUS or Longformer for long-document summarization.
  • Evaluate factual consistency with QA-based metrics.
  • Domain adaptation: summarization for civic/public documents.
  • Lightweight deployment: summarization API with latency monitoring.

Glossary / Key Terms

Extractive summarization, abstractive summarization, ROUGE, BLEU, BERTScore, hallucination, beam search, length penalty.

Additional Resources

Contributors

Author(s): TBD
Reviewer(s): TBD
Maintainer(s): TBD
Date updated: 2025-09-20
Dataset licenses: CNN/DailyMail (non-commercial), XSum (BBC dataset).

Issues Referenced

Epic: HfLA Text Analysis Tutorials (T0–T14).
This sub-issue: T7: Summarization.


Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    New Issue Approval

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions