Skip to content

Latest commit

 

History

History
38 lines (27 loc) · 1.06 KB

README.md

File metadata and controls

38 lines (27 loc) · 1.06 KB

TREC Washington Post Corpus (v4)

Synopsis

The TREC Washington Post Corpus contains 728,626 news articles and blog posts from January 2012 through December 2020. The articles are stored in JSON format, and include:

  • title
  • byline
  • date of publication
  • kicker (a section header)
  • article text broken into paragraphs
  • links to embedded images and multimedia (for 2012-2017 documents)

Files and Folders

.
└── WashingtonPost.v4/
    ├── data
    │   └── TREC_Washington_Post_collection.v4.jl
    ├── scripts
    │   └── ...
    ├── README-v3.md
    └── README.md

Research and Usecases

Our own research and experiments with this data sets.

License Information

We had to sign an Organizational agreement. Individual researchers have to sign an Individual agreement.

Data Source

The data can be downloaded from this site.

Publications