The TREC Washington Post Corpus contains 608,180 news articles and blog posts from January 2012 through August 2017. It was originally used for the Common Core Track at TREC 2018 (http://trec-core.github.io/2018/ ). The initial document collection contained duplicate docids. These duplicates are removed from the filed dataset. The resulting collection contains 595,037 documents. The documents are stored in one single JSON Lines file (http://jsonlines.org/ ).
archives
this directory contains the original filesdata
contains the JSON Lines filescripts
the python scripts for duplicate removal can be found herelicense-agreement
contains the license-agreementtopics-and-qrels
contain txt-files with 50 topics and corresponding qrels
This dataset will also be used within the course of CENTRE@CLEF2019 (http://www.centre-eval.org/clef2019/ ). This track focuses on the replicability, reproducibility and generalizability of retrieval systems. We are planning to participate in the CENTRE-track.
@pschaer signed a licence agreement which can be found under license-agreement
The original data can be retrieved from NIST:
https://ir.nist.gov/wapo/
Topic- and relevance-files can be retrieved from:
https://trec.nist.gov/act_part/tracks2018.html
- Alexander Bondarenko, Michael Völske, Alexander Panchenko, Chris Biemann, Benno Stein, and Matthias Hagen. Webis at TREC 2018: Common Core Track. In Ellen M. Voorhees and Angela Ellis, editors, 27th International Text Retrieval Conference (TREC 2018), NIST Special Publication, November 2018. National Institute of Standards and Technology (NIST). PDF