The New York Times Annotated Corpus

Synopsis

The New York Times Annotated Corpus is drawn from the historical archive of The New York Times and includes metadata provided by The New York Times Newsroom, The New York Times Indexing Service and the online production staff at NYTimes.com. This corpus contains over 1.8 million articles published by The New York Times between January 01, 1987 and June 19, 2007. Articles from wire services that appeared in The New York Times during this period are not included.

The corpus includes:

Over 1.8 million articles (excluding wire services articles that appeared during the covered period).
Over 650,000 article summaries written by library scientists.
Over 1,500,000 articles manually tagged by library scientists with tags drawn from a normalized indexing vocabulary of people, organizations, locations and topic descriptors.
Over 275,000 algorithmically-tagged articles that have been hand verified by the online production staff at nytimes.com.
Java tools for parsing corpus documents from .xml into a memory resident object.

The New York Times has established a community website for researchers working on the data set at http://groups.google.com/group/nytnlp and encourages feedback and discussion about the corpus.

Files and Folders

data this directory contains folders with original xml-files (years 1987-2007)
docs contains some documents (pdf-files to provide an overview of the corpus)
dtd contains dtd-files
tools contains java tools for parsing

Research and Usecases

License Information

Data Source

The author of this dataset is Evan Sandhaus: https://catalog.ldc.upenn.edu/LDC2008T19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

The New York Times Annotated Corpus

Synopsis

Files and Folders

Research and Usecases

License Information

Data Source

Publications

Files

README.md

Latest commit

History

README.md

File metadata and controls

The New York Times Annotated Corpus

Synopsis

Files and Folders

Research and Usecases

License Information

Data Source

Publications