This representation of the full Semantic Scholar corpus offers data relating to papers crawled from the web and subjected to a number of filters. There are over 45 million published research papers in Computer Science, Neuroscience, and Biomedical fields provided as json objects, one per line. Papers are grouped in batches and shared as a collection of gzipped files; each file is about 990 MB, and the total collection is about 46 GB. A sample file of about 100 records (98 KB) is also provided, as is a copy of the license agreement. The manifest includes a list of the available files.
the folder ./elasticsearch
contains the indexed data files for a local single-node elasticsearch cluster (docker image).
corpus-2019-01-31/s2-corpus-00.gz
corpus-2019-01-31/s2-corpus-01.gz
corpus-2019-01-31/s2-corpus-02.gz
corpus-2019-01-31/s2-corpus-03.gz
corpus-2019-01-31/s2-corpus-04.gz
corpus-2019-01-31/s2-corpus-05.gz
corpus-2019-01-31/s2-corpus-06.gz
corpus-2019-01-31/s2-corpus-07.gz
corpus-2019-01-31/s2-corpus-08.gz
corpus-2019-01-31/s2-corpus-09.gz
corpus-2019-01-31/s2-corpus-10.gz
corpus-2019-01-31/s2-corpus-11.gz
corpus-2019-01-31/s2-corpus-12.gz
corpus-2019-01-31/s2-corpus-13.gz
corpus-2019-01-31/s2-corpus-14.gz
corpus-2019-01-31/s2-corpus-15.gz
corpus-2019-01-31/s2-corpus-16.gz
corpus-2019-01-31/s2-corpus-17.gz
corpus-2019-01-31/s2-corpus-18.gz
corpus-2019-01-31/s2-corpus-19.gz
corpus-2019-01-31/s2-corpus-20.gz
corpus-2019-01-31/s2-corpus-21.gz
corpus-2019-01-31/s2-corpus-22.gz
corpus-2019-01-31/s2-corpus-23.gz
corpus-2019-01-31/s2-corpus-24.gz
corpus-2019-01-31/s2-corpus-25.gz
corpus-2019-01-31/s2-corpus-26.gz
corpus-2019-01-31/s2-corpus-27.gz
corpus-2019-01-31/s2-corpus-28.gz
corpus-2019-01-31/s2-corpus-29.gz
corpus-2019-01-31/s2-corpus-30.gz
corpus-2019-01-31/s2-corpus-31.gz
corpus-2019-01-31/s2-corpus-32.gz
corpus-2019-01-31/s2-corpus-33.gz
corpus-2019-01-31/s2-corpus-34.gz
corpus-2019-01-31/s2-corpus-35.gz
corpus-2019-01-31/s2-corpus-36.gz
corpus-2019-01-31/s2-corpus-37.gz
corpus-2019-01-31/s2-corpus-38.gz
corpus-2019-01-31/s2-corpus-39.gz
corpus-2019-01-31/s2-corpus-40.gz
corpus-2019-01-31/s2-corpus-41.gz
corpus-2019-01-31/s2-corpus-42.gz
corpus-2019-01-31/s2-corpus-43.gz
corpus-2019-01-31/s2-corpus-44.gz
corpus-2019-01-31/s2-corpus-45.gz
corpus-2019-01-31/s2-corpus-46.gz
sample-S2-records.gz
license.txt
manifest.txt
https://api.semanticscholar.org/corpus/
Waleed Ammar et al. 2018. Construction of the Literature Graph in Semantic Scholar. NAACL.