Skip to content

Latest commit

 

History

History

datasets

Datasets

Various datasets for the workshop.

Usage

We provide a script to parse these data sources into a suitable format. Follow the instructions here.

Interviews

Interviews of world leaders from various journalistic sources.

Vladimir Putin

File: vladimir_putin_interviews.json

Sources:

Barack Obama

File: barack_obama_interviews.json

Sources:

Movie Dialogs Corpus

File: cornell_movie_dialogs_corpus.json.zip

220,579 conversational exchanges between 10,292 pairs of movie characters

Source: http://www.cs.cornell.edu/~cristian/Cornell_Movie-Dialogs_Corpus.html

HTML Files

File: html-dataset.txt

HTML files from various Github projects.

Scraped from these repositories: https://gist.github.com/VladislavZavadskyy/e31ab07b03a5c22b11982c49669a400b

Source: https://www.kaggle.com/zavadskyy/lots-of-code

Typescript and JSON Files

Files:

  • typescript.zip
  • json.zip

TypeScript (.ts) and JSON (.json) files collected from a fresh angular app with routing (ng new <app-name>). For installation of angular see https://angular.io/guide/setup-local

JavaScript File

File: javascript.zip

Sample of JavaScript files (.js) collected from a data set containing JS-Files.

Source: https://www.sri.inf.ethz.ch/js150

Chess games

Chess games from 2019 in PGN format.

File: ficsgamesdb_2019_standard2000_nomovetimes_110541.pgn

Source: https://www.ficsgames.org/download.html

Music

Music in ABC-Notation.

File: abc_notation_songs.txt

Source: https://www.kaggle.com/raj5287/abc-notation-of-tunes/version/3

Donald Trump tweets

Data set is split into two files.

Files:

  • realdonaldtrump-1.ndjson
  • realdonaldtrump-2.ndjson

Source: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi%3A10.7910%2FDVN%2FKJEBIL

Shakespeare plays

File: shakespeare_data.csv

Source: https://www.kaggle.com/kingburrito666/shakespeare-plays