This repository contains the methods to download, filter, and normalize the text data, mainly from KB's digitalized newspapers.
To run the the various functions to download, filter, and normalize the data you will need to install and build some packages.
kenlm
https://github.com/kpu/kenlmlsh
https://github.com/mattilyra/LSH- you might have to change the
setup.py
file:USE_CYTHON = True
andextensions = cythonize(extensions, force=True)
- you might have to change the
unidecode
tokenizers
sentence_splitter
fasttext
- download also the
lid.176.bin
model
- download also the
kblab
https://github.com/Kungbib/kblab
To download the data of one specific type/tag:
python get_data.py --tag <tag> --location <download_folder> --login_file <login_config.yaml>
Some of the available tags are:
magazine
SOU
protokoll
proposition
issue
The data is downloaded into subfolders named after their tag with all larger
document collections (e.g. one scanned newspaper issue) saved as one json-object
per line, sorted into one file for each publication year.
The extracted text boxes, that we use as documents, are stored in the
json-object accessible via the key content
.
data
├── issue
│ ├── 1995.jsonl
│ ├── 1996.jsonl
│ ├── 1997.jsonl
│ ├── 1998.jsonl
├── SOU
│ ├── 1995.jsonl
│ ├── 1996.jsonl
│ ├── 1997.jsonl
│ ├── 1998.jsonl
...
The clean_data.py
script is used to filter and normalize the downloaded json-files keeping the json-format intact.
The filter_normalize_filter.sh
script first filters the data, to reduce the load for the normalizer, to the filter again after normalizing.
The various filter and normalizers that can be used are defined in data_filters.py
and data_normalizers.py
.
The deduplication runs in three steps:
find_duplicates.py
finding which documents are duplicates of each otherkenlm_score.py
scoring everyone of these documents with a KenLM modelchoose_best_duplicate.py
finally choosing the best duplicate with respect to the KenLM score
To get rid of the meta-information stored in the json-object, run again
clean_data.py
with the json2txt
flag, creating a text file with one document
per line.