How the models have been obtained is described in our paper.
Recommended python version is 3.11. Consider use of pyenv if that python version is not available on your system.
Activate virtual environment (virtualenv):
source venv/bin/activate
or (pyenv):
pyenv activate my-python-3.11-virtualenv
Update pip:
pip install -U pip
Install sbb_ner:
pip install git+https://github.com/qurator-spk/sbb_ner.git
Download required models: https://qurator-data.de/sbb_ner/models.tar.gz
Extract model archive:
tar -xzf models.tar.gz
Copy config file into working directory. Set USE_CUDA environment variable to True/False depending on GPU availability.
Run webapp directly:
env CONFIG=config.json env FLASK_APP=qurator/sbb_ner/webapp/app.py env FLASK_ENV=development env USE_CUDA=True/False flask run --host=0.0.0.0
For production purposes rather use
env CONFIG=config.json env USE_CUDA=True/False gunicorn --bind 0.0.0.0:5000 qurator.sbb_ner.webapp.wsgi:app
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-cpu -f Dockerfile.cpu .
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-cpu
Make sure that your GPU is correctly set up and that nvidia-docker has been installed.
docker build --build-arg http_proxy=$http_proxy -t qurator/webapp-ner-gpu -f Dockerfile .
docker run -ti --rm=true --mount type=bind,source=data/konvens2019,target=/usr/src/qurator-sbb-ner/data/konvens2019 -p 5000:5000 qurator/webapp-ner-gpu
NER web-interface is availabe at http://localhost:5000 .
Get available models:
curl http://localhost:5000/models
Output:
[
{
"default": true,
"id": 1,
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-de-finetuned",
"name": "DC-SBB + CONLL + GERMEVAL"
},
{
"default": false,
"id": 2,
"model_dir": "data/konvens2019/build-on-all-german-de-finetuned/bert-sbb-de-finetuned",
"name": "DC-SBB + CONLL + GERMEVAL + SBB"
},
{
"default": false,
"id": 3,
"model_dir": "data/konvens2019/build-wd_0.03/bert-sbb-de-finetuned",
"name": "DC-SBB + SBB"
},
{
"default": false,
"id": 4,
"model_dir": "data/konvens2019/build-wd_0.03/bert-all-german-baseline",
"name": "CONLL + GERMEVAL"
}
]
Perform NER using model 1:
curl -d '{ "text": "Paris Hilton wohnt im Hilton Paris in Paris." }' -H "Content-Type: application/json" http://localhost:5000/ner/1
Output:
[
[
{
"prediction": "B-PER",
"word": "Paris"
},
{
"prediction": "I-PER",
"word": "Hilton"
},
{
"prediction": "O",
"word": "wohnt"
},
{
"prediction": "O",
"word": "im"
},
{
"prediction": "B-ORG",
"word": "Hilton"
},
{
"prediction": "I-ORG",
"word": "Paris"
},
{
"prediction": "O",
"word": "in"
},
{
"prediction": "B-LOC",
"word": "Paris"
},
{
"prediction": "O",
"word": "."
}
]
]
The JSON above is the expected input format of the SBB named entity linking and disambiguation system.
Read CONLL 2003 ner ground truth files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.
compile_conll --help
Read germ eval .tsv files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.
compile_germ_eval --help
Read europeana historic ner ground truth .bio files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.
compile_europeana_historic --help
Read wikiner files from directory and write the outcome of the data parsing to some pandas DataFrame that is stored as pickle.
compile_wikiner --help
Perform BERT for NER supervised training and test/cross-validation.
bert-ner --help
collectcorpus --help
Usage: collectcorpus [OPTIONS] FULLTEXT_FILE SELECTION_FILE CORPUS_FILE
Reads the fulltext from a CSV or SQLITE3 file (see also altotool) and
write it to one big text file.
FULLTEXT_FILE: The CSV or SQLITE3 file to read from.
SELECTION_FILE: Consider only a subset of all pages that is defined by the
DataFrame that is stored in <selection_file>.
CORPUS_FILE: The output file that can be used by bert-pregenerate-trainingdata.
Options:
--chunksize INTEGER Process the corpus in chunks of <chunksize>.
default:10**4
--processes INTEGER Number of parallel processes. default: 6
--min-line-len INTEGER Lower bound of line length in output file.
default:80
--help Show this message and exit.
Generate data for BERT pre-training from a corpus text file where the documents are separated by an empty line (output of corpuscollect).
bert-pregenerate-trainingdata --help
Perform BERT pre-training on pre-generated data.
bert-finetune --help