Skip to content

Latest commit



62 lines (50 loc) · 4.39 KB

File metadata and controls

62 lines (50 loc) · 4.39 KB


Coursework of the seminar Grundlagen der Computerlinguitik III at the University of Erlangen
Wintersemester 2022-2023
Under the guidance of female professor Stephanie Evert, Lehrstuhl für Korpus- und Computerlinguistik
The project is a shared task from the SemEval-2023, which dedicates to the recognition of entities in the legal documents. (Legal Named Entity Recognizer)
Link to the shared task

1. The Tokenizer.ipynb

(also see
uses a TreebankWordTokenizer to convert the annotated judgement texts from the javascript objects (json) into pandas dataframes
The dataframes are stored in transitional_data/tokenized_train.csv and transitional_data/tokenized_dev.csv

2. The POS_Tagger.ipynb

(also see
uses two POS taggers to provide dataframe with tags and lemmas of each token (a row in the dataframe).
The first is the standard tokenizer (a pretrained PerceptronTagger) provided by nltk.
The second is the TreeTagger configured with the Penn treebank.
The TreeTagger provides also lemmas of each token to the dataframe.
The extanded dataframes are stored in "transitional_data/tagged_train_filled.csv" and "transitional_data/tagged_dev_filled.csv". The empty cells (np.NaN) in the dataframe are substituted with "0".

About the TreeTagger

Link to the install instruction of TreeTagger
Tutorial to use the TreeTagger in python
I have installed the TreeTagger directly in the project folder under the name "TreeTagger". It includes not only a the TreeTaggerwrapper, but also is configured with the Penn treebank.

3. Feature_Matrix.ipynb

Purpose of notebook is to enlarge the dataframe and provide the maschine learning model with more context information.
e. g. Tokens on the left and right sides and its pos tags, lemmas.
Because the Treetagger in last notebook already provides the pos tags to every token, the prefix, suffix and other features of the token are generally proven to be surplus and won't be included in the final feature matrix.
The "Trigramme" Processing (add the labels of last two tokens to the feature matrix, "gold" for the train and dynamically the predicted labels in the dev)
only makes the model much worse.
The latest result shows, feature matrix works best with following columns:
Token, POSTag and Lemma of the Token itself and also the three features of its L1, L2, R1, R2 neighbours.
WITHOUT any Affix, other features or labels before.

Best result: Weighed average f1 score of all labels 77% (level of token)

(exclduing the "o", outsiders, which makes up 86% of all tokens)

4. provides functions which can be used to compare the results of different models.

  1. get_all_labels
    extracts all labels with its sole sequence number from a y list as preparation for future usage.

  2. It provides two methods to evaluation.
    2.1) The fist is shown by get_classify_report.
    The Classify Report reflects the accuracy of (simply) classification, not considering the rate of detecting.
    2.2) The second method shows the accuracy of recognition in strict sense. (get_recognition_report)
    Recognition Report shows how many entities are not only in accurate length detected but also correctly classified.

  3. get_confusion_matrix
    returns confusion_matrix in three categories: juridical_person, formats, natural_person

5. Model_Selection.ipynb

This notebook compares the results of support vector mashine and the sklearn_crfsuite model, both after parameter finetuning.
The sklearn_crfsuite model has reached a obviously better result as the svm.
SVM:Weighed average f1 score of strict recognition 67% (level of entity)
crfsuite:Weighed average f1 score of strict recognition 75% (level of entity)
Visualization uses the visualizer tools of SpaCy, to mark the entities from text in colors.

6. Project Report (in german language)

with the total LaTeX document can be found in the file "Project_Report".