3. Named Entity Recognition

Overview

Named Entity Recognition normally requires a set of defined entities to train on. Most of the famous NERs including the Stanford NER model has been trained a large dataset of raw text and learns on the annotated entities of the text. Every NER trains on a defined set of entities. In the case of the Free UK Genealogy use case, the entities that were required to be seeded were surname, forenames, location, county, death date, death location, death county and relations. For the above project, curating a substantial set of entity values was not possible due to scarcity of data. For experiment purposes and as an initial attempt to solve the problem, I made use of two tools - the Stanford NLP NER and the SpaCy NER model. After comparing the two models, it was apparent that the SpaCy NER model outperformed the Stanford model.

The default NER Model provided by SpaCy was trained on a corpus that had the following entities. Since all these entities were not what the Probate Parsing project is looking for, the model had to be specifically trained with the probate books data.

Some important questions that need to be answered at this stage are:

What is NER?

The task of identifying proper names of people, organizations, locations, or other entities is a subtask of information extraction from natural language documents. In school, we were taught that a proper noun was “a specific person, place, or thing,” thus extending our definition from a concrete noun. Unfortunately, this seemingly simple mnemonic masks an extremely complex computational linguistic task—the extraction of named entities, e.g. persons, organizations, or locations from corpora.

(ORG S.E.C.) chief (PER Mary Shapiro) to leave (LOC Washington) in December.

This sentence contains three named entities that demonstrate many of the complications associated with named entity recognition. First, S.E.C. is an acronym for the Securities and Exchange Commission, which is an organization. The two words “Mary Shapiro” indicate a single person, and Washington, in this case, is a location and not a name. Note also that the token "chief" is not included in the person tag, although it very well could be.

What is an NER model?

Stanford NER is also referred to as a CRF (Conditional Random Field) Classifier as Linear chain Conditional Random Field (CRF) sequence models have been implemented in the software. We can train our own custom models with our own labeled dataset for various applications. Conditional Random Fields are a discriminative model, used for predicting sequences. They use contextual information from previous labels, thus increasing the amount of information the model has to make a good prediction.

What does training an NER model mean?

A final machine learning model is a model that you use to make predictions on new data. That is, given new examples of input data, you want to use the model to predict the expected output. This may be a classification (assign a label) or a regression (a real value). For example, whether the photo is a picture of a dog or a cat or the estimated number of sales for tomorrow.

The goal of your machine learning project is to arrive at a final model that performs the best, where “best” is defined by:

Data: the historical data that you have available.
Time: the time you have to spend on the project.
Procedure: the data preparation steps, algorithm or algorithms, and the chosen algorithm configurations. In your project, you gather the data, spend the time you have, and discover the data preparation procedures, the algorithm to use, and how to configure it.

Why do we use train and test sets?

Creating a train and test split of your dataset is one method to quickly evaluate the performance of an algorithm on your problem. The training dataset is used to prepare a model, to train it. We pretend the test dataset is new data where the output values are withheld from the algorithm. We gather predictions from the trained model on the inputs from the test dataset and compare them to the withheld output values of the test set.

Comparing the predictions and withheld outputs on the test dataset allows us to compute a performance measure for the model on the test dataset. This is an estimate of the skill of the algorithm trained on the problem when making predictions on unseen data.

Basic NER and its drawbacks

On doing an implementation of the SpaCy’s inbuilt pre-trained NER model - the following outcome is observed. It is not a very probate-specific output. Hence we train the model from scratch. Basic NER

Implementation

To take a look at the implemenation code, usage and output sample, click here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly