entity-linkings is a unified library for entity linking.
# from PyPi
pip install entity-linkings
# from the source
git clone [email protected]:naist-nlp/entity-linkings.git
cd entity-linkings
pip install .
# for uv users
git clone [email protected]:naist-nlp/entity-linkings.git
cd entity-linkings
uv sync
entity-linkigs provides two interfaces: command-line interface (CLI) and Python API.
Command-line interface can train/evalate/run Entity Linkings system from command-line.
To create EL system, you must build candidate retriever with entitylinkings-train_retrieval.
In this example, e5bm25 can be executed with custom dataset.
entitylinkings-train-retrieval \
--retriever_id e5bm25 \
--train_file train.jsonl \
--validation_file validation.jsonl \
--dictionary_id_or_path dictionary.jsonl \
--output_dir save_model/ \
--num_hard_negatives 4 \
--num_train_epochs 10 \
--train_batch_size 8 \
--validation_batch_size 16 \
--config config.yaml \
--wandbNext, Entity Disambiguation (ED) and End-to-End Entity Linking (EL) systems can trained with entitylinkings-train.
This example is the FEVRY with custom candidate retriever.
entitylinkings-train \
--model_type ed \
--model_id fevry \
--model_name_or_path google-bert/bert-base-uncased \
--retriever_id e5bm25 \
--retriever_model_name_or_path save_model/ \
--dictionary_id_or_path dictionary.jsonl \
--train_file train.jsonl \
--validation_file validation.jsonl \
--num_candidates 30 \
--num_train_epochs 2 \
--train_batch_size 8 \
--validation_batch_size 16 \
--output_dir save_fevry/ \
--config config.yaml \
--wandbFinally, you can evaluate Retriever or EL systems with entitylinkings-eval or entitylinkings-eval-retrieval, respectively.
entitylinkings-eval-retrieval \
--retriever_id <retriever_id> \
--model_name_or_path save_model/ \
--dictionary_id_or_path dictionary.jsonl \
--test_file test.jsonl \
--config config.yaml \
--output_dir result/ \
--test_batch_size 256 \
--wandbentitylinkings-eval \
--model_type ed \
--model_id fevry \
--model_name_or_path save_fevry/ \
--retriever_id e5bm25 \
--retriever_model_name_or_path save_model/ \
--dictionary_id_or_path dictionary.jsonl \
--test_file test.jsonl \
--config config.yaml \
--output_dir result/ \
--test_batch_size 256 \
--wandbYou can change the arguments (e.g., context length) using configuration file.
The config.yaml with default values can be generated via entitylinkings-gen-config.
entitylinkings-gen-configThis is the exemple of ChatEL with Zelda Candidate list via API.
Valids IDs for get_retrievers and get_models() can be found with get_retriever_ids and get_model_ids() respectively.
from entity_linkings import get_retrievers, get_models, load_dictionary
# Load Dictionary from dictionary_id or local path
dictionary = load_dictionary('zelda')
# Load Candidate Retriever
retriever_cls = get_retrievers('zeldacl')
retriever = retriever_cls(
dictionary,
config=retriever_cls.Config()
)
# Setup ED or EL models
model_cls = get_models('chatel')
model = model_cls(
task='ed'
retriever=retriever,
config=model_cls.Config("gpt-4o")
)
# Prediction
sentences = "NAIST is in Ikoma."
spans = [(0, 5)]
predictions = model.predict(sentence, spans, top_k=1)
print("ID: ", predictions[0][0]["id"])
print("Title: ", predictions[0][0]["prediction"])
print("Score: ", predictions[0][0]["score"])Please refer to the link for instructions on how to run each model.
- BM25
- ZELDA Candidate List (Milich and Akbik., 2023)
- Dual Encoder Model
- Text Embedding Model
- E5+BM25 (Nakatani et al., 2025)
- FEVRY (Févry et al.,2020)
- BLINK (Wu et al., 2020)
- ExtEnD: (Barba et al., 2022)
- FusionED: (Wang et al., 2024)
- ChatEL (Ding et al., 2024)
| dictionary_id | Dataset | Language | Domain |
|---|---|---|---|
kilt |
KILT (Petroni et al., 2021) | English | Wikipedia |
zelda |
ZELDA (Milich and Akbik., 2023) | English | Wikipedia |
zeshel |
ZeshEL (Logeswaran et al., 2021) | English | Wikia |
If you want to use our packages with your custom ontologies, you need to convert to the following format:
{
"id": "000011",
"name": "NAIST",
"description": "NAIST is located in Ikoma."
}
| dataset_id | Dataset | Domain | Language | Ontology | Train | Licence |
|---|---|---|---|---|---|---|
kilt |
KILT (Petroni et al., 2021) | Wikipedia | English | Wikipedia | ✅ | Unknown* |
zelda |
ZELDA (Milich and Akbik., 2023) | Wikimedia | English | Wikipedia | ✅ | Unknown* |
msnbc |
MSNBC (Cucerzan, 2007) | News | English | Wikipedia | Unknown* | |
aquaint |
AQUAINT (Milne and Witten, 2008) | News | English | Wikipedia | Unknown* | |
ace2004 |
ACE2004 (Ratinov et al, 2011) | News | English | Wikipedia | Unknown* | |
kore50 |
KORE50 (Hoffart et al., 2012) | News | English | Wikipedia | CC BY-SA 3.0 | |
n3-r128 |
N3-Reuters-128 (R̈oder et al., 2014) | News | English | Wikipedia | GNU AGPL-3.0 | |
n3-r500 |
N3-RSS-500 (R̈oder et al., 2014) | RSS | English | Wikipedia | GNU AGPL-3.0 | |
derczynski |
Derczynski (Derczynski et al., 2015) | English | Wikipedia | CC-BY 4.0 | ||
oke-2015 |
OKE-2015 (Nuzzolese et al., 2015) | News | English | Wikipedia | Unknown* | |
oke-2016 |
OKE-2016 (Nuzzolese et al., 2015) | News | English | Wikipedia | Unknown* | |
wned-wiki |
WNED-WIKI (Guo and Barbosa, 2018) | Wikipedia | English | Wikipedia | Unknown | |
wned-cweb |
WNED-CWEB (Guo and Barbosa, 2018) | Web | English | Wikipedia | Apache License 2.0 | |
unseen |
WikilinksNED Unseen-Mentions (Onoe and Durrett, 2020) | News | English | Wikipedia | ✅ | CC-BY 3.0* |
tweeki |
Tweeki EL (Harandizadeh and Singh, 2020) | English | Wikipedia | Apache License 2.0 | ||
reddit-comments |
Reddit EL (Botzer et al., 2021) | English | Wikipedia | CC-BY 4.0 | ||
reddit-posts |
Reddit EL (Botzer et al., 2021) | English | Wikipedia | CC-BY 4.0 | ||
shadowlink-shadow |
ShadowLink (Provatorova et al., 2021) | Wikipedia | English | Wikipedia | Unknown* | |
shadowlink-top |
ShadowLink (Provatorova et al., 2021) | Wikipedia | English | Wikipedia | Unknown* | |
shadowlink-tail |
ShadowLink (Provatorova et al., 2021) | Wikipedia | English | Wikipedia | Unknown* | |
zeshel |
Zeshel (Logeswaran et al., 2021) | Wikia | English | Wikia | ✅ | CC-BY-SA |
docred |
Linked-DocRED (Genest et al., 2023) | News | English | Wikipedia | ✅ | CC-BY 4.0 |
- Original MSNBC (Cucerzan, 2007) is not available due to expiration of the official link. You can download the dataset at GERBIL official code.
- ShadownLink, OKE-{2015,2016} are uncertain to publicly use, but they are provided at official repositories.
- WikilinksNED Unseen-Mentions is created by splitting the WikilinksNED. The WikilinksNED is derived from the Wikilinks corpus, which is made available under CC-BY 3.0.
- The folowing datasests is not publicly available or uncertain. If you want to evaluate these resource, please register the LDC and convert these dataset to our format.
- AIDA CoNLL-YAGO (Hoffart et al., 2011): You must sign the agreement to use Reuter Corpus
- TACKBP-2010 (Ji et al., 2011): You must sign Text Analysis Conference (TAC) Knowledge Base Population Evaluation License Agreement.
- Results for ZeshEL/ZELDA benchmarks (aida-b, tweeki, reddit-, shadowlink-, and wned-*) across all models can be found in the Spreadsheet.
If you want to use our packages with the your private dataset, you must convert it to the following format:
{
"id": "doc-001-P1",
"text": "She graduated from NAIST.",
"entities": [{"start": 19, "end": 24, "label": ["000011"]}],
}