Back

Named Entity Recognition and Linking

Extracting and linking entities to Wikipedia using spaCy and transformer-based BLINK models.

This assignment was developed for the Data Semantics course in the Master’s Degree in Data Science.

The goal was to apply a spaCy Named Entity Recognition pipeline to a text assigned individually to each student. We were then asked to evaluate the extracted spans by comparing them with a gold standard, using different matching criteria.

Similar metrics were computed to assess the correctness of the predicted entity types (such as person, location, organization, date, etc.). We also performed a qualitative error analysis of the span and type predictions.

In the second part of the assignment, we linked the identified entities to their corresponding Wikipedia pages using three approaches:

A naive string-matching method
A bi-encoder version of the BLINK model
The full BLINK model with both bi-encoder and cross-encoder

After evaluating the linking accuracy with appropriate metrics, we also performed a qualitative analysis of the results.

The results showed a clear performance advantage of the two BLINK-based approaches over the naive method. Some cases were particularly insightful. For example, when linking the phrase "Foreign Ministry":

The naive method linked it to the general "Ministry of Foreign Affairs" page.
The bi-encoder approach tried to disambiguate it based on context but incorrectly linked it to "Ministry of Foreign Affairs (Taiwan)", likely influenced by the nearby mention of "Taiwan".
The full BLINK model correctly linked it to "Ministry of Foreign Affairs of the People's Republic of China", showing its superior contextual understanding.

An even more interesting case occurred with the entity "Tang Shubei":

Both BLINK approaches incorrectly linked him to "Tang Fei", a different person.
Surprisingly, the naive method linked him to the "Qiandao Lake incident" page. Although this is technically incorrect (it links a person to an event), Tang Shubei is actually mentioned in that article as the ARATS vice president. So, in a way, the naive approach surfaced the only Wikipedia page that references him.

For more details on the code and methodology, you can check the downloadable Jupyter Notebook or visit my GitHub page.

Davide Giardini

Data Scientist & AI Developer

I focus on LLMs, Deep Learning
and Knowledge Graphs.

Named Entity Recognition and Linking

Extracting and linking entities to Wikipedia using spaCy and transformer-based BLINK models.

Tags