Back

Named Entity Recognition and Linking

Extracting and linking entities to Wikipedia using spaCy and transformer-based BLINK models.

This assignment was developed for the Data Semantics course in the Master’s Degree in Data Science.

The goal was to apply a spaCy Named Entity Recognition pipeline to a text assigned individually to each student. We were then asked to evaluate the extracted spans by comparing them with a gold standard, using different matching criteria.

Similar metrics were computed to assess the correctness of the predicted entity types (such as person, location, organization, date, etc.). We also performed a qualitative error analysis of the span and type predictions.

In the second part of the assignment, we linked the identified entities to their corresponding Wikipedia pages using three approaches:

After evaluating the linking accuracy with appropriate metrics, we also performed a qualitative analysis of the results.

The results showed a clear performance advantage of the two BLINK-based approaches over the naive method. Some cases were particularly insightful. For example, when linking the phrase "Foreign Ministry":
  1. The naive method linked it to the general "Ministry of Foreign Affairs" page.
  2. The bi-encoder approach tried to disambiguate it based on context but incorrectly linked it to "Ministry of Foreign Affairs (Taiwan)", likely influenced by the nearby mention of "Taiwan".
  3. The full BLINK model correctly linked it to "Ministry of Foreign Affairs of the People's Republic of China", showing its superior contextual understanding.
An even more interesting case occurred with the entity "Tang Shubei": For more details on the code and methodology, you can check the downloadable Jupyter Notebook or visit my GitHub page.

Tags

Data Semantics Named Entity Linking Named Entity Recognition BLINK spacy python