Named Entity Recognition and Linking
Extracting and linking entities to Wikipedia using spaCy and transformer-based BLINK models.
This assignment was developed for the Data Semantics course in the Master’s Degree in Data Science.
The goal was to apply a spaCy Named Entity Recognition pipeline to a text assigned individually to each student. We were then asked to evaluate the extracted spans by comparing them with a gold standard, using different matching criteria.
Similar metrics were computed to assess the correctness of the predicted entity types (such as person, location, organization, date, etc.). We also performed a qualitative error analysis of the span and type predictions.
In the second part of the assignment, we linked the identified entities to their corresponding Wikipedia pages using three approaches:
- A naive string-matching method
- A bi-encoder version of the BLINK model
- The full BLINK model with both bi-encoder and cross-encoder
The results showed a clear performance advantage of the two BLINK-based approaches over the naive method. Some cases were particularly insightful. For example, when linking the phrase "Foreign Ministry":
- The naive method linked it to the general "Ministry of Foreign Affairs" page.
- The bi-encoder approach tried to disambiguate it based on context but incorrectly linked it to "Ministry of Foreign Affairs (Taiwan)", likely influenced by the nearby mention of "Taiwan".
- The full BLINK model correctly linked it to "Ministry of Foreign Affairs of the People's Republic of China", showing its superior contextual understanding.
- Both BLINK approaches incorrectly linked him to "Tang Fei", a different person.
- Surprisingly, the naive method linked him to the "Qiandao Lake incident" page. Although this is technically incorrect (it links a person to an event), Tang Shubei is actually mentioned in that article as the ARATS vice president. So, in a way, the naive approach surfaced the only Wikipedia page that references him.