Named Entity Recognition and Linking
Recognizing entity in the text and linking them to their Wikipedia page.
This assignment was developed for the Data Semantics course in the Master Degree in Data Science.
The request for this assignment was to apply a spacy Named Entity Recognition pipeline on a text assigned individually.
We were then asked to evaluate the found spans by comparing them with the "gold standard" spans, using different matching criterias.
Similar metrics were also computed taking into account the correctness of the entity's recognized type (person, location, organization, date, ecc.).
Lastly, a qualitative analysis of the errors made in recognising spans and types was written.
In the second part of the assignment we were asked to link the found entities to their Wikipedia page using various strategies (a naive approach, a bi-encoder BLINK instance and a complete BLINK instance).
After computing metrics to evaluate the linking correctness, a qualitative analysis of the found links was written.
The result show a general superior quality of the linking found by the two more complex approaches compared with the naive one.
Some result were particularly interesting. Linking the words "Foreign Ministry", for example, highlights the three approaches' ability to understand the entity of a link based on its context:
- The first naive method links this entity only to the page related to the "Ministry of foreign affairs" in general.
- The second method tries to understand the state of the Ministry, but fails. Probably because of its vicinity with the word "Taiwan", the method misclassify it as "Ministry of Foreign Affairs (Taiwan)".
- The third approach is by far the best one, classifying the entity as "Ministry of Foreign Affairs of the People's Republic of China", which is the correct classification.
A more in depth look at the code utilized in this assignment can be had through the downloadable Notebook or in my Github page.