Back

Master Thesis Project

Unveiling Hidden Information in Unstructured Documents: Organization and Hybrid Retrieval with Knowledge Graphs.

Goal

We generalize the informational content of a document along three distinct dimensions:

  1. Content: The textual content of the document.
  2. Structure: The hierarchical organization of the document, including headings, paragraphs, and sections.
  3. Entities: The entities mentioned within the documents and their semantic relations.

Given this categorization, our objective is to design a retrieval system, ranging from document structuring to the retrieval model itself, that can determine the most relevant chunks by integrating and combining all these informational dimensions. Specifically, the main objectives of this thesis are the following:

  1. Document Structuring: Developing a Knowledge Graph (KG) representation that effectively models a collection of unstructured documents, focusing on retaining and making explicit all of their informational dimensions: both the textual content and the more implicit metadata, such as the organization of passages and the relationships between entities.
  2. Hybrid Retrieval: Designing a retrieval system capable of leveraging this enriched structure, and therefore taking full advantage of all three informational dimensions to retrieve the most relevant chunks comprehensively.
    Investigating which retrieval methods are more critical in identifying the most relevant passages, and consequently design the system to balance the weights in favor of those methods.
  3. Comparison with traditional RAG: Understanding whether structuring and enriching documents can offer a valuable benefit in RAG systems.

Methodology

The work proceeds in two stages:

Results.

On the held-out split of Google NQ and on a 100-question synthetic Multi-Hop set, the hybrid approach:

Tags

Publication Neo4j NER & NEL Knowledge Organization Speech-Recognition