Projet de recherche doctoral numero :8298

Description

Date depot: 4 avril 2022
Titre: Structured Neural Representations for Information Retrieval
Directeur de thèse: Benjamin PIWOWARSKI (ISIR (EDITE))
Directrice de thèse: Laure SOULIER (ISIR (EDITE))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Intelligence artificielle

Resumé: This thesis proposal tries to unify both the efficiency (speed) and effectiveness (quality) of neural IR models by developing models that will build upon structured representations – where the structure helps to quickly focus on a subset of relevant documents as well as allowing to score documents. First, the PhD will look at how to introduce structure into these representations, e.g. by leveraging tensor spaces. This strategy has been used for instance to represent words and is thought to be important to represent relationships between entities in the text. This structuration of the retrieval representation space will allow proposing search engines that progressively refine the ranking of the candidates, thus eliminating the need for two-stage rankers. Second, the PhD will contribute strongly to the problem of text representation. There are two competing approaches to structured representations of language. On the one hand, structured logical representations [8, 2] are potentially very powerful but are limited by the fact that texts are noisy and hard to process, and also because a logical representation is not adapted to all the types of texts (especially for ambiguous or unclear ones). On the other hand, structured distributed representations [7, 14, 6] are more flexible, allowing to represent a variety of types of texts, but are hard to train since it is impossible to clearly label the sentences. Because of their flexibility and their link to psycholinguistics which describes how a human processes a text, this second type of approach is however promising. The PhD will develop models inspired by those that use a bag of vectors to represent documents such as ColBERT and its derivatives, which have shown that they were generating much better than other models. Instead of a bag of vectors, we intend to develop models relying on a structured bag of vectors (i.e. a graph) able to represent relationships between ideas expressed in the text. This type of representation has the potential to unlock new progress in Natural Language Processing and Information Access tasks.