Description
Date depot: 23 novembre 2023
Titre: Analogy in multilingual natural language processing
Directeur de thèse:
Benoit SAGOT (Inria-Paris (ED-130))
Encadrante :
Rachel BAWDEN (Inria-Paris (ED-130))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole
Resumé: In recent years, natural language processing (NLP) has undergone major improvements. However, most of these advances are confined to high-resource languages (English, Chinese, French, etc.). State-of-the-art NLP systems require a large amount of text for training and are computationally intensive. The former problem is exacerbated when it comes to building multilingual systems involving low-resource languages, in particular for machine translation (MT), as it is even more difficult to obtain a significant amount of parallel resources. In addition to the problem of data scarcity, there is a lack of knowledge about the behaviour of recent models which makes it difficult to interpret how they use their training data. Recent solutions involve upsampling low-resource languages during training, using noisy back-translation data or mining bitext. This PhD subject proposes to approach the question from the angle of analogy, which consist in relationships between text fragments. The techniques we intend to explore include the incorporation of linguistic knowledge, data augmentation and approaches for inferring interpretable rules from models, with the aim of doing more with less data.
Doctorant.e: Zebaze Dongmo Armel Randy