Projet de recherche doctoral numero :8218

Description

Date depot: 12 octobre 2021
Titre: Robust Neural Machine Translation
Directeur de thèse: Benoit SAGOT (Inria-Paris (ED-130))
Encadrante : Rachel BAWDEN (Inria-Paris (ED-130))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole

Resumé: Over the past few years, there have been significant improvements in Natural Language Processing (NLP) applications and machine translation (MT) in particular, notably thanks to deep learning approaches. However, state-of-the-art MT models, which typically require large quantities of data to be trained correctly, struggle when they are used to translate texts that differ from the type of data used to train them. An example of such a challenging scenario is MT of so-called “noisy” texts such as those produced by social media users and gamers online. Other than the relative scarcity of parallel resources to train models that are adapted to this kind of data, these texts pose new challenges because of the nature of the noise itself, which can be variable, productive and therefore unpredictable in advance (non-standard use of spelling, grammar and vocabulary, typographical errors, use of emojis, etc.). Moreover, the correct interpretation of these texts can be highly contextual, requiring information about the context in which the texts were produced (e.g. rules of the game, shared knowledge, news) as well as linguistic context (i.e. previous sentences). The characteristics of these texts are often very specific to a community of users, requiring domain adaptation to the particularities of the sociolect. Improving MT of noisy texts is a flourishing area of research and different approaches have been developed to handle the problem, for example by creating synthetic data, adversarial learning and the use of character-based models. Specific test sets to evaluate and compare these methods have also been developed. The proposed PhD topic is the exploration of new approaches to robust neural MT, including data augmentation methods, proposition of representation strategies for neural models and of new architectures to handle the phenomena found in non-standard texts.

Doctorant.e: Nishimwe Lydia