Projet de recherche doctoral numero :8197

Description

Date depot: 6 septembre 2021
Titre: NLP for low-resource, non-canonical language varieties with a focus on North-African dialectal Arabic
Directeur de thèse: Laurent ROMARY (Inria-Paris (ED-130))
Directeur de thèse: Djamé SEDDAH (Inria-Paris (ED-130))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole

Resumé: This PhD thesis aims at exploring Natural Language Processing(NLP) for low-resource Languages, focusing on noisy, user-generated content dialectal varieties. Due to the recent advances brought by large neural language models being monolingual or multilingual (Peters et al., 2018; Devlin et al., 2018), research in NLP has achieved considerable success, but most of it was achieved on a couple dozen high-resource languages with a particular focus on English. Given that most of spoken languages are not the current focus of those high-achieving models, this thesis proposal aims at exploring advanced natural languages processing and neural transfer learning techniques on low-resources, noisy user-generated content languages. Such languages exhibit a high variability at all linguistics levels, making their processing by automatic tools a still unresolved challenge. Therefore the question of developing methods that would best fitted to cope with those languages is crucial, especially in a scientific context where most of those language forms appear on social medias or online platforms. The goal of this thesis will be to explore the limit of current methods and develop new ones suitable to cope with those low-resource, non-canonical language varieties, starting with a particularly challenging dialect, that is North- African Arabic, written in Latin script as found in user-generated content.



Doctorant.e: Riabi Arij