Description
Date depot: 13 décembre 2021
Titre: Cheap and expressive neural contextual representations for textual data
Directeur de thèse:
Benoit SAGOT (Inria-Paris (ED-130))
Encadrant :
Eric DE LA CLERGERIE (Inria-Paris (ED-130))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole
Resumé: Language models are pre-trained using specific language modelling tasks, such as Masked Language Modelling and Next Sentence Prediction inthe case of BERT. Recently, models with similar architectures have achieved state-of-the-art performance by picking different pre-training objectives , showing that more useful representations could be obtained without increasing the complexity of the models and using less pre-training compute requirements.
Empirical studies have shown that BERT-like models are not very robust, as the smoothness of the induced representations is at question, and that adversarial attacks easily fool them. Such behaviour could be explained by the nature of the learned representations, but also by choosing a limited vocabulary of tokens as possible inputs. Novel approaches attempt to tackle the latter limitation to improve robustness or performance across languages, e.g. character-level models. Transformer-based models also require large quantities of data to be pre-trained properly, which is not possible for every language and their varieties.
We propose to investigate methods that would yield robust and expressive representations at a smaller computational cost, in order to democratize industrially operational modern NLP models. The tracks we would like to explore are:
- Contrastive Learning: this type of approach has proved quite successful for representation learning in Computer Vision, and recent works use it as a learning objective for NLP models
- Optimal Transport: it may help tackling the smoothness issue of contextualized representations, but also lead to better approximations for attention mechanisms
- Cheaper Attention: recent approaches propose ways to reduce the self-attention complexity both in terms of memory and time
- Multilingual Training: novel results show that multilingual approaches can help producing better representations for mid-resource/low-resource languages
Doctorant.e: Godey Nathan