Description
Date depot: 29 août 2022
Titre: Efficient Multimodal Learning
Directeur de thèse:
Matthieu CORD (ISIR (EDITE))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Images et vision
Resumé: Sujet sur les modèles Vision et Langage.
Résumé détaillé ci-dessous.
Résumé dans une autre langue: The prevalent paradigm in Deep Learning (DL) to excel in several vision, language and multimodal benchmarks, is to pretrain the model on a large dataset and then finetuning it on several downstream tasks. This has shifted the focus from dataset or task-customised model design to more general or foundational models (CLIP, FLAVA , CoCa ...). In particular, Vision-Language Models (VLM) have demonstrated to be a promising approach, by exploiting the synergy between both modalities.
These models are trained on large datasets, consisting of image-text pairs, and then finetuned on several unimodal or multimodal image and text datasets. They present several advantages; first, the datasets can be easily scraped from the internet without expensive annotations. Second, the rich description of the paired text/caption provides means to learn more general and useful visual representation than having only labels that describe one particular aspect of the image. Third, besides excelling in existing benchmarks, VLM paved the way to efficiently solve more vision language tasks where the model needs to understand both modalities as well as how they interact.
However, despite this success, these models are not efficient in training, as they require massive datasets (e.g. 70M, 400M, 1.8B ...), huge number of parameters (e.g. CoCa 2B params, Flamingo 70B params) and expensive training infrastructures (e.g. one week on 8 GPUs), prohibiting academic laboratories from participating in this research direction.
In this PhD, we propose first to develop new models and training paradigms that effectively exploit the datasets with reasonable training cost. Second, we aim for our models to leverage new objective functions and tasks that help them to acquire more reasoning capabilities. Third, to faster their deployment in real world applications, we plan to explore and develop new efficient transfer learning techniques, beyond just finetuning all the model parameters for each task. Finally, we aim to leverage these models to tackle vision tasks that traditionally addressed by classical vision models and techniques.
Doctorant.e: Shukor Mustafa