Projet de recherche doctoral numero :8700

Description

Date depot: 4 avril 2024
Titre: Label-Efficient Self-Supervised Learning of Large AI Models in Computer Vision
Directeur de thèse: Hichem SAHBI (LIP6)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Images et vision

Resumé: Deep learning (DL) is currently witnessing a major interest in computer vision and different related fields. Nowadays, large deep models (namely foundation models) represent a paradigm shift in deep learning, offering a powerful and versatile approach to developing AI applications. However, most of these deep models are reliant on large datasets whose hand-labeling is time and labor-demanding. Besides, in specific fields where labels depend on expertise, such as medical imaging, models with insufficient or inaccurate labels struggle to learn effectively and may produce unreliable outputs. A current trend is to make these DL models frugal and less label-dependent using different techniques such as zero and few-shot learning, transfer learning, data augmentation as well as active and self-supervised learning. Among the aforementioned existing solutions, self-supervised learning (SSL) is particularly interesting as it pushes frugality further by making training totally label-free and by leveraging only abundant unlabelled data. Unlike traditional supervised learning models that require vast amounts of labeled data, SSL models can learn meaningful representations from unlabeled data by exploiting the inherent structure or relationships within unlabeled data to generate supervisory signals for training. The principle of SSL consists in designing pretext tasks (image reconstruction and colorization, relative image comparison, etc) which are artificial tasks created to force the model to learn meaningful representations from the unlabeled data. The model's predictions during these pretext tasks are used as a form of supervision, and the model is trained to minimize the difference between its predictions and the actual data (ground truth) for the pretext task. Beside being label-efficient, models trained with SSL can sometimes generalize better to unseen data compared to models solely trained on labeled data. The goal of this thesis subject is to study and propose novel solutions that address different SSL challenges including • Designing relevant pretext tasks: the success of SSL depends on creating pretext tasks that effectively guide the models towards learning meaningful representations. Leveraging multiple cues (through multi-modality, spatio-temporal redundancy, differentiable rendering and other a priori knowledge) as a supervisory signal to attenuate the dependency of the learned models on labels, makes the fine-tuning of the models (with few labeled data to the peculiarities of the main tasks) easier, especially on video data. • Studying the guarantee of task-specific performance: whereas SSL models may learn meaningful representations, they might still require fine-tuning for specific downstream tasks. Hence, a major challenge is how to combine SSL methods with other label-efficient methods (such as zero and few shot learning) in order to achieve effective and efficient fine-tuning. • Designing suitable evaluation metrics: evaluating the quality of the learned representations in SSL can be more challenging compared to supervised learning settings. Particularly, linking the accuracy of the main tasks to the proxy tasks is one of the major bottlenecks. • Leveraging Foundation Models (FMs): FMs are designed to be general-purpose and capable of handling a wide range of tasks across different domains. Unlike traditional deep learning models trained for specific tasks, FMs can be fine-tuned for various downstream applications with significantly less data and training time; by leveraging pre-trained foundation models and SSL, AI design can avoid training a large model from scratch for each new task. • Making the proposed learning AI models lightweight: adapting the models to real-time tasks as well as cheap devices is a major challenge and requires developing suitable neural network compression and acceleration techniques using pruning, quantization, etc. Reducing the training cost of SSL, on large AI models, will make them more accessible. • etc Applications of this thesis subject will be centred around different machine vision tasks such as analyzing large video datasets for tasks including anomaly detection (spotting unusual events), action recognition (identifying specific actions in videos) as well as motion and depth analysis (scene flow and 3D pose estimation).