Projet de recherche doctoral numero :8883

Description

Date depot: 31 mars 2025
Titre: Deep Learning and Large Generative AI Models for Machine Vision
Directeur de thèse: Hichem SAHBI (LIP6)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Intelligence artificielle

Resumé: Deep neural networks are currently one of the most successful models in image processing and computer vision. Their principle consists in learning convolutional filters, together with attention and fully connected layers, that maximize classification and generation performances. Large generative models (LGMs) are a particular category of deep learning models specifically designed to generate new data, often resembling the data they were trained on. These models are at the forefront of artificial intelligence research, pushing the boundaries of what computers can create. Unlike standard deep learning models trained for classification or prediction, LGMs focus on creating entirely new data samples. This data can be images, text, video, audio, etc. LGMs leverage various deep learning architectures, with some of the most common being: (i) Generative Adversarial Networks (GANs) that involve two neural networks competing against each other. One network (generator) tries to create realistic data, while the other (discriminator) tries to distinguish real data from the generated data. This adversarial process helps the generator improve its ability to fool the discriminator and produce increasingly realistic outputs, (ii) Variational Autoencoders (VAEs) which encode input data into a latent space, a bottleneck that captures the essential features. The model then learns to decode samples from this latent space, effectively generating new data that resemble the training data, and (iii) Normalizing Flows/Diffusion Models: these models start with a noisy version of the target data and gradually "de-noise" it step-by-step, ultimately producing a clean and realistic sample. The challenges of LGMs stem from their training complexity, their bias particularly when designed in a lifelong learning regime, and their ability to generate realistic and potentially manipulative content while addressing ethical concerns. The goal of this thesis subject is to study and design novel solutions that address different LGM challenges including • Enhanced control and interpretability: by developing interactive techniques, based on prompting or semantic subspace design, that better control the outputs of LGMs, and their quality / diversity, and also understand how they generate specific data. • Extended LGMs to lifelong learning paradigm: by developing effective solutions that learn from streams of data while mitigating the challenging hurdle of catastrophic forgetting (i.e., without forgetting previously learned information); proposed solutions will mainly rely on regularization and dynamic LGMs architecture design (in order to maintain LGM capacity), as well as domain adaptation (in order to address the non-stationarity of training data streams). • Extended LGMs to unstructured data (such as 3D point clouds): by designing LGMs on graphs while handling all the possible symmetries (ambiguities) in the unstructured data, and particularly permutations, both for LGM encoding and decoding. • Improved LGM efficiency: by developing more efficient training algorithms to make LGMs more accessible particularly on edge devices. • Mitigated bias and ensured responsible use: by designing safeguards to address potential biases and promote responsible development and deployment of LGMs. • etc. Applications of this thesis will be centred around different computer vision and image processing tasks including (i) creative image and video content generation, (ii) data augmentation (generating synthetic data to improve the performance of other machine learning models), and (iii) image/video editing (filling in missing parts of visual content, enhancing resolution, or photorealistic editing).