Projet de recherche doctoral numero :8687

Description

Date depot: 3 avril 2024
Titre: Thematic visual grounding
Directeur de thèse: Nicolas LOMENIE (LIPADE)
Encadrant : Sylvain LOBRY (LIPADE)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Images et vision

Resumé: Visual grounding is a task that aims at locating objects in an image based on a natural languagequery. This task, along with image captioning, visual question answering or content based imageretrieval links image data with the text modality. Numerous works have been produced in thelast decade about visual grounding in the computer vision community. These work most often consider both modality separatly, through a dedicated encoder (e.g. a convolutional neuralnetwork for images, a recurrent neural network for the text). Both encoded representations arethen merged, potentially using attention mechanisms, to obtain a common latent representation.Recently, text-image foundation models such as CLIP (Contrastive Image Language Pre-training) have changed the paradigm for visual grounding models. Indeed, leveragingthe shared semantics between language and images is a key element for the task. While great amount of works have been produced in the computer vision community on the task of visual grounding on natural images, there is a lack of research works on this task for thematic domains such as medical imaging and remote sensing. In both of these domains, there is a need to precisely locate particular objects, following precise definitions, in images. In addition, the image of a particular scene (e.g. an organ in medical image, a geographical area in remote sensing) can be made through several acquisitions (e.g. an MRI stack or a time series). As such, we are interested in the question: How can visual grounding be made domain specific?