Projet de recherche doctoral numero :8869

Description

Date depot: 24 mars 2025
Titre: 3D point cloud understanding via Multi-modal Large Language Models
Directeur de thèse: Laurent WENDLING (LIPADE)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Images et vision

Resumé: Recent works have demonstrated the potential of handling jointly the 3D and text modalities via multi-modal Large Language Models. The current 3D vision-language tasks include 3D captioning (3D → Text), 3D grounding (3D + Text → 3D Position), 3D conversation (3D + Text → Text), 3D embodied agents (3D + Text → Action) and text-to-3D generation (Text → 3D). Recently, a new vision-language task is emerged that produce a point-wise segmentation masks given a 3DPC and text query. The challenge is to develop new methods that provide intelligent interaction ways between text query and 3DPC while harnessing the Large Language Model’s reasoning capabilities. The different steps of this PRD are: - Review the different 2D/3D vision-language methods that exploit multi-modal LLMs. An implementation of some baseline methods is to be realized. - Exploiting 2D vision foundation models and 2D-LLMs. The 3D-2D alignment will be conducted through the knowledge distillation - Construction of 3DPC foundation model. The innovation will be the definition of the pre-text tasks related to the 3DPC. - Development of new 3D-LLMs for point-level reasoning and segmentation. Previous point will be incorporated in a unified framework for 3D point cloud segmentation via the reasoning ability of LLMs. Supervision will also be provided by Yaoub Karine (MDC Lipade).