Date depot: 8 avril 2021 Titre: Deep Graph Neural Networks for Visual Scene Recognition Directeur de thèse: Hichem SAHBI (LIP6) Domaine scientifique: Sciences et technologies de l'information et de la communication Thématique CNRS : Non defini Resumé: Deep learning is currently witnessing a major interest in computer vision and different related fields. Its principle consists in training multi-layered neural networks by designing suitable architectures and optimizing their parameters. In particular, convolutional networks are well studied and aim at extracting features that gradually capture low-to-high semantics of visual patterns. Early convolutional networks were dedicated to regular (grid-like) scenes including images where convolutions are achieved by shifting equivariant filters and measuring their responses across different image locations. However, scenes sitting on top of irregular domains (such as skeletons in action recognition or regions in object detection) require extending convolutional networks to unstructured data (namely graphs); indeed, while shifting filters across regular grids is a straightforward and a well-defined operation, its extension to irregular domains (i.e., graphs with heterogeneous topological properties) is generally ill-posed. Motivated by the success of deep learning in computer vision, graph convolutional networks (GCNs) are currently emerging for different use-cases and applications. The common ground of these networks consists in aggregating node representations prior to apply filters on the resulting node aggregates. Two categories of GCNs are known in the literature: the first one, dubbed as spatial, achieves convolution by locally averaging representations through nodes and their neighbors before applying convolutions using inner products. The second category, known as spectral, proceeds differently by first mapping filter and input graph signals using the eigen-decomposition of their Laplacians, then achieving filtering in the resulting spectral domain prior to back-project the filtered signal onto the input graph domain. While spectral GCNs make convolutions well-defined compared to spatial GCNs, their downside resides in the non-localized aspect of the learned filters and also in the high complexity of Laplacian eigen-decomposition. Considering the aforementioned issues, the goal of this thesis subject is to devise highly effective and also efficient GCNs for the task of visual scene recognition. In our targeted solutions, graphs will be used to model scene parts together with their spatial, temporal and semantic interactions. Concepts (and their combinations) will also be described with graphs where nodes correspond to individual classes and links correspond to their interactions. In contrast to most of the existing solutions, where nodes/edges in graphs are handcrafted, we will consider in this thesis subject an «end-to-end» training process that infers both nodes and their interactions, prior to learn the underlying GCNs. Other aspects will be addressed including attention mechanisms as well as transformer networks that help designing convolutions with topologically variant filter supports. We will also consider spectral GCNs that make convolutions through the graph Fourier transform principled and well defined; nevertheless, the relevance of these convolutions relies on the adequacy of the used Laplacian operators which are usually handcrafted. The latter are not able to capture all the relationships between nodes as their setting is agnostic to the targeted tasks. For instance, in skeleton-based action recognition, pre-existing node-to-node relationships capture the intrinsic anthropometric aspects of individuals which are necessary for their identification, while other relationships, yet to infer, about their dynamics are necessary in order to recognize their actions. Put differently, depending on the task at hand, connectivity in Laplacian operators should be appropriately learned by including not only the available (intrinsic) node-to-node connections in graphs but also their inferred (extrinsic) relationships. Moreover, the consistency of the learned Laplacian operators is also critical and requires adapting the domains of these operators to the input graphs. Finally, all these aspects will be investigated in the context of visual scene recognition including image/video classification and segmentation.