Date depot: 12 avril 2022 Titre: Lightweight Graph Neural Networks for Image and Action Recognition Directeur de thèse: Hichem SAHBI (LIP6) Domaine scientifique: Sciences et technologies de l'information et de la communication Thématique CNRS : Images et vision Resumé: Deep convolutional networks are currently one of the most successful models in image processing and computer vision. Their principle consists in learning convolutional filters, together with attention and fully connected layers, that maximize classification performances. These models are mainly suitable for data sitting on top of regular domains (such as images), but their adaptation to irregular data (namely graphs) requires extending convolutions to arbitrary domains; these extensions are known as graph convolutional networks (GCNs). Two categories of GCNs exist in the literature: spectral and spatial. Spectral methods proceed by projecting both input graph signals and convolutional filters using the Fourier transform, and achieve convolution in the Fourier domain, prior to back-project the resulting convolved signal in the input domain. These projections rely on the eigen-decomposition of graph Laplacians whose complexity scales polynomially with the size of the input graphs, and this makes spectral GCNs clearly intractable. Spatial methods instead rely on message passing, via attention matrices, before applying convolution. While spatial GCNs have been relatively more effective compared to spectral ones, their success is highly reliant on the accuracy of the attention matrices that capture context and node-to-node relationships. With multi-head attention, GCNs are more accurate but computationally more demanding, so lightweight variants of these models should instead be considered. Several methods have been proposed in the literature in order to design lightweight yet effective deep convolutional networks. Some of them build efficient networks from scratch while others pretrain heavy networks prior to reduce their time and memory footprint using distillation and pruning. Pruning methods, either unstructured or structured, allow removing connections whose impact on the classification performance is the least perceptible. Unstructured pruning consists in cutting connections individually using different criteria, including weight magnitude, prior to fine-tuning. In contrast, structured pruning aims at removing groups of connections, channels or entire sub-networks. Whereas structured pruning may reach high speed-up on dedicated hardware resources, its downside resides in the rigidity of the class of learnable lightweight networks. On another side, unstructured pruning is more flexible, but may result into topologically inconsistent sub-networks (i.e., either partially or completely disconnected), and this may lead to limited generalization especially at very high pruning rates. The goal of this thesis subject, is to devise novel approaches for very lightweight GCN design that gathers the advantage of both structured and unstructured pruning, and discards their inconvenient; i.e., the proposed methods should impose constraints on the structure of the learned sub-networks (namely their topological consistency) while also ensuring their flexibility at some extent. The proposed solutions may consider network connections using different criteria (highest magnitudes, connectivity and predefined topologies, etc.) while guaranteeing their accessibility (i.e., their reachability from neural network inputs) and their co-accessibility (i.e., their actual contribution in the evaluation of neural network outputs). Hence, only topologically consistent sub-networks should be considered when selecting network connections. Applications include image classification and human action recognition in large video collections. We consider both raw videos and already extracted skeleton data described with graphs, where nodes correspond to human joints, and edges to their spatial and temporal relationships.