Projet de recherche doctoral numero :4614

Description

Date depot: 1 janvier 1900
Titre: Weakly and semi-supervised learning for image classification
Directeur de thèse: Matthieu CORD (ISIR (EDITE))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: {{{General context}}} In recent years we have witnessed an explosion of successful applications of deep learning including image recognition, speech recognition, automatic translation, self-driving cars, computers that can beat professional Go players, and recommender systems. Because of the abundance of training data, machine learning techniques can now deal with increasingly large and complex inputs (video, sound, speech, text, etc.). As an example, the ImageNet dataset is one of the major object recognition benchmarks in computer vision, it consists of more than 14 million images of more than 20 thousand classes. Computing power is advancing rapidly with massively parallel GPU architectures, changing the nature of machine learning research. In recent years deep learning has surfaced as one of the most effective approaches to exploit the available data and computing power. In deep learning data is used to train a model that consists of multiple non-linear processing layers — all the way from the raw input data to the target predictions — each of which has trainable parameters. Deep networks designed for these tasks have millions, if not billions, of parameters and take enormous resources to train in terms of data, memory and computing power. {{{Subject}}} The purpose is to study the use of non-labeled data — that they do not require human annotations — for joint supervised and unsupervised deep learning. The idea of using unsupervised methods that fits with supervision is not new [Suddarth90; Larochelle08]. The basic criterion for unsupervised learning is the reconstruction (or generation) of the input. For instance, with an auto-encoder for natural images, the decoder will try to reconstruct the original input (with all details) from the internal representation. This approach, however, might not be optimal for tasks such as image classification where pixel-level details are not relevant, since such models should actually be invariant for many low-level imaging conditions. The recently proposed ladder network [Valpola15] modifies the basic auto-encoder reconstruction scheme by not requiring the reconstruction to be performed from the deepest internal representation. Instead, the decoder also receives input from corresponding layers of the encoder by adding lateral connections, thus fine details do not need to be propagated to deep representational layers. Based on the same idea, we will explore other schemes to learn hierarchical representations suitable for abstract high-level tasks such as image classification, while also maintaining the capability to reconstruct fine details in the input to ensure compatibility with reconstruction-based loss functions: We propose to combine deep architectures based on scattering operators [Bruna13] with recent approaches dedicated to generate data (images) from deep models [Mahendran15, Gregor15]. The idea behind scattering deep models [Bruna13] is to attempt at linearizing deformations, such that the variations due to some irrelevant aspects are discarded for the high level representations. This makes, however, the reconstruction of lower levels impossible from higher representations alone. We propose here to explicitly model the invariant and variant features for each training example, such that the reconstruction using deep generation model can be performed with this extended information. We will consider as a baseline the deep top-down RBM nets [Goh13] recently proposed as unsupervised/supervised scheme. Our proposition aims at extending this baseline especially by defining a training scheme where the supervised and unsupervised criterion are learned in a joint manner, and by explicitly modeling the decomposition of variant and invariant representations. One assumption we want to validate is that this explicit decomposition drives the learning towards more effective (robust) representations. The second approach we propose is to explore representations of different nature, and particularly mixed representations composed of both continuous and discrete values as it was recently proposed in the supervised case [Tang13, Zhu14]. Mixing these two types of representations can be seen as learning an (image) representation composed of both a concept (discrete values) and context (real values). The concept could be for example a binary vector corresponding to a category (i.e. a cat), while the context will be mapped to a continuous one corresponding to the visual context of the category (i.e. on a red sofa). While this information will be extracted through unsupervised techniques such as auto-encoders, a classifier can make use of the binary information only. Such a model will thus be learned by simultaneously using a reconstruction loss, and a classification one, the latter being only defined on the binary part of the extracted representation. Since the binary part is a discrete information that will be sampled from the observation, learning

Doctorant.e: Robert Thomas