Projet de recherche doctoral numero :6107

Description

Date depot: 26 juin 2019
Titre: Multimodal analysis and knowledge inference for musical creativity
Directeur de thèse: Carlos AGON (STMS)
Directeur de thèse: Patrick GALLINARI (ISIR (EDITE))
Directeur de thèse: Philippe Joseph Rene ESLING (STMS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: The recent advent of deep learning [Bengio09] has witnessed a flourishing interest within the machine learning community. Its goal is to train connectionist architectures analogous to the well-known Neural Networks (NN). However, previous architectures were bound to low depth (number of layers) because of the gradient diffusion problem that led to inefficient learning in deeper networks. The recent breakthrough of greedy layer-wise pre-training [Hinton07] allows each layer of the network to be trained independently and in an unsupervised manner. Hence, each layer exploits the statistical regularities in the previous level of representation, and the deep architectures can be seen as a set of increasingly higher-level abstractions that decompose a complex problem into a hierarchical array of simpler ones. This project aims to develop new algorithms for learning joint multimodal embedding spaces, linking together symbolic, acoustic and perceptual sources of information to further disentangle the correlations that emerge from given orchestral effects. To that end, we will introduce zero-shot learning [Palatucci09] for musical content, by developing architectures specifically tailored for the nature of audio signals and music writing. The embedding approach can automatically learn a space linking heterogeneous data and can extract new descriptive dimensions from the data, in order to obtain sets of potential descriptors relevant for musical orchestration. Therefore, specific transforms will be developed to incorporate multivariate time series structures that are core to the perception of timbre. By further extending the embedding spaces through variational learning [Kingma14] with a specific information content measure, we can discriminate the different dimensions of these spaces, and also allow for processes that generate music directly from the spaces. It has been shown that the embedding spaces found through zero-shot learning provide semantic regularities [Mikolov13] and metric relationships across modalities [Socher13] that can be exploited in our context to attain multiple musical and pedagogical goals. We will explore these regularities in musical data to study the underlying metric relationships and perform semantic inference on the data. All methods provide the advantage of being able to learn in semi-supervised settings, by leveraging massive sets of unlabeled data, which are then refined through small sets of labeled (supervised) knowledge.

Doctorant.e: Prang Mathieu