Description
Date depot: 1 janvier 1900
Titre: Structure discovery in multivariate musical audio signals through semi-supervised variational learning
Directeur de thèse:
Gérard ASSAYAG (STMS)
Encadrant :
Jérôme NIKA (STMS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini
Resumé:
This PhD work will be co-supervised by Philippe Esling and Jerôme Nika.
The aim of this research project is to model the multivariate information structures inherent to multiple sound signals through methods of {variational learning}. The main goal is to learn the intricate interactions in multiple musical sound sources in order to discriminate the most salient features of each source and how they influence the overall structure of the global mixture. Here, we consider structure as any underlying sequence that constitutes a higher-level abstraction of an original input sequence. In musical audio signals, this includes both the high-level properties (eg. chords progressions, key changes, thematic organisation) and resulting audio signal (eg. emerging timbral properties well known in orchestration) of sound mixtures. We aim to directly assess those interactions by combining ladder networks and variational auto-encoders. These approaches provide an adequate mathematical representation of lower-dimensional properties hidden in high-dimensional spaces. By relying on {semi-supervised learning} (mixing unsupervised learning and supervised examples) on different tasks of {structure discovery}, this project aims to discover embedding spaces able to explain as well as generate directly novel data following the underlying distribution of musical inputs while fullfilling their structural properties. These representations of audio signals can yield direct applications such as music information retrieval, structure discovery, sound morphing, source separation and transcription, and lead to novel approaches for digital audio synthesis and musical human-computer interaction. In particular, this project will take place in an ongoing thematic, involving the STMS lab along with a network of national and international collaborations (ANR, Inria, McGill U., EHESS, UCSD, U. Berkeley, U. Columbia), on the Creative Dynamics of Improvised Interactions (DYCI2). This Doctoral project will benefit from this important momentum and contribute by bringing in novel {generative learning} methodologies for extracting the salient features of musical signals at different structural scales, and feed live creative agents interacting with humans with polyphonic musical expertise.
As a first step, {variational learning} techniques previously investigated in image processing will be adapted to single source audio data as a preliminary attempt to model audio signal distributions. Hence, we will focus on unsupervised representation learning algorithms, such as variational auto-encoders, which aim to model the distribution of input data through latent spaces. By combining discriminative and generative learning, these methods can reconstruct explanatory latent spaces in an unsupervised fashion. The main idea behind this technique is to use the reconstructed (lower-dimensional) code as a probability distribution from which we can perform sampling (generating examples that follow the data distribution). Hence, the obtained lower-dimensional spaces provide a straightforward mechanism for generating new data that follows the original distribution. Then, we will compare the spaces obtained for different audio sound sources and instruments in multi-track musical recordings. By leveraging an international SSHRC funded research project with McGill University, we are currently gathering the largest multi-track orchestral recordings collection. Also as an important asset for this project, DYCI2 has access through specific agreements to a huge collection of concert multi-track audio and video recordings that is unique in the world: the Montreux Jazz Festival archive, an UNESCO {Memory of the World Heritage} Laureate. Live recordings are important for the understanding and generation of improvisation.
Second, we will target structure discovery in these complex audio recordings by extending the previously developed approaches with semi-supervised approaches such as the {ladder network} architectures. By combining a discrimination (supervised) task with a reconstruction (unsupervised) task, these approaches allow to provide extremely high accuracy scores while relying on only few labeled examples. In our case, this can allow to alleviate the relative scarcity of available labeled multi-track recordings, as compared to the very large amount of audio data freely available. We intend to study various extensions of this technique by injecting multiple types of learning signals (prediction, regression) in increasingly complex learning architectures.
Finally, we will analyze the temporal aspects of multiple musical sound sources at variable time scales. These questions are particularly critical in musical structures, where any time instant is defined and dependent on multiple temporal contexts spanning from few notes to the global scale of the structure of a piece. This requires to embed notions of different memory spans into current models
Doctorant.e: Carsault Tristan