Projet de recherche doctoral numero :3436

Description

Date depot: 1 janvier 1900
Titre: High Quality Voice Conversion by modelling and transformation of extended voice characteristics
Directeur de thèse: Xavier RODET (STMS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: 1. Voice Conversion (VC) aims at transforming the characteristics of a source speaker's voice in such a way that it will be perceived as being uttered by a target speaker. The principle of VC is to define mapping functions for the conversion from one source speaker's voice to one target speaker's voice. The transformation functions to be applied adapt instantaneously to the contextual characteristics of the source voice. The proposed project is centered around Voice Identity Conversion. The goals are promising for the different domains where Voice and Musical Audio play an important role, such as Video Games, Animation, Video and Films, Post-Production, Dubbing, Music creation and production, Multimedia in general (e.g. avatars), communication systems for transmission etc. Voice conversion has received increasing attention within the speech research community over the last years because of recent improvements in sound and conversion quality and its many potential applications. The reproduction of and/or transformation into specific voices may find use in Human Voice Avatar Generation, Improvement of Voice Transformation Systems, Voice Dubbing for Movies, Voice Re-Creation of voices from deceased human persons based on old recordings, Voice Style Editing for dialects and discourse genres, Biometric Testing, Voice pathology and even Telephone transmission and mobile communication. 2. High Quality Voice Conversion In order to achieve VC of highest quality, an improvement of the state of the art techniques in Voice Conversion is required. The two main problems in conventional Voice Conversion are the insufficient similarity between the transformed source and the target voice as well as the artifacts present in the transformed signal. The following sections describe the principle ideas that will be investigated in the research project. In general we note that performance will be monitored continuously during the research stages by means of objective evaluation measures. At the end of each research unit a subjective listening test will be conducted to validate the decisions made. 2.1 Glottal source and vocal tract separation Taking into account the glottal source parameters is considered as an important and challenging factor for VC systems, and there have recently been quite a few VC systems that use glottal pulse parameters. The estimation and representation of the excitation characteristics of source and target speaker will allow us to transform the source signal in a manner that is coherent with the transformation of the vocal tract filter. The mapping function of state of the art VC systems is conditioned on the spectral envelope features. The idea being that the mapping between source and target features should change with the phonemes and that the decoding into phonemes can be performed by means of clustering spectral envelope features. Due to the fact that the glottal pulse and especially the glottal formant is part of the spectral envelope the phonetic decoding in current VC systems is suboptimal. An improvement can be expected if the effect of the glottal pulse is extracted from the spectral envelope before the statistical model is trained. Glottal pulse parameters can then be added separately as parameter and could be transformed explicitly. In the present work package we propose to evaluate the benefits of separating the glottal source from the spectral envelope prior to the training of the statistical model with the remaining vocal tract filter representation. The evaluation and possible improvement of the excitation source characterization, based on a recent PhD thesis at IRCAM to estimate the glottal pulse parameter Rd, before its integration into the VC system is required. 2.2 Voice Quality Transformation The extended set of features tackled in the Voice Conversion process requires some extensions of the existing voice transformation algorithms. The fine control of the voice quality features that is required for transformation into a well defined target speaker requires extended means for coherent transformation of the glottal pulse shape parameter and the voiced/unvoiced frequency boundary (VUF). The latter is the frequency limit between voiced and unvoiced signal components, which is especially important to avoid incoherent spectral envelope modifications, notably the disturbing artifacts that are produced by current VC systems and that are due to formants being excited by unvoiced source signals. The inclusion of glottal source parameters into the statistical model requires an algorithm that is capable to apply glottal source parameter modifications to the source signal. An initial glottal pulse parameter transformation algorithm has been developed at IRCAM. The algorithm needs to be improved such that it allows to take into account modification of the VUF. 2.3 Optimized probabilistic conversion One of the sources of artifacts in VC systems is the fact

Doctorant.e: Huber Stefan