Description
Date depot: 1 janvier 1900
Titre: High Quality Voice Conversion by modelling and transformation of extended voice characteristics
Directeur de thèse:
Xavier RODET (STMS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini
Resumé:
1.
Voice Conversion (VC) aims at transforming the characteristics of a source speaker's voice in such a way that it will be perceived as being uttered by a target speaker. The principle of VC is to define
mapping functions for the conversion from one source speaker's voice to one target speaker's voice. The transformation functions to be applied adapt instantaneously to the contextual characteristics
of the source voice.
The proposed project is centered around Voice Identity Conversion. The goals are promising for the different domains where Voice and Musical Audio play an important role, such as Video Games, Animation, Video and Films, Post-Production, Dubbing, Music creation and production, Multimedia in general (e.g. avatars), communication systems for transmission etc. Voice conversion has received increasing attention within the speech research community over the last years because of recent improvements in sound and conversion quality and its many potential applications. The reproduction of and/or transformation into specific voices may find use in Human Voice Avatar Generation, Improvement of Voice Transformation Systems, Voice Dubbing for Movies, Voice Re-Creation of voices from deceased human persons based on old recordings, Voice Style Editing for dialects and discourse genres, Biometric Testing, Voice pathology and even Telephone transmission and mobile communication.
2. High Quality Voice Conversion
In order to achieve VC of highest quality, an improvement of the state of the art techniques in Voice Conversion is required. The two main problems in conventional Voice Conversion are the insufficient similarity between the transformed source and the target voice as well as the artifacts present in the transformed signal.
The following sections describe the principle ideas that will be investigated in the
research project. In general we note that performance will be monitored continuously
during the research stages by means of objective evaluation measures. At the end of each
research unit a subjective listening test will be conducted to validate the decisions made.
2.1 Glottal source and vocal tract separation
Taking into account the glottal source parameters is considered as an important and
challenging factor for VC systems, and there have recently been quite a few
VC systems that use glottal pulse parameters. The estimation and representation of the excitation characteristics of source and target speaker will allow us to transform the source signal in a manner that is coherent with the transformation of the vocal tract filter.
The mapping function of state of the art VC systems is conditioned on the spectral
envelope features. The idea being that the mapping between source and target features
should change with the phonemes and that the decoding into phonemes can be performed by
means of clustering spectral envelope features. Due to the fact that the glottal pulse and
especially the glottal formant is part of the spectral envelope the phonetic decoding in
current VC systems is suboptimal. An improvement can be expected if the effect of the
glottal pulse is extracted from the spectral envelope before the statistical model is
trained. Glottal pulse parameters can then be added separately as parameter and could be
transformed explicitly.
In the present work package we propose to evaluate the benefits of separating the
glottal source from the spectral envelope prior to the
training of the statistical model with the remaining vocal tract filter representation.
The evaluation and possible improvement of the excitation source characterization,
based on a recent PhD thesis at IRCAM to estimate the glottal pulse parameter Rd,
before its integration into the VC system is required.
2.2 Voice Quality Transformation
The extended set of features tackled in the Voice Conversion process requires some extensions of the existing voice transformation algorithms.
The fine control of the voice quality features that is required for transformation into a well defined target speaker requires extended means for coherent transformation of the glottal pulse shape parameter and the voiced/unvoiced frequency boundary (VUF).
The latter is the frequency limit between voiced and unvoiced signal components,
which is especially important to avoid incoherent spectral envelope modifications,
notably the disturbing artifacts that are produced by current VC systems
and that are due to formants being excited by unvoiced source signals.
The inclusion of glottal source parameters into the statistical model requires an
algorithm that is capable to apply glottal source parameter modifications to the source
signal. An initial glottal pulse parameter transformation algorithm has been developed at
IRCAM. The algorithm needs to be improved such that it
allows to take into account modification of the VUF.
2.3 Optimized probabilistic conversion
One of the sources of artifacts in VC systems is the fact
Doctorant.e: Huber Stefan