Description
Date depot: 31 octobre 2022
Titre: Learning speech and speaker representations for robust speaker and language recognition
Directeur de thèse:
Thierry GERAUD (LRE)
Encadrant :
Reda DEHAK (LRE)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole
Resumé: Learning speech or speaker representation models for speech-processing tasks is very challenging.
Generally, we want the learned speech representations to be disentangled, invariant, and hierarchical.
Since spoken utterances contain, in addition to the phonetic contents, information about speaker
identity, style, emotion, surrounding noise, and communication channel noise, it is essential to learn
representations that disentangle these factors of variation.
Deep learning methods trained with supervised learning algorithms (Latif, 2021) on large amounts of
labeled speech data have shown remarkable success in numerous speech applications, including
speech, speaker, and language recognition (Snyder, 2018) (Desplanques, 2020). However, training
these systems relies on large amounts of labeled data poses a barrier to deploying deep neural
networks in speech domains where labeled data are intrinsically rare, costly, or time -consuming to
collect. Recently, there has been an interest in using self-supervised representation learning methods
(Mohamed, 2022), which promise a single universal model that would benefit various tasks and
domains. They have shown success in natural language and computer vision domains. These methods
face many challenges when applied to speech data due to the continuous nature of speech. One
challenging aspect is that the strategy used to define positive and negative sample s imposes
invariances only on the learned representations. For example, in speaker recognition representation
learning, sampling positive samples from the same utterance adds bias to the learning process; if the
two samples are extracted from the same utterance, they do not share only the same speaker identity,
but also the same language, the same emotion, the same surrounding noise, and the same channel
noise. Another standing challenge is that speech signal does not have explicit segmentation of acoustic
units.
In this thesis, we will explore the use of different self-supervised learning (Chen, 2022) to improve the
robustness of speaker and language recognition systems. Specifically, we will address the problem of
vulnerabilities and how to deal with the different types of attacks, such as recent deep fakes. The rise
in voice manipulation for criminal purposes has become a real challenge for speaker recognition
systems (ENLETS 2021), especially when you consider that it is now so easy to generate fake voices
using voice conversion or speech synthesis technologies.
References:
(Latif, 2021) S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep representation
learning in speech processing: Challenges, recent advances, and future trends,” 2021.
(Mohamed, 2022) A. Mohamed et al., "Self-Supervised Speech Representation Learning: A Review,"
in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179-1210, Oct. 2022.
(Chen 2022) Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, "Large -Scale Self-
Supervised Speech Representation Learning for Automatic Speaker Verification," ICASSP 2022 - 2022
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6147-
6151
(Desplanques, 2020) B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA - TDNN: Emphasized
Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,” in Proc. In-
terspeech 2020, 2020, pp. 3830–3834.
(Snyder, 2018) D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu- danpur, “X-vectors: Robust
DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP 2018, pp. 5329–5333.
ENLETS document June 2021 Synthetic reality & deep fakes impact on police work
Doctorant.e: Lepage Théo