MyEDB

Description

Date depot: 31 octobre 2022
Titre: Learning speech and speaker representations for robust speaker and language recognition
Directeur de thèse: Thierry GERAUD (LRE)
Encadrant : Reda DEHAK (LRE)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole

Resumé: Learning speech or speaker representation models for speech-processing tasks is very challenging. Generally, we want the learned speech representations to be disentangled, invariant, and hierarchical. Since spoken utterances contain, in addition to the phonetic contents, information about speaker identity, style, emotion, surrounding noise, and communication channel noise, it is essential to learn representations that disentangle these factors of variation. Deep learning methods trained with supervised learning algorithms (Latif, 2021) on large amounts of labeled speech data have shown remarkable success in numerous speech applications, including speech, speaker, and language recognition (Snyder, 2018) (Desplanques, 2020). However, training these systems relies on large amounts of labeled data poses a barrier to deploying deep neural networks in speech domains where labeled data are intrinsically rare, costly, or time -consuming to collect. Recently, there has been an interest in using self-supervised representation learning methods (Mohamed, 2022), which promise a single universal model that would benefit various tasks and domains. They have shown success in natural language and computer vision domains. These methods face many challenges when applied to speech data due to the continuous nature of speech. One challenging aspect is that the strategy used to define positive and negative sample s imposes invariances only on the learned representations. For example, in speaker recognition representation learning, sampling positive samples from the same utterance adds bias to the learning process; if the two samples are extracted from the same utterance, they do not share only the same speaker identity, but also the same language, the same emotion, the same surrounding noise, and the same channel noise. Another standing challenge is that speech signal does not have explicit segmentation of acoustic units. In this thesis, we will explore the use of different self-supervised learning (Chen, 2022) to improve the robustness of speaker and language recognition systems. Specifically, we will address the problem of vulnerabilities and how to deal with the different types of attacks, such as recent deep fakes. The rise in voice manipulation for criminal purposes has become a real challenge for speaker recognition systems (ENLETS 2021), especially when you consider that it is now so easy to generate fake voices using voice conversion or speech synthesis technologies. References: (Latif, 2021) S. Latif, R. Rana, S. Khalifa, R. Jurdak, J. Qadir, and B. W. Schuller, “Deep representation learning in speech processing: Challenges, recent advances, and future trends,” 2021. (Mohamed, 2022) A. Mohamed et al., "Self-Supervised Speech Representation Learning: A Review," in IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1179-1210, Oct. 2022. (Chen 2022) Z. Chen, S. Chen, Y. Wu, Y. Qian, C. Wang, S. Liu, Y. Qian, and M. Zeng, "Large -Scale Self- Supervised Speech Representation Learning for Automatic Speaker Verification," ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6147- 6151 (Desplanques, 2020) B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA - TDNN: Emphasized Channel Attention, Propagation and Ag- gregation in TDNN Based Speaker Verification,” in Proc. In- terspeech 2020, 2020, pp. 3830–3834. (Snyder, 2018) D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khu- danpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in Proc. IEEE ICASSP 2018, pp. 5329–5333. ENLETS document June 2021 Synthetic reality & deep fakes impact on police work

Doctorant.e: Lepage Théo

Projet de recherche doctoral numero :8406

Description