Projet de recherche doctoral numero :4599

Description

Date depot: 1 janvier 1900
Titre: Online Diarization Enhanced by recent Speaker identification and Structured prediction Approaches
Directeur de thèse: Nicholas EVANS (Eurecom)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: Speaker diarization is an unsupervised process which aims to identify each speaker within an audio stream and to determine when each speaker is active. It considers that the number of speakers, their identities and their speech turns are all unknown. Speaker diarization has become an important key technology in many domains such as content-based information retrieval, voice biometrics, forensics or social-behavioural analysis. Example applications of speaker diarization include speech and speaker indexing, speaker recognition (in the presence of multiple speakers), speaker role detection, speech-to-text transcription, speech-to-speech translation and audiovisual content structuring. Although speaker diarization has been studied for almost two decades, current state-of-the-art systems suffer from many limitations. Such systems are extremely domain-dependent. For instance, a speaker diarization system trained on radio/TV broadcast news experiences drastically degraded performance when tested on a different type of recordings such as radio/TV debates, meetings, lectures, conversational telephone speech or conversational voice-over-IP speech. Overlapping speech, the spontaneous speaking style, background noise, music and other non-speech sources (laugh, applause, etc.) are all nuisance factors which badly affect the reliability of speaker diarization. Furthermore, most existing work addresses the problem of offline speaker diarization: the system has full access to the entire audio recording beforehand and no real time processing is required. Therefore, the multi-pass processing over the same data is feasible and a bunch of elegant machine learning tools can be used. Nevertheless, these compromises are not admissible in real-time applications mainly when it comes to public security and fight against terrorism and cyber-criminality. Moreover, after an initial step of segmentation into speech turns, most approaches address speaker diarization as a bag-of-speech-turns clustering problem and do not take into account the inherent temporal structure of interactions between speakers. Better performance may be achieved by integrating this information by exploiting structured prediction techniques to improve over standard hierarchical clustering methods. Speaker diarization is inherently related to speaker recognition. In recent years, the performance of state-of-the-art speaker recognition systems has improved enormously on account of new recognition paradigms such as i-vectors and deep learning, new session compensation techniques such as probabilistic linear discriminant analysis, and new score normalization techniques such as adaptive symmetric score normalization. However, existing speaker diarization systems do not take full advantages of these new techniques.

Doctorant.e: Patino Villar Jose Maria