Projet de recherche doctoral numero :8498


Date depot: 11 avril 2023
Titre: Multimodal behavior generation and style transfer for virtual agent animation
Directrice de thèse: Catherine PELACHAUD (ISIR (EDITE))
Directeur de thèse: Nicolas OBIN (STMS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Intelligence artificielle

Resumé: The aim of this PhD is to generate human-like gestural behavior in order to empower virtual agents to communicate verbally and nonverbally with different styles. We view behavioral style as being pervasive while speaking; it colors the communicative behaviors while content is carried by multimodal signals but mainly expressed through text semantics. The objective is to generate ultra-realistic verbal and nonverbal behaviors (text style, prosody, facial expression, body gestures and poses) corresponding to a given content (mostly driven by text and speech), and to adapt it to a specific style. This raises methodological and fundamental challenges in the fields of machine learning and human-computer interaction: 1) How to define content and style; which modalities are involved and with which proportion in the gestural expression of content and style? 2) How do we implement efficient neural architectures to disentangle content and style information from multimodal human behavior (text, speech, gestures)? The proposed directions will leverage on the cutting-edge research in neural networks such as multimodal modeling and generation, information disentanglement, and text prompt generation as popularized by DALL-E or Chat-GPT. The research questions can be summarized as follows: - What is a multimodal style?: What are the style cues in each modality (verbal, prosody, and nonverbal behavior)? How to fuse each modality cues to build a multimodal style? - How to control the generation of verbal and nonverbal cues using a multimodal style? How to transfer a multimodal style into generative models? How to integrate style-oriented prompts/instructions into multimodal generative models by keeping the underlying intentions to be conveyed by the agent? - How to evaluate the generation?: How to measure the content preservation and the style transfer? How to design evaluation protocols with real users? The PhD candidate will elaborate contributions in the field of neural multimodal behavior generation of virtual agents with a particular focus on multimodal style generation and controls: - Learning disentangled content and style encodings from multimodal human behavior using adversarial learning, bottleneck learning, and cross-entropy / mutual information formalisms. - Generating expressive multimodal behavior using prompt-tuning, VAE-GAN, and stable diffusion algorithms. To accomplish those objectives, we propose the following steps: - Analyzing corpus to identify style and content cues in different modalities. - Proposing generative models for multimodal style transfer according to different control levels (human mimicking or prompts/instructions) - Evaluating the proposed models with dedicated corpus (e.g. PATS) and with real users.