Projet de recherche doctoral numero :7688

Description

Date depot: 1 octobre 2020
Titre: Data-driven generative modeling of protein sequence landscapes
Directeur de thèse: Martin WEIGT (LCQB)
Directeur de thèse: Francesco ZAMPONI (LPENS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Sciences de l’information et sciences du vivant

Resumé: Proteins belong to the most fascinating complex systems in nature. Playing a crucial role in almost all biological processes, they attract considerable attention at the interfaces of biology, physics, and computer science. Thanks to the sequencing revolution in biology, protein sequence databases have been growing exponentially over the last years. Data-driven modeling approaches, which in recent times increasingly include methods from Artificial Intelligence, are therefore becoming more and more popular in exploring this emergent data richness. In our doctoral project we suggest to construct highly accurate, generative but interpretable models for protein sequence landscapes by leveraging rapidly expanding sequence databases, inverse statistical physics and deep learning. The landscapes describe the sequence variability in protein families, i.e. ensembles of proteins having common ancestry in evolution, sharing very similar three-dimensional structures and biological functions, but having highly variable amino-acid sequences. To build these models, we will systematically explore generative modeling approaches, ranging from parsimonious but easily interpretable models (e.g. Boltzmann machines, restricted Boltzmann machines) to more powerful, but also less easily interpretable deep generative models (e.g. autoregressive models, variational auto-encoders and generative adversarial networks). We will explore integrative modeling strategies, which combine publicly available sequence data with more quantitative data (deep-mutational scanning) generated by our close collaborator Dr. Olivier Tenaillon (DR INSERM at Bichat). Uncovering the patterns of natural sequence variability using generative models will allow us to address biological questions of prime importance, including the assessment of mutational effects in proteins (important e.g. in predicting pathological mutations or evolution of drug resistance) and the data-driven design of new protein sequences.

Doctorant.e: Trinquier Jeanne