Projet de recherche doctoral numero :4288

Description

Date depot: 1 janvier 1900
Titre: From residue co-evolution to protein structure prediction
Directeur de thèse: Martin WEIGT (LCQB)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: The grand challenge of biology in the 21st century, to become a quantitative science like physics and chemistry, can be solved only by integrating recent progress in computer science (high-performance computing as well as novel methods in statistical inference and model learning) and experimental biology (high-throughput sequencing). Within this project, our vision is to develop a powerful algorithmic framework whereby the mining of vast amounts of raw data will lead to the understanding of complex biological processes. More specifically, we will exploit the sequence variability of related proteins across thousands of sequenced genomes, to detect evolutionary constraints, and to exploit them for the prediction of protein structures (contact map and 3D fold prediction). Protein-structure prediction is recognized as one of the most important problems in bioinformatics, medicine and biotechnology. In fact, in the course of evolution, protein structure is remarkably conserved, whereas amino-acid sequences vary strongly between homologous, i.e. evolutionarily related proteins (so-called protein families). This structural conservation constrains sequence variability, forcing residues to co-evolve: residues being close in the protein structure (but possibly distant along the sequence) will typically evolve in a correlated way, cf. Valencia et al. 2013 for a review. In our team, we have recently proposed an innovative statistical inference method, called Direct-Coupling Analysis (DCA), which turned out to reach a substantial breakthrough in detecting residue-residue contacts from sequence information alone (Weigt et al. 2009, Morcos et al. 2011, Ekeberg et al. 2013). This inference approach is based on the statistical modeling of protein sequences by Markov Random Fields (MRF). This problem, being a priori infeasible for protein data due to its exponential time complexity, has been approached with methods inspired my statistical physics (mean-field approximation) and machine learning (pseudo-likelihood maximization).

Doctorant.e: Barrat-Charlaix Pierre