Projet de recherche doctoral numero :8502

Description

Date depot: 12 avril 2023
Titre: Deep Learning for Genomics: functional classification of protein-coding genes
Directrice de thèse: Alessandra CARBONE (LCQB)
Encadrant : Chris BOWLER (IBENS)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Sciences de l’information et sciences du vivant

Resumé: Now that the issue of predicting protein structures has been largely “resolved” by the unbelievable advancement of Deep Learning approaches that lead to AlphaFold, determining what proteins do when they interact is the next frontier. This thesis takes on this new challenge to decipher the complexity of the interaction between proteins and other molecules from the perspective of function. Proteins are key molecules in living cells. They are responsible for nearly every task of cellular life and are essential for the maintenance of the structure, function, and regulation of the unicellular organisms in any ecosystem, from tissues and organs in the human body to the ocean. Cells can produce thousands of different types of proteins (the so-called proteome), which perform a plethora of diverse functions, all crucial for cell viability in their environment. Assigning functions to the vast array of proteins present in cells remains a challenging task in cell biology. This question applies to the multitude of organisms interacting in the ocean and constituting the ocean microbiome, that is a highly dilute microbial system that covers the majority of Earth's surface and extends an average of 3600 m down to the seafloor. It also applies to the human genome and its 15000 understudied proteins. Broadly, the myriad of protein coding sequences accumulating in our databases coming from different ecosystems have no identified function and their functional classification constitutes the critical bottleneck in their understanding and in our control on their health. In this project, we want to design and train a novel Deep Learning (DL) architecture, which is able to classify sets of sequences by function, discover possibly new functions and functional subclasses. We shall take advantage of the huge amounts of sequences present in our databases, protein Language Models and multi-view DL approaches, and the recent in-house approach ProfileView devoted to domain functional classification. The method should allow 1. to infer a function on sequences sharing similar sequence patterns by transferring functional labels from those few sequences where the function is already characterized, 2. to discover the existence of new functions by exploiting new sequence patterns, and 3. to identify functional determinants, that is the ensemble of residues that allows a protein to realize the function. The direct impacts of functional classifications produced with this thesis will be of interest for the international community.

Résumé dans une autre langue: Now that the issue of predicting protein structures has been largely “resolved” by the unbelievable advancement of Deep Learning approaches that lead to AlphaFold, determining what proteins do when they interact is the next frontier. This thesis takes on this new challenge to decipher the complexity of the interaction between proteins and other molecules from the perspective of function. Proteins are key molecules in living cells. They are responsible for nearly every task of cellular life and are essential for the maintenance of the structure, function, and regulation of the unicellular organisms in any ecosystem, from tissues and organs in the human body to the ocean. Cells can produce thousands of different types of proteins (the so-called proteome), which perform a plethora of diverse functions, all crucial for cell viability in their environment. Assigning functions to the vast array of proteins present in cells remains a challenging task in cell biology. This question applies to the multitude of organisms interacting in the ocean and constituting the ocean microbiome, that is a highly dilute microbial system that covers the majority of Earth's surface and extends an average of 3600 m down to the seafloor. It also applies to the human genome and its 15000 understudied proteins. Broadly, the myriad of protein coding sequences accumulating in our databases coming from different ecosystems have no identified function and their functional classification constitutes the critical bottleneck in their understanding and in our control on their health. In this project, we want to design and train a novel Deep Learning (DL) architecture, which is able to classify sets of sequences by function, discover possibly new functions and functional subclasses. We shall take advantage of the huge amounts of sequences present in our databases, protein Language Models and multi-view DL approaches, and the recent in-house approach ProfileView devoted to domain functional classification. The method should allow 1. to infer a function on sequences sharing similar sequence patterns by transferring functional labels from those few sequences where the function is already characterized, 2. to discover the existence of new functions by exploiting new sequence patterns, and 3. to identify functional determinants, that is the ensemble of residues that allows a protein to realize the function. The direct impacts of functional classifications produced with this thesis will be of interest for the international community.