Description
Date depot: 1 janvier 1900
Titre: Combining machine learning and evolution for the annotation of metagenomics data
Directrice de thèse:
Alessandra CARBONE (LCQB)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini
Resumé:
Context, description of the project and computational challenges
After the recent developments in DNA sequencing, we can now hope to decode and assembly multiple genomes from multiple species in microbial communities. The new field of metagenomics asks to scientists to think on a broad scale, shifting their focus from ‘how does an organism work’ to `who all is here and what are they doing?’. As many new microorganisms are being identified via metagenomics projects, we are also detecting many organisms that are not well understood and many microbes that have simply not been listed in the databases yet. The understanding of these communities becomes a crucial challenge for the understanding of our environment.
But this shift is not the only challenge proposed by metagenomics. The increased complexity of the data poses computational challenges in assembling, annotating, and classifying genomic fragments from multiple organisms. Complications stem from the difficulty of assembling, annotating, and classifying the short sequence fragments typically obtained with next-generation sequencing methods. So, novel computational methods are needed to address these issues and the massive amounts of sequence data that have become available.
In this thesis we shall be interested in developing a new approach to annotation. This method will be hopefully used to identify whether a gene in a microbial community is already known, or it is completely new or it is simply diverged so much from what we know that is just hard to recover common origins. To learn about the common origin of genes is important for the identification of new protein families and for the extension of known families with evolutionary pathways that might have been not known before. These latter will contribute important insights on the functional activity of the communities.
We shall build on a new and original approach to domain annotation that has been recently developed in the laboratory (Bernardes, Zaverucha, Vaquero, Carbone, 2012 submitted).
Traditional protein annotation methods describe known domains with probabilistic models representing the consensus among homologous domain sequences. When relevant signals become too weak to be identified by consensus, attempts for annotation fails. We tackle the problem of identifying protein domains based on the observation that many structural and functional protein constraints are not globally conserved through all species but might be locally conserved in separate clades. We applied our strategy to genomes known to present strong annotation difficulties and verified that the same pathways of evolution taken by domains within these genomes might be found in phylogenetically distant species. We construct profiles, called phylogenetic models, starting from a large and differentiated panel of homologous sequences in protein domain families and use these profiles to search for homology. In order to produce reliable domain recognition, we combine predictions coming from consensus models and phylogenetic models by using a meta-classifier that highlights properties of individual model results and provides an indication of the performance of all models. A novel algorithm based on multiple optimization criteria finds the most likely architecture for each protein. When this method is applied to Plasmodium falciparum genome, it predicts domains for 70% of P.falciparum proteins against the 58% achieved by Pfam, CODD and dPUC methods. This is an exceptionally good performance, for a problem that was thought to reach its limits of attainability. In particular, the method finds additional domains in already annotated proteins, it predicts domains for proteins with unknown function, it highlights new highly frequent domains and domain architectures, and it helps to unravel the domain structure of long proteins. Since the approach is general, it can be applied to any genome and in particular to reasonably short assembled metagenomic sequences.
The drawback of the approach though, when applied to metagenomic data, is the huge computational time that is required by the construction of the evolutionary models associated to the thousands of Pfam domains we use to annotate metagenomic sequences. These models are constructed starting from sequences that are a representative of the protein domain family and that belong to a few hundreds of species within the whole phylogenetic tree. In numerical terms this means
(11912x100)x30min=35,736,000min=595,600h=24,816d=67y
of computational time (on a single CPU) for constructing phylogenetic models, where 11912 are the protein domains to construct models for, 100 is (roughly) the number of species or equivalently the number of models constructed for a domain, 30min is the average time to construct a model for a domain.
Then the 35,736,000 models should be used to annotate the database of metagenomic sequences. This means that each model is tested on each m
Doctorant.e: Ugarte Ari