Description
Date depot: 1 janvier 1900
Titre: Information-theoretic methods to infer relevant mutations in cancer from high-throughput sequencing and copy number variant data
Directeur de thèse:
Hervé ISAMBERT (PC_Curie)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini
Resumé:
Despite unprecedented amounts of sequence data and the ongoing efforts to sequence mutated genomes from tumors of individual patients, we still lack a quantitative understanding of cancer progression. This is because the methodologies to pinpoint 'driver' from 'passenger' mutations and infer their functional consequences are not at par with the industrialization of sequencing. {{The objective of this interdisciplinary PhD project is to infer relevant mutations in cancer}} from high-throughput sequencing and copy number variant (CNV) data.
{{The Isambert team has recently developed a novel inference method to reconstruct causal networks from large scale datasets}} (Affeldt {et al.} 2014 & 2015). This information-theoretic approach combines constraint-based and bayesian inference methods to reliably infer large causal graphs, despite inherent sampling noise in finite datasets. In a nutshell, it ascertains structural independencies in causal graphs ({I(x;y|[ui])}=0) based on a bayesian ranking of their most contributing nodes, {[ui]}, by contrast to classical constraint-based approaches, such as the PC algorithm, which assess structural independencies in arbitrary order of the intervening variables, {[ui]}, rendering them prone to spurious conditional independencies. Instead, our novel hybrid approach progressively uncovers the best supported conditional independencies, by iteratively “taking off” the most significant indirect contributions of conditional 3-point information from every 2-point (mutual) information of the causal graph, as,
{I(x;y|[ui]n)=I(x;y)-I(x;y;u1)-I(x;y;u2|u1)-...-I(x;y;un|[ui]n-1)}.
Conditional independencies are thus derived by progressively collecting the most significant indirect contributions to all pairwise mutual information. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The approach is shown to outperform both constraint-based and Bayesian inference methods on a range of benchmark networks and on the reconstruction of hematopoiesis differentiation pathways based on recent single cell expression data (Moignard {et al.} 2013 & 2015).
{{In this PhD project, we will first adapt these network reconstruction methods to uncover causal cascades of driver mutations and CNV in large scale cancer genome datasets}}, such as The Cancer Genome Atlas and COSMIC, which includes already about 1,000,000 tumor samples. The PhD candidate will also apply the {{Mediation analysis (Pearl 2009), recently adapted to genomics data by the team (Singh {et al.} Cell Rep 2012), to analyze the direct and indirect paths in the reconstructed tumor progression pathways.}}
The first issue to reconstruct tumor progression pathways is to properly take into account the {{relevance of mutations and CNV in the context of the large mutational heterogeneity along the genomes}}. This is a fundamental problem in cancer genome studies as extensive spurious associations typically overshadow true driver events. The identification of genes that are significantly associated to cancer progression will be addressed following recent approaches such as 'MutSigCV' (Lawrence {et al.} Nature 2013) and 'MuSiC' (Dees {et al.} Genome Res 2012). From a broader computational perspective, it is also related to an ubiquitous problem in the emergent field of 'big data' analysis with usually many more observed variables compared to the number of independent data points (eg 20,000 sequenced genes in whole exome studies but only a few hundred tumor samples from individual patients). We will address this problem building on a number of early studies on network analysis, such as the weighted correlation network analysis (WGCNA: Langfelder {et al.} BMC Bioinfo 2008), as well as methods for dimensionality reduction such as Principal Component Analysis (PCA) or more advanced approaches based on spectral multidimension scaling analysis (Aflalo {et al.} PNAS 2013). This will lead to the {{development and implementation of an iterative computational method reconstructing large networks from the expansion and combination of multiple local networks}}.
{{Finally, the susceptibility to mutations and copy number variations of genes implicated in cancer will also be analyzed in terms of evolutionary models developed by our team to understand the role of duplication-divergence processes on the long-term evolution of biomolecular networks}} (Singh {et al.} PloS Comp Biol 2015, PloS Comp Biol 2014 & Cell Rep 2012; Malaguti {et al.} Theor Pop Biol 2014; Stein {et al.} PRE 2011; Evlampiev {et al.} PNAS 2008 & BMC Syst Biol 2007; Cosentino Lagomarsino {et al.} PNAS 2007).
Doctorant.e: Verny Louis