Projet de recherche doctoral numero :4644

Description

Date depot: 1 janvier 1900
Titre: Information-theoretic methods for cancer genomics
Directeur de thèse: Hervé ISAMBERT (PC_Curie)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: {{Scientific background}} Despite unprecedented amounts of sequence data and the ongoing efforts to sequence mutated genomes from tumors of individual patients, we still lack a quantitative understanding of cancer progression. This is because the methodologies to pinpoint 'driver' from 'passenger' mutations and infer their functional consequences are not at par with the industrialization of sequencing. In particular, many low frequency driver mutations remain difficult to uncover statistically with available inference methods, despite much anticipated hope to better identify cancer variants and ultimately to develop specific adapted treatments (Marx Nat Methods 2014). Recent modeling accomplishments by the group provide novel information-theoretic avenues to infer causal cascades of driver mutations and copy number variations (CNV) in tumors from large scale cancer genome. {{Objectives}} The aim of this PhD project is to infer relevant mutation pathways in cancer from high- throughput sequencing and CNV data. Next generation sequencing (NGS) technologies now provide rapidly increasing datasets on cancer mutations, which are available in cancer genome repositories, such as COSMIC, which includes already about 1,000,000 tumor samples. In this interdisciplinary project we aim at analyzing information-rich experimental datasets with novel advanced information-theoretic approaches. The results should provide a more functional, systems level description of cancer progression in terms of mutation cascades and associated phenotypes. {{Methods}} The Isambert team has recently developed novel inference methods to reconstruct causal networks from large scale datasets (Affeldt et al, 2014 & 2015; Singh et al. Cell Rep 2012 & PloS Comp Biol 2014). This information-theoretic approach combines constraint-based and bayesian inference methods to reliably infer large causal graphs, despite inherent sampling noise in finite datasets. In a nutshell, it ascertains structural independencies in causal graphs (ie I(x;y|{u_i})=0 implying no x-y link in the underlying network) based on a bayesian ranking of their most contributing nodes, {u_i}. By contrast, classical constraint-based approaches, such as the ''PC'' algorithm (Spirtes et al 1991), assess structural independencies in arbitrary order of the intervening variables,{u_i}, rendering them prone to spurious conditional independencies. Instead, our novel hybrid approach, “3off2”, progressively uncovers the best supported conditional independencies, by iteratively “taking off” the most significant indirect contributions, I(x;y;u k |{u_i}_k-1), of conditional 3- point information from every 2-point (mutual) information, I(x;y), of the causal graph, as, I(x;y|{u_i}_n) = I(x;y) – I(x;y;u_1) – I(x;y;u_2|u_1) - ... - I(x;y;u_n|{u_i}_n-1) Conditional independencies are thus derived by progressively collecting the most significant indirect contributions to all pairwise mutual information. The resulting network skeleton is then partially directed by orienting and propagating edge directions, based on the sign and magnitude of the conditional 3-point information of unshielded triples. The approach is shown to outperform both constraint-based and Bayesian inference methods on a range of benchmark networks, Fig 1A, and on the reconstruction of hematopoiesis differentiation pathways based on recent single cell expression data (Moignard et al. 2013 & 2015), Fig 1B. In this PhD project, we will first adapt these network reconstruction methods to uncover causal cascades of driver mutations and CNV in large scale cancer genome datasets, such as The Cancer Genome Atlas and COSMIC. Examples of driver mutation cascades have been uncovered experimentally for a few cancers, such as for colorectal tumors, Figure 2. However, many low frequency driver mutations remain difficult to uncover statistically for most cancers, despite much anticipated hope to better identify cancer variants and ultimately to develop specific adapted treatments (Marx Nat Methods 2014). To bridge this gap in statistical power of inference methods, important improvements are needed to integrate and analyze the heterogeneous information from both low-scale (ie few sequenced genes) datasets from many cancer patients and the recent large-scale (eg whole exome) datasets obtained for tens to a few hundreds patients in typical studies. Adapting our 3off2 inference methods to such heterogeneous datasets will be the first milestone of this PhD project. Preliminary results show that known driver mutation cascades in different primary tumors, such as outlined in Figure 2, are readily recovered with the 3off2 inference approach, but novel unknown mutation cascades are also uncovered. A second issue to reconstruct tumor progression pathways is to properly take into account the relevance of mutations and CNV in the context of the large mutational variations along the genomes. This is a fundamental problem i

Doctorant.e: Sella Nadir