Projet de recherche doctoral numero :4548

Description

Date depot: 1 janvier 1900
Titre: Machine learning from multimodal genetic and neuroimaging data for personalized medicine
Directeur de thèse: Olivier COLLIOT (ICM)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: {{Context}} Personalized medicine aims at tailoring medical decisions, prevention and therapies to individual patients, based on their predicted risk of disease, evolution and response. In this approach, patients are characterized using rich multimodal measurements (genomics, medical imaging, biomarkers…). A central challenge is then to develop predictive models from these measurements. To that end, it is necessary to design new machine learning approaches that can fully exploit the different types of data. Neurodegenerative diseases, such as Alzheimer’s disease and Parkinson’s disease, are complex multifactorial diseases that represent major public health issues. In the context of these brain disorders, two types of data play a major role: genetics and neuroimaging. Genetics allow identifying factors that modulate the risk of a given disease, its evolution and response to treatment. It involves measurement of increasing complexity, from series of Single Nucleotide Polymorphisms (SNPs) provided by microarrays to high-throughput sequencing approaches such as whole-exome or even whole-genome sequencing. Neuroimaging allows measuring, in the living patient, different types of anatomical and functional alterations, using a variety of imaging modalities: anatomical, functional and diffusion magnetic resonance imaging (MRI) and positron emission tomography (PET). These two technologies have witnessed considerable development during the past 15 years. In the meantime, important advances have been made for processing and statistical analysis of these complex data. In particular, our laboratory has developed advanced machine learning approaches for disease prediction from neuroimaging data (Cuingnet et al., 2013; Gerardin et al., 2009). However, machine learning approaches that can adequately integrate neuroimaging and genetic data are currently lacking. The development of such approaches is particularly timely because massive datasets of patients with both imaging and genetic data are now available. One can cite for instance the Alzheimer’s Disease Neuroimaging Initiative (ADNI, http://www.adni-info.org/), the UK Biobank (http://www.ukbiobank.ac.uk/), the MEMENTO national cohort (http://www.ukbiobank.ac.uk/) or the Parkinson’s disease Progression Markers Initiative (PPMI, http://www.ppmi-info.org/). Methodological developments are challenging because of: i) the high dimensionality of both types of data (around 10^5-10^6); ii) the complex multivariate interactions between variables, i.e. variables usually have only a mild effect when considered in isolation and only their combination can result in higher predictive power; iii) the structure of these data (spatial and anatomical structure for brain images, genomic structure) that needs to be adequately modeled. {{Research program}} The objective of this PhD thesis is to develop and validate new statistical learning approaches that can integrate genetic and neuroimaging data, in the context of personalized medicine for neurodegenerative diseases. The main strategy we propose to pursue is to adequately model the structure of both genetic and neuroimaging data. Such strategy aims at constraining learning procedures to better handle the high-dimensionality and at providing interpretable results. Neuroimaging data is structured by the geometry of anatomical structures, their relations and their connectivity. Genetic data is also highly structured: variants are grouped within genes, their dependency is structured by genomic architecture, and genes interact within pathways. Our team has recently proposed new approaches for integrating the structure of neuroimaging data into statistical learning approaches (Cuingnet et al., 2013). In the context of the present thesis, we will focus on genetic data and its integration with neuroimaging. First, we propose to use the grouping of variants into genes and that of genes into common pathways and to select only relevant groups of genes/pathways. For that, we propose to use combinations of l2 and l1 norms, in the spirit of the group lasso approach (Yuan and Lin, 2006). We then propose to take into account the interactions of genes within a given pathway. Such interactions can be modeled using a graph which defines new regularization operators that can be introduced within the learning process through the definition of a new kernel (Kondor and Lafferty, 2002). We then propose to integrate imaging and genetic data through the definition of new kernels. We will define new kernels and similarity measures for genetic data. We will combine them with kernels for imaging data, using for instance multiple kernel learning(Gönen and Alpaydın, 2011). We propose to introduce specific constraints on the topographical expression of genes. Indeed, genes present with a differential expression in different brain regions. Such information appears particularly useful for integration with neuroimaging data. We will th

Doctorant.e: Lu Pascal