Description
              
              
              
              Date depot:  26 novembre 2020  
              Titre:  Scalable Machine Learning on Massive High-Dimensional Vectors  
              
  
    
        
        
        Directeur de thèse: 
        
        
         Themis PALPANAS (LIPADE)
    
   
    
    
    Encadrante : 
    
        Karima ECHIHABI (Université Mohammed VI Polytechnique)
    
              Domaine scientifique:  Sciences et technologies de l'information et de la communication  
              Thématique CNRS :  Non defini  
              Resumé:  Similarity search aims at finding objects in a collection that are close to a given query according to some definition of sameness. It is a fundamental operation that lies at the core of many critical data science applications [12, 17]. In data integration, it has been used to automate entity resolution [6] and support data discovery [11]. It has powered recommender systems of online billion-dollar enterprises [16] and enabled clustering [3], classification [15] and outlier detection [4] in domains as varied as bioinformatics, computer vision, security, finance and medicine. Similarity search has also been exploited in software engineering [1] to automate API mappings and predict program dependencies and I/O usage, and in cybersecurity to profile network usage and detect intrusions and malware [5].  This problem has been studied heavily in the past 25 years and will continue to attract attention as massive collections of high-dimensional objects are becoming omnipresent in various domains [14, 1]. Objects can be data series, text, images, audio and video recordings, graphs, database tables or deep network embeddings. Similarity search over high-dimensional objects is often reduced to a k-Nearest Neighbor (k-NN) problem such that the objects are represented using high-dimensional vectors and the (dis)-similarity between them is measured using a distance. The importance and relevance of NN search in high dimensions is further evidenced by a large and growing body of research [9,10, 7, 8, 13].   The objective of this thesis is to develop a novel framework for scalable machine learning on massive high-dimensional vectors. The framework will exploit modern hardware to support efficient exact and approximate search on large collections of high-dimensional vectors, making the right trade-offs between accuracy, efficiency and footprint.  
              
              
              
              
              
              Doctorant.e: Azizi Ilias