Projet de recherche doctoral numero :8079

Description

Date depot: 26 novembre 2020
Titre: Scalable Machine Learning on Massive High-Dimensional Vectors
Directeur de thèse: Themis PALPANAS (LIPADE)
Encadrante : Karima ECHIHABI (Université Mohammed VI Polytechnique)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: Similarity search aims at finding objects in a collection that are close to a given query according to some definition of sameness. It is a fundamental operation that lies at the core of many critical data science applications [12, 17]. In data integration, it has been used to automate entity resolution [6] and support data discovery [11]. It has powered recommender systems of online billion-dollar enterprises [16] and enabled clustering [3], classification [15] and outlier detection [4] in domains as varied as bioinformatics, computer vision, security, finance and medicine. Similarity search has also been exploited in software engineering [1] to automate API mappings and predict program dependencies and I/O usage, and in cybersecurity to profile network usage and detect intrusions and malware [5]. This problem has been studied heavily in the past 25 years and will continue to attract attention as massive collections of high-dimensional objects are becoming omnipresent in various domains [14, 1]. Objects can be data series, text, images, audio and video recordings, graphs, database tables or deep network embeddings. Similarity search over high-dimensional objects is often reduced to a k-Nearest Neighbor (k-NN) problem such that the objects are represented using high-dimensional vectors and the (dis)-similarity between them is measured using a distance. The importance and relevance of NN search in high dimensions is further evidenced by a large and growing body of research [9,10, 7, 8, 13]. The objective of this thesis is to develop a novel framework for scalable machine learning on massive high-dimensional vectors. The framework will exploit modern hardware to support efficient exact and approximate search on large collections of high-dimensional vectors, making the right trade-offs between accuracy, efficiency and footprint.

Doctorant.e: Azizi Ilias