Projet de recherche doctoral numero :4402

Description

Date depot: 1 janvier 1900
Titre: Theory and practice of scalable machine learning algorithms
Directeur de thèse: Pietro MICHIARDI (Eurecom)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini

Resumé: The amount of data created each second in our world is exploding. E-commerce, Internet security and financial applications, billing and customer services – to name only a few examples – will continue to fuel exponential growth of large pools of data that can be captured, communicated, aggregated, stored, and analyzed. As companies and organizations go about their business and interact with individuals, they are generating a tremendous amount of digital footprints, i.e., raw, unstructured data – for example log files – that are created as a by-product of other activities. The use of these huge quantities of data is considered today as a key basis of competition and growth: companies failing to develop their analysis capabilities will fail to understand and leverage the big picture hidden in the data, and hence fall behind. The current state-of-the-art already offers a set of approaches to tackle such large-scale data processing problems, like commercial databases (Oracle Big Data), public cloud services (Amazon Elastic MapReduce) and open-source projects (Hadoop). Nevertheless, designing scalable machine learning algorithms, that are able to discover compelling knowledge from these huge amounts of data, remains a hard problem. The high complexity of above mentioned execution frameworks, makes the design of such efficient algorithms complicated. Moreover, the optimization of these algorithms requires to understand the cost of these algorithms, which is also quite challenging. Finally, these systems make the implementation of even simple algorithms intricated. For example, implementations of even simple clustering algorithms that are largely used in many fields are inefficient and do not make an appropriate use of the underlying system resources (see for example the Mahout project). Therefore, the goal of this Thesis will be to develop highly scalable, optimized and reusable machine learning algorithms to process and interact with large amounts of data. The Thesis will not only focus on algorithm design, but also on the understanding and modelisation of the cost and bottlenecks of these algorithms. More generally, the Thesis should study the global purpose of these algorithms: does processing more data leverage more valuable knowledge, and do complicated distributed algorithms provide benefits compared to more simple algorithms?

Doctorant.e: Debatty Thibault