Description
Date depot: 1 janvier 1900
Titre: Theory and practice of scalable machine learning algorithms
Directeur de thèse:
Pietro MICHIARDI (Eurecom)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Non defini
Resumé:
The amount of data created each second in our world is exploding. E-commerce, Internet security and
financial applications, billing and customer services – to name only a few examples – will continue to
fuel exponential growth of large pools of data that can be captured, communicated, aggregated, stored,
and analyzed. As companies and organizations go about their business and interact with individuals,
they are generating a tremendous amount of digital footprints, i.e., raw, unstructured data – for example
log files – that are created as a by-product of other activities.
The use of these huge quantities of data is considered today as a key basis of competition and growth:
companies failing to develop their analysis capabilities will fail to understand and leverage the big
picture hidden in the data, and hence fall behind.
The current state-of-the-art already offers a set of approaches to tackle such large-scale data processing
problems, like commercial databases (Oracle Big Data), public cloud services (Amazon Elastic
MapReduce) and open-source projects (Hadoop).
Nevertheless, designing scalable machine learning algorithms, that are able to discover compelling
knowledge from these huge amounts of data, remains a hard problem. The high complexity of above
mentioned execution frameworks, makes the design of such efficient algorithms complicated.
Moreover, the optimization of these algorithms requires to understand the cost of these algorithms,
which is also quite challenging. Finally, these systems make the implementation of even simple
algorithms intricated.
For example, implementations of even simple clustering algorithms that are largely used in many fields
are inefficient and do not make an appropriate use of the underlying system resources (see for example
the Mahout project).
Therefore, the goal of this Thesis will be to develop highly scalable, optimized and reusable machine
learning algorithms to process and interact with large amounts of data. The Thesis will not only focus
on algorithm design, but also on the understanding and modelisation of the cost and bottlenecks of
these algorithms. More generally, the Thesis should study the global purpose of these algorithms: does
processing more data leverage more valuable knowledge, and do complicated distributed algorithms
provide benefits compared to more simple algorithms?
Doctorant.e: Debatty Thibault