Description
Date depot: 4 avril 2024
Titre: Automated Data Preparation for Machine Learning: A Black-box Optimization Approach
Directrice de thèse:
Carola DOERR (LIP6)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Intelligence artificielle
Resumé: Machine learning (ML) processes can be subdivided into two major steps, (1) data preparation and (2) modelization. Data preparation designates the application of a sequence of transformations to raw data in order to optimize its usability and result quality when input into a ML model. The modelization part encompasses choosing a ML model to apply to the data and tuning its parameters so as to achieve the best possible results.
Both steps are essential for good results and need to be addressed with care. The manual treatment of either of these components is very time-consuming, typically warranting an iterative trial and error approach by an ML expert. Reducing the human effort and bias involved in these laborious tasks, while at the same time optimizing performance of the ML systems, are key objectives of automated machine learning (AutoML).
Non-surprisingly, AutoML is undergoing a dramatic growth at the moment, with ML approaches becoming more and more relevant across a broad range of applications. However, zooming into recent AutoML research, we observe a strong bias towards research on the modelization part of the ML pipeline, leaving the data preparation part in a somewhat neglected state. Indeed, leading openly available AutoML solutions today necessitate already pre-processed input data or provide minimal data preparation whilst focusing on the modelization aspect of the pipeline. This severely limits their usability on real world data, which is often imperfect (messy, noisy, incomplete, ...).
The ambition of this PhD project is to extend the scope of AutoML to the data preparation step: our objective is to devise optimization methods in an endeavor to efficiently and effectively automate the complete ML data preparation process. We will look at this problem from a black-box optimization perspective, with the goal to design query-efficient iterative search approaches for automated data preparation that take into account relevant features such as the size of the data, the data types, and the budget and that are evaluated with respect to time, memory, and model metrics.