Description
Date depot: 25 avril 2019
Titre: Automated methods for data cleaning
Directeur de thèse:
Paolo PAPOTTI (Eurecom)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Données et connaissances
Resumé:
This thesis addresses a pressing need in data science applications: besides reliable models for decision making, we need to process data from its original, raw state into a curated form. In this “data cleaning” process, data engineers encode specifications, such as business rules on salaries, physical constraints for molecules, or representative training data in cleaning programs to be executed over the raw data to identify and fix errors. This human-centric process is expensive and, given the overwhelming amount of today’s data, is conducted with a best-effort approach, which does not provide formal guarantees on the quality of the data. The goal of this PhD program is to rethink the data cleaning field from its assumptions with an inclusive formal framework that radically reduces the human effort in cleaning data. This will be achieved by designing and implementing new automated techniques that use external information to identify and repair data errors and guarantees quality requirements.
The program tackles the previous limitations by fostering research in the holistic analysis of the noisy data to discover specifications and precisely measure data quality. More precisely, as mining on noisy data cannot provide exact specifications for cleaning programs, the idea is to use external evidence to automatically identify the correct cleaning specifications. Past user updates on the given dataset, programs available in code repositories, and correlations with other datasets, should all be mined and combined to create a holistic snapshot of the current quality state. This combination of different and uncertain signals can guide the specification process and provide some form of assurance about its conclusions, based on theoretical foundations. The project will enable accelerated information discovery, as well as economic benefits of early, trustworthy decisions. To provide the right context for evaluating these new techniques and highlight the impact of the project in different fields, the program plans to address its objectives by using real case studies from different domains, including health and financial data.
Doctorant.e: Cappuzzo Riccardo