Description
Date depot: 3 avril 2025
Titre: AI-driven Data Quality Management
Directeur de thèse:
Themis PALPANAS (LIPADE)
Encadrant :
Soror SAHRI (LIPADE)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Données et connaissances
Resumé: Ensuring data quality in machine learning pipelines has become a critical challenge, especially
with the growing scale and complexity of modern datasets. Traditional approaches primarily
focus on data cleaning and repair, addressing missing values, inconsistencies, and noise through
rule-based or statistical methods. However, these methods require extensive human
intervention, and struggle to adapt to the evolving nature of data in ML pipelines.
A primary focus of this project is adaptive quality monitoring within ML pipelines, where data
drift, label noise, and inconsistencies can degrade model performance over time. Current
validation approaches often rely on static rules or predefined thresholds, making them
ineffective in dynamic environments where data distributions evolve. This project will integrate
machine learning models with dynamic profiling techniques to enable real-time detection and
adaptation to emerging quality issues. Large language models (LLMs) will be leveraged to
improve contextual data quality assessment, ensuring that semantic inconsistencies, misaligned
labels, and incomplete information are identified and addressed within the ML pipeline.
Another aspect of this project is understanding how data quality interventions affect fairness in
ML pipelines. While data validation, cleaning, and augmentation techniques improve model
robustness, they may also unintentionally y harm fairness in ML decision-making. By integrating
adaptive monitoring, context-aware assessment using LLMs, and fairness-aware quality
validation, this project will contribute to the development of scalable and intelligent data quality
management systems that improve the trustworthiness and robustness of ML pipelines.