Projet de recherche doctoral numero :8915

Description

Date depot: 3 avril 2025
Titre: AI-driven Data Quality Management
Directeur de thèse: Themis PALPANAS (LIPADE)
Encadrant : Soror SAHRI (LIPADE)
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Données et connaissances

Resumé: Ensuring data quality in machine learning pipelines has become a critical challenge, especially with the growing scale and complexity of modern datasets. Traditional approaches primarily focus on data cleaning and repair, addressing missing values, inconsistencies, and noise through rule-based or statistical methods. However, these methods require extensive human intervention, and struggle to adapt to the evolving nature of data in ML pipelines. A primary focus of this project is adaptive quality monitoring within ML pipelines, where data drift, label noise, and inconsistencies can degrade model performance over time. Current validation approaches often rely on static rules or predefined thresholds, making them ineffective in dynamic environments where data distributions evolve. This project will integrate machine learning models with dynamic profiling techniques to enable real-time detection and adaptation to emerging quality issues. Large language models (LLMs) will be leveraged to improve contextual data quality assessment, ensuring that semantic inconsistencies, misaligned labels, and incomplete information are identified and addressed within the ML pipeline. Another aspect of this project is understanding how data quality interventions affect fairness in ML pipelines. While data validation, cleaning, and augmentation techniques improve model robustness, they may also unintentionally y harm fairness in ML decision-making. By integrating adaptive monitoring, context-aware assessment using LLMs, and fairness-aware quality validation, this project will contribute to the development of scalable and intelligent data quality management systems that improve the trustworthiness and robustness of ML pipelines.