Projet de recherche doctoral numero :8531

Description

Date depot: 24 avril 2023
Titre: Detecting Dataset Manipulation and Weaponisation of NLP Models
Directeur de thèse: Benoit SAGOT (Inria-Paris (ED-130))
Encadrant : Djamé SEDDAH (Inria-Paris (ED-130))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole

Resumé: Training Large Language Models (LLMs) has become more accessible than ever due to the increased interest in scaling these LMs to obscene scales, which have been shown to not only improve performance but to unlock new emergent capabilities. However, the high compute cost required to train LLMs is exclusive to high-budget private institutions or some countries, thus raising questions about bad actors with malicious intents. Furthermore, The Center on Terrorism, Extremism, and Counter-terrorism (CTEC) highlights the upcoming threat of industrialized terrorist and extremist propaganda using models like GPT-3. Hence, it is imperative to research methods to 1) detect and defend against threats of LM weaponization and malicious dataset tampering, 2) eliminate or mitigate the threats present in language models, and 3) improve the robustness of our OSINT and threat analysis defense systems against adversarial attacks. We argue that since the current paradigm of training and deploying language models lacks transparency and accountability, and since no single entity has the right to determine and enforce their own morality, it is crucial to research methods to detect and defend against these kinds of threats. Consequently, we raise the following research questions: - Is it possible to detect if the institutions training and releasing LMs are priming their models to later be weaponized? -Is it possible to spot if a model's training data has been maliciously tampered with or skewed (i.e. “sneaking” data from radicalized sources, or even excluding content from under-represented minorities), with and maybe more crucially without access to the data? - Is it possible to de-bias a language model, and how effective are current techniques in mitigating bias? - Is it possible to improve the detector model's robustness to adversarial attacks that attempt to fool the model and evade detection?

Doctorant.e: Antoun Wissam