Description
Date depot: 9 décembre 2022
Titre: Information Extraction on French Electronic Health Reports
Directeur de thèse:
Laurent ROMARY (Inria-Paris (ED-130))
Encadrant :
Eric DE LA CLERGERIE (Inria-Paris (ED-130))
Domaine scientifique: Sciences et technologies de l'information et de la communication
Thématique CNRS : Traitement automatique des langues et de la parole
Resumé: With the introduction of clinical data warehouses in hospitals, electronic health reports (EHR) are becoming increasingly available for research purposes, although still difficult to access for confidentiality reasons. Clinical studies may benefit from exploiting this unstructured albeit rich data. To that end, we need to extract structured information from free text. Natural language processing based on transfer learning using pre-trained language models such as BERT shows state-of-the-art results in this task. However, clinical reports have a specific style with specialized terms. The lack of publicly available clinical data in French makes it hard to perform well using these techniques. This type of sensitive data must be pseudonymized before being transmitted to neural networks to avoid leaks. Moreover, manual annotation by experts is expensive. We aim to train information extraction models specialized in french clinical reports in an unsupervised setting, using as much as possible knowledge acquisition on new pseudonymized corpora to generate annotations.
This work is part of the Oncolab project, which will allow us to work on data from partner health institutions. It is a national program aiming to make oncological data more accessible for research purposes. In this project, we have 2 startups, Arkhn, which leads the project, and will build and maintain clinical data warehouses, and Owkin, which will focus on building a platform to conduct clinical studies from extracted data. We also have 6 healthcare establishments that will give access to their data. And finally, Inria will conduct research on information extraction.
We aim to build efficient information extraction models for french biomedical that could help researchers conduct clinical studies. These models should work on low-resource scenarios as they will, in the future, run on hospital infrastructures and should require small amounts of data to be trained.
To this end, we will first focus on Knowledge Acquisition using medical corpora made available for us by our partners but also with external knowledge bases like Mesh or UMLS. We will also explore knowledge transfer between healthcare establishments. Using extraction patterns we aim to extract terminology and relations, which will allow us to pre-annotate some corpora along with light human validation using active learning. Finally using these labels we will fine-tune some language models for NER and Open Information Extraction. We will explore how to inject information from knowledge bases using Graph Attention Networks for example.
Doctorant.e: Touchent Rian