Our pre-Final REPORT is available


  • IRISA/LinkMedia and CNRS/STL have develop NLP techniques to extract numerical information from clinical record or clinical trial protocols (what is the concept measured, what is the unit, what is the value). Our approach is good in terms of precision and recall and is able to handle variations in the way units are expressed. The different aspects of this work has been published in several NLP conferences.
  • IRISA/LinkMedia and CNRS/STL have develop a NLP tool based on deep learning to detect negation in biomedical texts (the negation marker and the concerned negated concept). This tool achieve state-of-the-art results on English and the best known results for French and Portguese. It is available as a web-service on . This work has been publsihed in several NLP conferences, and is under adaptation to other languages.
  • IRISA/LinkMedia and CNRS/STL have developed a Part-of-Speech tagger and lemmatizer based on Conditional Random Fields and example-based machine learning. It is freely available as a web-service on Allgo.
  • IRISA/Cidre have developed a new distributed algorithms to detect the k most frequent elements in a set. It is a one-pass, time-decaying, resource efficient and distributed approach. It is especially suited for massive and highly dynamic data streams such as medical events mined in the EHR of several hospitals. This work has been published in two top-tier conferences: DSN 2018 and NCA 2018.
  • IRISA/LinkMedia and CNRS/STL, with colleagues from LIMSI, France, have proposed a text-mining competition, DeFT2019, dedicated to information extraction in clinical texts. A workshop will be held during PFIA in Toulouse, Fr. See the challenge website.

On going work

  • we are still evaluating distributed algorithms to detect the k most frequent elements in a set (IRISA/Cidre)
  • we are developping text classification tools based on deep learning to automatically assign ICD10 codes (PMSI) to clinical records (IRISA/LinkMedia)

Latest news

July 2019: BigClin at TALN/PFIA

We have presented the CAS corpus (clinical cases and their fine grain annotations) at the conference TALN. The corpus is being made freely available, on demand, for research purposes.

July 2019: workshop of the Text-mining Challenge DeFT2019

STL and Linkmedia, with colleagues from LIMSI, proposed a text-mining competition about information extraction in clinical texts. Nine teams participated (from academia and companies). The results and the participants’ approaches have been presented in a workshop held during PFIA in Toulouse, in July.

May 2019: Visit from Claudia Moro, PUCPR in Rennes

Claudia Moro visited LinkMedia, IRISA during a week. During her stay, we finished to adapt the information extraction tools developed for French to Brazilian. We began to write reports for the Brazilian funding agency. Plans for a co-advised PhD were also discussed.

January 2019: TagEx, a new NLP API is online

The Part-of-Speech system developed in the framework of the project is now available online as an API. Test it on Allgo .

November 2018: BigClin at NCA 2018

Vasile Cazacu is presenting his work “On the fly detection of top-k elements in the sliding-window model.” during the 17th IEEE International Symposium on Network Computing and Applications (NCA 2018), Boston, MA (CORE ranking: A).

November 2018: Visit to PUCPR, Curitiba, Brazil

Natalia Grabar (STL) and Vincent Claveau (IRISA) went to PUCPR to follow up the common work with our brazilian partners about numerical information extraction, negation detection and Part-of-Speech tagging. Several publications were also initiated during the stay.

November 2018: BigClin meeting

One full day dedicated to the project. All the participants have presented their latest results, with a special emphasis on the work of our two PhD students, Vasile Cazacu and Clément Dalloux. We had two invited researchers presenting their work in biomedical NLP :

  • Cyril Grouin (LIMSI) about Pharmacovigilance et réseaux sociaux
  • Anne-Lyse Minard, about her participation to the SMM4H challenge about drug identification, and adverse reaction

October 2018: BigClin at Louhi

Clément Dalloux will prensent his work about the development of annotated datasets of French clinical texts at the Louhi workshop held in Brussels, Belgium.

June 2018: BigClin at DSN 2018

Vasile Cazacu is presenting his work “Finding Top-k Most Frequent Items in Distributed Streams in the Time-Sliding Window Model” at the 48th International Conference on Dependable Systems and Networks (DSN 2018), Luxembourg City, Luxembourg (CORE ranking: A).

March 2018: BigClin at n2c2 challenge

This year, our team has participated to the n2c2 challenge on clinical trial recruitment. Although it addresses a real-life scenario (patient recruitment for clinical trials), the small amount of training data discarded our prefered approach based on machine learning (deep learning and others). Yet, rule-based techniques were also developed and gave good results. This work is going on.

February 2018: Visit to PUCPR, Curitiba, Brazil

Marc Cuggia and Pascal VanHill (LTSI) are on their way to spend 10 days in Curitiba with our brazilian partners. They will present their clinical record warehousing system.


December 2017: Visit to PUCPR, Curitiba, Brazil

Following our collaboration with the members of the medical informatics dept of the Pontifícia Universidade Católica do Paraná, C. Dalloux spent one month in Curitiba to work on text mining in clinical texts in French, Portuguese and English.


November 2017: BigClin at SIIM 2017

  1. Dalloux (IRISA/LinkMedia) will present his work about negation detection in biomedical texts at the Symposium sur l’Ingénierie de l’Information Médicale.


November 2017: NegDetect, a new NLP API is online

The negation detection system developed in the framework of the PhD thesis of C. Dalloux is now available online as an API. Test it on Allgo .


June 2017: BigClin at TALN2017

We will present our work on extracting numerical eligibility criteria in clinical trial protocaols at the conference Traitement Automatique des Langues Naturelles in Orléans.


June 2017: BigClin at AIME2017

We will present our work on extracting numerical eligibility criteria in clinical trial protocaols at the 16th conference in Artificial Intelligence in Medecine in Vienna.


April 2017: BigClin at CICLing 2017

Vincent Claveau and Ewa Kijak (IRISA/LinkMedia) will present their work on active learning for text mining at the 18th International Conference on Computational Linguistics and Intelligent Text Processing in Budapest.

February 2017: a new PhD student join the team!

Vasile Cazacu is hired as a PhD student. He will work with E. Anceaume and Yann Busnel on distributed computing.


December 2016: Visit to PUCPR, Curitiba, Brazil

In december 2016, BigClin members visited the medical informatics dept of the Pontifícia Universidade Católica do Paraná – PUCPR, in the research group of Prof. Claudia Moro.

A seminar was organized, gathering researchers from computer science, medical informatics and medecine. We presented rearch activities in the frame of the BigClin project:

  • Vincent Claveau, Biomedical NLP at IRISA
  • Natalia Grabar, Managing the complexity of medical texts
  • Guillaume Bouzillé, Big Data in Health: the example of CHU Rennes

Many common research avenues were identified and a long term collaboration is foreseen with the team of Prof. Moro.



December 2016: a new PhD student join the team!

Clément Dalloux is hired as PhD student in the Linkmedia team. He will study  biomedical NLP (text mining, information retrieval).


October 2016: Kickoff

A seminar was organized in LTSI Rennes for the kickoff of BigClin. Many invited personalities presented their work and  about 30 participants attended.



9h00 – 9h30 : Welcome

9h30 – 9h45 : Opening and presentation of the project (15’) Marc Cuggia & Vincent Claveau

9h45-11h00 Guest presentations (each 20’+5’)

  • Frantz Thiessard / Diallo Gayo (ISPED/CHU de Bordeaux) – Title :Contribution of NLP techniques in three research projects
  • Bastien Rance (HEGP) – Title :Secondary use of data at HEGP: The new challenges of reusing clinical narratives.
  • Douglas Teodoro (SIB Swiss Institute of Bioinformatics) : Title: Making sense of text data: from enterprise to web content

11h15 – 11h30 : Pause

11h30 – 12h20 : Guest presentations (each 20’+5’)

  • Claudia Moro (University of Parana) : Exploiting PLN methods to identifying diagnosis and continue of care information from Brazilian Portuguese Clinical Narratives – System Evaluation and EHR integration
  • Pierre Antoine Gourraud (CHU and University of Nantes) : “Precision Medicine for Multiple Sclerosis: From research cohort datasets through EMR data, The MS Bioscreen application”

12h20 – 13h50 Lunch and discussion (70’)

13h50 – 14h30 (40’’): Demos and related scientific works

  • eHOP (Guillaume Bouzille)

14h30 – 15h15  Teams presentations (45’) :

10’+5′    Linkmedia (IRISA)            Vincent Claveau

10’+5′    STL (Lille)                   Natalia Grabar

10’+5′    Cidre (IRISA)               Yann Busnel / Emmanuelle

15h15 – 16h15 (75’’) Discussions of topics

  • Use cases and applications for research
  • Pre-requisite for data processing (de-identification, data protection and regulation, project organisation, IT)
  • Extraction, indexing, Machine learning methods and tools
  • Benchmarks/Evaluation/Challenges
  • PhD or MSD Students  exchange

16h15 – 16h30  Pause

16h30 – 17h45  (75’) Action plan, networking, collaborations, next scientific meetings

17h45 – 17h55  (10’) Wrap up and Cloture

Comments are closed.