RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VIII

BioCreative VIII challenge and workshop (Events) [2023-01-22]

Note for Biocreative participants: For registration to a track please use the Google form.
Do not use the team "Team page" tab as it is non functional. For more information go to Registration section below

The BioCreative VIII workshop is scheduled to run with AMIA 2023 on November 12, 2023.


Critical Assessment of Information Extraction in Biology is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. BioCreative has been an invaluable source for advancing state-of-the-art text mining methods since 2004, by providing reference datasets and a collegial environment to develop and evaluate these methods in both shared and interactive models. The BioCreative VIII workshop aims to provide a forum for clinical informatics community members, and traditional biomedical natural language processing researchers, bioinformatics researchers, and data curators to present and discuss advances in text mining for health applications, following on the success of the previous seven BioCreative workshops.

The VIIIth BioCreative workshop seeks to attract researchers interested in automatic methods of extracting medically relevant information from clinical data and aims to bring together the medical NLP community and the health professionals community. Proposed tracks include SYMPTEMIST (Symptom TExt Mining Shared Task) and Phenotype extraction (genetic conditions in pediatric patients), addressing symptom and phenotype extraction from clinical records (in English and Spanish); in addition, the new BioRED (Biomedical Relation Extraction Dataset) Track will continue to address information extraction from biomedical literature; finally BioCreative VIII proposes a new Annotation Tool track focused on developing annotation tools to facilitate the job of domain experts, offering seamless integration with relevant ontologies and other features to improve user experience and efficiency. Please see below for more on the tracks:

Track 1: BioRED (Biomedical Relation Extraction Dataset) Track (Rezarta Islamaj and Zhiyong Lu)

This track aims to foster the development of systems that automatically extract biomedical relations in journal articles, and the final resource -- freely available to the community -- will consist of 1000 MEDLINE articles fully annotated with biological and medically relevant entities, biomedical relations between them, and the novelty of the relation (whether the relation is a key point of the article versus background knowledge that can be found elsewhere). The participants will use the training data (600 articles) to design and develop their NLP systems to extract asserted relationships from free text and are encouraged to classify relations that are novel findings. In the BioCreative setting we will enrich the BioRED training dataset with 400 recently published MEDLINE articles fully annotated, bringing this valuable resource to 1000 articles. This track serves as a continuation of previous BioCreative Workshops that addressed the individual extraction of bio entities and/or specific relations such as disease-gene, protein-protein, or chemical-chemical, in biomedical articles. In contrast from previous challenges, this track calls for the extraction of all semantic relations expressed in the article and their novelty factor.

Track 2: SYMPTEMIST (Symptom TExt Mining Shared Task) (Martin Krallinger)

A considerable effort has been made to automatically extract from clinical texts relevant variables and concepts using advanced entity recognition approaches. Despite the importance of clinical signs and symptoms for diagnosis, prognosis and healthcare data analytics strategies, this kind of clinical entity has received far less attention when compared to other entity classes such as medications or diseases. To understand and characterize relationships between different symptoms, their onset, or associations of symptoms to diseases is a central question for medical research. Due to the complexity underlying the annotation process and normalization or mapping of symptom mentions to controlled vocabularies, very few datasets or corpora have been generated to train and evaluate advanced clinical named entity recognition systems. To foster the development, research and evaluation of semantic annotation strategies that can be useful for systematically extracting and harmonizing symptoms from clinical documents we propose the  SYMPTEMIST track.  We will invite researchers, health-tech  professionals, NLP, and ontology experts to develop tools capable of detecting automatically mentions of clinical symptoms from clinical texts in Spanish and normalizing or mapping them to a widely used multilingual clinical vocabulary, namely SNOMED CT. For this task we will release a large collection of manually annotated symptoms mentions, together with detailed annotation guidelines, consistency analysis and additional resources. For this track we plan also to release a multilingual version of the corpus (English, Italian, Romanian, Catalan, Portuguese, French, Dutch, Swedish and Czech). This is a new challenge.

Track 3: Phenotype normalization (genetic conditions in pediatric patients) (Graciela Gonzalez, Ian Campbell, Davy Weissenbacher)

The dysmorphology physical examination is a critical component of the diagnostic evaluation in clinical genetics. This process catalogs often minor morphological differences of the patient's facial structure or body, but it may also identify more general medical signs such as neurologic dysfunction. The findings enable the correlation of the patient with known rare genetic diseases. Although the medical findings are key information, they are nearly always captured within the electronic health record (EHR) as unstructured free text, making them unavailable for downstream computational analysis. Advanced Natural Language Processing methods are therefore required to retrieve the information from the records. This is a new challenge.

Track 4: Annotation Tool track (Rezarta Islamaj, Cecilia Arighi, Lynette Hirschman, Martin Krallinger, Graciela Gonzalez)

Recognizing the need for freely available, time-saving tools that help build quality gold-standard resources, the goal of BioCreative 2023 Annotation Tool Track is to foster development of such biocuration annotation systems. This track calls for text mining developers to submit systems that are: 1) both publicly available, and offer local setup options to allow for data with privacy concerns, such as clinical records, 2) able to support team annotation, and collaboration between annotators to ensure data annotation quality, 3) able to annotate documents for triage, entities, and/or relations, and 4) able to integrate the selected ontology, and provide search capabilities/browsing, as well as suggestions to the curator for the selected ontology. A select number of systems will be showcased at the workshop.



The BioCreative VIII Proceedings will host all the submissions from participating teams, and it will be freely available by the time of the workshop.

In addition, we are working with a journal to host the BioCreative VIII special issue for work that passes their peer-review process. Invitation to submit will be sent after the workshop.


Team Registration

Teams can participate in one or more of these tracks. Team registration will continue until final commitment is requested by the individual tracks.

To register a team go to the Registration form. If you have restrictions accessing Google forms please send e-mail to

BioCreative Organizing Committee

  • Dr. Rezarta Islamaj, National Library of Medicine
  • Dr. Cecilia Arighi, University of Delaware
  • Dr. Ian M. Campbell, Children Hospital of Philadelphia
  • Dr. Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center
  • Dr. Lynette Hirschman, MITRE
  • Dr. Martin Krallinger, Barcelona Supercomputing Center
  • Dr. Davy Weissenbacher, Cedars-Sinai Medical Center
  • Dr. Zhiyong Lu, National Library of Medicine