RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VIII

BioCreative VIII Workshop Information (Events) [2023-10-27]

Welcome to the BioCreative VIII challenge and workshop: Curation and evaluation in the era of ChatGPT

When: Sunday, November 12, 2023. 8am-12:30pm CST.

Where: AMIA Annual Symposium, Hilton New Orleans Riverside, New Orleans, LA

Scientific Program

Time (CST)Session
8:00 - 8:05 amOpening remarks
8:05 - 8:30 am Keynote: Clinical information extraction in the era of large language models (LLMs).
Hua Xu, PhD, FACMI.
Robert T. McCluskey Professor and Vice Chair for Research and Development, Section of Biomedical Informatics and Data Science. Assistant Dean for Biomedical Informatics, Yale School of Medicine. Yale University
8:35 - 9:30 amTrack 1: BioRED (Biomedical Relation Extraction Dataset) Track
9:30 - 10:00 amBreak & BCVIII Poster session
10:00 - 10:55 am Track 2: SYMPTEMIST (Symptom TExt Mining Shared Task) track
11:00 - 11:55 am Track 3: Phenotype normalization (genetic conditions in pediatric patients)
12:00 - 12:25 pm PMC-Patients - a large-scale publicly available patient dataset: A call for flexible tools for complex annotation
12:25 - 12:30 pm Closing remarks


Clinical information extraction in the era of large language models (LLMs)

Electronic Health Records (EHRs) contain abundant free text data that are valuable for research and operation in the medical domain. Natural language processing (NLP) technologies have shown great promise in unlocking information in clinical texts; however, many challenges still exist when developing and implementing NLP technologies for biomedical applications. The talk will focus on method development in clinical and biomedical NLP, including recent advancements in large language models (LLMs), along with lessons learned from building NLP-based applications.

Speaker: Hua Xu, PhD, FACMI
Robert T. McCluskey Professor and Vice Chair for Research and Development, Section of Biomedical Informatics and Data Science
Assistant Dean for Biomedical Informatics, Yale School of Medicine Yale University

Bio: Dr. Hua Xu is Robert T. McCluskey Professor and Vice Chair for Research and Development, Section of Biomedical Informatics and Data Science at Yale School of Medicine (YSM), as well as Assistant Dean for Biomedical Informatics at YSM. He received his Ph.D. in Biomedical Informatics from Columbia University. His primary research interests include biomedical natural language processing (NLP) and data mining, as well as their applications in secondary use of electronic health records data for clinical and translational research. His research is funded by multiple agencies (i.e., NLM, NCI, NIGMS, NIA, AHA, and CPRIT), and methods/tools developed in his lab have been top ranked in a number of biomedical NLP shared tasks and widely used to support diverse biomedical applications. He served as the Chair of American Medical Informatics Association (AMIA) NLP Working Group and now leads the Observational Health Data Sciences and Informatics (OHDSI) NLP Working Group. Dr. Xu is a fellow of the American College of Medical Informatics (ACMI) and the International Academy of Health Sciences Informatics (IAHSI).

Back to top

Poster session

To be announced.

Back to top


To be announced

Back to top


To register go to AMIA symposium registration page

BioCreative VIII challenge and workshop (Events) [2023-01-22]

Note for Biocreative participants: For registration to a track please use the Google form.
Do not use the team "Team page" tab as it is non functional. For more information go to Registration section below

The BioCreative VIII workshop is scheduled to run with AMIA 2023 on November 12, 2023.


Critical Assessment of Information Extraction in Biology is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. BioCreative has been an invaluable source for advancing state-of-the-art text mining methods since 2004, by providing reference datasets and a collegial environment to develop and evaluate these methods in both shared and interactive models. The BioCreative VIII workshop aims to provide a forum for clinical informatics community members, and traditional biomedical natural language processing researchers, bioinformatics researchers, and data curators to present and discuss advances in text mining for health applications, following on the success of the previous seven BioCreative workshops.

The VIIIth BioCreative workshop seeks to attract researchers interested in automatic methods of extracting medically relevant information from clinical data and aims to bring together the medical NLP community and the health professionals community. Proposed tracks include SYMPTEMIST (Symptom TExt Mining Shared Task) and Phenotype extraction (genetic conditions in pediatric patients), addressing symptom and phenotype extraction from clinical records (in English and Spanish); in addition, the new BioRED (Biomedical Relation Extraction Dataset) Track will continue to address information extraction from biomedical literature; finally BioCreative VIII proposes a new Annotation Tool track focused on developing annotation tools to facilitate the job of domain experts, offering seamless integration with relevant ontologies and other features to improve user experience and efficiency. Please see below for more on the tracks:

Track 1: BioRED (Biomedical Relation Extraction Dataset) Track (Rezarta Islamaj and Zhiyong Lu)

This track aims to foster the development of systems that automatically extract biomedical relations in journal articles, and the final resource -- freely available to the community -- will consist of 1000 MEDLINE articles fully annotated with biological and medically relevant entities, biomedical relations between them, and the novelty of the relation (whether the relation is a key point of the article versus background knowledge that can be found elsewhere). The participants will use the training data (600 articles) to design and develop their NLP systems to extract asserted relationships from free text and are encouraged to classify relations that are novel findings. In the BioCreative setting we will enrich the BioRED training dataset with 400 recently published MEDLINE articles fully annotated, bringing this valuable resource to 1000 articles. This track serves as a continuation of previous BioCreative Workshops that addressed the individual extraction of bio entities and/or specific relations such as disease-gene, protein-protein, or chemical-chemical, in biomedical articles. In contrast from previous challenges, this track calls for the extraction of all semantic relations expressed in the article and their novelty factor.

Track 2: SYMPTEMIST (Symptom TExt Mining Shared Task) (Martin Krallinger)

A considerable effort has been made to automatically extract from clinical texts relevant variables and concepts using advanced entity recognition approaches. Despite the importance of clinical signs and symptoms for diagnosis, prognosis and healthcare data analytics strategies, this kind of clinical entity has received far less attention when compared to other entity classes such as medications or diseases. To understand and characterize relationships between different symptoms, their onset, or associations of symptoms to diseases is a central question for medical research. Due to the complexity underlying the annotation process and normalization or mapping of symptom mentions to controlled vocabularies, very few datasets or corpora have been generated to train and evaluate advanced clinical named entity recognition systems. To foster the development, research and evaluation of semantic annotation strategies that can be useful for systematically extracting and harmonizing symptoms from clinical documents we propose the  SYMPTEMIST track.  We will invite researchers, health-tech  professionals, NLP, and ontology experts to develop tools capable of detecting automatically mentions of clinical symptoms from clinical texts in Spanish and normalizing or mapping them to a widely used multilingual clinical vocabulary, namely SNOMED CT. For this task we will release a large collection of manually annotated symptoms mentions, together with detailed annotation guidelines, consistency analysis and additional resources. For this track we plan also to release a multilingual version of the corpus (English, Italian, Romanian, Catalan, Portuguese, French, Dutch, Swedish and Czech). This is a new challenge.

Track 3: Phenotype normalization (genetic conditions in pediatric patients) (Graciela Gonzalez, Ian Campbell, Davy Weissenbacher)

The dysmorphology physical examination is a critical component of the diagnostic evaluation in clinical genetics. This process catalogs often minor morphological differences of the patient's facial structure or body, but it may also identify more general medical signs such as neurologic dysfunction. The findings enable the correlation of the patient with known rare genetic diseases. Although the medical findings are key information, they are nearly always captured within the electronic health record (EHR) as unstructured free text, making them unavailable for downstream computational analysis. Advanced Natural Language Processing methods are therefore required to retrieve the information from the records. This is a new challenge.

Track 4: Annotation Tool track (Rezarta Islamaj, Cecilia Arighi, Lynette Hirschman, Martin Krallinger, Graciela Gonzalez)

Recognizing the need for freely available, time-saving tools that help build quality gold-standard resources, the goal of BioCreative 2023 Annotation Tool Track is to foster development of such biocuration annotation systems. This track calls for text mining developers to submit systems that are: 1) both publicly available, and offer local setup options to allow for data with privacy concerns, such as clinical records, 2) able to support team annotation, and collaboration between annotators to ensure data annotation quality, 3) able to annotate documents for triage, entities, and/or relations, and 4) able to integrate the selected ontology, and provide search capabilities/browsing, as well as suggestions to the curator for the selected ontology. A select number of systems will be showcased at the workshop.



The BioCreative VIII Proceedings will host all the submissions from participating teams, and it will be freely available by the time of the workshop.

In addition, we are working with a journal to host the BioCreative VIII special issue for work that passes their peer-review process. Invitation to submit will be sent after the workshop.


Team Registration

Teams can participate in one or more of these tracks. Team registration will continue until final commitment is requested by the individual tracks.

To register a team go to the Registration form. If you have restrictions accessing Google forms please send e-mail to

BioCreative Organizing Committee

  • Dr. Rezarta Islamaj, National Library of Medicine
  • Dr. Cecilia Arighi, University of Delaware
  • Dr. Ian M. Campbell, Children Hospital of Philadelphia
  • Dr. Graciela Gonzalez-Hernandez, Cedars-Sinai Medical Center
  • Dr. Lynette Hirschman, MITRE
  • Dr. Martin Krallinger, Barcelona Supercomputing Center
  • Dr. Davy Weissenbacher, Cedars-Sinai Medical Center
  • Dr. Zhiyong Lu, National Library of Medicine

BioCreative VII

BioCreative VII Workshop Information (Events) [2021-08-23]

Welcome to the BioCreative VII challenge evaluation workshop, November 8-10, 2021. This event is virtual and free.

This event is closed now

Scientific Program

The scientific program includes the talks related to the individual tracks, a panel about mining adverse drug reactions, a keynote talk and a flash talks for selected posters. Detailed agenda is shown below

Monday, November 8, 2021

UTC (Universal)ESTSession
2:30-2:40 pm9:30-9:40 amOpening remarks
2:40-3:55 pm9:40-10:55 am NLM-Chem Track: Full text Chemical Identification and Indexing in PubMed articles (Track 2)
Chair: Zhiyong Lu
  • Overview of the NLM-CHEM track - Full-text Chemical Identification and Indexing in PubMed articles (Rezarta Islamaj, Robert Leaman)
  • Chemical detection and indexing in PubMed full text articles using deep learning and rule-based methods (João Figueira Silva)
  • Improving Tagging Consistency and Entity Coverage for Chemical Identification in Full-text Articles (Hyunjae Kim)
  • A BERT-Based Hybrid System for Chemical Identification and Indexing in Full-Text Articles (Arslan Erdengasileng)
  • Chemical Identification and Indexing in PubMed Articles via BERT and Text-to-Text Based Approaches (Virginia Adams)
3:55-4:15 pm10:55-11:15 amBreak
4:15-5:00 pm11:15 am-12:00 pmKeynote: All of Us Research Program: Improving Health Through Technology, Huge Cohorts and Precision Medicine. Joshua Denny M.D., M.S., Chief Executive Officer of the All of Us Research Program, NIH. PRESENTATION
5:00-6:15 pm12:00-1:15 pm Automatic extraction of medication names in tweets (Track 3)
Chair: Davy Weissenbacher
  • BioCreative VII – Track 3: Automatic Extraction of Medication Names in Tweets (Davy Weissenbacher)
  • NCU-IISR/AS-GIS: Detecting Medication Names in Imbalanced Twitter Data with Pretrained Extractive QA Model and Data-Centric Approach (Yu Zhang)
  • BCH-NLP at BioCreative VII Track 3 - medications detection in tweets using transformer networks and multi-task learning (Dongfang Xu)

Tuesday, November 9, 2021

UTC (Universal)ESTSession
2:00-2:10 pm9:00-9:10 amOpening remarks
2:10-3:55 pm9:10-10:55 am DrugProt:Text mining drug/chemical-protein interactions (Track 1)
Chair: Antonio Miranda-Escalada
  • Overview of DrugProt BioCreative VII track: quality evaluation and large scale text mining of drug-gene/protein relations (Martin Krallinger, Antonio Miranda-Escalada)
  • Using Knowledge Base to Refine Data Augmentation for Biomedical Relation Extraction (WonJin Yoon)
  • Extracting Drug-Protein Interaction using an Ensemble of Biomedical Pre-trained Language Models through Sequence Labeling and Text Classification Techniques (Ling Luo)
  • Text Mining Drug-Protein Interactions using an Ensemble of BERT, Sentence BERT and T5 models (Xin Sui)
  • Humboldt @ DrugProt: Chemical-Protein Relation Extraction with Pretrained Transformers and Entity Descriptions (Leon Weber)
  • Does constituency analysis enhance domain-specific pre-trained BERT models for relation extraction? (Anfu Tang)
  • Text Mining Drug/Chemical-Protein Interactions using an Ensemble of BERT and T5 Based Models (Virginia Adams)
  • CU-UD: text-mining drug and chemical-protein interactions with ensembles of BERT-based models(Mehmet Efruz Karabulut)
  • TTI-COIN at BioCreative VII Track 1 (Naoki Iinuma/Masaki Asada)
  • A Multi-Task Transfer Learning-based method for Extracting Drug-Protein Interactions (Ed-drissiya El-allaly)
  • UTHealth@BioCreativeVII: Domain-specific Transformer Models for Drug-Protein Relation Extraction (Liang-Chin (Leon) Huang)
  • lasigeBioTM at BioCreative VII Track 1: Text mining drug and chemical-protein interactions using biomedical ontologies (Diana Sousa)
  • Identifying Drug/chemical-protein Interactions in Biomedical Literature using the BERT-based Ensemble Learning Approach for the BioCreative 2021 DrugProt Track (Tzu-Yi Li)
  • Catalytic DS at BioCreative VII: DrugProt Track (Dennis Mehay)
3:55-4:15 pm10:55-11:15 amBreak
4:15-5:00 pm11:15 am-12:00 pm Selected poster flash talks
Chair: Rezarta Islamaj
  • Claim Detection in Biomedical Twitter Posts as a Prerequisite for Fact-Checking (Amelie Wührl)
  • Visual Exploration of Randomized Clinical Trials for COVID-19 (Abel Correa Dias)
  • COVID-SEE: The Scientific Evidence Explorer for COVID-19 Related Research (Karin Verspoor)
  • Long Covid: A Comprehensive Collection of Articles on Post-COVID Conditions (Robert Leaman)
  • Automated topic prediction of LitCovid using BioBERT (Vangala G Saipradeep)
  • A Survey of Relation Extraction Techniques Using Hybrid Classical and State of the Art Methods (Onur Kara)
  • Automatic Extraction of Medication Names in Tweets as Named Entity Recognition (Carole Anderson)
  • PubMedBERT-based Classifier with Data Augmentation Strategy for Detecting Medication Mentions in Tweets (Qing Hang)
  • Extraction of Medication Names from Twitter Using Augmentation and an Ensemble of Language Models (Igor Kulev)
  • Recognizing Chemical Entity in Biomedical Literature using a BERT-based Ensemble Learning Methods for the BioCreative 2021 NLM-Chem Track (Yu Wen Chiu)
  • Fine-tuning transformers for automatic chemical entity identification in PubMed articles (Robert Bevan)
  • PolyU CBS-NLP at BioCreative-VII LitCovid Task: Ensemble Learning for COVID-19 Multilabel Classification (Jinghang Gu)
  • Multi-label topic classification for COVID-19 literature annotation using an ensemble model based on PubMedBERT (Shubo Tian)
  • RobertNLP at the BioCreative VII - LitCovid track: Neural Document Classification Using SciBERT (Friedrich Annemarie)
  • TTI-COIN at BioCreative VII Track 2 (Tomoki Tsujimura)
  • Chemical–protein relation extraction in PubMed abstracts using BERT and neural networks (Rui Antunes)
  • R-BERT-CNN: Drug-target interactions extraction from biomedical literature (Jehad Aldahdooh)
5:00-6:15 pm12:00-1:15 pm Panel: Challenges in mining adverse drug reactions

The BioCreative organizers have convened this panel to explore the possibility of a future BioCreative evaluation on mining adverse drug reactions (ADRs). The panel will explore challenges of mining ADRs, focusing on applications (e.g., post-market surveillance, early warning from tracking social media, predictive models of toxic endpoints for chemicals and drugs, pre-clinical and clinical research) and data sources (including their limitations and accessibility).

Chairs: Martin Krallinger, Lynette Hirschman

  • Dr. Martin Krallinger (Chair)
  • CDR Monica Muñoz, FDA CDER
  • Prof. Özlem Uzuner, George Mason University
  • Dr. Raul Rodriguez-Esteban, Roche Pharmaceutics
  • Prof. Graciela Gonzalez-Hernandez, U Pennsylvania Medical School

Wednesday, November 10, 2021

UTC (Universal)ESTSession
2:30-2:40 pm9:30-9:40 amOpening remarks
2:40-3:55 pm9:40-10:55 am LitCovid track Multi-label topic classification for COVID-19 literature annotation (Track 5)
Chair: Rezarta Islamaj
  • BioCreative VII LitCovid Track: multi-label topic classification for COVID-19 literature annotation (Qingyu Chen)
  • Multic-label topic classification for COVID-19 literature with Bioformer (Fang Li)
  • Multi-label topic classification for COVID-19 literature annotation: A BioBERT-based feature enhancement approach (Wentai Tang)
  • BERT-based bagging-stacking for multi-topic classification (Loïc Rakotoson)
  • Multi-label Topic Classification for COVID-19 Literature Annotation using the BERT-based Ensemble Learning Approach for the BioCreative 2021 LitCovid Track (Sheng-Jie Lin)
3:55-4:15 pm10:55-11:15 amBreak
4:15-5:30 pm11:15 am-12:30 pm COVID-19 text mining tool interactive demo (Track 4)
Chair: Lynette Hirschman
  • Introduction to the COVID-19 Text Mining Tool Interactive Demo Track (Andrew Chatr-Aryamontri)
  • Semantic Search Engine preVIEW COVID-19 - Evaluation in the BioCreative VII IAT Track (Johannes Darms)
  • TopEx: Topic Exploration of COVID-19 Corpora - Results from the Biocreative VII Challenge Track 4 (Amy Olex)
  • Interpretable Visualization of Scientific Hypotheses in Literature-based Discovery (Ilya Tyagin)
  • A self-updating causal model of COVID-19 mechanisms built from the scientific literature (Benjamin Gyori)
  • BioKDE: a Deep Learning Powered Search Engine and Biomedical Knowledge Discovery Platform (Jinfeng Zhang)
  • The COVID-19 Therapeutic Information Browser (Tonia Korves)
  • Overview of the COVID-19 Text Mining Tool Interactive Demo Track (Andrew Chatr-Aryamontri)
5:30-6:10 pm12:30-1:10 pmGeneral discussion
6:10-6:15 pm1:10-1:15 pmClosing remarks

Back to top

Poster session

There are a number of exciting posters presentations. Note that a number of these will be played during flash talk session on November 9.

Back to top


ISBN: 978-0-578-32368-8

Proceedings are available here

Back to top


Registration is now closed