RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VIII

Track 1: BioRED (Biomedical Relation Extraction Dataset) Track [2023-01-22]

Motivation

Biomedical relation extraction is the task of automatically identifying and characterizing relations between biomedical concepts from free text. As a central task in biomedical natural language processing (NLP) research, it plays a critical role in many downstream applications, such as drug discovery and personalized medicine.

While there is a significant body of research on automatic relation extraction, most existing benchmarking datasets for biomedical relation extraction only focus on relations of a single type (e.g., protein–protein interactions) at the sentence level. In response, a new biomedical relation extraction dataset – BioRED [1] – with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene–disease; chemical–chemical) at the document level was recently made freely available. Despite multiple attempts, the best performance on the BioRED dataset remains modest, with much room for further improvements.

Track goals

This track will provide the participants with access to BioRED, an open repository of published scientific research articles, annotated for biomedical concepts and corresponding relationships between those concepts in the titles and abstracts. The participants will use this training data to design and develop their NLP systems to extract asserted relationships from free text. In addition to BioRED, other open biomedical data sources may be used to complement this training data, if desired. In addition to recognizing asserted relationships, participating systems are also encouraged to classify relations that are novel findings (the key points of a manuscript), as opposed to background or other existing knowledge that can be found elsewhere.

Task Definition

The track will be organized in two sub-tasks, each aiming to spur innovation in the field. In Sub-task-1 participants will be provided with human-annotated concepts and focus on building methods for relation extraction and classification. In sub-task-2, teams are asked to develop an end-to-end system that identifies and classifies asserted relationships in free text.


Sub-task 1: Given the abstract and human-annotated entities, the goal is to identify all the relationships between them in specific types.
Sub-task 2: Given the abstract, the goal is to develop an end-to-end system to identify all the asserted relationships and classify them into specific types.

Training data

The BioRED corpus, is a manually annotated corpus consisting of 600 PubMed articles, where domain experts have exhaustively labeled mentions of biomedical concepts and all binary relationships between them corresponding to a specific set of biologically and conceptually relevant relation types (BioRED relation classes). This dataset is divided into training (500) and validation (100)

Biomedical concepts:
  • chemical,
  • gene/protein,
  • disease,
  • variant,
  • species
  • cell line mentions
Biomedical pairs:
  • gene/gene
  • disease/gene,
  • disease/chemical,
  • gene/chemical,
  • disease/variant,
  • chemical/chemical
  • chemical/variant
Biomedical relation types:
  • positive correlation
  • negative correlation
  • association
  • binding
  • co-treatment
  • drug interaction
  • comparison
  • conversion

Test data

An additional set of 400 PubMed articles will be annotated and used as an independent test set.

Evaluation Metrics

Please check the Codalab page: Sub-task 1, and Sub-task 2

Timeline:

  • Training data available: March 1, 2023
  • CodaLab evaluation website available: May 18, 2023
  • Zoom meeting for participants: June 29, 9:30AM ET
  • Submission Instructions available: (please contact us for instructions )
    • Each team will be allowed 5 runs
  • Test data (only title/abstract) available: September 15, 2023
  • User Survey available: September 15, 2023
  • Results Submission (sub-task 2, end-to-end system): September 22, 2023
  • Test data (title/abstract and entity annotation) available: September 25, 2023
  • Results Submission (sub-task 1, relation extraction): September 29, 2023
  • Participant Survey Due: September 30, 2023
  • Results return to participants: October 5, 2023
  • Short technical systems description paper due: October 12, 2023
  • Invites for oral presentation: October 20, 2023
  • Workshop: November 12, 2023

Task Organizers:

  • Rezarta Islamaj, Rezarta.Islamaj AT nih DOT gov
  • Po-Ting Lai, Po-Ting.Lai AT nih DOT gov
  • Chih-Hsuan Wei, Chih-Hsuan.Wei AT nih DOT gov
  • Ling Luo, lingluo AT dlut DOT edu DOT cn
  • Zhiyong Lu, Zhiyong.Lu AT nih DOT gov

References:

BioRED: a rich biomedical relation extraction dataset, Ling Luo, Po-Ting Lai, Chih-Hsuan Wei, Cecilia N Arighi, Zhiyong Lu, Briefings in Bioinformatics, Volume 23, Issue 5, September 2022, bbac282, https://doi.org/10.1093/bib/bbac282