BioCreative - Track 1: BioRED (Biomedical Relation Extraction Dataset) Track

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VIII

Track 1: BioRED (Biomedical Relation Extraction Dataset) Track [2023-01-22]

Motivation

Biomedical relation extraction is the task of automatically identifying and characterizing relations between biomedical concepts from free text. As a central task in biomedical natural language processing (NLP) research, it plays a critical role in many downstream applications, such as drug discovery and personalized medicine.

While there is a significant body of research on automatic relation extraction, most existing benchmarking datasets for biomedical relation extraction only focus on relations of a single type (e.g., protein–protein interactions) at the sentence level. In response, a new biomedical relation extraction dataset – BioRED [1] – with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene–disease; chemical–chemical) at the document level was recently made freely available. Despite multiple attempts, the best performance on the BioRED dataset remains modest, with much room for further improvements.

Track goals

This track will provide the participants with access to BioRED, an open repository of published scientific research articles, annotated for biomedical concepts and corresponding relationships between those concepts in the titles and abstracts. The participants will use this training data to design and develop their NLP systems to extract asserted relationships from free text. In addition to BioRED, other open biomedical data sources may be used to complement this training data, if desired. In addition to recognizing asserted relationships, participating systems are also encouraged to classify relations that are novel findings (the key points of a manuscript), as opposed to background or other existing knowledge that can be found elsewhere.

Task Definition

The track will be organized in two sub-tasks, each aiming to spur innovation in the field. In Sub-task-1 participants will be provided with human-annotated concepts and focus on building methods for relation extraction and classification. In sub-task-2, teams are asked to develop an end-to-end system that identifies and classifies asserted relationships in free text.

Sub-task 1: Given the abstract and human-annotated entities, the goal is to identify all the relationships between them in specific types.
Sub-task 2: Given the abstract, the goal is to develop an end-to-end system to identify all the asserted relationships and classify them into specific types.

Training data

The BioRED corpus, is a manually annotated corpus consisting of 600 PubMed articles, where domain experts have exhaustively labeled mentions of biomedical concepts and all binary relationships between them corresponding to a specific set of biologically and conceptually relevant relation types (BioRED relation classes). This dataset is divided into training (500) and validation (100)

Biomedical concepts:

chemical,
gene/protein,
disease,
variant,
species
cell line mentions

Biomedical pairs:

gene/gene
disease/gene,
disease/chemical,
gene/chemical,
disease/variant,
chemical/chemical
chemical/variant

Biomedical relation types:

positive correlation
negative correlation
association
binding
co-treatment
drug interaction
comparison
conversion

Test data

An additional set of 400 PubMed articles will be annotated and used as an independent test set.

Evaluation Metrics

Please check the Codalab page: Sub-task 1, and Sub-task 2

Timeline:

Training data available: March 1, 2023
- Training set available at this link:BioRED corpus
CodaLab evaluation website available: May 18, 2023
- Sub-task 1
- Sub-task 2
Zoom meeting for participants: June 29, 9:30AM ET
- Webinar info available at BioRED track materials
Submission Instructions available: (please contact us for instructions )
- Each team will be allowed 5 runs
Test data (only title/abstract) available: September 15, 2023
User Survey available: September 15, 2023
Results Submission (sub-task 2, end-to-end system): September 22, 2023
Test data (title/abstract and entity annotation) available: September 25, 2023
Results Submission (sub-task 1, relation extraction): September 29, 2023
Participant Survey Due: September 30, 2023
Results return to participants: October 5, 2023
Short technical systems description paper due: October 12, 2023
Invites for oral presentation: October 20, 2023
Workshop: November 12, 2023

Task Organizers:

Rezarta Islamaj, Rezarta.Islamaj AT nih DOT gov
Po-Ting Lai, Po-Ting.Lai AT nih DOT gov
Chih-Hsuan Wei, Chih-Hsuan.Wei AT nih DOT gov
Ling Luo, lingluo AT dlut DOT edu DOT cn
Zhiyong Lu, Zhiyong.Lu AT nih DOT gov

References:

BioRED: a rich biomedical relation extraction dataset, Ling Luo, Po-Ting Lai, Chih-Hsuan Wei, Cecilia N Arighi, Zhiyong Lu, Briefings in Bioinformatics, Volume 23, Issue 5, September 2022, bbac282, https://doi.org/10.1093/bib/bbac282