Motivation
Biomedical relation extraction is the task of automatically identifying and characterizing relations between biomedical concepts from free text. As a central task in biomedical natural language processing (NLP) research, it plays a critical role in many downstream applications, such as drug discovery and personalized medicine.
While there is a significant body of research on automatic relation extraction, most existing benchmarking datasets for biomedical relation extraction only focus on relations of a single type (e.g., protein–protein interactions) at the sentence level. In response, a new biomedical relation extraction dataset – BioRED [1] – with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene–disease; chemical–chemical) at the document level was recently made freely available. Despite multiple attempts, the best performance on the BioRED dataset remains modest, with much room for further improvements.
Track goals
This track will provide the participants with access to BioRED, an open repository of published scientific research articles, annotated for biomedical concepts and corresponding relationships between those concepts in the titles and abstracts. The participants will use this training data to design and develop their NLP systems to extract asserted relationships from free text. In addition to BioRED, other open biomedical data sources may be used to complement this training data, if desired. In addition to recognizing asserted relationships, participating systems are also encouraged to classify relations that are novel findings (the key points of a manuscript), as opposed to background or other existing knowledge that can be found elsewhere.
Task Definition
The track will be organized in two sub-tasks, each aiming to spur innovation in the field. In Sub-task-1 participants will be provided with human-annotated concepts and focus on building methods for relation extraction and classification. In sub-task-2, teams are asked to develop an end-to-end system that identifies and classifies asserted relationships in free text.
Sub-task 1: Given the abstract and human-annotated entities, the goal is to identify all the relationships between them in specific types.
Sub-task 2: Given the abstract, the goal is to develop an end-to-end system to identify all the asserted relationships and classify them into specific types.
Training data
The BioRED corpus, is a manually annotated corpus consisting of 600 PubMed articles, where domain experts have exhaustively labeled mentions of biomedical concepts and all binary relationships between them corresponding to a specific set of biologically and conceptually relevant relation types (BioRED relation classes). This dataset is divided into training (500) and validation (100)
Biomedical concepts:
- chemical,
- gene/protein,
- disease,
- variant,
- species
- cell line mentions
Biomedical pairs:
- gene/gene
- disease/gene,
- disease/chemical,
- gene/chemical,
- disease/variant,
- chemical/chemical
- chemical/variant
Biomedical relation types:
- positive correlation
- negative correlation
- association
- binding
- co-treatment
- drug interaction
- comparison
- conversion
Test data
An additional set of 400 PubMed articles will be annotated and used as an independent test set.
Evaluation Metrics
Please check the Codalab page: Sub-task 1, and Sub-task 2Timeline:
- Training data available: March 1, 2023
- Training set available at this link:BioRED corpus
- CodaLab evaluation website available: May 18, 2023
- Zoom meeting for participants: June 29, 9:30AM ET
- Webinar info available at BioRED track materials
- Submission Instructions available: (please contact us for instructions )
- Each team will be allowed 5 runs
- Test data (only title/abstract) available: September 15, 2023
- User Survey available: September 15, 2023
- Results Submission (sub-task 2, end-to-end system): September 22, 2023
- Test data (title/abstract and entity annotation) available: September 25, 2023
- Results Submission (sub-task 1, relation extraction): September 29, 2023
- Participant Survey Due: September 30, 2023
- Results return to participants: October 5, 2023
- Short technical systems description paper due: October 12, 2023
- Invites for oral presentation: October 20, 2023
- Workshop: November 12, 2023
Task Organizers:
- Rezarta Islamaj, Rezarta.Islamaj AT nih DOT gov
- Po-Ting Lai, Po-Ting.Lai AT nih DOT gov
- Chih-Hsuan Wei, Chih-Hsuan.Wei AT nih DOT gov
- Ling Luo, lingluo AT dlut DOT edu DOT cn
- Zhiyong Lu, Zhiyong.Lu AT nih DOT gov