Motivation
Biomedical relation extraction is the task of automatically identifying and characterizing relations between biomedical concepts from free text. As a central task in biomedical natural language processing (NLP) research, it plays a critical role in many downstream applications, such as drug discovery and personalized medicine.
While there is a significant body of research on automatic relation extraction, most existing benchmarking datasets for biomedical relation extraction only focus on relations of a single type (e.g., protein–protein interactions) at the sentence level. In response, a new biomedical relation extraction dataset – BioRED [1] – with multiple entity types (e.g., gene/protein, disease, chemical) and relation pairs (e.g., gene–disease; chemical–chemical) at the document level was recently made freely available. Despite multiple attempts, the best performance on the BioRED dataset remains modest, with much room for further improvements.
Track goals
This track will provide the participants with access to BioRED, an open repository of published scientific research articles, annotated for biomedical concepts and corresponding relationships between those concepts in the titles and abstracts. The participants will use this training data to design and develop their NLP systems to extract asserted relationships from free text. In addition to BioRED, other open biomedical data sources may be used to complement this training data, if desired. In addition to recognizing asserted relationships, participating systems are also encouraged to classify relations that are novel findings (the key points of a manuscript), as opposed to background or other existing knowledge that can be found elsewhere.
Task Definition
The track will be organized in two sub-tasks, each aiming to spur innovation in the field. In Sub-task-1 participants will be provided with human-annotated concepts and focus on building methods for relation extraction and classification. In sub-task-2, teams are asked to develop an end-to-end system that identifies and classifies asserted relationships in free text.
Sub-task 1: Given the abstract and human-annotated entities, the goal is to identify all the relationships between them in specific types.
Sub-task 2: Given the abstract, the goal is to develop an end-to-end system to identify all the asserted relationships and classify them into specific types.
Training data
The BioRED corpus, is a manually annotated corpus consisting of 600 PubMed articles, where domain experts have exhaustively labeled mentions of biomedical concepts and all binary relationships between them corresponding to a specific set of biologically and conceptually relevant relation types (BioRED relation classes). This dataset is divided into training (500) and validation (100)
Biomedical concepts:
- chemical,
- gene/protein,
- disease,
- variant,
- species
- cell line mentions
Biomedical pairs:
- gene/gene
- disease/gene,
- disease/chemical,
- gene/chemical,
- disease/variant,
- chemical/chemical
- chemical/variant
Biomedical relation types:
- positive correlation
- negative correlation
- association
- binding
- co-treatment
- drug interaction
- comparison
- conversion
Test data
An additional set of 400 PubMed articles will be annotated and used as an independent test set.
Evaluation Metrics
Please check the Codalab page: Sub-task 1, and Sub-task 2Timeline:
- Training data available: March 1, 2023
- CodaLab evaluation website available: May 18, 2023
- Zoom meeting for participants: End of June TBA (please contact)
- Test data available: September 1, 2023
- Participant Survey Due: September 10
- Results return to participants: September 10
- Short technical systems description paper due: September 30
- Invites for presentation at the workshop: October 10, TBA
- Workshop: TBA, November 11-15
Task Organizers:
- Rezarta Islamaj, Rezarta.Islamaj AT nih DOT gov
- Po-Ting Lai, Po-Ting.Lai AT nih DOT gov
- Chih-Hsuan Wei, Chih-Hsuan.Wei AT nih DOT gov
- Ling Luo, lingluo AT dlut DOT edu DOT cn
- Zhiyong Lu, Zhiyong.Lu AT nih DOT gov