RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VII

Track 2 - NLM-CHEM track Full-text Chemical Identification and Indexing in PubMed articles [2020-01-22]

Note for Biocreative participants: For registration to a track please use the Google form.
Do not use the team "Team page" tab as it is non functional.

NLM-CHEM track - Full-text Chemical Identification and Indexing in PubMed articles

Identifying named entities is an important building block for many complex knowledge extraction tasks. Errors in identifying relevant biomedical entities is a key impediment to accurate article retrieval, classification, and further understanding of textual semantics, such as relation extraction. Chemical entities appear throughout the biomedical research literature and are one of the entity types most frequently searched in PubMed [1]. Accurate automated identification of the chemicals mentioned in journal publications has the potential to translate to improvements in many downstream NLP tasks and biomedical fields; in the near-term, specifically in the retrieval of relevant articles, greatly assisting researchers, indexers, and curators.

Previous work in biomedical named entity recognition (NER) and normalization (i.e. entity linking) for chemicals includes several community challenges (e.g. CHEMDNER and BC5CDR tasks at previous BioCreative workshops [2,3]). However, indexing and curation tasks require processing full text articles, where information retrieval and extraction are different. For example, the full text frequently contains more detailed information, such as chemical compound properties, their biological effects and interactions with diseases, genes and other chemicals.

The NLM-CHEM track will consist of two tasks. Participants can choose to participate in either one or both. These tasks are:

  • Chemical Identification in full text: predicting all chemicals mentioned in recently published full-text articles, both span (i.e. named entity recognition) and normalization (i.e. entity linking) using MeSH.
  • Chemical Indexing prediction task: predicting which chemicals mentioned in recently published full-text articles should be indexed, i.e. appear in the listing of MeSH terms for the document.

Training dataset:

NLM-CHEM corpus [4] consists of a 150 full text articles with chemical entity annotations from human experts for ~5000 unique chemical names, mapped to ~2000 MeSH identifiers. This dataset is compatible with the CHEMDNER and BC5CDR corpora used in previous BioCreative challenges.

The chemical indexing terms for each of the articles will also be provided. This information is publicly available via PubMed and can be found in the “Substances” field.

Evaluation dataset:

The full text of a set of PubMed articles scheduled for human indexing in 2021 will be distributed as the test set. NLM expert indexers who annotated the training set, will fully annotate a subset of these articles for all occurrences of chemical mentions, as the gold standard for the Chemical Identification task. The human expert indexing of all articles will be the gold-standard for the Chemical Indexing task.

Participants in both tasks will return predictions for the entire set. Submissions to the Chemical Identification task will be evaluated against only the manually annotated subset, using metrics for both NER and normalization. Submissions to the Chemical Indexing task will be evaluated against the human expert indexing for substances for the entire set, using metrics for both the individual MeSH terms and the MeSH hierarchy. Evaluation scripts will be provided.

Data will be posted at BC7 NLM-Chem-track data and materials

Important dates:

Timeline:

  • April 20: Release of training set (in BioC XML and JSON)
  • May 7: Release of evaluation script
  • May 7: Supplementary training data release
  • May 20: Getting started with deep learning scripts and tutorial
  • May 27: Zoom webinar Getting started with the NLM-Chem BioCreativeVII Track
    • When: May 27, 2021 10:00 AM Eastern Time (US and Canada)
      Register in advance for this meeting:
      Register here
      After registering, you will receive a confirmation email containing information about joining the meeting.
  • July 20: Test set release, and results submission instructions
  • August 30: Test set prediction submission due.
  • September 20: Test set evaluation returned to participants.
  • October 10: Short technical systems description paper due.

Task organizers:

  • Rezarta Islamaj, National Library of Medicine
  • Robert Leaman, National Library of Medicine
  • Zhiyong Lu, National Library of Medicine

References

  1. Islamaj R, Murray GC, et al.. Understanding PubMed user search behavior through log analysis. Database (Oxford). 2009;2009:bap018. doi: 10.1093/database/bap018. PMID: 20157491; PMCID: PMC2797455.
  2. Krallinger M, Rabal O, et al.. The CHEMDNER corpus of chemicals and drugs and its annotation principles. J Cheminform. 2015. doi: 10.1186/1758-2946-7-S1-S2. PMID: 25810773; PMCID: PMC4331692.
  3. Li J, Sun Y, Johnson RJ, et al.. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford). 2016:baw068. doi: 10.1093/database/baw068. PMID: 27161011; PMCID: PMC4860626.
  4. Islamaj R, Leaman R, et al.. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data. 2021. doi: 10.1038/s41597-021-00875-1. PMID: 33767203.

Contact:

  • Rezarta Islamaj, Rezarta.Islamaj AT nih DOT gov
  • Robert Leaman, Robert.Leaman AT nih DOT gov