RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VII

Track 5 - LitCovid track Multi-label topic classification for COVID-19 literature annotation [2021-03-30]

Note for Biocreative participants: For registration to a track please use the Google form.
Do not use the team "Team page" tab as it is non functional.

LitCovid track Multi-label topic classification for COVID-19 literature annotation

The rapid growth of biomedical literature poses a significant challenge for manual curation and interpretation. This challenge has become more evident during the COVID-19 pandemic: the number of COVID-19-related articles in the literature is growing by about 10,000 articles per month. LitCovid, a literature database of COVID-19-related papers in PubMed, has accumulated more than 100,000 articles, with millions of accesses each month by users worldwide. LitCovid is updated daily, and this rapid growth significantly increases the burden of manual curation. In particular, annotating each article with up to eight possible topics, e.g., Treatment and Diagnosis, has been a bottleneck in the LitCovid curation pipeline.

This track calls for a community effort to tackle automated topic annotation for COVID-19 literature. Topic annotation in LitCovid is a standard multi-label classification task that assigns one or more labels to each article. These topics have been demonstrated to be effective for information retrieval and have been used in many downstream applications related to LitCovid. However, annotating these topics has been a primary bottleneck for manual curation. Increasing the accuracy of automated topic prediction in COVID-19-related literature would be a timely improvement beneficial to curators and researchers worldwide.

Training and development datasets

The training and development datasets contain the publicly-available text of over 30 thousand COVID-19-related articles and their metadata (e.g., title, abstract, journal). Articles in both datasets have been manually reviewed and articles annotated by in-house models

Evaluation dataset

Same as the training and development datasets, the evaluation dataset contains the articles that have been manually reviewed. Participants will return predictions for the entire set. Submissions will be evaluated using both label-based and instance-based metrics that are commonly applied for multi-label classification. Evaluation scripts will be provided.

Important dates:

  • Training and development set release: 15th June. The datasets can be accessed via here.
  • Evaluation script: early July
  • Test set release: late August
  • Test set prediction submission instructions: late August
  • Test set prediction submission due: early September
  • Test set evaluation returned to participants: early September
  • Short technical systems description paper due: mid September
  • Paper acceptance and review returned: late September

Task organizers:

  • Qingyu Chen, National Library of Medicine
  • Alexis Allot, National Library of Medicine
  • Rezarta Islamaj, National Library of Medicine
  • Robert Leaman, National Library of Medicine
  • Zhiyong Lu, National Library of Medicine


Please contact with the subject heading "BioCreative Track 5 LitCovid questions" if you have any questions

Status updates and FAQs

  1. The training and development datasets have been released. Please access the datasets here.