RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 1: Interactive Bio-ID Assignment (IAT-ID) [2017-02-06]

Interactive Bio-ID Assignment Track (Bio-ID)

The training set has been released (June 29, 2017). A new version of training set has been released (August 1, 2017). The BioID scorer for the evaluation is now available (August 2, 2017). Files available here

Innovations in biomedical digital curation have emerged as a critical topic to address sustainability of biological databases and research resources. Digital curation is defined as “the active management and preservation of digital resources over the lifecycle of scholarly and scientific interest, and over time for current and future generations of users” (1). In particular, there is a recognition that data curation needs to be integrated throughout the research lifecycle, without having to wait for curation by biocurators until after publication, as is the current practice for curated databases. While capturing knowledge of researchers at the time of data generation and publishing may enhance efficiency, there are significant barriers to moving curation “upstream.” It is well recognized that the adoption of common database identifiers (IDs), controlled vocabularies (CVs) and ontologies facilitates data integration and re-use; however, it is nontrivial to extract IDs, CVs and ontological terms from the free texts of the scientific literature. New methods and tools need to be developed to support more effective and consistent curation at the time of paper submission.

The Bio-ID track aims to address these needs for Innovations in Biomedical Digital Curation (2). Publications are one of the main vehicles for dissemination of experimental results. Researchers have new ideas, conduct experiments, write up their results summarizing those experiments, submit them to a journal and, if accepted after peer-review, the articles are disseminated in public literature databases. Publications are also the primary source of data for knowledgebase curators, who extract and summarize the relevant data in standard formats. While researchers use both the literature and knowledge bases, the latter offer efficient platforms for querying, given the linkage of data in literature to database objects. Then new ideas/hypotheses are generated to start a new cycle. Currently, there is a bottleneck in data re-use as curators spend time identifying bio-entities in publications and linking these entities into their databases. We hypothesize that curation would be facilitated if articles were preprocessed to link the key bio-entities to their appropriate biological knowledge bases, prior to publication (benefitting publishers) and prior to curation (speeding the downstream curation process); we refer to this as bio-ID assignment.

The Bio-ID track in BioCreative VI (BC VI) will explore assignment of bio-IDs both at the pre- and post-publication stages, with the aim of facilitating downstream article curation. To do this we are bringing together the various stakeholders to discuss functional requirements and develop interoperable digital curation tools. Built on previous BioCreative experiments, including the interactive tracks and the BioC and gene/protein/chemical name entity recognition tracks, the task is designed to foster the development of an integrated and interoperable workflow of multiple text mining tools for real-world testing in pilot publishing frameworks.

We propose two parts of this task:
1-Bioentity normalization task (For Text mining teams)

    The bioentity normalization task is similar to the normalization tasks in previous BioCreative in that the goal is to link bioentities mentioned in the literature to standard database identifiers. However, in this year’s challenge, we plan to collaborate with the EMBO SourceData project (sourcedata.embo.org), which will make it unique in several aspects:
  • Figure captions from full-length articles are provided.
  • Multiple bioentities are annotated (gene/gene products, small chemicals, cell type, subcellular location, tissue, organism).
  • Teams can participate by annotating all or a subset of bioentities
    Input/Output: The input for the text mining systems will be a PMCID; PMID; paper title, figure/panel number and associated text in BioC format; the output expected is PMCID; PMID; paper title, figure/panel number and associated text along with bioentity and identifier mark up (with offsets) in BioC format. Examples of these are available in download section at the bottom of this page
    Bioentities:
    Bioentity typeIdentifier type
    gene/gene productsEntrez/UniProtKB
    small chemicalChEBI (primary)
    subcellular structuresGO CC
    Cell linesCellosaurus (primary)
    Cell typesCell Ontology
    Tissues and organsUberon
    OrganismNCBI Taxon
    Data sets:
    The training set consists of a collection of SourceData (3) annotated captions in BioC format. Each file contains all SourceData annotated captions for a given article for a total of 570 articles .
    This data set is available here

    The annotation guidelines and description of the files can be found under download at the bottom of this page, as well as in the training set material
    Evaluation:
    The test data set is now available here. We will calculate precision, recall and F-measure. Teams will be ranked. A scorer developed by MITRE will be used for the evaluation. Information about the scorer, and the scorer package can be found here.
    Deadline:Teams should submit the annotated captions result in BioC format by August 23, 9pm EST.

2-Output review in SourceData framework task (For curators/publishers/authors)

    In this tasks the EMBO SourceData curation framework will be used to present the tagged bio-entities in manuscript’s figure legends for validation by authors/curators. Before the workshop we will check that the output from the text mining systems is interoperable with the SourceData framework. After the workshop we will recruit curators and authors to review results.

Timeline:

TaskDate
Training data releaseReleased June 29
Training data updated releaseReleased August 1
Scorer releaseReleased August 2
Test data releaseReleased August 16
Submission of results by teamsAugust 23
Evaluated results returned to the participantsAugust 28
Paper submissions deadlineSeptember 8
Review Sent by September 15

Paper submission

Submit a paper (max. 4 pages) describing your system and track results for your work to be included in the conference Proceedings, be considered for a talk in the workshop and be considered for publication in Database virtual issue.

  • Deadline for paper submissions is September 8
  • Instructions and paper template
  • Submission link: https://easychair.org/conferences/?conf=bc6
  • Organizers

  • Cecilia Arighi, U Delaware, USA
  • Lynette Hirschman, MITRE, USA
  • Thomas Lemberger, EMBO
  • Robin Liechti, Swiss Institutes of Bioinformatics
  • Cathy Wu, U Delaware, USA
  • With significant contributions from:

  • Donald Comeau, NCBI, NIH, USA
  • Rezarta Islamaj-Dogan, NCBI, NIH, USA
  • Samuel Bayer, MITRE, USA
  • Martin Krallinger, Spain
  • Analia Lourenço, Spain
  • References

    1. Lee, C., and Tibbo, H. (2007) Digital Curation and Trusted Repositories: Steps Toward Success. Journal of Digital Information 8, 2
    2. https://datascience.nih.gov/
    3. preprint: Liechti et al, 2016 BioRxiv doi: https://doi.org/10.1101/058529

    Back to top

    Downloads