BioCreative - Track 1: Interactive Bio-ID Assignment (IAT-ID)

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 1: Interactive Bio-ID Assignment (IAT-ID) [2017-02-06]

Interactive Bio-ID Assignment Track (Bio-ID)

The training set has been released (June 29, 2017). A new version of training set has been released (August 1, 2017). The BioID scorer for the evaluation is now available (August 2, 2017). Files available here

Innovations in biomedical digital curation have emerged as a critical topic to address sustainability of biological databases and research resources. Digital curation is defined as “the active management and preservation of digital resources over the lifecycle of scholarly and scientific interest, and over time for current and future generations of users” (1). In particular, there is a recognition that data curation needs to be integrated throughout the research lifecycle, without having to wait for curation by biocurators until after publication, as is the current practice for curated databases. While capturing knowledge of researchers at the time of data generation and publishing may enhance efficiency, there are significant barriers to moving curation “upstream.” It is well recognized that the adoption of common database identifiers (IDs), controlled vocabularies (CVs) and ontologies facilitates data integration and re-use; however, it is nontrivial to extract IDs, CVs and ontological terms from the free texts of the scientific literature. New methods and tools need to be developed to support more effective and consistent curation at the time of paper submission.

The Bio-ID track aims to address these needs for Innovations in Biomedical Digital Curation (2). Publications are one of the main vehicles for dissemination of experimental results. Researchers have new ideas, conduct experiments, write up their results summarizing those experiments, submit them to a journal and, if accepted after peer-review, the articles are disseminated in public literature databases. Publications are also the primary source of data for knowledgebase curators, who extract and summarize the relevant data in standard formats. While researchers use both the literature and knowledge bases, the latter offer efficient platforms for querying, given the linkage of data in literature to database objects. Then new ideas/hypotheses are generated to start a new cycle. Currently, there is a bottleneck in data re-use as curators spend time identifying bio-entities in publications and linking these entities into their databases. We hypothesize that curation would be facilitated if articles were preprocessed to link the key bio-entities to their appropriate biological knowledge bases, prior to publication (benefitting publishers) and prior to curation (speeding the downstream curation process); we refer to this as bio-ID assignment.

The Bio-ID track in BioCreative VI (BC VI) will explore assignment of bio-IDs both at the pre- and post-publication stages, with the aim of facilitating downstream article curation. To do this we are bringing together the various stakeholders to discuss functional requirements and develop interoperable digital curation tools. Built on previous BioCreative experiments, including the interactive tracks and the BioC and gene/protein/chemical name entity recognition tracks, the task is designed to foster the development of an integrated and interoperable workflow of multiple text mining tools for real-world testing in pilot publishing frameworks.

We propose two parts of this task:
1-Bioentity normalization task (For Text mining teams)

sourcedata.embo.org

Figure captions from full-length articles are provided.
Multiple bioentities are annotated (gene/gene products, small chemicals, cell type, subcellular location, tissue, organism).
Teams can participate by annotating all or a subset of bioentities

Input/Output

Bioentities:

Bioentity type	Identifier type
gene/gene products	Entrez/UniProtKB
small chemical	ChEBI (primary)
subcellular structures	GO CC
Cell lines	Cellosaurus (primary)
Cell types	Cell Ontology
Tissues and organs	Uberon
Organism	NCBI Taxon

Data sets:

here

Evaluation:

here

Deadline:

August 23, 9pm EST

2-Output review in SourceData framework task (For curators/publishers/authors)

In this tasks the EMBO SourceData curation framework will be used to present the tagged bio-entities in manuscript’s figure legends for validation by authors/curators. Before the workshop we will check that the output from the text mining systems is interoperable with the SourceData framework. After the workshop we will recruit curators and authors to review results.

Timeline:

Task	Date
Training data release	~~Released June 29~~
Training data updated release	~~Released August 1~~
Scorer release	~~Released August 2~~
Test data release	~~Released August 16~~
Submission of results by teams	~~August 23~~
Evaluated results returned to the participants	~~August 28~~
Paper submissions deadline	~~September 8~~
Review	Sent by September 15

Paper submission

Submit a paper (max. 4 pages) describing your system and track results for your work to be included in the conference Proceedings, be considered for a talk in the workshop and be considered for publication in Database virtual issue.

Deadline for paper submissions is September 8

Instructions and paper template

Submission link: https://easychair.org/conferences/?conf=bc6