BioCreative - Track 1- Collaborative Biocurator Assistant Task (BioC)

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative V

Track 1- Collaborative Biocurator Assistant Task (BioC) [2014-10-21]

Organizers

Sun Kim, NCBI
Andrew Chatr-aryamontri, BioGRID
Rezarta Islamaj Doğan, NCBI
Donald C. Comeau, NCBI
W. John Wilbur, NCBI

Overview

Protein-protein molecular interaction (MI) information is of great importance both in the field of experimental biology as well as from the perspective of systems biology and bioinformatics [1]. Thus, many efforts have been made to capture this information in molecular interaction databases such as BioGRID [2], IntAct [3] and DIP [4]. Due to the rapid increase of biomedical literature, finding biological evidence through text mining techniques has been a main research topic in the bio-text mining community. However, there have been few successes for improving biocuration throughput using text mining [5].

The purpose of this task is to create BioC [6]-compatible modules which complement each other and integrate into a system that assists BioGRID curators. In previous BioCreative workshops great emphasis was given to the identification of protein-protein interactions (PPI). The PPI task [1, 7, 8] was divided into subcategories and each subtask was addressed independently, i.e. article classification, interaction pair extraction, interaction sentence classification and experimental method identification. The user interaction task (IAT) [9-11] promoted the development of annotation systems that can assist in biocuration tasks by bringing text mining tool developers and database curators together. But, no attempt has been made to integrate text mining modules developed in the formal BioCreative PPI task into one annotation tool. This may be due to interoperability and data exchange problems, or performance is not good enough for certain extraction modules.

The main goals of the task are:

To define a collaborative task for MI information extraction, so each team can develop a module independently, but can also use other modules' outputs.
To develop practical MI tools by combining or improving existing methods for full-text articles.
To improve interoperability by developing BioC-supported MI extraction modules.
To implement an annotation assistant tool by closely working with biocurators in BioGRID.
To produce a full-text benchmark set while evaluating the new biocurator assistant tool.

One distinctive feature of the task is that there is no competition among participating teams. The organizers will promote a collaborative framework and help each team to collaborate with the others for building an integrated annotation system.

Tasks

1. Gene/protein named entity recognition (NER)

This task is to identify gene/protein mentions. Participating team(s) combine results from existing tools to improve NER performance. This may include developing more training data and/or using an approach such as active learning.

2. Species/organism NER

This task is to identify species/organism names and normalize to NCBI Taxonomy IDs. Participating team(s) either combine results from existing techniques or proposes a new way for identifying species/organisms.

3. Normalization of gene/protein names

This task is to determine Entrez Gene IDs based on gene/protein names and species/organisms mentioned in surrounding text. Previous BioCreative sets may be used for system development. However, the system will eventually use prediction results from Tasks 1 and 2.

4. Passages with PPIs

This task is to find passages describing physical PPIs. Physical interactions may appear in single or several sentences. Participating team(s) may use the PPI corpora such as BioCreative, BioNLP Shared Task , AIMed and LLL for training [12], but they also can develop additional training data.

5. Passages with genetic interactions

This task is to find passages claiming genetic interactions. Genetic interactions may appear in single or several sentences. The BioGRID set may be used for creating a training set.

6. Passages with experimental methods for physical interactions

This task is to search for passages describing experimental methods used for finding physical interactions. There are 17 experimental methods defined in BioGRID. For this task, BioGrid, MINT, and/or IntAct may be used for training data.

7. Passages with genetic interaction types

This task is to search for passages describing genetic interaction types. These passages may overlap with the ones from Task 5. However, for Task 7, a type of genetic interaction should be clearly shown. There are 11 interaction types defined in BioGRID. The BioGRID set may be used for training data.

8. Visual tool for displaying various annotations

This task is to develop a visualization tool for highlighting annotation results from other tasks above. The tool should allow easy navigation and display user-selected annotations. Supporting full-text BioC documents is mandatory. A participating team will work closely with biocurators in BioGRID, in order to develop a visualization tool that curators find most useful.

* A sample BioC file for BioC Task is available here.

Datasets

Unlike other BioCreative tasks, no official training/test set will be released. Participating teams propose a method for each subtask and define the training set they are going to use. It is strongly encouraged that teams assigned to finding passages for PPIs, genetic interactions and experimental methods closely work with biocurators. The organizers will also assist in building the dataset, e.g. providing available resources and conversion of data to BioC.

Evaluation

A small curation project focused on the annotation of PPI molecular interactions and experimental methods for a small set of genes linked to a disease will be performed for evaluation. For example, the organizers could pick some of Parkinson's disease genes that are weakly linked to the disease only on the basis of GWAS (Genome Wide Association Studies). 100-200 full-text articles will be annotated by a team of 4 people using the BioGRID assistant tool. In such a way it is possible to obtain a thorough evaluation of the assistant tool and the generation of a specific dataset. After completing use of the system, biocurators will be asked to rate the usefulness of the system and each functionality on a scale of 0-5. They will be encouraged to give feedback regarding what they think hinders usefulness and what may improve usefulness of the system and its functions.

Dates

Jan. 2015: Call for participation
Mar. 31, 2015: Deadline for participation
Apr-Jun. 2015: Developing individual systems and iterative system integration
Jul. 6, 2015: Due date for Tasks 1, 2 and 3
Jul. 10, 2015: Due date for Tasks 4, 5, 6 and 7
Jul-Aug. 2015: Overall system evaluation
Aug. 25, 2015: Paper deadline
Sep. 1, 2015: Camera-ready deadline
Sep. 9-11, 2015: BioCreative V workshop

Participation

Teams can submit their subtask proposals to sun<dot>kim<at>nih<dot>gov with the subject line "BioC subtask proposal" by March 31st, 2015. Proposals will be reviewed by the task organizers and decisions will be given in a timely manner.

Contact information

To receive the latest task information, please subscribe to the BioCreative mailing list at https://lists.sourceforge.net/lists/listinfo/biocreative-participant. You can also post questions related to Collaborative Biocurator Assistant Task (BioC) to the BioCreative mailing list.

References

1. Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A, Winter A, Perfetto L, Briganti L, Licata L, Iannuccelli M et al: The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC Bioinformatics 2011, 12 Suppl 8:S3.
2. Chatr-Aryamontri A, Breitkreutz BJ, Heinicke S, Boucher L, Winter A, Stark C, Nixon J, Ramage L, Kolas N, O'Donnell L et al: The BioGRID interaction database: 2013 update. Nucleic Acids Research 2013, 41(Database issue):D816-823.
3. Kerrien S, Aranda B, Breuza L, Bridge A, Broackes-Carter F, Chen C, Duesbury M, Dumousseau M, Feuermann M, Hinz U et al: The IntAct molecular interaction database in 2012. Nucleic Acids Research 2012, 40(Database issue):D841-846.
4. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D: The Database of Interacting Proteins: 2004 update. Nucleic Acids Research 2004, 32(Database issue):D449-451.
5. Hirschman L, Burns GA, Krallinger M, Arighi C, Cohen KB, Valencia A, Wu CH, Chatr-Aryamontri A, Dowell KG, Huala E et al: Text mining for the biocuration workflow. Database 2012, 2012:bas020.
6. Comeau DC, Islamaj Dogan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, Lu Z, Peng Y, Rinaldi F, Torii M et al: BioC: a minimalist approach to interoperability for biomedical text processing. Database 2013, 2013:bat064.
7. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008, 9 Suppl 2:S4.
8. Leitner F, Mardis SA, Krallinger M, Cesareni G, Hirschman LA, Valencia A: An Overview of BioCreative II.5. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010, 7(3):385-399.
9. Arighi CN, Roberts PM, Agarwal S, Bhattacharya S, Cesareni G, Chatr-Aryamontri A, Clematide S, Gaudet P, Giglio MG, Harrow I et al: BioCreative III interactive task: an overview. BMC Bioinformatics 2011, 12 Suppl 8:S4.
10. Arighi CN, Carterette B, Cohen KB, Krallinger M, Wilbur WJ, Fey P, Dodson R, Cooper L, Van Slyke CE, Dahdul W et al: An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database 2013, 2013:bas056.
11. Matis-Mitchell S, Roberts P, Tudor CO, Arighi CN: BioCreative IV Interactive Task. In: BioCreative IV Workshop; Washington, DC. 2013: 190-203.
12. WBI corpora repository [http://corpora.informatik.hu-berlin.de]