BioCreative - Track 4- GO Task

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative IV

Track 4- GO Task [2012-11-15]

Background and Objectives

Gene Ontology (GO) annotation is a common task among model organism database (MOD) groups. It is a very time-consuming and labor-intensive task, thus often considered as one of the bottlenecks in literature curation (1). There is a growing need for semi- or fully-automated GO curation techniques that will help database curators rapidly and accurately identify gene function information in full-length articles (2,3). Although automatically predicting GO terms from research articles is not a new problem in text mining, few studies have proven to be useful with regard to assisting real world GO curation. The lack of access to full text, gold-standard training data such as evidence sentences, and limited opportunities for interaction with actual GO curators has limited the advances in algorithm development and corresponding use in practical circumstances. The proposed task aims to address many of these issues and promote research and tool development on this topic.

As mentioned above, the lack of training data was one of the major stumbling blocks for advancing automated methods in GO predictions. As such, in BioCreative IV, not only do we plan to provide teams with article-level gold-standard GO annotations for each full text article as has been done in the past,, but we will also provide evidence sentences for each GO annotation with the help from expert GO curators (see participating MODs below). That is, to best help text-mining tool advancement, gold-standard relevant sentences supporting a GO annotation will be provided based on human curation. These selected sentences should also often contain information relevant to the associated GO evidence codes. As we know from past BioCreative tasks, recognizing gene names and experimental codes for Protein-Protein Interaction from full text are difficult tasks on their own (4). Hence, to encourage teams to focus on GO term extraction, in BioCreative IV we propose to separate gene recognition from the GO task by including the gene names and associated identifiers in the training and test data.

Task Organizers:

Zhiyong Lu, NCBI
Kimberly Van Auken, WormBase
Donghui Li, TAIR
Cecilia Arighi, PIR

Participating MOD Group Leads:

Pete McQuilton, FlyBase
Stan Laulederkind, Rat Genome Database (RGD)
Donghui Li, TAIR
Mary Schaffer, MaizeGDB
Kimberly Van Auken, WormBase

SubTask A: Retrieving GO evidence sentences for relevant genes

GO evidence sentences (GOES) are critical for human curators to make associated GO annotations. For a given GO annotation, multiple evidence passages may appear in the paper, some being more specific with experimental information while others may be more succinct about the gene function. For this subtask, participants are given as input full-text articles together with associated gene information. For system output, teams have to submit a list of GO evidence sentences for each of the given genes in the paper. For evaluation, the submitted sentence list will be compared against the gold standard. Standard precision, recall and F1 score will be computed. Each team is allowed to submit 3 runs. This subtask is similar to the BioCreative I GO subtask 2.1 (3), BioCreative II Interaction Sentence subtask (4), and automatic GeneRIF identification (e.g. (5)) in TREC Genomics 2004 (6).

SubTask B: Predicting GO terms for relevant genes

This subtask is a step towards the ultimate goal of using computers for assisting human GO curation. As in SubTask A, participants are given as input full text articles with associated gene information. For system output, teams have to return a list of relevant GO terms for each of the input genes in a paper. Manually curated GO annotations will be used as the gold standard for evaluating team predictions. Each team is allowed to submit 3 runs. For the evaluation, predicted GO codes will be compared with the gold standard for each gene in an article. In addition to the a) Standard precision, recall and F1 score, b) Hierarchical precision, recall and f-measure will also be computed where common ancestors in the computer-predicted and human-annotated GO codes are considered. See (7, 8) for more details about this measure. This subtask is similar to the BioCreative I GO subtask 2.2 (3).

Important Dates:

May 31, 2013 – training set released (100 articles)
August 2, 2013 – development set released (50 articles)
August 12, 2013 - evaluation script released (including detailed submission format)
September 3, 2013 – test data released (50 articles)
September 8, 2013 (11:59 EDT) – team submission due

Downloads: BC4GO corpus

Contact: Zhiyong Lu (zhiyong.lu@nih.gov)

To receive the latest task information or read archived messages, please subscribe to the biocreative mailing list at: https://lists.sourceforge.net/lists/listinfo/biocreative-participant. Once registered, you can also post questions related to the BC IV GO task to the BioCreative mailing list.

References:

1. Lu, Z. and Hirschman, L. (2012) Biocuration workflows and text mining: overview of the BioCreative 2012 Workshop Track II. Database : the journal of biological databases and curation, 2012, bas043.
2. Van Auken, K., Jaffery, J., Chan, J., Muller, H.M. and Sternberg, P.W. (2009) Semi-automated curation of protein subcellular localization: a text mining-based approach to Gene Ontology (GO) Cellular Component curation. BMC Bioinformatics, 10, 228.
3. Blaschke, C., Leon, E.A., Krallinger, M. and Valencia, A. (2005) Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics, 6 Suppl 1, S16.
4. Krallinger, M., Leitner, F., Rodriguez-Penagos, C. and Valencia, A. (2008) Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol, 9 Suppl 2, S4.
5. Lu, Z., Cohen, K.B. and Hunter, L. (2006) Finding GeneRIFs via gene ontology annotations. Pac Symp Biocomput, 52-63.
6. Cohen, A.M. and Hersh, W.R. (2006) The TREC 2004 genomics track categorization task: classifying full text biomedical documents. J Biomed Discov Collab, 1, 4.
7. Eisner R, Poulin B, Szafron D, Lu P, Greiner R: Improving protein function prediction using the hierarchical structure of the Gene Ontology. In: Proceedings of 2005 IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology. 2005.
8. Kiritchenko S, Matwin S, Famili AF: Functional annotation of genes using hierarchical text categorization. In: Proceedings of the BioLINK SIG: Linking Literature, Information and Knowledge for Biology. 2005.