BioCreative - Task 2: Protein-Protein Interactions

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative II

Task 2: Protein-Protein Interactions [2006-04-01]

This task is organized as a collaboration between the IntAct and MINT protein interaction databases and the CNIO Structural Bioinformatics and Biocomputing group.

Background
Introduction
Task description
Data
Resources

A) Background

The study of protein interactions is one of the most pressing biological problems. Characterizing protein interaction partners is crucial to understanding not only the functional role of individual proteins but also the organization of entire biological processes.

The development of high throughput experimental technologies, such as yeast two-hybrid screening [1] or affinity purification coupled with mass spectroscopy [2] is now making it possible to study protein interactions on a much larger scale by means of bioinformatics approaches [3-5]. One limitation of these large-scale experiments is their accuracy. Protein interactions databases have been developed [6-7] to integrate protein interaction information from these disparate sources, e.g., high throughput methods as well as carefully experimentally characterized individual protein interactions.

Databases such as IntAct and MINT provide interaction information in the form of well structured database records in standard formats, constituting a useful resource for both biologists as well as bioinformaticians.

Because the molecular biology literature provides detailed descriptions of protein interaction experiments specifying the individual interaction partners, as well as the corresponding interaction types, it has been exploited as a resource to derive protein interaction records for interaction databases. Due to the rapid growth of the biomedical literature and the increasing number of newly discovered proteins, it is becoming difficult for the interaction database curators to keep up with the literature by manually detecting and curating protein interaction information.

This is motivating the implementation of information extraction and text mining techniques to automatically extract protein interaction information from free texts. A number of approaches have been published, (see [7-33] for some of the strategies) and an initial challenge evaluation has been carried out [34]. Nevertheless, a large scale evaluation of different methods applied to existing protein interaction databases is still missing. To produce high quality training and test data collections, as well as to set up community-wide experiments which can result in relevant and useful systems, the collaboration with experts in protein interaction databases is crucial.

B) Introduction to the protein-protein interaction (PPI) extraction task

One of the main limitations for the development and evaluation of protein-protein interaction extraction methods from text is the lack of Gold Standard training data sets. This makes it cumbersome to compare existing automated extraction methods, as most results are reported using author-specific evaluation data sets; furthermore, some systems have only been evaluated using article abstracts.

In practice, biologists who search for protein interactions are not limited to abstracts, but consider full text articles to derive protein interaction information. Also the type of protein interaction and the experimental method used to determine whether two proteins interact is important information preserved in expert curated databases.

For BioCreAtIvE II, the protein-protein interaction task focuses on the prediction of protein interactions from full text articles. The second BioCreAtIvE challenge is gathering expert database curators with experience in protein interaction annotation together with experts in evaluating information extraction systems adapted to the biology domain.

Among the main goals posed in this task are:

To determine the state of the art in extraction of protein-protein interaction;
To produce useful resources for training and testing protein interaction extraction systems;
To learn which approaches are successful and practical;
To monitor interesting new approaches;
To provide the biology community with useful tools to extract protein-protein interactions from texts

This second BioCreAtIvE challenge provides the opportunity for participating systems to take advantage of the underlying collaboration with domain experts, addressing a practical task. The training and test data sets are characterized by in-depth annotations of protein interactions of full text articles. These annotations include all the manually registered mentions of the interacting proteins from the full texts, as provided by the database curators.

C) Protein-protein interaction task description

Reflecting the process of database curator annotation extraction, several sub-tasks are posed. Each participant is free to take part at any (or all) of the proposed sub-tasks.

Protein Interaction Article Sub-task 1 (IAS)

In practice, before detecting protein interaction descriptions in sentences, it is necessary to select those articles which contain relevant information relative to protein interactions. Although this aspect is critical for subsequent steps, it has often been neglected by previously published protein-interaction extraction systems. Thus this sub-task will be concerned with the classification of whether a given article contains protein interaction information.

Participants will need to return a ranked list of articles (identifiers) based on their relevance for protein interaction annotation. To evaluate the participating systems, the AROC (area under the receiver operating characteristic curve) measure based on the ranked predicted collections. (We had in the beginning also considered using additional evaluation metrics, e.g. utility measure[35]). The training collection will contain:

TP: (True Positives) collection of PubMed article abstracts which are relevant for protein interaction curation.
TN: (True Negatives) consists in articles which have been classified by domain expert curators from these two databases as not relevant for protein interaction curation.
TP: (likely True Positives) consists of a collection of PubMed identifiers of articles which have been used for protein interaction annotation by other interaction databases (namely BIND, HPRD, MPACT and GRID).

Protein Interaction Pairs Sub-task 2 (IPS)

This sub-task is related to the identification of protein-protein interaction pairs from full text articles. As training data the participants will get a collection of articles with the associated interaction pairs extracted from these articles, as well as the corresponding gene mention symbols. In case of the test set predictions, participants have to provide, for each article, a ranked list of protein-protein interaction pairs. The evaluation will be in terms of precision and recall of the predicted protein interaction pairs for each article.

Protein Interaction Sentences Sub-task 3 (ISS)

In practice, protein-protein interaction information for a given pair of proteins might be mentioned several times throughout a full text article. To produce a protein interaction summary, for instance, it is useful to select the most relevant sentence expressing interaction information for a given pair. Therefore one of the sub-tasks will ask participants to provide, for each protein interaction pair, a ranked list of maximal 5 text passages (containing at most 3 sentences per passage) describing their interaction. For the evaluation, pooling methods will be used, as follows: all the sentences from all the systems for each document are collected. We will evaluate according to two aspects: a) the Percentage of interaction relevant sentences with respect to the total number of predicted (submitted) sentences and b) the Mean reciprocal rank (MRR) of the ranked list of interaction evidence passages with respect to the manually chosen best interaction sentence. Point b) is the most important evaluation criteria.

Protein Interaction Method Sub-task 4 (IMS)

For annotation purposes, as well as to judge the quality of protein interactions, it is important to know how protein interactions have been determined experimentally. In case of protein-protein interaction annotation, considerable effort has been made to develop a controlled vocabulary about interaction methods. This sub-task refers to the identification of the type of experiment which was used to confirm a given protein-protein interaction. The experimental method description has to be mapped into a previously provided controlled hierarchical vocabulary of experimental methods [36]. In this case the evaluation will be measured by the mean reciprocal rank of correctly identified interaction methods (correct MI identifiers) for each protein-protein interaction pair compared to the previously manually annotated interaction detection methods. This hierarchical controlled vocabulary is available at MI.

D) General Data set considerations

For the training and test data sets, the annotation strategy followed by the IntAct and MINT databases has been considered. Note that proteins are uniquely identified by UniProt ID. Although IntAct and MINT annotation is done down to isoform level, the BioCreAtIvE competition mapping is done to UniProt "master" entries. An UniProt 'light' version will be distributed to the participants, note that only entries contained in this release will be considered for evaluation, to avoid the problem of obsolete identifiers.

E) Additional resources

Additional useful data collections, such as protein interaction sentences derived from PubMed abstracts and links to other interaction-relevant resources will be provided as well.

References

Uetz P., et al. (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623-627
Gavin, A.C., et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141-147
Valencia, A. & Pazos, F. (2003) Prediction of protein-protein interactions from evolutionary information. Methods Biochem Anal., 44, 411-426
Enright AJ. Et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature, 402, 86-90
Jansen R., et al (2003) A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302, 449-453
Hermjakob, H., et al. (2004) IntAct: an open source molecular interaction database. Nucleic Acids Res., 32, D452-D455
Zanzoni A, et al. (2002) MINT: a Molecular INTeraction database. FEBS Lett., 513, 135-140
Blaschke, C. and Valencia, A. (2001) The potential use of SUISEKI as a protein interaction discovery tool. Genome Inform Ser Workshop Genome Inform., 12, 123-134
Marcotte,E.M., et al (2001) Mining literature for protein-protein interactions. Bioinformatics, 17, 259--363
Proux,D., et al (2000) A pragmatic information extraction strategy for gathering data on genetic interactions. Proc Int Conf Intell Syst Mol Biol, 8, 279-285
Ono,T., et al (2001) Automated extraction of information on protein-protein interactions from the biological literature. Bioinformatics, 17, 155-161
Rindflesch,T.C., et al (1999) Mining molecular binding terminology from biomedical text. Proc AMIA Symp., 127-131
Hatzivassiloglou,V. and Weng,W. (2002) Learning anchor verbs for biological interaction patterns from published text articles. Int J Med Inf., 67, 19-32
Hoffmann,R. and Valencia,A. (2003) Protein interaction: same network, different hubs. Trends Genet., 19, 681-683
Donaldson,I., et al (2003) PreBIND and Textomy--mining the biomedical literature for protein-protein interactions using a support vector machine. BMC Bioinformatics, 4, 11
Sekimizu,T., et al (1998) Identifying the Interaction between Genes and Gene Products Based on Frequently Seen Verbs in Medline Abstracts. Genome Inform Ser Workshop Genome Inform., 9, 62-71
Daraselia,N., et al (2004) Extracting human protein interactions from MEDLINE using a full-sentence parser. Bioinformatics, 20, 604-611
Rzhetsky,A., et al (2004) GeneWays: a system for extracting, analyzing, visualizing, and integrating molecular pathway data. J Biomed Inform., 37, 43-53
Hu,Z.Z., et al (2004) iProLINK: an integrated protein resource for literature mining. Comput Biol Chem., 25, 409-416
Koike,A. and Takagi,T. (2005) PRIME: automatically extracted PRotein Interactions and Molecular Information databasE. In Silico Biol., 5, 9-20
Domedel-Puig,N. and Wernisch,L. (2005) Applying GIFT, a Gene Interactions Finder in Text, to fly literature. Bioinformatics, 21, 3582-3583
Blaschke,C. and Valencia,A. (2002) The frame-based module of the Suiseki information extraction system. IEEE Intelligent Systems., 17, 14-20
Katrenko,S., et al (2005) Learning Biological Interactions from Medline Abstracts. Proc of ICML05 workshop
Hao Y, et al (2005) Discovering patterns to extract protein-protein interactions from the literature: Part II. Bioinformatics, 21, 3294-3300
Ahmed,S.T., et al (2005) IntEx: A Syntactic Role Driven Protein-Protein Interaction extractor for Bio-Medical Text. Proc workshop ACL-05/ISMB-05, 54-61
Huang,M., et al (2004) Discovering patterns to extract protein-protein interactions from full texts. Bioinformatics, 20, 3604-3612
Koike,A., et al (2003) Kinase pathway database: an integrated protein-kinase and NLP-based protein-interaction resource. Genome Res., 13, 1231-1243
Humphreys,K., et al (2000) Two applications of information extraction to biological science journal articles: enzyme interactions and protein structures. Pac Symp Biocomput, 505-516
Blaschke,C., et al (1999) Automatic extraction of biological information from scientific text: protein-protein interactions. Proc Int Conf Intell Syst Mol Biol., 60-67
Rindflesch,T.C., et al (2000) EDGAR: extraction of drugs, genes and relations from the biomedical literature. Pac Symp Biocomput, 517-528
Blaschke,C and Valencia,A. (2001) Can bibliographic pointers for known biological data be found automatically? Protein interactions as a case study. Comp. Funct. Genom., 2, 196-206
Sugiyama,K., et al (2003) Extracting Information on Protein-Protein Interactions from Biological Literature Based on Machine Learning Approaches. Genome Informatics, 14, 699-700
Friedman,C., et al (2001) GENIES: a natural-language processing system for the extraction of molecular pathways from journal articles. Bioinformatics, 17, S74-S82
Nedellec,C. (2005) Learning Language in Logic - Genic Interaction Extraction Challenge. Proc LLL05 workshop
Hersh, W., et al (2004) TREC 2004 Genomics Track Overview http://ir.ohsu.edu/genomics/
http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI&termId=MI%3A0001&termName=interaction% 20detection%20method
http://psidev.sourceforge.net/mi/controlledVocab/psi-mi.def.html#MI:0026
http://cvs.sourceforge.net/viewcvs.py/psidev/psi/mi/controlledVocab/
http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=MI&termId=MI%3A0001&termName=interaction%