BioCreative - Overview

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

Background

Overview [2003-12-31]

The growing interest in information retrieval (IR), information extraction (IE) and text mining applied to the biological literature is related to the increasing accumulation of scientific literature (PubMed has currently (2005) over 15,000,000 entries) as well as the accelerated discovery of biological information obtained through characterization of biological entities (such as genes and proteins) using high-through put and large scale experimental techniques [1].

Computational techniques which process the biomedical literature are useful to enhance the efficient access to relevant textual information for biologists, bioinformaticians as well as for database curators. Many systems have been implemented which address the identification of gene/protein mentions in text or the extraction of text-based protein-protein interactions and of functional annotations using information extraction and text mining approaches [2].

To be able to evaluate performance of existing tools, as well as to allow comparison between different strategies, common evaluation standards as well as data sets are crucial. In the past, most of the implementations have focused on different problems, often using private data sets. As a result, it has been difficult to determine how good the existing systems were or to reproduce the results. It is thus cumbersome to determine whether the systems would scale to real applications, and what performance could be expected using a different evaluation data set [3-4].

The importance of assessing and comparing different computational methods have been realized previously by both, the bioinformatics and the NLP communities. Researchers in natural language processing (NLP) and information extraction (IE) have, for many years now, used common evaluations to accelerate their research progress, e.g., via the Message Understanding Conferences (MUCs) [5] and the Text Retrieval Conferences (TREC) [6]. This not only resulted in the formulation of common goals but also made it possible to compare different systems and gave a certain transparency to the field. With the introduction of a common evaluation and standardized evaluation metrics, it has become possible to compare approaches, to assess what techniques did and did not work, and to make progress. This progress has resulted in the creation of standard tools available to the general research community.

The field of bioinformatics also has a tradition of competitions, for example, in protein structure prediction (CASP [7]) or gene predictions in entire genomes (at the "Genome Based Gene Structure Determination" symposium held on the Wellcome Trust Genome Campus).

There has been a lot of activity in the field of text mining in biology, including sessions at the Pacific Symposium of Biocomputing (PSB [8]), the Intelligent Systems for Molecular Biology (ISMB) and European Conference on Computational Biology (ECCB) conferences [9] as well workshops and sessions on language and biology in computational linguistics (the Association of Computational Linguistics BioNLP SIGs).

A small number of complementary evaluations of text mining systems in biology have been recently carried out, starting with the KDD cup [10] and the genomics track at the TREC conference [11]. Therefore we decided to set up the first BioCreAtIvE challenge which was concerned with the identification of gene mentions in text [12], to link texts to actual gene entries, as provided by existing biological databases, [13] as well as extraction of human gene product (Gene Ontology) annotations from full text articles [14]. The success of this first challenge evaluation as well as the lessons learned from it motivated us to carry out the second BioCreAtIvE, which should allow us to monitor improvements and build on the experience and data derived from the first BioCreAtIvE challenge. As in the previous BioCreAtIvE, the main focus is on biologically relevant tasks, which should result in benefits for the biomedical text mining community, the biology and biological database community, as well as the bioinformatics community.

References

Krallinger M. and Valencia A.; Text mining and information retrieval services for Molecular Biology. Genome Biology, 6 (7), 224 (2005).
Krallinger M., Alonso-Allende Erhadt R. and Valencia A.; Text-mining approaches in molecular biology and biomedicine. Drug Discovery Today 10, 439-445 (2005).
Blaschke C., Hirschman L. and Valencia A.; Information extraction in molecular biology. Brief Bioinform. 3, 154-165 (2002).
Hersh W.; Evaluation of biomedical text-mining systems: lessons learned from information retrieval. Brief Bioinform. 6, 344-356 (2005).
Message Understanding Conferences: MUC
Text Retrieval Conferences: TREC
Critical Assessment of techniques for protein Structure Prediction: CASP
Pacific Symposium on Biocomputing: PSB
ECCB and ISMB are conferences promoted by the Internationa Society for Computational Biology: ISCB
Knowledge Discovery and Data Mining (2002): KDD
TREC Genomics Track
Yeh A., Morgan A., Colosimo M. and Hirschman L.; BioCreAtIvE Task 1A: gene mention finding evaluation. BMC Bioinformatics 2005, 6(Suppl 1):S2 (24 May 2005)
Hirschman L., Colosimo M., Morgan A. and Yeh A.; Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics2005, 6(Suppl 1):S11 (24 May 2005)
Blaschke C., Andres Leon E., Krallinger M. and Valencia A.; Evaluation of BioCreAtIvE assessment of task 2. BMC Bioinformatics 2005, 6(Suppl 1):S16 (24 May 2005)