BioCreative - Task 1A: Gene Mention Identification

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative I

Task 1A: Gene Mention Identification [2004-01-02]

Task 1 was divided into two sub-tasks, reflecting different sources of data. Task 1a focused on the identification of gene or protein names in running text [1]. The data for this task was provided by Lorrie Tanabe and John Wilbur (NCBI) [2] and was derived from annotation of single sentences taken from MEDLINE abstracts. This task was very close to the "named entity tagging" task that has been used extensively in the natural language processing community. This made it easy for many groups to participate whose main expertise was in natural language processing - this was the most heavily subscribed BioCreAtIvE task, with 15 teams participating.

An example sentence is shown below. Furthermore, as in the human gene, the 3' end of the Cacna1f gene maps within 5 kb of the 5' end of the mouse synaptophysin gene in a region orthologous to Xp11/23.

In this example, the system must identify the gene/protein names Cacna1f gene (orCacna1f) and mouse synaptophysin gene (or minimally, synaptophysin). However, a phrase like "the human gene" is not marked because it is not the name of a particular gene. The answer key provides for alternative forms, e.g., Cacna1f gene or Cacna1f.

Participants were given 10,000 annotated training sentences and were tested on an additional 5000 blind test sentences. The main findings from task 1a were that four different teams, using techniques such as Hidden Markov Models and Support Vector Machines, were able to achieve F-measures over 0.80 (F-measure is the harmonic mean of precision and recall). This is somewhat lower than figures for similar tasks from the news wire domain. For example, extraction of organization names has been done at over 0.90 F-measure. The article by Yeh et al provides an analysis of these differences, attributing about half the difference in F-measure to the fact that systems show lower performance for longer names (also noted in [3]), and the distribution of gene and protein names is skewed towards longer names than seen for organization names.

Data preparation for task 1a [1-2] had several interesting features. In particular, the data were annotated by biologists, without explicit annotation guidelines. This is a novel approach to annotation: annotation of named entities for news wire (e.g., person, organization, location, etc) for the Message Understanding Conference tasks required extensive multi-page annotation guidelines [4]. For task 1a, there were no systematic inter-annotator agreement studies carried out to assess the quality of the test data. However, some post-evaluation analysis indicated that there may have been inconsistencies in how compound terms were annotated, such as "Mek-Erk1/2 pathway".

These inconsistencies made it difficult to learn generalizations from the training data, thus reducing scores; this may also account for some of the discrepancy between performance on the gene/protein name extraction task, compared to the news wire tasks. Task 1a was viewed as a "building block" task - a task that could be treated as a natural language processing task that required no significant biological expertise. It also constitutes a first step for more complex tasks, such as gene name normalization (task 1b) or functional annotation of genes (task 2).

References

Yeh AS, Morgan A, Colosimo M and Hirschman, L; BioCreAtIvE task 1A: gene mention finding evaluation. BMC Bioinformatics 6(Suppl 1):S2 (24 May 2005)
2.Tanabe L, Xie N, Thom LH, Matten W and Wilbur WJ; GENETAG: A Tagged Corpus for Gene/Protein Named Entity Recognition. BMC Bioinformatics 6(Suppl 1):S3 (24 May 2005)
Kinoshita S, Cohen KB, Ogren PV and Hunter L; BioCreAtIvE Task 1A: Entity Identification with a Stochastic Tagger. BMC Bioinformatics 6(Suppl 1):S4 (24 May 2005)

MUC-7: Seventh Message Understanding Conference: