BioCreative - Task 1B: Gene Normalizations

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative I

Task 1B: Gene Normalizations [2004-01-02]

Task 1B focused on creating normalized gene lists; this is a task that is currently performed (manually) by curators for various model organism databases. This meant that there was a readily available data set for both training and testing. We chose three model organism databases (fly [1], mouse [2], yeast [3]) as sources of gene lists associated with papers. Our goal in choosing several model organisms was to encourage approaches that could be readily applied to different vocabularies.

We were committed to providing large training and test sets for this task. Due to the difficulties of obtaining large quantities of full text articles, we chose to provide only the abstracts of articles from MEDLINE for the evaluation. This meant that we had to edit the gene lists to make them correspond to genes mentioned in the abstract, rather than all the genes curated in the full text article. We developed a procedure to automatically remove genes not found in the abstract and were able to provide a large quantity of "noisy" training data for the three organisms, together with small collections of carefully corrected development and test data [4]. We estimated the quality of the noisy training data for the three organisms. Yeast training data quality appeared to be quite good (precision 0.99, recall 0. 86); fly training data was a little noisier (precision 0.92, recall 0.86); and mouse training data had poor recall (precision 0.99, recall 0.55). We also provided synonym lists for each organism, consisting of the unique gene identifier and its alternate names, as listed in the resources provided by each model organism database. Figure 2 shows a sample abstract with the associated unique gene identifiers, plus an excerpt from the lexicon, showing the many alternate names associated with genes. Although genes may be mentioned more than once in an abstract, the gene list consists of the set of unique mouse genes mentioned in the abstract.

There were eight groups participating in task 1b. The results [5] varied considerably, from a high for yeast of 0.92 F-measure, to somewhat lower scores for fly (high F-measure of 0.82) and mouse (high F-measure of 0.79). Our analysis showed that the differences among organisms could be attributed to a variety of factors, including extensive ambiguity in names and overlap of gene names with English terms (fly); complex multi-word gene names (mouse); and quality of the training data, especially for mouse, where recall on the training data was estimated at 55%.

These results lead us to believe that tools for automated gene name identification and normalization may be ready to be incorporated into the curation process, at least where organism nomenclature is highly regular, such as yeast, and authors adhere to the model organism database conventions in the literature. However in many cases, the real task is even more complicated, for example, when papers for several organisms are simultaneously analyzed, since the same names are used for different genes in different species.

References

The FlyBase Database: [http://flybase.org/].
The Mouse Genome Database: [http://www.informatics.jax.org]
Saccharomyces Genome Database: [http://www.yeastgenome.org]
Colosimo M, Morgan A, Yeh A, Colombe J and Hirschman, L; Data Preparation and Interannotator Agreement: BioCreAtIvE Task 1B. BMC Bioinformatics 6(Suppl 1):S12 (24 May 2005)
Hirschman L, Colosimo M, Morgan A adn Yeh A; Overview of BioCreAtIvE task 1B: Normalized Gene Lists. BMC Bioinformatics 6(Suppl 1):S11 (24 May 2005)