BioCreative - Task 1B: Human Gene Normalizations

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative II

Task 1B: Human Gene Normalizations [2006-04-01]

Premise

Systems will be required to return the EntrezGene (formerly Locus Link) identifiers corresponding to the human genes and direct gene products appearing in a given MEDLINE abstract. This has relevance to improving document indexing and retrieval, and to linking text mentions to database identifiers in support of more sophisticated information extraction tasks. It is similar to Task 1B of BioCreAtIvE I [1].

System Input

Participating groups will be given a master list of human EntrezGene identifiers with some common gene and protein names (synonyms) for each identifier in the master list. For the evaluation task, the input is a a collection of plain text abstracts.

System Output

For each abstract, the system will return a list of the EntrezGene identifiers and corresponding text excerpts for each human gene or gene product mentioned in the abstract. The excerpt required is a single mention of the gene's 'name' found in the abstract. Even if a gene is mentioned several different places in an abstract with alternate names being used, only a single excerpt/mention is to be returned by the system. If desired, groups may also include a fourth column which contains a confidence measure that ranges from 0 (no confidence) to 1 (absolute confidence). This is not a part of the main evaluation, and is included as an option for interested groups at the request of some participants. The return format is a single file, with each entry on one line, and the field delimited by tabs. The columns should then be: PUBMED ID, EntrezGene (LocusLink) ID, Mention Text, and optionally Confidence. There should be no column headers or line numbers, and the fields should all be separated with tabs. Although the hand annotated training file contains multiple text excerpts for each identifier, that is just meant to aid in training and only one would be expected from a participating system (any one of the set would be 'correct', although getting the right text is not the main part of the evaluation). An example line with made up identifiers follows:

123456 987 foobar

If interested in the optional confidence numbers:

123456 987 foobar .87

Evaluation

System performance will be evaluated on how well the generated EntrezGene identifier list corresponds to one generated by human annotators. In the interest of being better able to understand what variables impact system peformance we will also try to look at various features (e.g. term length or the variation of annotated terms from those in the lexicon) impact performance. We are releasing a preliminary scoring script with the distributed data, but we will score the main evaluation on the single task of returning the list of gene identifiers for each abstract. If participants have alternate techniques for understanding system performance (such as using the optional confidence scores which were not part of the original scoring script), we will try to include them as appropriate. It is hoped this way that we can both try to identify optimal techniques for achieving the main task and also increase our understanding of what factors impact performance.

Data Selection and Annotation

Abstracts were selected from those annotated by EBI's Human GOA [2] group, since this selection is assumed to be enriched in mentions of human genes and gene products. A small group of annotators trained in molecular biology searched through the abstract text (and title), identifying mentions of genes and gene products using UniProt and the NCBI Gene interface for identifying the corresponding EntrezGene identifier. Inter-annotator agreement was measured at over 90%. We will release a hand annotated training/development set of 281 annotated abstracts and we anticipated another 250-275 to be used in the evaluation. We have also compiled a lexicon for the human EntrezGene identifiers using common gene/protein name sources, which will be released along with the training data. Participating groups may wish to compile their own lexical resources or discover ways to prune the provided lexicon. Five thousand abstracts from the GOA annotation set will be released along with the EntrezGene identifiers that correspond to the EBI GOA annotations. These have been derived by mapping from the Uniprot to the EntrezGene mapping of PIR [3] and may provide useful noisy, training data. However, there are a number of limitations with this dataset set since most gene/proteins mentioned are not recorded, and the annotations which were done to UniProt do not completely map into EntrezGene. Participants are requested not to download or use the EBI human GOA annotations on their own.

Funding

The MITRE contribution to this work s based upon work supported by the National Science Foundation under Grant No. 0640153. Any opinions, findings and conclusions or recomendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation (NSF).

References

Hirschman L., et al., Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, 2005. 6 Suppl 1: p. S11.
Camon E., et al., The Gene Ontology Annotation (GOA) Database--an integrated resource of GO annotations to the UniProt Knowledgebase. In Silico Biol, 2004. 4(1): p. 5-6.
Barker W.C., et al., The protein information resource (PIR). Nucleic Acids Res, 2000. 28(1): p. 41-4.