BioCreative - GN: Gene Normalization

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative III

GN: Gene Normalization [2010-03-31]

Organizers:Lu/Wilbur

The gene normalization (GN) task in BioCreative III is similar to the GN tasks in previous BioCreative tasks [1, 2] in that the goal is to link gene or gene products mentioned in the literature to standard database identifiers. However, in this year’s challenge, there are two significant characteristics that make it unique:

Instead of using abstracts, full-length articles are provided.
Instead of being species-specific, no species information is provided.

Both changes make the challenge event closer to the real curation task for a model organism database.

System Input and Output

The task is to have participating teams return a list of all the gene ids (EntrezGene Ids) for a given full-length article.

Training Data Selection and Annotation

Participants will have a collection of training data to work with so that they can adjust their systems to optimal performance. The text of the articles is available as high quality XML (or one could have PDF if preferred) from selected journals in PubMed Central. The plan is to make a large number of full text documents available as downloadable for training and development in early April 2010. The training set will include two sets of annotated full-length articles:

A small number (approximately 30) of fully annotated articles by a group of trained and experienced curators, who are invited from various model organism databases. For each article in this set, a list of normalized Entrez Gene ids will be provided.
A large number (approximately 500) of partially annotated articles. That is, not all genes that are mentioned in an article are annotated, but only the most important ones that within the scope of curation are annotated by human indexers at the National Library of Medicine. We have noted that most of the annotated genes are taken from the abstracts, though this is not 100%. This does not necessarily mean that the remainder of the text is useless. Presumably the full text can help to decide which genes are most important in the paper and additional information like the species information to improve the prediction of the gene id.

Evaluation

The test data will consist of approximately several hundred selected full-text articles to be made available in early July with a period of one week to process and return answers. We anticipate to manually annotate at least 50 of these full-text articles from the test set (by the same group of curators and annotation guidelines used for the training documents) and use pooled results from team submissions to evaluate and compare different systems.

We will use the Threshold Average Precision (TAP-k) to evaluate team submissions. The measure is closely related to the well-known average precision in information retrieval. In addition, it reflects the practice of using system output scores in determining relevant results in bioinformatics. Thus, for the GN task, we require teams to return Entrez Gene identifiers accompanied by corresponding confidence scores. We will compute TAP-5, TAP-10, and TAP-20 scores for each submitted run (each team can submit maximally 3 runs). Here is an introduction to the TAP-k metric with 3 examples. For more details, we refer readers to the original paper by Carroll et al., 2010. The measure can be computed at the following website: http://www.ncbi.nlm.nih.gov/CBBresearch/Spouge/html.ncbi/tap/

BioCreative III GN Task Submission Instructions

Team ID: a team identifier was assigned when you registered at the BioCreative Website. Please find this id on your team page as you will need it for naming your submission files.

Test data: A total of 500 full text articles (from various BMC and PLoS journals) will be posted at the BioCreative Website for testing purpose on June 28, 2010. Same as in training data, both XML and PDF formats will be provided.

Deadline: GN results are due Friday July 2, 2010 (you have until 11:59 p.m. in the time zone of your choice).

GN results: each team can submit at most 3 runs. In each submitted run, GN results need to be returned in 3 tab-delimited fields in a separate text file:

PMC article identifier
Entrez Gene identifier
Ranking score

Please sort your results numerically first by the PMC id, followed by the ranking score, both in descending orders. It is your responsibility to submit results in a valid format, as described above. In order to verify that your result data is valid, you should run your system on the training data and evaluate the output with the Perl script “eval.pl” under the folder tap_1.1.

How to submit: please name your submission files to the following:
bc3gn_t_r.txt
For instance, the second submitted run from Team #18 should be named as:
bc3gn_t18_r2.txt
Please email your results (as .txt attachments) to:

Zhiyong Lu (luzh@ncbi.nlm.nih.gov) AND
John Wilbur (wilbur@ncbi.nlm.nih.gov)

System description: Participating teams are required to submit a short (1-2 pages) system description by July 16. The description should provide an overview of the approach used – please follow the template/questionnaire below. The description will be linked by team ID to the results for the GN task, to be distributed at the workshop. The description must be submitted to receive scores.

Team identifier
Please identify/describe any machine learning techniques used.
Please identify/describe any NLP techniques/components used.
Please identify/describe any external (marked up text) training data used.
Please identify/describe any external lexical resources (terminology lists) used.
Please describe any rule sets used.
Please describe if your system interacts with or uses data from any biological database(s).
Please identify/describe any other relevant resources used to train/develop your system.
Please describe the general data flow in your system.
Other information of interest.

Links:

References:

Hirschman L., et al., Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, 2005. 6 Suppl 1: p. S11.
Morgan AA, et al., Overview of BioCreative II gene normalization. Genome Biology, 2008; 9 Suppl 2:S3.