BioCreative - Task 1A: Gene Mention Tagging

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative II

Task 1A: Gene Mention Tagging [2006-04-01]

Gene Mention Tagging task is concerned with the named entity extraction of gene and gene product mentions in text.

Premise

Systems will be required to return the start and end indices corresponding to all the genes and gene products mentioned in a given MEDLINE sentence. This named entity task is a crucial first step for information extraction of relationships between genes and gene products.

System Input

The input file will consist of ascii sentences, one per line. Each sentence will be preceded on the same line by a sentence identifier.

System Output

Each system must output an ascii list of reported gene name mentions, one per line, and formatted as:

sentence-identifier-1|start-offset-1 end-offset-1|optional text... 
sentence-identifier-1|start-offset-2 end-offset-2|optional text... 
sentence-identifier-1|start-offset-3 end-offset-3|optional text... 
sentence-identifier-2|start-offset-1 end-offset-1|optional text... 
sentence-identifier-3|start-offset-1 end-offset-1|optional text... 
. 
. 
.

The sentence-identifier is from the sentence of the mention. Multiple mentions from the same sentence should appear on separate lines. A sentence is not required to have any mentions. The start-offset is the number of non-whitespace characters in the sentence preceding the first character of the mention, and the end-offset is the number of non-whitespace characters in the sentence preceding the last character of the mention. If you put anything after the vertical bar following the end-offset, it will be ignored by the evaluator.

Evaluation

System performance will be scored automatically by how well the generated gene/gene product list corresponds to one generated by human annotators. Acceptable alternatives to the gold standard names, also generated by human annotators, will count as true positives.

Data Selection and Annotation: Sentences were selected at random from MEDLINE, half of the sentences are likely to contain genes and gene products based on similarity to sentences with known gene names. A small group of annotators trained in biochemistry, molecular biology and genetics searched through each sentence, identifying mentions of genes and gene products, along with acceptable alternatives.
To date 20,000 sentences have been annotated. 15,000 sentences were used previously in BioCreative, and will be released as training data.

Submission Guidelines

Participants are requested to halt all system development after they obtain the test data.

Participants email their GM submissions to mailing list:
biocreative-gm-sub-2006@lists.sourceforge.net
as a .txt attachment.

These are due Oct 15 (PPI subtask 1) or Oct 22 (all other tasks/subtasks).
By submitting results, the groups agree to have their submission made public in an anonymous form at the end of the evaluation (e.g. as was done with the BioCreAtIvE 1 Task 2 submissions).

By requesting the test data, you are committed to the submission of results for that task or sub-task. If, for some reason, after receiving the test data, you are unable to submit results for a given task or subtask, you should notify the organizers promptly, and provide an email explaining why you have been unable to submit; we also ask that you provide a commitment to delete your copy of the test data.

System Description

You have to submit a short system description questionnaire (1-2 pps) by Oct 31. The description should give an overview of the approach used - please follow the template below. If you wish, the description may be anonymous; the description will be linked by user ID to the results for the tasks, to be distributed at the workshop. This is due Oct 31 and must be submitted to receive scores.

Groups will receive their scores and the gold standard data (by mid Dec) at the contact email address they provided. We will provide each group with its scores only - the full set of results will be made available at the BioCreAtIvE workshop and in the associated Proceedings.

Groups are requested not to publish results of their system on the goldstandard data until after the workshop.

Submission File Naming

By naming your submission files in the same format, we can keep everything much more organized.

The format is TeamId_BC2_Task(_Subtask)_Run.txt.

For example, Team 60 submitting 3 runs (the max for any task/subtask) to the GM task:

T60_BC2_GM_1.txt
T60_BC2_GM_2.txt
T60_BC2_GM_3.txt

System Description Template/Questionnaire

Please note that any information provided will be made publicly available, so if you wish to remain anonymous you do need to be specific with proprietary system components (e.g. simply note things like "proprietary gene lexicon"). However, the research community benefits by participants being as explicit as possible in these descriptions and complete disclosure is encouraged. If some information only pertains to a particular run, please note this.

1- Team identifier:.......
2- Which task does this describe (GN, GM or PPI):........
3- Please identify/describe any machine learning techniques used:..........
4- Please identify/describe any NLP techniques/components used:........
5- Please identify/describe any external (marked up text) training data used:.........
6- Please identify/describe any external lexical resources (terminology lists)used:........
7- Please describe any rule sets used:.........
8- If your system interacts with or uses data from any biological database(s), please describe:..........
9- Please identify/describe any other relevant resources used to train/develop your system:.........
10- Please describe the general data flow in your system:..........
11- Other information of interest:.........

GM Test Set Submission Format

We want to remind participants in the GM task that you are responsible for submitting result data in a valid format, as described in the file README.GM. In order to verify that your result data is valid, you should run your system on the training data and evaluate the output with the perl script alt_eval.perl (in the train subdirectory, described in the file train/README).

GM Test Set Sentence Identifiers

Additionally, systems should not make any assumptions about the contents or meaning of sentence identifiers in the test set. When you receive test data for the final run, sentence identifiers will be randomly assigned strings. We do not plan to release source information for the test sentences until after the evaluation is complete. (This statement is not meant to imply any other limits on resources or methods that may be used.)