CEMP (chemical entity mention in patents) task
General description
The settings of the CEMP task are very similar to the CEM task of BioCreative IV. It requires the detection of chemical named entity mentions, but instead of using PubMed abstracts we use patent abstracts for BioCreative V. Teams will get a training and development set to construct their predictor and a blinded test set for which they have to submit predictions that will be evaluated against manual annotations (that will be released only after the test set submission deadline).Participating teams have to detect correctly the start and end indices corresponding to all the chemical entities. Chemical entities have been manual annotated by domain experts using well defined annotation guidelines (CEMP annotation guidelines). Those guidelines are similar to the ones used for the CHEMDNER task at BioCreative IV but they also have some changes and updated to make them more suitable for the annotation of patent data.
Patent abstract records
Participating teams get three files to train/develop and tune their systems, that includes the actual patent abstract texts. This file contains plain-text, UTF8-encoded Patent abstracts in a tab-separated format with the following three columns:1- Patent identifier 2- Title of the patent 3- Abstract of the patentAn example annotation can be seen below.
CA2119782C Carbamate analogs of thiaphysovenine, pharmaceutical compositions, and method for inhibiting cholinesterases Substituted carbamates of tricyclic compounds which have a cyclic sulfer atom, having the formula:(See formula I) wherein R1 is H or a linear or branched chain C1- C10 alkyl group; and R2 is selected from the group consisting of a linear or branched chain -C1-C10 alkyl group, and (See formula I) wherein R3 and R4 are independently selected from the group consisting of H and a linear or branched chain C1-C10 -alkyl group;and with the proviso that when one of R1 or R2 is a H or a methyl group the other of R1 or R2 is not H and optical isomers of the 3aS series, provide highly potent and selective cholinergic agonist and blocking activity and are useful as pharmaceutical agents. Cholinergic disease are treated with these compounds such as glaucoma, Myasthenia Gravis, Alzheimer's disease. Methods for inhibiting esterases, acetylcholinesterase and butyryl-cholinesterase are also provided.
Detailed chemical mention annotations
Participating teams will also get a file with detailed chemical mention annotations manually classified into one of the following chemical mention classes: abbreviation (short form of chemical names including abbreviations and acronyms), formula (molecular formulas), identifier (chemical database identifiers), systematic (IUPAC names of chemicals), trivial (common names of chemicals and trademark names), family (chemical families with a defined structure) and multiple (non-continuous mentions of chemicals in text).This annotation file consist of tab-separated fields containing:
1- Patent identifier 2- Type of text from which the annotation was derived (T: Title, A: Abstract) 3- Start offset 4- End offset 5- Text string of the entity mention 6- Type of chemical entity mention (ABBREVIATION,FAMILY,FORMULA,IDENTIFIERS,MULTIPLE,SYSTEMATIC,TRIVIAL)
An example annotation can be seen below.
CA2119782C A 12 22 carbamates FAMILY CA2119782C A 128 129 H FORMULA CA2119782C A 160 173 C1- C10 alkyl FAMILY CA2119782C A 256 269 -C1-C10 alkyl FAMILY CA2119782C A 371 372 H FORMULA CA2119782C A 404 417 C1-C10 -alkyl FAMILY CA2119782C A 476 477 H FORMULA CA2119782C A 483 489 methyl SYSTEMATIC CA2119782C A 525 526 H FORMULA CA2119782C A 59 72 cyclic sulfer SYSTEMATIC CA2119782C T 0 9 Carbamate FAMILY CA2119782C T 21 36 thiaphysovenine TRIVIAL
Annotation evaluation file
An evaluation file that contains only the chemical mention offsets which will be used as the evaluation file for team predictions (we do not ask teams to provide the actual type of chemical mention, we will only assess the mention offset predictions).It consists of tab-separated columns containing:
1- Patent identifier 2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.
An example of the evaluation annotations can be seen below.
CA2119782C A:12:22 CA2119782C A:128:129 CA2119782C A:160:173 CA2119782C A:256:269 CA2119782C A:371:372 CA2119782C A:404:417 CA2119782C A:476:477 CA2119782C A:483:489 CA2119782C A:525:526 CA2119782C A:59:72 CA2119782C T:0:9 CA2119782C T:21:36
CEMP task prediction format
For the CEMP task we will only request the prediction of the chemical mention offsets following a similar stetting as done for the BioCreative IV CHEMDNER task on PubMed abstracts. Given a set of patent abstracts, the participants have to return the start and end indices corresponding to all the chemical entities mentioned in this document.It consists of tab-separated columns containing:
1- Patent identifier 2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and t he end offset. 3- The rank of the chemical entity returned for this document 4- A confidence score 5- The string of the chemical entity mention
An example illustrating the prediction format is shown below:
WO2009026621A1 A:12:24 1 0.99 paliperidone WO2011115938A1 T:0:17 1 0.99 Spiro-tetracyclic WO2011115687A2 A:0:12 1 0.99 SP-B WO2011115687A2 T:0:22 2 0.98989 Alkylated WO2011115687A2 A:104:117 3 0.98978 SP-B US20050101595 A:0:13 1 0.99 Aminothiazole US20050101595 A:60:67 2 0.98989 2-amino US20050101595 T:0:50 3 0.98978 N-containing US20050101595 A:29:52 4 0.98967 N-containing WO2010147138A1 A:252:262 1 0.99 nucleotide WO2010147138A1 A:363:373 2 0.98989 amino WO2010147138A1 A:92:102 3 0.98978 fatty CN103087254A A:196:218 1 0.99 stearyl
CEMP evaluation script (official)
The evaluation will be done using the BioCreative Evaluation script available at:http://www.biocreative.org/resources/biocreative-ii5/evaluation-library/
In this case the INT - article classification format option will be used.
Example command: bc-evaluate --INT team_cemp_prediction.tsv chemdner_cemp_gold_standard_train_eval.tsv > team_cemp_prediction.eval
where
--INT corresponds to the required evaluation option
team_cemp_prediction.tsv corresponds to the prediction file
chemdner_cemp_gold_standard_train_eval.tsv corresponds to the evaluation file
If you have problems with the required prediction format use bc-evaluate with the flag --debug to find out what is wrong.