BioCreative - CEMP detailed task description

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative V

CEMP detailed task description [2015-07-01]

CEMP (chemical entity mention in patents) task

General description

The settings of the CEMP task are very similar to the CEM task of BioCreative IV. It requires the detection of chemical named entity mentions, but instead of using PubMed abstracts we use patent abstracts for BioCreative V. Teams will get a training and development set to construct their predictor and a blinded test set for which they have to submit predictions that will be evaluated against manual annotations (that will be released only after the test set submission deadline).
Participating teams have to detect correctly the start and end indices corresponding to all the chemical entities. Chemical entities have been manual annotated by domain experts using well defined annotation guidelines (CEMP annotation guidelines). Those guidelines are similar to the ones used for the CHEMDNER task at BioCreative IV but they also have some changes and updated to make them more suitable for the annotation of patent data.

Patent abstract records

Participating teams get three files to train/develop and tune their systems, that includes the actual patent abstract texts. This file contains plain-text, UTF8-encoded Patent abstracts in a tab-separated format with the following three columns:

1- Patent identifier
2- Title of the patent
3- Abstract of the patent

An example annotation can be seen below.

CA2119782C	Carbamate analogs of thiaphysovenine, pharmaceutical compositions, and method for inhibiting cholinesterases	Substituted carbamates of tricyclic compounds which have a cyclic sulfer atom, having the formula:(See formula I) wherein R1 is H or a linear or branched chain C1- C10 alkyl group; and R2 is selected from the group consisting of a linear or branched chain -C1-C10 alkyl group, and (See formula I) wherein R3 and R4 are independently selected from the group consisting of H and a linear or branched chain C1-C10 -alkyl group;and with the proviso that when one of R1 or R2 is a H or a methyl group the other of R1 or R2 is not H and optical isomers of the 3aS series, provide highly potent and selective cholinergic agonist and blocking activity and are useful as pharmaceutical agents. Cholinergic disease are treated with these compounds such as glaucoma, Myasthenia Gravis, Alzheimer's disease. Methods for inhibiting esterases, acetylcholinesterase and butyryl-cholinesterase are also provided.

Detailed chemical mention annotations

Participating teams will also get a file with detailed chemical mention annotations manually classified into one of the following chemical mention classes: abbreviation (short form of chemical names including abbreviations and acronyms), formula (molecular formulas), identifier (chemical database identifiers), systematic (IUPAC names of chemicals), trivial (common names of chemicals and trademark names), family (chemical families with a defined structure) and multiple (non-continuous mentions of chemicals in text).

This annotation file consist of tab-separated fields containing:

1- Patent identifier
2- Type of text from which the annotation was derived (T: Title, A: Abstract)
3- Start offset
4- End offset
5- Text string of the entity mention
6- Type of chemical entity mention (ABBREVIATION,FAMILY,FORMULA,IDENTIFIERS,MULTIPLE,SYSTEMATIC,TRIVIAL)

An example annotation can be seen below.

CA2119782C	A	12	22	carbamates	FAMILY
CA2119782C	A	128	129	H	FORMULA
CA2119782C	A	160	173	C1- C10 alkyl	FAMILY
CA2119782C	A	256	269	-C1-C10 alkyl	FAMILY
CA2119782C	A	371	372	H	FORMULA
CA2119782C	A	404	417	C1-C10 -alkyl	FAMILY
CA2119782C	A	476	477	H	FORMULA
CA2119782C	A	483	489	methyl	SYSTEMATIC
CA2119782C	A	525	526	H	FORMULA
CA2119782C	A	59	72	cyclic sulfer	SYSTEMATIC
CA2119782C	T	0	9	Carbamate	FAMILY
CA2119782C	T	21	36	thiaphysovenine	TRIVIAL

Annotation evaluation file

An evaluation file that contains only the chemical mention offsets which will be used as the evaluation file for team predictions (we do not ask teams to provide the actual type of chemical mention, we will only assess the mention offset predictions).

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.

An example of the evaluation annotations can be seen below.

CA2119782C	A:12:22
CA2119782C	A:128:129
CA2119782C	A:160:173
CA2119782C	A:256:269
CA2119782C	A:371:372
CA2119782C	A:404:417
CA2119782C	A:476:477
CA2119782C	A:483:489
CA2119782C	A:525:526
CA2119782C	A:59:72
CA2119782C	T:0:9
CA2119782C	T:21:36

CEMP task prediction format

For the CEMP task we will only request the prediction of the chemical mention offsets following a similar stetting as done for the BioCreative IV CHEMDNER task on PubMed abstracts. Given a set of patent abstracts, the participants have to return the start and end indices corresponding to all the chemical entities mentioned in this document.

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and t
he end offset.
3- The rank of the chemical entity returned for this document
4- A confidence score
5- The string of the chemical entity mention

An example illustrating the prediction format is shown below:

WO2009026621A1	A:12:24	1	0.99	paliperidone
WO2011115938A1	T:0:17	1	0.99	Spiro-tetracyclic
WO2011115687A2	A:0:12	1	0.99	SP-B
WO2011115687A2	T:0:22	2	0.98989	Alkylated
WO2011115687A2	A:104:117	3	0.98978	SP-B
US20050101595	A:0:13	1	0.99	Aminothiazole
US20050101595	A:60:67	2	0.98989	2-amino
US20050101595	T:0:50	3	0.98978	N-containing
US20050101595	A:29:52	4	0.98967	N-containing
WO2010147138A1	A:252:262	1	0.99	nucleotide
WO2010147138A1	A:363:373	2	0.98989	amino
WO2010147138A1	A:92:102	3	0.98978	fatty
CN103087254A	A:196:218	1	0.99	stearyl

CEMP evaluation script (official)

The evaluation will be done using the BioCreative Evaluation script available at:

http://www.biocreative.org/resources/biocreative-ii5/evaluation-library/

In this case the INT - article classification format option will be used.

Example command: bc-evaluate --INT team_cemp_prediction.tsv chemdner_cemp_gold_standard_train_eval.tsv > team_cemp_prediction.eval

where
--INT corresponds to the required evaluation option
team_cemp_prediction.tsv corresponds to the prediction file
chemdner_cemp_gold_standard_train_eval.tsv corresponds to the evaluation file

If you have problems with the required prediction format use bc-evaluate with the flag --debug to find out what is wrong.

BioCreative Markyt prediction visualization

The Markyt tool is being used at the BioCreative-CHEMDNER challenge to enable the visualization and comparison of predictions by the participating teams and their comparison to manual annotations for the CEMP and GPRO tasks. It should make it easier to get a visual grasp of the annotations and the prediction, to obtain evaluation scores and to detect FP, FN predictions and to understand errors related to particular entity mention classes and documents.