BioCreative - GPRO detailed task description

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative V

GPRO detailed task description [2015-07-02]

GPRO (gene and protein related object task)

General description

For the GPRO task teams have to identify mentions of gene and protein related objects (named as GPROs) mentioned in patent titles and abstracts.

The definition of GPRO entity mentions was primarily concerned with capturing those types of mentions that are of
practical relevance (both for end users of the extracted data as well as for the named entity recognition systems).
Therefore the covered GPRO entities had to be annotated at a sufficient level of granularity to be able to determine
whether the labeled mention can or can not be linked to a specific gene or gene product (represented by an entry
of a biological annotation database). The annotation carried out for the CHEMDNER GPRO task was exhaustive for
the types of GPRO mentions that were previously specified. This implies that mentions of other entities such as
chemicals or substances should not be labeled as GPROs.

We distinguish two types of GPRO entity mention types:

(1) GPRO entity mention type 1: covering those GPRO mentions that can be normalized to a bio-entity database
record. GPRO type 1 includes the following classes: NESTED MENTIONS, IDENTIFIER, FULL NAME and ABBREVIATION.
(2) GPRO entity mention type 2: covering those GPRO mentions that in principle cannot be normalized to a unique
bio-entity database record. GPRO type 2 includes the following classes: NO CLASS, SEQUENCE, FAMILY and MULTIPLE.

Additional details are provided in the GPRO annotation guidelines.

Important note: For the GPRO task we will only use GPRO entity mentions of type 1 for evaluation purposes .

Participating teams have to detect correctly the start and end indices corresponding to all the GPRO entity mentions of type 1.

Patent abstract records

Participating teams get three files to train/develop and tune their systems, that includes the actual patent abstract texts.
This file contains plain-text, UTF8-encoded Patent abstracts in a tab-separated format with the following three columns:

1- Patent identifier
2- Title of the patent
3- Abstract of the patent

An example patent abstract record can be seen below.

CA2119782C	Carbamate analogs of thiaphysovenine, pharmaceutical compositions, and method for inhibiting cholinesterases	Substituted carbamates of tricyclic compounds which have a cyclic sulfer atom, having the formula:(See formula I) wherein R1 is H or a linear or branched chain C1- C10 alkyl group; and R2 is selected from the group consisting of a linear or branched chain -C1-C10 alkyl group, and (See formula I) wherein R3 and R4 are independently selected from the group consisting of H and a linear or branched chain C1-C10 -alkyl group;and with the proviso that when one of R1 or R2 is a H or a methyl group the other of R1 or R2 is not H and optical isomers of the 3aS series, provide highly potent and selective cholinergic agonist and blocking activity and are useful as pharmaceutical agents. Cholinergic disease are treated with these compounds such as glaucoma, Myasthenia Gravis, Alzheimer's disease. Methods for inhibiting esterases, acetylcholinesterase and butyryl-cholinesterase are also provided.

Detailed chemical mention annotations

Participating teams will also get a file with detailed GPRO mention annotations manually classified into one of the following GPRO mention classes:
NESTED MENTIONS, IDENTIFIER, FULL NAME, ABBREVIATION, NO CLASS, SEQUENCE, FAMILY and MULTIPLE.

This annotation file consist of tab-separated fields containing:

1- Patent identifier
2- Type of text from which the annotation was derived (T: Title, A: Abstract)
3- Start offset
4- End offset
5- Text string of the GPRO entity mention
6- Class of GPRO entity mention (ABBREVIATION,FAMILY,etc,..)
6- Corresponding database identifier for mentions that can be normalized (GPRO type 1) or else the tag 'GPRO_TYPE_2'

An example GPRO annotation can be seen below.


CA2119782C	A	819	828	esterases	            FAMILY	GPRO_TYPE_2
CA2119782C	A	830	850	acetylcholinesterase	FULL_NAME	P22303
CA2119782C	A	855	877	butyryl-cholinesterase	FULL_NAME	P06276
CA2119782C	T	93	108	cholinesterases	        FULL_NAME	P06276

GPRO evaluation annotation file

An evaluation file that contains only the GPRO mention offsets (for TYPE 1 mentions only!) which will be used as the evaluation file for team predictions (we do not ask teams to provide the actual type of chemical mention, we will only assess the mention offset predictions). We also will NOT evaluate the detection of the database identifiers
(only of the GPRO type 1 mention offsets)

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and the end offset.

An example of the evaluation annotations can be seen below.

CA2119782C	A:830:850
CA2119782C	A:855:877
CA2119782C	T:93:108

GPRO task prediction format

For the GPRO task we will only request the prediction of the GPRO type 1 mention offsets following a similar stetting as done for the CEMP task. Given a set of patent abstracts, the participants have to return the start and end indices corresponding to all the GPRO type 1 entities mentioned in this document.

It consists of tab-separated columns containing:

1- Patent identifier
2- Offset string consisting in a triplet joined by the ':' character. You have to provide the text type (T: Title, A:Abstract), the start offset and t
he end offset.
3- The rank of the GPRO type 1 entity returned for this document
4- A confidence score
5- The string of the GPRO type 1 entity mention

An example illustrating the prediction format is shown below:

CN103371975A	A:271:274	1	0.99	RGD
CN103371975A	A:276:306	2	0.98989	Arginine-Glycine-Aspartic
US20090312385	A:100:112	1	0.99	CB2
US20090312385	T:0:11	        2	0.98989	Cannabinoid
WO2014144455A1	A:616:621	1	0.99	CARM1
WO2014144455A1	T:53:58	        2	0.98989	carm1
WO2014144455A1	A:676:681	3	0.98978	CARM1
EP1087981B1	T:0:18	             1	0.99	Prenyl
EP1087981B1	A:60:79	             2	0.98989	prenyl
US20090274671	T:22:28	        1	0.99	globin

CEMP evaluation script (official)

The evaluation will be done using the BioCreative Evaluation script available at:

http://www.biocreative.org/resources/biocreative-ii5/evaluation-library/

In this case the INT - article classification format option will be used.

Example command: bc-evaluate --INT team_gpro_prediction.tsv chemdner_gpro_gold_standard_train_eval.tsv > team_gpro_prediction.eval

where --INT corresponds to the required evaluation option
team_gpro_prediction.tsv corresponds to the prediction file
chemdner_gpro_gold_standard_train_eval.tsv corresponds to the evaluation file

If you have problems with the required prediction format use bc-evaluate with the flag --debug to find out what is wrong.

BioCreative Markyt prediction visualization

The Markyt tool is being used at the BioCreative-CHEMDNER challenge
to enable the visualization and comparison of predictions by the participating teams and their comparison to manual annotations for the
CEMP and GPRO tasks. It should make it easier to get a visual grasp of the annotations and the prediction, to obtain evaluation scores and
to detect FP, FN predictions and to understand errors related to particular entity mention classes and documents.