BioCreative - Track 2- CHEMDNER-patents

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative V

Track 2- CHEMDNER-patents [2015-08-29]

Task Organizers

CHEMDNER scientific advisory board

Participation

Background

CHEMDNER-patents tasks

Corpora and data

Test set prediction submission [2015-08-14]

Evaluation

Other resources

Prediction and annotation visualization with Markyt

Tentative timeline

BioCreative V workshop

CHEMDNER patents FAQ

Proceedings and journal special issue

Previous CHEMDNER (Biocreative IV)

References

Task Organizers

Martin Krallinger, Spanish National Cancer Research Centre
Florian Leitner, Universidad Politecnica de Madrid
Obdulia Rabal, Center for Applied Medical Research (CIMA), University of Navarra
Julen Oyarzabal, Center for Applied Medical Research (CIMA), University of Navarra
Alfonso Valencia, Spanish National Cancer Research Centre

CHEMDNER scientific advisory board

Peter Murray-Rust, Reader in Molecular Informatics, Unilever Centre, Dep. of Chemistry, University of Cambridge, UK
John P. Overington, EMBL-EBI, Wellcome Genome Campus, Hinxton, UK
Erik M. van Mulligen, Department of Medical Informatics, Erasmus University Medical Center, Rotterdam, The Netherlands
Christian Tyrchan, Computational Chemistry, AstraZeneca
Stephen K. Boyer, IBM Almaden Research Center
Markus Bundschus, Head Scientific & Business Information Services, Roche Diagnostics GmbH

Registration and participation

Teams interested in the CHEMDNER-patents task should register for track 2 of BioCreative V. Important: First you need to register, and then go to the 'Team page' and complete the team information and select the task/s in which you intend to participate.

Task contact: If you have some additional questions send an e-mail to Martin Krallinger

Mailing list: You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html

Background

This task will address the automatic extraction of chemical and biological data from medicinal chemistry patents. The identification and integration of all information contained in these patents (e.g., chemical structures, their synthesis and associated biological data) is currently a very hard task not only for database curators but for life sciences researches and biomedical text mining experts as well. Despite the valuable characterizations of biomedical relevant entities such as chemical compounds, genes and proteins contained in patents, academic research in the area of text mining and information extraction using patent data has been minimal. Pharmaceutical patents covering chemical compounds provide information on their therapeutic applications and, in most cases, on their primary biological targets.

This would be the first time that a biomedical text mining community challenge handles noisy text data (patents) and could result in software that helps to derive annotations from patents. The methods resulting from this task could potentially also provide useful insights for extracting other kinds of information from patents on the one side, or they could serve to better understand how to detect such information from other text collections such as full text articles or legacy reports.

CHEMDNER-patents tasks

This task would cover three essential steps for the identification of biomedical relevant descriptions of chemical compounds:

CEMP (chemical entity mention in patents, main task): the detection of chemical named entity mentions in patents (start and end indices corresponding to all the chemical entities). A detailed task description can be found: here and the CEMP corpus can be found here.

CPD (chemical passage detection, text classification task): the detection of patent titles and abstracts that mention chemical compounds. A detailed task description can be found: here and the CPD corpus can be found here.

GPRO (gene and protein related object task): for the GPRO task teams have to identify mentions of gene and protein related objects (named as GPROs) mentioned in patent tiles and abstracts. This task was initially called CER task (chemical entity relation) task, we have updated the task name to make the task purpose more clear. A detailed task description can be found: here and the GPRO corpus can be found here (updated text: 2015-04-10).

Participating teams do not need to send results for all of three sub-tasks. The can also send results only for individual sub-tasks.

Data (updated text: 2015-07-29)

We have selected patents that have at least one assigned IPC code corresponding to A61P (or its corresponding child IPCs) and also at least one A61K31 IPC code. This selection criteria assured that the corresponding patents are enriched in medicinal chemistry patents mentioning chemical entities. Patents were used with an associated publication date between 2005 to 2014 and with titles and abstracts written in English (machine translated titles/abstracts were discarded).
We selected patents from the following agencies: the World Intellectual Property Organization (WIPO), the European Patent Office (EPO), the United States Patent and Trademark Office (USPTO), Canadian Intellectual Property Office (CIPO), the German Patent and Trade Mark Office (DPMA) and the State Intellectual Property Office of the People's Republic of China (SIPO).
The test set collection will consist of recently published patents plus a background set in order to avoid manual correction of results. We will restrict the annotation to particular, well-defined sections of patents, with a special focus on patent abstracts. We plan to annotate exhaustively a minimum of 30,000 patent abstracts. The CHEMDNER-patent corpus will rely on a modified version of the annotation guidelines used for the BioCreative-IV CHEMDNER task. These modifications are mainly intended to deal with spelling errors and spurious line breaks as well as to incorporate guidelines for the annotation of biological targets (mainly genes and gene products) and the therapeutic application. We plan to carry out the same annotation strategy in terms of annotation tools and domain expert manual annotations as done for the CHEMDNER task, including and inter-annotator agreement study to determine the consistency of the annotations. The annotation guidelines together with the entire CHEMDNER-patents corpus will be publicly available after the competition.

The CHEMDNER-patents corpora will consist of a training, development and test set, each comprising a total of 7000 manually annotated records.

Additionally we are exploring the usage of another 9000 manually records as well as an unlabeled background set of patent records after the challenge is over (the usage of this data will be discussed at the evaluation workshop in Seville).

The CHEMDNER corpora are available at:

Test set prediction submissions (updated text: 2015-08-14)

Test set predictions have to be uploaded at the following webpage:
Markyt test set submission

At this page there is a submission tab called: 'Test set submissions'. You are allowed to submit up to 5 runs for each task. You have to fill in your team information and specify the Task and the Run. The test set abstracts are available at:Release of test data text .
You only have to put the team e-mail and the team participant code. As soon as you do that, the system gets the team ID automatically (no editing) and you may choose the task and run.

Evaluation

We will use an adapted version of the BioCreative II.5 evaluation script to score the predictions. For the CPD task, the BioCreative II.5 ACT evaluation scores will be used (MCC, AUC PR, and accuracy). For the CEMP task, the exact mention evaluation strategy as used for the CHEMDNER task using the balanced F-score will be used. For the GPRO task, the same evaluation strategy as for the Biocreative II.5 IPT will be used (i.e., F-score). See the description of the evaluation script for more details. The evaluation software will also check that team predictions are compliant with the required submission format.

Other resources

A list of additional resources that might be useful for the CHEMDNER-patent task participants can be found here: CHEMDNER-patents links (updated 2015-07-21).

Prediction and annotation visualization with Markyt (updated, Tuesday, 1st July 2015

The Markyt tool is being used at the BioCreative-CHEMDNER challenge to enable the visualization and comparison of predictions by the participating teams and their comparison to manual annotations for the CEMP and GPRO tasks. It should make it easier to get a visual grasp of the annotations and the prediction, to obtain evaluation scores and to detect FP, FN predictions and to understand errors related to particular entity mention classes and documents.
Markyt is a Web-based multi-purpose annotation tool. It supports the preparation of the annotated corpora, the analysis of inter-annotator agreement and refinement of annotation guidelines, and the evaluation of predictions against the competition gold standards.
Please look at the Markyt tutorial for additional details.

Tentative timeline (updated, Monday, 3rd June 2015)

January 2015: ~~Task announcement & call for participation~~

March 2nd: ~~Sample set patent abstract plain text release~~ :CHEMDNER-patents sample text.

April 10th: Sample set annotation release for all subtasks (together with example predictions) (check out the corrected version, April 10th 2015): CHEMDNER patents sample set (version April 10th 2015)

June 8th 2015: Release of training set annotations, guidelines and Markyt annotation interface (updated text: 2015-06-03):CHEMDNER-patents CEMP training set , CHEMDNER-patents CPD training set and CHEMDNER-patents GPRO training set version 2 (GPRO training updated text:2015-07-01 !)

July 1st 2015: BioCreative Markyt: CHEMNDER task prediction and annotation visualization and analysis tool release (updated text: 2015-07-01).

July 8th 2015: Release of development data annotations: CHEMDNER-patents CEMP development set V02 (corrected, 2015-07-13), CHEMDNER-patents CPD development set and CHEMDNER-patents GPRO development set (updated text:2015-07-08).

August 14th 2015: Release of test data text (instructions will be send to each team, it will consist of the patent abstract text file, without the annotations)

August 24th 23:00 CET : Extended deadline, team results returned (instructions will be send to each team, max. 5 runs per CHEMDNER task/team) (updated text: 2015-08-22)

August 31 2015: The evaluated results returned to the participants (updated text: 2015-08-29).

September 2nd 2015: The camera ready system description write ups (1-2 pages) (updated text: 2015-06-03).

September 9th-11th 2015: BioCreative evaluation workshop, Sevilla (updated text: 2015-06-03).

CHEMDNER session at the BioCreative V workshop

The results of the CHEMDNER patents task will be presented at the Fifth BioCreative Challenge Evaluation Workshop. The BioCreative V Workshop website is now open for registration. Please visit the website for details about the submission process, the venue.

At the BioCreative V Workshop to be held in Seville (Spain) September 9-11 (2015) there will be a session devoted to the CHEMDNER patents task. This session will include an overview talk presenting the used datasets and results obtained by the participating teams. A number of teams will also be invited to present their systems. We plan to have also a discussion session where teams, task organizers and domain experts will discuss the obtained results and future steps. Finally during the poster session all teams will be able to present their participating strategies.

CHEMDNER patents FAQ

Please read the following document answering questions related to the CHEMDNER patents task: FAQ (online).

CHEMDNER patents workshop proceedings and journal special issue

Participating teams will be invited to contribute to the: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop. A selected number of top performing teams will also be invited to contribute with a system description paper to a special issue of a relevant journal in the field. Previous BioCreative publications are listed here.

Previous CHEMDNER (Biocreative IV)

The CHEMDNER-Biocreative IV special issue was published in the Journal of Chemoinformatics: Volume 7 Supplement 1, 'Text mining for chemistry and the CHEMDNER track'. It focused on the detection of chemical entities from PubMed abstracts. The entire supplement is available from the J Chem Inf.

The special issue includes an overview paper on the task, a paper on the CHEMDNER corpus and 13 selected systems description papers. Top scoring teams obtained an F-score of 87.39% for the recognition of chemical entity mentions, a very competitive result already close to the human IAA. Additionally some systems could show additional improvements compared to their original submissions.

In addition participating teams provided a short systems description paper for the BioCreative workshop proceedings, see: Proceedings of the Fourth BioCreative Challenge Evaluation Workshop vol. 2.

Additional details can be found at the BioCreative IV CHEMDNER task.

References

Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., & Valencia, A. CHEMDNER: The drugs and chemical names extraction challenge. Journal of Cheminformatics 2015, 7(Suppl 1):S1
Krallinger, M. et al. The CHEMDNER corpus of chemicals and drugs and its annotation principles. Journal of Cheminformatics 2015, 7(Suppl 1):S2
Krallinger, M., Leitner, F., Rabal, O., Vazquez, M., Oyarzabal, J., & Valencia, A. (2013, October). Overview of the chemical compound and drug name recognition (CHEMDNER) task. In BioCreative Challenge Evaluation Workshop (Vol. 2, p. 2).
Akhondi, S. A., Klenner, A. G., Tyrchan, C., Manchala, A. K., Boppana, K., Lowe, D., ... & Muresan, S. (2014). Annotated Chemical Patent Corpus: A Gold Standard for Text Mining. PloS one, 9(9), e107477.
Grego, T., Pęzik, P., Couto, F. M., & Rebholz-Schuhmann, D. (2009). Identification of chemical entities in patent documents. In Distributed Computing, Artificial Intelligence, Bioinformatics, Soft Computing, and Ambient Assisted Living (pp. 942-949). Springer Berlin Heidelberg.
Jessop, D. M., Adams, S. E., & Murray-Rust, P. (2011). Mining chemical information from Open patents. Journal of cheminformatics, 3(1), 40.
Gurulingappa, H., Müller, B., Klinger, R., Mevissen, H. T., Hofmann-Apitius, M., Friedrich, C. M., & Fluck, J. (2010). Prior Art Search in Chemistry Patents Based On Semantic Concepts and Co-Citation Analysis. In TREC.
Wishart, D. S., Knox, C., Guo, A. C., Shrivastava, S., Hassanali, M., Stothard, P., ... & Woolsey, J. (2006). DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic acids research, 34(suppl 1), D668-D672.
Zhu, F., Han, B., Kumar, P., Liu, X., Ma, X., Wei, X., ... & Chen, Y. (2010). Update of TTD: therapeutic target database. Nucleic acids research, 38(suppl 1), D787-D791.