BioCreative - Track 2- CHEMDNER Task: Chemical compound and drug name recognition task

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative IV

Track 2- CHEMDNER Task: Chemical compound and drug name recognition task [2012-11-15]

Background

There is an increasing interest, both on the academic side as well as for industry, to facilitate more efficient access to information on chemical compounds and drugs (chemical entities) described in repositories of unstructured data, including scientific articles, patents or health agency reports. In order to achieve this goal, a crucial aspect is to be able to identify mentions of chemical compounds automatically within text as well as to index whole documents with the compounds described in them. The recognition of chemical entities is also crucial for other subsequent text processing strategies, such as detection of drug-protein interactions, adverse effects of chemical compounds and their associations to toxicological endpoints or the extraction of pathway and metabolic reaction relations. The Comparative Toxicogenomics Database (CTD) group ran a track at BioCreative III to support identification of articles containing chemical-disease-gene relations and extraction of these entities and relations.

Despite its importance, only a very limited number of publicly accessible chemical compound recognition systems have been released [1]. In contrast to this, a considerable number of methods and strategies to recognize chemicals in text have been proposed. One of the main bottlenecks currently encountered to implement and compare the performance of such systems is the (a) lack of suitable training/test data, (b) the intrinsic difficulty in defining annotation guidelines of what actually constitutes a chemical compound or drug, (c) heterogeneity in terms of scope and textual data sources used, as well as (d) limited evaluation efforts carried out so far. Important grounding work to define annotation standards for chemicals in text (as well as the construction of an annotated corpus) was carried out by Corbett and colleagues [2]. Originally their guidelines were devised for the annotation of PubMed abstracts and chemistry journals. Another relevant work, more focused on the detection and annotation of IUPAC or IUPAC-like chemical compound names was done by Klinger et al [3], while some chemical substance annotation can also be recovered from corpora that are not primarily concerned with chemical compounds including the GENIA [4] or CRAFT [5] and CALBC [6] corpora. Moreover, there are also several databases as well as lexical resources that can be useful for the annotation and detection of compound mentions. Among these resources are PubChem [7], ChEBI [8] or the ‘Chemicals and Drugs’ branch of the MeSH vocabulary as well as joined lexical resources like Jochem and ChemSpider [9] .

Task organizers

Martin Krallinger, Spanish National Cancer Research Center (CNIO)

Obdulia Rabal, University of Navarra, Spain

Julen Oyarzabal, University of Navarra, Spain

Alfonso Valencia, Spanish National Cancer Research Center (CNIO)

Task goal

The goal of this task is to promote the implementation of systems that are able to detect mentions of chemical compounds and drugs, in particular those chemical entity mentions that can subsequently be linked to a chemical structure, rather than other macromolecules like genes and proteins that had been already addressed in previous BioCreative efforts. The overall setting of the task and evaluation will be based on the evaluation strategies followed for the Gene Mention recognition task of BioCreative I and II [11,12], as well as for the gene normalization task [13,14,15]. Moreover chemical compound annotation guidelines used by previous efforts will be adapted for this task.
We foresee a considerable interest in this task by the NLP/text mining community on one side, as well as by the bioinformatics, drug discovery/biomedicine and chemoinformatics communities on the other side. As has been the case in previous BioCreative efforts (resulting in high impact papers in the field), we expect that successful participants will have the opportunity to publish their system descriptions in a journal article.

BioCreative IV CHEMDNER Track

For the BioCreative IV Workshop, we invite participants to submit results for the CHEMDNER task providing predictions for:

a) Given a set of documents, return for each of them a ranked list of chemical entities described within each of these documents [Chemical document indexing sub-task]

b) Provide for a given document the start and end indices corresponding to all the chemical entities mentioned in this document [Chemical entity mention recognition sub-task].

For these two tasks the organizers will release training and test data collections. The task organizers will provide details on the used annotation guidelines; define a list of criteria for relevant chemical compound entity types as well as selection of documents for annotation. The examined criteria at mention level include: (1) diversity at the character space for compound names (2) name types and nomenclatures: systematic names (IUPAC and IUPAC-like), common or generic names, trade names, chemical identifiers (from databases and companies), acronyms and abbreviations, reference numbers, chemical structures (SMILES, InChI) and formulas. For the selection of documents the examined criteria will include chemical disciplines (e.g. organic chemistry). Additionally the organizers will analyze how the detected mentions can be normalized to chemical structures (at two levels, exact or using the chemical series).

Registration

Teams can participate in the CHEMDNER task by registering for track 2 of BioCreative IV. You can register additionally for other tracks too. To register your team go to the following page that provides more detailed instructions: http://www.biocreative.org/news/biocreative-iv/team/

Corpus

Download the BioCreative IV CHEMDNER Corpus.

Dates

25th June: ~~sample data collection, detailed task description, annotation and evaluation script~~
31st July: ~~training data collection, annotations and updated guidelines (updated)~~
16th August : ~~development data annotations~~ (download here)
3nd September : ~~test set release~~ (download here)
16th September: ~~test set prediction due~~ (FINAL UPDATE!)
17th September: ~~invite teams for workshop presentation talks~~
26th September: ~~CHEMDNER workshop proceedings paper due~~ (2-8 pages systems description paper) (FINAL UPDATE!)
30th September: ~~CHEMDNER workshop proceedings paper feedback~~
7th-9th October: ~~BioCreative IV workshop~~
4th November: ~~Test set manual annotation to CHEMDNER participating teams~~
4th November: ~~Invitation of selected teams to submit paper for the special journal issue~~
19th May (2014): ~~Test set manual annotation to all CHEMDNER public online for general access~~ : download here

Workshop proceedings papers for CHEMDNER task

Important: The CHEMDNER track systems description papers and the overview paper of the BioCreative IV workshop proceedings are available HERE.

CHEMDNER special issue:journal of Chemoinformatics

The Journal of Chemoinformatics published a Special issue on the BioCreative-IV CHEMDNER task. It includes an overview paper, a paper of the CHEMDNER corpus and articles from participating teams.

Frequently asked questions (FAQ)

Please read the following document answering questions related to the CHEMDNER task: FAQ (online).

CHEMDNER task useful resources

A list of additional resources that might be useful for CHEMDNER task participants can be found here: CHEMDNER links (updated 2013-08-28).

Mailing list and contact information

You can post questions related to the CHEMDNER task to the BioCreative mailing list. To register for the BioCreative mailing list, please visit the following page: http://biocreative.sourceforge.net/mailing.html
You can also directly send questions to the organizers through e-mail: mkrallinger@cnio<dot>es

REFERENCES

[1] Vazquez, M., Krallinger, M., Leitner, F., & Valencia, A. (2011). Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics, 30(6‐7), 506-519.

[2] Corbett, P., Batchelor, C., & Teufel, S. (2007). Annotation of chemical named entities. BioNLP 2007: Biological, translational, and clinical language processing, 57-64.

[3] Klinger, R., Kolářik, C., Fluck, J., Hofmann-Apitius, M., & Friedrich, C. M. (2008). Detection of IUPAC and IUPAC-like chemical names. Bioinformatics, 24(13), i268-i276.

[4] Kim, J. D., Ohta, T., Tateisi, Y., & Tsujii, J. I. (2003). GENIA corpus—a semantically annotated corpus for bio-textmining. Bioinformatics, 19(suppl 1), i180-i182.

[5] Bada, M., Hunter, L. E., Eckert, M., & Palmer, M. (2010, July). An overview of the CRAFT concept annotation guidelines. In Proceedings of the Fourth Linguistic Annotation Workshop (pp. 207-211). Association for Computational Linguistics.

[6] Rebholz-Schuhmann, Dietrich, et al. "CALBC silver standard corpus." Journal of bioinformatics and computational biology 8.01 (2010): 163-179.

[7] Wang, Y., Xiao, J., Suzek, T. O., Zhang, J., Wang, J., & Bryant, S. H. (2009). PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic acids research, 37(suppl 2), W623-W633.

[8] Degtyarenko, Kirill, et al. "ChEBI: a database and ontology for chemical entities of biological interest." Nucleic acids research 36.suppl 1 (2008): D344-D350.

[9] Hettne, K. M., Stierum, R. H., Schuemie, M. J., Hendriksen, P. J., Schijvenaars, B. J., Mulligen, E. M. V., ... & Kors, J. A. (2009). A dictionary to identify small molecules and drugs in free text. Bioinformatics, 25(22), 2983-2991.

[10] Pence, H. E., & Williams, A. (2010). ChemSpider: an online chemical information resource. J. Chem. Educ, 87(11), 1123-1124.

[11] Yeh, A., Morgan, A., Colosimo, M., & Hirschman, L. (2005). BioCreAtIvE task 1A: gene mention finding evaluation. BMC bioinformatics, 6(Suppl 1), S2.

[12] Smith, L., Tanabe, L. K., Ando, R. J., Kuo, C. J., Chung, I. F., Hsu, C. N., ... & Wilbur, W. J. (2008). Overview of BioCreative II gene mention recognition. Genome Biology, 9(Suppl 2), S2.

[13] Hirschman, L., Colosimo, M., Morgan, A., & Yeh, A. (2005). Overview of BioCreAtIvE task 1B: normalized gene lists. BMC bioinformatics, 6(Suppl 1), S11.

[14] Morgan, A., Lu, Z., Wang, X., Cohen, A., Fluck, J., Ruch, P., ... & Hirschman, L. (2008). Overview of BioCreative II gene normalization. Genome biology, 9(Suppl 2), S3.

[15] Lu, Z., Kao, H. Y., Wei, C. H., Huang, M., Liu, J., Kuo, C. J., ... & Wilbur, W. (2011). The gene normalization task in BioCreative III. BMC bioinformatics, 12(Suppl 8), S2.

Downloads

Text Mining for Drugs and Chemical Compounds