BioCreative - BioCreative Text Mining Workshop for Biocuration

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BCBioCuration2013

BioCreative Text Mining Workshop for Biocuration [2013-02-18]

Biocuration 2013 Conference at Churchill College, Cambridge, UK
Tuesday April 9 (afternoon), 2013, Agenda
To attend this workshop register to the Biocuration 2013 Conference (http://www.ebi.ac.uk/biocuration2013/content/registration)

Presenters: Cecilia Arighi (1), Kevin Cohen (2), Martin Krallinger (3), and Zhiyong Lu (4)
1 Center for Bioinformatics and Computational Biology, University of Delaware, DE, USA
2 Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA
3 Structural Biology and Biocomputing Group, Spanish National Cancer Research Centre, Madrid, Spain
4 National Center for Biotechnology Information, NIH, Bethesda, MD, USA

BioCreative: Critical Assessment of Information Extraction in Biology is an international community-wide effort that evaluates text mining and information extraction systems applied to the biological domain (http://www.biocreative.org/). A unique characteristic of this effort is its collaborative and interdisciplinary nature, as it brings together experts from various fields, including text mining, biocuration, publishing houses and bioinformatics. This allows to discuss during the accompanying BioCreative Workshops how to drive the development of text mining systems that can be integrated into the biocuration workflow and the knowledge discovery process. To address the current barriers in using text mining in biology, BioCreative has further been conducting user requirements analysis, user-based evaluations and fostering standards development for text mining tool re-use and integration [1-4].

This workshop will present several text-mining research topics addressed by the BioCreative efforts that are of particular relevance for literature curation. These topics include the extraction of bio-entity annotations using standard bio-ontologies (i.e. Gene Ontology annotation), the identification of bio-entities relevant for curation (i.e. chemical compounds and drugs), and aspects dealing with text mining systems’ utility/usability and interoperability.

The aim of this workshop is to encourage active involvement of biocurators in guiding text mining system development and adoption by demonstrating and discussing past and current efforts of the BioCreative challenges. Participation in this workshop will give biocurators the possibility to learn more about current text mining efforts useful in literature curation and will enable them to provide direct feedback to the text mining experts.

The intended audience includes both biocurators that do literature curation and developers involved in biocuration workflows.

Text Mining for Gene Ontology (GO) annotation
GO annotation is a common task among model organism database (MOD) groups. It is a very time-consuming and labor-intensive task, and thus is often considered to be one of the bottlenecks in literature curation. There is a growing need for semi- or fully-automated GO curation techniques that will help database curators to rapidly identify relevant articles for GO curation, and accurately identify gene function information in full-length articles. This section of the workshop will touch upon the Gene Ontology task planned for BioCreative IV, including its goals, differences from previous GO annotation text mining tasks, biocurators’ involvement, evaluation, and adoption.

Chemical and Drug Named Entity Recognition (CHEMDNER)
Several databases were developed to capture specifically information related to chemical compounds and drugs, such as PubChem, ChEMBL or ChEBI. They constitute a crucial resource not only for chemical experts but also for research in the area of biology and biomedical sciences. Chemical entities are central elements for databases annotating drug-protein interactions, adverse/toxicological effects of drugs or metabolic/biochemical pathways and reactions [5]. The number of different topics where chemical entities play a role explains the considerable interest in efficient access to information on chemical compounds and drugs characterized in scientific articles, patents or health agency reports. In order to achieve this goal, a crucial aspect is to be able to identify mentions of chemical compounds automatically within text as well as to index whole documents with the compounds described in them. This talk will introduce general aspects on the annotation of chemical entities from literature as well as previous efforts, tools and resources related to the recognition of chemical compounds that are of interest to the biocurator community [6].
The basic characteristics of the CHEMDNER track on recognition of chemical compounds and drugs will be introduced. The track consists in two sub-tasks, one dealing with indexing documents with chemicals while the second one will cover the actual recognition of compound mentions within text. Finally an outlook on the potential outcome of this challenge and its impact in manual literature curation will be discussed.

Usability and utility of text mining systems
A common problem faced by biocurators when using text mining systems is that they are difficult to use or do not provide an output that can be directly exploited by biocurators during their literature curation process. In this respect, the BioCreative Interactive Text Mining (IAT) task [7,8] has served as a great means to observe the approaches, standards and functionalities used by state-of-the-art text mining systems with potential applications in the biocuration domain. The IAT task also provides a means for biocurators to be directly involved in the testing of text mining systems. The benefits to biocurators participating in this activity are multifold, including: direct communication and interaction with developers; exposure to new text mining tools that can be potentially adapted and integrated into the biocuration workflow, contribution to the development of text mining systems that meet the needs of the biocuration community, and dissemination of findings in peer reviewed journal articles. We will discuss some of the outcomes from this experience, such as metrics, surveys conducted, and system requirements gathering, and will provide a forum for biocurators and developers to share their own experience.

Interoperability of text mining systems
Although there are a considerable number of natural language processing tools developed to handle biological literature, exploiting them through integration and adaption to help within a particular literature curation task is still extremely cumbersome. Many researchers are building natural language processing (NLP) and text mining tools. Yet these efforts tend to be singular, isolated and difficult to combine into larger, more powerful, and more capable systems. In this workshop we will present the proposal for simple XML formats to share text documents and annotations. The core concepts are simplicity, interoperability, and broad use and reuse. This means there should be little investment required to learn to use a format or a software module to process that format. Therefore, enabling system developers in biocuration groups to assemble their own pipelines.

Audience: Biocurators and system developers will greatly benefit from this workshop as it will touch upon subjects that aim at making the literature-based curation workflow more efficient.

Workshop agenda

Introduction to BioCreative and Biocuration, (10 min)

Overview of BioCreative challenges and workshops

Interaction between the Biocuration community and BioCreative

Future of BioCreative

Gene Ontology (GO) Annotation and BioCreative, Zhiyong Lu (25 min)

GO annotation, a common task in Biocuration

Challenges of GO annotation for text mining

GO proposed task in BioCreative IV

Biocurators involvement in GO task

Chemical and Drug Named Entity Recognition (CHEMDNER), Martin Krallinger (25 min)

Annotation of chemical compounds and drugs

Text mining efforts and their use for manual annotation of compounds

CHEMDNER track on recognition of chemical compound and drug mentions in text

Biocurators as users of text mining systems, Cecilia Arighi (25 min)

Overview of BioCreative interactive task

Interaction between the Biocuration and Text mining community

User evaluation of text mining systems and participation of the biocuration community

Sharing documents and annotations, Kevin Cohen (25 min)

Overview of interoperability proposal in BioCreative IV

Panel discussion, moderated by presenters (10 min)

Open discussion with participants

BioCreative Organizers: Cecilia N Arighi, Kevin B Cohen, Lynette Hirschman, Martin Krallinger, Zhiyong Lu, Carolyn Mattingly, Alfonso Valencia, Thomas C. Wiegers, W John Wilbur, and Cathy H Wu

References
1. Hirschman, L., Yeh, A., Blaschke, C. and Valencia, A. (2005) Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 6, S1.
2. Krallinger, M., Morgan, A., Smith, L., Leitner, F., Tanabe, L., Wilbur, J., Hirschman, L. and Valencia, A. (2008) Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biology, 9, S1.
3. Leitner, F., Mardis, S.A., Krallinger, M., Cesareni, G., Hirschman, L.A. and Valencia, A. (2010) An Overview of BioCreative II.5. IEEE/ACM Trans Comput Biol Bioinform., 7, 385-399.
4. Arighi, C., Lu, Z., Krallinger, M., Cohen, K., Wilbur, W., Valencia, A., Hirschman, L. and Wu, C. (2011) Overview of the BioCreative III Workshop. BMC Bioinformatics, 12, S1.
5. M Krallinger, RAA Erhardt, A Valencia. Text-mining approaches in molecular biology and biomedicine. Drug discovery today 10 (6), 439-445.
6. M Vazquez, M Krallinger, F Leitner, A Valencia. Text Mining for Drugs and Chemical Compounds: Methods, Tools and Applications. Molecular Informatics 30 (6‐7), 506-519.
7. Arighi, C., Carterette B., Cohen, K.B., Krallinger, M., Wilbur, W., Fey, P., Dodson, R., Cooper, L., Van Slyke, C.E., Dahdul, W., Mabee, P., et al. (2013) An Overview of the BioCreative 2012 Workshop Track III: Interactive Text Mining Task. DATABASE, 2013:bas056.
8. Arighi, C., Roberts, P., Agarwal, S., Bhattacharya, S., Cesareni, G., Chatr-aryamontri, A., Clematide, S., Gaudet, P., Giglio, M., Harrow, I. et al. (2011) BioCreative III interactive task: an overview. BMC Bioinformatics, 12, S4.