RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative IV

Call for Participation (Events) [2012-11-15]

BioCreative IV Challenge and Workshop

October 7-9, 2013 NCBI, Bethesda, Maryland, USA

BioCreative: Critical Assessment of Information Extraction in Biology is a community-wide effort for evaluating text mining and information extraction systems applied to the biological domain. Built on the success of the previous BioCreative Challenge Evaluations and Workshops ( BioCreative I, II, II.5, III, and 2012 workshop) [1-5] the BioCreative Organizing Committee will host the BioCreative IV Challenge ( IV/workshop/) at NCBI, National Institutes of Health, Bethesda, Maryland, on October 7-9, 2013. One key goal of BioCreative is the active involvement of the text mining user community in the design of the tracks, preparation of corpus and the testing of interactive systems. For BioCreative IV, the selection of the tracks has been driven in part by suggestions from the biocuration community during the BioCreative workshop 2012, and by our goal of addressing interoperability -- a major barrier to adoption to text mining tools.

BioCreative IV will consist of five tracks:

  • Track 1: Interoperability (BioC) – Development of an interoperable BioNLP module that can be seamlessly coupled to BioC compliant modules;
  • Track 2: Chemical and Drug Named Entity Recognition (CHEMDNER) – Detection of mentions of chemical compounds and drugs, in particular those chemical entity mentions that can subsequently be linked to a chemical structure;
  • Track 3: Comparative Toxicogenomics Database (CTD) Curation – Provision of Web Services to identify gene, chemical, disease, and action term mentions supporting CTD curation in PubMed abstracts;
  • Track 4: Gene Ontology (GO) curation – Development of automatic methods to aid GO curators in identifying articles with curatable GO information (triage) and extracting gene function terms and the associated evidence sentences in full-length articles;
  • Track 5: Interactive Curation (IAT) – Demonstration and evaluation of web-based systems addressing user-defined tasks, evaluated by curators on performance and usability;
  • Back to top


  • Registration
  • Important Dates
  • Organizing Committee
  • User Advisory Group
  • References
  • Tracks

    Teams can participate in one or more of these tracks. Team registration will start on November 19 and will continue until final commitment is requested by the individual tracks.

    To register a team go to


    June 7, 2013
    Date Interoperability CHEMDNER CTD GO IAT
    November 19 2012 Team registration starts
    January, 2013 Guidelines release Expression of interest email
    May, 2013 Module description submission (January-June) Training data realease Training data realease
    System document submission
    June, 2013 Sample data collection & eval. scripts Acceptance communication
    July, 2013 Training/development data release Test Pairing with biocurators, dataset preparation
    August, 2013 Module submission to repository Systems training phase August, 2013 Curators training
    September, 2013 Test set release & results Results Results Evaluation
    October 7-9, 2013 Workshop

    Back to top


  • Cecilia Arighi, University of Delaware, USA
  • Kevin Cohen, University of Colorado, USA
  • Lynette Hirschman, MITRE Corporation, USA
  • Martin Krallinger, Spanish National Cancer Centre, CNIO, Spain
  • Zhiyong Lu, National Center for Biotechnology Information (NCBI), NIH, USA
  • Carolyn Mattingly, North Carolina State University, USA
  • Catalina O. Tudor, University of Delaware, USA
  • Alfonso Valencia, Spanish National Cancer Centre, CNIO, Spain
  • Thomas Wiegers, North Carolina State University, USA
  • John Wilbur, National Center for Biotechnology Information (NCBI), NIH, USA
  • Cathy Wu, University of Delaware and Georgetown University, USA
  • Back to top


    Chairs: Cecilia Arighi and Zhiyong Lu

  • Judith Blake, MGI, The Jackson Laboratory, USA
  • Andrew Chatr-aryamontri, BioGrid, Canada
  • Stan Laulederkind, Rat Genome Database, USA
  • Donghui Li, TAIR, USA
  • Sherri Matis, Astrazeneca, USA
  • Fiona McCarthy, Agbase, USA
  • Peter McQuilton, Flybase, UK
  • Sandra Orchard, IntAct, UK
  • Phoebe Roberts, Pfizer, USA
  • Mary Schaeffer, MaizeGDB, USA
  • Kimberly Van-Auken, Wormbase, USA
  • Back to top


    1. Hirschman, L., A. Yeh, C. Blaschke, and A. Valencia, Overview of BioCreAtIvE: critical assessment of information extraction for biology. BMC Bioinformatics, 2005. 6 Suppl 1: p. S1. PMCID:PMC1869002
    2. Krallinger, M., A. Morgan, L. Smith, F. Leitner, L. Tanabe, J. Wilbur, L. Hirschman, and A. Valencia, Evaluation of text-mining systems for biology: overview of the Second BioCreative community challenge. Genome Biol, 2008. 9 Suppl 2: p. S1. PMCID:PMC2559980
    3. Leitner, F., S.A. Mardis, M. Krallinger, G. Cesareni, L.A. Hirschman, and A. Valencia, An Overview of BioCreative II.5. IEEE/ACM Trans Comput Biol Bioinform, 2010. 7(3): p. 385-99.
    4. Arighi, C.N., Z. Lu, M. Krallinger, K.B. Cohen, J. Wilbur, A. Valencia, L. Hirschman, and C.H. Wu, Overview of the BioCreative III Workshop BMC Bioinformatics, 2011. 12 Suppl. 8: p. S1.
    5. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008, 9(Suppl 2):S4. PMCID:PMC2559988
    6. Krallinger M, Leitner F, Rodriguez-Penagos C, Valencia A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biology 2008, 9(Suppl 2):S4. PMCID:PMC2559988
    7. Wu CH, Arighi CN, Cohen KB, Hirschman L, Krallinger Martin, Lu Z, Mattingly C, Valencia A, Wiegers TC, Wilbur WJ: Editorial: BioCreative-2012 Virtual Issue. Database (Oxford) 2012 (in press)

    Back to top

    Track 1-BioC The BioCreative Interoperability Initiative

    Goals for this task include promoting simplicity, interoperability, and broad use and reuse of text mining modules. For this purpose we propose BioC, an interchange format for tools for biomedical natural language processing. BioC is a family of simple XML formats, specified by DTD, to share text documents and annotations. The proposed annotation approach allows many different annotations to be represented, including sentences, tokens, parts of speech, and named entities such as genes or diseases. BioC packages in both C++ and Java can be downloaded from, with code that includes basic classes to work with data in BioC format, as well as a couple of simple applications and examples. Participating teams will be asked to:

      a) Preparing a BioC module that can be seamlessly coupled with the rest of the BioC code and definitions, and that performs an important NLP or BioNLP task. The task is left to participating teams to choose, implement and validate for the purposes of this challenge. If you are participating in any other BioCreative Track challenge and are producing a BioC compliant module, you are welcome to submit your module to Track 1. If the module you wish to produce is independent of the other tracks, then we request that you submit a proposal to the program committee for approval by end of July 2013. The program committee wishes to approve all proposed independent projects at this stage to avoid overlapping tasks. Such a proposal can consist of a couple of paragraphs and needs to be a high level description of the module you wish to develop and contribute to the repository.
      b) Where necessary preparing a corpus or otherwise making data available, in the BioC format, which will allow the challenge committee to test and judge the performance of the module produced in part a).
      c) Writing a paper that describes the BioC module produced in part a), the data provided in part b), evaluation of the module, and its proposed uses. The paper will be published as part of the BioCreative IV proceedings and a selected number of papers will also be considered for publication in a special journal issue. An accepted module along with the accompanying paper, and data where appropriate, is understood to be a contribution to the BioC public repository. The final products must be submitted to the repository by September 8, 2013 to give the organizers sufficient time to judge the acceptability of a product.

    More information here

    Back to top

    Track 2-Chemical compound and drug name recognition task (CHEMDNER)

    The goal of this task is to promote the implementation of systems that are able to detect mentions of chemical compounds and drugs, in particular those chemical entity mentions that can subsequently be linked to a chemical structure. Participating teams will provide for the following predictions:

      a) Given a set of documents, return a list of chemical entities described within each of these documents.
      b) For a given document, provide the start and end indices corresponding to all the chemical entities mentioned in this document.
    For these two tasks the organizers will release training and test data collections. More information here.

    Back to top

    Track 3-Chemical Toxicogenomic Database (CTD) Interoperability task

    CTD ( is a publicly available resource that seeks to promote understanding about the mechanisms by which drugs and environmental chemicals influence the function of biological processes and human health. CTD curators manually curate chemical-gene/protein interactions, chemical-disease relationships, and gene-disease relationships. Building upon the tasks proposed in Track 1 (interoperability) and Track 2 (chemical and drugs entity recognition), and driven by the direct biocuration needs at CTD, participating teams will be asked to provide Web Services that will enable CTD to send text passages to their remote sites in order to identify gene, chemical, disease, and action term mentions, each within the context of CTD's controlled vocabulary structure. Participating groups will be provided with training materials, a complete training corpus, and detailed testing results, including NER recall and processing response times.

    More information here

    Back to top

    Track 4-Gene Ontology Curation (GO task)

    The goal of this task is to promote research and tool development for assisting gene ontology (GO) term curation from biomedical literature, an important and common task for many Model Organism Databases such as WormBase. Participating teams will first be asked to classify whether or not an article is relevant for GO curation (a document triage task). Next, for those curatable articles, they will be asked to predict GO annotations, together with one or more supporting evidence sentences (an information extraction task). The organizers will provide teams with gold-standard GO annotations for each full text article, including evidence sentences for each GO annotation. This data set will be annotated by members of the User Advisory Group (UAG) who are active members of a variety of Model Organism Databases.

    More information here

    Track 5-User Interactive task (IAT)

    The goal of this task is to foster the interaction between system developers and biocurators in order to advance in the production and adoption of text mining tools useful for biocuration. Participating teams will present a web-based system that can address a biocuration task of their choice. This is a demonstration task, but the systems will be formally evaluated by biocurators based on (i) performance (time-on-task and accuracy of the text mining-assisted task as compared to manual or some reference curation), and (ii) a subjective measure of usability/utility of the system via a user questionnaire. Participating teams should engage two end users to assist in the developing phase and evaluation dataset selection/annotation;the organizers , together with the User Advisory Group will engage appropriate biocurators for the testing phase.

    Participating teams should submit a document by May 31, 2013 where they describe the system and the proposed biocuration task(s), provide the URL to a functioning system, and address the following aspects:

      a) Relevance and Impact: The teams should clearly describe: i) the targeted user community, and should be familiar and compliant with the needs and standards used by this community; ii) the task proposed and how it aligns with the guidelines set by the targeted user community, and iii) use cases for the application.
      b) User Interactivity: We are asking for web-based text mining systems with user interactivity (such as, highlighting, editing, and exporting results). A document with the system's requirements with examples is now available in the download section at the end of the page linked here.
      c) System Performance: The system should have undergone some internal benchmarking to show that it performs reasonably for user testing. Teams should (i) provide the description of the dataset for such benchmarking, including source (who annotated it) and size; (ii) report on metrics: precision, recall and F-measure, and/or mean average precision (MAP), and (iii) indicate the level at which the metrics were calculated (sentence vs. document) which should correspond to the level to be tested in the user evaluation.

    Teams who are interested in participating in this track should submit an email to Cecilia Arighi, by January 15, 2013, Subject:BioCreative IV IAT. This will allow early planning and coordination of the task, however this notification is not a commitment. More information here.

    Back to top