BioCreative - Track 3- BioCreative 2013 CTD Track

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative IV

Track 3- BioCreative 2013 CTD Track [2012-11-15]

ORGANIZERS

Thomas C. Wiegers, Allan Peter Davis, and Carolyn J. Mattingly

North Carolina State University

BACKGROUND

The Comparative Toxicogenomic Database (CTD) is a publicly available resource that seeks to promote understanding about the mechanisms by which drugs and environmental chemicals influence the function of biological processes and human health. CTD curators manually curate chemical-gene/protein interactions, chemical-disease relationships, and gene-disease relationships.

The BioCreative 2012 Track I Triage workshop focused on document triaging for CTD. More specifically, participants developed tools that ranked articles in terms of their curatability, and identified gene/protein, chemical, and disease actors, as well as CTD interaction-related action terms.

Although the results were very impressive, they were of little direct benefit to CTD; tools developed by Track I participants were written using a wide variety of technologies and within technical infrastructures that would not necessarily easily integrate into CTD’s existing text-mining pipeline. In short, interoperability was a major impediment to the direct application of the collaboration to the CTD pipeline; impediments include but are not limited to operating system compatibility, programming languages and versions, tool versioning, tool input file and library requirements, database management system-related compatibility, etc.

BIOCREATIVE IV OVERVIEW

The CTD track of BioCreative IV will focus on interoperability: can teams such as CTD directly benefit from the text mining tools developed worldwide by collaborators, without having to worry about the technical details associated with tools themselves? If so, would the response time associated with such tools be suitable for asynchronous, batch processing-based text mining using technologies such as Web Services? CTD involvement in BioCreative IV calls for participants to build interoperable tools that could be accessed remotely by batch-oriented CTD text-mining processes; this approach, if effective, could serve as a proof of concept to decouple a text mining integration team's technical infrastructure from one or more text mining service provider's potentially disparate technical infrastructure.

Participants are asked to provide Web Services that will enable CTD to send text passages to their remote sites in order to identify gene/protein, chemical, disease, and chemical/gene-specific action term mentions, each within the context of CTD's controlled vocabulary structure. Although chemical/gene-specific action terms were included as a text mining category in BioCreative 2012 Track I, there was only limited work done by participants in this regard. It is the hope of the organizers that chemical/gene-specific action terms will be an important focus for BioCreative IV participants, since it is such an important component of CTD-related text mining.

DETAILED REQUIREMENTS AND DELIVERABLES

Participants are asked to provide Web Services for gene/protein, chemical, disease, and/or CTD chemical/gene-specific action term named-entity recognition (NER); participants may provide tools for any one of the NER categories, all of them, or any combination thereof. The complete CTD controlled vocabularies, in multiple formats, may be found here for each of the NER categories:

The participant teams should make available a single Web Service for each individual NER category submitted for testing. The Web Service should employ a Representational State Transfer (RESTful) architecture that is accessed using HTTP POST protocol. The RESTful URL should include the keyword "gene" for gene/protein NER, "chem" for chemical NER, "disease" for disease NER, or "action_term" for chemical/gene-specific action term NER. For example, a gene/protein NER RESTful URL might look as follows:

http://localhost:8080/rest/gene/post/

Note the inclusion of the word 'gene' in the URL; the words 'rest' and 'post' are not necessary.

CTD software engineers will develop a client to test each participant entry for NER recall against a test dataset. For simplicity in testing, it is preferred that there be no security on the participant Web Service.

All communications between the CTD test client and the participant Web Service will be within the context of the BioC XML framework. For more information on BioC, please refer here. The BioC DTD file may be reviewed here.

The CTD test client will send BioC XML to the participant Web Service; the Web Service will perform NER analysis on the submitted title and abstract and return BioC XML with annotations representing NER results.

Here is an example of a Post request that the CTD client might send to the Web Service.

Client BioC XML Request

Here are examples of responses from the respective NER Web Services:

There are several points to note in the sample responses:

The actual location of the mention is not included; please be aware that although the location information may be included in the response by the participant's Web Service, it will not be validated by CTD and is therefore not required, and will simply be ignored.
Recall scores will be calculated by simply dividing the number of distinct curated actors successfully identified by the NER Web Service - either by a synonym to the curated term or by the curated term itself - by the total number of distinct curated actors. Consequently, the annotation need not reflect actual text verbatim in the abstract or title, but rather may reflect a version translated to CTD's respective controlled vocabulary. For example, in the gene response sample above, the annotation for "5-HT(2B)" was "HTR2B", which simply represents "5-HT(2B)" translated to CTD's underlying controlled gene vocabulary term "HTR2B". [NOTE: In this particular case, "5-HT(2B)" is a direct synonym for "HTR2B", so an annotation reflecting either would have been counted as a match for purposes of recall calculation.]
Additional information may be included in the Web Service response; the samples above reflect the minimum required information for CTD to calculate recall. For example, as indicated above, it is perfectly acceptable to include location information in the Web Service response; that information will simply be ignored.
Although annotation information has been converted entirely to upper case in the samples for consistency, this is not necessary. All matching for purposes of recall score calculation will be case-insensitive.

The BioC key file that describes the properties of the sample requests and responses is provided below.

Finally, each participating team should provide a brief description of their respective NER tool(s). The description should be no longer than 4 pages, including figures and references; Microsoft Word format is preferred.

Participants should submit their Web Service details, along with their system descriptions, to Thomas Wiegers (tcwieger@ncsu.edu) by no later than:

August 19, 2013

The Web Service should remain available for review by CTD from August 19, 2013 through December 31, 2013. Detailed results, including NER recall and processing response times, will be provided to participants.

NER WEB SERVICE TESTING FACILITY

A utility is available to enable participants to test their named-entity recognition (NER) Web Service(s) against the Track 3 learning corpus, providing testing metrics; the utility may be accessed here:

BioCreative IV Track 3 NER Testing Facility

To use the system, the participant may access the aforementioned URL, enter a PubMed ID from the Track 3 learning corpus, the URL of the participant NER Web Service, and then select the appropriate NER type. Pressing Submit will cause a BioC XML Post request to be made to the Web Service; the BioC XML response from the Web Service will be parsed, processed against the CTD curation dataset, and a report will then be generated summarizing the results. Reports are provided in both text and HTML format.

Details are provided in the report that enable participants to understand how the NER metrics are calculated. In terms of recall and precision calculations, for each paper's respective curated genes, diseases, chemicals, or action term NER category, there are 3 fields provided:

Curated Terms - Lists the terms, if any, that were curated for the PubMed paper by CTD in the respective NER category,
Text Mined Terms - Lists the text mined terms, if any, that were returned from the NER Web Service for the respective PubMed paper,
Text Mined Hits - Provides an explanation of how matches between the curated terms and the text mined terms were determined. Because providing synonyms to curated terms are counted as matches, the notation of CYP1-->CYP1A1, for example, indicates that the term CYP1 was text mined, which is a valid synonym for the actual underlying curated term CYP1A1; alternatively, FZR1-->FZR1 indicates that the text mined term of FZR1 exactly matched the curated term.

As indicated above, recall is calculated by dividing the number of curated term hits by the number of curated terms. Precision is calculated by dividing the number of curated term hits by the number of text mined terms.

LEARNING CORPUS AND TRAINING MATERIALS

A learning corpus in BioC XML format is provided here:

Learning Corpus in BioC XML format

The corpus is comprised of 1,112 PubMed titles and abstracts, all curated gene/protein, chemical, and disease actors, and associated chemical/gene-specific action terms. Each curated interaction associated with the article is also provided. Please note that the interactions are provided for reference only; the participants are not being asked to provide interaction-based NER tools.

The BioC key file may be found here:

BioC Key File

This key file describes the BioC XML format in the learning corpus, as well as the sample requests and responses referenced above.

General background information regarding CTD may be reviewed here:

CTD Overview