BioCreative - Track I- Triage

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BC Workshop '12

Track I- Triage [2011-09-09]

We invite text mining teams to develop a system to assist curators in the selection of relevant articles for curation for the Comparative Toxicogenomic Database (CTD).

Given a chemical (input), the system should present the curator with a list of PubMed IDs in ranked order, from more likely to less likely curatable, along with information that will help the curator to assess such ranking.

Therefore for each abstract the system should provide, in a TAB-delimited flat file format, the following information:

PubMed ID
Title
Abstract
Journal
Cited Gene Actors (Entries are '|' delimited)
Cited Chemical Actors (Entries are '|' delimited)
Cited Disease Actors (Entries are '|' delimited)
Marked-up HTML of abstract with tagged links back to CTD for all actors and terms (see note below)
Document Relevancy Score (Normalized 0,non curatable, to 1, curatable)
*Marked-up HTML of relevant sentences/phrases extracted with tagged links back to CTD for all actors and terms (sentences/phrases are '|' delimited)
*Cited Action Terms (Entries are '|' delimited)
*Cited Interactions (Entries are '|' delimited)

Fields preceded by * are optional; contributors unable to provide this information should account for the field's position, but simply leave the individual entry blank. An example of the output format is provided in Downloads at the end of this page.
NOTE: Links to CTD should take the following form for chemical, disease, and gene actors:
http://ctd.mdibl.org/basicQuery.go?bqCat=<gene|chem|disease>&bq=<term>. Please refer to the output file for examples.

To help you with the system development we provide:

The CTD curation overview that describes prioritization criteria (see downloads at the end of this page)

A training corpus (see downloads at the end of this page)
The columns for each file are as follows:

PubMed ID
Title
Abstract
Journal
Date
Curatable?
Number of Interactions
Curated Interactions (Entries are '|' delimited)
Curated Gene Actors (Entries are '|' delimited)
Curated Chemical Actors (Entries are '|' delimited)
Curated Disease Actors (Entries are '|' delimited)
Curated Action Terms (Entries are '|' delimited)

The complete CTD controlled vocabularies, in multiple formats, may be found here:

For chemicals:
http://ctd.mdibl.org/downloads/#allchems
For genes:
http://ctd.mdibl.org/downloads/#allgenes
For diseases:
http://ctd.mdibl.org/downloads/#alldiseases
For action terms:
http://ctd.mdibl.org/downloads/#gcixntypes

Test data will be released on February 6, 2012

Systems requirements:

Web-based system compatible at least with Mozilla Firefox 4.0 or higher.

In general, the CTD Curation Tool is web-based and integrates Java 6, JSP2.1/Servlet 2.5, HTML5, CSS3, JavaScript 1.85, and AJAX, in the context of an MVC architecture, and in conjunction with an Apache HTTP Server 2.2.15 and Tomcat 6.0.24. CTD's batch environment is Java 6. The operating environment is Red Hat Enterprise Linux 6.0. Data is stored in a PostgreSQL 9.0 database management system. It is strongly preferred that any solutions be easily integrated into the existing CTD architecture.

Text Mining System Description:
Each participating team should provide a description of the system. The description should be no longer than 6 pages including figures and word or .rtf formats are preferred.

Important Dates

Item	Deadline	Submit via	Comment
Team Registration	November 15, 2011	web	register here
Release of Test Data	February 06, 2012	email to twiegers@mdibl.org	Subject:Track1 commitment
Text Mining System Description	February 20, 2012	email to twiegers@mdibl.org	Subject:BioCreative-2012 Track I
Submission of Benchmarking Results	February 20, 2012	email to twiegers@mdibl.org	Subject:BioCreative-2012 Track I-results
Interface Available for Testing	March 1, 2012	email to twiegers@mdibl.org	Subject:BioCreative-2012 Track I-testing

Return to Homepage