RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BC Workshop '12

Track I- Triage [2011-09-09]

We invite text mining teams to develop a system to assist curators in the selection of relevant articles for curation for the Comparative Toxicogenomic Database (CTD).

Given a chemical (input), the system should present the curator with a list of PubMed IDs in ranked order, from more likely to less likely curatable, along with information that will help the curator to assess such ranking.

Therefore for each abstract the system should provide, in a TAB-delimited flat file format, the following information:

  1. PubMed ID
  2. Title
  3. Abstract
  4. Journal
  5. Cited Gene Actors (Entries are '|' delimited)
  6. Cited Chemical Actors (Entries are '|' delimited)
  7. Cited Disease Actors (Entries are '|' delimited)
  8. Marked-up HTML of abstract with tagged links back to CTD for all actors and terms (see note below)
  9. Document Relevancy Score (Normalized 0,non curatable, to 1, curatable)
  10. *Marked-up HTML of relevant sentences/phrases extracted with tagged links back to CTD for all actors and terms (sentences/phrases are '|' delimited)
  11. *Cited Action Terms (Entries are '|' delimited)
  12. *Cited Interactions (Entries are '|' delimited)

Fields preceded by * are optional; contributors unable to provide this information should account for the field's position, but simply leave the individual entry blank. An example of the output format is provided in Downloads at the end of this page.
NOTE: Links to CTD should take the following form for chemical, disease, and gene actors:
http://ctd.mdibl.org/basicQuery.go?bqCat=<gene|chem|disease>&bq=<term>. Please refer to the output file for examples.

To help you with the system development we provide:

  • The CTD curation overview that describes prioritization criteria (see downloads at the end of this page)
  • A training corpus (see downloads at the end of this page)
    The columns for each file are as follows:
    1. PubMed ID
    2. Title
    3. Abstract
    4. Journal
    5. Date
    6. Curatable?
    7. Number of Interactions
    8. Curated Interactions (Entries are '|' delimited)
    9. Curated Gene Actors (Entries are '|' delimited)
    10. Curated Chemical Actors (Entries are '|' delimited)
    11. Curated Disease Actors (Entries are '|' delimited)
    12. Curated Action Terms (Entries are '|' delimited)
  • The complete CTD controlled vocabularies, in multiple formats, may be found here:
  • Test data will be released on February 6, 2012

  • Systems requirements:
  • Web-based system compatible at least with Mozilla Firefox 4.0 or higher.
  • In general, the CTD Curation Tool is web-based and integrates Java 6, JSP2.1/Servlet 2.5, HTML5, CSS3, JavaScript 1.85, and AJAX, in the context of an MVC architecture, and in conjunction with an Apache HTTP Server 2.2.15 and Tomcat 6.0.24. CTD's batch environment is Java 6. The operating environment is Red Hat Enterprise Linux 6.0. Data is stored in a PostgreSQL 9.0 database management system. It is strongly preferred that any solutions be easily integrated into the existing CTD architecture.

  • Text Mining System Description:
    Each participating team should provide a description of the system. The description should be no longer than 6 pages including figures and word or .rtf formats are preferred.

    Important Dates
    Item Deadline Submit via Comment
    Team Registration November 15, 2011 web register here
    Release of Test Data February 06, 2012 email to twiegers@mdibl.org Subject:Track1 commitment
    Text Mining System Description February 20, 2012 email to twiegers@mdibl.org Subject:BioCreative-2012 Track I
    Submission of Benchmarking Results February 20, 2012 email to twiegers@mdibl.org Subject:BioCreative-2012 Track I-results
    Interface Available for Testing March 1, 2012 email to twiegers@mdibl.org Subject:BioCreative-2012 Track I-testing

    Return to Homepage

    Downloads