RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative III

IAT: Interactive Demostration Task for Gene Indexing and Retrieval [2010-07-22]

Organizers:Arighi/Hirschman/Wu

The BioCreative interactive tasks, to be run as part of the BioCreative workshops in 2010 and 2012, are meant to provide the component modules for text mining services for biocuration.. The goal is to support real-life tasks (e.g., mining genetic information and curation from the biomedical literature), by combining multiple text mining tasks to retrieve literature and extract relevant information, and provide results that can be integrated in the curation workflow.

The Biocreative III interactive task will be a demonstration task, with the goals of gathering curator requirements, encouraging the development of prototype systems, beginning development of data exchange standards, and laying the groundwork for an evaluation of interactive annotation systems for BioCreative IV (2012). As part of these activities, the BioCreative III workshop in September 2010 will devote a session to discussion of the possible exchange formats to facilitate interoperability of the text mining modules.

The BioCreative III interactive task (IAT 2010) is a demonstration task, and will focus on indexing (identifying which genes are being studied in an article and linking these genes to standard database identifiers) and gene-oriented document retrieval (identifying full-text papers relevant to a selected gene). This approach will facilitate the definition of metrics and acquisition of data that are necessary for designing the evaluation of the interactive systems in the BioCreative IV challenge.

To support this activity, the BioCreative organizers will:

  • Identify a set of end users from the participating curation teams, provided by the BioCreative User Advisory Group
  • Solicit participation from system developers to provide prototype systems
  1. System Requirements
  2. Data source: full text articles in XML (not pdf) from the PubMed Central Open Access collection. The data is available at http://www.biocreative.org/resources/corpora/biocreative-iii-corpus/ file IAT PubMedCentral XML Data

    Indexing

    Input: user-selected PMCID

    Output: the indexing system will return a list of gene/protein identifiers linked to the appropriate database identifiers from the selected full-text article. The list of genes/proteins should be ranked for their importance or “centrality” to the article (see below for discussion/definition); such ranking could, for example, take into consideration the frequency of gene/protein mention, the sections of the article where the gene is mentioned, the mention of the gene in figures or experimental results, etc.

    .

    Retrieval

    Input: user-selected gene

    Output: the retrieval system will return a ranked list of documents from PubMedCentral (with links to the full text) which would be relevant to provide information on the selected gene.

    Interface:

    Display will include

    • editable list of gene/protein identifiers, including names, identifier linked to appropriate standard database (EntrezGene, UniProt), species, links to one or more mentions in the abstract or text.
    • window showing full text, including annotations of genes

    Desired capabilities will include

    • ability to sort gene list based on frequency (how many times it is mentioned), location (in what sections it is mentioned), experimental evidence (whether it is studied in an experiment) or their combinations
    • support to identify gene/protein mentions in text and link a mention to a unique identifier
    • support for interactive disambiguation of gene/protein mentions based on context (e.g., other genes, species, chromosomal location) to enable the user to manually select the correct unique identifier from a set of possibilities (or to enter in the identifier if it is not present in the list)
    • ability to select a gene from the list and retrieve full text articles from PubMed Central that provide further information on the selected gene (for the retrieval subtask)
    • ability to collect event and timing information at the session level (and ideally at a finer granularity of user action)
    • Newly added: ability to export results (e.g. tab-delimited file) containing the following information: PMCID|Gene|DB ID|ranking is encouraged.

    Additional capabilities could include evidence attribution, highlighting the sentences with the gene mentions, species, or any of the attributes mentioned above.

  3. Task
  4. For evaluation, the BioCreative User Advisory Group (UAG) will identify users/curators who will provide feedback on the interactive demonstration systems. The UAG is made up of participants from some of the major expert curated biological databases as well as industrial users.

    Each end user or curator will be asked to index a set of articles. The automatic indexing systems will produce a list of the genes/proteins in the article, including both the gene/protein name and the associated unique identifier (EntrezGene for genes, UniProt identifier for proteins), but sorted according to a ranking established by each user/curator. The user should be able to sort the list based on frequency of gene mention and /or article section. The user will pick one or more candidate genes from the list and retrieve articles in which the selected gene is primarily studied.

    Users will also provide feedback on what features they found useful and what additional features they would like to have. Features might include species for the gene/protein, links to mentions of genes in the article, or attributes of the gene such as chromosome location.

    NOTE on gene ranking: Genes should be ranked by overall importance in the article, which can be reflected by their frequency in the article, the location of their mentions, or association of experimental data, etc. We are still refining these criteria. So far, based on curator input, the focus has been on genes with high frequency of mention, but also taking into consideration the section of the articles in which they are mentioned, e.g., it’s more important to be mentioned in the result section than in the discussion; also a gene with associated experimental results may be more important/central than a gene with no associated results. We know that this definition does not imply that the gene is relevant for all curators, but we will be refining this concept based on user and developer feedback.

    Members of the User Advisory Group (UAG) have gone through the exercise of selecting primary genes for two articles based on some of the criteria mentioned above. Results are summarized on the table below, we found in the responses two views:

    • Members who considered primary genes any gene that appeared in an experiment (including markers, and control proteins)
    • Members who considered primary genes those that had experimental support but they also had biological significance in the context of the article

    However, all members coincided on the genes that were more significant in the context of the article (genes with popularity 9 in the table below). UAG agreed that the most desirable result from a system would be to rank genes based on the latter concept above.

    In the context of the task it is important to know what genes are primary. In the case of PMID:19513100, gata1, e2f2, fog-1 and pRB were unanimously assigned as primary genes based on experimental support and biological significance in the context of the article. On the other hand, even though other genes such as CD71, c-kit, ter119, GFP, and beta-actin were mentioned multiple times in the result section these were either used in the experiments as cell type markers or controls. Therefore, these genes should be considered secondary.

    It is also important to consider what species are mentioned. For example, both gata1 human and gata1 mouse are linked to some experimental result (disregarding the individual number of experiments or mentions), therefore both would be considered primary. This way the user would quickly know that the article not only describes some property for gata1, but specifically for human and mouse gata1.