RSS 2.0
Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VI

Track 4: Mining protein interactions and mutations for precision medicine (PM) [2017-03-03]

Precision Medicine and Biomedical Information:

* Please click here for the test data of Precision Medicine Task.
* Please click here for the training data of Precision Medicine Task.
Data is provided in JSON and XML formats.

The precision medicine initiative (PMI) promises to identify individualized treatment depending on a patients’ genetic profile and their related responses. In order to help health professionals and researchers in the precision medicine endeavor, one goal is to leverage the knowledge available in the scientific published literature and extract clinically useful information that links genes, mutations, and diseases to specialized treatments (1).

Proteins and their interactions are the building blocks of metabolic and signaling pathways regulating cellular homeostasis (2). Understanding how allelic variation and genetic background influence the functionality of these pathways is crucial for predicting disease phenotypes and personalized therapeutical approaches. A crucial step is the mapping of gene products functional regions through the identification and study of mutations (naturally occurring or synthetically induced) affecting the stability and affinity of molecular interactions.

* Please see here for more details on how mutational analysis can reveal crucial regions for protein-protein interaction.

Overview of the Precision Medicine task:

Despite previous studies in protein-protein interaction (e.g. (3, 4)) and mutation extraction (e.g. (5)), no one has investigated how to combine these efforts in order to help assessing and curating the clinical significance of genetic variants, an essential step towards precision medicine. Thus, the PM task in BioCreative VI aims to bring together the biomedical text mining community in a new BioCreative challenge task (6) focusing on identifying and extracting from the biomedical literature protein-protein interactions changed by genetic mutations. This challenge consists of two subtasks:

  • Document Triage: Identify relevant PubMed citations describing genetic mutations affecting protein-protein interactions, and
  • Relation Extraction: Extract experimentally verified PPI affected by the presence of a genetic mutation.
    Document Triage Task:

    The training dataset will consist of a set of ~4K PubMed articles. These articles are manually labelled as relevant/not relevant by BioGRID database curators. Participants in this task will be expected to build automatic methods capable of receiving a list of PMIDs and return a relevance-ranked judgement of the test set for triage purposes.

    Relation Extraction Task:

    A subset of the relevant articles in Document Triage Task has been manually annotated with relevant interacting protein pairs. Each PubMed article in this set has at least one interacting pair which is listed with the GeneEntrez ID of the two interactors. These protein-protein interactions have been experimentally verified and the analysis of natural occurring or synthetic mutations has identified protein residues crucial for the interaction. Participants in this task will be expected to build automated methods that are capable of receiving a set of PMID documents and return the set of interacting protein pairs (and their corresponding Gene Entrez IDs) mentioned in the text that are affected by a genetic mutation.

    The validity of the text mining methods will be evaluated using standard metrics such as average precision, f-measure, etc. Additionally, the utility of participating systems will be assessed by a group of database curators from BioGrid.

    Data Format and Pre-annotations:

    The PM task organizers will provide the training dataset in multiple formats such as BioC (7). Task organizers will also provide several pre-computed annotations for all articles in the training set with automatically generated labels for diseases, genes/proteins, species, mutations, and other labels (8, 9). The BioC format is discussed in the BioC Webinar series.

    * Please click here for the training data of Precision Medicine Task.
    * Please click here for the test data of Precision Medicine Task.
    Data is provided in JSON and XML formats. Please contact organizers with questions.

    Frequently Asked Questions about the Precision Medicine Task.

    • Does the Relations Task involve Named Entity Recognition?
      Answer: Yes. Participating teams need to be prepared to recognize and identify gene/proteins mentioned in the test data documents (PubMed title and abstract) and normalize them to Entrez Gene ID.

    • Can we participate in only one of the tasks? Do we need to participate in Triage Task in order to participate in the Relations Task?
      Answer: Yes. Teams can participate in the Triage Task, the Relations Task or both.

    • Is there an evaluation script?
      Answer: Yes. The evaluation script can be downloaded from here. The evaluation script can be used to validate your submission files.

    • How do we submit our results?
      Answer: Results should be submitted via e-mail to organizers. A submission run includes: a results file that has been validated using the evaluation script, and a short description of the method used to obtain those results (an abstract).

    • Where do I find the paper template?
      Answer: The Word template for your papers can be found in the BioCreative Workshop webpage here . Track participants should submit their work in the form of a short paper (4 pages). Although template is a word document, the format for submission should be in PDF.

    • What is the data format for submissions?
      Answer: Task participants are encouraged to return a confidence probability score, for each of their predictions. For the Triage task, each document is expected to have two infons: “relevant” (possible values Yes/No), and “confidence” (value is a real value between 0 and 1).
      For the Relations task, a relevant document is expected to have a “relation” annotation. A Relation consists of two gene identifiers, a relation name (“PPIm”), and a confidence value (a real number between 0 and 1).
      Predictions for each PMID in the test dataset need to have the following lines at the document level.
      In XML:
          <infon key="relevant">yes</infon>
          <infon key="confidence">0.XY</infon>
          <relation id="R1">
            <infon key="Gene1">GENEID-1</infon>
            <infon key="Gene2">GENEID-2</infon>
            <infon key="relation">PPIm</infon>
            <infon key="confidence">0.XY</infon>
      In JSON:
          "infons": {
             "relevant": "no"
             "confidence": 0.XY 
          "relations": [
                "nodes": [],
                "infons": {
                   "Gene1": GENEID-1,
                   "Gene2": GENEID-2,
                   "relation": "PPIm"
                   "confidence": 0.XY 
                "id": "R0"

    More information

    To receive the latest task information, please subscribe to the BioCreative mailing list at You can also join the BioCreativeVITrack4 google group.

    Important dates:

    Mar/April 2017: Release of training dataset
    September 8, 2017: Release of test dataset
    September 15, 2017: Team results due, submission window closes
    September 15, 2017: Team invitations for poster presentation
    September 29, 2017: Workshop papers due
    October, 2017: Team invitations for oral presentation and Feedback on papers
    October 15, 2017: Camera Ready Papers
    October 18-20, 2017: BioCreative Workshop in Washington DC

    Task organizers:

    Rezarta Islamaj Dogan (NCBI)
    Andrew Chatr-aryamontri (BioGrid)
    Sun Kim (NCBI)
    Don Comeau (NCBI)
    Zhiyong Lu (NCBI)


    1. Singhal A, Simmons M, Lu Z. Text Mining Genotype-Phenotype Relationships from Biomedical Literature for Database Curation and Precision Medicine. PLoS Comput Biol. 2016;12(11):e1005017.

    2. Chatr-Aryamontri A, Oughtred R, Boucher L, Rust J, Chang C, Kolas NK, et al. The BioGRID interaction database: 2017 update. Nucleic Acids Res. 2017;45(D1):D369-D79.

    3. Kim S, Islamaj Dogan R, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, et al. BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database : the journal of biological databases and curation. 2016;2016.

    4. Krallinger M, Vazquez M, Leitner F, Salgado D, Chatr-Aryamontri A, Winter A, et al. The Protein-Protein Interaction tasks of BioCreative III: classification/ranking of articles and linking bio-ontology concepts to full text. BMC bioinformatics. 2011;12 Suppl 8:S3.

    5. Wei CH, Harris BR, Kao HY, Lu Z. tmVar: a text mining approach for extracting sequence variants in biomedical literature. Bioinformatics. 2013;29(11):1433-9.

    6. Huang CC, Lu Z. Community challenges in biomedical text mining over 10 years: success, failure and the future. Brief Bioinform. 2016;17(1):132-44.

    7. Comeau DC, Islamaj Dogan R, Ciccarese P, Cohen KB, Krallinger M, Leitner F, et al. BioC: a minimalist approach to interoperability for biomedical text processing. Database : the journal of biological databases and curation. 2013;2013:bat064.

    8. Wei CH, Kao HY, Lu Z. PubTator: a web-based text mining tool for assisting biocuration. Nucleic Acids Res. 2013;41(Web Server issue):W518-22.

    9. Wei CH, Leaman R, Lu Z. Beyond accuracy: creating interoperable and scalable text-mining web services. Bioinformatics. 2016;32(12):1907-10.