Do not use the team "Team page" tab as it is non functional.
Text mining drug and chemical-protein interactions (DrugProt)
With the rapid accumulation of biomedical literature, it is getting increasingly challenging to exploit efficiently drug-related information described in the scientific literature. One of the most relevant aspects of drugs and chemical compounds are their relationships with certain biomedical entities, in particular genes and proteins.
The aim of the DrugProt track (similar to the previous CHEMPROT task of BioCreative VI) is to promote the development and evaluation of systems that are able to automatically detect in relations between chemical compounds/drug and genes/proteins.
There are a range of different types of drug-gene/protein interactions, and their systematic extraction and characterization is essential to analyze, predict and explore key biomedical properties underlying high impact biomedical applications. These application scenarios include use cases related to drug discovery, drug repurposing, drug design, metabolic engineering, modeling drug response, pharmacogenetics, drug-induced adverse reactions, molecular medicine or systems biology and bioinformatics knowledge graph mining, just to name a few.
We have therefore generated a manually annotated corpus, the DrugProt corpus, where domain experts have exhaustively
labeled:(a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (DrugProt relation classes).
There is also an increasing interested in the integration of drug/chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not
only for biological but also for pharmacological and clinical research.
A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.
Following previous efforts, in particular past BioCreative tracks related to protein-protein interaction (PPI) extraction [1], chemical entity recognition (CHEMDNER) [2] and the more recent ChemProt track [3] which resulted in a considerable number of advanced relation mining systems we are organizing, as part or BioCreative VII, the DrugProt track on automatic detection of drug/chemical interactions with genes, proteins and miRNAs.
The DrugProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.
DrugProt teams participating will be provided with the following training corpus:
For the DrugProt track a granular interaction annotation was carried out, with the goal to cover all key relations of biomedical importance. The following 13 types of interactions will be considered for the BioCreative VII DrugProt track: INDIRECT-DOWNREGULATOR, INDIRECT-UPREGULATOR, DIRECT-REGULATOR, ACTIVATOR, INHIBITOR, AGONIST, ANTAGONIST, AGONIST-ACTIVATOR, AGONIST-INHIBITOR, PRODUCT-OF, SUBSTRATE, SUBSTRATE_PRODUCT-OF and PART-OF.
Example DrugProt entity mention annotations:
11808879 T12 GENE-Y 1860 1866 KIR6.2
11808879 T13 GENE-N 1993 2016 glutamate dehydrogenase
11808879 T14 GENE-Y 2242 2253 glucokinase
23017395 T1 CHEMICAL 216 223 HMG-CoA
23017395 T2 CHEMICAL 258 261 EPA
Example DrugProt relation annotations:
12488248 INHIBITOR Arg1:T1 Arg2:T52
12488248 INHIBITOR Arg1:T2 Arg2:T52
23220562 ACTIVATOR Arg1:T12 Arg2:T42
23220562 ACTIVATOR Arg1:T12 Arg2:T43
23220562 INDIRECT-DOWNREGULATOR Arg1:T1 Arg2:T14
The DrugProt corpus prepared for this track is available at: DrugProt corpus.
Important dates (last update November 4th):
Timeline:
DrugProt test set prediction instructions (main DrugProt task)
In order to be able to submit your results, you need to be registered as a team (we allow academic teams, commercial participants or individual participants for the DrugProt track).Following the settings of previous BioCreative tracks, the test set will consist of a large collection of records containing a subset of a total of 750 Gold Standard records that will be used for evaluation purposes.
The rest of 10,000 additional records (background set) is included to make sure that participating systems will be able to scale to a realistic user scenario as well as to avoid any manual correction of the automatically generated interaction extraction results generated by participating teams.
Therefore teams will have a considerable time period to generate the test set predictions.
As part of the test set, the collection of abstracts (texts documents in the same format as provide in the training collection) will be released as well as the annotations of entity mentions (chemicals and genes).
Only the DrugProt interactions have to be generated by participating teams, while entity mentions are provided by the track organizers.
Thus, the test set includes 750 Gold Standard records (with mentions annotated manually), together with an additional collection of 10,000 background set abstracts with automatic mention annotations, making up 10 750 abstracts. For evaluation purposes we require predictions for the entire collection , not just the 750 abstracts, to:
1) avoid that participants do any manual correction of their predictions, which would be quite easy when using only 750 records
2) generate a silver standard of predicted relations by teams for the larger set of automatic annotations for 10 000 abstracts (which in turn might be useful to further improve systems in the future)
3) Make sure that systems are able to cope with larger datasets, that is being useful for realistic settings
The test set evaluation will be done using only the subset of 750 manually annotated Gold Standard abstracts.
Important: - You are allowed to upload up to 5 runs per team - The prediction file must consists of tab-separated columns containing: 1- Article identifier (PMID) 2- Predicted DrugProt relation* 3- interactor argument 1 (Arg1: followed by the interactor term identifier, corresponds to the chemical entity) 4- interactor argument 2 (Arg2: followed by the interactor term identifier, corresponds to the gene entity)* Has to be one of these interaction types: INDIRECT-DOWNREGULATOR, INDIRECT-UPREGULATOR, DIRECT-REGULATOR, ACTIVATOR, INHIBITOR, AGONIST, ANTAGONIST, AGONIST-ACTIVATOR, AGONIST-INHIBITOR, PRODUCT-OF, SUBSTRATE, SUBSTRATE_PRODUCT-OF or PART-OF.
Evaluation:
Evaluation will be done using the following micro-averaged scores: f-measure, precision and recall.The DrugProt evaluation library is available at: DrugProt evaluation library.
Large scale Text Mining sub-track (additional DrugProt task)
In addition to the traditional quality evaluation efforts, that we carry out as part of the BioCreative initiative, the DrugProt shared task will also pose a novel additional sub-track specifically focusing on the scalability and processing of large datasets.The DrugProt large-scale text mining sub-track will require the automatic extraction of drug/chemical-protein/gene interactions from a large set of 2,366,081 PubMed records.
Pre-annotations of entity mentions of drugs/chemicals, as well as genes/proteins, are provided (53,993,602 entity annotations), while their interactions, following the same settings as for the main DrugProt track have to be automatically extracted.
We will examine which teams are able to generate large-scale predictions for this collection of over 2,3 million records with almost 54 million entity annotations.
We also foresee that the results of this additional Large-scale Text Mining subtrack will result in a highly valuable knowledge graph for biocuration and database construction efforts, network biology, and other predictive bioinformatics and drug discovery strategies.
The DrugProt large scale text mining set is available at: DrugProt Large Scale subtrack corpus.
BioCreative VII workshop proceedings and Journal Special Issue
Participating teams will be invited to contribute to the: Proceedings of the Seventh BioCreative Challenge Evaluation Workshop. Proceedings papers are free of charge, details on the proceedings paper submission process, format, exact dates and review process will be provided soon.
A selected number of top performing teams will also be invited to contribute with a longer system description
paper to a special issue on BioCreative VII to be published in
Database. Previous BioCreative publications are listed here.
DrugProt session at the BioCreative VII workshop
The results of the DrugProt task will be presented at the Seventh BioCreative Challenge Evaluation Workshop. The BioCreative VII Workshop will be held online and we foresee there will be a session devoted to the DrugProt task. This session will include an overview talk presenting the used datasets and results obtained by the participating teams. A number of teams will also be invited to present their systems (short talks). We plan to have also a discussion session where teams, task organizers, and domain experts will discuss the obtained results and future steps. Finally, during the poster session, all teams will be able to present their participating strategies. Additional details will be announced (dates, registration, etc,.. ) will be available at the BioCreative webpage. Registration will be free of charge for participating teams. See: RegistrationTask organizers:
References
[1] Krallinger, Martin, et al. "Overview of the protein-protein interaction annotation extraction task of BioCreative II." Genome biology 9.2 (2008): 1-19.[2] Krallinger, Martin, et al. "CHEMDNER: The drugs and chemical names extraction challenge." Journal of cheminformatics 7.1 (2015): 1-11.
[3] Krallinger, Martin, et al. "Overview of the BioCreative VI chemical-protein interaction Track." Proceedings of the sixth BioCreative challenge evaluation workshop. Vol. 1. 2017.