BioCreative - Track 1 - Text mining drug and chemical-protein interactions (DrugProt)

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VII

Track 1 - Text mining drug and chemical-protein interactions (DrugProt) [2020-01-22]

Note for Biocreative participants: For registration to a track please use the Google form.
Do not use the team "Team page" tab as it is non functional.

Text mining drug and chemical-protein interactions (DrugProt)

With the rapid accumulation of biomedical literature, it is getting increasingly challenging to exploit efficiently drug-related information described in the scientific literature. One of the most relevant aspects of drugs and chemical compounds are their relationships with certain biomedical entities, in particular genes and proteins.
The aim of the DrugProt track (similar to the previous CHEMPROT task of BioCreative VI) is to promote the development and evaluation of systems that are able to automatically detect in relations between chemical compounds/drug and genes/proteins.
There are a range of different types of drug-gene/protein interactions, and their systematic extraction and characterization is essential to analyze, predict and explore key biomedical properties underlying high impact biomedical applications. These application scenarios include use cases related to drug discovery, drug repurposing, drug design, metabolic engineering, modeling drug response, pharmacogenetics, drug-induced adverse reactions, molecular medicine or systems biology and bioinformatics knowledge graph mining, just to name a few.

We have therefore generated a manually annotated corpus, the DrugProt corpus, where domain experts have exhaustively labeled:(a) all chemical and gene mentions, and (b) all binary relationships between them corresponding to a specific set of biologically relevant relation types (DrugProt relation classes).
There is also an increasing interested in the integration of drug/chemical and biomedical data understood as curation of relationships between biological and chemical entities from text and storing such information in form of structured annotation databases. Such databases are of key relevance not only for biological but also for pharmacological and clinical research.
A range of different types chemical-protein/gene interactions are of key relevance for biology, including metabolic relations (e.g. substrates, products) inhibition, binding or induction associations.
Following previous efforts, in particular past BioCreative tracks related to protein-protein interaction (PPI) extraction [1], chemical entity recognition (CHEMDNER) [2] and the more recent ChemProt track [3] which resulted in a considerable number of advanced relation mining systems we are organizing, as part or BioCreative VII, the DrugProt track on automatic detection of drug/chemical interactions with genes, proteins and miRNAs.

The DrugProt track aims to address these needs and to promote the development of systems able to extract chemical-protein interactions that might be of relevance for precision medicine as well as for drug discovery and basic biomedical research.

DrugProt teams participating will be provided with the following training corpus:

PubMed abstracts

Manually annotated chemical compound mentions (see chemical annotation guidelines)

Manually annotated gene/protein mentions (see gene annotation guidelines)

Manually annotated drug/chemical-protein/gene interactions (see interaction annotation guidelines)

For the DrugProt track a granular interaction annotation was carried out, with the goal to cover all key relations of biomedical importance. The following 13 types of interactions will be considered for the BioCreative VII DrugProt track: INDIRECT-DOWNREGULATOR, INDIRECT-UPREGULATOR, DIRECT-REGULATOR, ACTIVATOR, INHIBITOR, AGONIST, ANTAGONIST, AGONIST-ACTIVATOR, AGONIST-INHIBITOR, PRODUCT-OF, SUBSTRATE, SUBSTRATE_PRODUCT-OF and PART-OF.

Example DrugProt entity mention annotations:

11808879	T12	GENE-Y	1860	1866	KIR6.2
11808879	T13	GENE-N	1993	2016	glutamate dehydrogenase
11808879	T14	GENE-Y	2242	2253	glucokinase
23017395	T1	CHEMICAL	216	223	HMG-CoA
23017395	T2	CHEMICAL	258	261	EPA

Example DrugProt relation annotations:

12488248	INHIBITOR	Arg1:T1	Arg2:T52
12488248	INHIBITOR	Arg1:T2	Arg2:T52
23220562	ACTIVATOR	Arg1:T12	Arg2:T42
23220562	ACTIVATOR	Arg1:T12	Arg2:T43
23220562	INDIRECT-DOWNREGULATOR	Arg1:T1	Arg2:T14

The DrugProt corpus prepared for this track is available at: DrugProt corpus.

Important dates (last update November 4th):

Timeline:

✓ Training set release - June 15th 2021: DrugProt training set
✓ Development set release - June 29th 2021: DrugProt development set 
✓ Evaluation library release  - July 13th 2021 (UPDATED!)
✓ Test set release- July 19th 2021 (UPDATED!) : DrugProt test set abstracts and entity annotations (while relations have to be predicted by teams)
✓ Large scale Text Mining sub-track set release- July 19th 2021 (UPDATED!) : Large scale DrugProt Additional Subtrack set abstracts and entity annotations (while relations have to be predicted by teams)
✓ DrugProt Test set prediction submission due: September 22th 2021 (Anywhere on Earth" (UTC−12:00)!): see instructions
✓ Large scale Text Mining additional subtrack prediction submission due: September 27th 2021 (Anywhere on Earth" (UTC−12:00)!): : see instructions
✓ Test set evaluation returned to participants: September 29st 2021
✓ Large scale Text Mining revision returned to participants: October 2nd 2021
✓ Short technical systems description paper due: October 10th 2021
✓ Paper acceptance and review returned: October 17th 2021
✓ Revised paper sumission due: October 24th 2021
Test set Gold Standard annotations to participants:  November 1st 2021
BioCreative VII Workshop:  November 8th-10th, 2021. (This workshop will be virtual).

DrugProt test set prediction instructions (main DrugProt task)

In order to be able to submit your results, you need to be registered as a team (we allow academic teams, commercial participants or individual participants for the DrugProt track).

Following the settings of previous BioCreative tracks, the test set will consist of a large collection of records containing a subset of a total of 750 Gold Standard records that will be used for evaluation purposes.
The rest of 10,000 additional records (background set) is included to make sure that participating systems will be able to scale to a realistic user scenario as well as to avoid any manual correction of the automatically generated interaction extraction results generated by participating teams.
Therefore teams will have a considerable time period to generate the test set predictions.

As part of the test set, the collection of abstracts (texts documents in the same format as provide in the training collection) will be released as well as the annotations of entity mentions (chemicals and genes).
Only the DrugProt interactions have to be generated by participating teams, while entity mentions are provided by the track organizers.
Thus, the test set includes 750 Gold Standard records (with mentions annotated manually), together with an additional collection of 10,000 background set abstracts with automatic mention annotations, making up 10 750 abstracts. For evaluation purposes we require predictions for the entire collection , not just the 750 abstracts, to:
1) avoid that participants do any manual correction of their predictions, which would be quite easy when using only 750 records
2) generate a silver standard of predicted relations by teams for the larger set of automatic annotations for 10 000 abstracts (which in turn might be useful to further improve systems in the future)
3) Make sure that systems are able to cope with larger datasets, that is being useful for realistic settings
The test set evaluation will be done using only the subset of 750 manually annotated Gold Standard abstracts.

Important:
- You are allowed to upload up to 5 runs per team
- The prediction file must consists of tab-separated columns containing:

1- Article identifier (PMID)
2- Predicted DrugProt relation*
3- interactor argument 1 (Arg1: followed by the interactor term identifier, corresponds to the chemical entity)
4- interactor argument 2 (Arg2: followed by the interactor term identifier, corresponds to the gene entity)

* Has to be one of these interaction types: INDIRECT-DOWNREGULATOR, INDIRECT-UPREGULATOR, DIRECT-REGULATOR, ACTIVATOR, INHIBITOR, AGONIST, ANTAGONIST, AGONIST-ACTIVATOR, AGONIST-INHIBITOR, PRODUCT-OF, SUBSTRATE, SUBSTRATE_PRODUCT-OF or PART-OF.

Evaluation:

Evaluation will be done using the following micro-averaged scores: f-measure, precision and recall.

The DrugProt evaluation library is available at: DrugProt evaluation library.

Large scale Text Mining sub-track (additional DrugProt task)

In addition to the traditional quality evaluation efforts, that we carry out as part of the BioCreative initiative, the DrugProt shared task will also pose a novel additional sub-track specifically focusing on the scalability and processing of large datasets.
The DrugProt large-scale text mining sub-track will require the automatic extraction of drug/chemical-protein/gene interactions from a large set of 2,366,081 PubMed records.
Pre-annotations of entity mentions of drugs/chemicals, as well as genes/proteins, are provided (53,993,602 entity annotations), while their interactions, following the same settings as for the main DrugProt track have to be automatically extracted.
We will examine which teams are able to generate large-scale predictions for this collection of over 2,3 million records with almost 54 million entity annotations.
We also foresee that the results of this additional Large-scale Text Mining subtrack will result in a highly valuable knowledge graph for biocuration and database construction efforts, network biology, and other predictive bioinformatics and drug discovery strategies.
The DrugProt large scale text mining set is available at: DrugProt Large Scale subtrack corpus.

BioCreative VII workshop proceedings and Journal Special Issue

Participating teams will be invited to contribute to the: Proceedings of the Seventh BioCreative Challenge Evaluation Workshop. Proceedings papers are free of charge, details on the proceedings paper submission process, format, exact dates and review process will be provided soon.
A selected number of top performing teams will also be invited to contribute with a longer system description paper to a special issue on BioCreative VII to be published in Database. Previous BioCreative publications are listed here.

DrugProt session at the BioCreative VII workshop

The results of the DrugProt task will be presented at the Seventh BioCreative Challenge Evaluation Workshop. The BioCreative VII Workshop will be held online and we foresee there will be a session devoted to the DrugProt task. This session will include an overview talk presenting the used datasets and results obtained by the participating teams. A number of teams will also be invited to present their systems (short talks). We plan to have also a discussion session where teams, task organizers, and domain experts will discuss the obtained results and future steps. Finally, during the poster session, all teams will be able to present their participating strategies. Additional details will be announced (dates, registration, etc,.. ) will be available at the BioCreative webpage. Registration will be free of charge for participating teams. See: Registration

Task organizers:

Martin Krallinger, Barcelona Supercomputing Center, Spain

Farrokh Mehryary, University of Turku, Finland

Jouni Luoma, University of Turku, Finland

Sampo Pyysalo, University of Turku, Finland

Antonio Miranda, Barcelona Supercomputing Center, Spain

Alfonso Valencia, Barcelona Supercomputing Center, Spain

References

[1] Krallinger, Martin, et al. "Overview of the protein-protein interaction annotation extraction task of BioCreative II." Genome biology 9.2 (2008): 1-19.
[2] Krallinger, Martin, et al. "CHEMDNER: The drugs and chemical names extraction challenge." Journal of cheminformatics 7.1 (2015): 1-11.
[3] Krallinger, Martin, et al. "Overview of the BioCreative VI chemical-protein interaction Track." Proceedings of the sixth BioCreative challenge evaluation workshop. Vol. 1. 2017.

DrugProt contact/info

Martin Krallinger: krallinger.martin@gmail.com