BioCreative - Track 3 - Automatic extraction of medication names in tweets

Critical Assessment of Information Extraction in Biology - data sets are available from Resources/Corpora and require registration.

BioCreative VII

Track 3 - Automatic extraction of medication names in tweets [2021-02-17]

Note for Biocreative participants: For registration to a track please use the Google form.
Do not use the team "Team page" tab as it is non functional.

Task Motivation

Twitter posts are now recognized as an important source of patient-generated data, providing unique insights into population health. A fundamental step towards incorporating Twitter data in pharmacoepidemiological research is to automatically recognize medication mentions in tweets. A common approach is to search for tweets containing lexical matches of drug names occurring in a manually compiled dictionary. Even allowing for variants and misspellings, this approach has several limitations. In our previous study [Weissenbacher et. al, 2019], when using the lexical match approach on a corpus where names of drugs are rare, we retrieved only 71% of the tweets that we manually identified as mentioning a drug, and more than 45% of the tweets retrieved were false positives. For example, tweets that mention Lyrica are predominantly about the singer, Lyrica Anderson, and not about the antiepileptic drug. In addition, descriptive text and medication class mentions (such as 'my blood pressure med’ or ‘my anti-seizure pill’), as well as compounds and ‘street’ names for medications (‘the blue pill’) present additional challenges. This competition will be an opportunity to go beyond the lexical match approach, providing new methods to improve the extraction of drugs mentioned in posts and enhancing the utility of social media for public health research.

Task Definition: Automatic extraction of medication names in tweets

The goal of this task is to extract the spans that mention a medication or dietary supplement in tweets. The dataset consists of all tweets posted by 212 Twitter users during their pregnancy. This data represents the natural and highly imbalanced distribution of drug mentions on Twitter, with only approximately 0.2% of the tweets mentioning a medication. Training and evaluating a sequence labeler on this data set will closely model the detection of drugs in tweets in practice. A description of our baseline labeler and its evaluation can be found in [Weissenbacher et. al, 2021], (see below for the link to download the labeler).

Training data: ~89,000 tweets (218 tweets mentioning at least one drug, ~89,000 tweets by the same 212 users, not mentioning drugs)
Validation data: ~39,000 tweets (93 tweets mentioning at least one drug, ~39,000 tweets by the same 212 users, not mentioning drugs)
Test data: ~54,000 tweets
Additional data: ~10,000 tweets from the training set of the #SMM4H'18 shared tasks (a balanced set of tweets mentioning drugs and phrases ambiguous with drug names)
Evaluation metric: exact and partial F1-scores for the positive class (i.e., the correct spans of drug name)
Evaluation Script: bitbucket repository
Baseline labeler + trained models: box repository
Contact information: Davy Weissenbacher (dweissen@pennmedicine.upenn.edu)
Codalab: https://competitions.codalab.org/competitions/23925

Data

For each tweet, the publicly available data set contains: i. the tweet ID, ii. the text of the tweet, iii. the start and iv. end of the span, v. the text covered by the span in the tweet, vi. the normalized drug name (empty if the tweet did not mention a drug).
Note 1: if a tweet mentions 2 or more drugs, the tweet is repeated 2 or more times with the mention of each drug in each repetition as shown below. The evaluation data will just contain the tweet IDs and the text of the tweet.
Note 2: participants will not be evaluated on the normalization task, just the extraction task, i.e. retrieving the span positions.
Note 3: to comply with Twitter’s terms of use, we cannot release more than 50,000 tweets per day. Since our corpus is pretty large, we will send the corpus split into batches by mail (please, check your spam folder after registration)

tweet ID             text                                                                  Begin    End     span            drug normalized    
397783574797352960   Only 3 Arnica Balms left...                                           8        19      Arnica Balms    arnica balm        
404288692514078720   @user sudafed that I'm not sure I'm comfortable taking it.            7        13      sudafed         sudafed            
343961712334686205   I like this song!                                                     -        -       -               -
424441978835570688   @user no my body hurts, they prescribed me hydros and moltrin         44       49      hydros          hydrocodone        
424441978835570688   @user no my body hurts, they prescribed me hydros and moltrin         55       61      moltrin         motrin

Important dates: (tentative)

Training data available: March 5, 2021
Test data available: ~~September 8, 2021, 9:00 UTC~~ September 15, 2021, 9:00 UTC
System predictions for test data due: ~~September 11, 2021, 23:59 UTC~~ September 18, 2021, 23:59 UTC
Short technical systems description paper due: October 10, 2021
Paper acceptance notification: October 20, 2021
Camera ready: October 27, 2021

Task organizers:

Graciela Gonzalez-Hernandez, University of Pennsylvania, USA
Davy Weissenbacher, University of Pennsylvania, USA
Ivan Flores, University of Pennsylvania, USA
Karen O’Connor, University of Pennsylvania, USA