social media mining for pharmacovigilance...social media mining for pharmacovigilance graciela...
TRANSCRIPT
Social media mining for
pharmacovigilance
Graciela Gonzalez-Hernandez @gracielagon
email: [email protected]
CPeRT - Feb 20, 2017
Funded by NLM/NIH grant number 5R01LM011176
2
Social media as an “online health report”?
26% of internet users actively
discuss health information. Of that
group …1
– 30% changed behavior as a result
– 42% discussed current medical
conditions
“Extrapolating” this to Twitter...2,3
– Given 317 million active monthly users (Q3
2015): about 24 million would change their
health behavior
– Given 350,000 tweets/minute: about 38,220
tweets / minute about their current medical
conditions
1http://www.pewinternet.org/fact-sheets/health-fact-sheet/2http://www.statisticbrain.com/twitter-statistics/ 2http://www.statista.com/statistics/282087/number-of-monthly-active-twitter-users/ 3www.internetlivestats.com/twitter-statistics/
3
Social media in public health monitoring
Growing interest - from just over 100 to 2000 publications including “social media” or “social network” in PubMed over the last 10 years:
• Identifying smoking cessation patterns (Struik and Baskerville 2014),
• Identifying user social circles with common experiences (like prescription drug abuse) (Hanson et al 2014) ,
• Monitoring malpractice (Nakhasi et al 2012),
• Tracking infectious/viral disease spread (Broniatowski et al 2013) (Paul and Dredze, 2011)
In September of 2015, our JBI paper “Utilizing Social Media Data for Pharmacovigilance: A Review” was nominated as one of the 10 articles with greatest potential social impact from the over 2500 journals published by Elsevier /Atlas.
Public health monitoring challenge: how do we get observations over time for specific groups that share interesting characteristics?
4
Social Media for health monitoring?
What do systematic reviews tell us? • Under-reporting is a problem in current surveillance systems. (37 studies
from 12 countries) showed median under-reporting rate was 94% (82-98%). For serioius/severe, 85%. (Hazell & Shakir, Drug Saf 2008 PMID 16689555).
• Abundant reports in SM. (29 studies that compared SM to other sources) showed a higher frequency of adverse events was found in social media and that this was particularly true for ‘symptom’ related and ‘mild’ adverse events. (Golder et all, Br J Clin Pharmacol 2015 PMID 26271492).
• Patient reporting brings different perspective, more info. (34 studies) Patient reporting brings novel information, more detail, info on severity and impact of ADRs in daily life. (Inacio et al, 2017 Br J Clin Pharmacol PMID 27558545).
Targeted, diverse, cohorts may be more easily accessible through social media.
People reveal information in social media that may not be available from FAERS or health records
• e.g., information about medication abuse, co-ingestion, sentiment regarding medications, impact on daily life…
Recruitment of cohorts via social media is something already being considered
• Shere et al “The Role of Social Media in Clinical Trials” (PMC3966825)
• Admon et al “Recruiting Pregnant Patients for Survey Research: A Head to Head Comparison of Social Media-Based Versus Clinic-Based Approaches (PMC5215244)
5
Difficulties with social media data
Incompleteness:
• Not all health conditions may be revealed through social media posts
• While social media data may provide access to larger population,
complete data about individual cases may be difficult to obtain:
pregnant woman can be identified and detected to be taking drug X,
but dosage, frequency etc. information may be missing
• Participants from the cohort may dropout at higher rates
Accessibility:
• Data from social media is dependent on the available APIs
• Data collection methods may have to be changed frequently over time
Authenticity:
• Bots – a large portion of social media is now generated by bots,
making it harder to mine reliable data
6
“Typical” Social media mining pipeline
Data collection
Annotation
Resource adaptation
Classification
Information extraction
Normalization
Case studies / validation
7
HA! Not if you're on #Seroquil. EXTREMELY vivid dreams
that stay in conscious memory. Very #Freaky! Any idea why?
I'm def suing cymbalta. I can't wait until its out of my system.
Get out!!!!!!! Nowwww!!!!! You turn peaceful people into the
hulk!. (c0034634 – Rage)
Apparently, Baclofen greatly exacerbates the "AD" part of my ADHD. Average length of focus today: about 30 seconds. (c0235198 – cerebration impaired)
The 100mg tabs of trazodone my gp prescribed are too much,
now that I don't take them every night. Still zombieish after an
hour awake
Gone from 50mg to 150mg of Serequel last night. Could
barely wake up this morning and I feel like my body is made
of lead
A taste of Twitter ADR lingo
8
Data collection and annotation
Phonetic spelling variants for capturing misspelled
medication names1
(http://diego.asu.edu/Publications/ADRSpell/ADRSpell.html)
• Seroquel -> siroquil, seroquil etc.
Binary and full ADR annotations2,3
Multiple trained annotators + pharmacology expert
to resolve annotation disagreements
1 Pimpalkhute et al. Phonetic Spelling Filter for Keyword Selection. AMIA Jt Summits Transl Sci Proc. 2014.
2 O’Connor et al. Pharmacovigilance on Twitter. AMIA Annu Symp Proc. 2014.
3 Ginn et al. Mining Twitter for adverse drug reaction mentions. BioTxtM. 2014.
9
Annotation example
Works to calm mania or depression but zonks me and scares
me about diabetes issues reported.
Indication:
mania (C0338831)
Indication:
depression (C001157)
ADR: drowsiness
(C0013144)
Other:
diabetes
stops me from crying most of the time, blocks most of my
feelings
Indication:
crying (C0010399)
Adverse reaction:
emotional indifference
(C0001726)
10
Text classification
Generate a large set of features, representing semantic
properties (e.g., sentiment, polarity, and topic), from
short text nuggets1
• Combine training data from different corpora in attempts to boost
classification accuracies
• Effort in resource creation pays off
Other text classification tasks:
• Drug abuse classification2
• Drug safety classification3
1 Sarker and Gonzalez. Portable automatic text classification. J Biomed Inform. 2015.
2 Sarker et al. Social media mining for toxicovigilance. Drug Saf. 2016.
3 Patki et al. Mining adverse drug .. going beyond extraction. BioLinkSig. 2014.
11
ADR extraction
To automatically extract exact mentions of ADRs
and other information
Traditional, lexicon-based approaches perform poorly on
social media text
12
ADRMine: deep learning
Our approach using conditional random fields
outperforms lexicon based approaches1
Shared Task at PSB 2016 showed it outperforms all
others.
Particularly ambiguous ADRs captured by “cluster”
feature
1 Nikfarjam et al. Pharmacovigilance from social media.. sequence labeling with word embedding cluster
features. JAMIA. 2015.
Publication resources: http://diego.asu.edu/Publications/ADRMine.html
13
Unsupervised learned clusters
Cluster# Topic Examples of clustered words
c1 Drug abilify, adderall, ambien, ativan, aspirin, citalopram, effexor, paxil,
…
c2 Signs/Symptoms hangover, headache, rash, hive, …
c3 Signs/Symptoms anxiety, depression, disorder, ocd, mania, stabilizer, …
c4 Drug dosage 1000mg, 100mg, .10, 10mg, 600mg, 0.25, .05, ...
c5 Treatment anti-depressant, antidepressant, drug, med, medication, medicine,
treat, …
c6 Family member brother, dad, daughter, father, husband, mom, mother, son, wife, …
c7 Date 1992, 2011, 2012, 23rd, 8th, april, aug, august, december, …
14
Concept normalization
A set of rule-based techniques followed by semantic similarity
based techniques1
Best F-score: 0.603
15
Frequency comparsion: signal analysis
Drug name (Brand name)
Primary Indications
Documented Adverse Effects (Frequency)
Adverse Effects Found in User Comments (Frequency)
carbamazepine (Tegretol)
epilepsy, trigeminal neuralgia
dizziness, somnolence or fatigue, unsteadiness, nausea, vomiting
somnolence or fatigue (12.3%), allergy (5.2%), weight gain (4.1%), rash (3.5%), depression (3.2%), dizziness (2.4%), tremor/spasm (1.7%), headache (1.7%), appetite increased (1.5%), nausea (1.5%)
olanzapine (Zyprexa)
schizophrenia, bipolar disorder
weight gain (65%), alteration in lipids (40%), somnolence or fatigue (26%), increased cholesterol (22%), diabetes (2%)
weight gain (30.0%), somnolence or fatigue (15.9%), appetite increased (4.9%), depression (3.1%), tremor (2.7%), diabetes (2.6%), mania (2.3%), anxiety (1.4%), hallucination (0.7%), edema (0.6%)
trazodone (Oleptro)
depression somnolence or fatigue (46%), headache (33%), dry mouth (25%), dizziness (25%), nausea (21%)
somnolence or fatigue (48.2%), nightmares (4.6%), insomnia (2.7%), addiction (1.7%), headache (1.6%), depression (1.3%), hangover (1.2%), anxiety attack (1.2%), panic reaction (1.1%), dizziness (0.9%)
16
Exploring Health Timelines: longitudinal data
We want to be able to explore a condition or event of
interest with a prior or subsequent event or condition.
Thus, if in the timeline of a pregnant woman we find:
• “6 years to this day that I was diagnosed with depression, bi polar
disorder and anxiety disorder. But I am still standing. God is good”
• “5 yrs today since I was diagnosed with type 1 diabetes”,
• “Stop vaping !! I have Asthma !!“
• “I took a Zyrtec this morning and I guess youre not suppose to
consume more than 1 in 24hrs the struggle”
we would want to include this information in the health timeline of the
user for further analysis.
17
Focus of this work
Address the gap in longitudinal social media based public
health surveillance
Develop natural language processing (NLP), machine learning,
and information retrieval (IR) methods to help accurately
identify a cohort of pregnant women and collect their social
media timelines
Perform preliminary analyses of the extracted health timelines
to identify limitations, and establish future research goals.
18
19
Data collection & classification
Collect tweets mentioning pregnancy announcements
• Based on search queries
• Time period: Jan 2014 to Sept 2015
• Example query: “i am * weeks/months pregnant”
• Query count: 18
Not all tweets from the search queries were legitimate pregnancy
announcements
• Example: “…I look like Im 3 months pregnant”
• CLASSIFICATION: N-grams and synsets, sentiment, word clusters
Collect user timelines of positive announcements using Twitter API
DailyStrength Data
• Individual forums for different cohorts.
• Collected data from 5 forums (Pregnancy, Pregnancy After Loss Or Infertility,
Pregnancy Teens, Stillbirth, and Miscarriage)
20
Information extraction: concept tagging
• Tweets by pregnancy period
• Extract relevant tweets from pregnancy period, if possible.
• Tag trimester: combination of term and pattern matching.
• “I'm officially 20 weeks pregnant….” : 2nd trimester.
• The proposed algorithm covers most of the cases.
• It fails to cover ambiguous and relative time cases:
• “next week is gone b my last week pregnant who want to make a bet
lol”
Tag medications mentioned
• Dictionary of 7396 drugs total
• FDA drug classification of medication safety: 1916 drugs collected from 3 sources
• Expanded using RxNorm database
Tag health conditions (diseases, side effects…) –not reported-
21
Evaluation
• Annotation
• 1200 tweets annotated by two human annotators
• Inter-Annotator agreement (kappa score) was found to be 0.79.
• 10x cross validation was used to find the accuracy of the
classifier.
• 15,523 users out of 35,355 were found to mention legitimate
pregnancy announcements
• Timeline extraction of these users resulted in over 30 million
tweets all of which were indexed to lucene.
Classification Results Precision Recall F-measure
isPreg 0.83 0.79 0.81
notPreg 0.84 0.77 0.80
22
Use case: medication mentions
Distribution of top 10 drug mentions by trimester in Twitter:
Note on ibuprofen: it is generally not recommended during pregnancy,
especially during the third trimester. ... Ibuprofen may cause premature
closure of the fetal ductus arteriosus and prolongation of bleeding time
(https://www.drugs.com/pregnancy/ibuprofen.html)
Note on codeine: Codeine use anytime during pregnancy was associated
with planned Cesarean delivery. Third-trimester use was associated with
acute Cesarean and postpartum hemorrhage (PMC3214255)
23
Prescription drug abuse monitoring
Users post information about medication abuse on
social media
- about to be cracked on adderall to survive today
- i’m just gonna shower and overdose on Seroquel so I’ll sleep
until morning.
- popped Adderall tonight hahahah let’s finish this 100 page paper
- an oxycodone high from snorting lasts for one hour, if it is
swallowed, your looking at three hour high.
24
Adderall® vs. oxycodone abuse patterns
Supervised classification to investigate patterns of
abuse-related tweets1
1 Sarker et al. Social media mining for toxicovigilance. Drug Saf. 2016.
25
Thank you!
Contact: [email protected]
Twitter: @gracielagon
HLP lab:
https://healthlanguageprocessing.org