learning about medicine by applying machine learning to user generated content: the case of anorexia...
TRANSCRIPT
Learning About Medicine by Applying Machine Learning to User Generated Content:
The Case of Anorexia
Elad Yom-TovMicrosoft Research Israel
*Pew survey, 2010 2
Why medicine?
• People use the Internet extensively:– More than 77% of USA population use the Internet– Every day, 55% of Americans use the Internet. They do so
for an average of two hours.– More than 80% of Internet users search for medical
information online, and significant medically-related activities happen on the Internet
• Large-scale medical trials are expensive and time consuming.
• Making sense of Internet data requires processing large amounts of data to produce meaningful insights
Anorexia Nervosa
A lifestyle choice?
“Thin is perfection, I'll die trying to achieve it”“Anorexia is a lifestyle, not a diet”“I only feel beautiful when I'm hungry”
Contacts
Data: Users
• All users who posted at least two photographs with a relevant tag (“thinspo”, “thinspiration”, “pro-ana”)– 162 users
• All users who posted to eating disorder groups on Flickr– 71 users
• Users who commented or favorited to at least two of the above-mentioned photos– 683 users
Data: Photos and links
• Raw data:– 543,891 photographs– 2,229,489 comments– 642,317 favorite markings– 237,165 contact links
• Labeling:– Users were labeled on a 5-point scale.
• Kappa = 0.51 (p<10-5)
Comments Contacts
Favorites Tags
ROC
Contacts 0.74
Comments 0.74
Favorites 0.53
Tag similarity
• Modeled users with a TF-IDF weighted bag-of-tags
• Average Cosine similarity:– Pro-anorexia: 0.259– Pro-recovery: 0.202– Pro-recovery to pro-anorexia: 0.225
– ROC: 0.52– Tag usage:
• “thinspiration”: 37% pro-anorexia, 7% pro-recovery• “pro-anorexia”: 1.7% pro-anorexia, 2.4% pro-
recovery
Is exposing pro-anorexia users to pro-recovery comments an
effective intervention?
Comments by...
PA PR
Posted by...
PA 61% 46%
PR 61% 71%
Hazard modelClass
Pro-Anorexia Pro-recovery
All previous times
Number of photos -0.226 -0.339Number of highly relevant photos -0.223 0.013Number of views 0.212 -0.072Number of views of highly relevant photos 0.164 0.023Number of comments from same-class users -0.057 0.057Number of comments from other-class users -0.117 -0.247Fraction of comments from same-class users -0.268 -0.027
Recent Number of photos 0.213 0.167Number of highly relevant photos -0.022 -0.094Number of views 0.029 0.163Number of views of highly relevant photos -0.007 0.199Number of comments from same-class users -0.068 0.024Number of comments from other-class users 0.057 -0.002Fraction of comments from same-class users 0.061 0.172
How do they get there?
DataToolbar data over a period of 5 months, in which we identified two types of behavior:
Celebrity queries• One of 3640 known
celebrities• Each scored for the
probability of them appearing in conjunction with the word “anorexia”
• We refer to this probability as the Perceived Anorexia Score (PAS).
Anorexia queriesWe define anorexic activity searching (AAS) as one of the following:1. Tips for proana or anorexia2. “how to … ” and proana or
anorexia.3. Proana buddy
A total of 5,800,270 users searched for least one celebrity in the top 2.5% of PAS, of which 3,615 also made AASs.
15
Clustering
• Start with a matrix of users by celebrities– 9,188,983 users by
3,640 celebrities
• Cluster using k-means with cosine similarity
• Clusters are statistically significant by PAS, but not by occupation.
1
2
3
4
5
6
7
8
9
10
0 0.02 0.04 0.06 0.08 0.1 0.12
Random
Average PAS
Hazard models
Attributes Model 1 Model 2
Weight (s.e.)
Exp(weight)
Weight (s.e.)
Exp(weight)
Number of all searches 1.4*10-3 (5*10-5)
1.00 1.4*10-3 (5*10-5)
1.00
Number of celebrity searches
1.5*10-4 (0.011)
1.00 -5.9*10-3 (0.011)N.
S.
0.99
Number of searches for top PAS celebrities
0.131 (0.008)
1.14 6.8*10-2 (0.012)
1.07
Number of (unique) top PAS celebrities searched
0.498 (0.061)
1.65
Adding the media effect• The Spearman correlation
between the number of queries for a celebrity and the number of tweets was 0.63, so the bigger the peak (the “media buzz”), the more searches will occur.
• When focusing on queries and tweets which mentioned anorexia, this correlation is 0.68.
• AAS searchers were 1.9 times more likely to query for a high PAS celebrity in the days following a media peak compared to all other people, and 2.4 times more likely when the peak was associated with anorexia.
Hazard models revisitedAttributes N = 1 N = 7
Weight (s.e.)
Exp(weight) Weight (s.e.)
Exp(weight)
Number of all searches 1.35*10-3 (5.31*10-5)
1.00 1.35*10-3 (5.31*10-5)
1.00
Number of celebrity searches
-2.06*10-3 (1.10*10-2)
N.S.
1.00 -2.13*10-3 (1.11*10-2)
N.S.
1.00
Number of searches for top PAS celebrities
3.24*10-3 (1.10*10-2)
1.03 3.30*10-3 (1.11*10-2)
1.03
Number of (unique) top PAS celebrities searched
0.61(5.70*10-2)
1.84 0.60(0.06)
1.83
Peak in all Twitter activity 0.29(0.11)
1.33 0.29(0.07)
1.33
Peak in Twitter activity related to anorexia
-0.25(0.13) N.S.
0.78 -0.27(0.10)
0.77
Why is this interesting?
Summary
• As people spend ever more time on the Internet, they generate content which we can use to understand (and later hopefully improve) health and healthcare
• This content is especially useful when:– People have less of an incentive to lie, compared to the
real world– Collecting data in the real world is hard– Activity is largely web-driven
• BUT: Making sense of so much data requires integrating Machine Learning research with medical practice.
Q & AQuestions?