learning about medicine by applying machine learning to user generated content: the case of anorexia...

Learning About Medicine by Applying Machine Learning to User Generated Content:

The Case of Anorexia

Elad Yom-TovMicrosoft Research Israel

*Pew survey, 2010 2

Why medicine?

• People use the Internet extensively:– More than 77% of USA population use the Internet– Every day, 55% of Americans use the Internet. They do so

for an average of two hours.– More than 80% of Internet users search for medical

information online, and significant medically-related activities happen on the Internet

• Large-scale medical trials are expensive and time consuming.

• Making sense of Internet data requires processing large amounts of data to produce meaningful insights

Anorexia Nervosa

A lifestyle choice?

“Thin is perfection, I'll die trying to achieve it”“Anorexia is a lifestyle, not a diet”“I only feel beautiful when I'm hungry”

Contacts

Data: Users

• All users who posted at least two photographs with a relevant tag (“thinspo”, “thinspiration”, “pro-ana”)– 162 users

• All users who posted to eating disorder groups on Flickr– 71 users

• Users who commented or favorited to at least two of the above-mentioned photos– 683 users

Data: Photos and links

• Raw data:– 543,891 photographs– 2,229,489 comments– 642,317 favorite markings– 237,165 contact links

• Labeling:– Users were labeled on a 5-point scale.

• Kappa = 0.51 (p<10-5)

Comments Contacts

Favorites Tags

ROC

Contacts 0.74

Comments 0.74

Favorites 0.53

Tag similarity

• Modeled users with a TF-IDF weighted bag-of-tags

• Average Cosine similarity:– Pro-anorexia: 0.259– Pro-recovery: 0.202– Pro-recovery to pro-anorexia: 0.225

– ROC: 0.52– Tag usage:

• “thinspiration”: 37% pro-anorexia, 7% pro-recovery• “pro-anorexia”: 1.7% pro-anorexia, 2.4% pro-

recovery

Is exposing pro-anorexia users to pro-recovery comments an

effective intervention?

Comments by...

PA PR

Posted by...

PA 61% 46%

PR 61% 71%

Hazard modelClass

Pro-Anorexia Pro-recovery

All previous times

Number of photos -0.226 -0.339Number of highly relevant photos -0.223 0.013Number of views 0.212 -0.072Number of views of highly relevant photos 0.164 0.023Number of comments from same-class users -0.057 0.057Number of comments from other-class users -0.117 -0.247Fraction of comments from same-class users -0.268 -0.027

Recent Number of photos 0.213 0.167Number of highly relevant photos -0.022 -0.094Number of views 0.029 0.163Number of views of highly relevant photos -0.007 0.199Number of comments from same-class users -0.068 0.024Number of comments from other-class users 0.057 -0.002Fraction of comments from same-class users 0.061 0.172

How do they get there?

DataToolbar data over a period of 5 months, in which we identified two types of behavior:

Celebrity queries• One of 3640 known

celebrities• Each scored for the

probability of them appearing in conjunction with the word “anorexia”

• We refer to this probability as the Perceived Anorexia Score (PAS).

Anorexia queriesWe define anorexic activity searching (AAS) as one of the following:1. Tips for proana or anorexia2. “how to … ” and proana or

anorexia.3. Proana buddy

A total of 5,800,270 users searched for least one celebrity in the top 2.5% of PAS, of which 3,615 also made AASs.

15

Clustering

• Start with a matrix of users by celebrities– 9,188,983 users by

3,640 celebrities

• Cluster using k-means with cosine similarity

• Clusters are statistically significant by PAS, but not by occupation.

1

2

3

4

5

6

7

8

9

10

0 0.02 0.04 0.06 0.08 0.1 0.12

Random

Average PAS

Hazard models

Attributes Model 1 Model 2

Weight (s.e.)

Exp(weight)

Weight (s.e.)

Exp(weight)

Number of all searches 1.4*10-3 (5*10-5)

1.00 1.4*10-3 (5*10-5)

1.00

Number of celebrity searches

1.5*10-4 (0.011)

1.00 -5.9*10-3 (0.011)N.

S.

0.99

Number of searches for top PAS celebrities

0.131 (0.008)

1.14 6.8*10-2 (0.012)

1.07

Number of (unique) top PAS celebrities searched

0.498 (0.061)

1.65

Adding the media effect• The Spearman correlation

between the number of queries for a celebrity and the number of tweets was 0.63, so the bigger the peak (the “media buzz”), the more searches will occur.

• When focusing on queries and tweets which mentioned anorexia, this correlation is 0.68.

• AAS searchers were 1.9 times more likely to query for a high PAS celebrity in the days following a media peak compared to all other people, and 2.4 times more likely when the peak was associated with anorexia.

Hazard models revisitedAttributes N = 1 N = 7

Weight (s.e.)

Exp(weight) Weight (s.e.)

Exp(weight)

Number of all searches 1.35*10-3 (5.31*10-5)

1.00 1.35*10-3 (5.31*10-5)

1.00

Number of celebrity searches

-2.06*10-3 (1.10*10-2)

N.S.

1.00 -2.13*10-3 (1.11*10-2)

N.S.

1.00

Number of searches for top PAS celebrities

3.24*10-3 (1.10*10-2)

1.03 3.30*10-3 (1.11*10-2)

1.03

Number of (unique) top PAS celebrities searched

0.61(5.70*10-2)

1.84 0.60(0.06)

1.83

Peak in all Twitter activity 0.29(0.11)

1.33 0.29(0.07)

1.33

Peak in Twitter activity related to anorexia

-0.25(0.13) N.S.

0.78 -0.27(0.10)

0.77

Why is this interesting?

Summary

• As people spend ever more time on the Internet, they generate content which we can use to understand (and later hopefully improve) health and healthcare

• This content is especially useful when:– People have less of an incentive to lie, compared to the

real world– Collecting data in the real world is hard– Activity is largely web-driven

• BUT: Making sense of so much data requires integrating Machine Learning research with medical practice.

Q & AQuestions?

learning about medicine by applying machine learning to user generated content: the case of anorexia...

Documents

users users

internet users

sense of internet data

anorexia nervosa slide

im hungry slide

links raw data

machine learning

contact links