exploring linkability of user reviews mishari almishari and gene tsudik university of california,...

Post on 29-Jan-2016

218 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exploring Linkability of User Reviews

Mishari Almishari and Gene Tsudik

University of California, Irvine

Roadmap

1. Introduction2. Data Set & Problem Settings3. Linkability Results &

Improvements4. Discussion5. Future Work & Conclusion

Motivation

Increasing Popularity of Reviewing Sites

Yelp, more than 39M visitors and 15M reviews in 2010

Example

category

Rating

Motivation

Rising awareness of privacy

Motivation

How is it applied?

Traceability/Linkability

Linkability of Ad hoc Reviews

Linkablility of Several Accounts

Goal

Assess the linkability in user reviews

Roadmap

1. Introduction2. Data Set & Problem Settings3. Linkability Results &

Improvements4. Discussion5. Future Work & Conclusion

Data Set

• 1 Million Reviews • 2000 Users• more than 300 reviews

Problem Settings

Problem Settings

IR: Identified RecordIR

IR

IR

IR

AR

AR

AR

AR

AR: Anonymous Record

Problem Formulation

Anonymous Record (AR)

Identified Records (IR’s)

Matching Model

TOP-X LinkabilityX: 1 and 10

1, 5, 10, 20,…60

Problem Settings

Methodologies(1) Naïve Bayesian Model

(2) Kullback-Leibler Divergence (KLD)

Decreasing Sorted List of IRs

Increasing Sorted List of IRs

Maximum-Likelihood Estimation

Tokens

• Unigram:• “privacy”: “p”, “r”, “i”, “v”, “a”, “c”, “y”• 26 values

• Digram• “privacy”: “pr”, “ri”, “iv”, “va”, “ac”, “cy”• 676 values

• Rating• 5 values

• Category• 28 values

Roadmap

1. Introduction2. Data Set & Problem Settings3. Linkability Results &

Improvements4. Discussion5. Future Work & Conclusion

NB -Unigram

Unigram Results

Anonymous Record Size

Lin

kab

ilit

y R

ati

o

Size 60, LR 83%/ Top-1LR 96% Top-10

Digram Results

NB -Digram

Lin

kab

ilit

y

Rati

o

Anonymous Record Size

Size 20, LR 97%/

Top-1Size10, LR 88%/

Top-1

Improvement (1): Combining Lexical and non-Lexical

onesNB Model

Anonymous Record Size

Lin

kab

ilit

y

Rati

o

Gain, up to 20%

Size 60, 83 % To

96%

Size 30, 60 % To

80%

What about Restricting Identified Record (IR) Size?

NB Model KLD Model

Anonymous Record Size

Lin

kab

ilit

y R

ati

oAnonymous Record

Size

Lin

kab

ilit

y R

ati

o

Affected by IR size

Performed better for smaller IR

Size 20 or less, improved

✖✖

v1 v3v2 v4

v7v5 v6 v8

v9 v10

v11

v12

v13

v14

v15 v1

6

Improvement (2): Matching All IR’s At Once

Matching All Results

Restricted IR Full IR

Anonymous Record Size

Lin

kab

ilit

y R

ati

o

Anonymous Record Size

Lin

kab

ilit

y R

ati

o

Gain, up to 16%

Size 30, From 74% To 90%

Gain, up to 23%Size 20, From 35% To 55%

Improvement (3): For Small IR Size

Changing it to:0.5 + Review Length

Anonymous Record Size

Lin

kab

ilit

y

Rati

o Size 10, 89% To 92%

Size 7, 79% To 84%

Gain up to 5%

Roadmap

1. Introduction2. Data Set & Problem Settings3. Linkability Results &

Improvements4. Discussion5. Future Work & Conclusion

Discussion

o Unigram and Scalabilityo 26 VS 676o 59 VS 676o Less than 10%

o Prolific Userso On the long run, will be prolific

o Anonymous Record Size o A set of 60 reviews, less than 20% of minimum

contribution o Detecting Spam Reviews

Roadmap

1. Introduction2. Data Set & Problem Settings3. Linkability Results &

Improvements4. Discussion5. Future Work & Conclusion

Future Work

o Improving more for Small AR’so Other Probabilistic Modelso Using Stylometry

o Review Anonymizationo Exploring Linkability in other Preference

Databases

Conclusion

o Extensive Study to Assess Linkability of User Reviewso For large set of userso Using very simple features

o Users are very exposed even with simple features and large number of authors

Reviews can be accurately de-anonymized using alphabetical letter distributions

Takeaway Point:

Questions?

top related