clickstream models & sybil detection
DESCRIPTION
Clickstream Models & Sybil Detection. Gang Wang ( 王刚 ) UC Santa Barbara [email protected]. Modeling User Clickstream Events. User-generated events E.g. profile load, link follow, photo browse, friend invite Assume we have event type, userID , timestamp - PowerPoint PPT PresentationTRANSCRIPT
Modeling User Clickstream Events User-generated events
E.g. profile load, link follow, photo browse, friend invite
Assume we have event type, userID, timestamp
Intuition: Sybil users act differently from normal users Sybil users act differently from normal users
Goal-oriented: focus on specific actions, less “extraneous” events
Time-limited: focused on efficient use of time, smaller gaps?
Forcing Sybil users to mimic users win?
UserID Event Generated
Timestamp
3
Legit
Sybils
System Overview
Clickstream LogSequence Clustering Cluster Coloring
Known Good Users
?
Incoming Clickstream
4
Clickstream Models
Clickstream log user clicks (click type) with timestamp
Modeling Clickstream Event-only Sequence Model: order of
events e.g. ABCDA
Time-based Model: sequence of inter-arrival time e.g. {t1, t2, t3, …}
Hybrid Model: sequence of click events with time e.g. A(t1)B(t2)C(t3)D(t4)A
5
Clickstream Clustering
Similarity Graph Vertices: users (or sessions) Edges: weighted by the similarity score of two
user’s clickstream
Clustering Similar Clickstreams together Graph partitioning using METIS
Q: How to compare two clickstreams?
6
Distance Functions Of Each Model
Click Sequence (CS) Model Ngram overlap
Ngram+count
Time-based Model Compare the distribution of inter-arrival time K-S test
Hybrid Model Bucketize inter-arrival time Compute 5grams (similar with CS Model)
ngram1= {A, B, AA, AB, AAB}ngram2= {A, C, AA, AC, AAC}
S1= AABS2= AAC
ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)}ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)}
S1= AABS2= AAC
Euclidean DistanceV1=(2,1,0,1,0,1,1,0)/6V2=(2,0,1,1,1,0,0,1)/6
7
Detection In A Nutshell
Inputs: Trained clusters Input sequences for testing
Methodology: given a test sequence A K nearest neighbor: find the top-k nearest
sequences in the trained cluster Nearest Cluster: find the nearest cluster based
on average distance to sequences in the cluster Nearest Cluster (center): pre-compute the
center(s) of cluster, find the nearest cluster center
?
8
Clustering Sequences
Model (Sequence Type)
Distance Function
(False positives, False negatives) of users
20 clusters
50 clusters
100 clusters
Click Sequence Model (Categories)
unigram (3% , 6%) (1%, 7%) (2%, 4%)
unigram+count
(1% , 4%) (1%, 3%) (1%, 3%)
10gram (1%, 3%) (1%, 3%) (2%, 2%)
10gram+count
(1%, 4%) (2%, 4%) (1%, 2%)
Time-based Model
K-S Test (9%, 8%) (2%, 10%) (5%, 10%)
Hybrid Model (Categories)
5gram (3%, 2%) (2%, 2%) (2%, 2%)
5gram+count
(3%, 4%) (4%, 5%) (1%, 2%)
How well can each method separate Sybils from legitimate users?
9
Detection Accuracy
Basics Training on one group of users, and test on the other group of users. Clusters trained using Hybrid Model
Key takeaways High accuracy with 50 clicks in the test sequence Nearest Cluster (Center) method achieves high accuracy with minor
computation overhead
Number of Clicks in the Sequence (length)
(False positives, False negatives) of users
K-nearest Neighbors
(k=3)
Nearest Cluster(Avg.
Distance)
Nearest Cluster (Center)
Length <=50 (1.5% , 2.1%) (0.6%, 2.6%) (0.4%, 2.3%)
Length <=100 (0.9% , 1.8%) (0.2%, 2.5%) (0.3%, 2.3%)
All (0.6% , 3%) (0.4%, 2.8%) (0.4%, 2.3%)
10
Can Model Be Effective Over Time?
Experiment method Using first two-week data to train the model Testing on the following two-week data
Model (False positives, False negatives) of users
K-nearest Neighbors
(k=3)
Nearest Cluster(Avg.
Distance)
Nearest Cluster (Center)
Click Sequence Model
(1.8% , 1%) (3%, 2%) (3%, 0.8%)
Hybrid Model (3% , 2%) (3%, 1%) (1.2%, 1.4%)
Still Ongoing Work
With broad interest and applications As Sybil detection tool
Code being tested internally at Renren Trained with 10K users (2-week log) Testing on 1 Million users (1-week log)
5 Sybil clusters 22K suspicious profiles Further improvement
Training with longer clickstream (half users have <5 clicks in 2-week)
More conservative in labeling Sybil clusters.
As user modeling tool Code being tested by LinkedIn as user profiler
Some Useful Tools
Graph Partitioning Metis
http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
Community Detection Louvain code
https://sites.google.com/site/findcommunities/
Other Ongoing Works/Ideas
Fighting against crowdturfing Crowdturfing: real users are paid to spam How to detect these malicious real users
User behavior model Network-wised temporal anomaly detection
Information Dissemination Content sharing visa social edges
How often will user click on the content How often will user comment on the content
Sybil detection, target ad placement
Questions?
http://current.cs.ucsb.edu
Thank You!