clickstream models & sybil detection

Clickstream Models & Sybil Detection

Gang Wang (王刚 )

UC Santa Barbara

[email protected]

Modeling User Clickstream Events User-generated events

E.g. profile load, link follow, photo browse, friend invite

Assume we have event type, userID, timestamp

Intuition: Sybil users act differently from normal users Sybil users act differently from normal users

Goal-oriented: focus on specific actions, less “extraneous” events

Time-limited: focused on efficient use of time, smaller gaps?

Forcing Sybil users to mimic users win?

UserID Event Generated

Timestamp

3

Legit

Sybils

System Overview

Clickstream LogSequence Clustering Cluster Coloring

Known Good Users

?

Incoming Clickstream

4

Clickstream Models

Clickstream log user clicks (click type) with timestamp

Modeling Clickstream Event-only Sequence Model: order of

events e.g. ABCDA

Time-based Model: sequence of inter-arrival time e.g. {t1, t2, t3, …}

Hybrid Model: sequence of click events with time e.g. A(t1)B(t2)C(t3)D(t4)A

5

Clickstream Clustering

Similarity Graph Vertices: users (or sessions) Edges: weighted by the similarity score of two

user’s clickstream

Clustering Similar Clickstreams together Graph partitioning using METIS

Q: How to compare two clickstreams?

6

Distance Functions Of Each Model

Click Sequence (CS) Model Ngram overlap

Ngram+count

Time-based Model Compare the distribution of inter-arrival time K-S test

Hybrid Model Bucketize inter-arrival time Compute 5grams (similar with CS Model)

ngram1= {A, B, AA, AB, AAB}ngram2= {A, C, AA, AC, AAC}

S1= AABS2= AAC

ngram1= {A(2), B(1), AA(1), AB(1), AAB(1)}ngram2= {A(2), C(1), AA(1), AC(1), AAC(1)}

S1= AABS2= AAC

Euclidean DistanceV1=(2,1,0,1,0,1,1,0)/6V2=(2,0,1,1,1,0,0,1)/6

7

Detection In A Nutshell

Inputs: Trained clusters Input sequences for testing

Methodology: given a test sequence A K nearest neighbor: find the top-k nearest

sequences in the trained cluster Nearest Cluster: find the nearest cluster based

on average distance to sequences in the cluster Nearest Cluster (center): pre-compute the

center(s) of cluster, find the nearest cluster center

?

8

Clustering Sequences

Model (Sequence Type)

Distance Function

(False positives, False negatives) of users

20 clusters

50 clusters

100 clusters

Click Sequence Model (Categories)

unigram (3% , 6%) (1%, 7%) (2%, 4%)

unigram+count

(1% , 4%) (1%, 3%) (1%, 3%)

10gram (1%, 3%) (1%, 3%) (2%, 2%)

10gram+count

(1%, 4%) (2%, 4%) (1%, 2%)

Time-based Model

K-S Test (9%, 8%) (2%, 10%) (5%, 10%)

Hybrid Model (Categories)

5gram (3%, 2%) (2%, 2%) (2%, 2%)

5gram+count

(3%, 4%) (4%, 5%) (1%, 2%)

How well can each method separate Sybils from legitimate users?

9

Detection Accuracy

Basics Training on one group of users, and test on the other group of users. Clusters trained using Hybrid Model

Key takeaways High accuracy with 50 clicks in the test sequence Nearest Cluster (Center) method achieves high accuracy with minor

computation overhead

Number of Clicks in the Sequence (length)

(False positives, False negatives) of users

K-nearest Neighbors

(k=3)

Nearest Cluster(Avg.

Distance)

Nearest Cluster (Center)

Length <=50 (1.5% , 2.1%) (0.6%, 2.6%) (0.4%, 2.3%)

Length <=100 (0.9% , 1.8%) (0.2%, 2.5%) (0.3%, 2.3%)

All (0.6% , 3%) (0.4%, 2.8%) (0.4%, 2.3%)

10

Can Model Be Effective Over Time?

Experiment method Using first two-week data to train the model Testing on the following two-week data

Model (False positives, False negatives) of users

K-nearest Neighbors

(k=3)

Nearest Cluster(Avg.

Distance)

Nearest Cluster (Center)

Click Sequence Model

(1.8% , 1%) (3%, 2%) (3%, 0.8%)

Hybrid Model (3% , 2%) (3%, 1%) (1.2%, 1.4%)

Still Ongoing Work

With broad interest and applications As Sybil detection tool

Code being tested internally at Renren Trained with 10K users (2-week log) Testing on 1 Million users (1-week log)

5 Sybil clusters 22K suspicious profiles Further improvement

Training with longer clickstream (half users have <5 clicks in 2-week)

More conservative in labeling Sybil clusters.

As user modeling tool Code being tested by LinkedIn as user profiler

Some Useful Tools

Graph Partitioning Metis

http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

Community Detection Louvain code

https://sites.google.com/site/findcommunities/






Other Ongoing Works/Ideas

Fighting against crowdturfing Crowdturfing: real users are paid to spam How to detect these malicious real users

User behavior model Network-wised temporal anomaly detection

Information Dissemination Content sharing visa social edges

How often will user click on the content How often will user comment on the content

Sybil detection, target ad placement

Questions?

http://current.cs.ucsb.edu

Thank You!

clickstream models & sybil detection

Documents