jing gao 1, feng liang 1, wei fan 2, chi wang 1, yizhou sun 1, jiawei han 1 university of illinois,...

21
On Community Outliers and their Efficient Detection in Information Networks Jing Gao 1 , Feng Liang 1 , Wei Fan 2 , Chi Wang 1 , Yizhou Sun 1 , Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

Post on 19-Dec-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

On Community Outliers and their Efficient Detection in Information

NetworksJing Gao1, Feng Liang1, Wei Fan2,

Chi Wang1, Yizhou Sun1, Jiawei Han1

University of Illinois, IBM TJ Watson

Debapriya Basu

Page 2: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

2

OBJECTIVES

Determine outliers in information networks Compare various algorithms which does the

same

Page 3: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

3

CHARACTERISTICS OF INFORMATION NETWORK

Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between

nodes

Page 4: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

4

Outliers – anomalies, novelties Different kinds of outliers

◦ Global◦ Contextual

OUTLIERS

Page 5: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

5

OUTLIERS IN INFORMATION NETWORKS

V7

10

V9

V8

30

V10

40 70 100 110 140 160

V6 V1 V4 V5 V3 V2

Global Outlier

Salary (in $1000)

Page 6: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

6

COMMUNITY OUTLIER Unified model considering both nodes and

links Community discovery and outlier detection

are related processes

Page 7: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

7

SUMMARY OF THE APPROACH Treat each object as a multivariate data point Use K components to describe normal

community behavior and one component to denote outliers

Induce a hidden variable zi at each object indicating community

Treat network information as a graph Model the graph as a Hidden Markov Random

Field on zi

Find the local minimum of the posterior probability potential energy of the model.

Page 8: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

8

UNIFIED PROBABILISTIC MODEL

community label Z

outlier

node feature

X

link structure W

high-income:mean: 116k

std: 35k

low-income:mean: 20k

std: 12k

model parameter

sK: number of communities

Page 9: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

9

SYMBOLS IN THE MODEL

Symbol Definition

I = {1,2,3….i,..M} Indices of the objects

V = {v1,v2….vm} Set of objects

S = {s1,s2,….sm} Given attributes of objects

WM*M = {wij} Adjacency matrix containing the weights of the links

Z = {z1,…..,zm} RVs for hidden labels of objects

X = {x1,…..,xm} RVs for observed data

Ni (i ∈ I) Neighborhood of object vi

1,….,k,….K Indices of normal communities

Θ = {Θ1, Θ2,……, Θk} R.Vs for model parameters

Page 10: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

10

◦ Set of R.Vs X are conditionally independent given their labelsP(X=S|Z) = ΠP(xi=si|zi)

◦ Kth normal community is characterized by a set of parametersP(xi=si|zi =k) = P(xi=si|Θk)

◦ Outliers are characterized by uniform distribution◦ P(xi=si|zi =0) = ρ0

◦ Markov random field is defined over hidden variable Z ◦ P(zi|zI-{i}) = P(zi|zNi)

◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1

H1 = normalizing constant, U(Z) = sum of clique potentials.

◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ

THE MODEL

Page 11: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

11

MODELING DATA

Continuous Data◦ Is modeled as Gaussian distribution◦ Model parameters: mean, standard deviation

Text Data◦ Is modeled as Multinomial distribution◦ Model parameters: probability of a word

appearing in a community

Page 12: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

12

COMMUNITY OUTLIER DETECTION ALGORITHM

Given Θ, find Z that maximizes P(Z|X)

Given Z, find Θthat maximizes P(X|Z)

Initialize Z

INFERENCE

PARAMETER ESTIMATION

Θ : model parametersZ: community labels

Page 13: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

13

PARAMETER ESTIMATION

Calculate model parameters◦ maximum likelihood estimation

Continuous◦ mean: sample mean of the community◦ standard deviation: square root of the sample

variance of the community Text

◦ probability of a word appearing in the community: empirical probability

Page 14: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

14

INFERENCES

Calculate Zi values◦ Given Model parameters,◦ Iteratively update the community labels of nodes at

each timestep◦ Select the label that maximizes P(Z|X,ZN)

Calculate P(Z|X,ZN) values◦ Both the node features and community labels of

neighbors if Z indicates a normal community◦ If the probability of a node belonging to any community

is low enough, label it as an outlier

Page 15: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

15

DISCUSSIONS

Setting Hyper parameters◦ a0 = threshold ◦ Λ = confidence in the network◦ K = number of communities

Initialization◦ Group outliers in clusters.◦ It will eventually get corrected.

Page 16: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

16

SIMULATED EXPERIMENTS

Data Generation◦ Generate continuous data based on Gaussian

distributions and generate labels according to the model

◦ Define r: percentage of outliers, K: number of communities

Baseline models◦ GLODA: global outlier detection (based on node

features only)◦ DNODA: local outlier detection (check the feature

values of direct neighbors)◦ CNA: partition data into communities based on

links and then conduct outlier detection in each community

Page 17: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

17

TRUE POSITIVE RATE(Top r% as outliers)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

r=1% K=5 r=5% K=5 r=1% K=8 r=5% K=8

GLODA

DNODA

CNA

CODA

Page 18: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

18

EXPERIMENTS ON DBLP Communities

◦ data mining, artificial intelligence, database, information analysis

Sub network of Conferences Links: percentage of common authors among two

conferences Node features: publication titles in the conference

Sub network of Authors Links: co-authorship relationship Node features: titles of publications by an author

Page 19: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

19

RESULTS

Community outliers: CVPR CIKM

Page 20: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

20

CONCLUSIONS

Community Outliers

Community Outlier Detection

QUESTIONS

Page 21: Jing Gao 1, Feng Liang 1, Wei Fan 2, Chi Wang 1, Yizhou Sun 1, Jiawei Han 1 University of Illinois, IBM TJ Watson Debapriya Basu

21

REFERENCES

On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han

Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data –

Last, Kandel Outlier Detection for High Dimensional Data –

Aggarwal, Yu