jing gao 1, feng liang 1, wei fan 2, chi wang 1, yizhou sun 1, jiawei han 1 university of illinois,...
Post on 19-Dec-2015
219 views
TRANSCRIPT
On Community Outliers and their Efficient Detection in Information
NetworksJing Gao1, Feng Liang1, Wei Fan2,
Chi Wang1, Yizhou Sun1, Jiawei Han1
University of Illinois, IBM TJ Watson
Debapriya Basu
2
OBJECTIVES
Determine outliers in information networks Compare various algorithms which does the
same
3
CHARACTERISTICS OF INFORMATION NETWORK
Eg Internet, Social Networking Sites Nodes – characterized by feature values Links - representative of relation between
nodes
4
Outliers – anomalies, novelties Different kinds of outliers
◦ Global◦ Contextual
OUTLIERS
5
OUTLIERS IN INFORMATION NETWORKS
V7
10
V9
V8
30
V10
40 70 100 110 140 160
V6 V1 V4 V5 V3 V2
Global Outlier
Salary (in $1000)
6
COMMUNITY OUTLIER Unified model considering both nodes and
links Community discovery and outlier detection
are related processes
7
SUMMARY OF THE APPROACH Treat each object as a multivariate data point Use K components to describe normal
community behavior and one component to denote outliers
Induce a hidden variable zi at each object indicating community
Treat network information as a graph Model the graph as a Hidden Markov Random
Field on zi
Find the local minimum of the posterior probability potential energy of the model.
8
UNIFIED PROBABILISTIC MODEL
community label Z
outlier
node feature
X
link structure W
high-income:mean: 116k
std: 35k
low-income:mean: 20k
std: 12k
model parameter
sK: number of communities
9
SYMBOLS IN THE MODEL
Symbol Definition
I = {1,2,3….i,..M} Indices of the objects
V = {v1,v2….vm} Set of objects
S = {s1,s2,….sm} Given attributes of objects
WM*M = {wij} Adjacency matrix containing the weights of the links
Z = {z1,…..,zm} RVs for hidden labels of objects
X = {x1,…..,xm} RVs for observed data
Ni (i ∈ I) Neighborhood of object vi
1,….,k,….K Indices of normal communities
Θ = {Θ1, Θ2,……, Θk} R.Vs for model parameters
10
◦ Set of R.Vs X are conditionally independent given their labelsP(X=S|Z) = ΠP(xi=si|zi)
◦ Kth normal community is characterized by a set of parametersP(xi=si|zi =k) = P(xi=si|Θk)
◦ Outliers are characterized by uniform distribution◦ P(xi=si|zi =0) = ρ0
◦ Markov random field is defined over hidden variable Z ◦ P(zi|zI-{i}) = P(zi|zNi)
◦ The equivalent Gibbs distribution is P(Z) = exp(-U(Z))*1/H1
H1 = normalizing constant, U(Z) = sum of clique potentials.
◦ Goal is to find the configuration of z that maximizes P(X=S|Z)P(Z) for a given Θ
THE MODEL
11
MODELING DATA
Continuous Data◦ Is modeled as Gaussian distribution◦ Model parameters: mean, standard deviation
Text Data◦ Is modeled as Multinomial distribution◦ Model parameters: probability of a word
appearing in a community
12
COMMUNITY OUTLIER DETECTION ALGORITHM
Given Θ, find Z that maximizes P(Z|X)
Given Z, find Θthat maximizes P(X|Z)
Initialize Z
INFERENCE
PARAMETER ESTIMATION
Θ : model parametersZ: community labels
13
PARAMETER ESTIMATION
Calculate model parameters◦ maximum likelihood estimation
Continuous◦ mean: sample mean of the community◦ standard deviation: square root of the sample
variance of the community Text
◦ probability of a word appearing in the community: empirical probability
14
INFERENCES
Calculate Zi values◦ Given Model parameters,◦ Iteratively update the community labels of nodes at
each timestep◦ Select the label that maximizes P(Z|X,ZN)
Calculate P(Z|X,ZN) values◦ Both the node features and community labels of
neighbors if Z indicates a normal community◦ If the probability of a node belonging to any community
is low enough, label it as an outlier
15
DISCUSSIONS
Setting Hyper parameters◦ a0 = threshold ◦ Λ = confidence in the network◦ K = number of communities
Initialization◦ Group outliers in clusters.◦ It will eventually get corrected.
16
SIMULATED EXPERIMENTS
Data Generation◦ Generate continuous data based on Gaussian
distributions and generate labels according to the model
◦ Define r: percentage of outliers, K: number of communities
Baseline models◦ GLODA: global outlier detection (based on node
features only)◦ DNODA: local outlier detection (check the feature
values of direct neighbors)◦ CNA: partition data into communities based on
links and then conduct outlier detection in each community
17
TRUE POSITIVE RATE(Top r% as outliers)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
r=1% K=5 r=5% K=5 r=1% K=8 r=5% K=8
GLODA
DNODA
CNA
CODA
18
EXPERIMENTS ON DBLP Communities
◦ data mining, artificial intelligence, database, information analysis
Sub network of Conferences Links: percentage of common authors among two
conferences Node features: publication titles in the conference
Sub network of Authors Links: co-authorship relationship Node features: titles of publications by an author
19
RESULTS
Community outliers: CVPR CIKM
20
CONCLUSIONS
Community Outliers
Community Outlier Detection
QUESTIONS
21
REFERENCES
On Community Outliers and their Efficient Detection in Information Networks – Gao, Liang, Fan, Wang, Sun, Han
Outlier detection – Irad Ben-Gal Automated detection of outliers in real-world data –
Last, Kandel Outlier Detection for High Dimensional Data –
Aggarwal, Yu