detecting collective anomalies from multiple spatio-temporal datasets across different domains yu...
TRANSCRIPT
Detecting Collective Anomalies from Multiple Spatio-Temporal Datasets
across Different Domains
Yu ZhengMicrosoft Research, Beijing, China
http://research.microsoft.com/en-us/people/yuzheng/
Released Data & Codes
Existing Anomaly Detection• Detecting anomalies (outliers) is sometimes more useful than regular patterns
• Existing research focuses on detecting anomalies based on a single dataset• May cause some anomalies undetected or very late• Or over detected when using a sparse dataset (false alerts)
A) Bike rentingB) Social mediaA) Taxi flow
r1r2
r3
r6
r4
r5
r1
<0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…>
Reports of sickness in a neighborhood
time,
(1−𝜇)≫3𝜎
An undetected example A false alert
Collective Anomalies
• ST-data in different domains• , ,…, • Noise complaints: <construction, loud music, traffic…>• Air quality: <good, moderate, unhealthy, …> • Check in: <food, entertainment, shopping, arts,…>• Traffic conditions: <fast, normal, congestion>• Epidemic: <disease 1, disease 2,…, disease n> • ……
...
...
...
... ...
B) People¶s Complaints A) Traffic Sensing C) Social Media
• Detect collective anomalies based on multiple Spatio-Temporal (ST) datasets
t1
t2
t4
t3
2D Geo-Space
a1
a2
a3• Collective anomalies
• Spatio-temporal collectiveness: a collection of nearby locations () and during a few consecutive time intervals ()
• Data collectiveness: anomalous when checking multi ple datasets simultaneously
An Example
A) Raw road network B) Segmented regions
8am 12pm9am 10am 11am 1pm
Benefits• Detect an underlying problem• Den o te an early stage of an epidemic disease
or the beginning of a natural disaster• Provide a panora mic view of an event
Eight regions are collectively anomalous in five consecutive hours
in terms of three datasets:Taxicab, bike-sharing, and 311 complaints,
𝛿𝑑
Challenges• Data sparsity and uncertainty
Difficult to estimate their true distri butions based on limited observationsHard to measure the deviation of an instance from its original dis tri bution
• Different scales and distributions Difficult to aggregate them into an integrate (anomalous) measurement
t1
t2
t4
t3
2D Geo-Space
a1
a2
a3
• Many combinations of regions and time intervals
High computational cost Conflicts online detection
<0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…>
<1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,…>
Distribution ?
Aggregation ?
Methodology• Multiple Sources Latent Topic (MSLT) Model :
• Combine multiple datasets to better estim ate the underlying distribution of a sparse dataset
• Leading to more accurate anomaly detection
• Spatio-Temporal Log-likelihood Ratio Test (ST_LRT)
• Adap ts Likelihood Ratio Test to a spatio-temporal setting• Aggregates the information of multiple datasets across
multiple regions to detect anomalies
• Candidate generation algorithm• Generate candidates using computational geometry• Prune unnecessary combinations based on skylines
σ
λ
μ
α θ
φ1
f
z ci
z cj
z ck
φ2
φ3
β
z1 z2 zk z1 z2 zk z1 z2 zk
0.11 0.25 0.07
c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3D1 D2 D3
θ
z1 z2 zk
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkwΛ=− 2 logh𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑 𝑓𝑜𝑟 𝑛𝑢𝑙𝑙𝑚𝑜𝑑𝑒𝑙
h𝑙𝑖𝑘𝑒𝑙𝑖 𝑜𝑜𝑑 𝑓𝑜𝑟 𝑎𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑖𝑣𝑒𝑚𝑜𝑑𝑒𝑙
r5r1r1
r2
r3
r4
r5
r6
r5r1
r6
A) Retrieve candidate regions for r1
ᵹd
ᵹd
B) Find intersection between two regions
p1
p3
p1
p2
p3p4
C) Find combination of three regions
ᵹd
r1
r5r6
p1 p2 p3 p4
: r1, r5
: r1, r5, r6
: r1, r6
p1 p2
p2 p3
p3 p4
D) Output region sets
0.5
0.5
0.5
0
1.0
1.0
No
ise
Taxi1.0
ST_LRT
Framework
A) Raw road network B) Segmented regionsA) Raw road network B) Segmented regions, …}
Learning Distributions
,,…,…,,…,
t1
t2
t4
t3
2D Geo-Space
σ
λ
μ
α θ
φ1
f
z ci
z cj
z ck
φ2
φ3
β
z1 z2 zk z1 z2 zk z1 z2 zk
0.11 0.25 0.07
c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3D1 D2 D3
θ
z1 z2 zk
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkw
MSLT Model
𝑠1 𝑠2
, …}
…
,
Skyline Detection
…}
r5r1r1
r2
r3
r4
r5
r6
r5r1
r6
A) Retrieve candidate regions for r1
ᵹd
ᵹd
B) Find intersection between two regions
p1
p3
p1
p2
p3p4
C) Find combination of three regions
ᵹd
r1
r5r6
p1 p2 p3 p4
: r1, r5
: r1, r5, r6
: r1, r6
p1 p2
p2 p3
p3 p4
D) Output region sets
Circel_Based_Spatial_Check(spatial constraint )
LRT
t1
t2
t4
t3
2D Geo-Space
a1
a2
a3
0.5
0.5
0.5
0
1.0
1.0
No
ise
Taxi1.0
An entry
MSLT Model• Combine multiple datasets to discover
latent functions of a region • To better estimate the distribution of a sparse dataset• Different datasets in a region can mutually reinforce • A dataset can reference across different regions
𝑝𝑟𝑜𝑝 (𝑤 𝑖 )=∑𝑡
𝜃𝑑𝑡𝜑𝑡𝑤 𝑖
𝑠2𝑠1σ
η
μ
α θ
φ1
f
z w1
z w2
z w|s|
φ2
φ|s|
β
z1 z2 zk z1 z2 zk z1 z2 zk c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3s1
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkw
z1 z2 zk
A) Graphic representation of MSLT
B) Topic-words distribution across different datasets
s2 s3
W1
W2
W3
OI OI OI OI OIOIOIOI
0:00 4:002:001:00 3:00
<c1, c2,« ,c10> <w1, w2,« ,w10>
0:30 1:30 2:30 3:30T
axi
311
20:00 24:0022:0021:00 23:00
OI OI OI OI OIOIOIOI
0:30 1:30 2:30 3:30
Tax
i31
1
Time interval 1 Time interval 6
<c1, c2,« ,c10><w51, w52,« ,w60>
tc=4:00
<λ'1, λ'2,« ,λ'10> λ i=λ ×pi
<c1, c2,« ,c10> λ=ZIP(ci|i=1,2,« ,10)<c1, c2,« ,c10> λ'=ZIP(ci|i=1,2,« ,10)
pi=prop(wi) / Si prop(wi), 1 � i � 10
λ'i=λ'×pi <λ 1, λ 2,« ,λ 10>
tc-4=0:00 tc-2=2:00
<c1, c2,« ,c10> <w1, w2,« ,w10>
w'1 w'16 w'81 w'96
A) Sett ings of MSLT
B) Sett ings of ST_LRT
• A topic model-based method: • A region a document • Latent functions latent topics• 311, bikes, taxicabs words (dynamic)• POIs and road networks keywords (static)
σ
λ
μ
α θ
φ1
f
z ci
z cj
z ck
φ2
φ3
β
z1 z2 zk z1 z2 zk z1 z2 zk
0.11 0.25 0.07
c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3D1 D2 D3
θ
z1 z2 zk
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkw
MSLT Model• Learning
• , and are fixed parameters • Learn and based on observed and • Using a stochastic EM algorithm
• Structure• of a region depends on its geographical pro perties • There are multiple topic-word distributions
σ
λ
μ
α θ
φ1
f
z ci
z cj
z ck
φ2
φ3
β
z1 z2 zk z1 z2 zk z1 z2 zk
0.11 0.25 0.07
c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3D1 D2 D3
θ
z1 z2 zk
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkw
σ
η
μ
α θ
φ1
f
z w1
z w2
z w|s|
φ2
φ|s|
β
z1 z2 zk z1 z2 zk z1 z2 zk c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3s1
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkw
z1 z2 zk
A) Graphic representation of MSLT
B) Topic-words distribution across different datasets
s2 s3
W1
W2
W3
Latent Dirichlet Allocation (LDA) MSLT
φ
K
β
RN
α θ z w
𝑝𝑟𝑜𝑝 (𝑤 𝑖 )=∑𝑡
𝜃𝑑𝑡𝜑𝑡𝑤 𝑖
ST_LRT
• Log-Likelihood Ratio Test (LRT)• Apply LRT to a single (ST) dataset
• in a single region• in multiple regions
• Apply LRT to multiple datasets• Distribution estimations for different datasets• Aggregate anomalous degree of multiple datasets
ST_LRT• LRT
• testing whether a simplifying assumption for a model is valid
• can be approximated by a chi-square distribution
1)
An example for a single region and a single dataset
3)
=0.999
Region r
12:00-14:00A) C)
12:00-14:00 14:00-16:00 16:00-18:00
Gaussian(200,1300)
xt=70
_cdf(˄, 1)>0.95PD
F χ 2
˄
B)
Region r
Poisson(8)x1=14
Poisson(10)x2=14
Poisson(6)x3=8
˄=3.84
;
= 2000.35=70; 13000.35=455
𝑝=70
200=0.35
2) The maximum likelihood for the alternative model (mean to 70)
Region r
12:00-14:00A) C)
12:00-14:00 14:00-16:00 16:00-18:00
Gaussian(200,1300)
xt=70
_cdf(˄, 1)>0.95PD
F χ 2
˄
B)
Region r
Poisson(8)x1=14
Poisson(10)x2=14
Poisson(6)x3=8
˄=3.84
20070
ST_LRT• Apply LRT to multiple regions (or time slots)
Region r
12:00-14:00A) C)
12:00-14:00 14:00-16:00 16:00-18:00
Gaussian(200,1300)
xt=70
_cdf(˄, 1)>0.95PD
F χ 2
˄
B)
Region r
Poisson(8)x1=14
Poisson(10)x2=14
Poisson(6)x3=8
˄=3.84
1) ;
;
2) Calculate : To maximize the likelihood of the alternative model (=1)
81.5=12, =101.5=15, =61.5=9;
3) 5.19
𝑜𝑑= χ 2 _ cdf (5.19 , 𝑓𝑑=1 )=0.978
A dataset varies in different regions (or time slots) consist ently
A dataset changes differently in different regi ons (or slots).
𝑜𝑑 (𝑠 )=√∑𝑖
¿¿¿¿A) Bike rentingB) Social mediaA) Taxi flow
r1r2
r3
r6
r4
r5
r1
A) Bike rentingB) Social mediaA) Taxi flow
r1r2
r3
r6
r4
r5
r1
ST_LRT• Deal with multiple datasets
• Dealing with a sparse dataset• The zero-inflated Poisson (ZIP) model
• Using latent topic-word distribution
𝑝𝑟𝑜𝑝 (𝑤 𝑖 )=∑𝑡
𝜃𝑑𝑡𝜑𝑡𝑤 𝑖
1) ;
2) ;
;
𝑋=h ,with probability (1 −𝑝 ) 𝑒−𝜆𝜆h
h!
<0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0,…>
σ
η
μ
α θ
φ1
f
z w1
z w2
z w|s|
φ2
φ|s|
β
z1 z2 zk z1 z2 zk z1 z2 zk c1 c2
cm
cm+1
cn cw
λ1
φ1
cm+2 cn+1 cn+2
φ2 φ3
λ2 λ3s1
θ
c1 cm cm+1 cn cn+1 cw
θ1 θkθ2
φ11 φkw
z1 z2 zk
A) Graphic representation of MSLT
B) Topic-words distribution across different datasets
s2 s3
W1
W2
W3
:<0, 0, 0, 0, 0, 0, c1, 0, 0, 0, 0, 0, c2, 0, 0,…>
2 :<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, c2, 0, 0,…>
1 :<0, 0, 0, 0, 0, 0, c1, 0, 0, 0, 0, 0, 0, 0, 0,…>
𝜆
𝑍𝐼𝑃
<0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, c2, 0, 0,…>
<0, 0, 0, 0, 0, 0, c1, 0, 0, 0, 0, 0, 0, 0, 0,…>𝜆1𝜆2
𝜆𝑖=𝜆×𝑝𝑟𝑜𝑝 (𝑤𝑖 )
𝐿𝑅𝑇
ST_LRT• Estimate distributions for different datasets
𝜆
𝑍𝐼𝑃
𝜆𝑖=𝜆×𝑝𝑟𝑜𝑝 (𝑤𝑖 ) 𝐿𝑅𝑇
s
Sparse? variance (𝑠 ) ≫𝑚𝑒𝑎𝑛(𝑠)Y Y
() ()
NN
ST_LRT• Aggregate anomalous degrees of multiple datasets
{
{<𝑟1 , 𝑡1 ¿ ,¿𝑟1 ,𝑡 2>,<𝑟 2 , 𝑡1>,<𝑟4 , 𝑡 2>}
{
…
…
r5r1r1
r2
r3
r4
r5
r6
r5r1
r6
A) Retrieve candidate regions for r1
ᵹd
ᵹd
B) Find intersection between two regions
p1
p3
p1
p2
p3p4
C) Find combination of three regions
ᵹd
r1
r5r6
p1 p2 p3 p4
: r1, r5
: r1, r5, r6
: r1, r6
p1 p2
p2 p3
p3 p4
D) Output region sets
Circel-Based Spatial Check
{𝑟6 ,𝑟7 },
…
…
< ,…, >
< ,…, >
< ,…, >
… …
0.5
0.5
0.5
0
1.0
1.0
No
ise
Taxi1.0Skyline ods
If a set of entries’ upper bound of is dominated by existing skyline combinations, all the combinations of its subsets will be dominated by the skyline too.
Pruning
Evaluation• Datasets
Construction Commercial – Music/Party/Talking Park – Music/Party/Talking
House – Music/Party/Talking/TV
Street – Music/Party/Talking
Dog Air Condition/ventilation
Traffic Manufacturing Others
Data sources Properties values
Taxicab data1/1/2014-1/1/2015
number of taxicabs 14,144number of trips 165M
total duration (hour) 36.5M
total distances (km) 5,671M
Bike Data1/1/2014-1/1/2015
number of stations 344number of bikes 6,811number of trips 8,081,216
total duration (hour) 1.9M
311 Complaints5/26/2013-12/13/2014
number of categories 10
number of instances 197,922
Road network 2013
number of nodes 79,315number of road segments (level5)
32,210
number of road segments (level>5)
83,655
number of regions 862
POIs2013
number of categories 14
number of instances 24,031
Data Release:http://research.microsoft.com/pubs/255670/release_data.zip
Evaluation• Evaluation on MSLT
• Estimating the distribution for 311 data (sparse)
• KL-Divergence between estimations and ground truth
• Down-sampling ground truth
0 20 40 60 80 100
0.5
0.6
0.7
0.8
KL-
Div
erge
nce
1/X
MSLT Count
0 20 40 60 80 100
0.4
0.6
0.8
1.0
KL-
Div
erge
nce
1/X
MSLT Count
c1 c2 c3 c4 c5
𝑟1 𝑟2
A distribution of 311
Event Name Address Start Time End Time
1 Bowlloween 2014 New York Halloween
624-660 W 42nd St
10/31/2014 9PM
11/1/2014 2AM
2 Largest Halloween Singles Party in NYC
247 West 37th Street
10/31/2014 7AM
11/1/2014 3AM
3 Kokun Cashmere Sample and Stock Sale
237 W 37th Street
11/5/2014 10:30AM
11/7/2014 5:45PM
4 Big Apple Film Festival 54 Varick St 11/5/2014 6PM
11/9/2014 11PM
5 InterHarmony Concert Series: The Soul of élégiaque
881 7th Avenue
11/6/2014 8PM
11/6/2014 10PM
6 Hiras Master Tailors New York Trunk Show
301 Park Avenue
11/6/2014 9AM
11/9/2014 1PM
7 in Collaboration with Carnegie Halls Neighborhood Concerts
881 Seventh Avenue
11/7/2014 6PM
11/7/2014 10PM
8 Thomas/Ortiz Dance Show 248 West 60th Street
11/7/2014 7PM
11/8/2014 9PM
9 Rebecca Taylor Sample Sale 260 5th Ave 11/11/2014 10AM
11/15/2014 8PM
10 The News NYC Sample Sale 495 Broadway 11/13/2014 9AM
11/15/2014 6AM
11 Giorgio Armani Sample Sale 317 W 33rd St 11/15/2014 9:30AM
11/19/2014 6:30PM
12 Get Buzzed 4 Good Charity Event NYC 200 5th Ave 11/15/2014
1PM11/15/2014 4PM
13 Ment’or Young Chef Competition 462 Broadway 11/15/2014
2PM11/15/2014 6PM
14 Gotham Comedy Club 208 West 23rd Street
11/17/2014 6PM
11/17/2014 9PM
15 Kal Rieman NYC Sample Sale 265 West 37th Street
11/18/2014 11AM
11/20/2014 8PM
16 Inhabit Cashmere Sample Sale 250 West 39th St
11/18/2014 10AM
11/20/2014 6 PM
17 Shoshanna NYC Sample Sale 231 W. 39th St 11/19/2014 10AM
11/20/2014 6:30PM
18 ICB / J. Press NYC Sample Sale 530 Seventh Avenue
11/19/2014 12AM
11/21/2014 12AM
19 Thanksgiving in New York City 2014 1675 Broadway 11/27/2014
6AM11/27/2014 10PM
20 Thanksgiving Day Dinner at Croton Reservoir Tavern
108 West 40th St
11/27/2014 12PM
11/27/2014 9PM
Taxi Inflow
Taxi Outflow
Bike Inflow
Bike Outflow
Single Dataset
DB-S-Taxi-S: one property
DB-S-Bike-S: one property
DB-S-Taxi-B: both properties
DB-S-Bike-B: both properties
Multi-Datasets
DB-M-One: one of the properties satisfying the 3-time deviationDB-M-ALL: all the properties need to satisfy the 3-time deviation
Methods Detected Anomalies/day Hit Event IDs
DB-S-Taxi-S 336.3 1, 9, 19, 20DB-S-Bike-B 25.7 9, 19, 20DB-S-Taxi-S 18.1 4, 19DB-S-Bike-B 1.83 NoneDB-M-One 353.2 1, 4, 9, 19, 20DB-M-ALL 0.12 None
ST_LRT 28.5 1, 3, 9, 10, 11, 13, 15, 16, 20
Baselines
Results
Events were reported by nycinsiderguide.com
Nov. 1, 2014 to Nov. 30, 2014
DB: distance-based methods
B) Taxi inflow- C) Taxi outflow- D) Bike inflow- E) Bike outflow-
F) Taxi inflow- G) Taxi outflow- H) Bike inflow- I) Bike outflow-A) The News NYC Sample Sale
od=<0.571, 0.912, 0.256>
A
Data sources Properties (s)
Taxicab DataIn flow 0.274 0.593 0.822 0.932
0.571Out flow 0.383 0.282 0.612 0.202Total 0.404 0.700
Bike DataIn flow 0.796 0.901 0.932 0.901
0.912Out flow 0.872 0.953 0.983 0.987
Total 0.882 0.940
311 Data Complaints \ \ \ \ 0.256
• Beyond distance-based methods
• Beyond a single dataset
• Beyond a single region
(:18-20, : 20-22)
Conclusion • Detect collective anomalies based on multiple datasets
• Methodology• MSLT• ST_LRT• Candidate generation and pruning
• Evaluated based on five datasets in NYC
• Detect all anomalies in NYC in 3 minutes
HomepageReleased Data & Codes
Thanks!
Collective Anomalies• Formal Definition
• Given • regions, …} • multiple datasets , …} during the recent time intervals and • that over a period of historical time
• Formulate a spatio-temporal set ,,…,…,,…, .• is associated with a vectordenoting the number of instances in each category of each
dataset in region at time interval .
• Detect , each is a collection of spatio-temporal entries from
• , ,• , • _)true
t1
t2
t4
t3
2D Geo-Space< c1, c2>
t4
t2
t3
t1
s2:< c¶1, c¶2>s1:
r1
r2r3 r4 r5
r6
t1
t2
t4
t3
2D Geo-Space
a1
a2
a3