daning hu, ifi, uzh network-based business intelligence€¦ · measures/metrics. network-based...
TRANSCRIPT
Daning Hu, IFI, UZH
Network-based Business Intelligence
Network Modeling
& Analysis
Design
BI Applications
Economic
Networks
Social
Networks
Financial Markets,
Banking Systems,
Supply Chains, …
Online Communities,
Social Networking
Websites, …
Influence
Risk Contagion
(e.g., Bank run)
Social Contagion
(e.g., Word-of-
Mouth)
Monitoring, Analyzing, and
Simulating
Contagious Risk
Identification,
Prediction, and
Recommendation of
Influential Individuals
Network (Relational)
Data Analytics
Prof. Dr. Daning Hu
Department of Informatics
University of Zurich
Nov 12th, 2014
3
Outline
Introduction: The Big Data Era
Data Mining
Relational/Network Data Analytics
Mining Longitudinal Network Data
Dynamic Network Analysis
Introduction: The Big Data Era
90% of the data in the world today has been created in the last
two years alone (IBM). Big data comes from everywhere:
sensors used to gather climate information,
posts to social media,
digital pictures and videos,
purchase records, etc.
In response, everyone has begun embracing a loosely defined
term for today's massive data sets and the challenges they
present: Big Data.
Lack of efficient and effective analytical methods for big data
Big brother issues
A Brief History of Big Data
5
Herman Hollerith census data (electric hole pouching)1887-90
1935-37 FDR’s Social Security Act
26 million working Americans and 3 million emplyers
IBM, field investigators
Internet Ages and 9/11 NSA: 1.7 billion emails, phone calls, daily
Retailers amassing information on shopping habits
Wal-Mart: 460 T cache in 2004
Social Networks and Social Media proliferated
U.S. Open Government Initiative: data.gov
1943
–
1960s
WWII and Cold War “Colossus” Project: Deciphering Nazi Codes
742M U.S. tax returns and 175M fingerprints -> Privacy act
1990s
–
2000s
2014
? Network Data Analytics
Data Mining: Predecessor of Big Data (Analytics)
Data mining (the analysis step of the "Knowledge Discovery in
Databases" process, or KDD) is the process that attempts to
discover patterns in large data sets.
a field at the intersection of computer science and statistics
AI, machine learning, statistics, and database systems
The goal of DM is to extract information from a large data set
and transform it into an understandable form for further use
Data -> Information -> Knowledge
Involving data analysis, preprocessing & management,
model and inference considerations, complexity
considerations, post-processing of discovered structures,
visualization, and online updating (real-time).
7
Data Mining
The Knowledge Discovery in Databases (KDD) process is commonly
defined with the stages:
Collection and Selection
Pre-processing
Transformation
Data Mining (Analysis)
Interpretation/Evaluation
8
Data Mining Tasks
Major Data Mining Tasks:
Association rule mining (Dependency modeling) – Searches for
relationships between variables.
Clustering – discovering groups and structures in the data that are in
some way or another "similar", without using known structures.
Classification – generalizing known structure to apply to new data.
E.g., software classifying an e-mail as "spam". (Training dataset)
Regression – Attempts to find a function which models the data with
the least error.
Summarization – providing a more compact representation of the
data set, including visualization and report generation.
9
Relational/Network Data Analytics (Mining)
Network Data Analytics differs from regular DM in several
ways
Network-based Representation – Often involves large-scale
relational data and can be modeled with network
measures/metrics.
Network-based Models and Algorithms (PageRank and HITS)
The tasks are often similar with DM: Classification, Regression,
etc. But the application goal often requires analytical insights about
the relations among entities in the data set.
Case Study: Applications of Mutual
Information for U.S. Border Safety
Border-crossing records can be considered as a stream of text (license plates) ordered by the time of crossing. MI can be used to identify frequent co-occurrence between a
pair of vehicle crossings.
If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently.
We use conditional probability to include domain heuristics in the MI formulation.
The heuristics are derived from information recorded in multiple law-enforcement databases.
Case Study I: Association Rule Mining in COPLINK The COPLINK dataset contains data from multiple law
enforcement agencies from 1990 - 2006
3 million incident reports
Their personal and sociological information (age, ethnicity, etc.)
Time information: when two individuals co-offend
TPD, PCSD, CBP (Six ports between AZ and Mexico)
A Integrated Criminal Dataset
1.44 million criminals
662,000 vehicles
11
TPD PCSD CBP
Number of People 662,527 640,733 17.6 M record ( 2.6 M vehicles)
Time Span 1990 - 2006 1990 - 2006 2004 - 2006
Table 1. Summary of the COPLINK vehicle dataset
Association Rule Mining/Learning
Inferring associations between items in the database was motivated by decision support problems faced by retail organizations (Stonebraker 1993).
An association rule (AR) is a relationship of the form A B A is the antecedent item-set and B is the consequent item-set.
The antecedent and consequent item-sets can contain multiple items.
A B holds in a transaction set D with confidence ‘c’ if c% of transactions in D that contain A also contain B,
support ‘s’ if s% of transactions in D contain both A and B.
Association rule mining identifies all the rules that have support and confidence greater than user-specified thresholds.
Mutual Information Mutual information is an information theoretic measure that can
be used to identify interesting co-occurrences of objects.
It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets.
The earliest definitions of MI was given by Claude et al. (1949) and Fano (1961) as the amount of information provided by the occurrence of an event (y) about the occurrence of another event (x):
Intuitively, this concept measures if the co-occurrence of x and y(P(x,y)) is more likely than their separate occurrences (P(x).P(y)).
14
Research Design (cont.)
Border Crossing Data
Six Ports
TPD PCSD
Law Enforcement Data*
2/3
1/3
Training Data
Testing Data
Heuristic
Calculation
Criminal
Vehicles with
Crossings
Potential Target
Vehicles
MIW/MIC
Splitting
Overlap
Subset
Evaluation
Narcotics
Vehicles
Set A
Set B
Web-Spider Internet
Archive
Border Wait Times
TPD PCSD
Law Enforcement Data*
Scores
Research design and process explained in the following slides
Estimating Border Wait Times
An aerial photograph of a
typical U.S. port of entry
(southern border).
Vehicle lanes are backed
up with dozens of vehicles
during peak times.
Criminal vehicles operate in
groups.
If one is caught others
turn back into Mexico.
They may join the lines one
at a time or use turn-out
points.
Vehicle lanes
Turn-out points
Turn-out points
Port of Entry
(Check points)
© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM
Thus, time interval between two related vehicles is likely to be more or equal to
the waiting time if the second vehicle doesn’t join the line until the first vehicle
goes through.
This needs to be taken into consideration in the calculation of MI.
Estimating Border Wait Times
CBP publishes hourly wait times on its website (BWT).
The information is posted only for the current day
No publicly available archive is maintained
A web-spider was used to systematically download the web-page for every
hour over several days in April 2006
However, the average waiting times thus obtained cannot be generalized
to the entire year
The Internet Archive (IA) contained snapshots of the BWT web-
page from April 10, 2004 to March 31, 2005. Obtain waiting time statistics for various days over many months in 2004
and 2005
The statistics from the spidering process and IA were then used to calculate average waiting times for each port on an hourly basis and used in MIW.
17
Temporal Patterns of Border Crossings
The figure suggests that a large number (≈50%) of crossings with police contacts
happen after dark.
MIW uses this information to assign more weight to time periods with more criminal
crossings.
• Figure (a) shows the percentage of all crossings over six time periods of the a day.
– 23% of all crossings take place between 8pm-Midnight.
• Figure (b) shows the percentage of all crossings by vehicles with police contacts over the six time periods.
– 27% of crossings by vehicles with police contacts happen between 8pm-Midnight.
Midnight-5am
12%
5am-10am
10%
10am-2pm
20%
2pm-4pm
13%
4pm-8pm
22%
8pm-Midnight
23%
Midnight-5am
15%
5am-10am
10%
10am-2pm
14%
2pm-4pm
10%
4pm-8pm
24%
8pm-Midnight
27%
Night
Day
Night
Day
(a) (b)
Comparative Evaluation (cont.)
For hypothesis testing, thirty data points (ranging from top 5 to 3500 pairs) were taken for each of the measures and a t-test was done for the differences in the mean number of potentially criminal vehicles identified.
It was found that MIW performed significantly better (at the 99% level) than MIC in all but one dataset in identifying potentially criminal vehicles. The hypothesis on MIW performing better was partially supported.
MIW - MIC
TPD dataset 0.2194
PCSD dataset 0.0001*
Tucson met. dataset 0.0009*
Case 1: Vehicle Pair Identified by MIW
This figure shows the crossing patterns of a
pair of vehicles with the high MIW score.
• Vehicle C from
Arizona and it’s
occupant were arrested
in Tucson for the sale of
narcotics.
• Vehicle C crossed 7
times in a one month
period and crossed
within a few minutes of
Vehicle D.
• The crossings may be
considered suspicious
since they are almost
always after dark and
do not fit a standard
work schedule.
0
500
1000
1500
2000
Jan
15
Jan
25
Jan
26
Jan
29
Fe
b 6
Fe
b 7
Fe
b 1
4
Vehicle C Vehicle D
After dark/No fixed work schedule
Tim
e o
f D
ay
Criminal Activity of Vehicle C & D
Tucson met. area – Narcotics Network Customs and Border Protection
Frequent
Crossers
at Night
MIW
0
500
1000
1500
2000
Ja
n 1
5
Ja
n 2
5
Ja
n 2
6
Ja
n 2
9
Fe
b 6
Fe
b 7
Vehicle A Vehicle B
Vehicle C Vehicle D
Tucson met. area
Criminal Network
Vehicle C was found to have strong connections to a narcotics network in the
Tucson metropolitan area. It had links to other people and vehicles that had been
arrested / suspected for narcotics sales and possession in the region.
Vehicle D was also involved in criminal activity in the Tucson region.
MIW identified many other such strong cases.
A Suspect Vehicle Triple Identified
MIW scores were calculated between Vehicle F and other crossing vehicles and a promising transitive association with Vehicle G was found. Vehicle G had crossed 3 times within minutes Vehicle F over a 12 day period.
0
500
1000
1500
2000
Se
p 6
Se
p 1
1
Se
p 1
7
Se
p 1
8
Se
p 2
5
Oct 4
Oct 5
Vehicle E Vehicle F Vehicle G
Tim
e o
f D
ay
Dates (2005)
After dark
• This figure shows the
crossing patterns of vehicle
triple that was identified by the
transitive use of MIW with
support constraints.
• Vehicle F crossed 7 times in
a one month period out of
which it crossed 5 times within
a few minutes of Vehicle E.
• It was also found that
Vehicle E was involved in
multiple narcotics crimes in the
Tucson region in recent times.
Crime Involvement of Vehicles E and G
Vehicle E was involved in narcotics crimes and Vehicle G was found to be involved in suspicious activity and forgery.
Since the procedure used MIW, it indicates that the vehicles may have been simultaneously waiting in line at the same port of entry.
This example clearly shows that the transitive use of MIW shows promise in identifying potentially criminal vehicles.
MIW
Vehicle CVehicle E
0
500
1000
1500
2000
No
v 1
2
No
v 2
6
Ja
n 1
4
Fe
b 4
Vehicle A Vehicle B
Vehicle F
0
500
1000
1500
2000
Se
p 5
Se
p 6
Se
p 1
1
Se
p 1
7
Se
p 1
8
Se
p 2
5
Oct 2
Oct 4
Oct 5
Vehicle A Vehicle B Vehicle C
MIW
Vehicle G
Tucson met. area
Narcotics CrimesCustoms and Border Protection Tucson met. area
Crimes
Mining Longitudinal Network Data:
Dynamic Network Analysis (DNA)
23
What Why How
Model the changes in
network evolution
Temporal changes in
network topological
measures
Dynamic network
recovery on
longitudinal data
Studying dynamic link formation processes behind network
evolution.
Nodes forming links Network Evolution
Statistical analysis of
determinants behind
link formation
Homophily
Preferential
attachment
Shared affiliations
Simulate the evolution
of networks
Agent-based Modeling
and Simulation
Examine network
robustness
Case Study II: A Global Terrorist Network
24
The Global Salafi Jihad (GSJ) network data is compiled by a former
CIA operation officer Dr. Marc Sageman - 366 terrorists
friendship, kinship, same religious leader, operational interactions, etc.
geographical origins, socio-economic status, education, etc.
when they join and leave GSJ
The goal of dynamic analysis
gain insights about the evolution of GSJ network
develop effective attack strategies to break down GSJ network
Sample data of GSJ terrorists
25
a
26
Stretching
Or Not?
Leonardo da Vinci
Vitruvian Man, 1487
27
Dynamic Network Analysis
28
What Why How
Model the changes in
network evolution
Temporal changes in
network topological
measures
Dynamic network
recovery on
longitudinal data
Studying dynamic processes (i.e., link formation) behind
network evolution.
Nodes’ behaviors Network Evolution
Statistical analysis of
determinants behind link
formation
Homophily
Preferential attachment
Shared affiliations
Simulate the evolution
of networks
Agent-based Modeling
and Simulation
Examine network
robustness
Temporal Changes in Network-level Measures
Average Degree <k >
0
2
4
6
8
10
12
14
16
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
deg
ree
<k>
Fig.1. The temporal changes in the (a)
average degree, (b) and (c) degree
distribution
Degree = number of links a node has
a
b
c
0.00
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49
pro
bab
ilit
y o
f d
eg
ree
1990
1991
1993
Poisson
0.00
0.03
0.06
0.09
0.12
0.15
0.18
0.21
0.24
1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52
pro
bab
ilit
y o
f d
eg
ree
1995
1997
1999
Findings
30
There are three stages for the evolution of the GSJ network:
1989 - 1993 The emerging stage:
The network grows in size
Accelerated Growth - No. of edges increases faster than nodes
Random network topology (Poisson degree distribution)
1994 - 2000 The mature stage:
The size of the network reached its peak in 2000
Scale-free topology (Power-law degree distribution)
2001 - 2003 The disintegration stage:
Falling into small disconnected components after 9/11
Temporal Changes in Node Centrality Measures
31
0
10
20
30
40
50
60
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
Degree
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
Betweenness
Figure.2. Temporal changes in Degree
and Betweenness centrality of Osama
Bin Laden
Degree: No. of links a node has
Betweenness of a node i
No. of shortest paths from all nodes to all
others that pass through node i
Measure i’s influence on the traffic
(information, resource) flowing through it
Findings and Possible Explanations
32
1994 – 1996: A sharp decrease in Bin Laden’s Betweenness
1994: Saudi revoked his citizenship and expelled him
1995: Went to Sudan and was expelled again under U.S. pressure
1996: Went to Afghanistan and established camps there
1998 –1999: Another sharp decrease in his Betweenness
After 1998 bombings of U.S. embassies, Bill Clinton ordered a freeze on
assets linked to bin Laden (top 10 most wanted)
August 1998: A failed assassination on him from U.S.
1999: UN imposed sanctions against Afghanistan to force the Taliban to
extradite him
33
Dynamic Network Recovery
The GSJ Network: Small, Low Frequency (Yearly), AdHoc!
How about large, high-frequency longitudinal network data?
Email communication network
t t +1 t +2
t t +1 t +2
................ ................
Timeline
The relevancy horizon
Which past links are relevant to the current state of the network?
................
How frequent
to sample?
t, t + 2[ ]An instantaneous network
34
Relevancy Horizon and Sampling Period
Recovering a set of instantaneous social networks from longitudinal
network data by setting a sliding window filter
The relevancy horizon : the maximum time length that a past event (link)
has impact on current network.
The sampling period : determines which events were considered to be
simultaneous and independent of each other
Timeline
Case Study III: A Narcotic Criminal Network
The COPLINK dataset contains 3 million police incident reports from the Tucson Police Department (1990 to 2006).
3 million incident reports and 1.44 million individuals
Their personal and sociological information (age, ethnicity, etc.)
Time information: when two individuals co-offend
AZ Inmate affiliation data: when and where an inmate was housed
A Narcotic Criminal Network
19,608 individuals involved in organized narcotic crimes
29,704 co-offending pairs (links)
35
COPLINK
Narcotic Data
Arizona Inmate
Data
Overlapped (identified by first
name, last name and DOB)
Number of People 36,548 165,540 19,608
Time Span 1990 - 2006 1985 - 2006 17 years
Table 1. Summary of the COPLINK dataset and the Arizona inmate dataset
Determine Sampling Period and Relevancy Horizon
The sampling period can be calculated based on
Nyquist–Shannon sampling theorem: , where is the
maximum frequency of link formation (i.e., co-offending a crime).
The relevancy horizon is determined by
Within-pair Response Time : the time gap between two subsequent co-
offendings by i and j.
36
d
max2/1 f maxf
ijt
Cumulative Distribution of within-pair response time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 51 101 165 262 451
tij (days)
p(t
ij)
90% of the response time gaps are
within = 210 days.
Set 210 days as the relevancy
horizon for our empirical analysis.
90.0t
Dynamic Network Analysis
37
What Why How
Model the changes in
network evolution
Temporal changes in
network topological
measures
Dynamic network
recovery on
longitudinal data
Studying dynamic processes (i.e., link formation) behind
network evolution.
Nodes’ behaviors Network Evolution
Statistical analysis of
determinants behind
link formation
Homophily
Preferential attachment
Shared affiliations
Simulate the evolution
of networks
Agent-based Modeling
and Simulation
Examine network
robustness
Proportional hazards model (Cox Regression Analysis)
Homophily in age (group) and race
Shared affiliations:
Mutual acquaintances (through crimes)
Vehicle affiliation (same vehicle used by two in different crimes)
38
h(t, x1, x2, x3...) = h0(t)exp(b1x1 +b2x2 +b3x3...)
Statistical Analysis of Determinants for Link Formation
mutualacq
gender
age
vehicle
race
0 10 20 30 40
Hazard Ratio g
1
Fig.3. Results of
multivariate survival (Cox
regression) analysis of
triadic closure (link
formation).
IBM’s COPLINK is an intelligent police information system aims to to help speed up the crime detection process.
COPLINK calculates the co-offending likelihood score based on the proportional hazards model .
A ranked list of individuals based on their predicted likelihood of co-
offending with the suspect under investigation.
39
BI Application: Co-offending Prediction in COPLINK
Fig.4. Screenshots of
the COPLINK system
Dynamic Network Analysis
40
What Why How
Model the changes in
network evolution
Temporal changes in
network topological
measures
Dynamic network
recovery on
longitudinal data
Studying dynamic processes (i.e., link formation) behind
network evolution.
Nodes’ behaviors Network Evolution
Statistical analysis of
determinants behind link
formation
Homophily
Preferential attachment
Shared affiliations
Simulate the evolution
of networks
Agent-based Modeling
and Simulation
Examine network
robustness
Simulate Attacks on Dark Networks
41
Three attack (i.e. node removals) strategies:
Attack on hubs (highest degrees)
Attack on bridge (highest betweenness)
Real-world Attack (Attack order based on real-world data)
Simulate two types of attacks to examine the robustness of the
Dark networks
Simultaneous attacks (the degree/betweenness of nodes are NOT
updated after each removal) – Static
Progressive attacks (the degree/betweenness of nodes are updated after
each removal) – Dynamic
Simultaneous Vs. Progressive Attacks
42* The relative size of the largest cluster that remains connected: S
Both Dark networks are more vulnerable to progressive
attacks than simultaneous attacks.
Dynamic updates are more effective
Hub Vs. Bridge Attacks
43
Both hub and bridge attacks are far more effective than real-world
arrests – Policy implications?
Both Dark networks are more vulnerable to Bridge attacks than Hub
attacks.
Bridge (highest beweenness): Field lieutenants, operational leaders, etc.
Hub (highest degree) : e.g., Bin Laden
GSJ
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.2 0.4 0.6 0.8 1
Fraction of nodes removed
S a
nd <
s>
S (Hub attacks)
S (Bridge attacks)
44
Summary and Findings
Dynamic Network Analysis (DNA) methods are effective in
Linking network topological changes to analytical insights
Systematically capturing the link formation processes
Examining the determinants of link formation
Dark networks are
robust against real-world attacks
but vulnerable to targeted bridge attacks
DNA provides real-time decision support for fighting crimes based on
relational/network data mining.