daning hu, ifi, uzh network-based business intelligence€¦ · measures/metrics. network-based...

Daning Hu, IFI, UZH

Network-based Business Intelligence

Network Modeling

& Analysis

Design

BI Applications

Economic

Networks

Social

Networks

Financial Markets,

Banking Systems,

Supply Chains, …

Online Communities,

Social Networking

Websites, …

Influence

Risk Contagion

(e.g., Bank run)

Social Contagion

(e.g., Word-of-

Mouth)

Monitoring, Analyzing, and

Simulating

Contagious Risk

Identification,

Prediction, and

Recommendation of

Influential Individuals

Network (Relational)

Data Analytics

Prof. Dr. Daning Hu

Department of Informatics

University of Zurich

Nov 12th, 2014

3

Outline

Introduction: The Big Data Era

Data Mining

Relational/Network Data Analytics

Mining Longitudinal Network Data

Dynamic Network Analysis

Introduction: The Big Data Era

90% of the data in the world today has been created in the last

two years alone (IBM). Big data comes from everywhere:

sensors used to gather climate information,

posts to social media,

digital pictures and videos,

purchase records, etc.

In response, everyone has begun embracing a loosely defined

term for today's massive data sets and the challenges they

present: Big Data.

Lack of efficient and effective analytical methods for big data

Big brother issues

A Brief History of Big Data

5

Herman Hollerith census data (electric hole pouching)1887-90

1935-37 FDR’s Social Security Act

26 million working Americans and 3 million emplyers

IBM, field investigators

Internet Ages and 9/11 NSA: 1.7 billion emails, phone calls, daily

Retailers amassing information on shopping habits

Wal-Mart: 460 T cache in 2004

Social Networks and Social Media proliferated

U.S. Open Government Initiative: data.gov

1943

–

1960s

WWII and Cold War “Colossus” Project: Deciphering Nazi Codes

742M U.S. tax returns and 175M fingerprints -> Privacy act

1990s

–

2000s

2014

? Network Data Analytics

Data Mining: Predecessor of Big Data (Analytics)

Data mining (the analysis step of the "Knowledge Discovery in

Databases" process, or KDD) is the process that attempts to

discover patterns in large data sets.

a field at the intersection of computer science and statistics

AI, machine learning, statistics, and database systems

The goal of DM is to extract information from a large data set

and transform it into an understandable form for further use

Data -> Information -> Knowledge

Involving data analysis, preprocessing & management,

model and inference considerations, complexity

considerations, post-processing of discovered structures,

visualization, and online updating (real-time).

7

Data Mining

The Knowledge Discovery in Databases (KDD) process is commonly

defined with the stages:

Collection and Selection

Pre-processing

Transformation

Data Mining (Analysis)

Interpretation/Evaluation

8

Data Mining Tasks

Major Data Mining Tasks:

Association rule mining (Dependency modeling) – Searches for

relationships between variables.

Clustering – discovering groups and structures in the data that are in

some way or another "similar", without using known structures.

Classification – generalizing known structure to apply to new data.

E.g., software classifying an e-mail as "spam". (Training dataset)

Regression – Attempts to find a function which models the data with

the least error.

Summarization – providing a more compact representation of the

data set, including visualization and report generation.

9

Relational/Network Data Analytics (Mining)

Network Data Analytics differs from regular DM in several

ways

Network-based Representation – Often involves large-scale

relational data and can be modeled with network

measures/metrics.

Network-based Models and Algorithms (PageRank and HITS)

The tasks are often similar with DM: Classification, Regression,

etc. But the application goal often requires analytical insights about

the relations among entities in the data set.

Case Study: Applications of Mutual

Information for U.S. Border Safety

Border-crossing records can be considered as a stream of text (license plates) ordered by the time of crossing. MI can be used to identify frequent co-occurrence between a

pair of vehicle crossings.

If one vehicle in the pair has a criminal record, some inferences may be made about the second vehicle if they cross together frequently.

We use conditional probability to include domain heuristics in the MI formulation.

The heuristics are derived from information recorded in multiple law-enforcement databases.

Case Study I: Association Rule Mining in COPLINK The COPLINK dataset contains data from multiple law

enforcement agencies from 1990 - 2006

3 million incident reports

Their personal and sociological information (age, ethnicity, etc.)

Time information: when two individuals co-offend

TPD, PCSD, CBP (Six ports between AZ and Mexico)

A Integrated Criminal Dataset

1.44 million criminals

662,000 vehicles

11

TPD PCSD CBP

Number of People 662,527 640,733 17.6 M record ( 2.6 M vehicles)

Time Span 1990 - 2006 1990 - 2006 2004 - 2006

Table 1. Summary of the COPLINK vehicle dataset

Association Rule Mining/Learning

Inferring associations between items in the database was motivated by decision support problems faced by retail organizations (Stonebraker 1993).

An association rule (AR) is a relationship of the form A B A is the antecedent item-set and B is the consequent item-set.

The antecedent and consequent item-sets can contain multiple items.

A B holds in a transaction set D with confidence ‘c’ if c% of transactions in D that contain A also contain B,

support ‘s’ if s% of transactions in D contain both A and B.

Association rule mining identifies all the rules that have support and confidence greater than user-specified thresholds.

Mutual Information Mutual information is an information theoretic measure that can

be used to identify interesting co-occurrences of objects.

It can be considered a subset of AR mining with 1-item antecedent and consequent item-sets.

The earliest definitions of MI was given by Claude et al. (1949) and Fano (1961) as the amount of information provided by the occurrence of an event (y) about the occurrence of another event (x):

Intuitively, this concept measures if the co-occurrence of x and y(P(x,y)) is more likely than their separate occurrences (P(x).P(y)).

14

Research Design (cont.)

Border Crossing Data

Six Ports

TPD PCSD

Law Enforcement Data*

2/3

1/3

Training Data

Testing Data

Heuristic

Calculation

Criminal

Vehicles with

Crossings

Potential Target

Vehicles

MIW/MIC

Splitting

Overlap

Subset

Evaluation

Narcotics

Vehicles

Set A

Set B

Web-Spider Internet

Archive

Border Wait Times

TPD PCSD

Law Enforcement Data*

Scores

Research design and process explained in the following slides

Estimating Border Wait Times

An aerial photograph of a

typical U.S. port of entry

(southern border).

Vehicle lanes are backed

up with dozens of vehicles

during peak times.

Criminal vehicles operate in

groups.

If one is caught others

turn back into Mexico.

They may join the lines one

at a time or use turn-out

points.

Vehicle lanes

Turn-out points

Turn-out points

Port of Entry

(Check points)

© 2006 Google – Imagery © 2006 DigitalGlobe, Map data ©2006 NAVTEQTM

Thus, time interval between two related vehicles is likely to be more or equal to

the waiting time if the second vehicle doesn’t join the line until the first vehicle

goes through.

This needs to be taken into consideration in the calculation of MI.

Estimating Border Wait Times

CBP publishes hourly wait times on its website (BWT).

The information is posted only for the current day

No publicly available archive is maintained

A web-spider was used to systematically download the web-page for every

hour over several days in April 2006

However, the average waiting times thus obtained cannot be generalized

to the entire year

The Internet Archive (IA) contained snapshots of the BWT web-

page from April 10, 2004 to March 31, 2005. Obtain waiting time statistics for various days over many months in 2004

and 2005

The statistics from the spidering process and IA were then used to calculate average waiting times for each port on an hourly basis and used in MIW.

17

Temporal Patterns of Border Crossings

The figure suggests that a large number (≈50%) of crossings with police contacts

happen after dark.

MIW uses this information to assign more weight to time periods with more criminal

crossings.

• Figure (a) shows the percentage of all crossings over six time periods of the a day.

– 23% of all crossings take place between 8pm-Midnight.

• Figure (b) shows the percentage of all crossings by vehicles with police contacts over the six time periods.

– 27% of crossings by vehicles with police contacts happen between 8pm-Midnight.

Midnight-5am

12%

5am-10am

10%

10am-2pm

20%

2pm-4pm

13%

4pm-8pm

22%

8pm-Midnight

23%

Midnight-5am

15%

5am-10am

10%

10am-2pm

14%

2pm-4pm

10%

4pm-8pm

24%

8pm-Midnight

27%

Night

Day

Night

Day

(a) (b)

Comparative Evaluation (cont.)

For hypothesis testing, thirty data points (ranging from top 5 to 3500 pairs) were taken for each of the measures and a t-test was done for the differences in the mean number of potentially criminal vehicles identified.

It was found that MIW performed significantly better (at the 99% level) than MIC in all but one dataset in identifying potentially criminal vehicles. The hypothesis on MIW performing better was partially supported.

MIW - MIC

TPD dataset 0.2194

PCSD dataset 0.0001*

Tucson met. dataset 0.0009*

Case 1: Vehicle Pair Identified by MIW

This figure shows the crossing patterns of a

pair of vehicles with the high MIW score.

• Vehicle C from

Arizona and it’s

occupant were arrested

in Tucson for the sale of

narcotics.

• Vehicle C crossed 7

times in a one month

period and crossed

within a few minutes of

Vehicle D.

• The crossings may be

considered suspicious

since they are almost

always after dark and

do not fit a standard

work schedule.

0

500

1000

1500

2000

Jan

15

Jan

25

Jan

26

Jan

29

Fe

b 6

Fe

b 7

Fe

b 1

4

Vehicle C Vehicle D

After dark/No fixed work schedule

Tim

e o

f D

ay

Criminal Activity of Vehicle C & D

Tucson met. area – Narcotics Network Customs and Border Protection

Frequent

Crossers

at Night

MIW

0

500

1000

1500

2000

Ja

n 1

5

Ja

n 2

5

Ja

n 2

6

Ja

n 2

9

Fe

b 6

Fe

b 7

Vehicle A Vehicle B

Vehicle C Vehicle D

Tucson met. area

Criminal Network

Vehicle C was found to have strong connections to a narcotics network in the

Tucson metropolitan area. It had links to other people and vehicles that had been

arrested / suspected for narcotics sales and possession in the region.

Vehicle D was also involved in criminal activity in the Tucson region.

MIW identified many other such strong cases.

A Suspect Vehicle Triple Identified

MIW scores were calculated between Vehicle F and other crossing vehicles and a promising transitive association with Vehicle G was found. Vehicle G had crossed 3 times within minutes Vehicle F over a 12 day period.

0

500

1000

1500

2000

Se

p 6

Se

p 1

1

Se

p 1

7

Se

p 1

8

Se

p 2

5

Oct 4

Oct 5

Vehicle E Vehicle F Vehicle G

Tim

e o

f D

ay

Dates (2005)

After dark

• This figure shows the

crossing patterns of vehicle

triple that was identified by the

transitive use of MIW with

support constraints.

• Vehicle F crossed 7 times in

a one month period out of

which it crossed 5 times within

a few minutes of Vehicle E.

• It was also found that

Vehicle E was involved in

multiple narcotics crimes in the

Tucson region in recent times.

Crime Involvement of Vehicles E and G

Vehicle E was involved in narcotics crimes and Vehicle G was found to be involved in suspicious activity and forgery.

Since the procedure used MIW, it indicates that the vehicles may have been simultaneously waiting in line at the same port of entry.

This example clearly shows that the transitive use of MIW shows promise in identifying potentially criminal vehicles.

MIW

Vehicle CVehicle E

0

500

1000

1500

2000

No

v 1

2

No

v 2

6

Ja

n 1

4

Fe

b 4

Vehicle A Vehicle B

Vehicle F

0

500

1000

1500

2000

Se

p 5

Se

p 6

Se

p 1

1

Se

p 1

7

Se

p 1

8

Se

p 2

5

Oct 2

Oct 4

Oct 5

Vehicle A Vehicle B Vehicle C

MIW

Vehicle G

Tucson met. area

Narcotics CrimesCustoms and Border Protection Tucson met. area

Crimes

Mining Longitudinal Network Data:

Dynamic Network Analysis (DNA)

23

What Why How

Model the changes in

network evolution

Temporal changes in

network topological

measures

Dynamic network

recovery on

longitudinal data

Studying dynamic link formation processes behind network

evolution.

Nodes forming links Network Evolution

Statistical analysis of

determinants behind

link formation

Homophily

Preferential

attachment

Shared affiliations

Simulate the evolution

of networks

Agent-based Modeling

and Simulation

Examine network

robustness

Case Study II: A Global Terrorist Network

24

The Global Salafi Jihad (GSJ) network data is compiled by a former

CIA operation officer Dr. Marc Sageman - 366 terrorists

friendship, kinship, same religious leader, operational interactions, etc.

geographical origins, socio-economic status, education, etc.

when they join and leave GSJ

The goal of dynamic analysis

gain insights about the evolution of GSJ network

develop effective attack strategies to break down GSJ network

Sample data of GSJ terrorists

26

Stretching

Or Not?

Leonardo da Vinci

Vitruvian Man, 1487


28

What Why How


network evolution

Temporal changes in

network topological

measures

Dynamic network

recovery on

longitudinal data

Studying dynamic processes (i.e., link formation) behind

network evolution.

Nodes’ behaviors Network Evolution


determinants behind link

formation

Homophily

Preferential attachment

Shared affiliations


of networks


and Simulation

Examine network

robustness

Temporal Changes in Network-level Measures

Average Degree <k >

0

2

4

6

8

10

12

14

16

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

deg

ree

<k>

Fig.1. The temporal changes in the (a)

average degree, (b) and (c) degree

distribution

Degree = number of links a node has

a

b

c

0.00

0.03

0.06

0.09

0.12

0.15

0.18

0.21

0.24

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49

pro

bab

ilit

y o

f d

eg

ree

1990

1991

1993

Poisson

0.00

0.03

0.06

0.09

0.12

0.15

0.18

0.21

0.24

1 4 7 10 13 16 19 22 25 28 31 34 37 40 43 46 49 52

pro

bab

ilit

y o

f d

eg

ree

1995

1997

1999

Findings

30

There are three stages for the evolution of the GSJ network:

1989 - 1993 The emerging stage:

The network grows in size

Accelerated Growth - No. of edges increases faster than nodes

Random network topology (Poisson degree distribution)

1994 - 2000 The mature stage:

The size of the network reached its peak in 2000

Scale-free topology (Power-law degree distribution)

2001 - 2003 The disintegration stage:

Falling into small disconnected components after 9/11

Temporal Changes in Node Centrality Measures

31

0

10

20

30

40

50

60

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

Degree

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

Betweenness

Figure.2. Temporal changes in Degree

and Betweenness centrality of Osama

Bin Laden

Degree: No. of links a node has

Betweenness of a node i

No. of shortest paths from all nodes to all

others that pass through node i

Measure i’s influence on the traffic

(information, resource) flowing through it

Findings and Possible Explanations

32

1994 – 1996: A sharp decrease in Bin Laden’s Betweenness

1994: Saudi revoked his citizenship and expelled him

1995: Went to Sudan and was expelled again under U.S. pressure

1996: Went to Afghanistan and established camps there

1998 –1999: Another sharp decrease in his Betweenness

After 1998 bombings of U.S. embassies, Bill Clinton ordered a freeze on

assets linked to bin Laden (top 10 most wanted)

August 1998: A failed assassination on him from U.S.

1999: UN imposed sanctions against Afghanistan to force the Taliban to

extradite him

33

Dynamic Network Recovery

The GSJ Network: Small, Low Frequency (Yearly), AdHoc!

How about large, high-frequency longitudinal network data?

Email communication network

t t +1 t +2

t t +1 t +2

................ ................

Timeline

The relevancy horizon

Which past links are relevant to the current state of the network?

................

How frequent

to sample?

t, t + 2[ ]An instantaneous network

34

Relevancy Horizon and Sampling Period

Recovering a set of instantaneous social networks from longitudinal

network data by setting a sliding window filter

The relevancy horizon : the maximum time length that a past event (link)

has impact on current network.

The sampling period : determines which events were considered to be

simultaneous and independent of each other

Timeline

Case Study III: A Narcotic Criminal Network

The COPLINK dataset contains 3 million police incident reports from the Tucson Police Department (1990 to 2006).

3 million incident reports and 1.44 million individuals

Their personal and sociological information (age, ethnicity, etc.)

Time information: when two individuals co-offend

AZ Inmate affiliation data: when and where an inmate was housed

A Narcotic Criminal Network

19,608 individuals involved in organized narcotic crimes

29,704 co-offending pairs (links)

35

COPLINK

Narcotic Data

Arizona Inmate

Data

Overlapped (identified by first

name, last name and DOB)

Number of People 36,548 165,540 19,608

Time Span 1990 - 2006 1985 - 2006 17 years

Table 1. Summary of the COPLINK dataset and the Arizona inmate dataset

Determine Sampling Period and Relevancy Horizon

The sampling period can be calculated based on

Nyquist–Shannon sampling theorem: , where is the

maximum frequency of link formation (i.e., co-offending a crime).

The relevancy horizon is determined by

Within-pair Response Time : the time gap between two subsequent co-

offendings by i and j.

36

d

max2/1 f maxf

ijt

Cumulative Distribution of within-pair response time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 51 101 165 262 451

tij (days)

p(t

ij)

90% of the response time gaps are

within = 210 days.

Set 210 days as the relevancy

horizon for our empirical analysis.

90.0t


37

What Why How


network evolution

Temporal changes in

network topological

measures

Dynamic network

recovery on

longitudinal data


network evolution.



determinants behind

link formation

Homophily


Shared affiliations


of networks


and Simulation

Examine network

robustness

Proportional hazards model (Cox Regression Analysis)

Homophily in age (group) and race

Shared affiliations:

Mutual acquaintances (through crimes)

Vehicle affiliation (same vehicle used by two in different crimes)

38

h(t, x1, x2, x3...) = h0(t)exp(b1x1 +b2x2 +b3x3...)

Statistical Analysis of Determinants for Link Formation

mutualacq

gender

age

vehicle

race

0 10 20 30 40

Hazard Ratio g

1

Fig.3. Results of

multivariate survival (Cox

regression) analysis of

triadic closure (link

formation).

IBM’s COPLINK is an intelligent police information system aims to to help speed up the crime detection process.

COPLINK calculates the co-offending likelihood score based on the proportional hazards model .

A ranked list of individuals based on their predicted likelihood of co-

offending with the suspect under investigation.

39

BI Application: Co-offending Prediction in COPLINK

Fig.4. Screenshots of

the COPLINK system


40

What Why How


network evolution

Temporal changes in

network topological

measures

Dynamic network

recovery on

longitudinal data


network evolution.



determinants behind link

formation

Homophily


Shared affiliations


of networks


and Simulation

Examine network

robustness

Simulate Attacks on Dark Networks

41

Three attack (i.e. node removals) strategies:

Attack on hubs (highest degrees)

Attack on bridge (highest betweenness)

Real-world Attack (Attack order based on real-world data)

Simulate two types of attacks to examine the robustness of the

Dark networks

Simultaneous attacks (the degree/betweenness of nodes are NOT

updated after each removal) – Static

Progressive attacks (the degree/betweenness of nodes are updated after

each removal) – Dynamic

Simultaneous Vs. Progressive Attacks

42* The relative size of the largest cluster that remains connected: S

Both Dark networks are more vulnerable to progressive

attacks than simultaneous attacks.

Dynamic updates are more effective

Hub Vs. Bridge Attacks

43

Both hub and bridge attacks are far more effective than real-world

arrests – Policy implications?

Both Dark networks are more vulnerable to Bridge attacks than Hub

attacks.

Bridge (highest beweenness): Field lieutenants, operational leaders, etc.

Hub (highest degree) : e.g., Bin Laden

GSJ

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.2 0.4 0.6 0.8 1

Fraction of nodes removed

S a

nd <

s>

S (Hub attacks)

S (Bridge attacks)

44

Summary and Findings

Dynamic Network Analysis (DNA) methods are effective in

Linking network topological changes to analytical insights

Systematically capturing the link formation processes

Examining the determinants of link formation

Dark networks are

robust against real-world attacks

but vulnerable to targeted bridge attacks

DNA provides real-time decision support for fighting crimes based on

relational/network data mining.

daning hu, ifi, uzh network-based business intelligence€¦ · measures/metrics. network-based...

Documents