the power of data mining and machine learning techniques for network construction and analysis

88
The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis Reda Alhajj University of Calgary, Calgary, Alberta, Canada Global University, Beirut, Lebanon [email protected]

Upload: mohammad-ramsey

Post on 30-Dec-2015

15 views

Category:

Documents


0 download

DESCRIPTION

The Power of Data Mining and Machine Learning Techniques for Network Construction and Analysis. Reda Alhajj University of Calgary, Calgary, Alberta, Canada Global University, Beirut, Lebanon [email protected]. General Overview. The network model provides a powerful platform - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

The Power of Data Mining and Machine Learning Techniques for Network

Construction and Analysis

Reda Alhajj

University of Calgary, Calgary, Alberta, CanadaGlobal University, Beirut, Lebanon

[email protected]

Page 2: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

General Overview

The network model provides a powerful platform to study a group of entities and their relationships

The semantics of the links in the network is determined by considering the application domain to be investigated

A network can be constructed by considering pairwise correlation between entities or by investigating the correlation between two entities based on a global view of the data

Data mining and machine learning techniques allow for better investigation by globally visioning the data to derive the strength of pairwise links

The combination of data mining, machine learning and network analysis would lead to a comprehensive and robust framework for data analysis.

2 Reda Alhajj, University of Calgary

Page 3: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 20133 Reda Alhajj, University of Calgary

Outline of the talk

Background on ARM, Clustering, Network Model, fuzziness

From FPM, ARM and clustering to network

Some Application Domains: database design web mining terror network analysis outlier detection Disease Biomarker Database search

Conclusions and research directions

Page 4: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Overview of Association Rules Mining

A general model for mining domains where there is many2many relationship between two sets of entities, e.g., baskets and items; documents and words, etc.

Consider a set of items I = {I1 , I2 , I3 ,…, Im }

Consider a database of transactions D where each transaction T is a set of items such that T I

So, if A is a set of items a transaction T is said to contain A if and only if A T

An association rule is an implication or correlation of the form:

A B where A I, B I, and A B =

Support and confidence are the measures generally used to filter the rules

4 Reda Alhajj, University of Calgary

Page 5: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Association Rules Mining: Two Steps

In general association rules mining can be reduced to the following two steps:

1. Find all frequent itemsets Each itemset will occur at least as frequently as a

minimum support count

2. Generate strong association rules from the frequent itemsets These rules will satisfy minimum support and confidence

measures

We use the outcome from the first step in part of the research and the outcome from the second step in another part of the research

5 Reda Alhajj, University of Calgary

Page 6: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Association Rules Mining: Apriori Algorithm

Any subset of a frequent itemset must be frequent Apriori pruning principle: If there is any itemset which is

infrequent, its superset should not be generated/tested!

Minimum support = 2

6 Reda Alhajj, University of Calgary

Page 7: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Association Rule MiningFrequent Closed Itemset

A frequent itemset X is closed if none of its immediate supersets has the same support as the itemset X

Example

Image Reference: http://www.siam.org/meetings/sdm06/proceedings/038lucchesec.pdf

7 Reda Alhajj, University of Calgary

Page 8: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Clustering

It is an unsupervised learning process

It is the process of distributing a given set of data instances into groups such that the similarity of instances is high within each group and low between the groups. Similarity within the cluster (intra-cluster) is measured using

variance average variance or TWCV Similarity across the clusters (inter-cluster) is measure based on

linkage.

For clustering we need to know at least the characteristics of the instances and the similarity measure to be used in the process

Various algorithms exist for clustering, e.g., k-means, DBscan,

Each algorithm has its advantages and disadvantages

8 Reda Alhajj, University of Calgary

Page 9: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Clustering

9 Reda Alhajj, University of Calgary

Example 1

Example 2

Page 10: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Overview of Social Network Analysis

A social network is a set of entities called actors and the links connecting them. Ex: students enrolled in same courses, people and likes, etc A social network is mostly represented as a graph called sociogram

Social Network Analysis (SNA) is powerful because it has foundations in math/graph theory

SNA provides a set of tools to empirically extend our theoretical intuition of the patterns that compose a social structure.

SNA provides a set of relational methods for systematically understanding and identifying connections among actors.

SNA embodies a range of theories relating types of observable social spaces and their relation to individual and group behavior.

10 Reda Alhajj, University of Calgary

Page 11: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Social Network AnalysisCentrality Measures

Degree Sum of connections (sum of the weights of connections in

case of weighted graphs) from or to an actor

Closeness Distance of one actor to all others in the network

Betweenness The number of shortest paths that passes through an actor

Eigen-vector Measures how importance of an actor

11 Reda Alhajj, University of Calgary

Page 12: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Social Network AnalysisCentrality Measures (example)

The red nodes have the highest degree centrality

The blue node has the highest Closeness and betweenness centrality

Node 7 has the highest degree centrality

Node 8 has the highest betweenness Centrality

Nodes 4 and 5 have the highest Closeness Centrality

Example 1 Example 2

Image Reference:http://www.biomedcentral.com/

Image Reference:http://mande.co.uk/special-issues/network-models/

12 Reda Alhajj, University of Calgary

Page 13: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Social Network AnalysisGraph Clustering Algorithms

MST based clustering First finds a Minimum Spanning Tree (MST) of the graph

Removes edges with the highest weight from the MST to form clusters of vertices (actors)

Edge Betweenness clustering The betweenness of an edge is defined as the extent to

which the edge lies along shortest paths

First computes edge betweenness for all edges in current graph

Removes edges having the highest betweenness from the graph

13 Reda Alhajj, University of Calgary

Page 14: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

One Mode versus Two Mode Networks

Queries (users) versus Tables is a two mode network

Folding is used to produceone mode networks from a two mode network

Folding is simply the multiplicationof the adjacency matrix of the two mode network by its transpose

X Y Z

A 1 0 0

B 1 0 1

C 1 1 0

D 1 0 1

A B C D

X 1 1 1 1

Y 0 0 1 0

Z 0 1 0 1

14 Reda Alhajj, University of Calgary

Page 15: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Fuzzy Sets Generalizes the classical set theory by a characteristic

membership function.

A membership function introduces a grey area between the black and white areas

Consider fuzzy set A, its domain D, and object x.

Membership function µ specifies the degree of membership of x in A:

µA(x): D → [0, 1].

µA(x)= 0 means x does not belong to A.

µA(x)= 1 means x completely belongs to A.

Intermediate values 0< µA(x)<1 represent varying degree of membership.

15 Reda Alhajj, University of Calgary

Page 16: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Income Range Centroid

Quite poor 10-10-30 -Poor 10-30-70 30

Moderate 30-70-120 70

Rich 70-120-120 -

The ranges of fuzzy sets

10K 30K 70K 120Kincome($)

poor moderate richquitepoor

The membership functions found according to the centroids

Example on Membership

1.0

0.5

0.0

Membership

16 Reda Alhajj, University of Calgary

Page 17: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201317 Reda Alhajj, University of Calgary

From FPM to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for FPM by deciding on the baskets and items. Keep in mind that items are the actors in the network

Apply the FPM algorithm of your choice to find Frequent sets of items; it is possible to narrow down to closed or maximal FP

Construct the network by considering the frequent sets as follows:

Add a link between two actors i and j iff i and j exist together in at least one FP, the weight of the link is set to the number of common FP’s

It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.

Page 18: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

From FPM to Network Construction

18 Reda Alhajj, University of Calgary

Page 19: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201319 Reda Alhajj, University of Calgary

From ARM to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for ARM by deciding on the baskets and items. Keep in mind that items are the actors in the network; they will form the antecedents and consequents of the rules

Apply the ARM algorithm of your choice to find all AR’s that satisfy certain criteria

Construct the network by considering the AR’s as follows: Add a link between two actors i and j iff i and j exist together in

at least one AR, the weight of the link is set to the number of common AR’s. It is possible to concentrate on antecedent, consequent or both.

It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.

Page 20: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201320 Reda Alhajj, University of Calgary

From ARM to Network Construction

Page 21: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201321 Reda Alhajj, University of Calgary

From Clustering to Network Construction

Given a data set of M instances and N features per instance

Prepare the data for clustering by deciding on the features to consider in computing the similarity measure

Apply either one clustering algorithm several times by playing with the required input parameters or a number of clustering algorithms to find one clustering solution per run.

Construct the network by considering the clusters as follows: Add a link between two actors i and j iff i and j exist together in

the same cluster in at least one clustering solution, the weight of the link is set to the number of common clusters across the solutions.

It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.

Page 22: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Network Construction

Multiple clustering solutions

22 Reda Alhajj, University of Calgary

Page 23: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201323 Reda Alhajj, University of Calgary

From the Data to Network Construction

Given a data set of M instances and N features per instance

Prepare the data processing by deciding on the features P to consider in the analysis

Construct a MxP matrix A by considering every instance as a row and every feature as a column

Find the transpose of matrix A

Multiply matrix A by its transpose to get the adjacency matrix for the target network.

It is possible to normalize the weights and/or remove some links based on a certain criteria like below average weight or below certain predefined threshold based on weight, etc.

Page 24: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

NetDriller : A Powerful Social Network Analysis Tool*

Negar Koochakzadeh, Atieh Sarraf, Keivan Kianmehr, Jon Rokne, Reda Alhajj{nkoochak, sarrafsa}@ucalgary.ca, [email protected], {alhajj, rokne}@ucalgary.ca Social Network Analysis (SNA) is a technique first used in sociology.

Recently computer scientists have realized that this model is general enough to be applied to any domain where the entities and their interconnections can be separated into actors and their links, respectively. Data Mining techniques can strengthen SNA

Searching in the Network: Example1: Find individuals who could monitor the information flow in an organization better than most others. Example 2: Find individuals who have best picture of what is happening in the network as a whole.

Closeness centrality reveals how long it takes information to spread from one individual to others in the network. High scoring individuals in Closeness have the shortest paths to all others in the network.Betweenness centrality indicates the extent that an individual is a broker of indirect connections among all others in a network. Someone with high Betweenness could be thought of as a gatekeeper of information flow. People that occur on many shortest paths among other People have highest Betweenness value.Degree centrality indicates the extent that an individual send or receive information to the neighbors.Eigenvector centrality calculates the principle eigenvector of the network. A node is central to the extent that its neighbors are central.

Fuzzy Query Example: Find individuals with high centralities

Raw Dataset: People and their attributes

Social Network: Based on community detection

Fuzzy Query Result: Color hue shows DofM

Fuzzy Sets: Based on multi-objective GA optimization

age work class education Marital status occupation relationship race sex Hours/week nativecountry

39 State-gov Bachelors Never-married Adm-clerical Not-in-family White Male 40 US50 Self-emp-not-inc Bachelors Married-civ-spouse Exec-managerial Husband White Male 13 Canada52 Self-emp-not-inc HS-grad Married-civ-spouse Exec-managerial Husband White Male 45 US30 State-gov Bachelors Married-civ-spouse Prof-specialty Husband Black Male 40 India25 Self-emp-not-inc HS-grad Never-married Farming-fishing Own-child White Male 35 Iran43 Self-emp-not-inc Masters Divorced Exec-managerial Unmarried White Female 45 US…

1Network Construction

2

* ICDM 2011 IEEE International Conference on Data Mining http://cpsc.ucalgary.ca/~nkoochak/NetDriller/

Page 25: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201325 Reda Alhajj, University of Calgary

IMPROVING DATABASE PERFORMANCE BY BUILDING AND ANALYZING NETWORK OF TABLES FROM QUERY ACCESS PATTERNS

Page 26: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Problem Definition

Response time in a distributed or parallel database system is largely determined by how data is organized and stored on different machines/sites.

The goal is to place related data on nearby, or preferably the same, sites to minimize the response time.

The study of data distribution requires solving two problems: 1. The partitioning problem 2. The allocation problem

26 Reda Alhajj, University of Calgary

Page 27: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Queries (users) versus Tables

27 Reda Alhajj, University of Calgary

Page 28: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Overview of the analysis process

Three main steps:

1. Considering tables as items and queries as transactions, extract frequent closed itemsets

A kind of fuzzy sets can be built from the closed itemsets in this step

2. Use the extracted itemsets from the previous step to build the network of tables

3. Use network analysis to extract information about the tables from the network of tables

28 Reda Alhajj, University of Calgary

Page 29: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step1Items and Transactions

Sample database EMPLOYEE (Ssn, Fname, Lname, Dno) DEPARTMENT (Dnumber, Dname) PROJECT (Pnumber, Pname, Plocation, Dno)

Sample query (Q1) SELECT Lname

FROM EMPLOYEE, DEPARTMENTWHERE DNO = Dnumber AND Dname = ‘Reasearch’

Items EMPLOYEE, DEPARTMENT, PROJECT

Transactions Q1: EMPLOYEE, DEPARTMENT

29 Reda Alhajj, University of Calgary

Page 30: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 1Example (Sample Database)

Sample database schema from Fundamentals of Database Systems, Elmasri/Navathe

30 Reda Alhajj, University of Calgary

Page 31: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 1Example (List of Queries)

List of Queries in Transaction FormatQ1 EMPLOYEE DEPARTMENT

Q2 EMPLOYEE DEPARTMENT PROJECT

Q3 EMPLOYEE DEPARTMENT

Q4 EMPLOYEE DEPARTMENT WORKS_ON PROJECT

Q5 EMPLOYEE WORKS_ON PROJECT

Q6 EMPLOYEE DEPARTMENT WORKS_ON PROJECT

Q7 EMPLOYEE DEPENDENT

Q8 EMPLOYEE WORKS_ON PROJECT

Q9 EMPLOYEE DEPENDENT

Q10 EMPLOYEE DEPENDENT

Q11 EMPLOYEE DEPARTMENT

Q12 EMPLOYEE DEPARTMENT

Q13 WORKS_ON PROJECT

Q14 WORKS_ON PROJECT

Q15 EMPLOYEE WORKS_ON PROJECT

31 Reda Alhajj, University of Calgary

Page 32: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 1Example (Closed Itemsets)

List of frequent closed itemsets with min-support-threshold = 2

Note: 1-itemsets are omitted from the results

Itemset FrequencyEMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT 2

EMPLOYEE, WORKS_ON, PROJECT 5

EMPLOYEE, DEPARTMENT, PROJECT 3

EMPLOYEE, PROJECT 6

WORKS_ON, PROJECT 7

EMPLOYEE, DEPARTMENT 7

EMPLOYEE, DEPENDENT 3

32 Reda Alhajj, University of Calgary

Page 33: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step1Example (Fuzzy Sets)

Fuzzy Sets{WORKS_ON: 0.500, PROJECT: 0.304}

{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}

{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}

{EMPLOYEE: 0.231, PROJECT: 0.261}

{EMPLOYEE: 0.269, DEPARTMENT: 0.583}

{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}

{EMPLOYEE: 0.115, DEPENDENT: 1.000}

Itemset Frequency

EMPLOYEE, DEPARTMENT, WORKS_ON, PROJECT

2

EMPLOYEE, WORKS_ON, PROJECT 5

EMPLOYEE, DEPARTMENT, PROJECT 3

EMPLOYEE, PROJECT 6

WORKS_ON, PROJECT 7

EMPLOYEE, DEPARTMENT 7

EMPLOYEE, DEPENDENT 3

33 Reda Alhajj, University of Calgary

Page 34: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Example (Fuzzy Sets)

SUGGESTED ALLOCATION, NO REPLICATION CASE{WORKS_ON: 0.500, PROJECT: 0.304}

{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}

{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}

{EMPLOYEE: 0.231, PROJECT: 0.261}

{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}

{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}

{EMPLOYEE: 0.115}

Fuzzy Sets

{WORKS_ON: 0.500, PROJECT: 0.304}

{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}

{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}

{EMPLOYEE: 0.231, PROJECT: 0.261}

{EMPLOYEE: 0.269, DEPARTMENT: 0.583}

{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}

{EMPLOYEE: 0.115, DEPENDENT: 1.000}

34 Reda Alhajj, University of Calgary

Page 35: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Example (Fuzzy Sets)

SUGGESTED ALLOCATION, REPLICATION CASE; AT MOST THREE REPLICA ALLOWED{WORKS_ON: 0.500, PROJECT: 0.304}

{EMPLOYEE: 0.192, WORKS_ON: 0.357, PROJECT: 0.217}

{EMPLOYEE: 0.115, PROJECT: 0.130, DEPARTMENT: 0.250}

{EMPLOYEE: 0.231, PROJECT: 0.261, DEPARTMENT: 0.250}

{EMPLOYEE: 0.269, DEPARTMENT: 0.583, DEPENDENT: 1.000}

{EMPLOYEE: 0.077, WORKS_ON: 0.143, PROJECT: 0.087, DEPARTMENT: 0.167}

{EMPLOYEE: 0.115, DEPENDENT: 1.000}

35 Reda Alhajj, University of Calgary

Page 36: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step2Building the Network

Each item (table) is a node in the network

An edge exists between two nodes if they appear together in at least one frequent closed itemset

The weight of an edge between two nodes is related to the number of frequent closed itemsets in which corresponding tables appear together

Weight is normalized

36 Reda Alhajj, University of Calgary

Page 37: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 2Example

Network of tables

Note: Table DEPT_LOCATIONS is not included in the graph since this table did not appear in any of the queries

37 Reda Alhajj, University of Calgary

Page 38: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 3Applying Network Analysis

Various network analysis techniques can be used to extract relationships of tables from the social network

Centrality measures can be used to identify the tables that are in relationship with many other tables and consequently play a key role in linking data from different tables together

Graph clustering algorithms can be applied to find groups of tables that are frequently accessed together in queries

38 Reda Alhajj, University of Calgary

Page 39: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 3Example (Centrality Measures)

Tables Degree (unweighted)

Closeness Betweenness

EMPLOYEE 4 0.40 6

DEPARTMENT 3 0.27 4

WORKS_ON 3 0.25 4

PROJECT 3 0.36 4

DEPENDENT 1 0.18 4

39 Reda Alhajj, University of Calgary

Page 40: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Step 3Example (Clustering Results)

Edge betweenness clusters C1: EMPLOYEE, PROJECT, DEPARTMENT C2: WORKS_ON C3: DEPENDENT

MST clusters C1: DEPENDENT C2: EMPLOYEE, WORKS_ON, PROJECT C3: DEPARTMENT

Clustering results may seem meaningless since in this example we have 5 highly correlated nodes in the graph

40 Reda Alhajj, University of Calgary

Page 41: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Experiment1Centrality Measures

This experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 20 queries, min-support-threshold = 2

High degree nodes T10: 6 T14: 4

High closeness nodes T10: 0.25 T14: 0.20

High betweenness nodes T10: 86 T14: 49

41 Reda Alhajj, University of Calgary

Page 42: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Experiment1Clustering Result

Edge betweenness clusters C1: T11, T12, T13, T14 C2: T1, T0, T2 C3: T4, T5, T10, T8, T3

MST clusters C1: T11

C2: T4, T3 C3: T5, T10, T12, T13, T8, T14, T1, T0, T2

42 Reda Alhajj, University of Calgary

Page 43: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Experiment 2Centrality Measures

The experiment has been done on a synthetic dataset of 14 tables (T0 to T13) and 30 queries, min-support-threshold = 1

High degree nodes T7: 12 T10: 11

High closeness nodes T10: 0.20 T7: 0.19

High betweenness nodes T7: 43 T10: 31

43 Reda Alhajj, University of Calgary

Page 44: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Experiment 2Clustering Result

Edge betweenness clusters C1: T6 C2: T8 C3: T4, T5, T3, T2 C4: T1, T0 C5: T7, T10, T11, T12, T13, T14, T9

MST clusters C1: T6, T8 C2: T11 C3: T7, T9 C4: T10, T12, T13, T14, T1, T0, T2 C5: T4, T5, T3

44 Reda Alhajj, University of Calgary

Page 45: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

To further demonstrate the effectiveness of the proposed approach in practice

we conducted another experiment using a synthetic query set of 1000 queries on 50 tables

finding real data is very hard because this type of data is very sensitive and hence highly confidential.

We have generated the data by restricting the number of tables that could appear in the same query to be at most 20 one query may require accessing at most 20 different

tables, though in practice it is not more than four or five tables.

45 Reda Alhajj, University of Calgary

Page 46: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201346 Reda Alhajj, University of Calgary

Page 47: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

These are four example communities:

{T6, T8, T9, T22, T23, T24, T33 } –

{ T6, T9, T21, T37, T42, T45} –

{T5, T6, T11, T13, T14, T16, T19 } –

{ T6, T7, T9, T10, T12, T13, T19} .

47 Reda Alhajj, University of Calgary

Page 48: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201348 Reda Alhajj, University of Calgary

From Frequent Patterns to Network construction

Page 49: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Overview

Given a dataset, e.g., emails exchanged between a group of people, like employees in the same company

Partition the dataset into groups based on a certain criteria to be studied To study the employees, all emails are grouped such that emails of

the same employee form one group

Decide on the items to be considered in the analysis E.g., each email could be a transaction and words/emails within

the header/text could be items

Mine FP within each group and globally

Find relevant features for each group based on the entropy

49 Reda Alhajj, University of Calgary

Page 50: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201350 Reda Alhajj, University of Calgary

The Proposed Framework

Feature Extraction Model

Network Creation Model

Mine frequent closed

patterns

Calculate weights offeatures to

create feature vectors

Select suitable features based on entropy ranking

Freq. Closed Pats.

Features

Statistical Analysis Model

Front End Interface and Visualization Tool

Page 51: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201351 Reda Alhajj, University of Calgary

Feature Extraction Model: The Feature Vector

The feature vector related to entity ej with m features is represented

as -

Fj = ( w(f1), w(f2), …, w(fm) ),

where w(fk) is the weight of the k-th feature, fk in entity ej.

Page 52: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201352 Reda Alhajj, University of Calgary

Feature Extraction Model: Weight of a Feature

The weights of each feature is calculated using the following formula,

wDj(fk) = supDj(fk)/supD(fk)

where

wDj(fk) is the weight of the feature k for entity ej,

supDj(fk) is frequency of feature fk across dataset Dj of entity ej,

and

supD(fk) is frequency of fk across dataset D of all entities E.

Page 53: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201353 Reda Alhajj, University of Calgary

Experimental Results: Enron E-mail dataset description

Dataset contains 500,000 e-mail messages over 150 Enron employees.

For this analysis inbox having more than 1000 e-mails were considered.

From each user’s inbox we have chosen 1000 e-mails randomly that makes the e-mail dataset for the corresponding user.

Page 54: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201354 Reda Alhajj, University of Calgary

Experimental Results: Processing Enron E-mail dataset

Identify itemsets from email dataset –

The stem words appearing in the body and the subject line of the e-mails are considered as items.

E-mail addresses inside the e-mails are identified as items as well.

These items appearing in a single e-mail are considered as a single transaction

This way for each user we make a transactional database of 1000 e-mail transactions for each of the 1000 e-mails in the inbox

From these transactional databases we identify the globally frequent closed itemsets (corresponding to a support of 10%)

Based on entropy ranking we chose top 100 closed itemsets as our feature set.

Page 55: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201355 Reda Alhajj, University of Calgary

Experimental Results: Euclidean Distance Matrix for Enron Users

  buy deanermi

sjone

s

kamiski

keavey

lokeymay

sagersaibi

salisbury

shackleton

thomas

whalley

ybarbo

buy 0.00 0.65 0.57 0.26 0.43 0.41 0.43 0.35 0.32 0.36 0.25 0.22 0.65 0.60 0.59

dean 0.65 0.00 0.13 0.50 0.28 0.50 0.27 0.68 0.40 0.44 0.73 0.64 0.08 0.10 0.13

ermis 0.57 0.13 0.00 0.44 0.22 0.44 0.21 0.61 0.33 0.38 0.65 0.56 0.15 0.14 0.16

jones 0.26 0.50 0.44 0.00 0.27 0.35 0.29 0.38 0.19 0.26 0.36 0.21 0.50 0.47 0.44

kamiski 0.43 0.28 0.22 0.27 0.00 0.31 0.16 0.47 0.17 0.28 0.51 0.39 0.28 0.25 0.25

keavey 0.41 0.50 0.44 0.35 0.31 0.00 0.38 0.25 0.30 0.41 0.45 0.38 0.51 0.47 0.50

lokey 0.43 0.27 0.21 0.29 0.16 0.38 0.00 0.50 0.22 0.25 0.52 0.41 0.27 0.25 0.24

may 0.35 0.68 0.61 0.38 0.47 0.25 0.50 0.00 0.40 0.45 0.35 0.33 0.69 0.65 0.67

sager 0.32 0.40 0.33 0.19 0.17 0.30 0.22 0.40 0.00 0.25 0.44 0.28 0.40 0.36 0.36

saibi 0.36 0.44 0.38 0.26 0.28 0.41 0.25 0.45 0.25 0.00 0.45 0.34 0.43 0.41 0.41

salisbury 0.25 0.73 0.65 0.36 0.51 0.45 0.52 0.35 0.44 0.45 0.00 0.30 0.75 0.70 0.70

shackleton 0.22 0.64 0.56 0.21 0.39 0.38 0.41 0.33 0.28 0.34 0.30 0.00 0.63 0.60 0.59

thomas 0.65 0.08 0.15 0.50 0.28 0.51 0.27 0.69 0.40 0.43 0.75 0.63 0.00 0.09 0.13

whalley 0.60 0.10 0.14 0.47 0.25 0.47 0.25 0.65 0.36 0.41 0.70 0.60 0.09 0.00 0.11

ybarbo 0.59 0.13 0.16 0.44 0.25 0.50 0.24 0.67 0.36 0.41 0.70 0.59 0.13 0.11 0.00

Distance cutoff point 0.30

Page 56: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201356 Reda Alhajj, University of Calgary

Experimental Results: The Enron E-mail users’ social network based on e-mail usage

Page 57: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201357 Reda Alhajj, University of Calgary

Five CLUSTERS OF ENRON E-MAIL.

1 saibi

2 buy, salisbury, shakleton, jones

3 dean, ermis, jones, kaminski, lokey, sager, thomas, whalley, ybarbo

4 keavey

5 may

Experimental Results: The Enron E-mail users’ social network based on e-mail usage

Page 58: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201358 Reda Alhajj, University of Calgary

From Association rules to Network

Page 59: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201359 Reda Alhajj, University of Calgary

Basic Steps

Given a website The mining process can be applied on three dimensions:

content, structure and log

Actors in the network are the pages.

Construct the adjacency matrix by mining association rules from the transactional database obtained after preprocessing the web log data:

Each transaction is a set of pages accessed together in one session.

FPM algorithm, e.g., Apriori or FP-growth is applied on the derived

transactional data and association rules are derived.

Page 60: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201360 Reda Alhajj, University of Calgary

Basic Steps

Determine frequent Itemsets

Find association rules

Add items in the rule as node in the graph and connect items in the left side to items in the right side (directed edges)

Use support and confidence to find a combined weight of each added edge

If edge already exist then add the new weight to the existing weight of the edge

Analyze the graph using SNA techniques

Page 61: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201361 Reda Alhajj, University of Calgary

From Association Rules to Social Network

Page 62: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201362 Reda Alhajj, University of Calgary

From Association Rules to Social Network

Analyze weblog

Determine frequent sets of pages based on frequency of pages accessed together

Determine rules and keep only those satisfying minimum confidence

Construct network of pages based on rules

Page 63: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201363 Reda Alhajj, University of Calgary

From Association Rules to Network

Each rule is reflected in the adjacency matrix by incrementing every entry (i; j) such that pages i and j exist in the antecedent and consequent of the rule, respectively.

Entries in the adjacency matrix are normalized by dividing each value by the overall average of the values that exist in the matrix.

The network is analyzed to rank the pages by considering their in-degrees, out-degrees, and betweenness, eigen-vector centrality.

Pages with high betweenness centrality are considered as important to link pages from different communities.

Page 64: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201364 Reda Alhajj, University of Calgary

From Association Rules to Social Network

analysis was done using the software Visone (http://visone.info/)

Betweeness Centrality measure

Page 65: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201365 Reda Alhajj, University of Calgary

From Association Rules to Social Network

Closeness Centrality measure

Page 66: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201366 Reda Alhajj, University of Calgary

From Association Rules to Social Network

Eigenvector Centrality measure

Page 67: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201367 Reda Alhajj, University of Calgary

From Multi-objective GA based clustering to Network Construction

The case of Genes/Proteins

Page 68: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201368 Reda Alhajj, University of Calgary

Motivation

In most traditional clustering algorithms, number of clusters is given a-priori.

In fact: the clustering criteria is dependent on more than one objective!

Cluster validation to assess the number of clusters.

Multi-objective clustering must work on small and large data sets.

Page 69: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201369 Reda Alhajj, University of Calgary

Objective Functions For Clustering

Three objectives:

F1 : minimize the number of clusters

F2 : maximize the heterogeneity between clusters

F3 : maximize the within cluster homogeneity

Page 70: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201370 Reda Alhajj, University of Calgary

Objective functions

Page 71: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201371 Reda Alhajj, University of Calgary

Divide and Conquer

Basic Steps:

If the dataset to be clustered is of manageable size then it is clustered as a whole set.

Otherwise

repeat the following steps

Partition the dataset (or set of centroids after the first iteration) into subsets of manageable size

Cluster each subset individually by applying multi-objective GA combined with validity analysis to get the centroids of the obtained clusters

If the set of all centroids is of manageable size then cluster the whole set of centroids and exit the loop

Backtrack to merge clusters that have their centroids ending up in the same final cluster

Page 72: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201372 Reda Alhajj, University of Calgary

Unique Solution of Compact Clusters

Page 73: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201373 Reda Alhajj, University of Calgary

From Alternative Solutions to Adjacency Matrix

GenesGenes

Genes

Entry (i,j) specifies number of solutions where Genei and Genej occurred in the same cluster

Page 74: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201374 Reda Alhajj, University of Calgary

From Adjacency Matrix to Network

Page 75: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201375 Reda Alhajj, University of Calgary

Criminal and Terror Network Analysis

Page 76: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Terror Network Analysis by Clustering

We developed a framework that employs clustering, frequent pattern mining and some social network analysis measures to determine the effectiveness of a network.

The clustering and frequent pattern mining techniques start with the adjacency matrix of the network.

For clustering, we utilize entries in the table by considering each row as an object and each column as a feature.

features of a network member are his/her direct neighbors. We maintain the weight of links in case of weighted

network links.

76 Reda Alhajj, University of Calgary

Page 77: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Multi-Objective GA based Clustering

We applied multi-objective GA based clustering

77 Reda Alhajj, University of Calgary

Page 78: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Terror Network Analysis by Clustering & FPM

For Clustering, we consider each row as an instance and each column as a feature

We Cluster instances to find important groups and individuals within the network

For frequent pattern mining, we consider each row of the adjacency matrix as a transaction and each column as an item.

We map entries into a 0/1 scale such that every entry whose value is greater than zero is assigned the value one; entries keep the value zero otherwise.

This way we can apply frequent pattern mining algorithms to determine the most influential members in a network as well as the effect of removing some members or even links between members of a network.

78 Reda Alhajj, University of Calgary

Page 79: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Terror Network Analysis

We investigate the effect of adding some links between members.

We are able to study how the various members in the network change role as the network evolves.

This is measured by applying some SNA measures on the network at each stage during the development.

We report some interesting results related to on various benchmark networks: including 9/11 and Madrid bombing.

79 Reda Alhajj, University of Calgary

Page 80: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201380 Reda Alhajj, University of Calgary

Database Search

Page 81: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Problem Definition

You tell the computer what you want in terms that mean something to you; using fuzzy sets

You ask your question from the computer using the fuzzy term

Computer tells you how accurate your results are Degree of membership

81 Reda Alhajj, University of Calgary

Page 82: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Related Work: Database Search

Fuzzy Data Representation Disadvantages:

Existing databases need to be re-structured Prevent traditional users from executing standard

(non-fuzzy) queries

Extending a Query Language to support fuzzy querying without changing the database itself Disadvantages:

Commercially available DBMS’s need to support a new query language

Requires users to learn the new query language

82 Reda Alhajj, University of Calgary

Page 83: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Motivation

Proposing an independent intermediate translation layer to incorporate fuzziness in: the interface/querying facility of database systems to

retrieve more accurate facts Groups within a social network may share the same

intermediate layer Recommendation system based on SNA to help users in

building their intermediate layer

The intermediate layer provides the mapping between fuzziness expected by the user and the actual crisp values stored in the data repository

83 Reda Alhajj, University of Calgary

Page 84: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Methodology

Fuzziness can be specified : Manually: by a human expert Semi-automatically:

A human experts decides on the number of fuzzy sets the intermediate layer defines the fuzzy sets

Fully-automatically: by the intermediate layer

The intermediate layer uses the fuzzy sets specifications to map between fuzziness expected by the user and the actual crisp values stored in the data repository

84 Reda Alhajj, University of Calgary

Page 85: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Intelligent Database Search

85 Reda Alhajj, University of Calgary

Page 86: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

AskFuzzy: Attractive Visual Fuzzy Query Builder*

Fuzzy Query DB

MS

Fuzzy

La

yer

Data Fuzzification

Fuzzy Query Construction

Fuzzy Query Execution

* ICDE 2011 IEEE International Conference on Knowledge Engineeringhttp://cpsc.ucalgary.ca/~nkoochak/AskFuzzy/

1

2

3

• Transferring numeric values to fuzzy sets:Number of Fuzzy sets Fuzzy sets Functions

Manual

Semi-automated

Full-automated

By UserBy User

By System (Initial Fuzzy sets: based on Clustering resultOptimized fuzzy sets: Based on Genetic Algorithm Optimization

By System (Optimization process: Min number of clustersMax cluster quality)

Page 87: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 2013

Conclusions

Data mining and machine learning techniques could be integrated with the network based analysis.

The combination would lead to

A strong framework for data analysis from various perspectives.

Global correlations within the data are considered and hence lead to more realistic results

A variety of application domains could benefit from the integrated setup

87 Reda Alhajj, University of Calgary

Page 88: The Power of Data Mining and Machine Learning Techniques for Network Construction and  Analysis

BYU, Provo, USA, March 201388 Reda Alhajj, University of Calgary

The End!Thank you for your attention

Reda [email protected]