data mining and knowledge discovery in dynamic networksparkes/nagurney/pardalos.pdfdata mining and...

46
Panos M. Pardalos Center for Applied Optimization Dept. of Industrial & Systems Engineering Affiliated Faculty of: Computer & Information Science & Engineering Department Biomedical Engineering Program, McKnight Brain Institute Data Mining and Knowledge Discovery in Dynamic Networks

Upload: others

Post on 16-Oct-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Panos M. PardalosCenter for Applied OptimizationDept. of Industrial & Systems Engineering

Affiliated Faculty of:Computer & Information Science & Engineering DepartmentBiomedical Engineering Program, McKnight Brain Institute

Data Mining andKnowledge Discoveryin Dynamic Networks

Page 2: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Massive Datasets

The proliferation of massive datasets brings with it aseries of special computational challenges. This dataavalanche arises in a wide range of scientific andcommercial applications. With advances in computerand information technologies, many of thesechallenges are beginning to be addressed.

(Abello, Pardalos & Resende, 2002,Handbook of Massive Datasets)

Page 3: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Knowledge Discovery inDatabases (KDD)

KDD is the process of identifying valid, novel,potentially useful, and ultimately understandablestructure (models and patterns) in the data

Understand the application domainCreate a target datasetRemove (or correct) corrupted dataApply data-reduction algorithmsApply data mining algorithmsInterpret the mined patterns

Page 4: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Graph Representation ofMassive Datasets

In many cases, it is convenient torepresent a dataset as a graph(network) with certain attributesassociated with its vertices and edgesStudying the properties of these graphsoften provides useful information aboutthe internal structure of the datasetsthey represent

Page 5: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Important Concepts

A graph G = (V, E), V = set of vertices,E = set of edgesDegrees of the vertices, degreedistributionSize of connected componentsEdge densityCliques and independent sets

Page 6: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Example of a graph

1

5

3

2

4

Page 7: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Examples of Real-LifeMassive Graphs

Web graph (links between websites)Call graph (telephone traffic data)Market graph (stock prices data)Brain networks (neurons andconnections between them)

Page 8: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Degree Distribution:Power Law

Degree distribution of a graph characterizesglobal statistical patterns underlying thedataset this graph representsInterestingly, the degree distribution of allconsidered real-life graphs has a well-definedpower-law structure:

The probability that a vertex has a degree k(i.e., k neighbors) is

or

(“Self-organized” networks)

Page 9: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Cliques and IndependentSets

A clique is a subgraph of G that has allpossible edgesCliques represent dense clusters of“similar” objectsAn independent set is a subgraph of Gwith no edges.Independent sets represent groups of“different” objects

Page 10: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Maximum Clique andIndependent Set ProblemsThe subject of a special interest is tofind the maximum clique andindependent set in the graphMaximum clique and maximumindependent set problems can betransformed to each other, using thenotion of complementary graphThese problems are NP-hard

Page 11: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Finding cliques andindependent sets

Heuristic algorithms (no guarantee tofind an optimal solution)Exact algorithms (finding maximumclique or independent set)

Page 12: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Clique Partitioning

Minimum clique partition: dividing thegraph into a minimum number of distinctcliquesThis provides a natural way of partitioninga dataset represented by a graph into anumber of clusters of “similar” objects(clustering problem), where the number ofclusters is the minimum number of cliquesin the graph

Page 13: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Graph Coloring

Coloring essentially represents thepartitioning of the graph into a minimumnumber of independent setsPartitioning a dataset represented by agraph into a number of clusters of“different” objects

Page 14: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Call GraphThe “call graph” comes from telecommunicationstraffic. The vertices of this graph are telephonenumbers, and the edges are calls made from onenumber to another (including additional billing data,such as, the time of the call and its duration). Thechallenge in studying call graphs is that they aremassive. Every day AT & T handles approximately300 million long-distance calls. (American ScientistOnline, Jan- Feb 2000)Careful analysis of the call graph could help withinfrastructure planning, customer classification andmarketing.How can we visualize such massive graphs? Toflash a terabyte of data on a 1000x1000 screen,you need to cram a megabyte of data into eachpixel!

Page 15: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

In our experiments with data from telecommunicationtraffic, in an instance of the corresponding multigraphhas 53,767,087 vertices and over 170 million ofedges.It is a not a connected graph, but has 3.7 millionseparate components, yet a giant connectedcomponent with 44,989,297 vertices was computed.The maximum (quasi)-clique problem is considered inthis giant component. We found cliques of size 30and there were more than 14000 of these 30-member cliques

(Abello, Pardalos & Resende)

Call Graph

Page 16: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Call graph

Page 17: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Call GraphIn a battlefield situation, just counting the messages oridentifying the source and the intended recipient of eachmessage, constructing a call graph, yields valuableinformation like the organization of a military force.The records in the call database are collected forcommercial purposes. In order to send an itemized bill, aphone company needs to keep track of every callcompleted, with the originating and receiving phonenumbers and the starting and ending times. The largestcompanies handle roughly 250 million toll calls a day, andso a month's worth of data amounts to several billion callrecords. AT&T reports that its database of retainedrecords is approaching two trillion calls and more than 300terabytes of data.

Page 18: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Historical calling patterns can be used to detect fraud, andsome patterns are also of interest in marketing. Forexample, a company that offers a discounted rate within a"calling circle" can use information from the call graph toestimate the costs and benefits of the program.This kind of traffic data could be compiled for othercommunications channels. For instance, Federal Expressand other courier services keep digitized records of theirdeliveries, which could readily be transformed into adatabase of senders and receivers. With a ‘packet sniffer’installed in the network, we compile this data for the e-mailtraffic. (American Scientist Online, Sep-Oct 2006)

Call Graph

Page 19: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Degree Distribution of the CallGraph (data by AT&T)

Page 20: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Market Graph

Vertices are stocks, and an edge connectstwo stocks if the correlation between theirprice fluctuations over a certain period isgreater than a specified threshold~6000 vertices (stocks)

Page 21: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Market Graph

Market graph (all the consideredinstances for different correlationthresholds) follows the power-law modelUsing the combination of heuristic andexact algorithms, the exact solution ofthe maximum clique problem was found

(Boginski, Butenko & Pardalos)

Page 22: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Degree distribution ofthe Market graph

Page 23: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Finding Cliques in theMarket Graph

Applying a heuristic algorithm to find alarge clique: let N(i) be the set ofneighbors of the vertex i

Page 24: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Finding Cliques in theMarket graph

Preprocessing procedure: C is the clique found by the heuristic

algorithm: recursively remove from thegraph all of the vertices which are not inC and whose degree is less than |C|

Denote the resulting (reduced) graph asG’ = (V’, E’)

Page 25: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Finding Cliques in theMarket graph

Using the IP formulation of the maximumclique problem to find the exact solution:

Page 26: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Maximum Clique size fordifferent correlation

thresholdsLarge cliques despite very low edgedensity – confirms the idea about the“globalization” of the market

Page 27: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Classification of StocksUsing Clique Partitioning

A clique in the market graph represents adense cluster of stocks whose pricesexhibit a similar behavior over timeTherefore, dividing the market graph into aset of distinct cliques (clique partitioning)is a natural approach to classifyingstocks (dividing the set of stocks intoclusters of similar objects – an approachto solving the clustering problem)

Page 28: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Independent sets in theMarket graph

Maximum independent set represents thelargest “perfectly diversified” portfolioSolving the maximum clique problem in thecomplementary graphThe preprocessing procedure could notreduce the size of the initial graph, the exactsolution could not be foundLarge diversified portfolios are hard to find

Page 29: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Independent set sizesfor different correlation

thresholdsRelatively small independent sets foundby the heuristic algorithm

Page 30: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Independent Sets in theMarket Graph

Finding a perfectly diversified portfoliocontaining any given stockFor every vertex in the market graph, anindependent set that contains this vertex wasdetected, and the sizes of these independentsets were almost the same, which means thatit is possible to find a diversified portfoliocontaining any given stock using the marketgraph methodology

Page 31: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Maximum Clique size fordifferent correlation

thresholds Maximum clique size for various thresholds in

Food Market Graph

Page 32: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Independent Sets in theMarket Graph

Maximum independent set size for variousthresholds in Food Market Graph

Page 33: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Connected Componentsin Market Graph

IntuitionTwo nodes are correlated if theircorrespondent nodes are connected by edge(correlated)Power-law graphs generally have very highclustering coefficient i.e., the tendency forassociation of two nodes which areassociated with a common node is high

Page 34: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Connected Componentsin Market Graph

Group Size by Time Period - (0.7)

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11

Time Period

La

rge

st

Gro

up

Siz

e

Group Size by Time period - (0.6)

0

100

200

300

400

500

1 2 3 4 5 6 7 8 9 10 11

Time Period

La

rge

st

gro

up

s

ize

Group Size by Time Period - (0.5)

0

200

400

600

800

1,000

1,200

1,400

1 2 3 4 5 6 7 8 9 10 11

Time Period

La

rge

st

gro

up

siz

e

Largest Group size by Time Period

Page 35: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Connected Componentsin Market Graph

ObservationsThe increase in the giant component sizefrom oldest to newest time period indicatesthe globalization tendency, just as inmaximum clique size and edge densityThe giant components includessemiconductor industries and the increase inthe size of the giant components corroboratesthe observation that the number of theseindustries has been increasing with time

Page 36: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Additional Applications

Social NetworksBiological NetworksTransportation Networks (place of livingand place of work)

Page 37: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

ReferencesJ. Abello, P.M. Pardalos, and M.G.C. Resende (eds.),2002. Handbook of Massive Data Sets, Kluwer AcademicPublishers.V. Boginski, S. Butenko, and P.M. Pardalos, 2003.Modeling and Optimization in Massive Graphs. In: P. M.Pardalos and H. Wolkowicz, eds. Novel Approaches toHard Discrete Optimization, American MathematicalSociety, 17-39.V. Boginski, S. Butenko, and P.M. Pardalos, 2003. OnStructural Properties of the Market Graph. In: A.Nagurney (editor), Innovations in Financial and EconomicNetworks, Edward Elgar Publishers, 28-45.

Page 38: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

American Scientist Online, (Jan-Feb 2000),Computing Science Graph Theory in Practice:Part I by Brian Hayes, Volume 88, No. 1

American Scientist Online, (Sep-Oct 2006),Connecting the Dots: Can the tools of graphtheory and social-network studies unravel thenext big plot? , Volume 94, No. 5

References

Page 39: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Modeling Epileptic Brain

EEG recordings received from theelectrodes located in different functionalunits of the brain (time series)The values of T-index between all pairsof electrodes are calculatedTwo electrodes are considered to beentrained in the seizure if thecorresponding value of T-index is lessthan Tcritical.

Page 40: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Modeling Epileptic Brain

One can represent all the electrode locations(functional units of the brain) as the verticesof the graph.An edge connects two vertices if thecorresponding value of T-index is less thanTcritical, i.e. these electrode sites are entrainedat a certain time moment.The evolution of the properties of this graph isinvestigated

Page 41: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Modeling Epileptic BrainEdge density of the considered graph(dashed lines represent the moments ofepileptic seizures)

Page 42: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Modeling Epileptic BrainSize of the largest connected component(dashed lines represent the moments ofepileptic seizures)

Page 43: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Modeling Epileptic Brain:Related Technique

Let A be the matrix containing the values ofT-index Tij for all pairs of electrodesSolve the quadratic 0-1 problem

to find k electrode sites producing the minimalsum of T-indices (so-called critical sites),which means that these sites are entrainedduring the seizure

Page 44: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

SummaryThere are many mathematicalprogramming techniques for addressingdata mining problems in dynamic networksGraph-based techniques for this type ofproblems is a promising research areaPerformance of any approach depends ona specific dataset – there is no “universal”technique

Page 45: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

Data Mining in Biomedicine, P.M. Pardalos, V. Boginski,and A. Vazacopoulos (eds.), Springer, forthcoming.P.M. Pardalos, W. Chaovalitwongse, L.D. Iasemidis, J.C.Sackellares, D.-S. Shiau, P.R. Carney, O.A. Prokopyev,V.A. Yatsenko, 2004. Seizure Warning Algorithm Based onOptimization and Nonlinear Dynamics, MathematicalProgramming, 101(2): 365-385.O.A. Prokopyev, V. Boginski, W. Chaovalitwongse, P. M.Pardalos, J. C. Sackellares, and P. R. Carney, 2004.Network-Based Techniques in EEG Data Analysis andEpileptic Brain Modeling. To appear in: Data mining inBiomedicine, P.M. Pardalos, V. Boginski and A.Vazacopoulos (eds.), Springer.

References

Page 46: Data Mining and Knowledge Discovery in Dynamic Networksparkes/nagurney/pardalos.pdfData Mining and Knowledge Discovery in Dynamic Networks. Massive Datasets The proliferation of massive

This cosmos was not made by Gods ormen, but always was, and is, and evershall be ever-living fire.

Heraclitus - The Fire Priest (540 BC -480 BC)