data mining and knowledge discovery in dynamic networksparkes/nagurney/pardalos.pdfdata mining and...

Panos M. PardalosCenter for Applied OptimizationDept. of Industrial & Systems Engineering

Affiliated Faculty of:Computer & Information Science & Engineering DepartmentBiomedical Engineering Program, McKnight Brain Institute

Data Mining andKnowledge Discoveryin Dynamic Networks

Massive Datasets

The proliferation of massive datasets brings with it aseries of special computational challenges. This dataavalanche arises in a wide range of scientific andcommercial applications. With advances in computerand information technologies, many of thesechallenges are beginning to be addressed.

(Abello, Pardalos & Resende, 2002,Handbook of Massive Datasets)

Knowledge Discovery inDatabases (KDD)

KDD is the process of identifying valid, novel,potentially useful, and ultimately understandablestructure (models and patterns) in the data

Understand the application domainCreate a target datasetRemove (or correct) corrupted dataApply data-reduction algorithmsApply data mining algorithmsInterpret the mined patterns

Graph Representation ofMassive Datasets

In many cases, it is convenient torepresent a dataset as a graph(network) with certain attributesassociated with its vertices and edgesStudying the properties of these graphsoften provides useful information aboutthe internal structure of the datasetsthey represent

Important Concepts

A graph G = (V, E), V = set of vertices,E = set of edgesDegrees of the vertices, degreedistributionSize of connected componentsEdge densityCliques and independent sets

Example of a graph

1

5

3

2

4

Examples of Real-LifeMassive Graphs

Web graph (links between websites)Call graph (telephone traffic data)Market graph (stock prices data)Brain networks (neurons andconnections between them)

Degree Distribution:Power Law

Degree distribution of a graph characterizesglobal statistical patterns underlying thedataset this graph representsInterestingly, the degree distribution of allconsidered real-life graphs has a well-definedpower-law structure:

The probability that a vertex has a degree k(i.e., k neighbors) is

or

(“Self-organized” networks)

Cliques and IndependentSets

A clique is a subgraph of G that has allpossible edgesCliques represent dense clusters of“similar” objectsAn independent set is a subgraph of Gwith no edges.Independent sets represent groups of“different” objects

Maximum Clique andIndependent Set ProblemsThe subject of a special interest is tofind the maximum clique andindependent set in the graphMaximum clique and maximumindependent set problems can betransformed to each other, using thenotion of complementary graphThese problems are NP-hard

Finding cliques andindependent sets

Heuristic algorithms (no guarantee tofind an optimal solution)Exact algorithms (finding maximumclique or independent set)

Clique Partitioning

Minimum clique partition: dividing thegraph into a minimum number of distinctcliquesThis provides a natural way of partitioninga dataset represented by a graph into anumber of clusters of “similar” objects(clustering problem), where the number ofclusters is the minimum number of cliquesin the graph

Graph Coloring

Coloring essentially represents thepartitioning of the graph into a minimumnumber of independent setsPartitioning a dataset represented by agraph into a number of clusters of“different” objects

Call GraphThe “call graph” comes from telecommunicationstraffic. The vertices of this graph are telephonenumbers, and the edges are calls made from onenumber to another (including additional billing data,such as, the time of the call and its duration). Thechallenge in studying call graphs is that they aremassive. Every day AT & T handles approximately300 million long-distance calls. (American ScientistOnline, Jan- Feb 2000)Careful analysis of the call graph could help withinfrastructure planning, customer classification andmarketing.How can we visualize such massive graphs? Toflash a terabyte of data on a 1000x1000 screen,you need to cram a megabyte of data into eachpixel!

In our experiments with data from telecommunicationtraffic, in an instance of the corresponding multigraphhas 53,767,087 vertices and over 170 million ofedges.It is a not a connected graph, but has 3.7 millionseparate components, yet a giant connectedcomponent with 44,989,297 vertices was computed.The maximum (quasi)-clique problem is considered inthis giant component. We found cliques of size 30and there were more than 14000 of these 30-member cliques

(Abello, Pardalos & Resende)

Call Graph

Call graph

Call GraphIn a battlefield situation, just counting the messages oridentifying the source and the intended recipient of eachmessage, constructing a call graph, yields valuableinformation like the organization of a military force.The records in the call database are collected forcommercial purposes. In order to send an itemized bill, aphone company needs to keep track of every callcompleted, with the originating and receiving phonenumbers and the starting and ending times. The largestcompanies handle roughly 250 million toll calls a day, andso a month's worth of data amounts to several billion callrecords. AT&T reports that its database of retainedrecords is approaching two trillion calls and more than 300terabytes of data.

Historical calling patterns can be used to detect fraud, andsome patterns are also of interest in marketing. Forexample, a company that offers a discounted rate within a"calling circle" can use information from the call graph toestimate the costs and benefits of the program.This kind of traffic data could be compiled for othercommunications channels. For instance, Federal Expressand other courier services keep digitized records of theirdeliveries, which could readily be transformed into adatabase of senders and receivers. With a ‘packet sniffer’installed in the network, we compile this data for the e-mailtraffic. (American Scientist Online, Sep-Oct 2006)

Call Graph

Degree Distribution of the CallGraph (data by AT&T)

Market Graph

Vertices are stocks, and an edge connectstwo stocks if the correlation between theirprice fluctuations over a certain period isgreater than a specified threshold~6000 vertices (stocks)

Market Graph

Market graph (all the consideredinstances for different correlationthresholds) follows the power-law modelUsing the combination of heuristic andexact algorithms, the exact solution ofthe maximum clique problem was found

(Boginski, Butenko & Pardalos)

Degree distribution ofthe Market graph

Finding Cliques in theMarket Graph

Applying a heuristic algorithm to find alarge clique: let N(i) be the set ofneighbors of the vertex i

Finding Cliques in theMarket graph

Preprocessing procedure: C is the clique found by the heuristic

algorithm: recursively remove from thegraph all of the vertices which are not inC and whose degree is less than |C|

Denote the resulting (reduced) graph asG’ = (V’, E’)

Finding Cliques in theMarket graph

Using the IP formulation of the maximumclique problem to find the exact solution:

Maximum Clique size fordifferent correlation

thresholdsLarge cliques despite very low edgedensity – confirms the idea about the“globalization” of the market

Classification of StocksUsing Clique Partitioning

A clique in the market graph represents adense cluster of stocks whose pricesexhibit a similar behavior over timeTherefore, dividing the market graph into aset of distinct cliques (clique partitioning)is a natural approach to classifyingstocks (dividing the set of stocks intoclusters of similar objects – an approachto solving the clustering problem)

Independent sets in theMarket graph

Maximum independent set represents thelargest “perfectly diversified” portfolioSolving the maximum clique problem in thecomplementary graphThe preprocessing procedure could notreduce the size of the initial graph, the exactsolution could not be foundLarge diversified portfolios are hard to find

Independent set sizesfor different correlation

thresholdsRelatively small independent sets foundby the heuristic algorithm

Independent Sets in theMarket Graph

Finding a perfectly diversified portfoliocontaining any given stockFor every vertex in the market graph, anindependent set that contains this vertex wasdetected, and the sizes of these independentsets were almost the same, which means thatit is possible to find a diversified portfoliocontaining any given stock using the marketgraph methodology

Maximum Clique size fordifferent correlation

thresholds Maximum clique size for various thresholds in

Food Market Graph

Independent Sets in theMarket Graph

Maximum independent set size for variousthresholds in Food Market Graph

Connected Componentsin Market Graph

IntuitionTwo nodes are correlated if theircorrespondent nodes are connected by edge(correlated)Power-law graphs generally have very highclustering coefficient i.e., the tendency forassociation of two nodes which areassociated with a common node is high


Group Size by Time Period - (0.7)

0

50

100

150

200

250

1 2 3 4 5 6 7 8 9 10 11

Time Period

La

rge

st

Gro

up

Siz

e

Group Size by Time period - (0.6)

0

100

200

300

400

500

1 2 3 4 5 6 7 8 9 10 11

Time Period

La

rge

st

gro

up

s

ize

Group Size by Time Period - (0.5)

0

200

400

600

800

1,000

1,200

1,400

1 2 3 4 5 6 7 8 9 10 11

Time Period

La

rge

st

gro

up

siz

e

Largest Group size by Time Period


ObservationsThe increase in the giant component sizefrom oldest to newest time period indicatesthe globalization tendency, just as inmaximum clique size and edge densityThe giant components includessemiconductor industries and the increase inthe size of the giant components corroboratesthe observation that the number of theseindustries has been increasing with time

Additional Applications

Social NetworksBiological NetworksTransportation Networks (place of livingand place of work)

ReferencesJ. Abello, P.M. Pardalos, and M.G.C. Resende (eds.),2002. Handbook of Massive Data Sets, Kluwer AcademicPublishers.V. Boginski, S. Butenko, and P.M. Pardalos, 2003.Modeling and Optimization in Massive Graphs. In: P. M.Pardalos and H. Wolkowicz, eds. Novel Approaches toHard Discrete Optimization, American MathematicalSociety, 17-39.V. Boginski, S. Butenko, and P.M. Pardalos, 2003. OnStructural Properties of the Market Graph. In: A.Nagurney (editor), Innovations in Financial and EconomicNetworks, Edward Elgar Publishers, 28-45.

American Scientist Online, (Jan-Feb 2000),Computing Science Graph Theory in Practice:Part I by Brian Hayes, Volume 88, No. 1

American Scientist Online, (Sep-Oct 2006),Connecting the Dots: Can the tools of graphtheory and social-network studies unravel thenext big plot? , Volume 94, No. 5

References

Modeling Epileptic Brain

EEG recordings received from theelectrodes located in different functionalunits of the brain (time series)The values of T-index between all pairsof electrodes are calculatedTwo electrodes are considered to beentrained in the seizure if thecorresponding value of T-index is lessthan Tcritical.

Modeling Epileptic Brain

One can represent all the electrode locations(functional units of the brain) as the verticesof the graph.An edge connects two vertices if thecorresponding value of T-index is less thanTcritical, i.e. these electrode sites are entrainedat a certain time moment.The evolution of the properties of this graph isinvestigated

Modeling Epileptic BrainEdge density of the considered graph(dashed lines represent the moments ofepileptic seizures)

Modeling Epileptic BrainSize of the largest connected component(dashed lines represent the moments ofepileptic seizures)

Modeling Epileptic Brain:Related Technique

Let A be the matrix containing the values ofT-index Tij for all pairs of electrodesSolve the quadratic 0-1 problem

to find k electrode sites producing the minimalsum of T-indices (so-called critical sites),which means that these sites are entrainedduring the seizure

SummaryThere are many mathematicalprogramming techniques for addressingdata mining problems in dynamic networksGraph-based techniques for this type ofproblems is a promising research areaPerformance of any approach depends ona specific dataset – there is no “universal”technique

Data Mining in Biomedicine, P.M. Pardalos, V. Boginski,and A. Vazacopoulos (eds.), Springer, forthcoming.P.M. Pardalos, W. Chaovalitwongse, L.D. Iasemidis, J.C.Sackellares, D.-S. Shiau, P.R. Carney, O.A. Prokopyev,V.A. Yatsenko, 2004. Seizure Warning Algorithm Based onOptimization and Nonlinear Dynamics, MathematicalProgramming, 101(2): 365-385.O.A. Prokopyev, V. Boginski, W. Chaovalitwongse, P. M.Pardalos, J. C. Sackellares, and P. R. Carney, 2004.Network-Based Techniques in EEG Data Analysis andEpileptic Brain Modeling. To appear in: Data mining inBiomedicine, P.M. Pardalos, V. Boginski and A.Vazacopoulos (eds.), Springer.

References

This cosmos was not made by Gods ormen, but always was, and is, and evershall be ever-living fire.

Heraclitus - The Fire Priest (540 BC -480 BC)

data mining and knowledge discovery in dynamic networksparkes/nagurney/pardalos.pdfdata mining and...

Documents