data mining and knowledge discovery in dynamic networksparkes/nagurney/pardalos.pdfdata mining and...
TRANSCRIPT
Panos M. PardalosCenter for Applied OptimizationDept. of Industrial & Systems Engineering
Affiliated Faculty of:Computer & Information Science & Engineering DepartmentBiomedical Engineering Program, McKnight Brain Institute
Data Mining andKnowledge Discoveryin Dynamic Networks
Massive Datasets
The proliferation of massive datasets brings with it aseries of special computational challenges. This dataavalanche arises in a wide range of scientific andcommercial applications. With advances in computerand information technologies, many of thesechallenges are beginning to be addressed.
(Abello, Pardalos & Resende, 2002,Handbook of Massive Datasets)
Knowledge Discovery inDatabases (KDD)
KDD is the process of identifying valid, novel,potentially useful, and ultimately understandablestructure (models and patterns) in the data
Understand the application domainCreate a target datasetRemove (or correct) corrupted dataApply data-reduction algorithmsApply data mining algorithmsInterpret the mined patterns
Graph Representation ofMassive Datasets
In many cases, it is convenient torepresent a dataset as a graph(network) with certain attributesassociated with its vertices and edgesStudying the properties of these graphsoften provides useful information aboutthe internal structure of the datasetsthey represent
Important Concepts
A graph G = (V, E), V = set of vertices,E = set of edgesDegrees of the vertices, degreedistributionSize of connected componentsEdge densityCliques and independent sets
Example of a graph
1
5
3
2
4
Examples of Real-LifeMassive Graphs
Web graph (links between websites)Call graph (telephone traffic data)Market graph (stock prices data)Brain networks (neurons andconnections between them)
Degree Distribution:Power Law
Degree distribution of a graph characterizesglobal statistical patterns underlying thedataset this graph representsInterestingly, the degree distribution of allconsidered real-life graphs has a well-definedpower-law structure:
The probability that a vertex has a degree k(i.e., k neighbors) is
or
(“Self-organized” networks)
Cliques and IndependentSets
A clique is a subgraph of G that has allpossible edgesCliques represent dense clusters of“similar” objectsAn independent set is a subgraph of Gwith no edges.Independent sets represent groups of“different” objects
Maximum Clique andIndependent Set ProblemsThe subject of a special interest is tofind the maximum clique andindependent set in the graphMaximum clique and maximumindependent set problems can betransformed to each other, using thenotion of complementary graphThese problems are NP-hard
Finding cliques andindependent sets
Heuristic algorithms (no guarantee tofind an optimal solution)Exact algorithms (finding maximumclique or independent set)
Clique Partitioning
Minimum clique partition: dividing thegraph into a minimum number of distinctcliquesThis provides a natural way of partitioninga dataset represented by a graph into anumber of clusters of “similar” objects(clustering problem), where the number ofclusters is the minimum number of cliquesin the graph
Graph Coloring
Coloring essentially represents thepartitioning of the graph into a minimumnumber of independent setsPartitioning a dataset represented by agraph into a number of clusters of“different” objects
Call GraphThe “call graph” comes from telecommunicationstraffic. The vertices of this graph are telephonenumbers, and the edges are calls made from onenumber to another (including additional billing data,such as, the time of the call and its duration). Thechallenge in studying call graphs is that they aremassive. Every day AT & T handles approximately300 million long-distance calls. (American ScientistOnline, Jan- Feb 2000)Careful analysis of the call graph could help withinfrastructure planning, customer classification andmarketing.How can we visualize such massive graphs? Toflash a terabyte of data on a 1000x1000 screen,you need to cram a megabyte of data into eachpixel!
In our experiments with data from telecommunicationtraffic, in an instance of the corresponding multigraphhas 53,767,087 vertices and over 170 million ofedges.It is a not a connected graph, but has 3.7 millionseparate components, yet a giant connectedcomponent with 44,989,297 vertices was computed.The maximum (quasi)-clique problem is considered inthis giant component. We found cliques of size 30and there were more than 14000 of these 30-member cliques
(Abello, Pardalos & Resende)
Call Graph
Call graph
Call GraphIn a battlefield situation, just counting the messages oridentifying the source and the intended recipient of eachmessage, constructing a call graph, yields valuableinformation like the organization of a military force.The records in the call database are collected forcommercial purposes. In order to send an itemized bill, aphone company needs to keep track of every callcompleted, with the originating and receiving phonenumbers and the starting and ending times. The largestcompanies handle roughly 250 million toll calls a day, andso a month's worth of data amounts to several billion callrecords. AT&T reports that its database of retainedrecords is approaching two trillion calls and more than 300terabytes of data.
Historical calling patterns can be used to detect fraud, andsome patterns are also of interest in marketing. Forexample, a company that offers a discounted rate within a"calling circle" can use information from the call graph toestimate the costs and benefits of the program.This kind of traffic data could be compiled for othercommunications channels. For instance, Federal Expressand other courier services keep digitized records of theirdeliveries, which could readily be transformed into adatabase of senders and receivers. With a ‘packet sniffer’installed in the network, we compile this data for the e-mailtraffic. (American Scientist Online, Sep-Oct 2006)
Call Graph
Degree Distribution of the CallGraph (data by AT&T)
Market Graph
Vertices are stocks, and an edge connectstwo stocks if the correlation between theirprice fluctuations over a certain period isgreater than a specified threshold~6000 vertices (stocks)
Market Graph
Market graph (all the consideredinstances for different correlationthresholds) follows the power-law modelUsing the combination of heuristic andexact algorithms, the exact solution ofthe maximum clique problem was found
(Boginski, Butenko & Pardalos)
Degree distribution ofthe Market graph
Finding Cliques in theMarket Graph
Applying a heuristic algorithm to find alarge clique: let N(i) be the set ofneighbors of the vertex i
Finding Cliques in theMarket graph
Preprocessing procedure: C is the clique found by the heuristic
algorithm: recursively remove from thegraph all of the vertices which are not inC and whose degree is less than |C|
Denote the resulting (reduced) graph asG’ = (V’, E’)
Finding Cliques in theMarket graph
Using the IP formulation of the maximumclique problem to find the exact solution:
Maximum Clique size fordifferent correlation
thresholdsLarge cliques despite very low edgedensity – confirms the idea about the“globalization” of the market
Classification of StocksUsing Clique Partitioning
A clique in the market graph represents adense cluster of stocks whose pricesexhibit a similar behavior over timeTherefore, dividing the market graph into aset of distinct cliques (clique partitioning)is a natural approach to classifyingstocks (dividing the set of stocks intoclusters of similar objects – an approachto solving the clustering problem)
Independent sets in theMarket graph
Maximum independent set represents thelargest “perfectly diversified” portfolioSolving the maximum clique problem in thecomplementary graphThe preprocessing procedure could notreduce the size of the initial graph, the exactsolution could not be foundLarge diversified portfolios are hard to find
Independent set sizesfor different correlation
thresholdsRelatively small independent sets foundby the heuristic algorithm
Independent Sets in theMarket Graph
Finding a perfectly diversified portfoliocontaining any given stockFor every vertex in the market graph, anindependent set that contains this vertex wasdetected, and the sizes of these independentsets were almost the same, which means thatit is possible to find a diversified portfoliocontaining any given stock using the marketgraph methodology
Maximum Clique size fordifferent correlation
thresholds Maximum clique size for various thresholds in
Food Market Graph
Independent Sets in theMarket Graph
Maximum independent set size for variousthresholds in Food Market Graph
Connected Componentsin Market Graph
IntuitionTwo nodes are correlated if theircorrespondent nodes are connected by edge(correlated)Power-law graphs generally have very highclustering coefficient i.e., the tendency forassociation of two nodes which areassociated with a common node is high
Connected Componentsin Market Graph
Group Size by Time Period - (0.7)
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11
Time Period
La
rge
st
Gro
up
Siz
e
Group Size by Time period - (0.6)
0
100
200
300
400
500
1 2 3 4 5 6 7 8 9 10 11
Time Period
La
rge
st
gro
up
s
ize
Group Size by Time Period - (0.5)
0
200
400
600
800
1,000
1,200
1,400
1 2 3 4 5 6 7 8 9 10 11
Time Period
La
rge
st
gro
up
siz
e
Largest Group size by Time Period
Connected Componentsin Market Graph
ObservationsThe increase in the giant component sizefrom oldest to newest time period indicatesthe globalization tendency, just as inmaximum clique size and edge densityThe giant components includessemiconductor industries and the increase inthe size of the giant components corroboratesthe observation that the number of theseindustries has been increasing with time
Additional Applications
Social NetworksBiological NetworksTransportation Networks (place of livingand place of work)
ReferencesJ. Abello, P.M. Pardalos, and M.G.C. Resende (eds.),2002. Handbook of Massive Data Sets, Kluwer AcademicPublishers.V. Boginski, S. Butenko, and P.M. Pardalos, 2003.Modeling and Optimization in Massive Graphs. In: P. M.Pardalos and H. Wolkowicz, eds. Novel Approaches toHard Discrete Optimization, American MathematicalSociety, 17-39.V. Boginski, S. Butenko, and P.M. Pardalos, 2003. OnStructural Properties of the Market Graph. In: A.Nagurney (editor), Innovations in Financial and EconomicNetworks, Edward Elgar Publishers, 28-45.
American Scientist Online, (Jan-Feb 2000),Computing Science Graph Theory in Practice:Part I by Brian Hayes, Volume 88, No. 1
American Scientist Online, (Sep-Oct 2006),Connecting the Dots: Can the tools of graphtheory and social-network studies unravel thenext big plot? , Volume 94, No. 5
References
Modeling Epileptic Brain
EEG recordings received from theelectrodes located in different functionalunits of the brain (time series)The values of T-index between all pairsof electrodes are calculatedTwo electrodes are considered to beentrained in the seizure if thecorresponding value of T-index is lessthan Tcritical.
Modeling Epileptic Brain
One can represent all the electrode locations(functional units of the brain) as the verticesof the graph.An edge connects two vertices if thecorresponding value of T-index is less thanTcritical, i.e. these electrode sites are entrainedat a certain time moment.The evolution of the properties of this graph isinvestigated
Modeling Epileptic BrainEdge density of the considered graph(dashed lines represent the moments ofepileptic seizures)
Modeling Epileptic BrainSize of the largest connected component(dashed lines represent the moments ofepileptic seizures)
Modeling Epileptic Brain:Related Technique
Let A be the matrix containing the values ofT-index Tij for all pairs of electrodesSolve the quadratic 0-1 problem
to find k electrode sites producing the minimalsum of T-indices (so-called critical sites),which means that these sites are entrainedduring the seizure
SummaryThere are many mathematicalprogramming techniques for addressingdata mining problems in dynamic networksGraph-based techniques for this type ofproblems is a promising research areaPerformance of any approach depends ona specific dataset – there is no “universal”technique
Data Mining in Biomedicine, P.M. Pardalos, V. Boginski,and A. Vazacopoulos (eds.), Springer, forthcoming.P.M. Pardalos, W. Chaovalitwongse, L.D. Iasemidis, J.C.Sackellares, D.-S. Shiau, P.R. Carney, O.A. Prokopyev,V.A. Yatsenko, 2004. Seizure Warning Algorithm Based onOptimization and Nonlinear Dynamics, MathematicalProgramming, 101(2): 365-385.O.A. Prokopyev, V. Boginski, W. Chaovalitwongse, P. M.Pardalos, J. C. Sackellares, and P. R. Carney, 2004.Network-Based Techniques in EEG Data Analysis andEpileptic Brain Modeling. To appear in: Data mining inBiomedicine, P.M. Pardalos, V. Boginski and A.Vazacopoulos (eds.), Springer.
References
This cosmos was not made by Gods ormen, but always was, and is, and evershall be ever-living fire.
Heraclitus - The Fire Priest (540 BC -480 BC)