scalable parallel data mining eui-hong (sam) han department of computer science and engineering army...
DESCRIPTION
Why Mine Data? Commercial ViewPoint... zLots of data is being collected and warehoused. zComputing has become affordable. zCompetitive Pressure is Strong yProvide better, customized services for an edge. yInformation is becoming product in its own right.TRANSCRIPT
![Page 1: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/1.jpg)
Scalable Parallel Data Mining
Eui-Hong (Sam) Han
Department of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of Minnesota
Research Supported by NSF, DOE, Army Research Office, AHPCRC/ARL
http://www.cs.umn.edu/~han
Joint work with George Karypis, Vipin Kumar, Anurag Srivastava, and Vineet Singh
![Page 2: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/2.jpg)
What is Data Mining?Many Definitions
Search for Valuable Information in Large Volumes of Data.
Exploration & Analysis, by Automatic or Semi-Automatic Means, of Large Quantities of Data in order to Discover Meaningful Patterns & Rules.
A Step in the KDD Process…
![Page 3: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/3.jpg)
Why Mine Data? Commercial ViewPoint...Lots of data is being collected and
warehoused.Computing has become affordable.Competitive Pressure is Strong
Provide better, customized services for an edge.
Information is becoming product in its own right.
![Page 4: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/4.jpg)
Why Mine Data?Scientific Viewpoint... Data collected and stored at enormous speeds
(Gbyte/hour) remote sensor on a satellite telescope scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of data
Traditional techniques are infeasible for raw data Data mining for data reduction..
cataloging, classifying, segmenting data Helps scientists in Hypothesis Formation
![Page 5: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/5.jpg)
Data Mining Tasks...Classification [Predictive]Clustering [Descriptive]Association Rule Discovery [Descriptive]Sequential Pattern Discovery
[Descriptive]Regression [Predictive]Deviation Detection [Predictive]
![Page 6: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/6.jpg)
Classification Example
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categorical
categorical
continuous
class
Refund MaritalStatus
TaxableIncome Cheat
No Single 75K ?
Yes Married 50K ?
No Married 150K ?
Yes Divorced 90K ?
No Single 40K ?
No Married 80K ?10
TestSet
Training Set Model
Learn Classifier
![Page 7: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/7.jpg)
Classification ApplicationDirect Marketing
Fraud Detection
Customer Attrition/Churn
Sky Survey Cataloging
![Page 8: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/8.jpg)
Example Decision Tree
Tid Refund MaritalStatus
TaxableIncome Cheat
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes10
categorical
categorical
continuous
class
Refund
MarSt
TaxInc
YESNO
NO
NO
Yes No
Married Single, Divorced
< 80K > 80K
Splitting Attributes
The splitting attribute at a node is determined based on the Gini index.
![Page 9: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/9.jpg)
Hunt’s MethodAn Example: Attributes: Refund (Yes, No), Marital Status (Single, Married,
Divorced), Taxable Income (Continuous) Class: Cheat, Don’t Cheat
Refund
Don’t Cheat
Yes No
MaritalStatus
Don’t Cheat
Cheat
Single,Divorced Married
TaxableIncome
Don’t Cheat
< 80K >= 80KDon’t Cheat
Refund
Don’t Cheat
Don’t Cheat
Yes NoRefund
Don’t Cheat
Yes No
MaritalStatus
Don’t Cheat
Cheat
Single,Divorced Married
![Page 10: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/10.jpg)
Binary Attributes: Computing GINI IndexSplits into two partitionsEffect of Weighing partitions:
Larger and Purer Partitions are sought for.
N1 N2C1 0 4C2 6 0Gini=0.000
N1 N2C1 3 4C2 3 0Gini=0.300
N1 N2C1 4 2C2 4 0Gini=0.400
N1 N2C1 6 2C2 2 0Gini=0.300
True?
Yes No
Node N1 Node N2
![Page 11: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/11.jpg)
Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute,
Sort the attribute on values Linearly scan these values, each time updating the count matrix and
computing gini index Choose the split position that has the least gini index
Cheat No No No Yes Yes Yes No No No No
Taxable Income
60 70 75 85 90 95 100 120 125 220
55 65 72 80 87 92 97 110 122 172 230<= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= >
Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Sorted ValuesSplit Positions
![Page 12: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/12.jpg)
Need for Parallel Formulations
Need to handle very large datasets.Memory limitations of sequential computers
cause sequential algorithms to make multiple expensive I/O passes over data.
Need for scalable, efficient (fast) data mining computations gain competitive advantage. Handle larger data for greater accuracy in
shorter times.
![Page 13: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/13.jpg)
Constructing a Decision Tree in Parallel
Partitioning of data only– global reduction per
node is required– large number of
classification tree nodes gives high communication cost
n recordsm categorical attributes
Good Bad
35 50
20 5
family
sport
![Page 14: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/14.jpg)
Synchronous Tree Construction Approach
+ No data movement is required
– Load imbalance• can be eliminated by
breadth-first expansion
– High communication cost• becomes too high in lower
parts of the tree
Partition Data Across Processors
![Page 15: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/15.jpg)
Constructing a Decision Tree in Parallel
Partitioning of classification tree nodes– natural concurrency– load imbalance as the
amount of work associated with each node varies
– child nodes use the same data as used by parent node
– loss of locality– high data movement
cost
7,000 records
10,000 training records
3,000 records
2,000 5,000 2,000 1,000
![Page 16: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/16.jpg)
Partitioned Tree Construction Approach
+ Highly concurrent
- High communication cost due to excessive data movements
- Load imbalance
Partition Data and Nodes
![Page 17: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/17.jpg)
Hybrid Parallel Formulation
![Page 18: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/18.jpg)
Load Balancing
![Page 19: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/19.jpg)
Splitting CriterionSwitch to Partitioned Tree Construction
when
Splitting criterion ensures
– G. Karypis and V. Kumar, IEEE Transactions on Parallel and Distributed Systems, Oct. 1994
BalancingLoadCostMovingCostionCommunicat
optimalCostionCommunicatCostionCommunicat 2
![Page 20: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/20.jpg)
Experimental ResultsData set
– function 2 data set discussed in SLIQ paper (Mehta, Agrawal and Rissanen, EDBT’96)
– 2 class labels, 3 categorical and 6 continuous attributes
IBM SP2 with 128 processors– 66.7 MHz CPU with 256 MB real memory– AIX version 4– high performance switch
![Page 21: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/21.jpg)
Speedup Comparison of the Three Parallel Algorithms
0.8 million examples 1.6 million examples
![Page 22: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/22.jpg)
Splitting Criterion Verification in the Hybrid Algorithm
BalancingLoadCostMovingCostionCommunicat
RatioCriterionSplitting
0.8 million examples on 8 processors 1.6 million examples on 16 processors
![Page 23: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/23.jpg)
Speedup of the Hybrid Algorithm with Different Size Data Sets
![Page 24: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/24.jpg)
Summary of Algorithms for Categorical Attributes Synchronous Tree Construction Approach
– no data movement required– high communication cost as tree becomes bushy
Partitioned Tree Construction Approach– processors work independently once partitioned
completely– load imbalance and high cost of data movement
Hybrid Algorithm– combines good features of two approaches– adapts dynamically according to the size and shape of
trees
![Page 25: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/25.jpg)
Handling Continuous AttributesSort continuous attributes at each node of
the tree (as in C4.5). Expensive, hence Undesirable!
Discretize continuous attributes CLOUDS (Alsabti, Ranka, and Singh, 1998) SPEC (Srivastava, Han, Kumar, and Singh, 1997)
Use a pre-sorted list for each continuous attributes SPRINT (Shafer, Agrawal, and Mehta, VLDB’96) ScalParC (Joshi, Karypis, and Kumar, IPPS’98)
![Page 26: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/26.jpg)
Association Rule Discovery: DefinitionGiven a set of records each of which
contain some number of items from a given collection; Produce dependency rules which will predict
occurrence of an item based on occurrences of other items.TID Items
1 Bread, Coke, Milk2 Beer, Bread3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Diaper, Milk
Rules Discovered: {Milk} --> {Coke} {Diaper, Milk} --> {Beer}
![Page 27: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/27.jpg)
Association Rule Discovery ApplicationMarketing and Sales Promotion
Supermarket Shelf Management
Inventory Management
![Page 28: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/28.jpg)
Association Rule Discovery: Support and Confidence
TID Items
1 Bread, Milk2 Beer, Diaper, Bread, Eggs3 Beer, Coke, Diaper, Milk4 Beer, Bread, Diaper, Milk5 Coke, Bread, Diaper, Milk
yX ,s
))yX,((||
)yX( PsT
s
))X|y((|)X()yX( P
Association Rule:
Support:
Confidence:
Example:Beer}MilkDiaper,{ ,s
4.052
nsTransactio ofNumber Total)BeerMilk,Diaper,(
s
66.0|)MilkDiaper,(
)BeerMilk,Diaper,(
![Page 29: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/29.jpg)
Handling Exponential Complexity
Given n transactions and m different items: number of possible association rules: computation complexity:
Systematic search for all patterns, based on support constraint [Agarwal & Srikant]: If {A,B} has support at least, then both A and B
have support at least If either A or B has support less than , then {A,B} has
support less than . Use patterns of size k-1 to find patterns of size k.
)2( 1mmO)2( mnmO
![Page 30: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/30.jpg)
Illustrating Apriori PrincipleItem Count Bread 4 Coke 2 Milk 4 Beer 3 Diaper 4 Eggs 1
Items (1-itemsets)
Itemset Count{Bread,Milk} 3{Bread,Beer} 2{Bread,Diaper} 3{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
Pairs (2-itemsets)
Triplets (3-itemsets)Itemset Count{Bread,Milk,Diaper} 3{Milk,Diaper,Beer} 2
Minimum Support = 3
If every subset is considered, 6C1 + 6C2 + 6C3 = 41
With support-based pruning,6 + 6 + 2 = 14
![Page 31: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/31.jpg)
Counting CandidatesFrequent Itemsets are found by counting
candidates.Simple way:
Search for each candidate in each transaction.Expensive!!!
TransactionsCandidates
N M
![Page 32: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/32.jpg)
Association Rule Discovery: Hash tree for fast access.
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 83 5 63 5 76 8 9
2 3 45 6 7
1 2 44 5 7
1 2 54 5 8
1,4,7
2,5,8
3,6,9
Hash Function Candidate Hash Tree
![Page 33: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/33.jpg)
Parallel Formulation of Association Rules
Need: Huge Transaction Datasets (10s of TB) Large Number of Candidates.
Data Distribution: Partition the Transaction Database, or Partition the Candidates, or Both
![Page 34: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/34.jpg)
Parallel Association Rules: Count Distribution (CD)Each Processor has complete candidate
hash tree.Each Processor updates its hash tree with
local data.Each Processor participates in global
reduction to get global counts of candidates in the hash tree.
Multiple database scans are required if the hash tree is too big to fit in the memory.
![Page 35: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/35.jpg)
CD: Illustration
{5,8}
2
{3,4}{2,3}{1,3}{1,2}
5372 {5,8}
7
{3,4}{2,3}{1,3}{1,2}
3119 {5,8}
0
{3,4}{2,3}{1,3}{1,2}
2826
P0 P1 P2
Global Reduction of Counts
N/p N/p N/p
![Page 36: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/36.jpg)
Parallel Association Rules: Data Distribution (DD)Candidate set is partitioned among the
processors.Once local data has been partitioned, it is
broadcast to all other processors.High Communication Cost due to data
movement.Redundant work due to multiple traversals
of the hash trees.
![Page 37: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/37.jpg)
DD: Illustration
All-to-All Broadcast of Candidates
9{1,3}{1,2}
10 {3,4}{2,3} 12
10{5,8} 17
P0 P1 P2
N/p N/p N/pRemote
DataRemote
DataRemote
Data DataBroadcastCount Count Count
![Page 38: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/38.jpg)
Parallel Association Rules: Intelligent Data Distribution (IDD)
Data Distribution using point-to-point communication.
Intelligent partitioning of candidate sets. Partitioning based on the first item of candidates. Bitmap to keep track of local candidate items.
Pruning at the root of candidate hash tree using the bitmap.
Suitable for single data source such as database server.
With smaller candidate set, load balancing is difficult.
![Page 39: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/39.jpg)
IDD: Illustration
All-to-All Broadcast of Candidates
9{1,3}{1,2}
10 {3,4}{2,3} 12
10{5,8} 17
P0 P1 P2
N/p N/p N/pRemote
DataRemote
DataRemote
Data
1 2,3 5bitmask
DataShift
Count Count Count
![Page 40: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/40.jpg)
Filtering Transactions in IDD
1 5 9
1 4 5 1 3 63 4 5 3 6 7
3 6 83 5 63 5 76 8 9
2 3 45 6 7
1 2 44 5 7
1 2 54 5 8
Skipped!3 5 62 +1 + 2 3 5 6
5 63 +
bitmask 1,3,5 1 2 3 5 6 transaction
![Page 41: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/41.jpg)
Parallel Association Rules: Hybrid Distribution (HD)Candidate set is partitioned into G groups
to just fit in main memory Ensures Good load balance with smaller
candidate set.Logical processor mesh G x P/G is formed.Perform IDD along the column processors
Data movement among processors is minimized.
Perform CD along the row processors Smaller number of processors is global
reduction operation.
![Page 42: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/42.jpg)
HD: Illustration
C0
C1
C2
C0
C1
C2
C0
C1
C2
N/(P/G) N/(P/G) N/(P/G) CDalongRows
IDD
alo
ng C
olum
nsA
ll-to
-All
Bro
adca
st o
f Can
dida
tes
G G
roup
s of
Pro
cess
ors
P/G Processors per Group
N/P
N/P
N/P N/P
N/P
N/P N/P
N/P
N/P
![Page 43: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/43.jpg)
Parallel Association Rules: Experimental Setup 128-processor Cray T3E
600 MHz DEC Alpha (EV4) 512MB of main memory per processor 3-D torus interconnection network with peak
unidirectional bandwidth of 430 MB/sec. MPI used for communications. Synthetic data set: avg transaction size 15 and
1000 distinct items. For larger data sets, multiple read of transactions
in blocks of 1000. HD switch to CD after 90.7% of the total
computation is done.
![Page 44: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/44.jpg)
Scaleup Results (50K, 0.1%)
![Page 45: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/45.jpg)
Speedup Results (N=1.3 million, M=0.7 million)
![Page 46: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/46.jpg)
Sizeup Results (P=64, M=0.7 million)
![Page 47: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/47.jpg)
Response Time with Varying Candidate Size (P=64, N=1.3 million)
![Page 48: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/48.jpg)
SP2 Response Time with Varying Candidate Size (P=64, N=1.3 million)
![Page 49: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/49.jpg)
Parallel Association Rules: Summary of Experiments
HD shows the same linear speedup and sizeup behavior as that of CD.
HD Exploits Total Aggregate Main Memory, while CD does not.
IDD has much better scaleup behavior than DD
![Page 50: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/50.jpg)
SummaryData mining is a rapidly growing field
Fueled by enormous data collection rates, and need for intelligent analysis for business and scientific gains.
Large and high-dimensional nature data requires new analysis techniques and algorithms.
Scalable, fast parallel algorithms are becoming indispensable.
Many research and commercial opportunities!!!
![Page 51: Scalable Parallel Data Mining Eui-Hong (Sam) Han Department of Computer Science and Engineering Army High Performance Computing Research Center University](https://reader035.vdocuments.site/reader035/viewer/2022062523/5a4d1ad17f8b9ab05997187e/html5/thumbnails/51.jpg)
Collaborators George Karypis and Vipin Kumar
Department of Computer Science and EngineeringArmy High Performance Computing Research CenterUniversity of Minnesota
Anurag Srivastava
Digital Impact
Vineet Singh
Hewlett Packard Laboratories
http://www.cs.umn.edu/~han