an introduction to knowledge discovery and data miningbao/talks/pdcattutorial.pdf · an...
TRANSCRIPT
![Page 1: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/1.jpg)
An Introduction to Knowledge Discovery and Data Mining
TuBao HoSchool of Knowledge ScienceJapan Advanced Institute of Science and Technology
![Page 2: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/2.jpg)
PDCAT 2002, T.B. Ho 2
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
![Page 3: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/3.jpg)
PDCAT 2002, T.B. Ho 3
Un-interpretedsignals1st 2nd 3rd 4th …25 27 21 26 …
data equipped with meaning(temperature of the days)
integrated information, including facts and their relations (“justified true belief”)(E = mc2)
Data, Information, Knowledge
Data mining metaphor: extractingore from rock
![Page 4: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/4.jpg)
PDCAT 2002, T.B. Ho 4
1. ( 5.6, 8.5)2. ( 6.0, 13.0)3. (11.0, 12.0)4. (11.0, 19.0)5. (13.5, 10.0)6. (16.5, 20.0)7. (17.5, 15.0)8. (17.5, 5.0)9. (22.5, 25.0)10. (26.0, 7.5)11. (30,0, 9.0)12. (30.0, 18.0)13. (30.0, 30.0)14. (31.0, 14.0)15. (32.5, 25.0)16. (38.0, 12.0)17. (41.0, 9.0)18. (41.0, 22.0)19. (43.5, 12.5)20. (44.0, 27.5)21. (45.0, 22.5)22. (48.0, 28.0)23. (52.5, 21.0)24. (53.5, 32.0)25. (54.0, 27.5)26. (57.5, 18.0)27. (59.0, 18.0)28. (62.5, 32.5)29. (63.0, 18.0)“if income < $33K, then the person has defaulted on the loan”
Mean of Debt = 18.4, Mean of Income = 34.5
33
US$ K(income, debt)
0
34.5, 18.4
(information)
(knowledge)
Have defaultedon the loan
Good statuswith the bank
Debt
Income
Data, Information, Knowledge
![Page 5: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/5.jpg)
PDCAT 2002, T.B. Ho 5
Knowledge Discovery and Data Mining (KDD)
106-1012 bytes:never see the whole data set or put it in thememory of computers
What knowledge?How to represent and use it?
Data mining algorithms?
the automatic extraction of non-obvious, hidden knowledge (patterns/models) from large volumes of data
![Page 6: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/6.jpg)
PDCAT 2002, T.B. Ho 6
...10, M, 0, 10, 10, 0, 0, 0, SUBACUTE, 37, 2, 1, 0,15,-,-, 6000, 2, 0, abnormal, abnormal,-, 2852, 2148, 712, 97, 49, F,-,multiple,,2137, negative, n, n, ABSCESS, VIRUS
12, M, 0, 5, 5, 0, 0, 0, ACUTE, 38.5, 2, 1, 0,15, -,-, 10700,4,0,normal, abnormal, +, 1080, 680, 400, 71, 59, F,-,ABPC+CZX,, 70, negative, n, n, n, BACTERIA, BACTERIA
15, M, 0, 3, 2, 3, 0, 0, ACUTE, 39.3, 3, 1, 0,15, -, -, 6000, 0,0, normal, abnormal, +, 1124, 622, 502, 47, 63, F, -,FMOX+AMK, , 48, negative, n, n, n, BACTE(E), BACTERIA
16, M, 0, 32, 32, 0, 0, 0, SUBACUTE, 38, 2, 0, 0, 15, -, +, 12600, 4, 0,abnormal, abnormal, +, 41, 39, 2, 44, 57, F, -, ABPC+CZX, ?, ?, negative, ?, n, n, ABSCESS, VIRUS...
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + ANDNausea > 15 THEN Prediction = VIRUS [confidence = 87,5%]
From Data to Knowledge
Meningitis data, Tokyo Med. & Dental Univ., 38 attributes
numerical categorical missing class attribute
![Page 7: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/7.jpg)
PDCAT 2002, T.B. Ho 7
DatabasesStore, access, search, update data (deduction)
Statistics Infer information from data (deduction and induction, mainly numeric data)
Machine LearningComputer algorithms that improve automatically through experience (mainly induction, symbolic data)
KDD
KDD: An Interdisciplinary Field
also Algorithmics, Visualization, Data warehouses, OLAP, etc.
![Page 8: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/8.jpg)
PDCAT 2002, T.B. Ho 8
KDD’95, 96, 97, 98, 99, 00, 01, 02 (ACM, America)PAKDD’97, 98, 99, 00, 01, 02 (Pacific Rim & Asia)PKDD’97, 98, 99, 00, 01, 02 (Europe)ICDM’01, 02 (IEEE), SDM’01, 02 (SIAM)
Industrial Interest: IBM, Microsoft, Silicon Graphics, Sun, Boeing, NASA, SAS, SPSS, …
Japan: FGCS Project focus on logic programming and reasoning; attention has been paid on knowledge acquisition and machine learning.2001-2004: “Active Mining Project”
KDD: New and Fast Growing Area
![Page 9: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/9.jpg)
PDCAT 2002, T.B. Ho 9
High-powered computers (larger disks, faster cpus) and networked data become widely available
People gathered and stored so much data because they think some valuable assets are implicitly coded within it. Its true value depends on the ability to extract useful information
Impractical manual data analysis
How to acquire knowledgefor knowledge–based systems remains as the main difficult and crucial AI problem
Why KDD?
![Page 10: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/10.jpg)
PDCAT 2002, T.B. Ho 10
Relational DatabasesA relational database is a collection of tables, each of which is assigned a unique name, and consists of a set of attributes and a set of tuples.
Cust-ID name address age income credit-info .C1 Smith, Sandy 5463 E Hasting, Burnaby 21 $27000 1 …
BC V5A 459, Canada … … … … … … …
Item-ID name brand category type price place-made supplier cost I3 high-res-TV Toshiba high resolution TV $988.00 Japan NIkoX $600.00I8 multidisc- Sanyo multidisc CD player $369.00 Japan MusicFont $120.00
… CDplayer … … … … … … …
customer
item
Emp-ID name category group salary commisionE35 Jones, Jane home entertainmentl manager $18,000 2%… … … … … …
employee
Branch-ID name addressB1 City square 369 Cambie St., Vancouver, BC V5L 3A2, Canada… … …
branch
Trans-ID cust-ID empl-ID data time method-paid amountT100 C1 B55 01/21/98 15:45 Visa $1357.00… . … … … … … …
purchases
Trnas-ID item-ID sty
T100 I3 1T100 I8 2… … …
Empl-ID branch-ID
E55 B1… …
Item-sold works-at
![Page 11: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/11.jpg)
PDCAT 2002, T.B. Ho 11
Data Warehouses
A data warehouse is a repository of information collected from multiple resources, stored under a unified schema, and which is usually resides at a single site.
Data sourcein Chicago
Data sourcein New York
Data sourcein Vancouver
Data sourcein Toronto
CleanTransformIntegrateLoad
Data warehouse
Query andanalysis tool
client
client
![Page 12: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/12.jpg)
PDCAT 2002, T.B. Ho 12
Transactional Databases
A transactional database consists of a file where each record represents a transaction. A transaction typically includes a unique transaction identity number (trans_ID), and list of the items making up the transaction
Trans_ID list of item_ID
T100 I1, I3, I8, I16T200 I3, I5, I23…. …
![Page 13: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/13.jpg)
PDCAT 2002, T.B. Ho 13
Object-Oriented Databases
Object-Relational Databases
Spatial Databases
Temporal Databases and Time-Series Databases
Text Databases and Multimedia Databases
Heterogeneous Databases and Legacy Databases
The World Wide Web
Advanced Database Systems
![Page 14: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/14.jpg)
PDCAT 2002, T.B. Ho 14
Spatial databases contain spatial-related information: geographic databases, VLSI chip design databases, medical and satellite image databases etc.
Data mining may uncover patterns describing the characteristics of houses located near a specified kind of location, the climate of mountainous areas located at various altitudes, etc.
Spatial Databases Japanese earthquakes
1961-1994
![Page 15: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/15.jpg)
PDCAT 2002, T.B. Ho 15
Temporal and Time-Series Databases
They store time-related data. A temporal database stores relational data that include time-related attributes (timestamps with different semantics). A time-series database stores sequences of values that change with time (stock exchange)
Data mining finds the characteristics of object evolution, trend of change for objects: e.g., stock exchange data can be mined to uncover trends in investment strategies
![Page 16: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/16.jpg)
PDCAT 2002, T.B. Ho 16
Text and Multimedia Databases
Text databases contain documents, usually highly unstructured or semi-structured. To uncover general descriptions of object classes, keywords, content associations, clustering behavior of text objects, etc.
Multimedia databases store image, audio, and video data: picture content-based retrieval, voice-email systems, video-on-demand-systems, speech-based user interface, etc.
![Page 17: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/17.jpg)
PDCAT 2002, T.B. Ho 17
The Web provides an enormous source of explicit and implicit knowledge that people can navigate and search for what they need.
Example: When examining the data collected from Internet Mart, heavily trodden paths gave BT hints to regions of the site which were of key interest to its visitors.
The World Wide Web
![Page 18: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/18.jpg)
PDCAT 2002, T.B. Ho 18
KDD is inherentlyinteractive and iterative
a step in the KDD process consisting of methods that produce useful patterns or models from the data
1
3
4
5
Understand the domainand Define problems
Collect andPreprocess Data
Data MiningExtract Patterns/Models
Interpret and Evaluatediscovered knowledge
Putting the resultsin practical use
Maybe 70-90% of effort and cost in KDD
The KDD Process
2
![Page 19: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/19.jpg)
PDCAT 2002, T.B. Ho 19
Data organized by function
Create/selecttarget database
Select samplingtechnique and
sample data
Supply missing values
Normalizevalues
Select DM task (s)
Transform todifferent
representation
Eliminatenoisy data
Transformvalues
Select DM method (s)
Create derivedattributes
Extract knowledge
Find importantattributes &value ranges
Test knowledge
Refine knowledge
Query & report generationAggregation & sequencesAdvanced methods
Data warehousing
1
2
3
4
5
The KDD Process
![Page 20: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/20.jpg)
PDCAT 2002, T.B. Ho 20
Starting Points: Data or Mining?
Nature of Data
Flat data tablesRelational databaseTemporal & Spatial TransactionMultimedia dataTextWeb
Mining tasks and methods
Classification/PredictionDecision treesNeural networkRule inductionetc.
DescriptionAssociation analysisClusteringetc.
![Page 21: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/21.jpg)
PDCAT 2002, T.B. Ho 21
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualizationChallenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
![Page 22: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/22.jpg)
PDCAT 2002, T.B. Ho 22
Predictive mining tasks perform inference on the current data in order to make predictions
Descriptive mining tasks characterize the general properties of the data in the database
Primary task of KDD
![Page 23: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/23.jpg)
PDCAT 2002, T.B. Ho 23
Patterns
ModelsA model is a global description of a data set, a high level population or large sample perspective
A pattern is a low level summary of a relationship, perhaps which holds only for a few records or for only a few variables (local)A pattern is seen as a statement S in a language L that describes a subset D(S) of a database D with a quality q(S)
A model tells us about correlation between variables (regression), about hierarchies of clusters (clustering), a neural network, etc.
IF cell_poly <= 220 AND Risk = n AND Loc_dat = + AND Nausea > 15THEN Prediction = VIRUS [87,5%]
Discovery of Patterns and/or Models
![Page 24: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/24.jpg)
PDCAT 2002, T.B. Ho 24
color #nuclei #tails class
H1 light 1 1 healthy
H2 dark 1 1 healthy
H3 light 1 2 healthy
H4 light 2 1 healthy
C1 dark 1 2 cancerous
C2 dark 2 1 cancerous
C3 light 2 2 cancerous
C4 dark 2 2 cancerous
Datasets: Cancerous and Healthy Cells
H1
C3
H3 H4
H2
C2C1
C4
![Page 25: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/25.jpg)
PDCAT 2002, T.B. Ho 25
Classification/Prediction
Classification is the process of finding a set of models (or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown
Decision treesIF-THEN rulesNeural networksMathematical formulaeetc.
![Page 26: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/26.jpg)
PDCAT 2002, T.B. Ho 26
ClassificationAlgorithms
If color = darkand # tails = 2
Then cancerous cell
H1
H3 H4
H2
C2C1
training data
Classifier(model)
Unknown case
Classification—A Two-Step Process
Cancerous?
Model construction Model usage
![Page 27: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/27.jpg)
PDCAT 2002, T.B. Ho 27
Comparing Classification Methods
Predictive accuracy: the ability of the classifier to correctly predict unseen data
Speed: refers to computation cost
Robustness: the ability of the classifier to make correctly predictions given noisy data or data with missing values
Scalability: the ability to construct the classifier efficiently given large amounts of data
Interpretability: the level of understanding and insight that is provided by the classifier
![Page 28: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/28.jpg)
PDCAT 2002, T.B. Ho 28
Mining with Decision Trees
#nuclei?
1 2
light dark
color?
light dark
1 2
#tails?H
H C
color?
#tails?
1 2
H C
C
H1
C3
H3 H4
H2
C2C1
C4
![Page 29: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/29.jpg)
PDCAT 2002, T.B. Ho 29
General Algorithm for Tree Induction
1. Choose the “best” attribute by a given measure for attribute selection
2. Extend tree by adding new branch for each value of the attribute
3. Sorting training examples to leaf nodes
4. If examples in a node belong to one class Then Stop Else Repeat steps 1-4 for leaf nodes
5. Prune the tree to avoid over-fitting
Two steps: recursively generate the tree (1-4), and prune the tree (5)
![Page 30: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/30.jpg)
PDCAT 2002, T.B. Ho 30
Measures for Attribute Selection
![Page 31: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/31.jpg)
PDCAT 2002, T.B. Ho 31
Other Classification Methods
Neural NetworksInstance-based ClassificationGenetic AlgorithmsRough Set ApproachStatistical ApproachesSupport Vector Machinesetc.
![Page 32: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/32.jpg)
PDCAT 2002, T.B. Ho 32
H1
C3
H3 H4
H2
C2C1
C4
Healthy
Cancerous
color = dark
# nuclei = 1
# tails = 2
Mining with Neural Networks
![Page 33: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/33.jpg)
PDCAT 2002, T.B. Ho 33
Neural Networks
Advantagesprediction accuracy is generally highrobust, works when training examples contain errorsoutput may be discrete, real-valued, or a vector of several discrete or real-valued attributesfast evaluation of the learned target function
Criticismlong training timedifficult to understand the learned function (weights)not easy to incorporate domain knowledge
![Page 34: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/34.jpg)
PDCAT 2002, T.B. Ho 34
Instance-based Classification
Instance-based classificationUsing most similarity individual instances known in the past to classify a new instance
Typical approachesk-nearest neighbor approach
Instances represented as points in a Euclidean space
Locally weighted regressionConstructs local approximation
Case-based reasoningUses symbolic representations and knowledge-based inference
![Page 35: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/35.jpg)
PDCAT 2002, T.B. Ho 35
Genetics Algorithms (GA)
GA: based on an analogy to biological evolution
Each rule is represented by a string of bits
An initial population is created consisting of randomly generated rules
e.g., IF A1 and Not A2 then C2 can be encoded as 100
Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy on a set of training examples
Offsprings are generated by crossover and mutation
![Page 36: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/36.jpg)
PDCAT 2002, T.B. Ho 36
Rough Set Approach
Rough sets are used to approximately or “roughly” define equivalent classes
A rough set for a given class C is approximated by two sets:
A lower approximation(certain to be in C)A upper approximation(possible to be in C)
Finding the minimal subsets (reducts) of attributes, dependencies in data, rules, etc.
X
Equivalence classes
Rough sets and Data Mining, T.Y. Lin, N. Cercone (eds.), Kluwer Academic Pub., 1997)
Rough Sets in Knowledge Discovery, L. Polkowski, A. Skowron (eds.), Physica-Verlag, 1998.
![Page 37: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/37.jpg)
PDCAT 2002, T.B. Ho 37
Bayesian Classification
Calculate explicit probabilities for hypothesis, among the most practical approaches to certain types of problems
P(Ci|X) = probability that the instance X = <x1,…,xk> is of class Ci. Idea: assign to sample X the class label Ci such that P(Ci|X) is maximal
Bayesian theorem
Naïve assumption: attribute independence
Bayesian belief network allows a subset of the variables conditionally independent
P(X)))P(CC|P(X
X)|P(C iii =
![Page 38: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/38.jpg)
PDCAT 2002, T.B. Ho 38
Market Basket Analysis
Analyzes customer buying habits by finding associations between the different items that customers place in their “shopping baskets”
Helps develop marketing strategies by gaining insight into whichitems are frequently purchased together by customers
How often people buy onigiri and beer together?
![Page 39: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/39.jpg)
PDCAT 2002, T.B. Ho 39
If color = lightand # nuclei = 1
Then # tails = 1(support = 12.5%; confidence = 50%)
If # nuclei = 2and cell = cancerous
Then # tails = 2(support = 25%;confidence = 100%)
H1
C3
H3 H4
H2
C2C1
C4
Mining with Association RulesAssociation: the presence of same color and # nuclei implies the presence of same # tails in the same record
Support: the proportion of times that the rule applies. Confidence: the proportion of times that the rule is correctApriori algorithm, R. Agrawal 1993
![Page 40: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/40.jpg)
PDCAT 2002, T.B. Ho 40
Rule Measures: Support and Confidence
Example: Find all the rules X & Y ⇒ Z with minimum confidence and support
support s = probability that a transaction contains {X and Y and Z}confidence c = conditional probability that a transaction having {X and Y} also contains Z
If minimum support 50%, minimum confidence 50%:
A ⇒ C (s=50%, c=66.6%)C ⇒ A (s=50%, c=100%)
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Customer buys onigiri
Customer buys both Customerbuys beer
![Page 41: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/41.jpg)
PDCAT 2002, T.B. Ho 41
Association Mining: Apriori Algorithm
It is composed of two steps:
1. Find all frequent itemsets: By definition, each of these itemsets will occur at least as frequently as a pre-determined minimum support count
2. Generate strong association rules from the frequent itemsets: By definition, these rules must satisfy minimum support and minimum confidence
![Page 42: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/42.jpg)
PDCAT 2002, T.B. Ho 42
Association Mining: Apriori Principle
For rule A ⇒ C:support = support({A and C}) = 50%confidence = support({A and C})/support({A}) = 66.6%
The Apriori principle:Any subset of a frequent itemset must be frequent
Transaction ID Items Bought2000 A,B,C1000 A,C4000 A,D5000 B,E,F
Frequent Itemset Support{A} 75%{B} 50%{C} 50%{A,C} 50%
Min. support 50%Min. confidence 50%
![Page 43: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/43.jpg)
PDCAT 2002, T.B. Ho 43
The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation
1. Find the frequent itemsets: the sets of items that have minimum support
A subset of a frequent itemset must also be a frequent itemset
i.e., if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (k-itemset)
C1 … Li-1 Ci Li Ci+1 … Lk
2. Use the frequent itemsets to generate association rules.
![Page 44: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/44.jpg)
PDCAT 2002, T.B. Ho 44
Example (min_sup_count = 2)
TID List of items_IDs
T100 I1, I2, I5T200 I2, I4T300 I2, I3T400 I1, I2, I4T500 I1, I3T600 I2, I3T700 I1, I3T800 I1, I2, I3, I5T900 I1, I2, I3
Itemset Sup.Count
{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2
C1
Itemset Sup.Count
{I1} 6 {I2} 7{I3} 6{I4} 2 {I5} 2
L1
Transactional dataScan D for count of each candidate
Compare candidate support count with minimum support count
![Page 45: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/45.jpg)
PDCAT 2002, T.B. Ho 45
Example (min_sup_count = 2)
Itemset{I1, I2} {I1, I3} {I1, I4} {I1, I5} {I2, I3}{I2, I4}{I2, I5}{I3, I4}{I3, I5}{I4, I5}
C2
Scan D for count of each candidate
Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I4} 1 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2{I3, I4} 0{I3, I5} 1{I4, I5} 0
C2Compare candidate support count with minimum support count
Itemset S.count{I1, I2} 4 {I1, I3} 4 {I1, I5} 2 {I2, I3} 4{I2, I4} 2{I2, I5} 2
L2
Generate C3 candidates from L2
Itemset
{I1, I2, I3} {I1, I2, I5}
Scan D for count of each candidate
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
C3
Compare candidate support count with minimum support count
Itemset Sc
{I1, I2, I3} 2 {I1, I2, I5} 2
L3
![Page 46: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/46.jpg)
PDCAT 2002, T.B. Ho 46
Mining with Clustering
Clustering analyzes data objects without consulting a known class label.
The objects are clustered or grouped based on the principle of maximizing the intra-class and minimizing the inter-class similarity
Partition-based clustering for large sets of numerical data.
Hierarchical clustering with at least O(n2) time complexity seems not be suitable for very large datasets
![Page 47: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/47.jpg)
PDCAT 2002, T.B. Ho 47
What is Cluster Analysis?A cluster is a collection of data objects satisfying
Objects in this cluster are similar to one another
Objects in this cluster are dissimilar to the objects in other clusters
The process of grouping objects into clusters is called clustering
![Page 48: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/48.jpg)
PDCAT 2002, T.B. Ho 48
Clustering in Different Fields
Statistics: since many years, focus on distance-based clustering (S-Plus, SPSS, SAS)
Machine learning: unsupervised learning. In conceptual clustering, a group of objects forms a class only if it is described by a concept
KDD: Efficient and effective clustering of large databases: scalability, complex shapes and types of data, high dimensional clustering, mixed numerical and categorical data
![Page 49: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/49.jpg)
PDCAT 2002, T.B. Ho 49
What is Good Clustering?
A good clustering method will produce high quality clusters with
high intra-class similarity (within a class)
low inter-class similarity (between classes)
The quality of clustering basically depends on the similarity measure and the cluster representative used by the method
![Page 50: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/50.jpg)
PDCAT 2002, T.B. Ho 50
Typical Requirements of Clustering
ScalabilityAbility to deal with different types of attributesDiscovery of clusters with arbitrary shapeMinimal requirements for domain knowledge to determine input parametersAbility to deal with noisy dataInsensitivity to the order of input recordsHigh dimensionalityConstraint-based clusteringInterpretability and usability
![Page 51: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/51.jpg)
PDCAT 2002, T.B. Ho 51
Clustering Methods in KDD
Partitioning Methods
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
![Page 52: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/52.jpg)
PDCAT 2002, T.B. Ho 52
Partitioning Methods
Given n objects and k as number of clusters to form. A partitioning algorithm organizes the objects into a partition of k clusters
The clusters are formed to optimize an objective partitioning criterion so that the objects within a cluster are “similar” , whereas the objects of different classes are “dissimilar”
![Page 53: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/53.jpg)
PDCAT 2002, T.B. Ho 53
K-means Algorithm (K=2)Two centersselected randomlyfrom nobjects
Form twoclusters byassigningeach object toits nearest center
Reformtwo new clusters
Calculatetwo newcenters
Calculatetwo newcenters
Repeatstep 2 and 3untilthe stoppingconditions hold
1 2
3 4
![Page 54: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/54.jpg)
PDCAT 2002, T.B. Ho 54
Partitioning Methods
The k-means algorithm is sensitive to outliers
The k-medoids method uses medoid (the most centrally located object in a cluster)
The EM (Expectation Maximization) algorithm: assigns to a cluster according to a weight representing the probability of membership.
PAM (Partitioning Around Medoids)
From k-Medoids to CLARA (Clustering LARgeApplications)
From CLARA to CLARANS (Clustering LARgeApplications based on RANdomized Search)
![Page 55: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/55.jpg)
PDCAT 2002, T.B. Ho 55
Hierarchical Methods
A hierarchical clustering is a sequence of partitions in which each partition is nested into the next (previous) partition in the sequence.
Partition Q is nested into partition P if every component of Q is a subset of a component of P.
{ }},,{},,{},,,,,{ 65382109741 xxxxxxxxxxP =
{ }},{},{},,{},,{},,,{ 63582107941 xxxxxxxxxxQ =
![Page 56: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/56.jpg)
PDCAT 2002, T.B. Ho 56
Hierarchical Clustering: Chameleon
Chameleon: A Hierarchical Clustering Algorithm Using Dynamic Modeling. Clusters are merged if the interconnectivity and closeness between them are highly related to the internal interconnectivity and closeness of objects within the clusters.
![Page 57: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/57.jpg)
PDCAT 2002, T.B. Ho 57
Density-based Methods
Typically regards clusters as dense regions of objects in the data space that are separated by regions of low density
DBSCAN: Based on Connected Regions with Sufficiently High Density (Nearest Neighbor Estimation)
DENCLUE: Based on Density Distribution Functions (Kernel Estimation)
DBScan result for DS2 with MinPts at 4 and Eps at (a) 5.0, (b) 3.5 and (c) 3.0
![Page 58: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/58.jpg)
PDCAT 2002, T.B. Ho 58
Data and Knowledge Visualization
Sunday11-12 PM
Lunch time
Tree map
Cone tree
Fisheye view
Hyperbolic tree
MagicLens
![Page 59: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/59.jpg)
PDCAT 2002, T.B. Ho 59
KDD Products and Tools
SPSS
IBM
Silicon Graphics SASSalford Systems
RuleQuest Research (C4.5)
![Page 60: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/60.jpg)
PDCAT 2002, T.B. Ho 60
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
![Page 61: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/61.jpg)
PDCAT 2002, T.B. Ho 61
Challenges of KDD
Different types of data in different forms(mixed numeric, symbolic, text, image, voice,…)
Large data sets (106-1012 bytes) and high dimensionality (102-103 attributes)[Problems: efficiency, scalability?]
[Problems: quality, effectiveness?]
Data and knowledge are changing
Human-Computer Interaction and Visualization
![Page 62: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/62.jpg)
PDCAT 2002, T.B. Ho 62
3 attributes each has 2 values: #instances = 23 = 8 #patterns =27
What if #attributes increases?
Size of instance space and pattern space increased exponentially
p attributes each has d values, size of instance space is dp
38 attributes each has 10 values: #instances = 1038
Large Datasets and High Dimensionality
H1
C3
H3 H4
H2
C2C1
C4
![Page 63: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/63.jpg)
PDCAT 2002, T.B. Ho 63
Scalable and efficient algorithms (scalable: given an amount of main memory, its runtime increases linearly with the number of input instances)
Sampling (instance selection)
Dimensionality reduction (feature selection)
Approximation methods
Massively parallel processing
Integration of machine learning and database management
Possible Solutions
![Page 64: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/64.jpg)
PDCAT 2002, T.B. Ho 64
Attribute Numerical Symbolic
No structure
≠= Places,Color
Ordinal structure
≥≠= Ring
structure
Rank,Resemblance
Age,Temperature,Taste,
Income,Length
Nominal(categorical)
Ordinal
Measurable
Numerical vs. Symbolic DataCombinatorial search in hypothesis spaces (machine learning)
Often matrix-based computation (multivariate data analysis)
×+≥≠=
![Page 65: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/65.jpg)
PDCAT 2002, T.B. Ho 65
Issues of Decision Tree Mining
Attribute selection
Pruning trees
From trees to rules (high cost of pruning)
Visualization
Data access: recent development on very large training sets, fast, efficient and scalable (in-memory and secondary storage)
(well-known systems: C4.5 and CART)
![Page 66: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/66.jpg)
PDCAT 2002, T.B. Ho 66
Scalable Decision Tree Induction Methods
SLIQ (Mehta et al., 1996)builds an index for each attribute and only class list and the current attribute list reside in memory
SPRINT (J. Shafer et al., 1996)constructs an attribute list data structure
PUBLIC (Rastogi & Shim, 1998)integrates tree splitting and tree pruning: stop growing the tree earlier
RainForest (Gehrke, Ramakrishnan & Ganti, 1998)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list (attribute, value, class label)
![Page 67: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/67.jpg)
PDCAT 2002, T.B. Ho 67
Effectively address the weakness of the symbolic AI approach in knowledge discovery (grow of the hypothesis space)
Extracting or making sense of numeric weights associated with the interconnections of neurons to come up with a higher level of knowledge has been and will continue to be a challenge problem
Issues of Neural Network Mining
![Page 68: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/68.jpg)
PDCAT 2002, T.B. Ho 68
Improving the efficiencyDatabase scan reduction: partitioning (Savaseve 95), hashing (Park 95), sampling (Toivonen 96), dynamic itemset counting (Brin 97), find non-redundant rules (3000 times less, Zaki KDD’2000)
Parallel mining of association rules
New measures of associationInterestingness and exceptional rules
Generalized and multiple-level rules
Issues of Association Rule Mining
![Page 69: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/69.jpg)
PDCAT 2002, T.B. Ho 69
Mining Scientific Data
Data Mining in Bioinformatics
Data Mining the Astronomy and Earth Sciences
Mining Physics and Chemistry data
Mining Large Image Databases
etc.
![Page 70: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/70.jpg)
PDCAT 2002, T.B. Ho 70
Some Advanced Techniques
Support Vector Machines
Independent Component Analysis
Level Sets and Data Mining
Multi-Relational Data Mining and Logic Programming
Ensemble Methods
Distributed and High Performance Computing
etc.
![Page 71: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/71.jpg)
PDCAT 2002, T.B. Ho 71
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
![Page 72: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/72.jpg)
PDCAT 2002, T.B. Ho 72
Scalable and efficient algorithms scalable: given an amount of main memory, its runtime increases linearly with the number of input instances
Massively parallel processingData-parallel vs. Control-parallel Data Mining
Client/Server Frameworks for Parallel Data Mining
Possible Solutions
Mining Very Large Databases With Parallel ProcessingAlex A. Freitas & Simon H. Lavington, Kluwer Academic Publishers, 1998
![Page 73: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/73.jpg)
PDCAT 2002, T.B. Ho 73
Mixed Similarity Measures (MSM): Goodall (1966) time O(n3), Diday and Gowda (1992),
Ichino and Yaguchi (1994),
Li & Biswas (1997) Time O(n2logn2), Space O(n2):
New and Efficient MSM (Binh & Bao, 2000):
Time and Space O(n):
Example of a Scalable Algorithm:Mixed Similarity Measure
*ˆ 1ˆijij PP −=
ijP*ijP
![Page 74: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/74.jpg)
PDCAT 2002, T.B. Ho 74
Comparative ResultsUS Census database 33 sym + 8 num attributes, Alpha 21264, 500 MHz, RAM 2 GB, Solaris OS (Nguyen N.B. & Ho T.B., PKDD 2000)
#cases 500 1.000 1.500 2.000 5.000 10.000 199.523 (0.2M) (0.5M) (0.9M) (1.1M) (2.6M) (5.2M) (102M)
# values 497 992 1.486 1.973 4.858 9.651 97.799
time of LiBis 67.3s 26m6.2 1h46m31s 6h59m45s >60h not app not app O(n2logn2)
Time of OURS 0.1s 0.2s 0.3s 0.5s 2.8s 9.2s 36m26sO(n)
Memory of LiBis 5.3M 20.0M 44.0M 77.0M 455.0M not app not app O(n2)Memory of OURS 0.5 M 0.7M 0.9M 1.1M 2.1M 3.4M 64.0MO(n)
Preprocessing 0.1s 0.1s 0.2s 0.5s 0.9s 6.2s 127.2s
![Page 75: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/75.jpg)
PDCAT 2002, T.B. Ho 75
Approaches of High Performance Computing to Data Mining
approaches
Data-oriented
discretization
Attribute selection
Instance selection(sampling)
Fast algorithms
Distributed mining
Parallel mining
Single sampling
Iterative sampling
Restricted search
Algorithm optimization
Voting
Model integration
Meta-learning
Inter-processor cooperation
Inter-algorithm parallelization
Algorithm-oriented
Inter-algorithm parallelization
![Page 76: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/76.jpg)
PDCAT 2002, T.B. Ho 76
Distributed & Parallel Data Mining
Data set to
be mined
Subset 1 Alg.
Combine
Know.
Subset P Alg.Know.
Know.
... ... ...
Data set to
be mined
Alg.
Combine
Know.
Alg.Know.
Know.... ...
Distributed System
Parallel System
![Page 77: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/77.jpg)
PDCAT 2002, T.B. Ho 77
Parallel Data Mining
Rule inductionDecision treesNeural networksGenetic algorithmsRough setsAssociation rulesClusteringetc.
1. Parallel Data Mining without DBMS Facilities2. Parallel Data Mining with Database Facilities
newcase
storedcases
subset 1Local MIN
Processor 1Global MIN
local nearest case
storedcases
subset pLocal MIN
Processor p
local nearest case
nearest case
Exploiting data parallelism in instance-based learning
![Page 78: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/78.jpg)
PDCAT 2002, T.B. Ho 78
Outline
Basic concepts of KDD
KDD techniques: classification, association, clustering, visualization
Challenges and trends in KDD
KDD and high performance computing
Case studies in medicine data mining
![Page 79: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/79.jpg)
PDCAT 2002, T.B. Ho 79
Mining Stomach Cancer Data
Each year about 50,000 people die in Japan by stomach cancer. Expect to use data mining methods to find new/useful knowledge.
The project started in summer 1999, including three data mining groups, and doctors at National Cancer Center in Tokyo.
The stomach cancer database was collected during 40 years (1962-1991). Transformed data table contains data of 6,712 patients described by 83 numeric and categorical attributes.
![Page 80: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/80.jpg)
PDCAT 2002, T.B. Ho 80
Overview of Our Data Mining Work
Understand the domainand Define problems
Preprocess Data
Data MiningExtract Patterns/Models
Interpret and Evaluatediscovered knowledge
Putting the resultsin practical use
- Use pre-operative data to predict the patient stage after the operation
- alive (3275), deathafter 5 years (575), death after 90 days (2552), deathwithin 90 days(302), unknown (8).
- Transform data: converting categorical many-value attributes(280) into binary attributes
- Construct the target attribute- Selection of 31 significant
attributes by KJ and SFG methods
- Learn decision trees by See5 and CABRO with treevisualization
- Learn prediction rules by CBA, Rosetta and ourmethod LUPC
- Meeting with medical experts every two months to evaluate the results
- Scores (1 – 5) are given to “Acceptability”, “Novelty” and “Utility” of discovered patterns
- Data mining and evaluation are off-line
1
3
2
4
5
![Page 81: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/81.jpg)
PDCAT 2002, T.B. Ho 81
Learned Decision Trees with CABRO
Tightly-coupled views
T2.5D views (Trees 2.5 Dimensions)
Induced decision trees with graphical representation (easy to observe and interpret)
![Page 82: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/82.jpg)
PDCAT 2002, T.B. Ho 82
Learned Rules and Expert Evaluation
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
1.2.3.4.5
IF dcancer = S AND serosal = 3 ANDeritoneal = 0 AND apnemia = 0 ANDTHEN death < 90days
IF dcancer = x AND type = B3 AND
peritoneal = 0 AND liver_metastasis = 3THEN death < 90days
IF sex = M AND age < 73 AND
liver_metastasis = 3 AND cardio = 1THEN death < 90days
UtilityNoveltyAcceptabilitySome discovered rules
Most rules found are Most rules found are not newnot new to medical expertsto medical expertsVery high false negative error in the (minority) target class
![Page 83: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/83.jpg)
PDCAT 2002, T.B. Ho 83
User-centered Data Mining
Active participationof the user (domain experts) in the KDD process and model selection
Putting the visualization power in the KDD process
Putting domain knowledge in mining
![Page 84: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/84.jpg)
PDCAT 2002, T.B. Ho 84
Visualization in the KDD Process
Synergistic visualization of data & knowledge into knowledge discovery context
Appropriate interactive visualizationtechniques in the knowledge discovery process
![Page 85: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/85.jpg)
PDCAT 2002, T.B. Ho 85
Significant Hypothesis Detected by Visualization
Some instances in class “alive”are with metastasis = 3
![Page 86: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/86.jpg)
PDCAT 2002, T.B. Ho 86
Putting Domain Knowledge in Mining
Exclusive constraints: If imposed, D2MS will find only rules that do not contain any of such constraints (attribute-value pairs) in their condition part.
Inclusive constraints: If imposed, D2MS find only rules each of them must contain at least one of such constraints (attribute-value pairs) in their condition part.
![Page 87: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/87.jpg)
PDCAT 2002, T.B. Ho 87
Putting Domain Knowledge in Mining
Finding irregular rules
Find only rules for class “death within 90 days” that do not contain the characterized attribute “liver_metastasis”and/or its combination with two other typical attributes, “Peritoneal_metastasi”and “Serosal_invasion” by exclusive constraints.
Rule 8 acc = 1.0 (4/4), cover = 0.001 (4/6712)
IF category = R AND sex = F AND proximal_third = 3 AND middle_third = 1
THEN death within 90 days
Finding rare rules
Find rules in the class “alive”that contain the symptom “liver_metastasis” by inclusive constraints.
Rule 1 acc = 0.500 (2/4); cover = 0.001(4/6712)
IF sex = M AND type = B1 AND liver_metastasis = 3AND middle_third = 1
THEN class = alive
![Page 88: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/88.jpg)
PDCAT 2002, T.B. Ho 88
Mining Hepatitis Data with Temporal Abstraction
The hepatitis relational database collected during 1982-2001 at the Chiba university hospital
Our process of mining hepatitis data with temporal abstraction goes through six steps
![Page 89: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/89.jpg)
PDCAT 2002, T.B. Ho 89
Temporal Abstraction Problems & Data Analysis
Structure and problem of temporal abstractionStructure of basic temporal abstraction
<episode, state & trend>example: <ALB 3 months, low & decreasing>Problems: finding episodes, states, and trends.
For example, when visualizing the relation between GOT, GPT, TTT, ZTT and fibrosis stages of one patient during 1985-1993, we observed that the values of GOT, GPT, TTT, and ZTT decrease when fibrosis becomes less severe.
Analysis of data by statistics and visualization tools
![Page 90: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/90.jpg)
PDCAT 2002, T.B. Ho 90
Abstracted Data and Primary ResultsFrom the relational and temporal database, we derived abstracted
descriptions and converted into symbolic data in the flat data tables.
Most rules for hepatitis B and C match from 2% to 5% of the database with high accuracy. The accuracy with 10-cross validation is somehow higher than 70%.
By using system D2MS we found different rules sets and decision trees for distinguishing hepatitis B and C, as well the fibrosis stages.
The patient in the first row has abstractions on “ALB 3 months” as “normal & decreasing-decreasing”
(N-DD), on “ALB 6 months” as “normal & decreasing-stable” (N-
DS), etc.
Abstracted data
Original data
Extracted rules
![Page 91: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/91.jpg)
PDCAT 2002, T.B. Ho 91
Rules Contradict with Human’s Belief
Short term change: GOT (up), GPT (up), TTT (up), ZTT (up).
Long term change: T-CHO (down), CHE (down), ALB (down), TP (down), PLT (down), WBC (down), HGB (down), T-BIL (up), D-BIL (up), I-BIL (up), ICG-15 (up).
Many rules found are contradict with human’s belief
Rule 2 : accuracy = 1.0 (12/12); coverage = 0.028 (2/426)IF ALB2 = normal & decreasing-decreasing
GOT4 = normal & decreasing-decreasingTTT4 = normal & decreasing-decreasing
THEN class = fibrosis stage F1
![Page 92: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/92.jpg)
PDCAT 2002, T.B. Ho 92
Rules Characterizing HBV and HCV
Example of a rule for hepatitis C
The rules show the difference in temporal patterns between HBV and HCV
![Page 93: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/93.jpg)
PDCAT 2002, T.B. Ho 93
Rules Characterizing Fibrosis Stages
Example of a rule characterizing fibrosis stage F4.
The rules show the difference in temporal patterns between fibrosis stagesF0, F1, …, F4
![Page 94: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/94.jpg)
PDCAT 2002, T.B. Ho 94
Summary
KDD concepts, methods, challenges, examples
KDD is a new, fast growing interdisciplinary field for both research and application
Speed up KDD algorithms is crucial
![Page 95: An Introduction to Knowledge Discovery and Data Miningbao/talks/PDCATTutorial.pdf · An Introduction to Knowledge Discovery and Data Mining TuBao Ho School of Knowledge Science](https://reader033.vdocuments.site/reader033/viewer/2022042303/5ecec23702646746d80faa1f/html5/thumbnails/95.jpg)
PDCAT 2002, T.B. Ho 95
Recommended References
http://www.kdnuggets.com
David J. Hand, Heikki Mannila and Padhraic Smyth, Principles of Data Mining, MIT Press, 2000
Jiawei Han, Micheline Kamber, Data Mining : Concepts and Techniques, Morgan Kaufmann, 2000
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, AAAI/MIT Press, 1996.