chapter 12dekai/600g/notes/km_slides_ch12.pdf · • an induction tree is a decision tree holding...
TRANSCRIPT
![Page 1: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/1.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall
Additional material © 2007 Dekai Wu
Chapter 12
Discovering New Knowledge –Data Mining
![Page 2: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/2.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Chapter Objectives
• Introduce the student to the concept of Data Mining (DM), also known as Knowledge Discovery in Databases (KDD).
How it is different from knowledge elicitation from expertsHow it is different from extracting existing knowledge from databases.
• The objectives of data miningExplanation of past events (descriptive DM)Prediction of future events (predictive DM)
• (continued)
![Page 3: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/3.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Chapter Objectives (cont.)
• Introduce the student to the different classes of statistical methods available for DM
Classical statistics (e.g., regression, curve fitting, …)Induction of symbolic rulesNeural networks (a.k.a. “connectionist” models)
• Introduce the student to the details of some of the methods described in the chapter.
![Page 4: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/4.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Historical Perspective
• DM, a.k.a. KDD, arose at the intersection of three independently evolved research directions:
Classical statistics and statistical pattern recognitionMachine learning (from symbolic AI)Neural networks
![Page 5: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/5.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Objectives of Data Mining
• Descriptive DM seeks patterns in past actions or activities to affect these actions or activities
eg, seek patterns indicative of fraud in past records• Predictive DM looks at past history to predict
future behaviorClassification classifies a new instance into one of a set of discrete predefined categoriesClustering groups items in the data set into different categoriesAffinity or association finds items closely associated in the data set
![Page 6: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/6.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Classical statistics & statistical pattern recognition
• Provide a survey of the most important statistical methods for data mining
Curve fitting with least squares methodMulti-variate correlationK-Means clusteringMarket Basket analysisDiscriminant analysisLogistic regression
![Page 7: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/7.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.14 – 2-D input data plotted on a graph
x
y
![Page 8: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/8.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.15 – data and deviations
x
y
Best fitting equation
x
![Page 9: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/9.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Induction of symbolic rules
• Present a detailed description of the symbolic approach to data mining – rule induction by learning decision trees
• Present the main algorithm for rule inductionC5.0 and its ancestors, ID3 and CLS (from machine learning)CART (Classification And Regression Trees) and CHAID, very similar algorithms for rule induction (independently developed in statistics)
• Present several example applications of rule induction
![Page 10: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/10.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.1 – decision tables(if ordered, then decision lists)
Name Outlook Temperature Humidity ClassData sample1 Sunny Mild Dry EnjoyableData sample2 Cloudy Cold Humid Not EnjoyableData sample3 Rainy Mild Humid Not EnjoyableData sample4 Sunny Hot Humid Not Enjoyable
Note: DS = Data Sample
![Page 11: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/11.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.1 – decision trees(a.k.a. classification trees)
Is the stock’s price/earnings ratio > 5?
Has the company’s quarterly profit
increased over the last year by 10% or
more?
Don’t buy
NoYes
Root node
Is the company’s management
stable?
Leaf node
Yes
Don’t buy
No
Buy Don’t Buy
Yes No
![Page 12: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/12.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Induction trees
• An induction tree is a decision tree holding the data samples (of the training set)
• Built progressively by gradually segregating the data samples
![Page 13: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/13.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.2 – simple induction tree (step 1)
DS3 - Not EnjoyableDS2 - Not Enjoyable
= Sunny= Cloudy = Rainy
RainRain
Outlook
{DS1, DS2, DS3, DS4}
DS1 – EnjoyableDS4 - Not Enjoyable
![Page 14: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/14.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.3 – simple induction tree (step 2)
None
DS3 - Not Enjoyable
DS2 - Not Enjoyable
= Sunny= Cloudy = Rain
RainRain
Outlook
Temperature DS1 – Enjoyable
DS4 – Not Enjoyable
= Cold = Hot= Mild
DS1 - Enjoyable DS4 – Not Enjoyable
{DS1, DS2, DS3, DS4}
![Page 15: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/15.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Writing the induced tree as rules
• Rule 1. If the Outlook is cloudy, then the Weather is not enjoyable.
• Rule 2. If the Outlook is rainy, then the Weatheris not enjoyable.
• Rule 3. If the Outlook is sunny and Temperatureis mild, then the Weather is enjoyable.
• Rule 4. If the Outlook is sunny and Temperatureis cold, then the Weather is not enjoyable.
![Page 16: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/16.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Learning decision trees for classification into multiple classes
• In the previous example, we were learning a function to predict a boolean (enjoyable = true/false) output.
• The same approach can be generalized to learn a function that predicts a class (when there are multiple predefined classes/categories).
• For example, suppose we are attempting to select a KBS shell forsome application:
with the following as our options:ThoughtGen, Offsite, Genie, SilverWorks, XS, MilliExpert
using the following attributes and range of values:Development language: { Java, C++, Lisp }Reasoning method: { forward, backward }External interfaces: { dBase, spreadsheetXL, ASCII file, devices }Cost: any positive numberMemory: any positive number
![Page 17: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/17.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.2 –collection of data samples (training set)
described as vectors of attributes (feature vectors)
Language Reasoningmethod
InterfaceMethod
Cost Memory Classification
Java Backward SpreadsheetXL 250 128MB MilliExpertJava Backward ASCII 250 128MB MilliExpertJava Backward dBase 195 256MB ThoughtGenJava * Devices 985 512MB OffSiteC++ Forward * 6500 640MB GenieLISP Forward * 15000 5GB SilverworksC++ Backward * 395 256MB XSLISP Backward * 395 256MB XS
![Page 18: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/18.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.4 – decision tree resulting from selection of the language attribute
MilliExpertThoughtGen
OffSite
Language
GenieXS
SilverWorksXS
LispC++Java
![Page 19: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/19.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.5 – decision tree resulting from addition of the reasoning method attribute
Language
MilliExpertThoughtGen
OffSite
Backward
Forward Forward
Backward
C++Java
MilliExpertThoughtGen
OffSite
GenieXS
SilverWorksXS
Lisp
OffSite XS Ginie XS SilverWorks
Forward
Backwards
![Page 20: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/20.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.6 – final decision tree
MilliExpertThoughtGen
OffSite
Backward
Forward Forward
Backward
C++Java
MilliExpertThoughtGen
OffSite
Language
GenieXS
SilverWorksXS
Lisp
OffSiteXS Genie XS SilverWorks
Forward
Backward
MilliExpert MilliExpert ThoughtGen OffSite
SpreadsheetXL
ASCII Devices
dBase
![Page 21: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/21.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Order of choosing attributes
• Note that the decision tree that is built depends greatly on which attributes you choose first
![Page 22: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/22.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.2 – simple induction tree (step 1)
DS3 - Not EnjoyableDS2 - Not Enjoyable
= Sunny= Cloudy = Rainy
RainRain
Outlook
{DS1, DS2, DS3, DS4}
DS1 – EnjoyableDS4 - Not Enjoyable
![Page 23: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/23.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.3 – simple induction tree (step 2)
None
DS3 - Not Enjoyable
DS2 - Not Enjoyable
= Sunny= Cloudy = Rain
RainRain
Outlook
Temperature DS1 – Enjoyable
DS4 – Not Enjoyable
= Cold = Hot= Mild
DS1 - Enjoyable DS4 – Not Enjoyable
{DS1, DS2, DS3, DS4}
![Page 24: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/24.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.1 – decision tables(if ordered, then decision lists)
Name Outlook Temperature Humidity ClassData sample1 Sunny Mild Dry EnjoyableData sample2 Cloudy Cold Humid Not EnjoyableData sample3 Rainy Mild Humid Not EnjoyableData sample4 Sunny Hot Humid Not Enjoyable
Note: DS = Data Sample
![Page 25: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/25.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.7
= Dry= Humid
DS2 – Not EnjoyableDS3 – Not EnjoyableDS4 – Not Enjoyable
{DS1, DS2, DS3, DS4}
Humidity
DS1 - Enjoyable
![Page 26: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/26.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Order of choosing attributes (cont)
• One sensible objective is to seek the minimal tree, ie, the smallest tree required to classify all training set samples correctly
Occam’s Razor principle: the simplest explanation is the best
• What order should you choose attributes in, so as to obtain the minimal tree?
Often too complex to be feasibleHeuristics usedInformation gain, computed using information theoretic quantities, is the best way in practice
![Page 27: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/27.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Artificial Neural Networks
• Provide a detailed description of the connectionist approach to data mining – neural networks
• Present the basic neural network architecture –the multi-layer feed forward neural network
• Present the main supervised learning algorithm –backpropagation
• Present the main unsupervised neural network architecture – the Kohonen network
![Page 28: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/28.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.8 – simple model of a neuron
x2
x1
xk
Activation function
f()Inputs
yΣ
W1
W2
Wn
![Page 29: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/29.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.9 – three common activation functions
Threshold function Piece-wise Linear function
Sigmoid function
1.0
![Page 30: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/30.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.10 – simple single-layer neural network
Inputs Outputs
![Page 31: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/31.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.11 – two-layer neural network
![Page 32: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/32.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Supervised Learning:Back Propagation
• An iterative learning algorithm with three phases:1. Presentation of the examples (input patterns with
outputs) and feed forward execution of the network2. Calculation of the associated errors when the output
of the previous step is compared with the expected output and back propagation of this error
3. Adjustment of the weights
![Page 33: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/33.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Unsupervised Learning:Kohonen Networks
Clustering by an iterative competitive algorithmNote relation to CBR
![Page 34: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/34.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.12 – clusters of related data in 2-D space
Variable A
Variable B
Cluster #1
Cluster #2
![Page 35: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/35.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Figure 12.13 – Kohonen self-organizing map
Inputs
Wi
![Page 36: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/36.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
When to use what
• Provide useful guidelines for determining what technique to use for specific problems
![Page 37: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/37.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.3
Goal InputVariables
(Predictors)
OutputVariables
(Outcomes)
StatisticalTechnique
Examples
[SPSS, 2000]
Find linearcombination ofpredictors thatbest separate thepopulation
Continuous Discrete DiscriminantAnalysis
• Predict instances of fraud• Predict whether customerswill remain or leave(churners or not)• Predict which customerswill respond to a newproduct or offer •Predict outcomes ofvarious medical procedures
Predict theprobability ofoutcome being ina particularcategory
Continuous Discrete Logistic andMultinomialRegression
• Predicting insurance policyrenewal• Predicting fraud• Predicting which product acustomer will buy• Predicting that a product islikely to fail
![Page 38: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/38.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.3 (cont.)
Output is a linearcombination ofinput variables
Continuous Continuous LinearRegression
• Predict expected revenuein dollars from a newcustomer• Predict sales revenue for astore• Predict waiting time on holdfor callers to an 800 number.• Predict length of stay in ahospital based on patientcharacteristics and medicalcondition.
For experimentsand repeatedmeasures of thesame sample
Most inputsmust beDiscrete
Continuous Analysis ofVariance(ANOVA)
• Predict whichenvironmental factors arelikely to cause cancer
To predict futureevents whosehistory has beencollected atregular intervals
Continuous Continuous Time SeriesAnalysis
• Predict future sales datafrom past sales records
Goal InputVariables
(Predictors)
OutputVariables
(Outcomes)
StatisticalTechnique
Examples
[SPSS, 2000]
![Page 39: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/39.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.4
Goal Input(Predictor)Variables
Output(Outcome)Variables
StatisticalTechnique
Examples
[SPSS, 2000]
Predict outcomebased on valuesof nearestneighbors
Continuous,Discrete, and
Text
Continuous orDiscrete
Memory-based
Reasoning(MBR)
•Predicting medicaloutcomes
Predict bysplitting data intosubgroups(branches)
Continuous orDiscrete(Different
techniques usedbased on datacharacteristics)
Continuous orDiscrete(Different
techniquesused based on
datacharacteristics)
DecisionTrees
•Predicting whichcustomers will leave•Predictinginstances of fraud
Predict outcomein complex non-linearenvironments
Continuous orDiscrete
Continuous orDiscrete
NeuralNetworks
•Predicting expectedrevenue•Predicting creditrisk
![Page 40: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/40.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.5
Goal Input(Predictor)Variables
Output(Outcome)Variables
StatisticalTechnique
Examples [SPSS,2000]
Predict bysplitting data intomore than twosubgroups(branches)
Continuous,Discrete, or
Ordinal
Discrete Chi-square AutomaticInteraction Detection
(CHAID)
• Predict whichdemographiccombinations ofpredictors yield thehighest probability ofa sale• Predict whichfactors are causingproduct defects inmanufacturing
Predict bysplitting data intomore than twosubgroups(branches)
Continuous Discrete C5.0 • Predict which loancustomers areconsidered a “good”risk• Predict whichfactors areassociated with acountry’s investmentrisk
![Page 41: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/41.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.5 (cont.)
Predict bysplitting data intobinary subgroups(branches)
Continuous Continuous Classification andRegression Trees
(CART)
• Predict whichfactors areassociated with acountry’scompetitiveness• Discover whichvariables arepredictors ofincreased customerprofitability
Predict bysplitting data intobinary subgroups(branches)
Continuous Discrete Quick, Unbiased,Efficient, Statistical
Tree (QUEST)
•Predict who needsadditional care afterheart surgery
Goal Input(Predictor)Variables
Output(Outcome)Variables
StatisticalTechnique
Examples [SPSS,2000]
![Page 42: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/42.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.6
Goal InputVariables
(Predictor)
OutputVariables(Outcome)
StatisticalTechnique
Examples [SPSS, 2000]
Find large groupsof cases in largedata files that aresimilar on a smallset of inputcharacteristics,
Continuousor Discrete
Nooutcomevariable
K-meansClusterAnalysis
• Customer segments formarketing• Groups of similar insuranceclaims
To create largeclustermemberships
KohonenNeural
Networks
• Cluster customers intosegments based ondemographics and buyingpatterns
Create small setassociations andlook for patternsbetween manycategories
Logical Nooutcomevariable
MarketBasket or
AssociationAnalysis with
Apriori
• Identify which products arelikely to be purchasedtogether• Identify which coursesstudents are likely to taketogether
![Page 43: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/43.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Errors andtheir significance in DM
• Discuss the importance of errors in data mining studies
• Define the types of errors possible in data mining studies
![Page 44: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/44.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.7 – Confusion Matrix
Heart Disease DiagnosticPredicted
No Disease
Predicted
Presence of Disease
Actual
No Disease118 (72%) 46 (28%)
Actual Presence of Disease 43 (30.9%) 96 (69.1%)
![Page 45: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/45.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.7 – Confusion Matrix
Heart Disease DiagnosticPredicted
No Disease
Predicted
Presence of Disease
Actual
No Disease118 (72%) 46 (28%)
Actual Presence of Disease 43 (30.9%) 96 (69.1%)
false negatives
![Page 46: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/46.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Table 12.7 – Confusion Matrix
Heart Disease DiagnosticPredicted
No Disease
Predicted
Presence of Disease
Actual
No Disease118 (72%) 46 (28%)
Actual Presence of Disease 43 (30.9%) 96 (69.1%)
false positives
![Page 47: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/47.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall / Additional material © 2007 Dekai Wu
Conclusions
• You should know when to use:Curve-fitting algorithms.Statistical methods for clustering.The C5.0 algorithm to capture rules from examples.Basic feedforward neural networks with supervised learning.Unsupervised learning, clustering techniques and the Kohonen networks.Other statistical techniques.
![Page 48: Chapter 12dekai/600G/notes/KM_Slides_Ch12.pdf · • An induction tree is a decision tree holding the data samples (of the training set) • Built progressively by gradually segregating](https://reader033.vdocuments.site/reader033/viewer/2022042513/5a77ef557f8b9aea3e8e6a67/html5/thumbnails/48.jpg)
Becerra-Fernandez, et al. -- Knowledge Management 1/e -- © 2004 Prentice Hall
Additional material © 2007 Dekai Wu
Chapter 12
Discovering New Knowledge –Data Mining