chapter 4 basic data mining technique. data warehouse and data mining chapter 4 2 content what is...

35
Chapter 4 Basic Data Mining Technique

Upload: angelica-todd

Post on 13-Jan-2016

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 4

Basic Data Mining Technique

Page 2: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 42Data Warehouse and Data Mining

Content• What is classification? • What is prediction?• Supervised and Unsupervised Learning• Decision trees• Association rule• K-nearest neighbor classifier

• Case-based reasoning

• Genetic algorithm

• Rough set approach

• Fuzzy set approaches

Page 3: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 43Data Warehouse and Data Mining

Data Mining Process

Page 4: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 44Data Warehouse and Data Mining

Data Mining Strategies

Page 5: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 45Data Warehouse and Data Mining

• Classification: – predicts categorical class labels– classifies data (constructs a model)

based on the training set and the values (class labels) in a classifying attribute and ....uses it in classifying new data

• Prediction: – models continuous-valued functions, i.e.,

predicts unknown or missing values

Classification vs. Prediction

Page 6: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 46Data Warehouse and Data Mining

• Typical Applications– credit approval– target marketing– medical diagnosis– treatment effectiveness analysis

Classification vs. Prediction

Page 7: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 47Data Warehouse and Data Mining

Classification Process

1. Model construction: 2. Model usage:

Page 8: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 48Data Warehouse and Data Mining

Classification Process

1. Model construction: describing a set of predetermined classes• Each tuple/sample is assumed to belong to a

predefined class, as determined by the class label attribute

• The set of tuples used for model construction: training set

• The model is represented as classification rules, decision trees, or mathematical formulae

Page 9: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 49Data Warehouse and Data Mining

1. Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Page 10: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 410Data Warehouse and Data Mining

2. Model usage: for classifying future or unknown objectsEstimate accuracy of the model

• The known label of test sample is compared with the classified result from the model

• Accuracy rate is the percentage of test set samples that are correctly classified by the model

• Test set is independent of training set

Classification Process

Page 11: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 411Data Warehouse and Data Mining

2. Use the Model in Prediction

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Page 12: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 412Data Warehouse and Data Mining

What Is Prediction?

• Prediction is similar to classification– 1. Construct a model

– 2. Use model to predict unknown value

• Major method for prediction is regression

– Linear and multiple regression

– Non-linear regression

• Prediction is different from classification– Classification refers to predict categorical class

label

– Prediction models continuous-valued functions

Page 13: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 413Data Warehouse and Data Mining

Issues regarding classification and prediction

1. Data Preparation

2. Evaluating Classification Methods

Page 14: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 414Data Warehouse and Data Mining

1. Data Preparation

• Data cleaning– Preprocess data in order to reduce noise and

handle missing values

• Relevance analysis (feature selection)– Remove the irrelevant or redundant attributes

• Data transformation– Generalize and/or normalize data

Page 15: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 415Data Warehouse and Data Mining

2. Evaluating Classification Methods• Predictive accuracy• Speed and scalability

– time to construct the model– time to use the model

• Robustness– handling noise and missing values

• Scalability– efficiency in disk-resident databases

• Interpretability: – understanding and insight proved by the model

• Goodness of rules– decision tree size– compactness of classification rules

Page 16: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 416Data Warehouse and Data Mining

Supervised vs. Unsupervised Learning

• Supervised learning (classification)– Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations

– New data is classified based on the training set

• Unsupervised learning (clustering)– The class labels of training data is unknown

– Given a set of measurements, observations, etc. with the aim of establishing the existence of classes or clusters in the data

Page 17: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 417Data Warehouse and Data Mining

Supervised Learning

Page 18: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 418Data Warehouse and Data Mining

Unsupervised Learning

Page 19: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 419Data Warehouse and Data Mining

Classification by Decision Tree Induction

• Decision tree – A flow-chart-like tree structure– Internal node denotes a test on an attribute– Branch represents an outcome of the test– Leaf nodes represent class labels or class

distribution• Use of decision tree: Classifying an unknown sample

– Test the attribute values of the sample against the decision tree

Page 20: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 420Data Warehouse and Data Mining

Classification by Decision Tree Induction• Decision tree generation consists of two

phases1. Tree construction

•At start, all the training examples are at the root

•Partition examples recursively based on selected attributes

2. Tree pruning•Identify and remove branches that reflect noise or outliers

Page 21: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 421Data Warehouse and Data Mining

Training Dataset

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no30…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Page 22: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 422Data Warehouse and Data Mining

Output: A Decision Tree for “buys_computer”

age?

overcast

student? credit rating?

no yes fairexcellent

<=30 >40

no noyes yes

yes

30..40

Page 23: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 423Data Warehouse and Data Mining

Decision Tree

Page 24: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 424Data Warehouse and Data Mining

What Is Association Mining?

• Association rule mining:– Finding frequent patterns, associations,

correlations, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories.

• Applications:– Basket data analysis, cross-marketing,

catalog design, loss-leader analysis, clustering, classification, etc.

Page 25: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 425Data Warehouse and Data Mining

Presentation of Classification Results

Page 26: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 426Data Warehouse and Data Mining

Instance-Based Methods• Instance-based learning:

– Store training examples and delay the processing (“lazy evaluation”) .....until a new instance must be classified

• Typical approaches– k-nearest neighbor approach

•Instances represented as points in a Euclidean space.

– Case-based reasoning•Uses symbolic representations and

knowledge-based inference

Page 27: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 427Data Warehouse and Data Mining

The k-Nearest Neighbor Algorithm

• All instances correspond to points in the n-D space.

• The nearest neighbor are defined in terms of Euclidean distance.

• The target function could be discrete- or real- valued.

• For discrete-valued, the k-NN returns the most common value among the k training examples nearest to xq.

• Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples. .

_+

_ xq

+

_ _+

_

_

+

.

..

. .

Page 28: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 428Data Warehouse and Data Mining

Case-Based Reasoning

• Also uses: lazy evaluation + analyze similar instances

• Difference: Instances.... are not “points in a Euclidean space”

• Methodology– Instances represented by rich symbolic

descriptions (e.g., function graphs)– Multiple retrieved cases may be combined

Page 29: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 429Data Warehouse and Data Mining

Genetic Algorithms• GA: based on an analogy to biological evolution• Each rule is represented by a string of bits• An initial population is created consisting of

randomly generated rules– e.g., IF A1 and Not A2 then C2 can be encoded as 100

• Based on the notion of survival of the fittest, a new population is formed to consists of the fittest rules and their offsprings

• The fitness of a rule is represented by its classification accuracy on a set of training examples

• Offsprings are generated by crossover and mutation

Page 30: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 430Data Warehouse and Data Mining

Supervised genetic learning

Page 31: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 431Data Warehouse and Data Mining

Rough Set Approach• Rough sets are used to approximately or

“roughly” define equivalent classes

Page 32: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 432Data Warehouse and Data Mining

Rough Set Approach

• A rough set for a given class C is approximated by two sets:

1. a lower approximation (certain to be in C) and

2. an upper approximation (cannot be described as not belonging to C)

• Finding the minimal subsets of attributes (for feature reduction) is NP-hard

Page 33: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 433Data Warehouse and Data Mining

Fuzzy Set Approaches• Fuzzy logic uses truth values between 0.0 and

1.0 to represent the degree of membership (such as using fuzzy membership graph)

Low Medium High

baseline high

somewhat

low

Income

Fuzzy membeship

Page 34: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 434Data Warehouse and Data Mining

• Attribute values are converted to fuzzy values– e.g., income is mapped into the discrete categories

{low, medium, high} with fuzzy values calculated

• For a given new sample, more than one fuzzy value may apply

• Each applicable rule contributes a vote for membership in the categories

• Typically, the truth values for each predicted category are summed

Fuzzy Set Approaches

Page 35: Chapter 4 Basic Data Mining Technique. Data Warehouse and Data Mining Chapter 4 2 Content What is classification? What is prediction? Supervised and Unsupervised

Chapter 435Data Warehouse and Data Mining

Reference

Data Mining: Concepts and Techniques (Chapter 7 for textbook), Jiawei Han and Micheline Kamber, Intelligent Database Systems Research Lab, School of Computing Science, Simon Fraser University, Canada