decision tree a decision tree is a flowchart-like tree structure, where each internal node (nonleaf...

50
Data Warehousing and Data Mining Prepared By: Amit Patel http://ajpatelit.hpage

Upload: aron-henderson

Post on 22-Dec-2015

224 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

Data Warehousing and Data Mining

Prepared By: Amit Patelhttp://ajpatelit.hpage.com

Page 2: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Decision Tree

• A DECISION TREE is a flowchart-like tree structure,

• Where each internal node (nonleaf node) denotes a test on an attribute,

• Each branch represents an outcome of the test,

• And each leaf node (or terminal node) holds a class label.

• The topmost node in a tree is the root node.

Page 3: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Decision Tree

Page 4: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition D.Input:

• Data partition, D, which is a set of training tuples and their associated class labels;

• Attribute_list, the set of candidate attributes;• Attribute_selection_method, a procedure to determine the

splitting criterion that “best” partitions the data tuples into individual classes. This criterion consists of a splitting attribute and, possibly, either a split point or splitting subset.

Output: A decision tree.

Decision Tree

Page 5: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Decision Tree

Page 6: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

Let A be the splitting attribute. (a) If A is discrete-valued, then one branch is grown for each known value of A. (b) If A is continuous-valued, then two branches are grown, corresponding to A < split point and A > split point. (c) If A is discrete-valued and a binary tree must be produced, then the test is of the form A 2 SA, where SA is the splitting subset for A.

Page 7: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Let A be the splitting attribute. (a) If A is discrete-valued, then one branch isgrown for each known value of A. (b) If A is continuous-valued, then two branches aregrown, corresponding to A split point and A > split point. (c) If A is discrete-valuedand a binary tree must be produced, then the test is of the form A 2 SA, where SA is thesplitting subset for A.where

Page 8: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

Procedure:• Assume that there are a set of m samples S={s1, s2,…,sm}

(the training data set).• Where every sample si is represented as an n-dimensional

vector (x1,x2,…,xn)• Values xi correspond to attributes A1, A2,…, An respectively.• Also, there are k classes C1, C2,…, Ck, and every sample

belongs to one of these classes.• These probabilities are computed using Bayes’ Theorem:

P(Ci|X)=[P(X|Ci).P(Ci)]/P(X)• As P(X) is constant for all classes, only the product P(X|

Ci).P(Ci) needs to be maximized.

Page 9: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

Page 10: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

Page 11: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

• The data tuples are described by the attributes age, income, student, and credit rating. The class label attribute, buys_computer, has two distinct values namely, {yes, no}.

• Let Class:• C1: buys_computer= “yes”• C2: buys_computer= “no”

• The data sample we want to classify is:• X=(age=youth, income=medium, student=yes, credit

rating=fair)

Page 12: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

• We need to maximize P(X|Ci).P(Ci), for i = 1, 2. P(Ci), the

prior probability of each class, can be computed based on

the training tuples:

• P(buys computer = yes)

• P(buys computer = no)

• To compute P(X|Ci), for i = 1, 2, we compute the following

conditional probabilities:

= 9/14 = 0.643

= 5/14 = 0.357

Page 13: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

• P(age = youth | buys_computer = yes)• P(age = youth | buys_computer = no)• P(income = medium | buys_computer = yes)• P(income = medium | buys_computer = no)• P(student = yes | buys_computer = yes)• P(student = yes | buys_computer = no)• P(credit_rating = fair | buys_computer = yes)• P(credit_rating = fair | buys_computer = no)

• X=(age=youth, income=medium, student=yes, credit_rating=fair)

Naïve Bayesian Classification

= 2/9 = 0.222= 3/5 = 0.600

= 4/9 = 0.444= 2/5 = 0.400

= 6/9 = 0.667= 1/5 = 0.200

= 6/9 = 0.667= 2/5 = 0.400

Page 14: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

Using the above probabilities, we obtain

P(X|buys_computer = yes) = P(age = youth | buys_computer = yes) * P(income = medium | buys_computer = yes) * P(student = yes | buys_computer = yes) * P(credit_rating = fair | buys_computer = yes)

= 0.2220 * 0.4440 * 0.6670 * 0.667 = 0.044

Page 15: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Naïve Bayesian Classification

Similarly,P(X|buys_computer = no) = 0.6000 * 0.4000 * 0.2000 * 0.4000 = 0.019To find the class, Ci, that maximizes P(X|Ci).P(Ci), we computeP(X|buys_computer = yes) * P(buys_computer = yes)= 0:0440 * 0.643 = 0.028P(X|buys_computer = no) * P(buys_computer = no) = 0.0190 * 0.357 = 0.007

Therefore, the naïve Bayesian classifier predicts buys_computer = yes for tuple X.

Page 16: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Rule-Based Classification

Using IF-THEN Rules for Classification

• Rules are a good way of representing information or bits of

knowledge. A rule-based classifier uses a set of IF-THEN rules

for classification. An IF-THEN rule is an expression of the formIF condition THEN conclusion.

An example is rule R1,R1: IF age = youth AND student = yes THEN

buys computer = yes.

Page 17: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

• The “IF”-part (or left-hand side)of a rule is known as the rule antecedent or precondition.

• The “THEN”-part (or right-hand side) is the rule consequent.• In the rule antecedent, the condition consists of one or more

attribute tests (such as age = youth, and student = yes) that are logically ANDed.

• The rule’s consequent contains a class prediction (in this case, we are predicting whether a customer will buy a computer).

• R1 can also be written as • R1: (age = youth)^(student = yes) (buys computer = yes).

Rule-Based Classification

Page 18: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

• If the condition (that is, all of the attribute tests) in a rule antecedent holds true for a given tuple, we say that the rule antecedent is satisfied (or simply, that the rule is satisfied) and that the rule covers the tuple.

• A rule R can be assessed by its coverage and accuracy. • Given a tuple, X, from a class labeled data set, D,

• let ncovers be the number of tuples covered by R;

• ncorrect be the number of tuples correctly classified by R; and

• |D| be the number of tuples in D.

Rule-Based Classification

Page 19: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

We can define the coverage and accuracy of R as:

Rule-Based Classification

Find the Coverage and Accuracy for the Rule R1: IF age = youth AND student = yes THEN buys_computer = yes ???

Page 20: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Rule-Based Classification

Page 21: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

• These are class-labeled tuples from the AllElectronics customer database.

• Our task is to predict whether a customer will buy a computer.

• Consider rule R1 above, which covers 2 of the 14 tuples. It can correctly classify both tuples.

• Therefore, coverage(R1)accuracy (R1)

Rule-Based Classification

= 2/14 = 14.28% = 2/2 = 100%

Page 22: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Rule-Based Classification

R2: IF age = youth AND student = no THEN buys_computer = nocoverage(R2) = accuracy (R2) =

R3: IF age = senior AND credit_rating = excellent THEN buys_computer = yes

coverage(R3) = accuracy (R3) =

R4: IF age = middle_aged THEN buys_computer = yescoverage(R4) = accuracy (R4) =

3/14 = 21.43%3/3 = 100%

2/14 = 14.28% 0/2 = 0%

4/14 = 28.57% 4/4 = 100%

Page 23: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Basic Sequential Covering algorithm

IF-THEN rules can be extracted directly from the training data (i.e., without having to generate a decision tree first) using a sequential covering algorithm.Input:• D, a data set class-labeled tuples;• Att_vals, the set of all attributes and their possible values.

Page 24: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Rule-Based Classification

Page 25: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

CART

Classification and Regression Trees

• University of California- a study into patients after admission for a heart attack.

• 19 variables collected during the first 24 hours for 215 patients (for those who survived the 24 hours)

• Question: Can the high risk (will not survive 30 days) patients be identified?

Page 26: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

CART

H

Is the minimum systolic blood pressure over the 1st 24 hours>91?

Is age>62.5?

Is sinus present?

H L

L

Page 27: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

CART

Plan for construction of a CART

• Selection of the Splits

• Decisions when to decide that a node is a terminal node (i.e. not to split it any further)

• Assigning a class to each terminal node

Page 28: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

CART

Where splits need to be evaluated…???

Sorted by Age Sorted by Blood Pressure

AGE BP SINUST SURVIVE43 78 1 DEAD

40 83 1 DEAD

40 91 0 SURVIVE

43 99 0 SURVIVE

40 110 0 SURVIVE

49 110 1 SURVIVE

48 119 1 DEAD

45 120 0 SURVIVE

48 122 0 SURVIVE

43 135 0 SURVIVE

49 150 0 DEAD

X

X

AGE BP SINUST SURVIVE40 91 0 SURVIVE

40 110 0 SURVIVE

40 83 1 DEAD

43 99 0 SURVIVE

43 78 1 DEAD

43 135 0 SURVIVE

45 120 0 SURVIVE

48 119 1 DEAD

48 122 0 SURVIVE

49 150 0 DEAD

49 110 1 SURVIVE

Page 29: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

CART

Features of CART

• Binary Splits

• Splits based only on one variable

Page 30: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

CART

Advantages of CART

• Can cope with any data structure or type

• Classification has a simple form

• Uses conditional information effectively

• Invariant under transformations of the variables

• Is robust with respect to outliers

• Gives an estimate of the misclassification rate

Page 31: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Disadvantages of CART

CART

• CART does not use combinations of variables

• Tree can be deceptive – if variable not included it could be as it was “masked” by another

• Tree structures may be unstable – a change in the sample may give different trees

• Tree is optimal at each split – it may not be globally optimal.

Page 32: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Neural Networks

• Neural networks are useful for data mining and decision-support applications.

• People are good at generalizing from experience.

• Computers excel at following explicit instructions over and over.

• Neural networks bridge this gap by modeling, on a computer, the neural behavior of human brains.

Page 33: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Neural Network

Input 0 Input 1 Input n...

Output 0 Output 1 Output m...

• Neural Networks map a set of input-nodes to a set of output-nodes

• Number of inputs/outputs is variable

• The Network itself is composed of an arbitrary number of nodes with an arbitrary topology

Neural Networks

Page 34: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Neural Networks

• A neuron: many-inputs / one-output unit• Output can be excited or not excited• Incoming signals from other neurons determine if the

neuron shall excite ("fire")• Output subject to attenuation in the synapses, which are

junction parts of the neuron

Page 35: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

A multilayer feed-forward neural network.

Neural Networks

Neural Networks

Page 36: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Prediction

(Numerical) prediction is similar to classification construct a model use model to predict continuous or ordered value for a given

input Prediction is different from classification

Classification refers to predict categorical class label Prediction models continuous-valued functions

Page 37: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Major method for prediction: regression Model the relationship between one or more

independent or predictor variables and a dependent or response variable

Regression analysis Linear and multiple regression Non-linear regression Other regression methods: generalized linear model,

Poisson regression, log-linear models, regression trees

Prediction

Page 38: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Linear Regression

Linear regression: involves a response variable y and a single predictor variable x

y = w0 + w1 x

where w0 (y-intercept) and w1 (slope) are regression coefficients

Method of least squares: estimates the best-fitting straight line

||

1

2

||

1

)(

))((

1 D

ii

D

iii

xx

yyxxw xwyw

10

Page 39: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Multiple linear regression: involves more than one predictor variable Training data is of the form (X1, y1), (X2, y2),…, (X|D|, y|D|)

Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2

Many nonlinear functions can be transformed into the above

Linear Regression

Page 40: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Nonlinear Regression

Some nonlinear models can be modeled by a polynomial function

A polynomial regression model can be transformed into linear regression model. For example,

y = w0 + w1 x + w2 x2 + w3 x3

convertible to linear with new variables: x2 = x2, x3= x3

y = w0 + w1 x + w2 x2 + w3 x3

Other functions, such as power function, can also be transformed to linear model

Some models are intractable nonlinear (e.g., sum of exponential terms) possible to obtain least square estimates through

extensive calculation on more complex formulae

Page 41: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Information Gain

ID3 uses information gain as its attribute selection measure.

Let node N represent or hold the tuples of partition D. The attribute with the highest information gain is chosen as the splitting attribute for node N.

This attribute minimizes the information needed to classify the tuples in the resulting partitions and reflects the least randomness or “impurity” in these partitions.

Page 42: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

InfoA(D) is the expected information required to classify a tuple from D based on the partitioning by A.

Information gain is defined as the difference between the original information requirement (i.e., based on just the proportion of classes) and the new requirement (i.e., obtained after partitioning on A).

Information Gain (cont…)

Page 43: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Information Gain (cont…)

Examples

Over all Information for a database

Page 44: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Information Gain (cont…)

Hence, the gain in information from such a partitioning would be

Page 45: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Gain Ratio

Consider an attribute that acts as a unique identifier, such as product_ID. A split on product_ID would result in a large number of partitions (as many as there are values), each one containing just one tuple.

Because each partition is pure, the information required to classify data set D based on this partitioning would be Infoproduct_ID(D) = 0.

Therefore, the information gained by partitioning on this attribute is maximal. Clearly, such a partitioning is useless for classification.

Page 46: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Gain Ratio (cont…)

This value represents the potential information generated by splitting the training data set, D, into v partitions, corresponding to the v outcomes of a test on attribute A. Note that, for each outcome, it considers the number of tuples having that outcome with respect to the total number of tuples in D. It differs from information gain, which measures the information with respect to classification that is acquired based on the same partitioning. The gain ratio is defined as

Page 47: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Cluster Analysis

Cluster: a collection of data objects Similar to one another within the same cluster Dissimilar to the objects in other clusters

Cluster analysis Finding similarities between data according to the

characteristics found in the data and grouping similar data objects into clusters

Unsupervised learning: no predefined classes Typical applications

As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms

Page 48: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Clustering: Rich Application

Pattern Recognition Spatial Data Analysis

Create thematic maps in GIS by clustering feature spaces Detect spatial clusters or for other spatial mining tasks

Image Processing Economic Science (especially market research) WWW

Document classification Cluster Weblog data to discover groups of similar access

patterns

Page 49: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

Requirements of Clustering in DM

Scalability Ability to deal with different types of attributes Ability to handle dynamic data Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to determine

input parameters Able to deal with noise and outliers Insensitive to order of input records High dimensionality Incorporation of user-specified constraints Interpretability and usability

Page 50: Decision Tree A DECISION TREE is a flowchart-like tree structure, Where each internal node (nonleaf node) denotes a test on

http://ajpatelit.hpage.com

End of the Unit-5