2015年10月27日星期二 2015年10月27日星期二 2015年10月27日星期二 data mining:...

二〇二三年四月二十日 Data Mining: Concepts and Techniques 1

Classification and Prediction

(Data Mining: Concepts and Techniques)


Chapter 6 Classification and Prediction

What is classification? What is prediction? Issues regarding classification and

prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Other Classification Methods Prediction Performance evaluation Summary


An Example of Classification(Fruit Classifier)

Classifieroutput

Class label

oval, red, orange, yellow

shape=roundcolor = red

inputfeatures

Apple

shape=roundcolor = orange

Classifier Orange

Classifier Mango


A Graphical Model for Classifier

yClassifierinputfeatures output

class label

::

x1

x2

xn


Model Representation for Classifier

The model is in a form of y = f (x1, x2, …, xn)

yClassifierinputfeatures output

class label

::

x1

x2

xn

Main problems in classifier model construction: • What are x1, …, xn in order to construct an effective f ?• How to get the model f given x1, …, xn ?• How to collect training data with class label y for creating model f

Use a training set to construct a model for the outcomeforecast of future events. Two main types Classification

constructing models that distinguish classes for future forecast

Applications: loan approval, customer classification, recognition of finger print

Model representation: decision-tree, neural network Prediction

constructing models that predict unknown numerical values Applications : price prediction of various securities, assets Model representation: neural network, linear regression

Main Data Mining TechniquesSupervised Learning

6

Use a training set to construct a model for the outcome

forecast of future events. Classification

predicts categorical class labels constructs a classification model to classify new data

Prediction predicts numerical values Constructs a continuous-valued function to predict

unknown or missing values Typical Applications

credit card approval medical diagnosis & treatment Pattern recognition

Classification vs. Prediction

7


Classification vs. Prediction

Classifierinputfeatures output

class label

::

predicted value

Predictorinputfeatures output

::

(continuous value)

(category/nominal value)


Classification—A Two-Step Process

1. Model construction Training set : the data set used for model construction

Class label : each tuple/sample is assumed to belong to a predefined class (determined by the class label attribute)

Model representation: classification rules, decision trees, or mathematical formulae

2. Model usage: for classifying future unknown objects Test set : a data set independent of training set Performance evaluation : to evaluate how good the model is

The known class label (y) of each test sample is compared with the classified result (y’) from the classification model

Accuracy rate : the percentage of test samples that are correctly classified by the model (the ratio of y=y’s)


Model construction

TrainingData(I, O)

ClassificationLearning

Algorithms

ClassifierModel

Model usage

ClassifierModel

inputfeatures output

class label

::

Classification—A Two-Step Process


Classification Process (1): Example for Model Construction

TrainingData

name rank years tenuredMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationLearning

Algorithms

IF rank = ‘professor’OR years > 6

THEN tenured = ‘yes’

Classifier(Model)

tenured = f (rank, years)

input features class label


Classification Process (2): Example for Model Use

TestingData

name rank years tenuredTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Predict tenured?Accuracy

Performance evaluationuse of the model

Classifier


Supervised vs. Unsupervised Learning

Supervised learning (classification)

Aim : establish a classifier model

Supervision : The training data (observations,

measurements, etc.) are accompanied by class

labels indicating the class of the observations

Unsupervised learning (clustering)

The class labels of training data is unknown

Aim : establish the classes or clusters for the

data



What is classification? What is prediction? Issues regarding classification and prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Classification based on concepts from

association rule mining Other Classification Methods Prediction Estimating classification accuracy Summary


Issues regarding classification & prediction

1). Data Preparation (Data Preprocessing) Data cleaning

Preprocess data in order to reduce noise and handle missing values

Feature relevance analysis (feature selection)

Remove the irrelevant or redundant attributes

Data transformation Generalize and/or normalize data


Issues regarding classification and prediction

2). Performance Evaluation of Classification Methods Predictive accuracy Speed scalability (for Big Data Analysis)

time to construct the model time to use the model

Space scalability (for Big Data Analysis) Memory/disk required to construct/use the model

Robustness handling noise and missing values

Interpretability: understanding and insight provided by the model

Goodness of rules size and compactness of classification rules


Decision Tree Induction Algorithm

(A Learning Algo. for Classification Model)

Decision Tree Induction Given : a set of training data (<I,O>=< x1, …, xn, y>)

Aim (Find f, where y = f (x1, x2, …, xn) and f in DT form)

To construct a minimal decision tree in order to effectively classify future unknown samples

Decision tree Representation : a flow-chart-like tree structure Internal node denotes a test on an attribute of a

sample Branch represents an outcome of the test Leaf node represent a class labels or a class

distribution


Classification in Decision Tree Induction

1. Generation of decision tree : consisting of two phases Tree construction

Tree is constructed one node by one node in a top-down manner by using training examples.

Tree pruning Identify and remove branching subtrees that

reflect noise or outliers2. Use of decision tree: classifying an unknown sample

Test the attribute values of a sample (with unknown class label) against the decision tree


An Example of Training Dataset( For buys_PC )

age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

This follows an example from Quinlan’s ID3

Class label

Input features

no yes fairexcellent

<= 30 > 4030..40

student?

age?

credit rating?

nono yes

yes

yes

: test (input) attribute: class label for Buy_PC

: attribute value

?

A Decision Tree for Predicting buys_PC

Buy_PC = f (age, student, credit rating)

f

(age, student, credit rating)

Buy_PC=y/n

21


Extracting Classification Rules from Trees

Rules are easier for humans to understand Represent the knowledge in the form of IF-THEN rules One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a condition The leaf node holds the class prediction

Examples of Extracted RulesIF age = “<=30” AND student = “no” THEN buys_computer = “no”IF age = “<=30” AND student = “yes” THEN buys_computer =

“yes”IF age = “31…40” THEN buys_computer =

“yes”IF age = “>40” AND credit_rating = “excellent”

THEN buys_computer = “yes”IF age = “>40” AND credit_rating = “fair”

THEN buys_computer = “no”


ID3 Algorithm for Decision Tree Induction

Assumption: Attributes are categorical If continuous-valued, they are discretized in advance

Idea of decision tree induction algorithm: Tree is constructed one node by one node in a top-

down recursive divide-and-conquer manner Key : Find the most discriminating attribute at each

node At start, all the training samples are located at the root node At each node, training samples at this node are

used to select the most discriminating attribute on the basis of a heuristic or statistical measure (attribute selection measure)

partitioned training samples into node branches based on the most discriminating attribute and its associated data values of training samples

<= 30 > 4030..40

age?

Partitioning of Training Dataat a Node of a Decision Tree

age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

age income student credit_rating buys_PC>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no>40 medium yes fair yes>40 medium no excellent no

age income student credit_rating buys_PC31…40 high no fair yes31…40 low yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes

24

age income student credit_rating buys_PC<=30 high no fair no<=30 high no excellent no<=30 medium no fair no<=30 low yes fair yes<=30 medium yes excellent yes

???

age, income, student and credit_rating are tested. age is the most discriminating attribute.


Stopping Conditions (for Decision Tree Induction)

Conditions for stopping recursive data partitioning

All samples at a given node belong to the same class

There are no remaining attributes for further

partitioning – majority voting is employed for

classifying the leaf

There are no samples left


Attribute Selection Measure(Find the most discriminating attribute at

each node)

Information gain (ID3/C4.5) All attributes are assumed to be categorical Can be modified for continuous-valued attributes

Gini index (IBM IntelligentMiner) All attributes are assumed continuous-valued Assume there exist several possible split values

for each attribute May need other tools, such as clustering, to get the

possible split values

Can be modified for categorical attributes


Information Gain (of ID3/C4.5)

(A Measure for Attribute Selection)

How to find the most discriminating attribute for each node

Idea : find the attribute with the highest information gain at each node

Assume there are two classes, P and N, in training examples Let the set of training examples S contains

p elements of class P n elements of class N

The information amount, needed to decide if an arbitrary example in S belongs to P or N, is defined asnp

nnp

nnp

pnp

pnpI

22 loglog),(Entropy

:


Information Gain (in Decision Tree Induction)

Assume that using attribute Ak the set S will be partitioned

into sets {S1, S2 , …, Sv} (i.e., # of Ak ’s values= v)

If Si contains pi examples of P and ni examples of N, the

entropy (the expected information amount needed to classify objects in all subtrees of S is

The information amount gained by branching on Ak

Find the attribute Ai with maximal gain for this node

11

),(),(||

||)(

iii

ii

iii

ik npI

np

npnpI

S

SAE

)A(E)n,p(I)A(Gain kk

Example

ij,mjfor)A(Gain)A(Gain ji 1


An Example(Attribute Selection by Information

Gains)

Data

Class P: buys_computer = “yes”

Class N: buys_computer = “no”

I(p, n) = I(9, 5) =0.940

Compute entropies for atrributes : age, income, student, credit_rating

at root node

Information Gain for age:

Similarly

Therefore, attribute age is selected at root node

age pi ni I(pi, ni)<=30 2 3 0.97130…40 4 0 0>40 3 2 0.971

69.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIageE

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

25.0)(),()( ageEnpIageGain

Entropy after splitting using

age

Total split entropy for age:

equation

DataSplitting


Gini Index (IBM IntelligentMiner)

If a data set T contains examples from n classes, gini index gini (T) is defined as

where pj is the data percentage of class j in T. If a data set T is split into two subsets T1 and T2 with

sizes N1 and N2 respectively by using attribute Ak, the gini index after splitting ginik (T) is defined as

The attribute providing the smallest ginik(T) is chosen to split the node (need to enumerate all possible splitting points for attribute Ak).

n

jp jTgini

1

21)(

)()()( 22

11 Tgini

NN

TginiNNTginik

entropy


Avoid Overfitting in ID3/C4.5

The generated tree may overfit the training data Result in poor accuracy for unseen samples If too many tree branches exist, some may reflect

anomalies due to noise or outliers Two pruning approaches to avoid overfitting

Prepruning: Halt tree construction early, i.e.,Don’t split a node if this would result in the goodness measure falling below a threshold

Difficult to choose an appropriate threshold Postpruning:

Get a sequence of progressively pruned trees from a “fully grown” tree

Use a set of data different from the training data to decide which is the “best pruned tree”


Enhancements to basic decision tree induction

Allow for continuous-valued attributes

Define new discrete-valued attributes by

dynamically partitioning the continuous attribute

values into a set of discrete intervals

Handle missing attribute values by

Assigning the most common value of the attribute

Assigning a probability to each of the possible values


Why decision tree induction

Decision tree induction a classification learning algorithm.Classification — a typical problem extensively studied by statisticians and machine learning researchersWhy decision tree induction for classification?

convertible to simple and easy to understand classification rules (if-then rules)

can use SQL queries to access databases for each rule to find its associated data and rule coverage rate

comparable classification accuracy with other methods

relatively faster learning speed (than other classification methods)


Presentation of Classification Results




association rule mining Other Classification Methods Prediction Estimating classification Summary


Bayesian Classification: Why?

Probabilistic prediction and learning: Predict multiple hypotheses Calculate a probability for each hypothesis

Incremental learning : Each training example incrementally increases/decreases

the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.

Standard:

Even when Bayesian methods are computationally intractable, they can provide a benchmark standard against other methods


Naive Bayes Classification for m classes

Definition X =<x1, x2, …, xn >: denoting a n-dimensional sample Ci|X : the hypothesis “sample X is of class Ci” P(Ci|X) = probability that sample X is of class Ci

There are m probabilities : P(C1|X), P(C2|X), …, P(Cm|X)

f : assign sample X to class Ci if P(Ci|X) is maximal among P(C1|X), P(C2|X),… P(Ci|X), …, P(Cm|X), i.e.,

The naive Bayesian classifier f assigns an unknown sample X to class Ci if and only if

(X is most likely of class Ci according to probabilities)

ijmjforXCPXCP ji ,1)|()|(

Find f for D, where y = f (x1, x2, …, xn) and f in NBC form


Estimating a-posteriori probabilities

How to find maximal one among P(C1|X), P(C2|X) …, P(Cm|

X) According to Bayes theorem:

P(Ci|X) = P(X|Ci)·P(Ci) / P(X) for 1 ≤ i ≤ m

P(X) is difficult to obtain, but it is a constant for

computing P(Ci|X)’s for all m classes, where 1 ≤ i ≤ m

So, only the values of P(X|Ci)·P(Ci)’s are reqired for 1 ≤ i ≤ m

P(Ci) = relative freq of samples in class Ci

=> can be computed from the training data Remaining problem: How to compute P(X|Ci) ?


Naïve Bayesian Classification

Naïve assumption: attribute independence

P(X|Ci) = P(<x1,…,xn>|Ci) = P(x1|Ci)·…·P(xn|Ci)

Why this assumption ? Make P(X|Ci) computable. X may not be in Ci , therefore P(X|Ci) is not computable Require a minimal # of training data

If k-th attribute of X is categorical:P(xk|Ci) is estimated as the relative freq of samples having

value xk as k-th attribute (Ak=xk) in class Ci , 1 ≤ k ≤ n If k-th attribute is continuous:

P(xk|Ci) is estimated via Gaussian density function (normal

distribution, a function modeled by mean and variance) for k-th attribute by using data in class Ci


Play-tennis example(Predict playing tennis or not on a given day)

Outlook Temperature Humidity Windy Classsunny hot high false Nsunny hot high true Novercast hot high false Prain mild high false Prain cool normal false Prain cool normal true Novercast cool normal true Psunny mild high false Nsunny cool normal false Prain mild normal false Psunny mild normal true Povercast mild high true Povercast hot normal false Prain mild high true N

Given : an unseen input sample<rain, hot, high, false>

Will play tennis ?

P : Play, 9 recordsN : Not play, 5 records

Given : the following training dataProblem : Predict whether to play tennis on a particular day ?


The Independence Assumption

Makes computation possible

Yields optimal classifiers when the assumption is

satisfied

But is seldom satisfied in practice, as attributes

(variables) are often correlated.

Can attempt to overcome this limitation by:

Bayesian networks, that combine Bayesian reasoning

with causal relationships between attributes

Sample size must be large enough compared to NBC


Bayesian Belief Networks (I)FamilyHistory

LungCancer

PositiveXRay

Smoker

Emphysema

Dyspnea

LC

~LC

(FH, S) (FH, ~S)(~FH, S) (~FH, ~S)

0.8

0.2

0.5

0.5

0.7

0.3

0.1

0.9

Bayesian Belief Networks

The conditional probability table for the variable LungCancerAssumption : Variables Family History and Smoker are correlated.


Bayesian Belief Networks (II)

Bayesian belief network allows class conditional

independencies between subsets of the variables

A graphical model of causal relationships

Several classes of problems in learning Bayesian

belief networks Given network structure of all related variables => Easy

Given network structure of some related variables =>

Hard

When the network structure is unknown => Harder




prediction Classification by decision tree induction Bayesian Classification Discriminative Classification Other Classification Methods Prediction Estimating classification Summary


Binary classification as a mathematical mapping computes 2-class categorical labels is a binary function f: X Y mathematically

y = f(X), X X ≡ n, y Y = {+1, –1} (or = {0, 1}) X : input, y : output

Example : classification of personal homepage, SPAM mail(An application of automatic document classification ) y = +1 or –1 (1/0; yes/no; true/false; positive/negative) X = <x1, x2, …, xn > (a keyword frequency vector for a Web

page) x1 : # of keyword 1, e.g., “homepage” x2 : # of keyword 2, e.g., “welcome” …..

Classification: A Mathematical Mapping


Linear Classification Problems

Example : a 2D binary classification problem A sample is a 2-D point The data above the red

line belongs to class ‘x’ The data below red line

belongs to class ‘o’ The data can be linearly

classified by the red line Classifier examples:

SVM Perceptron (an ANN)

x

x

x xxx

x

x

x

x o o

o

oo

o

o

o

o oo

o

o

In linear classification problems, the classification is accomplished by a linear hyperplane.

0 cbyax

02211 xwxw

x




prediction Classification by decision tree induction Bayesian Classification Classification by Backpropagation Other Classification Methods Prediction Estimating classification Summary


Artificial Neural Networks(A Network of Artificial Neurons)

Advantages prediction accuracy is generally high robust, works when training examples contain

errors output may be discrete, real-valued, or a vector of

discrete or real-valued attributes fast evaluation of the learned target function

Criticism (somewhat) long training time for an optimal model difficult to understand the learned function

(weights) not easy to incorporate prior domain knowledge


Architecture of a Typical Artificial Neural Network(Multi-layer Perceptron)

Input Layer Output Layer

Middle Layer

I n

p u

t S

i g

n a

l s

O u

t p

u t

S

i g n

a l

s

.

.

.

.

.

.


A Neuron as a Simple Computing Element

)(1

n

iiiwxfY function meaning

Neuron Y

Input Signals

x1

x2

xn

Output Signals

Y

Y

Y

w2

w1

wn

Weights

θ Threshold(bias)

ConnectionWeights


Steps of a neuron’s computation

1. computes the weighted sum of the input signals

2. compares the result with a threshold value

3. Produce an output based on a transfer or

activation function as follows:

A Neuron as a Simple Computing Element

)(1

n

iiiwxfY


Various Activation Functions f of a Neuron

S t e p f u n c t io n S ig n f u n c t io n

+ 1

-1

0

+ 1

-1

0X

Y

X

Y

+ 1

-1

0 X

Y

S ig m o id f u n c t io n

+ 1

-1

0 X

Y

L in e a r f u n c t io n

0 if ,0

0 if ,1

X

XY step

0 if ,1

0 if ,1

X

XY sign

Xsigmoid

eY

1

1XY linear

Hidden neuronOutput neuron

Output neuron(function approximation)

n

iiiwxXXfY

1

),(

Construction of Classification Modelvia Network Training

The objective of network training obtain a set of connection weights that makes

almost all the training tuples classified correctly Steps

1. Initialize weights with random values 2. Feed one of the training samples into the network 3. Do the followings for each neuron layer by layer

1. Compute the net input to the neuron as a weighted summation of all the inputs to the neuron

2. Compute the output value using the activation function3. Compute the error by backpropagation algorithm4. Adjust the weights and the bias according to the error

4. Go to Step 2 until convergence

57

Input Layer Output Layer

Middle Layer

I n p

u t

S i

g n

a l

s

O u

t p

u t

S i

g n

a l

s

Loop

Backpropagation Training Algorithm

Output nodes

Input nodes

Hidden nodes

Output vector

Input vector

i

Hj

Ii

Hji

Hj OwI

HjI

Hj

eO

1

1

))(1( Okk

Ok

Okk OTOOErr

k

kOkj

Hj

Hj

Hj Errw)O(OErr 1

Hj

Hj

new,Hj Err

Hj

Ok

Okj

new,Okj OErrww

Ii

Hj

Hji

new,Hji OErrww

k

j

OkO

OjH

i

weights are updated according to backward propagated errors

I).input is propagated forward

58

…

… …

… …

OiI

II).

error

: learning rate

IjH

j

Ok

Hj

Okj

Ok OwI

OkI

Ok

eO

1

1

HjI Hidden

layerThe j-th neuron

I : inputO : output

Hjiw

Okjw

ErrkO

ErrjH




prediction Classification by decision tree induction Bayesian classification Classification by Backpropagation Other classification methods Prediction Estimating classification accuracy Summary


Other Classification Methods

SVM—Support Vector Machines k-nearest neighbor classifier Case-based reasoning Rough set approach Fuzzy set approaches


SVM—Support Vector Machines

A classification method for both linear and nonlinear data

For nonlinear data, a nonlinear mapping is used to transform the training data into a higher dimension

With the new dimension, it searches for the optimal linear separating hyperplane (i.e., “decision boundary”)

With an appropriate nonlinear mapping to a sufficiently high dimension, data from two classes can always be separated by a linear hyperplane

SVM finds this separating hyperplane using support vectors (“essential” training tuples) and margins (間隔幅度 , defined by the support vectors)


SVM—History and Applications

Vapnik and colleagues (1992)—groundwork from

Vapnik & Chervonenkis’ statistical learning theory

in 1960s

Features: training can be slow but accuracy is high

owing to their ability to model complex nonlinear

decision boundaries (for margin maximization)

Used both for classification and prediction

Applications: handwritten digit recognition, object

recognition, speaker identification


SVM — General Concept(Find a decision boundary with a maximal margin)

decision boundary 1

decision boundary 2

Decision boundary 2 is better than decision boundary 1


SVM— Margins and Support Vectors

Better support vectors

Small Margin Large Margin

Worse support vectors


SVM — Case 1When Data Is Linearly Separable

m

Let data D={(X1, y1), …, (X|D|, y|D|)} be the set of training tuples, where Xi is associated with the class label yi

There are infinite lines (hyperplanes) separating the two classes but we want to find the best one (the one that minimizes classification error on unseen data)

SVM searches for the separating hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)


SVM – Case 1 : Linearly Separable

A separating hyperplane can be written as

W ● X + b = 0

where W={w1, w2, …, wn} is a weight vector and b a scalar (bias)

For a 2-D space, a line L ax + by +c=0 can be written as

w0 + w1 x1 + w2 x2 = 0

The hyperplanes defining the two sides of the margin:

H1: w0 + w1 x1 + w2 x2 ≥ 1 for yi = +1, and

H2: w0 + w1 x1 + w2 x2 ≤ – 1 for yi = –1

Support vectors : any training tuples that fall on hyperplanes H1

or H2 (i.e., the sides defining the margin)

This becomes a constrained (convex) quadratic optimization problem: linear constraints and quadratic objective function Quadratic Programming (QP) Lagrangian multipliers


Why Is SVM Effective on High Dimensional Data?

The complexity of SVM classifier is characterized by the # of

support vectors rather than the dimensionality or # of the data

The support vectors are the essential or critical training

examples —they lie closest to the decision boundary (MMH)

If all other training examples are removed and the SVM training

is repeated, the same separating hyperplane would be found

The set of support vectors can be used to compute an (upper)

bound on the expected error rate of an SVM classifier, which is

independent of the data dimensionality

Thus, an SVM with a small number of support vectors can still

have good generalization, even when the dimensionality of the

data is high

Transform the original input data into a higher dimensional space

A 3D input vector is mapped into a new 6D space Search for a linear separating hyperplane in the 6D

space

z1=x1, z2=x2, z3=x3


SVM — Case 2 Linearly Inseparable

A 1

A 2

<x1, x2, x3><z1, z2, z3, z4, z5, z6>


k-NN (k-Nearest Neighbor) Algorithm

A sample X is represented as

A training sample corresponds to a point in an n-D space.

The nearest neighbors are defined in terms of Euclidean

distance:

When given an unknown sample xq, a k-NN classifier

searches the sample space for the k training samples nearest

to the sample xq. Then decide the class of the unknown

sample by majority vote.

The value of k is decided heuristically. . _

+_ xq

+

_ _+

_

_

+

n

iii yxYXD

1

2)(),(

nxxxX ...,, ,21


Discussion on the k-NN Algorithm

k-NN algorithm works only for numeric-valued data Enhancement : distance-weighted k-NN algorithm

Weight the contribution of the k neighbors according to their distance to the query point (sample) xq

giving greater weight to closer neighbors

Similarly, works only for numeric-valued data Robust to noisy data by averaging k nearest neighbors Curse of dimensionality: distance between neighbors

could be dominated by many irrelevant attributes. To overcome it, axes stretch or elimination of the

least relevant attributes.

2)),((1

ixqxDw


Rough Set Approach

Rough sets are used to approximately or “roughly” define equivalent classes

A rough set for a given class C is approximated by two sets: a lower approximation (certain to be in C) and an upper approximation (cannot be described as not

belonging to C)

Rough set can also be used for feature reduction. A discernibility matrix is used to detect redundant

attributes.


Fuzzy Set Approaches

Example Application : Credit approval Credit Approval Rule :

IF (years_employed >=2) AND (income >=50K), THEN Credit = “approval”

Problem :What if a customer has had a job for at least two years and her income is $49K ?

Should she be approved or not ? Solution : Fuzzy set approaches

IF (years_employed is medium) AND (income is high), THEN Credit is approval


Fuzzy Set Approaches

Fuzzy logic uses truth values between 0.0 and 1.0 to represent the degree of membership

Attribute values are converted to fuzzy values e.g., income is mapped into the discrete categories {low,

medium, high} with fuzzy values calculated as

For income, 49K is transformed into

Each applicable rule of the rule set contributes a vote for membership in the categories

Typically, the truth values for each predicted category are summed up with weights for making decision

]1,0[,,;,, hmlhmlx

fuzzy membership graph

9.0,1.0,0




prediction Classification by decision tree induction Bayesian Classification Classification by backpropagation Other Classification Methods Prediction Estimating classification accuracy Summary


What Is Prediction?

Prediction is similar to classification Step 1: construct a model Step 2: use the model to predict unknown value

Major method for prediction is regression Linear multiple regression Non-linear regression

Other method : artificial neural networks Main difference between prediction and

classification Classification predict categorical class label Prediction models continuous-valued functions

2211 xxy 3

32

21 xxxy


Linear regression: Y = + X Two parameters , and specify the line They are estimated by using the training samples

Using least squares criterion to the training samples: (X1, Y1), (X2, Y2) …, (Xn, Yn)

Multiple regression: Y = b0 + b1 X1 + b2 X2+…+ bn Xn

Many nonlinear functions can be transformed into the above.

Log-linear models: Example : Estimate probability:

p(a, b, c, d) = αabc abdγacd bcd

log p(a, b, c, d) = log abc +log abd+logγacd +log bcd

Regression Analysis and Log-Linear Models in Prediction




association rule mining Other Classification Methods Prediction Estimating accuracy Summary


Classifier Accuracy Measures

Accuracy of a classifier M, acc(M): percentage of test samples that are correctly classified by the classifier M (created by training set) Error rate (misclassification rate) of M = 1 – acc(M) Given m classes, CMi,j , an entry in a confusion matrix, indicates

# of samples in class i that are labeled by the classifier as class j Alternative performance measures (e.g., for cancer diagnosis)

sensitivity = TP/P /* true positive recognition rate */specificity = TN/N /* true negative recognition rate */precision = TP/(TP + FP)

accuracy = (TP+TN)/(P + N) = sensitivity * P/(P + N) + specificity * N/(P + N)

This model can also be used for cost-benefit analysis

classes yes (computed) no (computed) total recognition(%)

buy_computer = yes

6954 46 7000 99.34

buy_computer = no

412 2588 3000 86.27

total 7366 2634 10000

95.42

P’

(computed)N’ (computed)

P True positive False negative

N False positive

True negative


Error Measures for Prediction

Measure how far off the predicted value is from the actual known value Loss function: measures the error betw. yi and yi’ (predicted value)

Absolute error: | yi – yi’| Squared error: (yi – yi’)2 Test error: the average loss over the test set

Mean absolute error (MAE): Mean squared error (MSE):

Relative absolute error (RAE): Relative squared error (RSE):

The mean squared-error exaggerates the presence of outliers Popularly use (square) root mean-squared error, similarly, root

relative squared error

d

yyd

iii

1

|'|

d

yyd

iii

1

2)'(

dyy

dyy

d

ii

d

iii

/||

/|'|

1

1

d

ii

d

iii

dyy

dyy

1

2

1

2

/)(

/)'(


Performance Evaluation of Classification(Methods for Estimating Average Classification Accuracy)

Partition: Training-and-testing use two independent data sets: training set (2/3), test

set(1/3) Used for data set with large number of samples

k-fold cross-validation Randomly divide the data set into k subsets : S1, S2, …,Sk

At iteration i , the Si subset is used as test set and the remaining k-1 subsets are used as training data

A total of k times for computing average accuracy Used for data set of moderate size

Bootstrapping (leave-one-out) Similar to k-fold cross-validation with k set to s, where s is

the number of initial samples Used for small data set


10-fold Cross-Validation

Data set

Randomly divided into 10 subsets

1 2 3 4 5 6 7 8 9 10

1 2 3 4 5 6 7 8 9 10

Nine out of ten are used for training the classifier The tenth subset is

used as the test set in iteration

10 10

Training setTest set

Repeat iterations :1 to 10

12

34

5

678

9

At iteration 10


Model Comparisonby ROC Curves

ROC (Receiver Operating Characteristics) curves: for visual comparison of the performance of classification models

Vertical axis represents TP (true positive) rate Horizontal axis represents FP (false positive) rate The plot also shows a diagonal line Model 1 is better than model 2

model 1

model 2 diagonal line

(coin model)


Model Comparison by ROC Curves

Originated from signal detection theory Shows the trade-off between the true positive

rate and the false positive rate The area under the ROC curve (AUC) is a

measure of the performance of the model A model with perfect accuracy will have an area

of 1.0

The closer to the diagonal line (i.e., the closer the area is to 0.5), the less accurate is the model


Summary

Classification is an extensively studied problem (mainly

in statistics, machine learning & AI)

Classification issue : How to create a classifier, i.e., find

f,

where y = f (x1, x2, …, xn) and f in DT, NBC, ANN,… forms

Classification is probably one of the most widely used

data mining techniques with a lot of extensions

Scalability is an important issue for applications

related to Big Data, Clouds

2015年10月27日星期二 2015年10月27日星期二 2015年10月27日星期二 data mining:...

Documents