chapter 11 supervised learning: statistical methods
DESCRIPTION
Chapter 11 Supervised Learning: STATISTICAL METHODS. Cios / Pedrycz / Swiniarski / Kurgan. Outline. Bayesian Methods Basics of Bayesian Methods Bayesian Classification – General Case Classification that Minimizes Risk Decision Regions and Probability of Errors Discriminant Functions - PowerPoint PPT PresentationTRANSCRIPT
Chapter 11
Supervised Learning:STATISTICAL METHODS
Cios / Pedrycz / Swiniarski / KurganCios / Pedrycz / Swiniarski / Kurgan
Cios / Pedrycz / Swiniarski / Kurgan2
• Bayesian Methods– Basics of Bayesian Methods– Bayesian Classification – General Case– Classification that Minimizes Risk– Decision Regions and Probability of Errors– Discriminant Functions– Estimation of Probability Densities– Probabilistic Neural Network– Constraints in Classifier Design
Outline
Outline
• Regression
– Data Models
– Simple Linear Regression
– Multiple Regression
– General Least Squares and Multiple Regression
– Assessing Quality of the Multiple Regression Model
Cios / Pedrycz / Swiniarski / Kurgan3
Cios / Pedrycz / Swiniarski / Kurgan4
Bayesian Methods
Statistical processing based on the Bayes decision
theory is a fundamental technique for pattern recognition and classification.
The Bayes decision theory provides a framework for
statistical methods for classifying patterns into classes based on probabilities of patterns and their features.
Basics of Bayesian Methods
Let us assume an experiment involving recognition of two kinds of birds: an eagle and a hawk.
States of nature C = { “ an eagle ”, “ a hawk ” }
Values of C = { c1, c2 } = { “ an eagle ”, “ a hawk ” }
We may assume that among the large number N of prior observations it was concluded that a fraction neagle of them belonged to a class c1 (“an eagle”)
and a fraction nhawk belonged to a class c2 (“a hawk”) (with neagle + nhawk = N)
Cios / Pedrycz / Swiniarski / Kurgan5
Cios / Pedrycz / Swiniarski / Kurgan6
Basics of Bayesian Methods
A priori (prior) probability P(ci):
Estimation of a prior P(ci):
P(ci) denotes the (unconditional) probability that an object belongs to class ci, without any further information about the object.
Basics of Bayesian Methods
The a priori probabilities P(c1) and P(c2) represent our initial knowledge (in statistical terms) about how likely it is that an eagle or a hawk may emerge even before a bird physically appears.
– Natural and best decision:
“Assign a bird to a class c1 if P(c1) > P(c2); otherwise, assign a
bird to a class c2 ”
– The probability of classification error:
P(classification error) = P(c2) if we decide C = c1
P(c1) if we decide C = c2
Cios / Pedrycz / Swiniarski / Kurgan7
Cios / Pedrycz / Swiniarski / Kurgan8
Involving Object Features in Classification
• Feature variable / feature x– It characterizes an object and allows for better discrimination
between one class from another
– We assume it to be a continuous random variable taking
continuous values from a given range
– The variability of a random variable x can be expressed in
probabilistic terms
– We represent a distribution of a random variable x by the class
conditional probability density function (the state conditional
probability density function):
Cios / Pedrycz / Swiniarski / Kurgan9
Involving Object Features in Classification
Examples of probability densities
Cios / Pedrycz / Swiniarski / Kurgan10
Involving Object Features in Classification
• Probability density function p(x|ci)
– also called the likelihood of a class ci with respect to the value x of a feature variable
– the likelihood that an object belongs to class ci is bigger if p(x|ci) is larger
– joint probability density function p(ci , x)
A probability density that an object is in a class ci and has a feature variable value x.
– A posteriori (posterior) probability P(x|ci)
The conditional probability function P(x|ci) (i = 1, 2), which specifies the probability that the object class is ci given that the measured value of a feature variable is x.
Cios / Pedrycz / Swiniarski / Kurgan11
Involving Object Features in Classification
• Bayes’ rule / Bayes’ theorem
– From probability theory (see Appendix B)
– An unconditional probability density function
Cios / Pedrycz / Swiniarski / Kurgan12
Involving Object Features in Classification
• Bayes’ rule
“The conditional probability P(ci|x) can be expressed in
terms of the a priori probability function P(ci), together with
the class conditional probability density function p(ci|x).”
Cios / Pedrycz / Swiniarski / Kurgan13
Involving Object Features in Classification
• Bayes’ decision rule
P(c2|x) if we decide C = c1
P(classification error | x) =
P(c1|x) if we decide C = c2
“This statistical classification rule is best in the sense of
minimizing the probability of misclassification (the
probability of classification error)”
– Bayes’ classification rule guarantees minimization of the
average probability of classification error
Cios / Pedrycz / Swiniarski / Kurgan14
Involving Object Features in Classification
ExampleLet us consider a bird classification problem with P(c1) = P(“an eagle”) = 0.8 and P(c2) = P(“a hawk”) = 0.2 and known probability density functions p(x|c1) and p(x|c2). Assume that, for a new bird, we have measured its size x = 45 cm and for this value we computed p(45|c1) = 2.2828 ∙10-2 and p(45|c2) = 1.1053 ∙ 10-2. Thus, the classification rule predicts class c1 (“an eagle”) because p(x|c1)P(c1) > p(x|c2)P(c2) (2.2828 ∙10-2 ∙ 0.8 > 1.1053 ∙ 10-2 ∙ 0.2). Let us assume that we have known an unconditional density p(x) value to be equal to p(45) = 0.3. The probability of classification error is
Cios / Pedrycz / Swiniarski / Kurgan15
Bayesian Classification – General Case
• Bayes’ Classification Rule for Multiclass Multifeature Objects
– Real-valued features of an object as n-dimensional column
vector x Rn:
– The object may belong to l distinct classes (l distinct states
of nature):
Cios / Pedrycz / Swiniarski / Kurgan16
Bayesian Classification – General Case
• Bayes’ Classification Rule for Multiclass Multifeature Objects
– Bayes’ theorem
A priori probability: P(ci) (i = 1, 2…,l)
Class conditional probability density function : p(x|ci)
A posteriori (posterior) probability: P(ci |x)
Unconditional probability density function:
Cios / Pedrycz / Swiniarski / Kurgan17
Bayesian Classification – General Case
• Bayes’ Classification Rule for Multiclass Multifeature Objects
– Bayes classification rule: A given object with a given value x of a feature vector can
be classified as belonging to class cj when:
Assign an object with a given value x of a feature vector to class cj when:
Cios / Pedrycz / Swiniarski / Kurgan18
Classification that Minimizes Risk
• Basic Idea To incorporate the fact that misclassifications of some
classes are more costly than others, we define a classification that is based on a minimization criterion that involve a loss regarding a given classification decision for a given true state of nature
• A loss function– Cost (penalty, weight) due to the fact of assigning an object to
class cj when in fact the true class is ci
Cios / Pedrycz / Swiniarski / Kurgan19
Classification that Minimizes Risk
• A loss matrix– We denote a loss function by Lij matrix for l-class
classification problems
• Expected (average) conditional loss
In short,
Cios / Pedrycz / Swiniarski / Kurgan20
Classification that Minimizes Risk
• Overall Risk
The overall risk R can be considered as a classification criterion for minimizing risk related to a classification decision.
• Bayes riskMinimal overall risk R leads to the generalization of Bayes’
rule for minimization of probability of the classification error.
Cios / Pedrycz / Swiniarski / Kurgan21
Classification that Minimizes Risk
• Bayes’ classification rule with Bayes risk
Choose a decision (a class) ci for which:
Cios / Pedrycz / Swiniarski / Kurgan22
Classification that Minimizes Risk
• Bayesian Classification Minimizing the Probability of Error
– Symmetrical zero-one conditional loss function
– The conditional risk R(cj| x) criterion is the same as the average probability of classification error:
– An average probability of classification error is thus used as a criterion of minimization for selecting the best classification decision
Cios / Pedrycz / Swiniarski / Kurgan23
Classification that Minimizes Risk
• Generalization of the Maximum Likelihood Classification
– Generalized likelihood ratio for classes ci and cj
– Generalized threshold value
– The maximum likelihood classification rule
“Decide a class cj if
Cios / Pedrycz / Swiniarski / Kurgan24
Decision Regions and Probability of Errors
• Decision regions
– A classifier divides the feature space into l
disjoint decision subspaces R1,R2, … Rl
– The region Ri is a subspace such that each
realization x of a feature vector of an object falling
into this region will be assigned to a class ci
Cios / Pedrycz / Swiniarski / Kurgan25
Decision Regions and Probability of Errors
• Decision boundaries (decision surfaces)
– The regions intersect, and boundaries between
adjacent regions
“The task of a classifier design is to find classification rules
that will guarantee division of a feature space into optimal
decision regions R1,R2, … Rl (with optimal decision
boundaries) that will minimize a selected classification
performance criterion”
Cios / Pedrycz / Swiniarski / Kurgan26
Decision Regions and Probability of Errors
• Decision boundaries
Cios / Pedrycz / Swiniarski / Kurgan27
Decision Regions and Probability of Errors
• Optimal classification with decision regions
– Average probability of correct classification
“Classification problems can be stated as choosing a decision region Ri (thus defining a classification rule) that maximize the probability P(classification_correct) of correct classification being an optimization criterion”
Cios / Pedrycz / Swiniarski / Kurgan28
Discriminant Functions
• Discriminant functions:
• Discriminant type classifier– It assigns an object with a given value x of a feature vector
to a class cj if
• Classification rule for a discriminant function-based classifier
1) Compute numerical values of all discriminant functions for x
2) Choose a class cj as a prediction of true class for which a value of the associated discriminant function dj(x) is the largest:
Select a class cj for which dj(x) = max(di(x) ); i = 1, 2, …, l
Cios / Pedrycz / Swiniarski / Kurgan29
Discriminant Functions
• Discriminant classifier
Cios / Pedrycz / Swiniarski / Kurgan30
Discriminant Functions
• Discriminant type classifier for Bayesian classification
– The natural choice for the discriminant function is the a
posteriori conditional probability P(ci|x):
– Practical versions using Bayes’ theorem
– Bayesian discriminant in a natural logarithmic form
Cios / Pedrycz / Swiniarski / Kurgan31
Discriminant Functions
• Characteristics of discriminant function
– Discriminant functions define the decision boundaries that
separate the decision regions
– Generally, the decision boundaries are defined by
neighboring decision regions when the corresponding
discriminant function values are equal
– The decision boundaries are unaffected by the increasingly
monotonic transformation of discriminant functions
Cios / Pedrycz / Swiniarski / Kurgan32
Discriminant Functions
• Bayesian Discriminant Functions for Two Classes
– General caseTwo discriminant functions: d1(x) and d2(x).
Two decision regions: R1 and R2.
The decision boundary: d1(x) = d2(x).
– Using dichotomizerSingle discriminant function: d (x) = d1(x) - d2(x).
Cios / Pedrycz / Swiniarski / Kurgan33
Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
– Quadratic Discriminant
– Assumption:
A multivariate normal Gaussian distribution of the
feature vector x within each class
– The Bayesian discriminant( in the previous section):
Cios / Pedrycz / Swiniarski / Kurgan34
Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
– Quadratic Discriminant– Gaussian distribution of the probability density function
– Quadratic Discriminant function
– Decision boundaries:
hyperquadratic functions in n-dimensional feature space (hyperspheres, hyperellipsoids, hyperparaboloids, etc.)
Cios / Pedrycz / Swiniarski / Kurgan35
Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
Given: A pattern x. Values of state conditional probability densities p(xj|ci) and the a priori probabilities P(ci)
1) Compute values of the mean vectors i and the covariance
matrices i for all classes i = 1, 2, …, l based on the training set
2) Compute values of the discriminant function for all classes
3) Choose a class ci as a prediction of true class for which a value of
the associated discriminant function dj(x) is largest:
Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l
Cios / Pedrycz / Swiniarski / Kurgan36
Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
– Linear Discriminant:
– Assumption: equal covariances for all classes i =
– The Quadratic Discriminant:
– A linear form of discriminant functions:
Cios / Pedrycz / Swiniarski / Kurgan37
Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
– Linear Discriminant:
Decision boundaries between classes i and j, for which di(x) = dj(x), are pieces of hyperplanes in n-dimensional feature space
Cios / Pedrycz / Swiniarski / Kurgan38
Discriminant Functions
• Quadratic and Linear Discriminants Derived from the Bayes Rule
– The classification process using linear discriminants
1) Compute, for a given x, numerical values of discriminant functions for all classes:
2) Choose a class ci for which a value of the discriminant function dj(x) is largest:
Select a class cj for which dj(x) = max(di(x)) i = 1, 2, …, l
Cios / Pedrycz / Swiniarski / Kurgan39
Discriminant Functions
• Quadratic and Linear Discriminants
Example Let us assume that the following two-feature patterns x R2
from two classes c1 = 0 and c2 = 1 have been drawn according to the Gaussian (normal) density distribution:
Cios / Pedrycz / Swiniarski / Kurgan40
Discriminant Functions
• Quadratic and Linear Discriminants
Example – The estimates of the symmetric covariance matrices for both
classes
– The linear discriminant functions for both classes
Cios / Pedrycz / Swiniarski / Kurgan41
Discriminant Functions
• Quadratic and Linear Discriminants
Example – Two-class two-feature pattern dichotomizer.
Cios / Pedrycz / Swiniarski / Kurgan42
Discriminant Functions
• Quadratic and Linear Discriminants
– Minimum Mahalanobis Distance Classifier
– Assumption
– Equal covariances for all classes i = ( i = 1, 2, …, l )
– Equal a priori probabilities for all classes P(ci) = P
– Discriminant function
Cios / Pedrycz / Swiniarski / Kurgan43
Discriminant Functions
• Quadratic and Linear Discriminants
– Minimum Mahalanobis Distance Classifier
– A classifier selects the class cj for which a value x is
nearest, in the sense of Mahalanobis distance, to the
corresponding mean vector j . This classifier is called a
minimum Mahalanobis distance classifier.
– Linear version of the minimum Mahalanobis distance
classifier
Cios / Pedrycz / Swiniarski / Kurgan44
Discriminant Functions
• Quadratic and Linear Discriminants
– Minimum Mahalanobis Distance Classifier Given: The mean vectors for all classes i (i = 1, 2, …, l) and a
given value x of a feature vector
1) Compute numerical values of the Mahalanobis distances between x and means i for all classes.
2) Choose a class cj as a prediction of true class, for which the value of the associated Mahalanobis distance attains the minimum:
Cios / Pedrycz / Swiniarski / Kurgan45
Discriminant Functions
• Quadratic and Linear Discriminants– Linear Discriminant for Statistically Independent
Features
– Assumption
– Equal covariances for all classes i = ( i = 1, 2, …, l )
– Features are statistically independent
– Discriminant function
where
Cios / Pedrycz / Swiniarski / Kurgan46
Discriminant Functions
• Quadratic and Linear Discriminants
– Linear Discriminant for Statistically Independent Features
– Discriminants
– Quadratic discriminant formula
– Linear discriminant formula
Cios / Pedrycz / Swiniarski / Kurgan47
Discriminant Functions
• Quadratic and Linear Discriminants
– Linear Discriminant for Statistically Independent Features
– “Neural network” style as a linear threshold machine
where
– The decision surfaces for the linear discriminants are
pieces of hyperplanes defined by equations di(x)-dj(x).
Cios / Pedrycz / Swiniarski / Kurgan48
Discriminant Functions
• Quadratic and Linear Discriminants
– Minimum Euclidean Distance Classifier
– Assumption– Equal covariances for all classes i = ( i = 1, 2, …, l )
– Features are statistically independent
– Equal a priori probabilities for all classes P(ci) = P
– Discriminants
or
Cios / Pedrycz / Swiniarski / Kurgan49
Discriminant Functions
• Quadratic and Linear Discriminants
– Minimum Euclidean Distance Classifier
– The minimum distance classifier or a minimum Euclidean distance classifier selects the class cj of which a value x is nearest to the corresponding mean vector j .
– Linear version of the minimum distance classifier
Cios / Pedrycz / Swiniarski / Kurgan50
Discriminant Functions
• Quadratic and Linear Discriminants Given: The mean vectors for all classes i (i = 1, 2, …, l) and a
given value x of a feature vector
1) Compute numerical values of Euclidean distances between x and means i for all classes:
2) Choose a class cj as a prediction of true class for which a value of the associated Euclidean distance is smallest:
Cios / Pedrycz / Swiniarski / Kurgan51
Discriminant Functions
• Quadratic and Linear Discriminants
– Characteristics of Bayesian Normal Discriminant– Assumptions
– multivariate normality within classes– equal covariance matrices between classes
– The linear discriminant is equivalent to the optimal classifier
– These assumptions are satisfied only approximately– Due to its simple structure, the linear discriminant tends
not to overfit the training data set, which may lead to stronger generalization ability for unseen cases
Cios / Pedrycz / Swiniarski / Kurgan52
Estimation of Probability Densities
• Basic Idea
In Bayesian classifier design, it is necessary to estimate a
priori probabilities and conditional probability densities due
to the limited number of a priori observed objects. This
estimation should be optimal according to the well-defined
estimation criterion.
• Estimates of a priori probabilities
Cios / Pedrycz / Swiniarski / Kurgan53
Estimation of Probability Densities• Estimation of the class conditional probability
densities p(x|ci)
– Parametric methods with the assumption of a specific functional form of a
probability density function
– Nonparametric methods without the assumption of a specific functional form of a
probability density function
– Semiparametric method a combination of parametric and nonparametric methods
Cios / Pedrycz / Swiniarski / Kurgan54
Estimation of Probability Densities
• Parametric Methods– A priori observations of objects and corresponding
patterns:
– Split set of all patterns X according to a class into l disjoint sets:
– Assume that the parametric form of the class conditional probability density is given as a function:
where
Cios / Pedrycz / Swiniarski / Kurgan55
Estimation of Probability Densities
• Parametric Methods
– If the probability density has a normal (Gaussian) form:
where
Cios / Pedrycz / Swiniarski / Kurgan56
Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
– Assumption
– we are given a limited-size set of N patterns xi:
– we know a parametric form p(x|) of a conditional probability density function
– Goal– The task of estimation is to find the optimal (the best
according to the used criterion) value of the parameter vector of a given dimension m.
Cios / Pedrycz / Swiniarski / Kurgan57
Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
– Likelihood– The joint probability density L( ) is a function of a
parameter vector for a given set of patterns X.
– It is called the likelihood of for a given set of patterns X.
– Maximum Likelihood Estimation
The function L( ) can be chosen as a criterion for finding the optimal estimate of . It is called the maximum likelihood estimation of parameters
Cios / Pedrycz / Swiniarski / Kurgan58
Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
– Minimizing the negative natural logarithm of the likelihood
L( ) :
– For the differentiable function p(xi| ):
Cios / Pedrycz / Swiniarski / Kurgan59
Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
– For the normal form of a probability density function N(µ,)
with unknown parameters µ and constituting vector :
Cios / Pedrycz / Swiniarski / Kurgan60
Estimation of Probability Densities
• The Maximum Likelihood Estimation of Parameters
– Example of Maximum Likelihood Estimation
– for
– The maximum likelihood estimation criterion
– The maximum likelihood estimates for the parameters:
Cios / Pedrycz / Swiniarski / Kurgan61
Estimation of Probability Densities
• Nonparametric Methods “Nonparametric methods are more general methods of
probability density estimation that based on existing data, but without an assumption about a functional form of the probability density function.”
– Nonparametric techniques:
– Histogram
– Kernel-based method
– k-nearest neighbors
– Nearest neighbors
Cios / Pedrycz / Swiniarski / Kurgan62
Estimation of Probability Densities• Nonparametric Methods
General Idea Determine an estimate of a true probability density
p(x) based on the available limited-size samples – The probability that a new pattern x will fall inside a
region R
– Approximation of the probability for a small region and for continuous p(x), with almost the same values within a region R
Cios / Pedrycz / Swiniarski / Kurgan63
Estimation of Probability Densities
• Nonparametric Methods
General Idea– The probability that for N sample patterns set k of them
will fall in a region R
– Estimate of the probability P
– Approximation for a probability density function for a given pattern x
Cios / Pedrycz / Swiniarski / Kurgan64
Estimation of Probability Densities
• Nonparametric Methods
– Kernel-based Method and Parzen Window
– Kernel-based method is based on fixing around a
pattern vector x a region R (and thus a region volume V )
and counting a number k of given training patterns
falling in this region by using a special kernel function
associated with the region.
– Such a kernel function is also called a Parzen window
Cios / Pedrycz / Swiniarski / Kurgan65
Estimation of Probability Densities
• Nonparametric Methods
– Hypercube-type Parzen window
Volume of the hypercube:
Kernel (window) function:
Total number of patterns falling within the hypercube
The estimate of the probability density function
Cios / Pedrycz / Swiniarski / Kurgan66
Estimation of Probability Densities
• Nonparametric Methods
– Smooth estimate of the probability density function
– A kernel function must satisfy two conditions:
and
– For example, the radial symmetric multivariate Gaussian
(normal) kernel:
– The estimate of the probability density function:
Cios / Pedrycz / Swiniarski / Kurgan67
Estimation of Probability Densities
• Nonparametric Methods
– Smooth estimate of the probability density function
– The estimate of the class-dependent p(x|ck) probability
density:
– The estimate of the class-dependent p(x|ck) probability
density for the Gaussian kernel:
Cios / Pedrycz / Swiniarski / Kurgan68
Estimation of Probability Densities
• Nonparametric Methods
– Design issues
– The selection of a kernel function:
Parzen window, Gaussian kernel, etc.
– The selection of a smoothing parameter
– The generalization ability of the kernel-based
density estimation depends on the training set
and on smoothing parameters
Cios / Pedrycz / Swiniarski / Kurgan69
Estimation of Probability Densities
• Nonparametric Methods
– K-nearest Neighbors“A method of probability density estimation with variable
size regions”– First, a small n-dimensional sphere is located in the
pattern space centered at the point x.– Second, a radius of this sphere is extended until the
sphere contains exactly the fixed number k of patterns from a given training set.
– Then an estimate of the probability density for x is computed as
Cios / Pedrycz / Swiniarski / Kurgan70
Estimation of Probability Densities
• Nonparametric Methods
– K-nearest Neighbors Classification Rule– First, for a given x, the first k-nearest neighbors from a
training set should be found (regardless of a class label) based on a defined pattern distance measure.
– Second, among the selected k nearest neighbors, numbers ni of patterns belonging to each class ci are computed.
– Then, the predicted class cj assigned to x corresponds to a class for which nj is the largest.
Cios / Pedrycz / Swiniarski / Kurgan71
Estimation of Probability Densities
• Nonparametric Methods– Nearest Neighbors Classification Rule
“The simple version of the k-nearest neighbors classification is for a number of neighbors k equal to one”
– Algorithm Given: A training set Ttra of N patterns x1, x2, …, xN labeled by
l classes. A new pattern x.
• Compute for a given x the nearest neighbor xj from a whole training set based on the defined pattern distance measure distance(x, xi).
• Assign to x a class cj of nearest neighbors to x.
Cios / Pedrycz / Swiniarski / Kurgan72
Estimation of Probability Densities
• Semiparametric Methods
“Combination of parametric and nonparametric methods”
– Two semiparametric methods– Functional approximation– Mixture models (mixtures of probability densities)
– Major advantage It is able to precisely fit component functions locally to
specific regions of a feature space, based on discoveries about probability distributions and their modalities from the existing data
Cios / Pedrycz / Swiniarski / Kurgan73
Estimation of Probability Densities
• Semiparametric Methods
– Functional Approximation
– Approximation of density by the linear combination of m
basis functions i(x):
– Using a symmetric radial basis function
Cios / Pedrycz / Swiniarski / Kurgan74
Estimation of Probability Densities
• Semiparametric Methods
– Functional Approximation– Gaussian radial function: “The most commonly used
basis function”
– Optimization criterion for the functional approximation of density
– Optimal estimates for parameters:
Cios / Pedrycz / Swiniarski / Kurgan75
Estimation of Probability Densities• Semiparametric Methods
– The algorithm for functional approximation Given : A training set Ttra of N patterns x1, x2, …, xN. The m
orthonormal radial basis functions i(x) (i = 1, 2,…,m), along with their parameters.
1) Compute the estimates of unknown parameters
2) Form the model of the probability density as a functional approximation
Cios / Pedrycz / Swiniarski / Kurgan76
Estimation of Probability Densities
• Semiparametric Methods
– Mixture Models (Mixtures of Probability Densities) “These models are based on linear parametric combination
of known probability density functions (for example, normal densities) localized in certain regions of data”
– The linear mixture distribution
Simplified version:
Cios / Pedrycz / Swiniarski / Kurgan77
Estimation of Probability Densities• Distance Between Probability Densities and
the Kullback-Leibler Distance
– Distance
“We can define distance between two densities,
with true density p(x) and its approximate estimate ”
– Kullback-Leibler distance
Cios / Pedrycz / Swiniarski / Kurgan78
Probabilistic Neural Network
• Probabilistic Neural Network “The PNN is a hardware implementation of the kernel-based
method of density estimation and Bayesian optimal classification (providing minimization of the average probability of the classification error”
– Optimal Bayes’ classification rule
– Kernel-based estimation of a probability density function
Cios / Pedrycz / Swiniarski / Kurgan79
Probabilistic Neural Network
• Topology
Cios / Pedrycz / Swiniarski / Kurgan80
Probabilistic Neural Network
• Details
– An input layer (weightless) consists of n neurons (units),
each receiving one element xi (i = 1,2,…, n) of the n-
dimensional input pattern vector x.
– A pattern layer consists of N neurons (units, nodes), each
representing one reference pattern from the training set Ttra .
– The transfer function of the pattern layer neuron
implements a kernel function(a Parzen window)
Cios / Pedrycz / Swiniarski / Kurgan81
Probabilistic Neural Network
• Details– The weightless second hidden layer is the summation layer.
The number of neurons in the summation layer is equal to the number of classes l.
– The output activation function of the summation layer neuron is generally equal to but may be modified for different kernel functions.
– The output layer is the classification decision layer that implements Bayes’ classification rule by selecting the largest value and thus decides a class cj for the pattern x
Cios / Pedrycz / Swiniarski / Kurgan82
Probabilistic Neural Network
• Pattern Processing “Processing of patterns by the already-designed PNN network
is performed in the feedforward manner. The input pattern is presented to the network and processed forward by each layer. The resulting output is the predicted class”
• PNN with the Radial Gaussian Kernel– Kernel function:
– Transfer function:
– Output activation function:
Cios / Pedrycz / Swiniarski / Kurgan83
Probabilistic Neural Network• PNN with the Radial Gaussian Normal Kernel and
Normalized Patterns– Transfer function:
– Normalization of patterns allows for a simpler architecture of the pattern-layer neurons, containing here also input weights and an exponential output activation function.
– The transfer function of a pattern neuron can be divided into a neuron’s transfer function and an output activation function
– The pattern-neuron output activation function:
Cios / Pedrycz / Swiniarski / Kurgan84
Constraints in Classifier Design
• Problems
– Will a classifier guarantee minimization of the average
probability of the classification error?
– Does a training set well represent patterns generated by a
physical phenomenon?
– Are patterns drawn according to the characteristic of
underlying phenomenon probability density?
– Is the average probability of a classification error difficult to
calculate?
Cios / Pedrycz / Swiniarski / Kurgan85
Constraints in Classifier Design
• Suboptimal solutions of Bayesian classifier design
– The estimation of class conditional probabilities is based on
a limited sample
– The samples are frequently collected randomly, and not by
use of a well-planned experimental procedure
Cios / Pedrycz / Swiniarski / Kurgan86
REGRESSION
• Data Models
• Simple Linear Regression Analysis
• Multiple Regression
• General Least Squares and Multiple Regression
• Assessing the Quality of the Multiple Regression
Model
Cios / Pedrycz / Swiniarski / Kurgan87
Data Models• Mathematical models “They are useful approximate representations of phenomena that
generate data and may be used for prediction, classification, compression, or control design.”
• Black-box models– Mathematical models obtained by processing existing data
without using laws of physics governing data-generating phenomena
• Regression analysis– Data analysis and model design are based on a sample from a
given population
Cios / Pedrycz / Swiniarski / Kurgan88
Data Models
• Categories of regression models– Simple linear regression – Multiple linear regression – Neural network-based linear regression – Polynomial regression – Logistic regression – Log-linear regression – Local piecewise linear regression – Nonlinear regression (with a nonlinear model) – Neural network-based nonlinear regression
Cios / Pedrycz / Swiniarski / Kurgan89
Data Models
• Static and dynamic models
– A static model produces outcomes based only on
the current input (no internal memory).
– A dynamic model produces outcomes based on
the current input and the past history of the model
behavior (internal memory)
Cios / Pedrycz / Swiniarski / Kurgan90
Data Models
• Data gathering
– Random sample from a certain population
– N pairs of the experimental data set named Torig
Cios / Pedrycz / Swiniarski / Kurgan91
Data Models
• Regression analysis “A statistical method used to discover the relationship between
variables and to design a data model that can be used to predict variable values based on other variables”
Cios / Pedrycz / Swiniarski / Kurgan92
Data Models• Regression analysis
– A simple linear regression– To find the linear relationship between two variables, x
and y, and to discover a linear model, i.e., a line equation y = b+ax, which is the best fit to given data in order to predict values of data
– This modeling line is called the regression line of y on x– The equation of that line is called a regression equation
(regression model)
– Typical linear regression analysis provides a prediction of a dependent variable y based on an independent variable x
Cios / Pedrycz / Swiniarski / Kurgan93
Data Models
• Visualization of Regression
– Scatter plot for height versus weight data
Cios / Pedrycz / Swiniarski / Kurgan94
Data Models
• Visualization of Regression
– Scatter plot for height versus weight data
Cios / Pedrycz / Swiniarski / Kurgan95
Simple Linear Regression Analysis• Sample data and Regression model
Cios / Pedrycz / Swiniarski / Kurgan96
Simple Linear Regression Analysis
• Assumptions– The observations yi (i = 1, …, N) are random samples and are
mutually independent. – The regression error terms (the difference between the
predicted value and the true value) are also mutually independent, with the same distribution (normal distribution with zero mean) and constant variances
– The distribution of the error term is independent of the joint distribution of explanatory variables. It is also assumed that unknown parameters of regression models are constants
Cios / Pedrycz / Swiniarski / Kurgan97
Simple Linear Regression Analysis• Simple Linear Regression
Analysis– Evaluation of basic statistical
characteristics of data
– An estimation of the optimal parameters of a linear model
– Assess of model quality and generalization ability to predict the outcome for new data
Cios / Pedrycz / Swiniarski / Kurgan98
Simple Linear Regression Analysis
• Model Structure– Nonlinear data:
– Generally, a function f(x) could be nonlinear in x:
– Linear form :
Cios / Pedrycz / Swiniarski / Kurgan99
Simple Linear Regression Analysis
• Regression Error (residual error)
– Difference between real-value yi and predicted-value yi,est
Cios / Pedrycz / Swiniarski / Kurgan100
Simple Linear Regression Analysis
• Performance Criterion – Sum of Squared Errors.
– The sum of squared errors performance criterion for
multiple regression
– The minimization technique in regression uses as a
criterion the sum of squared error - method of least squares
or errors (LSE) or, in short, the method of least squares
Cios / Pedrycz / Swiniarski / Kurgan101
Simple Linear Regression Analysis
• Basic Statistical Characteristics of Data
– The mean of N samples
– The variance
– The covariance
Cios / Pedrycz / Swiniarski / Kurgan102
Simple Linear Regression Analysis• Sum of Squared Variations in y Caused by the
Regression Model
– The total sum of squared variations in y
– These formulas are used to define important regression measures (for example, the correlation coefficient)
Cios / Pedrycz / Swiniarski / Kurgan103
Simple Linear Regression Analysis• Computing Optimal Values of the Regression
Model Parameters– The optimal model parameters values have to be
computed based on the given data set and the defined performance criterion
– Methods for estimation of optimal model parameter values
– The analytical offline method– The analytical recursive offline method– Searching iteratively optimal model parameters– Neural network-based regression
Cios / Pedrycz / Swiniarski / Kurgan104
Simple Linear Regression Analysis• Simple Linear Regression Analysis, Linear
Least Squares, and Design of a Model
– The general linear model structure
– The performance criterion
and performance curve
y = ax (a model with b=0)
Cios / Pedrycz / Swiniarski / Kurgan105
Simple Linear Regression Analysis• Simple Linear Regression Analysis, Linear Least
Squares, and Design of a Model
– The optimal parameters
Cios / Pedrycz / Swiniarski / Kurgan106
Simple Linear Regression Analysis
• Procedure for simple linear regression Given: The number N of experimental observations, and the
set of the N experimental data points { (xi, yi), i = 1, 2, …, N }
1) Compute the statistical characteristics of the data
2) Compute the estimates of the model optimal parameters using Equations
3) Assess the regression model quality indicating howwell the model fits the data. Compute
a) Standard error of estimate
b) Correlation coefficient r
c) Coefficient of determination r2
Cios / Pedrycz / Swiniarski / Kurgan107
Simple Linear Regression Analysis
Example
– Sample of four data points
– Resulting regression line
y = 0.9 + 0.56x
Cios / Pedrycz / Swiniarski / Kurgan108
Simple Linear Regression Analysis• Optimal Parameter Values in the Minimum
Least Squares Sense
– Required conditions for a valid linear regression
– The error term e = y - (b + ax) is normally distributed
– The error variance is the same for all values of x
– Error are independent of each other.
Cios / Pedrycz / Swiniarski / Kurgan109
Simple Linear Regression Analysis• Quality of the Linear Regression Model and
Linear Correlation Analysis
– Assessment of model quality
– The resulting correlation coefficient can be used as a
measure of how well the trends predicted by the values
follow the trends in the training data
– The coefficient of determination can be used to measure
how well the regression line fits the data points
Cios / Pedrycz / Swiniarski / Kurgan110
Simple Linear Regression Analysis
• Correlation coefficient
• Coefficient of determination – The percent of variation in the dependent variable y that can
be explained by the regression equation, – the explained variation in y divided by the total variation, or – the square of r (correlation coefficient)
Cios / Pedrycz / Swiniarski / Kurgan111
Simple Linear Regression Analysis• Coefficient of determination
– Explained and unexplained variation in y
Cios / Pedrycz / Swiniarski / Kurgan112
Simple Linear Regression Analysis• Coefficient of determination
– Example– If the coefficient of correlation has the value r = 0.9327,
then the value of the coefficient of determination is r2 = 0.8700. It can be understood that 87% portion of the total variation in y can be explained by the linear relationship between x and y, as it is described by the optimal regression model of the data. The remaining portion 13% of the total variation in y remains unexplained.
– The calculation of coefficient of determination
Cios / Pedrycz / Swiniarski / Kurgan113
Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression
Based on Least Squares Method
– The matrix form of the model description (the estimation of
) for all N experimental data points
– The regression error
Cios / Pedrycz / Swiniarski / Kurgan114
Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression
Based on Least Squares Method– The performance criterion:
– Optimal parameters:
– The value of the criterion for the optimal parameter vector:
– The regression error for the model with the optimal parameter vector:
Cios / Pedrycz / Swiniarski / Kurgan115
Simple Linear Regression Analysis• Matrix Version of Simple Linear Regression
Based on Least Squares Method
– Example: let us consider again the dataset shown in the
following table
y = 0.56x + 0.9
Cios / Pedrycz / Swiniarski / Kurgan116
Multiple Regression
• Definition The multiple regression analysis is the statistical technique of
exploring the relation (association) between the set of n independent variables that are used to explain the variability of one (generally many) dependent variable y
– Linear multiple regression model
– Linear multiple regression model using vector notation
– This regression model is represented by a hyperplane in (n + 1)-dimensional space.
Cios / Pedrycz / Swiniarski / Kurgan117
Multiple Regression
• Geometrical Interpretation: Regression Errors The goal of multiple regression is to find a hyperplane in the
(n + 1)-dimensional space that will best fit the data
– The performance criterion
– The error variance and standard error of the estimate
Cios / Pedrycz / Swiniarski / Kurgan118
Multiple Regression
• Degree of Freedom
– The denominator N – n – 1 in the previous equation tells us
that in multiple regression with n independent variables, the
standard error has N – n – 1 degrees of freedom
– The degree of freedom has been reduced from N by n + 1
because n + 1 numerical parameters a0, a1, a2, …, an of the
regression model have been estimated from the data
Cios / Pedrycz / Swiniarski / Kurgan119
General Least Squares and Multiple Regression
• General model description in function form
– Data model
– Performance criterion
– Regression error
Cios / Pedrycz / Swiniarski / Kurgan120
General Least Squares and Multiple Regression
• General model description in matrix form
– Data model
– Performance criterion
– Optimal parameters
Cios / Pedrycz / Swiniarski / Kurgan121
General Least Squares and Multiple Regression
• Practical, Numerically Stable Computation of the Optimal Model Parameters
– Problem “The solution for the optimal least-squares parameters is
almost never computed from the equation due to its poor numerical performance in cases when the matrix (the covariance matrix) is ill conditioned”
– Solution: various matrix decomposition methods
Cios / Pedrycz / Swiniarski / Kurgan122
Assessing the Quality of the Multiple Regression Model
• The Coefficient of Multiple Determination,R2
“The percent of the variance in the dependent variable that can be explained by all of the independent variables taken together.”
– Adjusted R2
Adjusted R2 uses the number of design parameters plus a constant that are used in the model and the number of data points N in order to correct the statistic of this coefficient in situations when unnecessary parameters are used in the model structure
Cios / Pedrycz / Swiniarski / Kurgan123
Assessing the Quality of the Multiple Regression Model
• Cp Statistic
– It is used to compare multiple regression models Cp
– When comparing alternative regression models, the
designer aims to choose models whose values of Cn is close
to or below (n + 1)
Cios / Pedrycz / Swiniarski / Kurgan124
Assessing the Quality of the Multiple Regression Model
• Multiple Correlation– A value of R can be found as the positive square root of R2
(coefficient of multiple determination)
– It is a measure of the strength of the linear relationship between the dependent variable y and the set of independent variables x1, x2, …, xn.
– A value of R close to 1 indicates that the fit is very good
– A value near zero indicates that the model is not a good approximation of the data and cannot be efficiently used for prediction
Cios / Pedrycz / Swiniarski / Kurgan125
Assessing the Quality of the Multiple Regression Model
Example “Let us consider a multiple linear regression analysis for the
data set containing N = 4 cases, composed with one dependent variable y and two independent variables x1 and x2”
– Three-dimensional data
Cios / Pedrycz / Swiniarski / Kurgan126
Assessing the Quality of the Multiple Regression Model
Example – The scatter plot of data points in three-dimensional space
(x1, x2, y)
Cios / Pedrycz / Swiniarski / Kurgan127
Assessing the Quality of the Multiple Regression Model
Example
– The data matrix
– The optimal model parameters
Cios / Pedrycz / Swiniarski / Kurgan128
Assessing the Quality of the Multiple Regression Model
Example
– The optimal model:
y = 3.1+0.9x1+0.56x2
– The optimal
regression model in
(x1, x2, y) space :
Cios / Pedrycz / Swiniarski / Kurgan129
Assessing the Quality of the Multiple Regression Model
Example
– Multipleregression, regression plane model and scatter plot
Cios / Pedrycz / Swiniarski / Kurgan130
Assessing the Quality of the Multiple Regression Model
Example
– The residuals (errors)
– The criterion value for the optimal parameters: 0.016
Cios / Pedrycz / Swiniarski / Kurgan131
References
Bishop, C.M. 1995. Neural Networks for Pattern Recognition. Oxford
Press
Cios, K.J., Pedrycz, W., and Swiniarski, R. 1998. Data Mining
Methods for Knowledge Discovery. Kluwer
Draper, N.R., and Smith, H. Applied Regression Analysis Wiley
Series in Probability and Statistics
Duda, R.O. Hart, P.E., and D.G. Stork. 2001. Pattern Classification.
Wiley
Myers, R.H. 1986. Classical and Modern Regression with
Applications, Boston, MA: Duxbury Press.