methods in medical image analysis statistics of pattern recognition: classification and clustering...
Post on 11-Jan-2016
217 Views
Preview:
TRANSCRIPT
Methods in Medical Image Analysis
Statistics of Pattern Recognition: Classification and Clustering
Some content provided by Milos Hauskrecht, University of Pittsburgh Computer Science
ITK Questions?
Classification
Classification
Classification
Features
• Loosely stated, a feature is a value describing something about your data points (e.g. for pixels: intensity, local gradient, distance from landmark, etc)
• Multiple (n) features are put together to form a feature vector, which defines a data point’s location in n-dimensional feature space
Feature Space
• Feature Space -– The theoretical n-dimensional space occupied
by n input raster objects (features). – Each feature represents one dimension, and
its values represent positions along one of the orthogonal coordinate axes in feature space.
– The set of feature values belonging to a data point define a vector in feature space.
Statistical Notation
• Class probability distribution:
p(x,y) = p(x | y) p(y)
x: feature vector – {x1,x2,x3…,xn}
y: class
p(x | y): probabilty of x given y
p(x,y): probability of both x and y
Example: Binary Classification
Example: Binary Classification
• Two class-conditional distributions:
p(x | y = 0) p(x | y = 1)
• Priors:
p(y = 0) + p(y = 1) = 1
Modeling Class Densities
• In the text, they choose to concentrate on methods that use Gaussians to model class densities
Modeling Class Densities
Generative Approach to Classification
1. Represent and learn the distribution:
p(x,y)
2. Use it to define probabilistic discriminant functionse.g.
go(x) = p(y = 0 | x)
g1(x) = p(y = 1 | x)
Generative Approach to Classification
Typical model:
p(x,y) = p(x | y) p(y)
p(x | y) = Class-conditional distributions (densities)
p(y) = Priors of classes (probability of class y)
We Want:
p(y | x) = Posteriors of classes
Class Modeling
• We model the class distributions as multivariate Gaussians
x ~ N(μ0, Σ0) for y = 0
x ~ N(μ1, Σ1) for y = 1
• Priors are based on training data, or a distribution can be chosen that is expected to fit the data well (e.g. Bernoulli distribution for a coin flip)
Making a class decision
• We need to define discriminant functions ( gn(x) )
• We have two basic choices:– Likelihood of data – choose the class (Gaussian) that
best explains the input data (x):
– Posterior of class – choose the class with a better posterior probability:
Calculating Posteriors
• Use Bayes’ Rule:
• In this case,
)(
)()|()|(
BP
APABPBAP =
Linear Decision Boundary
• When covariances are the same
Linear Decision Boundary
Linear Decision Boundary
Quadratic Decision Boundary
• When covariances are different
Quadratic Decision Boundary
Quadratic Decision Boundary
Clustering• Basic Clustering Problem:
– Distribute data into k different groups such that data points similar to each other are in the same group
– Similarity between points is defined in terms of some distance metric
• Clustering is useful for:– Similarity/Dissimilarity analysis
• Analyze what data point in the sample are close to each other
– Dimensionality Reduction• High dimensional data replaced with a group (cluster) label
Clustering
Clustering
Distance Metrics
• Euclidean Distance, in some space (for our purposes, probably a feature space)
• Must fulfill three properties:
Distance Metrics
• Common simple metrics:
– Euclidean:
– Manhattan:
• Both work for an arbitrary k-dimensional space
Clustering Algorithms
• k-Nearest Neighbor
• k-Means
• Parzen Windows
k-Nearest Neighbor
• In essence, a classifier
• Requires input parameter k– In this algorithm, k indicates the number of
neighboring points to take into account when classifying a data point
• Requires training data
k-Nearest Neighbor Algorithm
• For each data point xn, choose its class by finding the most prominent class among the k nearest data points in the training set
• Use any distance measure (usually a Euclidean distance measure)
k-Nearest Neighbor Algorithm
++
++
-
-
-
-
-
-e1
1-nearest neighbor:the concept represented by e1
5-nearest neighbors:q1 is classified as negative
q1
k-Nearest Neighbor
• Advantages:– Simple– General (can work for any distance measure you
want)
• Disadvantages:– Requires well classified training data– Can be sensitive to k value chosen– All attributes are used in classification, even ones that
may be irrelevant– Inductive bias: we assume that a data point should be
classified the same as points near it
k-Means
• Suitable only when data points have continuous values
• Groups are defined in terms of cluster centers (means)
• Requires input parameter k– In this algorithm, k indicates the number of
clusters to be created
• Guaranteed to converge to at least a local optima
k-Means Algorithm
• Algorithm:1. Randomly initialize k mean values
2. Repeat next two steps until no change in means:1. Partition the data using a similarity measure
according to the current means
2. Move the means to the center of the data in the current partition
3. Stop when no change in the means
k-Means
k-Means
• Advantages:– Simple– General (can work for any distance measure you want)– Requires no training phase
• Disadvantages:– Result is very sensitive to initial mean placement– Can perform poorly on overlapping regions– Doesn’t work on features with non-continuous values (can’t
compute cluster means)– Inductive bias: we assume that a data point should be classified
the same as points near it
Parzen Windows
• Similar to k-Nearest Neighbor, but instead of using the k closest training data points, its uses all points within a kernel (window), weighting their contribution to the classification based on the kernel
• As with our classification algorithms, we will consider a gaussian kernel as the window
Parzen Windows
• Assume a region defined by a d-dimensional Gaussian of scale σ
• We can define a window density function:
• Note that we consider all points in the training set, but if a point is outside of the kernel, its weight will be 0, negating its influence
∑=
−=S
j
jSxGS
xp1
2),)((
1),( σσ
rrr
Parzen Windows
Parzen Windows
• Advantages:– More robust than k-nearest neighbor– Excellent accuracy and consistency
• Disadvantages:– How to choose the size of the window?– Alone, kernel density estimation techniques
provide little insight into data or problems
top related