Download - DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University
![Page 1: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/1.jpg)
DATA MINING van data naar informatie
Ronald WestraDep. MathematicsMaastricht University
![Page 2: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/2.jpg)
CLUSTERING AND CLUSTER ANALYSIS
Data Mining Lecture IV[Chapter 8: sections 8.4 and Chapter 9 from Principles
of Data Mining by Hand,, Manilla, Smyth ]
![Page 3: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/3.jpg)
1. Clustering versus Classification
• classification: give a pre-determined label to a sample
• clustering: provide the relevant labels for classification from structure in
a given dataset
• clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity
• Objectives: - 1. segmentation of space
- 2. find natural subclasses
![Page 4: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/4.jpg)
Examples of Clustering
and Classification
1. Computer Vision
![Page 5: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/5.jpg)
Examples of Clustering
and Classification: 1. Computer Vision
![Page 6: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/6.jpg)
Example of Clustering
and Classification: 1. Computer Vision
![Page 7: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/7.jpg)
Examples of Clustering and Classification:
2. Types of chemical reactions
![Page 8: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/8.jpg)
Examples of Clustering and Classification:
2. Types of chemical reactions
![Page 9: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/9.jpg)
Voronoi Clustering
Georgy Fedoseevich Voronoy
1868 - 1908
![Page 10: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/10.jpg)
Voronoi Clustering
A Voronoi diagram (also called a Voronoi tessellation, Voronoi decomposition, Dirichlet tessellation), is a special kind of decomposition of a metric space determined by distances to a specified discrete set of objects in the space, e.g., by a discrete set of points.
![Page 11: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/11.jpg)
Voronoi Clustering
![Page 12: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/12.jpg)
Voronoi Clustering
![Page 13: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/13.jpg)
Voronoi Clustering
![Page 14: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/14.jpg)
Voronoi
Clustering
![Page 15: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/15.jpg)
Voronoi Clustering
![Page 16: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/16.jpg)
Voronoi Clustering
![Page 17: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/17.jpg)
Voronoi Clustering
![Page 18: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/18.jpg)
Partitional Clustering [book section 9.4]
score-functions
centroid
intra-cluster distance
inter-cluster distance
C-means [book page 303]
![Page 19: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/19.jpg)
k-means clustering (also: C-means)
The k-means algorithm assigns each point to the cluster whose center (also called centroid) is nearest. The center is the average of all the points in the cluster, ie its coordinates is the arithmetic mean for each dimension separately for all the points in the cluster.
![Page 20: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/20.jpg)
k-means clustering (also: C-means)
Example: The data set has three dimensions and the cluster has two points: X = (x1, x2, x3) and Y = (y1, y2, y3).
Then the centroid Z becomes Z = (z1, z2, z3), where z1 = (x1 + y1)/2 and z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2
![Page 21: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/21.jpg)
k-means clustering (also: C-means)
This is the basic structure of the algorithm (J. MacQueen, 1967):
•Randomly generate k clusters and determine the
cluster centers or directly generate k seed points as
cluster centers
•Assign each point to the nearest cluster center.
•Recompute the new cluster centers.
•Repeat until some convergence criterion is met
(usually that the assignment hasn't changed).
![Page 22: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/22.jpg)
C-means [book page 303]
while changes in cluster Ck
% form clusters
for k=1,…,K do
Ck = {x | ||x – rk|| < || x – rl|| }
end
% compute new cluster centroids
for k=1,…,K do
rk = mean({x | x Ck })
end
end
![Page 23: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/23.jpg)
k-means clustering (also: C-means)
The main advantages of this algorithm are its simplicity and speed, which allows it to run on large datasets. Yet it does not systematically yield the same result with each run of the algorithm. Rather, the resulting clusters depend on the initial assignments. The k-means algorithm maximizes inter-cluster (or minimizes intra-cluster) variance, but does not ensure that the solution given is not a local minimum of variance.
![Page 24: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/24.jpg)
k-means clustering
![Page 25: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/25.jpg)
k-means clustering (also: C-means)
![Page 26: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/26.jpg)
k-means clustering (also: C-means)
![Page 27: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/27.jpg)
k-means clustering (also: C-means)
![Page 28: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/28.jpg)
Fuzzy c-means
One of the problems of the k-means algorithm is that it gives a hard partitioning of the data, that is to say that each point is attributed to one and only one cluster. But points on the edge of the cluster, or near another cluster, may not be as much in the cluster as points in the center of cluster.
![Page 29: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/29.jpg)
Fuzzy c-means
Therefore, in fuzzy clustering, each point does not pertain to a given cluster, but has a degree of belonging to a certain cluster, as in fuzzy logic. For each point x we have a coefficient giving the degree of being in the k-th cluster uk(x). Usually, the sum of those coefficients has to be one, so that uk(x) denotes a probability of belonging to a certain cluster:
![Page 30: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/30.jpg)
Fuzzy c-means
With fuzzy c-means, the centroid of a cluster is computed as being the mean of all points, weighted by their degree of belonging to the cluster, that is:
![Page 31: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/31.jpg)
Fuzzy c-meansThe degree of being in a certain cluster is related to the inverse of the distance to the cluster
then the coefficients are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So :
![Page 32: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/32.jpg)
Fuzzy c-means
For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1. When m is close to 1, then cluster center closest to the point is given much more weight than the others, and the algorithm is similar to k-means.
![Page 33: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/33.jpg)
Fuzzy c-means
The fuzzy c-means algorithm is greatly similar to the k-means algorithm :
![Page 34: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/34.jpg)
Fuzzy c-means
•Choose a number of clusters •Assign randomly to each point coefficients for being in the clusters •Repeat until the algorithm has converged (that is, the coefficients' change between two iterations is no more than ε, the given sensitivity threshold) : •Compute the centroid for each cluster, using the formula above •For each point, compute its coefficients of being in the clusters, using the formula above
![Page 35: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/35.jpg)
Fuzzy C-means
uij is membership of sample i to custer j
ck is centroid of custer i
while changes in cluster Ck
% compute new memberships
for k=1,…,K do
for i=1,…,N do
ujk = f(xj – ck)
end
end
% compute new cluster centroids
for k=1,…,K do
% weighted mean
ck = SUMj jkxk xj /SUMj ujk
end
end
![Page 36: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/36.jpg)
Fuzzy c-means
The fuzzy c-means algorithm minimizes intra-cluster variance as well, but has the same problems as k-means, the minimum is local minimum, and the results depend on the initial choice of weights.
![Page 37: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/37.jpg)
Fuzzy c-means
![Page 38: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/38.jpg)
Fuzzy c-means
![Page 39: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/39.jpg)
Fuzzy c-means
![Page 40: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/40.jpg)
Fuzzy
c-means
0 0.2 0.4 0.6 0.8 1-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Trajectory of Fuzzy MultiVariate Centroids
1
2345
1
2
3
4
5
0 0.2 0.4 0.6 0.8 1-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Trajectory of Fuzzy C-means Centroids
1
2345
1
2
3
4
5
![Page 41: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/41.jpg)
Hierarchical Clustering [book section 9.5]
One major problem with partitional clustering is that the number of clusters (= #classes) must be pre-specified !!!
This poses the question: what IS the real number of clusters in a given set of data?
Answer: it depends!
• Agglomerative methods: bottom-up
• Divisive methods: top-down
![Page 42: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/42.jpg)
Hierarchical Clustering Agglomerative hierarchical clustering
![Page 43: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/43.jpg)
Hierarchical Clustering
![Page 44: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/44.jpg)
Hierarchical
Clustering
![Page 45: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/45.jpg)
Hierarchical Clustering
![Page 46: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/46.jpg)
Hierarchical Clustering
![Page 47: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/47.jpg)
Example of Clustering
and Classification
![Page 48: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/48.jpg)
1. Clustering versus Classification
• classification: give a pre-determined label to a sample
• clustering: provide the relevant labels for classification from structure in
a given dataset
• clustering: maximal intra-cluster similarity and maximal inter-cluster dissimilarity
• Objectives: - 1. segmentation of space
- 2. find natural subclasses
![Page 49: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/49.jpg)
![Page 50: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/50.jpg)
DATA ANALYSIS AND UNCERTAINTY
Data Mining Lecture V[Chapter 4, Hand, Manilla, Smyth ]
![Page 51: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/51.jpg)
Random Variables [4.3]
multivariate random variables
marginal density
conditional density & dependency: p(x|y) = p(x,y) / p(y)
* example: supermarket purchases
RANDOM VARIABLES
![Page 52: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/52.jpg)
Example: supermarket purchases
X = n customers x p products; X(i,j) = Boolean variable: “Has customer #i bought a product of type p ?” nA = sum(X(:,A)) is number of customers that bought product AnB = sum(X(:,B)) is number of customers that bought product BnAB = sum(X(:,A).*X(:,B)) is number of customers that bought product B
*** Demo: matlab: conditionaldensity
RANDOM VARIABLES
![Page 53: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/53.jpg)
(conditionally) independent:
p(x,y) = p(x)*p(y)
i.e. :
p(x|y) = p(x)
RANDOM VARIABLES
![Page 54: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/54.jpg)
Simpson’s paradox Observation of two different treatments for several categories of patients
Category Treatment A Treatment B Old 0.20 0.33
Young 0.53 1.00 Total 0.50 0.40
RANDOM VARIABLES
![Page 55: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/55.jpg)
Explanation: the sets are POOLED: Category Treatment A Treatment B
Old 2 / 10 0.20 30 / 90 0.33 Young 48 / 90 0.53 10 / 10 1.00 Total 50 / 100 0.50 40 / 100 0.40
Both for Old and Young individually, treatment B seems best, but for total Treatment A seems superior.
RANDOM VARIABLES
![Page 56: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/56.jpg)
- 1st order Markov processes [example: text analysis and reconstruction with Markov chains]
- Demo: dir *.txt - type 'werther.txt' - [m0,m1,m2] = Markov2Analysis('werther.txt'); - CreateMarkovText(m0,m1,m2,3,80,20); - Nederlands, duits, frans, engels, matlabs, europees - Correlation, dependency, causation - Example 4.3
RANDOM VARIABLES
![Page 57: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/57.jpg)
Sampling [4.4]
- Large quantities of observations: central limit theorem : normally distributed - Small sample size: modeling: OK, pattern recognition: not - Figure 4.1 - Role of model parameters - Probability of observing data D = { x(1), x(1), …, x(n)}
with independent probabilities:
n
i
MipMDP1
),|)((),|( x
with fixed parameter-set Θ.
SAMPLING
![Page 58: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/58.jpg)
Estimation [4.5]
- Properties
1. ̂ is estimator of θ – depends on sample so stochastic variable with ]ˆ[E etc.
2. Bias: ]ˆ[)ˆ( Ebias
unbiased: 0)ˆ( bias , i.e. no systematic bias 3. consistent estimator : asymptotically unbiased
4. Variance: 2]]ˆ[ˆ[)ˆvar( EE 5. Best unbiased estimator: unbiased estimator with smallest variance
ESTIMATION
![Page 59: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/59.jpg)
- Maximum Likelihood Estimation
- Expression 4.7:
n
i
ifDL1
)|)(()|( x
scalar function in θ drop all terms not containing θ value of θ with highest L(θ) is maximum likelihood estimator MLE:
)(maxargˆ
LMLE
- Example 4.4 + figure 4.2 Demo: MLbinomial
Maximum Likelihood Estimation
![Page 60: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/60.jpg)
- Example 4.5: Normal distribution, log L(θ) - Example 4.6: Sufficient statistic: nice idea but practical infeasible - MLE: biased bur consistent O(1/n) - In practice log- likelihood l(θ) = log L(θ) is more attractive - Multiple parameters: complex but relevant: EM-algorithm [treated later] - Determination of confidence levels: example 4.8
Maximum Likelihood Estimation
![Page 61: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/61.jpg)
- Sampling using central limit theorem (CLT) with large samples: Bootstrap method: Example 4.9: use observed data as i. The real distribution F, ii. To generate many sub-samples, iii. Estimate properties in these sub-samples using the “known” distribution F, iv. Compare sub-sample estimate and “real” estimate and apply CLT.
Maximum Likelihood Estimation
![Page 62: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/62.jpg)
- Bayesian estimation - Here not frequentist approach but subjective approach: D is given and parameters θ
are random variables. - p(θ) represents our degree of belief in the value of θ - in practice this means that we have to make all sorts of (wild) assumptions about p. - prior distributon: p(θ) - posterior distributon: p(θ|D) - Modification of prior to posterior by Bayes theorem:
dpDp
pDp
Dp
pDpDp
)()|(
)()|(
)(
)()|()|(
BAYESIAN ESTIMATION
![Page 63: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/63.jpg)
- MAP: Maximum a Posteriori method: < θ> : mean or mode of the posterior distribution
- MAP relates to MLE - Example 4.10 - Bayesian approach: rather than point-estimates keep full knowledge of uncertainties in
model(s) - This causes massive computation -> therefore only recent decade popular
BAYESIAN ESTIMATION
![Page 64: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/64.jpg)
- Sequential updating: )()|()|( pDpDp (equation 4.10)
)()|()|(),|( 1221 pDpDpDDp (equation 4.15) Reminds of Markov Process Algorithm: * start with p(θ) * new data D: use eq. 4.10 to obtain posterior distribution * new data D2 : use eq. 4.15 to obtain new posterior distribution * repeat ad nauseam
BAYESIAN ESTIMATION
![Page 65: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/65.jpg)
- Large Problem: different observers may define different p(θ); the choice is subjective! hence: different results! Analysis depends on choice of p(θ)
- Often a consensus prior e.g. Jeffrey’s prior (equations 4.16 and 4.17)
- Computation of credibility interval
- Markov Chain Monte Carlo (MCMC) methods for estimating posterior distributions
- Similarly: Bayesian Belief Networks (BBN)
BAYESIAN ESTIMATION
![Page 66: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/66.jpg)
- Bayesian Approach: Uncertainty_Model + Uncertainty_Params MLE: aim is: a point estimate Bayesian: aim is: posterior distribution
Bayesian estimate = weighted estimate over all models M and all parameters θ where the weights are the likelihoods of the parameters in the different models PROBLEM: these weights are difficult to estimate
BAYESIAN ESTIMATION
![Page 67: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/67.jpg)
![Page 68: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/68.jpg)
PROBABILISTIC MODEL-BASED CLUSTERING USING MIXTURE
MODELS
Data Mining Lecture VI[4.5, 8.4, 9.2, 9.6, Hand, Manilla, Smyth ]
![Page 69: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/69.jpg)
Probabilistic Model-Based Clustering using Mixture Models
A probability mixture model A mixture model is a formalism for modeling a probability density function as a sum of parameterized functions. In mathematical terms:
![Page 70: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/70.jpg)
A probability mixture model where pX(x) is the modeled probability distribution function, K is the number of components in the mixture model, and ak is mixture proportion of component k. By definition, 0 < ak < 1 for all k = 1…K and:
![Page 71: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/71.jpg)
A probability mixture model h(x | λk) is a probability distribution parameterized by λk.
Mixture models are often used when we know h(x) and we can sample from pX(x), but we would like to determine the ak and λk values.
Such situations can arise in studies in which we sample from a population that is composed of several distinct subpopulations.
![Page 72: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/72.jpg)
A common approach for ‘decomposing’ a mixture model It is common to think of mixture modeling as a missing data problem. One way to understand this is to assume that the data points under consideration have "membership" in one of the distributions we are using to model the data. When we start, this membership is unknown, or missing. The job of estimation is to devise appropriate parameters for the model functions we choose, with the connection to the data points being represented as their membership in the individual model distributions.
![Page 73: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/73.jpg)
Probabilistic Model-Based Clustering using Mixture Models
The EM-algorithm [book section 8.4]
![Page 74: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/74.jpg)
Mixture Decomposition:
The ‘Expectation-Maximization’ Algorithm The Expectation-maximization algorithm computes the missing memberships of data points in our chosen distribution model.
It is an iterative procedure, where we start with initial parameters for our model distribution (the ak's and λk's of the model listed above).
The estimation process proceeds iteratively in two steps, the Expectation Step, and the Maximization Step.
![Page 75: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/75.jpg)
The ‘Expectation-Maximization’ Algorithm
The expectation step
With initial guesses for the parameters in our mixture model, we compute "partial membership" of each data point in each constituent distribution. This is done by calculating expectation values for the membership variables of each data point.
![Page 76: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/76.jpg)
The ‘Expectation-Maximization’ Algorithm
The maximization step
With the expectation values in hand for group membership, we can recompute plug-in estimates of our distribution parameters.
For the mixing coefficient of this is simply the fractional membership of all data points in the second distribution.
![Page 77: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/77.jpg)
EM-algorithm for Clustering
The Suppose we have data D with a model with parameters and hidden parameters H
Interpretation: H = the class label
Log-likelihood of observed data:
H
HDppl )|,(log)(log)(
![Page 78: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/78.jpg)
EM-algorithm for Clustering
With p the probability over the data D.
Let Q be the unknown distribution over the hidden parameters H
Then the log-likelihood is:
![Page 79: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/79.jpg)
Then the log-likelihood is:
H
HDpl )|,(log)(
H HQ
HDpHQl
)(
)|,()(log)(
H HQ
HDpHQl
)(
)|,(log).()(
),()(
1log).()|,(log).()( QF
HQHQHDpHQl
HH
[*Jensen’s inequality]
![Page 80: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/80.jpg)
Jensen’s inequality
for a concave-down function, the expected value of the function is less than the function of the expected value. The gray rectangle along the horizontal axis represents the probability distribution of x, which is uniform for simplicity, but the general idea applies for any distribution
![Page 81: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/81.jpg)
EM-algorithm
So: F(Q,) is a lower-bound on the log-likelihood
function l(Q,) .
EM alternates between:
E-step: maximising F to Q with fixed , and:
M-step: maximising F to with fixed Q .
![Page 82: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/82.jpg)
EM-algorithm
E-step:
M-step:
),(maxarg1 kk
Q
k QFQ
),(maxarg 11 kkk QF
![Page 83: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/83.jpg)
Probabilistic Model-Based Clustering using Gaussian Mixtures
![Page 84: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/84.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 85: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/85.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 86: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/86.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 87: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/87.jpg)
Probabilistic Model-Based Clustering using Mixture Models
Gaussian Mixture Decomposition Gaussian mixture Decomposition is a good classificator. It allows supervised as well as unsupervised learning (find how many classes is optimal, how they should be defined,...). But training is iterative and time consuming.
Idea is to set position and width of gaussian distribution(s) to optimize the coverage of learning samples.
![Page 88: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/88.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 89: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/89.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 90: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/90.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 91: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/91.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 92: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/92.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 93: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/93.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 94: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/94.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 95: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/95.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 96: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/96.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 97: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/97.jpg)
Probabilistic Model-Based Clustering using Mixture Models
![Page 98: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/98.jpg)
![Page 99: DATA MINING van data naar informatie Ronald Westra Dep. Mathematics Maastricht University](https://reader033.vdocuments.site/reader033/viewer/2022042703/56814043550346895dabb075/html5/thumbnails/99.jpg)
The End