machine learning and statistical analysis
TRANSCRIPT
![Page 1: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/1.jpg)
Jong Youl ChoiComputer Science Department([email protected])
![Page 2: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/2.jpg)
Social Bookmarking
2
Socialized
Tags
Bookmarks
![Page 3: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/3.jpg)
3
![Page 4: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/4.jpg)
Principles of Machine Learning Bayes’ theorem and maximum likelihood
Machine Learning Algorithms Clustering analysis Dimension reduction Classification
Parallel Computing General parallel computing architecture Parallel algorithms
4
![Page 5: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/5.jpg)
DefinitionAlgorithms or techniques that enable computer (machine) to “learn” from data. Related with many areas such as data mining, statistics, information theory, etc.
Algorithm Types Unsupervised learning Supervised learning Reinforcement learning
5
Topics Models▪ Artificial Neural Network
(ANN)▪ Support Vector Machine
(SVM)
Optimization▪ Expectation-Maximization
(EM)▪ Deterministic Annealing
(DA)
![Page 6: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/6.jpg)
Posterior probability of i, given X
i 2 : Parameter X : Observations P(i) : Prior (or marginal) probability
P(X|i) : likelihood
Maximum Likelihood (ML) Used to find the most plausible i 2 , given X Computing maximum likelihood (ML) or log-
likelihood Optimization problem
6
![Page 7: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/7.jpg)
ProblemEstimate hidden parameters (={, })from the given data extracted from k Gaussian distributions
Gaussian distribution
Maximum Likelihood
With Gaussian (P = N),
Solve either brute-force or numeric method
7
(Mitchell , 1997)
![Page 8: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/8.jpg)
Problems in ML estimation Observation X is often not complete Latent (hidden) variable Z exists Hard to explore whole parameter space
Expectation-Maximization algorithm Object : To find ML, over latent distribution P(Z |X,) Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)2. M-step – Find new which maximize likelihood. 3. Go to step 1 after updating old à new
8
![Page 9: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/9.jpg)
DefinitionGrouping unlabeled data into clusters, for the purpose of inference of hidden structures or information
Dissimilarity measurement Distance : Euclidean(L2), Manhattan(L1), … Angle : Inner product, … Non-metric : Rank, Intensity, …
Types of Clustering Hierarchical ▪ Agglomerative or divisive
Partitioning▪ K-means, VQ, MDS, …
9(Matlab
helppage)
![Page 10: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/10.jpg)
Find K partitions with the total intra-cluster variance minimized
Iterative method Initialization : Randomized yi
Assignment of x (yi fixed)
Update of yi (x fixed)
Problem? Trap in local minima
10(MacKay, 2003)
![Page 11: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/11.jpg)
Deterministically avoid local minima No stochastic process (random walk) Tracing the global solution by changing
level of randomness
Statistical Mechanics Gibbs distribution
Helmholtz free energy F = D – TS▪ Average Energy D = < Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z
In DA, we make F minimized
11
(Maxima and Minima, Wikipedia)
![Page 12: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/12.jpg)
Analogy to physical annealing process Control energy (randomness) by temperature (high
low) Starting with high temperature (T = 1) ▪ Soft (or fuzzy) association probability▪ Smooth cost function with one global minimum
Lowering the temperature (T ! 0)▪ Hard association▪ Revealing full complexity, clusters are emerged
Minimization of F, using E(x, yj) = ||x-yj||2
Iteratively,12
![Page 13: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/13.jpg)
DefinitionProcess to transform high-dimensional data into low-dimensional ones for improving accuracy, understanding, or removing noises.
Curse of dimensionality Complexity grows exponentially
in volume by adding extra dimensions
Types Feature selection : Choose representatives (e.g.,
filter,…) Feature extraction : Map to lower dim. (e.g., PCA,
MDS, … )13
(Koppen, 2000)
![Page 14: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/14.jpg)
Finding a map of principle components (PCs) of data into an orthogonal space, such that
y = W x where W 2 Rd£h (hÀd)
PCs – Variables with the largest variances Orthogonality Linearity – Optimal least mean-square
error
Limitations? Strict linearity specific distribution Large variance assumption 14
x1
x2
PC 1PC 2
![Page 15: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/15.jpg)
Like PCA, reduction of dimension by y = R x where R is a random matrix with i.i.d columns and R 2 Rd£p (pÀd)
Johnson-Lindenstrauss lemma When projecting to a randomly selected subspace,
the distance are approximately preserved
Generating R Hard to obtain orthogonalized R Gaussian R Simple approach
choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6 respectively
15
![Page 16: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/16.jpg)
Dimension reduction preserving distance proximities observed in original data set
Loss functions Inner product Distance Squared distance
Classical MDS: minimizing STRAIN, given From , find inner product matrix B (Double
centering)
From B, recover the coordinates X’ (i.e., B=X’X’T )
16
![Page 17: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/17.jpg)
SMACOF : minimizing STRESS Majorization – for complex f(x),
find auxiliary simple g(x,y) s.t.:
Majorization for STRESS
Minimize tr(XT B(Y) Y), known as Guttman transform
17
(Cox, 2001)
![Page 18: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/18.jpg)
Competitive and unsupervised learning process for clustering and visualization
Result : similar data getting closer in the model space
18
Input Model
Learning Choose the best similar
model vector mj with xi
Update the winner and its neighbors by mk = mk + (t) (t)(xi – mk)
(t) : learning rate(t) : neighborhood size
![Page 19: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/19.jpg)
19
DefinitionA procedure dividing data into the given set of categories based on the training set in a supervised way
Generalization Vs. Specification Hard to achieve both Avoid overfitting(overtraining)▪ Early stopping▪ Holdout validation▪ K-fold cross validation ▪ Leave-one-out cross-validation
Validation Error
Training Error
Underfitting Overfitting
(Overfitting, Wikipedia)
![Page 20: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/20.jpg)
Perceptron : A computational unit with binary threshold
Abilities Linear separable decision surface Represent boolean functions (AND, OR, NO)
Network (Multilayer) of perceptrons Various network architectures and capabilities
20
Weighted SumWeighted Sum Activation Function
Activation Function
(Jain, 1996)
![Page 21: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/21.jpg)
Learning weights – random initialization and updating
Error-correction training rules Difference between training data and output: E(t,o) Gradient descent (Batch learning) ▪ With E = Ei ,
Stochastic approach (On-line learning)▪ Update gradient for each result
Various error functions Adding weight regularization term ( wi
2) to avoid overfitting
Adding momentum (wi(n-1)) to expedite convergence
21
![Page 22: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/22.jpg)
Q: How to draw the optimal linear separating hyperplane? A: Maximizing margin
Margin maximization The distance between H+1 and
H-1:
Thus, ||w|| should be minimized 22
Margin
![Page 23: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/23.jpg)
23
Constraint optimization problem Given training set {xi, yi} (yi 2 {+1, -1}): Minimize :
Lagrangian equation with saddle points
Minimized w.r.t the primal variable w and b:
Maximized w.r.t the dual variables i (all i ¸ 0)
xi with i > 0 (not i = 0) is called support vector (SV)
![Page 24: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/24.jpg)
Soft Margin (Non-separable case) Slack variables i < C Optimization with additional
constraint
Non-linear SVM Map non-linear input to feature space Kernel function k(x,y) = h(x), (y)i Kernel classifier with support vectors
si
24
Input Space Feature Space
![Page 25: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/25.jpg)
Memory Architecture
Decomposition Strategy Task – E.g., Word, IE, … Data – scientific problem Pipelining – Task + Data
25
Shared Memory Distributed Memory
Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive
Symmetric Multiprocessor (SMP) OpenMP, POSIX, pthread, MPI Easy to manage but expensive
Commodity, off-the-shelf processors MPI Cost effective but hard to maintain
Commodity, off-the-shelf processors MPI Cost effective but hard to maintain
(Barney, 2007)
(Barney, 2007)
![Page 26: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/26.jpg)
Shrinking Recall : Only support vectors (i>0) are
used in SVM optimization Predict if data is either SV or non-SV Remove non-SVs from problem space
Parallel SVM Partition the problem Merge data hierarchically Each unit finds support vectors Loop until converge
26(Graf, 2005)
![Page 27: Machine Learning and Statistical Analysis](https://reader034.vdocuments.site/reader034/viewer/2022052600/55806e9ad8b42a925c8b4a2d/html5/thumbnails/27.jpg)
27