lecture 24: conclusion
TRANSCRIPT
![Page 1: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/1.jpg)
Lecture 24: Conclusion
Andreas WichertDepartment of Computer Science and Engineering
Técnico Lisboa
![Page 2: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/2.jpg)
• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression
• Backpropagation• Learning Theory
• K-Means, EM-Clustering• RBF-Networks, SVM
• Model Selection, MDL Principle• Deep Learning• Convolution NN
• RNN• KL-Transform, PCA, ICA
• Autoencoders• Feature Extraction
• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine
![Page 3: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/3.jpg)
Example of ML: Decision Trees
![Page 4: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/4.jpg)
Top-Down Induction of Decision Trees ID3
1. A ¬ the “best” decision attribute for next node2. Assign A as decision attribute (=property) for
node3. For each value of A create new descendant 4. Sort training examples to leaf node according to
the attribute value of the branch5. If all training examples are perfectly classified
(same value of target attribute) stop, else iterate over new leaf nodes
![Page 5: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/5.jpg)
Heuristic function: Shannon Entropy
• Shannon formalized these intuitions• Given a universe of messages M={m1,m2,...,mn} and a probability p(mi)
for the occurrence of each message, the information content (also called entropy)of a message M is given
€
I(M) = −p(mii=1
n
∑ )log2(p(mi))
![Page 6: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/6.jpg)
• The gain from the property P is computed by subtracting the expected information to complete E(P) fro the total information
€
E(P) =|Ci ||C |i=1
n
∑ I(Ci)
€
gain(P) = I(C) − E(P)
![Page 7: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/7.jpg)
Linear and Nonlinear Regression
![Page 8: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/8.jpg)
Linear Regression
![Page 9: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/9.jpg)
Sum-of-squares error
![Page 10: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/10.jpg)
Design Matrix
![Page 11: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/11.jpg)
![Page 12: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/12.jpg)
• Dimensions change since the dimension are not determined by the dimension of the vector x which is D • The number of the is M-1
![Page 13: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/13.jpg)
Posterior Density
![Page 14: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/14.jpg)
Relation between Regularised Least-Squares and MAP
![Page 15: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/15.jpg)
Perceptron/Logistic Regression
![Page 16: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/16.jpg)
Perceptron (1957)
• Linear threshold unit (LTU)
S
x1
x2
xn
...
w1w2
wn
w0X0=1
o
McCulloch-Pitts model of a neuron (1943)
The “bias”, a constant term that does not depend on any input value
![Page 17: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/17.jpg)
Linearly separable patterns
X0=1, bias...
![Page 18: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/18.jpg)
Perceptron learning rule
• Consider linearly separable problems• How to find appropriate weights
• Initialize each vector w to some small random values
• Look if the output pattern o belongs to the desired class, has the desired value d
• h is called the learning rate• 0 < h ≤ 1
Δw =η ⋅ (d −o) ⋅ x
![Page 19: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/19.jpg)
• The update rule for gradient decent is given by
![Page 20: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/20.jpg)
Linear Unit
![Page 21: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/21.jpg)
Sigmoid Unit
![Page 22: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/22.jpg)
Logistic Regression
![Page 23: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/23.jpg)
Logistic Regression
![Page 24: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/24.jpg)
Sigmoid Unit versus Logistic Regression
![Page 25: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/25.jpg)
Linear Unit versus Logistic Regression
![Page 26: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/26.jpg)
Backpropagation
![Page 27: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/27.jpg)
Back-propagation
• The algorithm gives a prescription for changing the weights wij in any feed-forward network to learn a training set of input output pairs {xk,yk}• We consider a simple two-layer network
![Page 28: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/28.jpg)
![Page 29: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/29.jpg)
![Page 30: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/30.jpg)
![Page 31: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/31.jpg)
![Page 32: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/32.jpg)
![Page 33: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/33.jpg)
![Page 34: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/34.jpg)
![Page 35: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/35.jpg)
Learning Theory
![Page 36: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/36.jpg)
Bias-Variance Dilemma
![Page 37: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/37.jpg)
Bias
![Page 38: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/38.jpg)
Variance
![Page 39: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/39.jpg)
VC-dimension
• The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H
• It can be shown that the VC dimension of linear decision surfaces in an ddimensional space (i.e., the VC dimension of a perceptron with d inputs) is d + 1
• Perceptron in d dimensions has d+1 parameter (bias) Through d+1 linear independent chosen points we can learn all dichotomies• For d+2 points in a perceptron in d dimension some vectors (at least two)
are represented as a linear combination, we cannot learn all dichotomies
![Page 40: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/40.jpg)
K-Means, EM-Clustering
![Page 41: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/41.jpg)
K-means Clustering
![Page 42: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/42.jpg)
K-means Clustering
![Page 43: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/43.jpg)
Algorithm: EM for Gaussian mixtures
![Page 44: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/44.jpg)
EM for Gaussian mixtures
![Page 45: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/45.jpg)
![Page 46: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/46.jpg)
EM for Gaussian mixtures
![Page 47: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/47.jpg)
![Page 48: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/48.jpg)
EM for Gaussian mixtures
![Page 49: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/49.jpg)
![Page 50: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/50.jpg)
RBF-Networks, SVM
![Page 51: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/51.jpg)
Interpolation Problem
![Page 52: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/52.jpg)
![Page 53: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/53.jpg)
Micchelli’s Theorem
![Page 54: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/54.jpg)
Radial Basis Function Networks
![Page 55: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/55.jpg)
Radial Basis Function Networks
![Page 56: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/56.jpg)
Radial Basis Function Networks
![Page 57: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/57.jpg)
![Page 58: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/58.jpg)
Constructing new kernels by building them out of simpler kernels as building blocks
• f(·) is any function, q(·) is a polynomial with nonnegative coefficients,• A is a symmetric positive
semidefinite matrix, xa and xb are variables (not necessarily disjoint) with x = (xa, xb), and ka and kb are valid kernel functions over their respective spaces.
![Page 59: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/59.jpg)
Gaussian Kernel
• Since:
• The feature vector that corresponds to the Gaussian kernel has infinite dimensionality
![Page 60: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/60.jpg)
![Page 61: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/61.jpg)
![Page 62: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/62.jpg)
Good Decision Boundary: Margin Should Be Large
• The decision boundary should be as far away from the data of both classes as possible
• We should maximize the margin, 𝞺
Class 1
Class 2
𝞺
![Page 63: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/63.jpg)
Dual Problem
![Page 64: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/64.jpg)
Dual Problem
![Page 65: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/65.jpg)
Design of Support Vector Machine
![Page 66: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/66.jpg)
Classify Data Points
![Page 67: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/67.jpg)
![Page 68: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/68.jpg)
![Page 69: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/69.jpg)
Model Selection, MDL Principle
![Page 70: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/70.jpg)
Attributes of the MDL Principle
• When we have two models that fit a given data sequence equally well, the MDL principle will pick the one that is the simplest in the sense that it allows the use of a shorter description of the data
• MDL principle implements a precise form of Occam’s razor, which states a preference for simple theories
• The MDL principle is a consistent model selection estimator in the sense that it converges to the true model order as the sample size increases
![Page 71: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/71.jpg)
MDL and Regularization
![Page 72: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/72.jpg)
Deep Learning
![Page 73: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/73.jpg)
• It is assumed that an artificial neural network with several hidden layers is less likely to be stuck in a local minima and it is easier to find the right parameters as demonstrated by empirical experiments
![Page 74: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/74.jpg)
How to set network parameters
16 x 16 = 256
1x
2x
……
256x
……
……
……
……
Ink → 1No ink → 0
……
y1
y2
y10
0.1
0.7
0.2
y1 has the maximum value
Set the network parameters 𝜃 such that ……
Input:
y2 has the maximum valueInput:
is 1
is 2
is 0Softm
ax
𝜃 = 𝑊!, 𝑏!,𝑊", 𝑏", ⋯𝑊# , 𝑏#
![Page 75: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/75.jpg)
Rectified Linear Unit (ReLU)
• f(x) = max(0,x)• Function defined as the positive part of its argument
• Does not saturate (in +region) • Very computationally efficient• Converges much faster than
sigmoid/tanh in practice (e.g. 6x) • More biologically plausible
• But: Not zero-centered output L• Non-differentiable at zero; however it is differentiable anywhere else, and
a value of 0 or 1
ReLu function
Derivative of ReLu function
![Page 76: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/76.jpg)
Batch Normalization
• During training time, a batch normalization layer does the following:
• Calculate the mean and variance of the layers input
• Normalize the layer inputs using the previously calculated batch statistics
• Scale and shift in order to obtain the output of the layer
• γ and β are learned during training along with the original parameters of the network.
![Page 77: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/77.jpg)
l2 Regularization
![Page 78: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/78.jpg)
l1 Regularization
![Page 79: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/79.jpg)
l2 versus l1 Regularization
![Page 80: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/80.jpg)
Regularization: Dropout• In each forward pass, randomly set some neurons to zero (for one
pass only)• Probability of dropping is a hyperparameter; 0.5 is common
![Page 81: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/81.jpg)
Convolution NN
![Page 82: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/82.jpg)
• Convolutional Neural Networks
![Page 83: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/83.jpg)
![Page 84: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/84.jpg)
RNN
![Page 85: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/85.jpg)
Recurrent Neural Networks
• Recurrent networks that produce an output at each time step and have recurrent connections between hidden units, Vanilla RNN or Elman RNN
![Page 86: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/86.jpg)
Recurrent Hidden Units
![Page 87: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/87.jpg)
KL-Transform, PCA, ICA
![Page 88: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/88.jpg)
The Karhunen-Loève Transform
![Page 89: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/89.jpg)
![Page 90: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/90.jpg)
![Page 91: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/91.jpg)
• The squares of the eigenvalues represent the variances along the eigenvectors. The eigenvalues corresponding to the covariance matrix of the data set Σ are
![Page 92: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/92.jpg)
The Blind Source Separation Problem
![Page 93: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/93.jpg)
PCA vs ICA
![Page 94: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/94.jpg)
![Page 95: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/95.jpg)
Autoencoders
![Page 96: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/96.jpg)
• Unsupervised Learning• Data: no labels!• Goal: Learn the structure of the
data
• Traditionally, autoencoders were used for dimensionality reduction or feature learning.
![Page 97: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/97.jpg)
Undercomplete AE
𝑥
"𝑥
𝑤
𝑤′𝑓 𝑥
• Hidden layer is Undercomplete if smaller than the input layer• Compresses the input• Compresses well only for the
training dist.
• Hidden nodes will be• Good features for the training
distribution.• Bad for other types on input
![Page 98: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/98.jpg)
• Autoencoders with nonlinear encoder functions f and nonlinear decoder functions g can thus learn a more powerful nonlinear generalization of PCA (later)
![Page 99: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/99.jpg)
![Page 100: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/100.jpg)
Feature Extraction
![Page 101: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/101.jpg)
![Page 102: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/102.jpg)
Edge detection
• Convert a 2D image into a set of curves• Extracts salient features of the scene• More compact than pixels
![Page 103: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/103.jpg)
• The basic idea behind edge detection is to localize discontinuities of the intensity function in the image
![Page 104: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/104.jpg)
• The right angle is the only place where the contour is curved, changes its direction• At the point of extreme curvature, the information is concentrated• Corners yield the greatest information• More strongly curved points yield more information• Information content of a contour is concentrated in the neighborhood
of points where the absolute value of the curvature is a local maximum
![Page 105: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/105.jpg)
KNN,Weighted Regrssion
![Page 106: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/106.jpg)
K-Nearest Neighbor
• Training algorithm• For each training example (x,f(x)) add the example to the list
• Classification algorithm• Given a query instance xq to be classified
• Let x1,..,xk k instances which are nearest to xq
• Where d(a,b)=1 if a=b, else d(a,b)= 0 (Kronecker function)
![Page 107: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/107.jpg)
Continuous-valued target functions
• kNN approximating continous-valued target functions• Calculate the mean value of the k nearest training examples rather
than calculate their most common value
![Page 108: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/108.jpg)
Distance Weighted
• For real valued functions
![Page 109: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/109.jpg)
Ensemble Methods
![Page 110: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/110.jpg)
Bootstrap a data set
• Sampling a dataset with replacement• Define: Size of the sample and the number of repeats. • Example:
• (0.1, 0.2, 0.3, 0.4, 0.5, 0.6) • Randomly choose the first observation from the dataset • sample = (0.2)
• This observation is returned to the dataset and we repeat this step 3 more times.
• sample = (0.2, 0.1, 0.2, 0.6)
![Page 111: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/111.jpg)
Bagging
![Page 112: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/112.jpg)
Bagging
![Page 113: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/113.jpg)
Bagging
![Page 114: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/114.jpg)
Boosting
• Boosting is a powerful technique for combining multiple base (weak) classifiers to produce a form of committee whose performance can be significantly better than that of any of the base classifiers.
• Boosting can give good results even if the base classifiers have a performance that is only slightly better than random, and hence sometimes the base classifiers are known as weak learners.
![Page 115: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/115.jpg)
AdaBoost
![Page 116: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/116.jpg)
Algorithm: EM for linear Regression Models
![Page 117: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/117.jpg)
![Page 118: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/118.jpg)
Bayesian Networks
![Page 119: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/119.jpg)
Naive Bayes Classifier
• Assume target function f: X è V, where each instance x described by attributes a1, a2 .. an
• Most probable value of f(x) is:
![Page 120: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/120.jpg)
Example
• In the example, there are four variables, namely, Burglary(= x2), Earthquake(= x3), Alarm(= x1) and JohnCalls(= x4). • The corresponding network
topology reflects the following “causal” knowledge:
• A burglar can set the alarm off.• An earthquake can set the alarm off. • The alarm can cause John to call.
![Page 121: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/121.jpg)
Causality
![Page 122: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/122.jpg)
![Page 123: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/123.jpg)
Learning CPTs from Fully Observed Data
![Page 124: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/124.jpg)
Expectation Maximization
![Page 125: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/125.jpg)
![Page 126: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/126.jpg)
![Page 127: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/127.jpg)
![Page 128: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/128.jpg)
Boltzmann Machine
![Page 129: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/129.jpg)
Boltzmann Machine
![Page 130: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/130.jpg)
![Page 131: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/131.jpg)
• Assuming the network is composed of n units. Usually the units are updated asynchronously, updated them one at the time.
• For example at each time step, a random unit i is selected and updated with
• with bi being the bias. Unit i then turns on with a probability given by the sigmoid (logistic) function
![Page 132: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/132.jpg)
Stochastic Dynamics
• The stochastic dynamics of Boltzmann machine can be described by Gibs sampling. • Suppose that the system is in a state x and we have chosen an
arbitrary coordinate i. • We can then ignore the actual state of the unit xi and ask for the
conditional probability
![Page 133: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/133.jpg)
![Page 134: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/134.jpg)
Learning
• During the training phase of the network there are two phases to the operation of the Boltzmann machine:
• (1) Positive phase. In this phase, the network operates in its clamped con-dition under the direct influence of the training sample. The visible neurons are all clamped onto specific states determined by the environment.
• (2) Negative phase. In this second phase, the network is allowed to run freely, and therefore with no environmental input. The states of the units are determined randomly. The probability of finding it in any particular globalstate depends on the energy function
![Page 135: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/135.jpg)
![Page 136: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/136.jpg)
![Page 137: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/137.jpg)
![Page 138: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/138.jpg)
![Page 139: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/139.jpg)
Harmonium - Restricted Boltzmann Machine
• A restricted Boltzmann machine (RBM) has only connections between visible and hidden units to make inference and learning easier. • It was initially invented under the name Harmonium by Paul
Smolensky in 1986
![Page 140: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/140.jpg)
A deep belief network
![Page 141: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/141.jpg)
• The model learned to to generate combinations of labels and images.
• To perform recognition we start with a random state of the label units and clamp the input image.
• Then we do up-pass from the image followed by a few iterations of the top-level layers.
![Page 142: Lecture 24: Conclusion](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a181c9047510c3c2fcbb1/html5/thumbnails/142.jpg)
• Example of ML: Decision Trees • Linear and Nonlinear Regression• Perceptron/Logistic Regression
• Backpropagation• Learning Theory
• K-Means, EM-Clustering• RBF-Networks, SVM
• Model Selection, MDL Principle• Deep Learning• Convolution NN
• RNN• KL-Transform, PCA, ICA
• Autoencoders• Feature Extraction
• KNN, Weighted Regression• Ensemble Methods• Bayesian Networks• Boltzmann Machine