anthony kuh- neural networks and learning theory
TRANSCRIPT
![Page 1: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/1.jpg)
EE645Neural Networks and Learning Theory
Spring 2005
Prof. Anthony Kuh
POST 205E
Dept. of Elec. Eng.
University of Hawaii
Phone: (808)-956-7527, Fax: (808)-956-3427
Email: [email protected]
![Page 2: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/2.jpg)
Preliminaries
Class Meeting Time: MWF 8:30-9:20Office Hours: MWF 10-11 (or by appointment)Prerequisites:– Probability: EE342 or equivalent– Linear Algebra: – Programming: Matlab or C experience
![Page 3: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/3.jpg)
I. Introduction to neural networksGoal: study computational capabilities of neural network
and learning systems.
Multidisciplinary fieldAlgorithms, Analysis, Applications
![Page 4: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/4.jpg)
A. MotivationWhy study neural networks and machine learning?
Biological inspiration (natural computation)Nonparametric models: adaptive learning systems, learning
from examples, analysis of learning modelsImplementationApplications
Cognitive (Human vs. Computer Intelligence):Humans superior to computers in pattern recognition,
associative recall, learning complex tasks.Computers superior to humans in arithmetic computations,
simple repeatable tasks.Biological: (study human brain)
10^10 to 10^11 neurons in cerebral cortex with on average of 10^3 interconnections / neuron.
![Page 5: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/5.jpg)
A neuron
Schematic of one neuron
![Page 6: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/6.jpg)
Neural Network
Connection of many neurons together forms a neural network.
Neural network properties:Highly parallel (distributed computing) Robust and fault tolerantFlexible (short and long term learning)Handles variety of information
(often random, fuzzy, and inconsistent)Small, compact, dissipates very little power
![Page 7: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/7.jpg)
B. Single Neuron
s=w T x + w0 ; synaptic strength (linearly weighted sum of inputs).
y=g(s); activation or squashing function
g( )xw
ys
Σ
w0
(Computational node)
![Page 8: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/8.jpg)
Activation functions
Linear units: g(s) = s.Linear threshold units: g(s) = sgn (s).Sigmoidal units: g(s) = tanh (Bs), B >0.
Neural networks generally have nonlinear activation functions.
Most popular models: linear threshold units andsigmoidal units.
Other types of computational units : receptive units (radial basis functions).
![Page 9: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/9.jpg)
C. Neural Network ArchitecturesSystems composed of interconnected neurons
inputsoutput
Neural network represented by directed graph: edges represent weights and nodes represent computational units.
![Page 10: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/10.jpg)
Definitions
Feedforward neural network has no loops in directed graph.
Neural networks are often arranged in layers.Single layer feedforward neural network has one layer of computational nodes.Multilayer feedforward neural network has two or more layers of computational nodes.Computational nodes that are not output nodes are called hidden units.
![Page 11: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/11.jpg)
D. Learning and Information Storage
1. Neural networks have computational capabilities.Where is information stored in a neural network?What are parameters of neural network?
2. How does a neural network work? (two phases)Training or learning phase (equivalent to write phase in
conventional computer memory): weights are adjusted to meet certain desired criterion.
Recall or test phase (equivalent to read phase in conventional computer memory): weights are fixed as neural network realizes some task.
![Page 12: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/12.jpg)
Learning and Information (continued)
3) What can neural network models learn?Boolean functionsPattern recognition problemsFunction approximationDynamical systems
4) What type of learning algorithms are there?Supervised learning (learning with a teacher)Unsupervised learning (no teacher)Reinforcement learning (learning with a critic)
![Page 13: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/13.jpg)
Learning and Information (continued)
5) How do neural networks learn?• Iterative algorithm: weights of neural network are adjusted on-
line as training data is received.w(k+1) = L(w(k),x(k),d(k)) for supervised learning whered(k) is desired output.• Need cost criterion: common cost criterionMean Squared Error: for one output J(w) = Σ (y(k) – d(k)) 2• Goal is to find minimum J(w) over all possible w. Iterative
techniques often use gradient descent approaches.
![Page 14: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/14.jpg)
Learning and Information (continued)6)Learning and Generalization
Learning algorithm takes training examples as inputs and produces concept, pattern or function to be learned.
How good is learning algorithm? Generalization ability measures how well learning algorithm performs.
Sufficient number of training examples. (LLN, typical sequences)
Occam’s razor: “simplest explanation is the best”.
+
+
+
++
+
Regression problem
![Page 15: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/15.jpg)
Learning and Information (continued)Generalization error
εg = εemp + εmodel
Empirical error: average error from training data (desired output vs. actual output)
Model error: due to dimensionality of class of functions or patterns
Desire class to be large enough so that empirical error is small and small enough so that model error is small.
![Page 16: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/16.jpg)
II. Linear threshold units
A. Preliminaries
sgn( )xw
ys
Σ
w0
sgn(s)= 1, if s>=0-1, if s<0
![Page 17: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/17.jpg)
Linearly separable
Consider a set of points with two labels: + and o.
Set of points is linearly separable if a linear threshold functioncan partition the + points from the o points.
+ +
+
o
ooSet of linearly separable points
![Page 18: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/18.jpg)
Not linearly separable
A set of labeled points that cannot be partitioned by a linear threshold function is not linearly separable.
+ +
o
o
Set of points that are not linearly separable
![Page 19: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/19.jpg)
B. Perceptron Learning Algorithm
An iterative learning algorithm that can find linear threshold function to partition two set of points.
1) w(0) arbitrary2) Pick point (x(k),d(k)).3) If w(k) T x(k)d(k) > 0 go to 5)4) w(k+1) = w(k ) + x(k)d(k)5) k=k+1, check if cycled through data, if not go to 26) Otherwise stop.
![Page 20: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/20.jpg)
PLA comments
Perceptron convergence theorem (requires margins)Sketch of proofUpdating threshold weightsAlgorithm is based on cost function
J(w) = - (sum of synaptic strengths of misclassified points)w(k+1) = w(k) - µ(k)J(w(k)) (gradient descent)
![Page 21: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/21.jpg)
Perceptron Convergence Theorem
Assumptions: w* solutions and ||w*||=1, no threshold and w(0)=0. Let max||x(k)||=β and min y(k)x(k)Tw*=γ.
<w(k),w*>=<w(k-1) + x(k-1)y(k-1),w*> ≥ <w(k-1),w*> + γ≥ k γ.
||w(k)||2 ≤ ||w(k-1)||2 + ||x(k-1)||2 ≤ ||w(k-1)||2 + β 2 ≤ kβ 2 .
Implies that k ≤ ( β/ γ ) 2 (max number of updates).
![Page 22: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/22.jpg)
III. Linear Units
xw
s=yΣ
A. Preliminaries
![Page 23: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/23.jpg)
Model Assumptions and Parameters
Training examples (x(k),d(k)) drawn randomlyParameters
– Inputs: x(k)– Outputs: y(k)– Desired outputs: d(k)– Weights: w(k)– Error: e(k)= d(k)-y(k)
Error criterion (MSE)min J(w) = E [.5(e(k)) 2]
![Page 24: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/24.jpg)
Wiener solution
Define P= E(x(k)d(k)) and R=E(x(k)x(k)T).J(w) =.5 E[(d(k)-y(k))2]
= .5E(d(k)2)- E(x(k)d(k)) Tw +wT E(x(k)x(k) T)w= .5E[d(k) 2] –PTw +.5wTRw
Note J(w) is a quadratic function of w. To minimize J(w) find gradient, ∇J(w) and set to 0.
∇J(w) = -P + Rw = 0Rw=P (Wiener solution)If R is nonsingular, then w= R-1 P.Resulting MSE = .5E[d(k)2]-PTR-1P
![Page 25: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/25.jpg)
Iterative algorithms
Steepest descent algorithm (move in direction of negative gradient)
w(k+1) = w(k) -µ ∇J(w(k)) = w(k) + µ (P-Rw(k))Least mean square algorithm
(approximate gradient from training example)∇J(w(k))= -e(k)x(k)w(k+1) = w(k) + µe(k)x(k) ^
![Page 26: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/26.jpg)
Steepest Descent Convergence
w(k+1) = w(k) + µ (P-Rw(k)); Let w* be solution.Center weight vector v=w-w*
v(k+1) = v(k) - µ (Rw(k)); Assume R is nonsingular.Decorrelate weight vector u= Q-1v where R=QΛ Q-1 is
the transformation that diagonalizes R.u(k+1) = (I - µ Λ ), u(k) = (I - µ Λ )k u(0).
Conditions for convergence 0< µ < 2/λmax .
![Page 27: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/27.jpg)
LMS Algorithm Properties
Steepest Descent and LMS algorithm convergence depends on step size µ and eigenvalues of R.LMS algorithm is simple to implement.LMS algorithm convergence is relatively slow.Tradeoff between convergence speed and excess MSE.LMS algorithm can track training data that is time varying.
![Page 28: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/28.jpg)
Adaptive MMSE MethodsTraining data
Linear MMSE: LMS, RLS algorithmsNonlinear Decision feedback detectors
Blind algorithmsSecond order statistics
Minimum Output Energy MethodsReduced order approximations: PCA, multistage Wiener Filter
Higher order statisticsCumulants, Information based criteria
![Page 29: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/29.jpg)
Designing a learning systemGiven a set of training data, design a system that
can realize the desired task.
SignalProcessing
FeatureExtraction
NeuralNetwork
Inputs Outputs
![Page 30: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/30.jpg)
IV. Multilayer Networks
A. Capabilities– Depend directly on total number of weights and
threshold values.– A one hidden layer network with sufficient number of
hidden units can arbitrarily approximate any boolean function, pattern recognition problems, and well behaved function approximation problems.
– Sigmoidal units more powerful than linear threshold units.
![Page 31: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/31.jpg)
B. Error backpropagationError backpropagation algorithm: methodical way of
implementing LMS algorithm for multilayer neural networks.Two passes: forward pass (computational pass), backward pass (weight correction pass).Analog computations based on MSE criterion.Hidden units usually sigmoidal units.Initialization: weights take on small random values.Algorithm may not converge to global minimum.Algorithm converges slower than for linear networks.Representation is distributed.
![Page 32: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/32.jpg)
BP Algorithm Comments
δs are error terms computed from output layer back to first layer in dual network. Training is usually done online.Examples presented in random or sequential order.Update rule is local as weight changes only involve connections to weight.Computational complexity depends on number of computational units.Initial weights randomized to avoid converging to local minima.
![Page 33: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/33.jpg)
BP Algorithm Comment continued
Threshold weights updated in similar manner to other weights (input =1).Momentum term added to speed up convergence.Step size set to small value.Sigmoidal activation derivatives simple to compute.
![Page 34: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/34.jpg)
BP Architecture
Forward network
Sensitivitynetwork
Output ofcomputational
values calculated
Output oferror termscalculated
![Page 35: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/35.jpg)
Modifications to BP Algorithm
Batch procedureVariable step sizeBetter approximation of gradient method (momentum term, conjugate gradient)Newton methods (Hessian)Alternate cost functionsRegularizationNetwork construction algorithmsIncorporating time
![Page 36: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/36.jpg)
When to stop trainingFirst major features captured. As training continues
minor features captured.Look at training error.Crossvalidation (training, validation, and test sets)
training error
testing error
Learning typically slow and may find flat learningareas with little improvement in energy function.
![Page 37: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/37.jpg)
C. Radial Basis FunctionsUse locally receptive units (potential functions)Transform input space to hidden unit space via potential functions.Output unit is linear.
inputsoutputΣ
Φ
Φ
Potential units Φ(x) = exp (-.5||x-c|| 2 /σ 2
Linear unit
![Page 38: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/38.jpg)
Transformation of input space
Input space Feature space
ΦX
X
OO O
O
XX
Φ: X Z
![Page 39: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/39.jpg)
Training Radial basis functions
Use gradient descent on unknown parameters: centers, widths, and output weightsSeparate tasks for quicker training: (first layer centers, widths), (second layer weights)
– First layerFix widths, centers determined from lattice structureFix widths, clustering algorithm for centersResource allocation network
– Second layer: use LMS to learn weights
![Page 40: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/40.jpg)
Comparisons between RBFs and BP Algorithm
RBF single hidden layer and BP algorithm can have many hidden layers.RBF (potential functions) locally receptive units versus BP algorithm (sigmoidal units) distributed representations.RBF typically many more hidden units.RBF training typically quicker training.
![Page 41: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/41.jpg)
V. Alternate Detection Method
Consider detection methods based on optimum margin classifiers or Support Vector Machines (SVM)
SVM are based on concepts from statistical learning theory.SVM are easily extended to nonlinear decision regions via kernel functions.SVM solutions involve solving quadratic programming problems.
![Page 42: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/42.jpg)
Optimal Marginal Classifiers
X
X X
OO
O
Given a set of points that are linearly separable:
Which hyperplane should you choose to separate points?
Choose hyperplane that maximizes distance between two sets of points.
X
![Page 43: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/43.jpg)
Finding Optimal Hyperplane
X
X X
OO
O
X Draw convex hull around each set of points.
Find shortest line segment connecting two convex hulls.
Find midpoint of line segment.
Optimal hyperplane intersects line segment at midpoing perpendicular to line segment.
w
margins
Optimal hyperplane
![Page 44: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/44.jpg)
Alternative Characterization of Optimal Margin Classifiers
X
X X
OO
O
X
u
v
w
Maximizing margins equivalent to minimizing magnitude of weight vector.
W u+ b = 1T
W v+ b = -1T
2m
W (u-v) = 2T
W (u-v)/ W = 2/ W =2mT
![Page 45: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/45.jpg)
Solution in 1 Dimension
O O O O O X O X X O X X X
Points on wrong side of hyperplane
If C is large SV include
If C is small SV include all points (scaled MMSE solution)Note that weight vector depends most heavily on outer support vectors.
![Page 46: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/46.jpg)
Comments on 1 Dimensional Solution
Simple algorithm can be implemented to solve 1D problem.Solution in multiple dimensions is finding weight and then projecting down to 1D.Min. probability of error threshold depends on likelihood ratio.MMSE solution depends on all points where as SVM depends on SV (points that are under margin (closer to min. probability of error).Min. probability of error, MMSE solution, and SVM in general
give different detectors.
![Page 47: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/47.jpg)
Kernel MethodsIn many classification and detectionproblems a linear classifier is not sufficient. However, working in higher dimensions can lead to “curse of dimensionality”.
Solution: Use kernel methods wherecomputations done in dual observation space.
Input space Feature space
ΦX
X
OO O
O
XX
Φ: X Z
![Page 48: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/48.jpg)
Solving QP problem
SVM require solving large QP problems. However, many αs are zero (not support vectors). Breakup QP into subproblem.
Chunking : (Vapnik 1979) numerical solution.Ossuna algorithm: (1997) numerical solution.Platt algorithm: (1998) Sequential Minimization Optimization (SMO) analytical solution.
![Page 49: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/49.jpg)
SMO Algorithm
Sequential Minimization Optimization breaks up QP program into small subproblems that are solved analytically.SMO solves dual QP SVM problem by examining points that violate KKT conditions.Algorithm converges and consists of:
Search for 2 points that violate KKT conditions.Solve QP program for 2 points.Calculate threshold value b.
Continue until all points satisfy KKT conditions.On numerous benchmarks time to convergence of SMO varied from O (l) to O (l 2.2 ) . Convergence time depends on difficulty of classification problem and kernel functions used.
![Page 50: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/50.jpg)
SVM Summary
SVM are based on optimum margin classifiers and are solved using quadratic programming methods.SVM are easily extended to problems that are not linearly separable.SVM can create nonlinear separating surfaces via kernel functions.SVM can be efficiently programmed via the SMO algorithm.SVM can be extended to solve regression problems.
![Page 51: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/51.jpg)
VI.Unsupervised Learning
A. MotivationGiven a set of training examples with no teacher orcritic, why do we learn?
Feature extractionData compressionSignal detection and recoverySelf organization
Information can be found about data from inputs.
![Page 52: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/52.jpg)
B. Principal Component Analysis
IntroductionConsider a zero mean random vector x ∈ R n with autocorrelation
matrix R = E(xxT).
R has eigenvectors q(1),… ,q(n) and associated eigenvalues λ(1)≥… ≥ λ(n).
Let Q = [ q(1) | …| q(n)] and Λ be a diagonal matrix containing eigenvalues along diagonal.
Then R = Q Λ QT can be decomposed into eigenvector and eigenvalue decomposition.
![Page 53: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/53.jpg)
First Principal ComponentFind max xTRx subject to ||x||=1.Maximum obtained when x= q(1) as this corresponds to
xTRx = λ(1).q(1) is first principal component of x and also yields
direction of maximum variance.y(1) = q(1)T x is projection of x onto first principal
component.x
q(1)
y(1)
![Page 54: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/54.jpg)
Other Principal Components
ith principal component denoted by q(i) and projection denoted by y(i) = q(i)T x with E(y(i)) = 0 and E(y(i)2)= λ(i).
Note that y= QTx and we can obtain data vector x from y by noting that x=Qy.
We can approximate x by taking first m principal components (PC) to get z: z= q(1)x(1) +…+ q(m)x(m). Error given by e= x-z. e is orthogonal to q(i) when 1≤i ≤ m.
![Page 55: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/55.jpg)
Diagram of PCA
xx
xx
x
x
x
x
x
xx
xx
x
x
x
x
x
x
First PCSecond PC
x
First PC gives more information than second PC.
![Page 56: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/56.jpg)
Learning algorithms for PCA
Hebbian learning rule: when presynaptic and postsynaptic signal are postive, then weigh associated with synapse increase in strength.
∆w = µ x y
x yw
![Page 57: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/57.jpg)
Oja’s rule
Use normalize Hebbian rule applied to linear neuron.
xw
s=yΣ
Need normalized Hebbian rule otherwise weightvector will grow unbounded.
![Page 58: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/58.jpg)
Oja’s rule continued
wi (k+1) = wi(k) + µ xi (k) y(k) (apply Hebbian rule)w(k+1)= w(k+1) / ||w(k+1)|| (renormalize weight)
Unfortunately above rule is difficult to implement so modification approximates above rule giving
wi (k+1) = wi(k) + µ y(k)(xi (k)- y(k) wi(k))Similar to Hebbian rule with modified input.
Can show that w(k) → q(1) with probability one given that x(k) is zero mean second order and drawn from a fixed distribution.
![Page 59: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/59.jpg)
Learning other PCs
Adaptive learning rules (subtract larger PCs out)– Generalized Hebbian Algorithm– APEX
Batch Algorithm (singular value decomposition)– Approximate correlation matrix R with time averages.
![Page 60: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/60.jpg)
Applications of PCA
Matched Filter problem: x(k) = s(k) + σv(k).Multiuser communications: CDMAImage coding (data compression)
GHA quantizer PCA
![Page 61: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/61.jpg)
Kernel MethodsIn many classification and detectionproblems a linear classifier is not sufficient. However, working in higher dimensions can lead to “curse of dimensionality”.
Solution: Use kernel methods wherecomputations done in dual observation space.
Input space Feature space
ΦX
X
OO O
O
XX
Φ: X Z
![Page 62: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/62.jpg)
C. Independent Component Analysis
PCA decorrelates inputs. However in many instances we may want to make outputs independent.
A WU YX
AInputs U assumed independent and user sees X.Goal is to find W so that Y is independent.
![Page 63: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/63.jpg)
ICA Solution
Y = DPU where D is a diagonal matrix and P is a permutation matrix.Algorithm is unsupervised. What are assumptions where learning is possible? All components of U except possibly one are nongaussian.Establish criterion to learn from (use higher order statistics): information based criteria, kurtosis function.Kullback Leibler Divergence:
D(f,g) = ∫ f(x) log (f(x)/g(x)) dx
![Page 64: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/64.jpg)
ICA Information Criterion
Kullback Leibler Divergence nonnegative.Set f to joint density of Y and g to products of marginals of Y thenD(f,g) = -H(Y) + ΣH(Yi)which is minized when components of Y are
independent.When outputs are independent they can be a permutation and scaled version of U.
![Page 65: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/65.jpg)
Learning Algorithms
Can learn weights by approximating divergence cost function using contrast functions.Iterative gradient estimate algorithms can be used.Faster convergence can be achieved with fixed point algorithms that approximate Newton’s methods.Algorithms have been shown to converge.
![Page 66: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/66.jpg)
Applications of ICA
Array antenna processingBlind source separation: speech separation, biomedical signals, financial data
![Page 67: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/67.jpg)
D. Competitive Learning
Motivation: Neurons compete with one another with only one winner emerging.
Brain is a topologically ordered computational map.Array of neurons self organize.
Generalized competitive learning algorithm.1) Initialize weights2) Randomly choose inputs3) Pick winner.4) Update weights associated with winner.5) Go to 2).
![Page 68: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/68.jpg)
Competitive Learning Algorithm
K means algorithm (no topological ordering)– Online algorithm– Update centers– Reclassify points– Converges to local minima
Kohonen Self Organization Feature Map (topological ordering)
– Neurons arranged on lattice– Weight that are updated depend on winner, step size, and
neighborhood.– Decrease step size and neighborhood size to get
topological ordering.
![Page 69: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/69.jpg)
KSOFM 2 dimensional lattice
![Page 70: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/70.jpg)
Neural Network Applications
Backgammon (Feedforward network)– 459-24-24-1 network to rate moves– Hand crafted examples, noise helped in training– 59% winning percentage against SUN gammontools– Later versions used reinforcement learning
Handwritten zip code (Feedforward network)– 16-768-192-30-10 network to distinguish numbers– Preprocessed data, 2 hidden layers act as feature detectors– 7291 training examples, 2000 test examples– Training data .14%, test data 5%, test/reject data 1%,12%
![Page 71: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/71.jpg)
Neural Network Applications
Speech recognition– KSOFM map followed by feedforward neural network– 40 – 120 frames mapped onto 12 by 12 Kohonen map– Each frame composed of 600 to 1800 analog vector– Output of Kohonen map fed to feedforward network– Reduced search using KSOFM map– TI 20 word data base 98-99% correct on speaker dependent
classsification
![Page 72: Anthony Kuh- Neural Networks and Learning Theory](https://reader033.vdocuments.site/reader033/viewer/2022042510/542ae34d219acd87798b4972/html5/thumbnails/72.jpg)
Other topics
Reinforcement learningAssociative networksNeural dynamics and controlComputational learning theoryBayesian learningNeuroscienceCognitive scienceHardware implementation