1 performance optimization steepest descent. 2 objective to learn algorithms how to optimize a...
TRANSCRIPT
1
Performance Optimization
Steepest Descent
2
objective
To learn algorithms how to optimize a performance index F(x) -> to Find the value of x that minimize F(x)
3
Basic Optimization Algorithmxk 1+ xk kpk+=
x k xk 1+ x k– kpk= =
pk - Search Direction
k - Learning Rate
or
xk
x k 1+
kpk
4
Steepest Descent
F x k 1+ F xk
Choose the next step so that the function decreases:
5
Steepest Descent
F xk 1+ F xk x k+ F xk gkT x k+=
For small changes in x we can approximate F(x):
g k F x x xk=
where
6
Steepest Descent
F xk 1+ F xk x k+ F xk gkT x k+=
g kT
x k kg kTpk 0=
If we want the function to decrease:
7
Steepest Descent
F xk 1+ F xk x k+ F xk gkT x k+=
g kT
x k kg kTpk 0=
If we want the function to decrease:
pk g– k=
We can maximize the decrease by choosing:
x k 1+ xk kg k–=
8
Steepest Descent
pk g– k=
We can maximize the decrease by choosing:
x k 1+ xk kg k–=
Two general methods to select k:- minimize F(x) w.r.t. k
- use a predetermined value (e.g. 0.2, 1/k)
9
ExampleF x x1
22 x1x
22x 2
2x1+ + +=
x00.5
0.5=
F x x1
F x
x2
F x
2x 1 2x2 1+ +
2x 1 4x 2+= = g0 F x
x x0=
3
3= =
0.1=
x1 x0 g0– 0.5
0.50.1 3
3– 0.2
0.2= = =
x2 x1 g1– 0.2
0.20.1 1.8
1.2– 0.02
0.08= = =
10
Plot
-2 -1 0 1 2-2
-1
0
1
2
11
Stable Learning Rates
Suppose that the performance index is a quadratic function:
cG xdAxxx TT
21)(
dAxx )(G
Steepest descent algorithm with constant learning rate:
)(1 dAxxgxx kkkkk
dxAIx kk ][1
A linear dynamic system will be stable if the eigenvalues of the matrix [I-A] are less than one in magnitude.
13
Stable Learning Rates
Let {1, 2,…, n} and {z1,z2,…, zn} be the eigenvalues and eigenvectors of the Hessian matrix. Then
iiiiiiii zzzAzzzAI )1(][
Condition for the stability of the steepest descent algorithm is then
11 i
Assume that the quadratic function has a strong minimum point, then its eigenvalues must be positive numbers. Hence,
i 2
This must be true for all eigenvalues:max
2
14
ExampleA 2 2
2 4= 1 0.764= z1
0.851
0.526–=
2 5.24 z20.526
0.851=
=
2
max------------ 2
5.24---------- 0.38= =
-2 -1 0 1 2-2
-1
0
1
2
-2 -1 0 1 2-2
-1
0
1
2 0.37= 0.39=
15
CHAPTER 10
Widrow-Hoff Widrow-Hoff LearningLearning
16
Objectives
Widrow-Hoff learning is an approximate steepest descent algorithm, in which the performance index is mean square error.
It is widely used today in many signal processing applications.
It is precursor to the backpropagation algorithm for multilayer networks.
17
ADALINE Network
ADALINE (Adaptive Linear Neuron) network and its learning rule, LMS (Least Mean Square) algorithm are proposed by Widrow and Marcian Hoff in 1960.
Both ADALINE network and the perceptron suffer from the same inherent limitation: they can only solve linearly separable problems.
The LMS algorithm minimizes mean square error (MSE), and therefore tires to move the decision boundaries as far from the training patterns as possible.
18
ADALINE Network
n = Wp + b a = purelin(Wp + b)
+
aS1
nS11 b
S1
W
SR
R
pR1
S
Single-layer perceptron
aS1+
nS11 b
S1
W
SR
R
pR1
S
19
Single ADALINE
Set n = 0, then Wp + b = 0 specifies a decision boundary.The ADALINE can be used to classify objects into two categories if they are linearly separable.
11w
12w
1p
2p n a
1
b
nnpurelinabn
wwp
p
)(
, 12112
1
Wp
Wp
1p
2p
W0a
0a
20
Mean Square Error
The LMS algorithm is an example of supervised training.The LMS algorithm will adjust the weights and biases of the ADALINE in order to minimize the mean square error, where the error is the difference between the target output (tq) and the network output (pq).
1 ,1 pz
wx
b
])[(])[(][)( 2T22 zxx tEatEeEF
zxpw TT1 ba
E[·]: expected value
MSE:
21
Mean Square Error
matrixn correlatioinput :][ and between n vector correlatio-cross :][
][ 2
TEttE
tEc
zzRzzh
022)2()( TT RxhRxxhxx cF
hRx 1
xzzxzxxzzxzx
zxx
][][2][]2[
])[()(
TTT2
TTT2
2T
EtEtEttE
tEF
Rxxhx TT2 c
matrix symmetric: ,2)(
ectorconstant v: ,)()(TT
TT
RRxxRRxRxx
hhhxxh
Example 1
23
Solved Problem P10.3
0 1 2 3-3 -2 -1
0
1
2
3
-2
-1
4
1w
2wSo the contour of the performance surface will be circular. The center of the contours (the minimum point) is .
x
Approximate Steepest Descent
Approximate Gradient
Approximate Gradient(conti.)
Approximate Gradient(conti.)
28
LMS Algorithm
The steepest descent algorithm with constant learning rate is
kFkk xx
xxx )(1
)()(2)()(ˆ 2 kkekeF zx )()(21 kkekk zxx
Matrix notation of LMS algorithm:
)(2)()1(
)()(2)()1( T
kkk
kkkk
ebb
peWW
The LMS algorithm is also referred to as the delta rule or the Widrow-Hoff learning algorithm.
29
Quadratic Functions
General form of quadratic function:
Ax
dAxx
Axxxdx
)(
)(
)(
2
T21T
G
G
cG
ADALINE network mean square error:Rxxhxx TT2)( cF
RAhd 2 ,2
(A: Hessian matrix)
If the eigenvalues of the Hessian matrix are all positive, then the quadratic function will have one unique global minimum.
30
Orange/Apple Example
1,
111
,1,11
1
2211 tt pp
5.01
0.2,0.1,0.0
.101010101
][
max
T222
1T112
1T
ppppppER
In practical applications, the stable learning rate might NOT be practical to calculate R, and could be selected by trial and error.
31
Orange/Apple Example
Start, arbitrary, with all the weights set to zero, and then will apply input p1, p2, p1, p2, etc., in that order, calculating the new weights after each input is presented.
1)0()0()0()0(0)0()0()0()0( 11 atatea pWpW
1,
111
,1,11
1
2211 tt pp
2.0
000)0(
W
.4.04.04.0)0()0(2)0()1( T pWW e
4.1)1()1()1()1(4.0)1()1()1()1( 22 atatea pWpW
.16.096.016.0)1()1(2)1()2( T pWW e
32
Orange/Apple Example
36.0)2()2()2()2(
64.0)2()2()2()2(
1
1
atate
a pWpW
.0160.01040.10160.0)2()2(2)2()3( T pWW e .010)( W
This decision boundary falls halfway between the two reference patterns. The perceptron rule did NOT produce such a boundary,
The perceptron rule stops as soon as the patterns are correctly classified, even though some patterns may be close to the boundaries. The LMS algorithm minimizes the mean square error.
Perceptron rule V.S. LMS algorithm
Perceptron rule V.S. LMS algorithm(conti.)
Perceptron rule V.S. LMS algorithm(conti.)
Perceptron rule V.S. LMS algorithm(conti.)
37
Solved Problem P10.4
1,
1
1,1,
1
12211 tt pp
Train the network using the LMS algorithm, with the initial guess set to zero and a learning rate = 0.25.
w1 w1
38
Solved Problem P10.8
2
2,
1
1:4,
1
2,
2
1:3
0
2,
1
2:2,
2
1,
1
1:1
8765
4321
pppp
pppp
classclass
classclass
Train the network using the LMS algorithm, with the initial guess set to zero and a learning rate = 0.04.
39
Tapped Delay Line
D
D
D
)(ky )()(1 kykp
)1()(2 kykp
)1()( RkykpR
At the output of the tapped delay line we have an R-dim. vector, consisting of the input signal at the current time and at delays of from 1 to R–1 time steps.
40
Adaptive Filter
)(kn)(ka
1
b
11w
12wD
D
D
)(ky
Rw1
bikyw
bpurelinkaR
ii
1
1 )1(
)()( Wp
41
Solved Problem P10.1
)(kn )(ka
11w
12wD
)(ky
D13w
4)1(,5)0(,...}0,0,0,4,5,0,0,0{...,)}({
3 ,1 ,2 131211
yyky
www
Just prior to k = 0 ( k < 0 ):Three zeros have enteredthe filter, i.e., y(3) = y(2) = y(1) = 0, the output just prior to k = 0 is zero.
k = 0: 10005
312)0()0(
Wpa
42
Solved Problem P10.1
k = 1: 13054
312)1()1(
Wpa
k = 2: 1954
0312)2()2(
Wpa
k = 3: 124
00
312)3()3(
Wpa
k = 4: 0000
312)4()4(
Wpa
43
Solved Problem P10.1
The effect of y(0) last from k = 0 through k = 2, so it will have an influence for three time intervals.This corresponds to the length of the impulse response of this filter.
0)4(,12)3(,19)2(,13)1(,10)0(,0)1( aaaaaa
)2()1(
)()()( 131211
kyky
kywwwkka Wp
44
Solved Problem P10.6
)(kn )(ka
)(ky11w
12w
D
D +
+
)()( kykt
)(ke
Application of ADALINE: adaptive predictorThe purpose of this filter is to predict the next value of the input signal from the two previous values.Suppose that the input signal is a stationary random process with autocorrelation function given by
.1)2(,1)1(,3)0(,)()()( yyyy CCCnkykyEnC
45
Solved Problem P10.6
Sketch the contour plot of the performance index (MSE).i.
)2()1(
)()(kyky
kk pz
.1)2(,1)1(,3)0(,)()()( yyyy CCCnkykyEnC
3)0()()( 22 yCkyEktEc
3113
)0()1()1()0(
yy
yyT
CCCC
E zzR
11
)2()1(
)2()()1()(
y
y
CC
kykykyky
EtE zh
46
Solved Problem P10.6
Performance Index (MSE): Rxxhxx TT2)( cFThe optimal weights are
21
21
83
84
81
83
11
11
11
3113
hRx
The Hessian matrix is Eigenvalues: 1 = 4, 2 = 8.
Eigenvectors:
6226
2)(2 RAxF
11
,11
21 vv
The contours of F(x) will be elliptical, with the long axis of each ellipse along the 1st eigenvector, since the 1st eigenvalue has the smallest magnitude.The ellipses will be centered at .x
47
Solved Problem P10.6
The maximum stable value of the learning for the LMS algorithm:
ii.25.0822 max
x
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
0 1 2-1-2
0
1
2
-1
-2
The LMS algorithm is approximate steepest descent, so the trajectory for small learning rates will move perpendicular to the contour lines.
iii.
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
x
1v
2v
0 1 2-1-2
0
1
2
-1
-2
48
Applications
Noise cancellation system to remove 60-Hz noise from EEG signal (Fig. 10.6)
Echo cancellation system in long distance telephone lines (Fig. 10.10)
Filtering engine noise from pilot’s voice signal (Fig. P10.8)