artificial intelligence chapter 3 neural networks

25
Artificial Intelligence Artificial Intelligence Chapter 3 Chapter 3 Neural Networks Neural Networks Biointelligence Lab School of Computer Sci. & Eng. Seoul National University

Upload: cedric-stout

Post on 31-Dec-2015

57 views

Category:

Documents


1 download

DESCRIPTION

Artificial Intelligence Chapter 3 Neural Networks. Biointelligence Lab School of Computer Sci. & Eng. Seoul National University. Outline. 3.1 Introduction 3.2 Training Single TLUs Gradient Descent Widrow-Hoff Rule Generalized Delta Procedure 3.3 Neural Networks - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Artificial Intelligence Chapter 3 Neural Networks

Artificial IntelligenceArtificial IntelligenceChapter 3Chapter 3

Neural NetworksNeural Networks

Biointelligence Lab

School of Computer Sci. & Eng.

Seoul National University

Page 2: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

2

OutlineOutline

3.1 Introduction

3.2 Training Single TLUs Gradient Descent Widrow-Hoff Rule Generalized Delta Procedure

3.3 Neural Networks The Backpropagation Method Derivation of the Backpropagation Learning Rule

3.4 Generalization, Accuracy, and Overfitting

3.5 Discussion

Page 3: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

3

3.1 3.1 IntroductionIntroduction

TLU (threshold logic unit): Basic units for neural networks Based on some properties of biological neurons

Training set Input: real value, boolean value, …

Output: di: associated actions (Label, Class …)

Target of training Finding f(X) corresponds “acceptably” to the members of the training

set. Supervised learning: Labels are given along with the input vectors.

),...,(X vector,dim-n :X 1 nxx

Page 4: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

4

3.2 3.2 Training Single TLUsTraining Single TLUs3.2.1 TLU Geometry3.2.1 TLU Geometry Training TLU: Adjusting variable weights A single TLU: Perceptron, Adaline (adaptive linear

element) [Rosenblatt 1962, Widrow 1962] Elements of TLU

Weight: W =(w1, …, wn)

Threshold: Output of TLU: Using weighted sum s = WX

1 if s > 0 0 if s < 0

Hyperplane WX = 0

Page 5: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

5

Figure 3.1 TLU Geometry

Page 6: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

6

3.2.2 3.2.2 Augmented VectorsAugmented Vectors

Adopting the convention that threshold is fixed to 0. Arbitrary thresholds: (n + 1)-dimensional vector W = (w1, …, wn, 1)

Output of TLU 1 if WX 0 0 if WX < 0

Page 7: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

7

3.2.3 3.2.3 Gradient Decent MethodsGradient Decent Methods

Training TLU: minimizing the error function by adjusting weight values.

Batch learning v.s. incremental learning Commonly used error function: squared error

Gradient:

Chain rule:

Solution of nonlinearity of f / s : Ignoring threshold function: f = s Replacing threshold function with differentiable nonlinear

function

2)( fd

11

,...,,...,ni

def

www

W

XXWW s

ffd

s

s

s

)(2

Page 8: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

8

3.2.4 3.2.4 The Widrow-Hoff Procedure: The Widrow-Hoff Procedure: First SolutionFirst Solution Weight update procedure:

Using f = s = WX Data labeled 1 1, Data labeled 0 1

Gradient:

New weight vector

Widrow-Hoff (delta) rule (d f) > 0 increasing s decreasing (d f) (d f) < 0 decreasing s increasing (d f)

XXW

)(2)(2 fds

ffd

XWW )( fdc

Page 9: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

9

3.2.5 3.2.5 The Generalized Delta The Generalized Delta Procedure: Second SolutionProcedure: Second Solution Sigmoid function (differentiable): [Rumelhart, et al. 1986]

Gradient:

Generalized delta procedure:

Target output: 1, 0 Output f = output of sigmoid function f(1– f) = 0, where f = 0 or 1 Weight change can occur only within ‘fuzzy’ region surrounding the

hyperplane (near the point f(s) = ½).

)1/(1)( xesf

XXW

)1()(2)(2 fffds

ffd

XWW )1()( fffdc

Page 10: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

10

Figure 3.2 A Sigmoid Function

Page 11: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

11

3.2.6 3.2.6 The Error-Correction ProcedureThe Error-Correction Procedure

Using threshold unit: (d – f) can be either 1 or –1.

In the linearly separable case, after finite iteration, W will be converged to the solution.

In the nonlinearly separable case, W will never be converged.

The Widrow-Hoff and generalized delta procedures will find minimum squared error solutions even when the minimum error is not zero.

XWW )( fdc

Page 12: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

12

3.3 3.3 Neural NetworksNeural Networks3.3.1 Motivation3.3.1 Motivation Need for use of multiple TLUs

Feedforward network: no cycle Recurrent network: cycle (treated in a later chapter) Layered feedforward network

jth layer can receive input only from j – 1th layer.

Example : 2121 xxxxf

Figure 3.4 A Network of TLUs That Implements the Even-Parity Function

Page 13: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

13

3.3.2 3.3.2 NotationNotation

Hidden unit: neurons in all but the last layer Output of j-th layer: X(j) input of (j+1)-th layer Input vector: X(0) Final output: f The weight of i-th sigmoid unit in the j-th layer: Wi

(j)

Weighted sum of i-th sigmoid unit in the j-th layer: si(j)

Number of sigmoid units in j-th layer: mj

)()1()( ji

jjis WX

)(,1

)(,

)(,1

)(

)1(,...,,...,( j

imjil

ji

ji j

www W

Page 14: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

14

Figure 3.4 A k-layer Network of Sigmoid Units

Page 15: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

15

3.3.3 3.3.3 The Backpropagation MethodThe Backpropagation Method

Gradient of Wi(j) :

Weight update:

)()1()( ji

jjis WX

)()(

2

)()(2

)(j

ij

ij

i s

ffd

s

fd

s

)1()()()()( jji

ji

ji

ji c XWW

)1()()1()(

)1()()(

)(

)()(

2)(2

jji

jj

i

jj

ij

i

ji

ji

ji

s

ffd

s

s

s

XX

XWW

Local gradient

Page 16: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

16

3.3.4 3.3.4 Computing Weight Changes in Computing Weight Changes in the Final Layerthe Final Layer Local gradient:

Weight update:

)1()(

)()(

)(

fffd

s

ffd

ji

k

)1()()()( )1()( kkkk fffdc XWW

Page 17: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

17

3.3.5 3.3.5 Computing Changes to the Computing Changes to the Weights in Intermediate LayersWeights in Intermediate Layers Local gradient:

The final ouput f, depends on si(j) through of the summed inputs to the

sigmoids in the (j+1)-th layer.

Need for computation of

)(

)1(

1

)1()(

)1(

)1(1

)(

)1(

)1()(

)1(

)1()(

)1(1

)1(1

)()(

11

1

1

)(

......)(

)(

ji

jl

m

l

jlj

i

jl

jl

m

l

ji

jm

jm

ji

jl

jl

ji

j

j

ji

ji

s

s

s

s

s

ffd

s

s

s

f

s

s

s

f

s

s

s

ffd

s

ffd

jj

j

j

)(

)1(

ji

jl

s

s

Page 18: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

18

Weight Update in Hidden Layers Weight Update in Hidden Layers (cont.)(cont.)

v i : v = i :

Conseqeuntly,

)1( )()()1(

1

1)(

)()1(

)(

1

1

)1()(

)(

)1(

1

1

)1()()1()()1(

ji

ji

jil

m

vj

i

jvj

vlji

m

v

jvl

lv

ji

jl

m

v

jvl

lv

jl

jjl

ffw

s

fw

s

wf

s

s

wfs

jj

j

WX

0)(

)(

ji

jv

s

f )1( )()()(

)(j

vj

vji

jv ff

s

f

1

1

)1()1()()()( )1(jm

l

jil

jl

ji

ji

ji wff

Page 19: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

19

Weight Update in Hidden Layers Weight Update in Hidden Layers (cont.)(cont.) Attention to recursive equation of local gradient!

Backpropagation: Error is back-propagated from the output layer to the input layer Local gradient of the latter layer is used in calculating local

gradient of the former layer.

))(1()( fdffk

1

1

)1()1()()()( )1(jm

l

jil

ji

ji

ji

ji wff

Page 20: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

20

3.3.5 (3.3.5 (con’t)con’t)

Example (even parity function)

Learning rate: 1.0

input target

1 0 1 0

0 0 1 1

0 1 1 0

1 1 1 1

Figure 3.6 A Network to Be Trained by Backprop

Page 21: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

21

3.4 3.4 Generalization, Accuracy, and Generalization, Accuracy, and OverfittingOverfitting Generalization ability:

NN appropriately classifies vectors not in the training set. Measurement = accuracy

Curve fitting Number of training input vectors number of degrees of

freedom of the network. In the case of m data points, is (m-1)-degree polynomial best

model? No, it can not capture any special information. Overfitting

Extra degrees of freedom are essentially just fitting the noise. Given sufficient data, the Occam’s Razor principle dictates to

choose the lowest-degree polynomial that adequately fits the data.

Page 22: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

22

Figure 3.7 Curve Fitting

Page 23: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

23

3.4 (3.4 (cont’d)cont’d) Out-of-sample-set error rate

Error rate on data drawn from the same underlying distribution of training set.

Dividing available data into a training set and a validation set Usually use 2/3 for training and 1/3 for validation

k-fold cross validation k disjoint subsets (called folds). Repeat training k times with the configuration: one validation

set, k-1 (combined) training sets. Take average of the error rate of each validation as the out-of-

sample error. Empirically 10-fold is preferred.

Page 24: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

24

Figure 3.8 Error Versus Number of Hidden Units

Fig 3.9 Estimate of Generalization Error Versus Number of Hidden Units

Page 25: Artificial Intelligence Chapter 3 Neural Networks

(c) 2000-2002 SNU CSE Biointelligence Lab

25

3.5 3.5 Additional Readings and Additional Readings and DiscussionDiscussion Applications

Pattern recognition, automatic control, brain-function modeling Designing and training neural networks still need experience and

experiments.

Major annual conferences Neural Information Processing Systems (NIPS) International Conference on Machine Learning (ICML) Computational Learning Theory (COLT)

Major journals Neural Computation IEEE Transactions on Neural Networks Machine Learning