cse 5526: introduction to neural networks

41
Part I 1 CSE 5526: Introduction to Neural Networks Instructor: DeLiang Wang

Upload: others

Post on 01-Aug-2022

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CSE 5526: Introduction to Neural Networks

Part I 1

CSE 5526: Introduction to Neural Networks

Instructor: DeLiang Wang

Page 2: CSE 5526: Introduction to Neural Networks

Part I 2

What is this course is about? AI (artificial intelligence) in the broad sense, in particular

learning The human brain and its amazing ability, e.g. vision

Page 3: CSE 5526: Introduction to Neural Networks

Part I 3

Human brain

Lateral fissure

Central sulcus

Page 4: CSE 5526: Introduction to Neural Networks

Part I 4

Brain versus computer

Brain

••

Computer

••

Brain-like computation – Neural networks (NN or ANN) –Neural computation• Discuss syllabus

Page 5: CSE 5526: Introduction to Neural Networks

Part I 5

A single neuron

Page 6: CSE 5526: Introduction to Neural Networks

Part I 6

A single neuron

Page 7: CSE 5526: Introduction to Neural Networks

Part I 7

Real neurons, real synapses Properties

Action potential (impulse) generation Impulse propagation Synaptic transmission & plasticity Spatial summation

Terminology Neurons – units – nodes Synapses – connections – architecture Synaptic weight – connection strength (either positive or negative)

Page 8: CSE 5526: Introduction to Neural Networks

Part I 8

Model of a single neuron

Page 9: CSE 5526: Introduction to Neural Networks

Part I 9

Neuronal model

∑=

=m

jjkjk xwu

1

Adder, weighted sum, linear combiner

kkk buv += Activation potential; bk: bias

)( kk vy ϕ= Output; φ: activation function

Page 10: CSE 5526: Introduction to Neural Networks

Part I 10

Another way of including biasSet x0 = +1 and wk0 = bk

So we have ∑=

=m

jjkjk xwv

0

Page 11: CSE 5526: Introduction to Neural Networks

Part I 11

McCulloch-Pitts model

}1,1{ −∈ix

)(1

bxwy i

m

ii += ∑

=

ϕ

0 if0 if

11

)(<≥

=vv

vϕ A form of signum (sign) function

Note difference from textbook; also number of neurons in the brain

Bipolar input

Page 12: CSE 5526: Introduction to Neural Networks

Part I 12

McCulloch-Pitts model (cont.)

• Example logic gates (see blackboard)

• McCulloch-Pitts networks (introduced in 1943) are the first class of abstract computing machines: finite-state automata• Finite-state automata can compute any logic (Boolean) function

Page 13: CSE 5526: Introduction to Neural Networks

Part I 13

Network architecture

• View an NN as a connected, directed graph, which defines its architecture• Feedforward nets: loop-free graph• Recurrent nets: with loops

Page 14: CSE 5526: Introduction to Neural Networks

Part I 14

Feedforward net

• Since the input layer consists of source nodes, it is typically not counted when we talk about the number of layers in a feedforward net

• For example, the architecture of 10-4-2 counts as a two-layer net

Page 15: CSE 5526: Introduction to Neural Networks

Part I 15

A one-layer recurrent net

In this net, the input typically sets the initial condition of the output layer

Page 16: CSE 5526: Introduction to Neural Networks

Part I 16

Network components

• Three components characterize a neural net• Architecture• Activation function• Learning rule (algorithm)

Page 17: CSE 5526: Introduction to Neural Networks

Part I 17

CSE 5526: Introduction to Neural Networks

Perceptrons

Page 18: CSE 5526: Introduction to Neural Networks

Part I 18

Perceptrons

• Architecture: one-layer feedforward net• Without loss of generality, consider a single-neuron perceptron

• • •

• • •

x1

x2

xm

y1

y2

Page 19: CSE 5526: Introduction to Neural Networks

Part I 19

Definition

)(vy ϕ=

otherwise0 if

11

)(≥

=v

Hence a McCulloch-Pitts neuron, but with real-valued inputs

bxwv i

m

ii +=∑

=1

Page 20: CSE 5526: Introduction to Neural Networks

Part I 20

Pattern recognition

• With a bipolar output, the perceptron performs a 2-class classification problem• Apples vs. oranges

• How do we learn to perform a classification problem?• Task: The perceptron is given pairs of input xp and desired

output dp. How to find w (with b incorporated) so that

? allfor , pdy pp =

Page 21: CSE 5526: Introduction to Neural Networks

Part I 21

Decision boundary

• The decision boundary for a given w:

• g is also called the discriminant function for the perceptron, and it is a linear function of x. Hence it is a linear discriminant function

0)(1

=+=∑=

bxwg i

m

iix

Page 22: CSE 5526: Introduction to Neural Networks

Part I 22

Example

• See blackboard

Page 23: CSE 5526: Introduction to Neural Networks

Part I 23

Decision boundary (cont.)

• For an m-dimensional input space, the decision boundary is an (m ‒ 1)-dimensional hyperplane perpendicular to w. The hyperplane separates the input space into two halves, with one half having y = 1, and the other half having y = -1• When b = 0, the hyperplane goes through the origin

x1

x2

y = -1

y = +1w

g = 0

x

o

o

o o

o

oo

oo

xx

x

x

xx

Page 24: CSE 5526: Introduction to Neural Networks

Part I 24

Linear separability

• For a set of input patterns xp, if there exists one w that separates d = 1 patterns from d = -1 patterns, then the classification problem is linearly separable• In other words, there exists a linear discriminant function that

produces no classification error• Examples: AND, OR, XOR (see blackboard)

• A very important concept

Page 25: CSE 5526: Introduction to Neural Networks

Part I 25

Linear separability: a more general illustration

Page 26: CSE 5526: Introduction to Neural Networks

Part I 26

Perceptron learning rule

• Strengthen an active synapse if the postsynaptic neuron fails to fire when it should have fired; weaken an active synapse if the neuron fires when it should not have fired• Formulated by Rosenblatt based on biological intuition

Page 27: CSE 5526: Introduction to Neural Networks

Part I 27

Quantitatively

)()()1( nwnwnw ∆+=+

n: iteration number η: step size or learning rate

)()]()([)( nxnyndnw −+= η

In vector form

xw ][ yd −=∆ η

Page 28: CSE 5526: Introduction to Neural Networks

Part I 28

Geometric interpretation

• Assume η = 1/2

x1

x2

w(0)

x: d = 1

o

oo

o: d = -1

xx

x

Page 29: CSE 5526: Introduction to Neural Networks

Part I 29

Geometric interpretation

• Assume η = 1/2

x1

x2

w(0)

x: d = 1

o

oo

o: d = -1

xx

x

w(1)

Page 30: CSE 5526: Introduction to Neural Networks

Part I 30

Geometric interpretation

• Assume η = 1/2

x1

x2

w(0)

x: d = 1

o

oo

o: d = -1

xx

x

w(1)

Page 31: CSE 5526: Introduction to Neural Networks

Part I 31

Geometric interpretation

• Assume η = 1/2

x1

x2

w(0)

x: d = 1

o

oo

o: d = -1

xx

x

w(1)

w(2)

Page 32: CSE 5526: Introduction to Neural Networks

Part I 32

Geometric interpretation

• Assume η = 1/2

x1

x2

w(0)

x: d = 1

o

oo

o: d = -1

xx

x

w(1)

w(2)

Page 33: CSE 5526: Introduction to Neural Networks

Part I 33

Geometric interpretation

• Assume η = 1/2

x1

x2

w(0)

x: d = 1

o

oo

o: d = -1

xx

x

w(1)

w(2)w(3)

Page 34: CSE 5526: Introduction to Neural Networks

Part I 34

Geometric interpretation

Each weight update moves w closer to d = 1 patterns, or away from d = -1 patterns. w(3) solves the classification problem

x1

x2 x: d = 1

o

oo

o: d = -1

xx

xw(3)

Page 35: CSE 5526: Introduction to Neural Networks

Part I 35

Perceptron convergence theorem

• Theorem: If a classification problem is linearly separable, a perceptron will reach a solution in a finite number of iterations

• [Proof]Given a finite number of training patterns, because of linear separability, there exists a weight vector wo so that

where

0>≥αpTopd xw (1)

)(min pTopp d xw=α

Page 36: CSE 5526: Introduction to Neural Networks

Part I 36

Proof (cont.)

• We assume that the initial weights are all zero. Let Np denote the number of times xp has been used for actually updating the weight vector at some point in learning

At that time:

])([ pppp

p ydN xw −=∑ η

ppp

pdN x∑= η2

Page 37: CSE 5526: Introduction to Neural Networks

Part I 37

Proof (cont.)

• Consider first

where

pTop

pp

To dN xwww ∑= η2

wwTo

(2)

∑=p

pNP

due to (1) ∑≥p

pNηα2

Pηα2=

Page 38: CSE 5526: Introduction to Neural Networks

Part I 38

Proof (cont.)

• Now consider the change in square length after a single update by x:

whereSince upon an update, d(wTx) ≤ 0

22222 2 wxwwwww −+=−∆+=∆ dη

2w

2max pp x=β

xwx Tdηη 44 22 +=

βη24≤

Page 39: CSE 5526: Introduction to Neural Networks

Part I 39

Proof (cont.)

• By summing for P steps we have the bound:

• Now square the cosine of the angle between wo and w, we have by (2) and (3)

Cauchy-Schwarz inequality

Pβη22 4≤w

2w

(3)

2

2

22

222

22

2

44)(1

ooo

To P

PP

wwwwww

βα

βηαη

=≥≥

Page 40: CSE 5526: Introduction to Neural Networks

Part I 40

Proof (cont.)

• Thus, P must be finite to satisfy the above inequality. This completes the proof

• Remarks• In the case of w(0) = 0, the learning rate has no effect on the proof.

That is, the theorem holds no matter what η (η > 0) is• The solution weight vector is not unique

Page 41: CSE 5526: Introduction to Neural Networks

Part I 41

Generalization

• Performance of a learning machine on test patterns not used during training

• Example: Class 1: handwritten “m”: class 2: handwritten “n”• See blackboard

• Perceptrons generalize by deriving a decision boundary in the input space. Selection of training patterns is thus important for generalization