perceptron multilayer

Post on 04-Jun-2018

252 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 1/67

Harvard-MIT Division of Health Sciences and TechnologyHST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo

6.873/HST.951 Medical Decision SupportSpring 2005

 Artificial Neural Networks  

Lucila Ohno-Machado(with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 2/67

Overview • Motivation

• Perceptrons

• Multilayer perceptrons

• Improving generalization • Bayesian perspective

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 3/67

Motivation 

Images removed due to copyright reasons.

benign lesion malignant lesion benign lesion malignant lesion

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 4/67

Motivation •  Human brain

 – Parallel processing

 – Distributed representation

 – Fault tolerant – Good generalization capability

•  Mimic structure and processing in

computational model

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 5/67

Biological Analogy 

Synapses

 Axon

Dendrites

Synapses+

+

+--

(weights)

Nodes

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 6/67

Perceptrons 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 7/67

00

01

110010

10

00 11 Input patterns

Input layer

Output layer

11

01 11

11

1100

00

00 10

10Sorted

.patterns

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 8/67

 Activation functions 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 9/67

Perceptrons (linear machines) Input units

Cough Headache

weights

Δrule

change weights to 

decrease the error  

-what we got

what we wantedNo disease Pneumonia Flu Meningitiserror

Output units

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 10/67

    0 1 0 0    0    0    0

 Abdominal Pain Perceptron Intensity Duration 

Male  Age Temp WBC Pain Pain adjustable

weights37 10 11 120

 Appendic itis Diverticulit is  Ulcer Pain Cholecystitis Obstruction PancreatitisDuodenal Non-specific Small BowelPerforated

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 11/67

 AND y

x1 x2

w1 w2

θ= 0.5

input output

00

011011

0

001

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0

f(0w1 + 1w2) = 0f(1w1 + 0w2) = 0

f(1w1 + 1w2) = 1

1, for a >θf(a) =

0, for a ≤ θ θ

some possible values for w1 and w2

w1 w2

0.20

0.20

0.25

0.40

0.35

0.40

0.30

0.20

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 12/67

Single layer neural network Output of unit j:

Input units

Input to unit

 j

i

Input to unit

units

measured value of variable

Output +θ j)o j = 1/ (1 + e- (a j )

 j: a j = Σwijai

i: ai

i

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 13/67

Single layer neural network Output of unit  j:

Output o j = 1/ (1 + e - (a j+θ j)

units

Input to unit

 j

i

Input to unit

measured

Input units

 j: a j = Σwija i

i: a i

value of variable i Increasing

0

1

-5 0 5

i

i

i

0.2

0.4

0.6

0.8

1.2

-15 -10 10 15

Ser es1

Ser es2

Ser es3

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 14/67

Training: Minimize Error  Input units

Cough Headache

weights

Δrule

change weights to 

decrease the error  

-what we got

what we wantedNo disease Pneumonia Flu Meningitiserror

Output units

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 15/67

Error Functions • Mean Squared Error (for regression

problems), where t is target, o is outputΣ(t - o)2/n

• Cross Entropy Error (for binaryclassification) 

− Σ(t log o) + (1-t) log (1-o) +θ j)o j = 1/ (1 + e- (a j )

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 16/67

Error function • Convention: w := (w0,w), x := (1,x)

• w0 is “bias” y

x1 x2

w1 w2

θ= 0.5

• o = f (w • x)

• Class labels ti ∈{+1,-1}

• Error measure

 – E = -Σ ti (w • x )o

i miscl.

i w

• How to minimize E?

y

x1 x2

w1 w2

1

= 0.5

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 17/67

Minimizing the Error  initial error

final error

local minimum

Error surface

derivative

initial trainedw w

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 18/67

Gradient descent 

Local minimum

Global minimum

Error

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 19/67

Perceptron learning w

• Find minimum of E by iterating

k+1 = wk – η gradw E

• E = -Σ ti (w • x ) ⇒i i miscl. gradw E = -Σ ti xi 

i miscl. 

w

• “online” version: pick misclassified x k+1 = wk + η  ti xi 

i

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 20/67

Perceptron learning •  Update rule wk+1 = wk + η  ti xi •  Theorem: perceptron learning converges

for linearly separable sets

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 21/67

Gradient descent •  Simple function minimization algorithm •  Gradient is vector of partial derivatives •  Negative gradient is direction of steepest

descent

3

3

2

2

1

1

0

0

20

10

0

0

0

11

2

2

3

3

Figures by MIT OCW.

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 22/67

Classification Model Inputs

Weights Output

 Age 34

1

4of beingAlive”

5

8

40.6

Gender

“Probability

Stage

 Dependent Independent Coefficients variablevariables

 a, b, c  p x1, x2, x3

 Prediction

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 23/67

Terminology • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights”

• Estimates = “targets”

• Iterative step = cycle, epoch

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 24/67

XOR y

x1 x2

w1 w2

θ= 0.5

input output

00

011011

0

110

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0

f(0w1 + 1w2) = 1f(1w1 + 0w2) = 1

f(1w1 + 1w2) = 0

some possible values for w1 and w2

w1 w2

1, for a > θf(a) = 0, for a ≤ θ

θ

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 25/67

XOR x 1 x 2

y

input output

00

011011

0

110

θ= 0.5

w 3 w5 w 4

z θ= 0.5

w1 w2

 f (w1, w2, w3, w4, w5)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5) f(a) =0, for a ≤ θ

(0.3,0.3,1,1,-2) θ

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 26/67

XOR input output

00

011011

0

110

w1 w3w2

w5 w6

θ= 0.5 for all units

w4

 f (w1, w2, w3, w4, w5 , w6)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) =0, for a ≤ θ

(0.6,-0.6,-0.7,0.8,1,1) θ

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 27/67

From perceptrons tomultilayer perceptrons 

Why? 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 28/67

 Abdominal Pain

37 10 1

 Appendici tis Diverticulit is

PerforatedNon-specific

Cholecystitis

Small Bowel

Pancreatitis

1 20

Male Age Temp WBC PainIntensity

1

PainDuration

0 1 0 0000

adjustable

weights

ObstructionPainDuodenalUlcer 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 29/67

0.8

Heart Attack Network Duration Intensity ECG: ST

Pain Pain elevation Smoker  Age Male

Myocardial Infarction

112 1504

“ Probability” of MI

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 30/67

Multilayered Perceptrons 

Input uni ts

Input to unit j: a j = Σwijai

 j

i

Input to unit i: aii

Output of unit j:o j = 1/ (1 + e- (a j+θ j) )

Output units

Perceptron

MultilayeredInput to unit k:

perceptron

Output of unit k:

ok = 1/ (1 + e- (ak+θk) )k

Hidden

units

ak = Σw jko j

measured value of variable

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 31/67

Neural Network Model Inputs

 Age 34

2

4

.6

.5

.8

.2

.1

.3

.7

.2

.4

.2

Output 

0.6

Gender

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 32/67

“Combined logistic models” Inputs

 Age 34

2

4

.6

.5

.8

.1

.7

Output 

0.6

Gender

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 33/67

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 34/67

Inputs  Age 34

1

4

.6.5

.8

.2

.1

.3

.7

.2

Output 

0.6

Gender

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 35/67

Not really,

no target for hidden units...  Age

0.6

Gender

34

2

4

.6

.5

.8

.2

.1

.3

.7

.2

.4

.2

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 36/67

Hidden Units and Backpropagation Output units

Hiddenunits

what we got

what we wanted-error

Δ rule

Δ rule

ba

i

ckpr

o

pagato

n

Input units 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 37/67

Multilayer perceptrons • Sigmoidal hidden layer

• Can represent arbitrary decision regions • Can be trained similar to perceptrons

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 38/67

ECG Interpretation R-R interval

S-T elevation

QRS duration

QRS amplitude

 AVF lead

SV tachycardia

Ventricular tachycardia

LV hypertrophy

RV hypertrophy

Myocardial infarction

P-R interval 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 39/67

Linear Separation Separate n-dimensional space using one (n - 1)-dimensional space

Meningitis FluNo cough Cough

Headache Headache

CoughNo coughNo headache No headacheNo disease Pneumonia

No treatment

00 10

01 11

000

010

101

111

110

Treatment

011

100 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 40/67

 Another way of thinking about

this… • Have data set D = {(x ,t )} drawn from

 i i

probability distribution P(x,t)

•  Model P(x,t) given samples D by ANN

with adjustable parameter w• Statistics 

analogy: 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 41/67

Maximum Likelihood Estimation • Maximize likelihood of data D• Likelihood L = Πp(x ,t ) = Πp(t |x )p(x ) i i i i i

• Minimize -log L = -Σlog p(t |x ) -Σlog p(x ) i i i

• Drop second term: does not depend on w • Two cases: “regression” and classification

 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 42/67

Likelihood for classification (ie categorical target) 

• For classification, targets t are class labels 

• Minimize -Σlog p(t |x ) i i • p(t 1-ti |x ) = y(x ,w) ti(1- y(x ,w)) ⇒

i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i

• Minimizing –log L equivalent to minimizing

-[Σt log y(x ,w) +(1 – t ) * log(1-y(x ,w))] i i i i

(cross-entropy error )

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 43/67

Likelihood for “regression” (ie continuous target) 

•  For regression, targets t are real values •  Minimize -Σlog p(t |x ) i i

• p(t |x ) = 1/Z exp(-(y(x ,w) – t )

2

/(2σ2

)) ⇒ 

i  i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Z i i  i i

•  y(x ,w) is network outputi •  Minimizing –log L equivalent to minimizing

Σ(y(x ,w) – t )2 (sum-of-squares error ) i  i

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 44/67

Backpropagation algorithm w

• Minimizing error function

by gradient descent:k+1 = wk – η gradw E

• Iterative gradientcalculation bypropagating error signals

Backpropagation algorithm

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 45/67

Backpropagation algorithm Problem: how to set learning rate η ? 

Better: use more advanced minimizationalgorithms (second-order information)

2

2

0

0

-2

2

0

-2

-4 -2 20-4 -2

Figures by MIT OCW.

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 46/67

Backpropagation algorithm Classification Regression

cross-entropy sum-of-squares

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 47/67

Overfitting 

ModelReal Distribution Overfitted

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 48/67

Improving generalization 

Problem: memorizing (x,t) combinations

(“overtraining”)

0.7 0.5 00.9-0.5 1

1-1.2-0.2

0.3 0.6 1

0.5-0.2 ?

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 49/67

Improving generalization 

•  Need test set to judge performance •  Goal: represent information in data set, not

noise

•  How to improve generalization?  – Limit network topology

 – Early stopping

 – Weight decay

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 50/67

Limit network topology 

• Idea: fewer weights ⇒less flexibility • Analogy to polynomial interpolation: 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 51/67

Limit network topology 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 52/67

Early Stopping 

     C     H     D

error

holdout

training

Overfitted model “Real” model Overfitted model

0 age cycles 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 53/67

Early stopping 

•  Idea: stop training when information (but

not noise) is modeled•  Need hold-out set to determine when to

stop training

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 54/67

Overfitting 

b = training set

a = hold-out set

Overfitted model

tss

Epochs

min (Δtss)

tss a

tss b

Stopping criterion

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 55/67

Early stopping 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 56/67

Weight decay 

• Idea: control smoothness of network

output by controlling size of weights 

• Add term α||w||2 to error function 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 57/67

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 58/67

Bayesian perspective 

•  Error function minimization corresponds to

maximum likelihood (ML) estimate: singlebest solution wML

•  Can lead to overtraining •  Bayesian approach: consider weight

posterior distribution p(w|D).

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 59/67

Bayesian perspective 

• Posterior = likelihood * prior

• p(w|D) = p(D|w) p(w)/p(D)• Two approaches to approximating p(w|D): 

 – Sampling

 – Gaussian approximation

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 60/67

Sampling from p(w|D) 

prior

likelihood

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 61/67

Sampling from p(w|D) 

prior * likelihood = posterior  

Bayesian example for

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 62/67

ayes a e a p e o

regression 

Bayesian example for

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 63/67

y p

classification 

Model Features 

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 64/67

(with strong personal biases) Modeling Examples Explanat.Effort Needed

Rule-based Exp. Syst. high low high?

Classification Trees low high+ “high”Neural Nets, SVM low high lowRegression Models high moderate moderate

Learned Bayesian Nets low high+ high(beautiful when it works)

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 65/67

Regression vs. Neural Networks 

“X1” 1X3” “X1X2X3”

Y

“X2”

X1 X2 X3 X1X2 X1X3 X2X3

Y

(23-1) possible combinations

X1X2X3

“X

X1 X2 X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 66/67

Summary 

• ANNs inspired by functionality of brain • Nonlinear data model• Trained by minimizing error function

• Goal is to generalize well• Avoid overtraining

• Distinguish ML and MAP solutions

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 67/67

Some References 

Introductory and Historical Textbooks

•  Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed

Processing. MIT Press, Cambridge, 1986. (H)•  Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of

Neural Computation. Addison-Wesley, Redwood City, 1991.

•  Pao, YH. Adaptive Pattern Recognition and Neural Networks.

 Addison-Wesley, Reading, 1989.•  Bishop CM. Neural Networks for Pattern Recognition. Clarendon

Press, Oxford, 1995.

top related