perceptron multilayer

67
8/13/2019 Perceptron Multilayer http://slidepdf.com/reader/full/perceptron-multilayer 1/67 Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support, Fall 2005 Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo 6.873/HST.951 Medical Decision Support Spring 2005  Artificial  Neural Networks  Lucila Ohno-Machado (with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission. )

Upload: ionut-stanciu

Post on 04-Jun-2018

252 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 1/67

Harvard-MIT Division of Health Sciences and TechnologyHST.951J: Medical Decision Support, Fall 2005Instructors: Professor Lucila Ohno-Machado and Professor Staal Vinterbo

6.873/HST.951 Medical Decision SupportSpring 2005

 Artificial Neural Networks  

Lucila Ohno-Machado(with many slides borrowed from Stephan Dreiseitl. Courtesy of Stephan Dreiseitl. Used with permission.)

Page 2: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 2/67

Overview • Motivation

• Perceptrons

• Multilayer perceptrons

• Improving generalization • Bayesian perspective

Page 3: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 3/67

Motivation 

Images removed due to copyright reasons.

benign lesion malignant lesion benign lesion malignant lesion

Page 4: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 4/67

Motivation •  Human brain

 – Parallel processing

 – Distributed representation

 – Fault tolerant – Good generalization capability

•  Mimic structure and processing in

computational model

Page 5: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 5/67

Biological Analogy 

Synapses

 Axon

Dendrites

Synapses+

+

+--

(weights)

Nodes

Page 6: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 6/67

Perceptrons 

Page 7: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 7/67

00

01

110010

10

00 11 Input patterns

Input layer

Output layer

11

01 11

11

1100

00

00 10

10Sorted

.patterns

Page 8: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 8/67

 Activation functions 

Page 9: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 9/67

Perceptrons (linear machines) Input units

Cough Headache

weights

Δrule

change weights to 

decrease the error  

-what we got

what we wantedNo disease Pneumonia Flu Meningitiserror

Output units

Page 10: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 10/67

    0 1 0 0    0    0    0

 Abdominal Pain Perceptron Intensity Duration 

Male  Age Temp WBC Pain Pain adjustable

weights37 10 11 120

 Appendic itis Diverticulit is  Ulcer Pain Cholecystitis Obstruction PancreatitisDuodenal Non-specific Small BowelPerforated

Page 11: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 11/67

 AND y

x1 x2

w1 w2

θ= 0.5

input output

00

011011

0

001

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0

f(0w1 + 1w2) = 0f(1w1 + 0w2) = 0

f(1w1 + 1w2) = 1

1, for a >θf(a) =

0, for a ≤ θ θ

some possible values for w1 and w2

w1 w2

0.20

0.20

0.25

0.40

0.35

0.40

0.30

0.20

Page 12: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 12/67

Single layer neural network Output of unit j:

Input units

Input to unit

 j

i

Input to unit

units

measured value of variable

Output +θ j)o j = 1/ (1 + e- (a j )

 j: a j = Σwijai

i: ai

i

Page 13: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 13/67

Single layer neural network Output of unit  j:

Output o j = 1/ (1 + e - (a j+θ j)

units

Input to unit

 j

i

Input to unit

measured

Input units

 j: a j = Σwija i

i: a i

value of variable i Increasing

0

1

-5 0 5

i

i

i

0.2

0.4

0.6

0.8

1.2

-15 -10 10 15

Ser es1

Ser es2

Ser es3

Page 14: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 14/67

Training: Minimize Error  Input units

Cough Headache

weights

Δrule

change weights to 

decrease the error  

-what we got

what we wantedNo disease Pneumonia Flu Meningitiserror

Output units

Page 15: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 15/67

Error Functions • Mean Squared Error (for regression

problems), where t is target, o is outputΣ(t - o)2/n

• Cross Entropy Error (for binaryclassification) 

− Σ(t log o) + (1-t) log (1-o) +θ j)o j = 1/ (1 + e- (a j )

Page 16: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 16/67

Error function • Convention: w := (w0,w), x := (1,x)

• w0 is “bias” y

x1 x2

w1 w2

θ= 0.5

• o = f (w • x)

• Class labels ti ∈{+1,-1}

• Error measure

 – E = -Σ ti (w • x )o

i miscl.

i w

• How to minimize E?

y

x1 x2

w1 w2

1

= 0.5

Page 17: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 17/67

Minimizing the Error  initial error

final error

local minimum

Error surface

derivative

initial trainedw w

Page 18: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 18/67

Gradient descent 

Local minimum

Global minimum

Error

Page 19: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 19/67

Perceptron learning w

• Find minimum of E by iterating

k+1 = wk – η gradw E

• E = -Σ ti (w • x ) ⇒i i miscl. gradw E = -Σ ti xi 

i miscl. 

w

• “online” version: pick misclassified x k+1 = wk + η  ti xi 

i

Page 20: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 20/67

Perceptron learning •  Update rule wk+1 = wk + η  ti xi •  Theorem: perceptron learning converges

for linearly separable sets

Page 21: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 21/67

Gradient descent •  Simple function minimization algorithm •  Gradient is vector of partial derivatives •  Negative gradient is direction of steepest

descent

3

3

2

2

1

1

0

0

20

10

0

0

0

11

2

2

3

3

Figures by MIT OCW.

Page 22: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 22/67

Classification Model Inputs

Weights Output

 Age 34

1

4of beingAlive”

5

8

40.6

Gender

“Probability

Stage

 Dependent Independent Coefficients variablevariables

 a, b, c  p x1, x2, x3

 Prediction

Page 23: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 23/67

Terminology • Independent variable = input variable • Dependent variable = output variable • Coefficients = “weights”

• Estimates = “targets”

• Iterative step = cycle, epoch

Page 24: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 24/67

XOR y

x1 x2

w1 w2

θ= 0.5

input output

00

011011

0

110

f(x1w1 + x2w2) = y

f(0w1 + 0w2) = 0

f(0w1 + 1w2) = 1f(1w1 + 0w2) = 1

f(1w1 + 1w2) = 0

some possible values for w1 and w2

w1 w2

1, for a > θf(a) = 0, for a ≤ θ

θ

Page 25: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 25/67

XOR x 1 x 2

y

input output

00

011011

0

110

θ= 0.5

w 3 w5 w 4

z θ= 0.5

w1 w2

 f (w1, w2, w3, w4, w5)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5) f(a) =0, for a ≤ θ

(0.3,0.3,1,1,-2) θ

Page 26: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 26/67

XOR input output

00

011011

0

110

w1 w3w2

w5 w6

θ= 0.5 for all units

w4

 f (w1, w2, w3, w4, w5 , w6)

a possible set of values for ws

1, for a > θ(w1, w2, w3, w4, w5 , w6) f(a) =0, for a ≤ θ

(0.6,-0.6,-0.7,0.8,1,1) θ

Page 27: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 27/67

From perceptrons tomultilayer perceptrons 

Why? 

Page 28: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 28/67

 Abdominal Pain

37 10 1

 Appendici tis Diverticulit is

PerforatedNon-specific

Cholecystitis

Small Bowel

Pancreatitis

1 20

Male Age Temp WBC PainIntensity

1

PainDuration

0 1 0 0000

adjustable

weights

ObstructionPainDuodenalUlcer 

Page 29: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 29/67

0.8

Heart Attack Network Duration Intensity ECG: ST

Pain Pain elevation Smoker  Age Male

Myocardial Infarction

112 1504

“ Probability” of MI

Page 30: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 30/67

Multilayered Perceptrons 

Input uni ts

Input to unit j: a j = Σwijai

 j

i

Input to unit i: aii

Output of unit j:o j = 1/ (1 + e- (a j+θ j) )

Output units

Perceptron

MultilayeredInput to unit k:

perceptron

Output of unit k:

ok = 1/ (1 + e- (ak+θk) )k

Hidden

units

ak = Σw jko j

measured value of variable

Page 31: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 31/67

Neural Network Model Inputs

 Age 34

2

4

.6

.5

.8

.2

.1

.3

.7

.2

.4

.2

Output 

0.6

Gender

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

Page 32: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 32/67

“Combined logistic models” Inputs

 Age 34

2

4

.6

.5

.8

.1

.7

Output 

0.6

Gender

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

Page 33: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 33/67

Page 34: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 34/67

Inputs  Age 34

1

4

.6.5

.8

.2

.1

.3

.7

.2

Output 

0.6

Gender

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

Page 35: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 35/67

Not really,

no target for hidden units...  Age

0.6

Gender

34

2

4

.6

.5

.8

.2

.1

.3

.7

.2

.4

.2

“Probability

Stageof beingAlive”

 Independent Weights Hidden Weights  Dependentvariables Layer

variable

 Prediction

Page 36: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 36/67

Hidden Units and Backpropagation Output units

Hiddenunits

what we got

what we wanted-error

Δ rule

Δ rule

ba

i

ckpr

o

pagato

n

Input units 

Page 37: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 37/67

Multilayer perceptrons • Sigmoidal hidden layer

• Can represent arbitrary decision regions • Can be trained similar to perceptrons

Page 38: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 38/67

ECG Interpretation R-R interval

S-T elevation

QRS duration

QRS amplitude

 AVF lead

SV tachycardia

Ventricular tachycardia

LV hypertrophy

RV hypertrophy

Myocardial infarction

P-R interval 

Page 39: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 39/67

Linear Separation Separate n-dimensional space using one (n - 1)-dimensional space

Meningitis FluNo cough Cough

Headache Headache

CoughNo coughNo headache No headacheNo disease Pneumonia

No treatment

00 10

01 11

000

010

101

111

110

Treatment

011

100 

Page 40: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 40/67

 Another way of thinking about

this… • Have data set D = {(x ,t )} drawn from

 i i

probability distribution P(x,t)

•  Model P(x,t) given samples D by ANN

with adjustable parameter w• Statistics 

analogy: 

Page 41: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 41/67

Maximum Likelihood Estimation • Maximize likelihood of data D• Likelihood L = Πp(x ,t ) = Πp(t |x )p(x ) i i i i i

• Minimize -log L = -Σlog p(t |x ) -Σlog p(x ) i i i

• Drop second term: does not depend on w • Two cases: “regression” and classification

 

Page 42: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 42/67

Likelihood for classification (ie categorical target) 

• For classification, targets t are class labels 

• Minimize -Σlog p(t |x ) i i • p(t 1-ti |x ) = y(x ,w) ti(1- y(x ,w)) ⇒

i i i i-log p(t |x ) = -t log y(x ,w) -(1 – t ) * log(1-y(x ,w))i i i i i i

• Minimizing –log L equivalent to minimizing

-[Σt log y(x ,w) +(1 – t ) * log(1-y(x ,w))] i i i i

(cross-entropy error )

Page 43: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 43/67

Likelihood for “regression” (ie continuous target) 

•  For regression, targets t are real values •  Minimize -Σlog p(t |x ) i i

• p(t |x ) = 1/Z exp(-(y(x ,w) – t )

2

/(2σ2

)) ⇒ 

i  i i i-log p(t |x ) = 1/(2σ2) (y(x ,w) – t )2 +log Z i i  i i

•  y(x ,w) is network outputi •  Minimizing –log L equivalent to minimizing

Σ(y(x ,w) – t )2 (sum-of-squares error ) i  i

Page 44: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 44/67

Backpropagation algorithm w

• Minimizing error function

by gradient descent:k+1 = wk – η gradw E

• Iterative gradientcalculation bypropagating error signals

Backpropagation algorithm

Page 45: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 45/67

Backpropagation algorithm Problem: how to set learning rate η ? 

Better: use more advanced minimizationalgorithms (second-order information)

2

2

0

0

-2

2

0

-2

-4 -2 20-4 -2

Figures by MIT OCW.

Page 46: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 46/67

Backpropagation algorithm Classification Regression

cross-entropy sum-of-squares

Page 47: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 47/67

Overfitting 

ModelReal Distribution Overfitted

Page 48: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 48/67

Improving generalization 

Problem: memorizing (x,t) combinations

(“overtraining”)

0.7 0.5 00.9-0.5 1

1-1.2-0.2

0.3 0.6 1

0.5-0.2 ?

Page 49: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 49/67

Improving generalization 

•  Need test set to judge performance •  Goal: represent information in data set, not

noise

•  How to improve generalization?  – Limit network topology

 – Early stopping

 – Weight decay

Page 50: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 50/67

Limit network topology 

• Idea: fewer weights ⇒less flexibility • Analogy to polynomial interpolation: 

Page 51: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 51/67

Limit network topology 

Page 52: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 52/67

Early Stopping 

     C     H     D

error

holdout

training

Overfitted model “Real” model Overfitted model

0 age cycles 

Page 53: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 53/67

Early stopping 

•  Idea: stop training when information (but

not noise) is modeled•  Need hold-out set to determine when to

stop training

Page 54: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 54/67

Overfitting 

b = training set

a = hold-out set

Overfitted model

tss

Epochs

min (Δtss)

tss a

tss b

Stopping criterion

Page 55: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 55/67

Early stopping 

Page 56: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 56/67

Weight decay 

• Idea: control smoothness of network

output by controlling size of weights 

• Add term α||w||2 to error function 

Page 57: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 57/67

Page 58: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 58/67

Bayesian perspective 

•  Error function minimization corresponds to

maximum likelihood (ML) estimate: singlebest solution wML

•  Can lead to overtraining •  Bayesian approach: consider weight

posterior distribution p(w|D).

Page 59: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 59/67

Bayesian perspective 

• Posterior = likelihood * prior

• p(w|D) = p(D|w) p(w)/p(D)• Two approaches to approximating p(w|D): 

 – Sampling

 – Gaussian approximation

Page 60: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 60/67

Sampling from p(w|D) 

prior

likelihood

Page 61: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 61/67

Sampling from p(w|D) 

prior * likelihood = posterior  

Bayesian example for

Page 62: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 62/67

ayes a e a p e o

regression 

Bayesian example for

Page 63: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 63/67

y p

classification 

Model Features 

Page 64: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 64/67

(with strong personal biases) Modeling Examples Explanat.Effort Needed

Rule-based Exp. Syst. high low high?

Classification Trees low high+ “high”Neural Nets, SVM low high lowRegression Models high moderate moderate

Learned Bayesian Nets low high+ high(beautiful when it works)

Page 65: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 65/67

Regression vs. Neural Networks 

“X1” 1X3” “X1X2X3”

Y

“X2”

X1 X2 X3 X1X2 X1X3 X2X3

Y

(23-1) possible combinations

X1X2X3

“X

X1 X2 X3

Y = a(X1) + b(X2) + c(X3) + d(X1X2) + ...

Page 66: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 66/67

Summary 

• ANNs inspired by functionality of brain • Nonlinear data model• Trained by minimizing error function

• Goal is to generalize well• Avoid overtraining

• Distinguish ML and MAP solutions

Page 67: Perceptron Multilayer

8/13/2019 Perceptron Multilayer

http://slidepdf.com/reader/full/perceptron-multilayer 67/67

Some References 

Introductory and Historical Textbooks

•  Rumelhart, D.E., and McClelland, J.L. (eds) Parallel Distributed

Processing. MIT Press, Cambridge, 1986. (H)•  Hertz JA; Palmer RG; Krogh, AS. Introduction to the Theory of

Neural Computation. Addison-Wesley, Redwood City, 1991.

•  Pao, YH. Adaptive Pattern Recognition and Neural Networks.

 Addison-Wesley, Reading, 1989.•  Bishop CM. Neural Networks for Pattern Recognition. Clarendon

Press, Oxford, 1995.