brief overview of connectionism to understand learning walter schneider p2476 cognitive neuroscience...

41
Brief Overview of Connectionism to understand Learning Walter Schneider P2476 Cognitive Neuroscience of Human Learning & Instruction http://schneider.lrdc.pitt.edu/P2476/inde x.htm Slides adapted from U. Oxford Connectionist Summer School 1998 http://hincapie.psych.purdue.edu/CSS/ind ex.html Hinton Lectures on connectionism http://www.cs.toronto.edu/~hinton/csc321 /index.html David Plaut

Upload: wesley-bishop

Post on 30-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Brief Overview of Connectionism to understand Learning

Walter Schneider P2476 Cognitive Neuroscience of Human Learning

& Instruction http://schneider.lrdc.pitt.edu/P2476/index.htm

Slides adapted from U. Oxford Connectionist Summer School 1998

http://hincapie.psych.purdue.edu/CSS/index.htmlHinton Lectures on connectionism

http://www.cs.toronto.edu/~hinton/csc321/index.htmlDavid Plaut http://www.cnbc.cmu.edu/~plaut/ICM/

Specific Example NetTalk • NetTalk: Sejnowski, T. J. & Rosenberg, C. R. (1987) Parallel Networks that Learn to

Pronounce English Text Complex Systems 1 145-168

Learning input phonetic transcription of a child continuous speech

Simple Units

Learning Rules Change Connection Weights

• Learning rules calculate the difference between desired output and the correct output and use that difference to change weights to reduce the error.

Learning or 50,000 trials.

Note if assume 200 words per our (welfare household) and 5 hr/day, 1000/day or 50 days.

NetTalk DownloadInitial 0:46 20secLearn space 0:2:17 20sAfter 10K ep 3:50 20sTransfer 5:19 20s http://www.cnl.salk.edu/ParallelNetsPronounce/index.phpTransfer to new words same speaker 78%.

Graceful Deterioration and robust processing with fast relearning

More Hidden units better performance but slower learning

Unit Coding Unclear in Distributed Code

Hierarchical Clustering Sensible groupings

Performance characteristics

• With 120 hidden units– 98% within trained units– 75% generalization on dictionary of 20,012 words– 85% first pass and 90% and 97.5% after 55 passes.

• Adding 2 hidden layers of 80 units slightly improved generalization (but slows learning) – 97% after 55 passes, 80% generalization,

Summary Supervised LearningNetTalk – example of back propagation learning • Performed computation with simple units, connection weight

matrices, parallel activation• Learning rule provided error signal from supervisor to change

connection weights • It took man 105 trials to reach good performance going through

babbling to word production• Learning speed and generalization varied with nature of number of

units and levels• Showed good generalization to related words• Developed similarity space consistent with human clustering data• Performance was robust to loss of units and connection noise• Needed expert teacher with ability to reach in brain to set correct

states

How is this like and not like human learning?

• Similar– Lots of trials– Babbling for a while before it makes sense– Ability to learn any language (e.g., Dutch)– Generalization to new words– Creates similarity spaces

• Dissimilar – Teacher shows exact correctness by activating the correct output units– Use DecTalk only allowing correct simple output– Very simple network, small number of units– Sequential presentation of target– Learning reading not babbling/speech– Accuracy does not reach human level– Unlikely to be biologically implement able (high precision connections, back

propagate precision across levels– Does not learn from instruction but only experience

Switch to Contrastive Hebbian Learning

Some Fundamental Concepts

• Parallel Processing• Distributed Representations

• Learning (multiple Types)• Generalisation• Graceful Degradation

Input Output

0 0 0

0 1 1

1 0 1

1 1 0

1.01.0

1.01.0

-1.0

-1.0

Genres of Network Architecture

Introduction to Neural Computation

• Simplified Neuron

• A layered neural network

OutputConnectionsΣ θ

Cell Body

InputConnections

Output Neurons

Input Neurons

Introduction to Neural Computation

• A single output neuron

1210202 awawnetinput

Output Neuron

Input Neurons

j

jiji awnetinput

otherwise 0.0

0.0netinput if 1.0 iia

20w 21w

0a 1a

2a

The Mapping Principle• Patterns of Activity

An input pattern is transformed to an output pattern.

• Activation States are Vectors

Each pattern of activity can be considered a unique point in a space of states. The activation vector identifies this point in space.

inv

x

y

zx

y

z

• Mapping Functions

T = F (S)

The network maps a source space S (the network inputs) to a target space T (the outputs).

The mapping function F is most likely complex. No simple mathematical formula can capture it explicitly.

• Hyperspace

Input states generally have a high dimensionality. Most network states are therefore considered to populate HyperSpace.

S T

outvinv

The Principle of SuperpositionMatrix 1

+1 -1 -1 +1

-0.25 +0.25 +0.25 -0.25 -1

-0.25 +025 +0.25 -0.25 -1

+0.25 -0.25 -0.25 +0.25 +1

+0.25 -0.25 -0.25 +0.25 +1

-1 +1 -1 +1

+0.25 -0.25 +0.25 -0.25 -1

-0.25 +0.25 -0.25 +0.25 +1

-0.25 +0.25 -0.25 +0.25 +1

+0.25 -0.25 +0.25 -0.25 -1

Matrix 2

0.0 0.0 +0.5 -0.5

-0.5 +0.5 0.0 0.0

0.0 0.0 -0.5 +0.5

+0.5 -0.5 0.0 0.9

Composite Matrix

Hebbian Learning

• Cellular Association“When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process of metabolic change takes place in one or both cells such that A’s efficiency, as one of the cells firing B, is increased.” (Hebb 1949, p.50)

• Learning Connections

Take the product of the excitation of the two cells and change the value of the connection in proportion to this product.

outinaaw

wain aout

• The Learning Rule

ε is the learning rate.

• Changing ConnectionsIf ain = 0.5, aout = 0.75, and ε = 0.5then Δw = 0.5(0.75)(0.5) = 0.1875

And if wstart = 0.0, then wnext = 0.1875

• Calculating CorrelationsInput Output

0 1 2

+ + +

+ - -

- + -

- - +

0 1

2

Models of English past tense• PDP accounts

– Single homogeneous architecture

– Superposition– Competition between

different different verb types result in overregularisation and irregularisation

– Vocabulary discontinuity– Rumelhart & McClelland 1986

50

556065

70758085

9095

100

0 100 200%

Co

rre

ct P

ast

Te

nse

Training Epochs

Irregulars

Regulars

Vocabulary discontinuity

Wickelfeature Representation of Stem

Wickelfeature Representation of Past Tense

Using an Error Signal• Orthogonality Constraint

Number of patterns limited by dimensionality of network.

• Input patterns must be orthogonal to each other

• Similarity effects.

outout at

• Perceptron Convergence Rule

Learning in a single weight network

Assume a teacher signal tout

Adaptation of Connection and Threshold (Rosenblatt 1958)

Note that threshold always changes if incorrect output.

Blame is apportioned to a connection in proportion to the activity of the input line.

x

y

z

inv

Input Neurons

Output Neuronsw

ain aout

ininoutout aaatw

Using an Error Signal

• Perceptron Convergence Rule“The perceptron convergence rule guarantees to find a solution to a mapping problem, provided a solution exists.” (Minsky & Papert 1969)

• An Example of Perceptron LearningBoolean Or

Training the network

Input Output

0 0 0

1 0 1

0 1 1

1 1 1

aout

w20 w21

In Out W20 W21 θ aout δ Δθ Δw

0 0 0 0.2 0.1 1.0 0 0 0 0

1 0 1 0.2 0.1 1.0 0 1.0 -0.5 0.5

0 1 1 0.7 0.1 0.5 0 1.0 -0.5 0.5

1 1 1 0.7 0.6 0.0 1 0 0 0

Gradient Descent• Least Mean Square Error (LMS)

Define the error measure as the square of the discrepancy between the actual output and the desired output. (Widrow-Hoff 1960)

•Plot an error curve for a single weight network

• Make weight adjustments by performing gradient descent – always move down the slope.

• Calculating the Error Signal

Note that Perceptron Convergence and LMS use similar learning algorithms – the Delta Rule

• Error Landscapes

Gradient descent algorithms adapt by moving downhill in a multi-dimensional landscape – the error surface.

Ball bearing analogy.

In a smooth landscape, the bottom will always be reached. However, bottom may not correspond to zero error.

p

atE 2outout

Weight Value

Err

or

dw

dEkw

in2

inout awatdw

dw

Past Tense Revisited

• Vocabulary Discontinuity– Up to 10 epochs – 8 irregulars + 2

regulars. Thereafter – 420 verbs – mostly regular.

– Justification: Irregulars are more frequent than regulars

• Lack of Evidence– Vocabulary spurt at 2 years whereas

overregularizations occur at 3 years. Furthermore, vocabulary spurt consists mostly of nouns.

– Pinker and Prince (1988) show that regulars and irregulars are relatively balanced in early productive vocabularies

50

556065

70758085

9095

100

0 100 200%

Co

rre

ct P

ast

Te

nse

Training Epochs

Irregulars

Regulars

Vocabulary discontinuity

Wickelfeature Representation of Stem

Wickelfeature Representation of Past Tense

Longitudinal evidence

• Stages or phases in development?– Initial error-free

performance.– Protracted period of

overregularisation but at low rates (typically < 10%).

– Gradual recovery from error.

– Rate of overregularisation is much less the rate of regularisation of regular verbs.

Adam

0

20

40

60

80

100

24 36 48 60

Age in Months

Sarah

0

20

40

60

80

100

17 19 21 23 25 27

Per

cent

Cor

rect

Age in Months

Eve Abe

0

20

40

60

80

100

24 36 48 60

Per

cent

Cor

rect

Age in Months

0

20

40

60

80

100

30 36 42 48 54 60

Per

cent

Cor

rect

Age in Months

Performance on Irregular Verbs

Marcus et al (in press)1992

Longitudinal evidence

• Error Characteristics– High frequency irregulars are robust to

overregularisation.– Some errors seem to be phonologically conditioned.– Irregularisations.

Single system account

• Multi-layered Perceptrons– Hidden unit representation

– Error correction technique

– Plunkett & Marchman 1991

– Type/Token distinction

– Continuous training set

Standard Phonological Representation of Stem

Standard Phonological Representation of P ast

Single system account• Incremental Vocabularies

– Plunkett & Marchman (1993)

– Initial small training set

– Gradual expansion

• Overregularisation– Initial error-free performance.

– Protracted period of overregularisation but at low rates (typically < 5%).

– High frequency irregulars are robust to

overregularisation.

Simulated Performance on Irregular VerbsMarcus et al Scoring

0

10

20

30

40

50

60

70

80

90

100

20 120 220 320% I

rreg

Pas

t Te

nse

Cor

rect

Vocabulary Size

Using an Error Signal• Orthogonality Constraint

Number of patterns limited by dimensionality of network.

• Input patterns must be orthogonal to each other

• Similarity effects.

outout at

• Perceptron Convergence Rule

Learning in a single weight network

Assume a teacher signal tout

Adaptation of Connection and Threshold (Rosenblatt 1958)

Note that threshold always changes if incorrect output.

Blame is apportioned to a connection in proportion to the activity of the input line.

x

y

z

inv

Input Neurons

Output Neuronsw

ain aout

ininoutout aaatw

Using an Error Signal

• Perceptron Convergence Rule“The perceptron convergence rule guarantees to find a solution to a mapping problem, provided a solution exists.” (Minsky & Papert 1969)

• An Example of Perceptron LearningBoolean Or

Training the network

Input Output

0 0 0

1 0 1

0 1 1

1 1 1

aout

w20 w21

In Out W20 W21 θ aout δ Δθ Δw

0 0 0 0.2 0.1 1.0 0 0 0 0

1 0 1 0.2 0.1 1.0 0 1.0 -0.5 0.5

0 1 1 0.7 0.1 0.5 0 1.0 -0.5 0.5

1 1 1 0.7 0.6 0.0 1 0 0 0

Gradient Descent• Least Mean Square Error (LMS)

Define the error measure as the square of the discrepancy between the actual output and the desired output. (Widrow-Hoff 1960)

•Plot an error curve for a single weight network

• Make weight adjustments by performing gradient descent – always move down the slope.

• Calculating the Error Signal

Note that Perceptron Convergence and LMS use similar learning algorithms – the Delta Rule

• Error Landscapes

Gradient descent algorithms adapt by moving downhill in a multi-dimensional landscape – the error surface.

Ball bearing analogy.

In a smooth landscape, the bottom will always be reached. However, bottom may not correspond to zero error.

p

atE 2outout

Weight Value

Err

or

dw

dEkw

in2

inout awatdw

dw

Past Tense Revisited

• Vocabulary Discontinuity– Up to 10 epochs – 8 irregulars + 2

regulars. Thereafter – 420 verbs – mostly regular.

– Justification: Irregulars are more frequent than regulars

• Lack of Evidence– Vocabulary spurt at 2 years whereas

overregularizations occur at 3 years. Furthermore, vocabulary spurt consists mostly of nouns.

– Pinker and Prince (1988) show that regulars and irregulars are relatively balanced in early productive vocabularies

50

556065

70758085

9095

100

0 100 200%

Co

rre

ct P

ast

Te

nse

Training Epochs

Irregulars

Regulars

Vocabulary discontinuity

Wickelfeature Representation of Stem

Wickelfeature Representation of Past Tense

Longitudinal evidence

• Stages or phases in development?– Initial error-free

performance.– Protracted period of

overregularisation but at low rates (typically < 10%).

– Gradual recovery from error.

– Rate of overregularisation is much less the rate of regularisation of regular verbs.

Adam

0

20

40

60

80

100

24 36 48 60

Age in Months

Sarah

0

20

40

60

80

100

17 19 21 23 25 27

Per

cent

Cor

rect

Age in Months

Eve Abe

0

20

40

60

80

100

24 36 48 60

Per

cent

Cor

rect

Age in Months

0

20

40

60

80

100

30 36 42 48 54 60

Per

cent

Cor

rect

Age in Months

Performance on Irregular Verbs

Marcus et al (in press)1992

Longitudinal evidence

• Error Characteristics– High frequency irregulars are robust to

overregularisation.– Some errors seem to be phonologically conditioned.– Irregularisations.

Single system account

• Multi-layered Perceptrons– Hidden unit representation

– Error correction technique

– Plunkett & Marchman 1991

– Type/Token distinction

– Continuous training set

Standard Phonological Representation of Stem

Standard Phonological Representation of P ast

Single system account• Incremental Vocabularies

– Plunkett & Marchman (1993)

– Initial small training set

– Gradual expansion

• Overregularisation– Initial error-free performance.

– Protracted period of overregularisation but at low rates (typically < 5%).

– High frequency irregulars are robust to

overregularisation.

Simulated Performance on Irregular VerbsMarcus et al Scoring

0

10

20

30

40

50

60

70

80

90

100

20 120 220 320% I

rreg

Pas

t Te

nse

Cor

rect

Vocabulary Size

Linear Separability• Boolean AND, OR and XOR

Input AND OR XOR

0 0 0 0 0

1 0 0 1 1

0 1 0 1 1

1 1 1 1 0

• Partitioning Problem Space

1,1

1,00,0

0,1 1,1

1,00,0

0,1

1,1

1,00,0

0,1 1,1

1,00,0

0,1

OR XOR

Internal Representations• Multi-layered Perceptrons

Solving XOR

Input Hidden Target

0 0 0 0 0

1 0 1 0 1

0 1 0 1 1

1 1 0 0 0

1.01.0

1.01.0

-1.0

-1.0

HiddenUnits

InputUnits

OutputUnitθ

θ

θ

Threshold θ = 1

• Representing Similarity RelationsHidden units transform the input

1,1

1,0

0,1

0,0

Back Propagation• Assignment of Blame to Hidden Units

i

iiw

i

iiw

innet-out 1

1

ea

Weight Value

Err

or

GlobalLocal

• Local Minima

• Activation Functions

Learning Hierarchical RelationsIsomorphic Family Trees

Family Tree Network

Hinton Diagrams

Unit1: NationalityUnit2: GenerationUnit3: Branch of Tree