neural networks · •so what was hopfield talking about? •patterns that are non-orthogonal...
TRANSCRIPT
![Page 1: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/1.jpg)
Neural Networks
Hopfield Nets and Boltzmann Machines
Fall 2017
1
![Page 2: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/2.jpg)
• Symmetric loopy network
• Each neuron is a perceptron with +1/-1 output
• Every neuron receives input from every other neuron
• Every neuron outputs signals to every other neuron
𝑦𝑖 = Θ
𝑗≠𝑖
𝑤𝑗𝑖𝑦𝑗 + 𝑏𝑖Θ 𝑧 = ቊ+1 𝑖𝑓 𝑧 > 0−1 𝑖𝑓 𝑧 ≤ 0
Recap: Hopfield network
2
![Page 3: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/3.jpg)
Recap: Hopfield network
• At each time each neuron receives a “field” σ𝑗≠𝑖𝑤𝑗𝑖𝑦𝑗 + 𝑏𝑖
• If the sign of the field matches its own sign, it does not
respond
• If the sign of the field opposes its own sign, it “flips” to
match the sign of the field
𝑦𝑖 = Θ
𝑗≠𝑖
𝑤𝑗𝑖𝑦𝑗 + 𝑏𝑖
Θ 𝑧 = ቊ+1 𝑖𝑓 𝑧 > 0−1 𝑖𝑓 𝑧 ≤ 0
3
![Page 4: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/4.jpg)
Recap: Energy of a Hopfield Network
𝐸 = −
𝑖,𝑗<𝑖
𝑤𝑖𝑗𝑦𝑖𝑦𝑗
• The system will evolve until the energy hits a local minimum
• In vector form, including a bias term (not used in Hopfield nets)
𝑦𝑖 = Θ
𝑗≠𝑖
𝑤𝑗𝑖𝑦𝑗
Θ 𝑧 = ቊ+1 𝑖𝑓 𝑧 > 0−1 𝑖𝑓 𝑧 ≤ 0
4
Not assuming node bias
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
![Page 5: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/5.jpg)
Recap: Evolution
• The network will evolve until it arrives at a
local minimum in the energy contour
statePE
5
𝐸 = −1
2𝐲𝑇𝐖𝐲
![Page 6: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/6.jpg)
Recap: Content-addressable memory
• Each of the minima is a “stored” pattern
– If the network is initialized close to a stored pattern, it will inevitably evolve to the pattern
• This is a content addressable memory
– Recall memory content from partial or corrupt values
• Also called associative memory
statePE
6
![Page 7: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/7.jpg)
Recap – Analogy: Spin Glasses
• Magnetic diploes
• Each dipole tries to align itself to the local field
– In doing so it may flip
• This will change fields at other dipoles
– Which may flip
• Which changes the field at the current dipole…7
![Page 8: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/8.jpg)
Recap – Analogy: Spin Glasses
• The total potential energy of the system
𝐸(𝑠) = 𝐶 −1
2
𝑖
𝑥𝑖𝑓 𝑝𝑖 = 𝐶 −
𝑖
𝑗>𝑖
𝑟𝑥𝑖𝑥𝑗
𝑝𝑖 − 𝑝𝑗2 −
𝑖
𝑏𝑖𝑥𝑗
• The system evolves to minimize the PE
– Dipoles stop flipping if any flips result in increase of PE
Total field at current dipole:
𝑓 𝑝𝑖 =
𝑗≠𝑖
𝑟𝑥𝑗
𝑝𝑖 − 𝑝𝑗2 + 𝑏𝑖
Response of current diplose
𝑥𝑖 = ൝𝑥𝑖 𝑖𝑓 𝑠𝑖𝑔𝑛 𝑥𝑖 𝑓 𝑝𝑖 = 1
−𝑥𝑖 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
8
![Page 9: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/9.jpg)
Recap : Spin Glasses
• The system stops at one of its stable configurations
– Where PE is a local minimum
• Any small jitter from this stable configuration returns it to the stable configuration
– I.e. the system remembers its stable state and returns to it
state
PE
9
![Page 10: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/10.jpg)
Recap: Hopfield net computation
• Very simple• Updates can be done sequentially, or all at once• Convergence
𝐸 = −
𝑖
𝑗>𝑖
𝑤𝑗𝑖𝑦𝑗𝑦𝑖
does not change significantly any more
1. Initialize network with initial pattern
𝑦𝑖 0 = 𝑥𝑖 , 0 ≤ 𝑖 ≤ 𝑁 − 1
2. Iterate until convergence
𝑦𝑖 𝑡 + 1 = Θ
𝑗≠𝑖
𝑤𝑗𝑖𝑦𝑗 , 0 ≤ 𝑖 ≤ 𝑁 − 1
10
![Page 11: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/11.jpg)
Examples: Content addressable memory
• http://staff.itee.uq.edu.au/janetw/cmc/chapters/Hopfield/11
![Page 12: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/12.jpg)
“Training” the network
• How do we make the network store a specific pattern or set of patterns?
– Hebbian learning
– Geometric approach
– Optimization
• Secondary question
– How many patterns can we store?
12
![Page 13: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/13.jpg)
Recap: Hebbian Learning to Store a Specific Pattern
• For a single stored pattern, Hebbian learning results in a network for which the target pattern is a global minimum
HEBBIAN LEARNING:𝑤𝑗𝑖 = 𝑦𝑗𝑦𝑖
1
-1
-1
-1 1
13
𝐖 = 𝐲𝑝𝐲𝑝𝑇 − I
![Page 14: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/14.jpg)
Hebbian learning: Storing a 4-bit pattern
• Left: Pattern stored. Right: Energy map
• Stored pattern has lowest energy
• Gradation of energy ensures stored pattern (or its ghost) is recalled from everywhere 14
![Page 15: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/15.jpg)
• {p} is the set of patterns to store– Superscript 𝑝 represents the specific pattern
• 𝑁𝑝 is the number of patterns to store
1
-1
-1
-1 1
1
1
-1
1 -1
15
𝐖 =
𝑝
𝐲𝑝𝐲𝑝𝑇 − 𝐈 = 𝐘𝐘𝑇 − 𝑁𝑝𝐈𝑤𝑗𝑖 =
𝑝∈{𝑝}
𝑦𝑖𝑝𝑦𝑗𝑝
Recap: Hebbian Learning to Store Multiple Patterns
![Page 16: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/16.jpg)
How many patterns can we store?
• Hopfield: For a network of 𝑁 neurons can store up to 0.14𝑁 patterns
16
![Page 17: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/17.jpg)
• Consider that the network is in any stored state 𝑦𝑝′
• At any node 𝑘 the field we obtain is
ℎ𝑘𝑝′=
𝑗
𝑦𝑘𝑝′𝑦𝑗𝑝′𝑦𝑗𝑝′+
𝑝≠𝑝′
𝑗
𝑦𝑘𝑝𝑦𝑗𝑝𝑦𝑗𝑝′= (𝑁 − 1)𝑦𝑘
𝑝′+
𝑝≠𝑝′
𝑗
𝑦𝑘𝑝𝑦𝑗𝑝𝑦𝑗𝑝′
• If the second “crosstalk” term sums to less than 𝑁 − 1, the symbol will not flip
1
-1
-1
-1 1
17
𝑤𝑗𝑖 =
𝑝∈{𝑝}
𝑦𝑖𝑝𝑦𝑗𝑝
Recap: Hebbian Learning to Store a Specific Pattern
![Page 18: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/18.jpg)
ℎ𝑘𝑝′=
𝑗
𝑦𝑘𝑝′𝑦𝑗𝑝′𝑦𝑗𝑝′+
𝑝≠𝑝′
𝑗
𝑦𝑘𝑝𝑦𝑗𝑝𝑦𝑗𝑝′= (𝑁 − 1)𝑦𝑘
𝑝′+
𝑝≠𝑝′
𝑗
𝑦𝑘𝑝𝑦𝑗𝑝𝑦𝑗𝑝′
• If 𝑦𝑘𝑝′ σ𝑝≠𝑝′σ𝑗 𝑦𝑘
𝑝𝑦𝑗𝑝𝑦𝑗𝑝′
is positive, then σ𝑝≠𝑝′σ𝑗 𝑦𝑘𝑝𝑦𝑗𝑝𝑦𝑗𝑝′
is the same
sign as 𝑦𝑘𝑝′
, and it will not flip
• If we choose 𝑃 patterns at random, what is the probability that
𝑦𝑘𝑝′ σ𝑝≠𝑝′σ𝑗 𝑦𝑘
𝑝𝑦𝑗𝑝𝑦𝑗𝑝′
will be positive for all symbols for all 𝑃 of them?
1
-1
-1
-1 1
18
𝑤𝑗𝑖 =
𝑝∈{𝑝}
𝑦𝑖𝑝𝑦𝑗𝑝
Recap: Hebbian Learning to Store a Specific Pattern
![Page 19: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/19.jpg)
How many patterns can we store?
• Hopfield: For a network of 𝑁 neurons can store up to 0.14𝑁 patterns
• What does this really mean?
– Lets look at some examples
19
![Page 20: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/20.jpg)
Hebbian learning: One 4-bit pattern
• Left: Pattern stored. Right: Energy map
• Note: Pattern is an energy well, but there are other local minima
– Where?
– Also note “shadow” pattern20
![Page 21: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/21.jpg)
Storing multiple patterns: Orthogonality
• The maximum Hamming distance between two 𝑁-bit patterns is 𝑁/2
– Because any pattern 𝑌 = −𝑌 for our purpose
• Two patterns 𝑦1and 𝑦2 that differ in 𝑁/2 bits are orthogonal
– Because 𝑦1𝑇𝑦2 = 0
• For 𝑁 = 2𝑀𝐿, where 𝐿 is an odd number, there are at most 2𝑀 orthogonal binary patterns
– Others may be almost orthogonal
21
![Page 22: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/22.jpg)
Two orthogonal 4-bit patterns
• Patterns are local minima (stationary and stable)
– No other local minima exist
– But patterns perfectly confusable for recall22
![Page 23: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/23.jpg)
Two non-orthogonal 4-bit patterns
• Patterns are local minima (stationary and stable)
– No other local minima exist
– Actual wells for patterns
• Patterns may be perfectly recalled!
– Note K > 0.14 N 23
![Page 24: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/24.jpg)
Three orthogonal 4-bit patterns
• All patterns are local minima (stationary and stable)
– But recall from perturbed patterns is random24
![Page 25: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/25.jpg)
Three non-orthogonal 4-bit patterns
• All patterns are local minima and recalled
– Note K > 0.14 N
– Note some “ghosts” ended up in the “well” of other patterns
• So one of the patterns has stronger recall than the other two25
![Page 26: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/26.jpg)
Four orthogonal 4-bit patterns
• All patterns are stationary, but none are stable
– Total wipe out
26
![Page 27: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/27.jpg)
Four nonorthogonal 4-bit patterns
• Believe it or not, all patterns are stored for K = N!
– Only “collisions” when the ghost of one pattern occurs next to another
• [1 1 1 1] and its ghost are strong attractors (why)27
![Page 28: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/28.jpg)
How many patterns can we store?
• Hopfield: For a network of 𝑁 neurons can store up to 0.14𝑁 patterns
• Apparently a fuzzy statement
– What does it really mean to say “stores” 0.14N patterns?• Stationary? Stable? No other local minima?
• N=4 may not be a good case (N too small)28
![Page 29: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/29.jpg)
A 6-bit pattern
• Perfectly stationary and stable
• But many spurious local minima..
– Which are “fake” memories29
![Page 30: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/30.jpg)
Two orthogonal 6-bit patterns
• Perfectly stationary and stable
• Several spurious “fake-memory” local minima..
– Figure over-states the problem: actually a 3-D Kmap30
![Page 31: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/31.jpg)
Two non-orthogonal 6-bit patterns
31
• Perfectly stationary and stable
• Some spurious “fake-memory” local minima..
– But every stored pattern has “bowl”
– Fewer spurious minima than for the orthogonal case
![Page 32: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/32.jpg)
Three non-orthogonal 6-bit patterns
32
• Note: Cannot have 3 or more orthogonal 6-bit patterns..
• Patterns are perfectly stationary and stable (K > 0.14N)
• Some spurious “fake-memory” local minima..
– But every stored pattern has “bowl”
– Fewer spurious minima than for the orthogonal 2-pattern case
![Page 33: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/33.jpg)
Four non-orthogonal 6-bit patterns
33
• Patterns are perfectly stationary and stable for K > 0.14N
• Fewer spurious minima than for the orthogonal 2-pattern case
– Most fake-looking memories are in fact ghosts..
![Page 34: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/34.jpg)
Six non-orthogonal 6-bit patterns
34
• Breakdown largely due to interference from “ghosts”
• But patterns are stationary, and often stable
– For K >> 0.14N
![Page 35: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/35.jpg)
More visualization..
• Lets inspect a few 8-bit patterns
– Keeping in mind that the Karnaugh map is now a 4-dimensional tesseract
35
![Page 36: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/36.jpg)
One 8-bit pattern
36
• Its actually cleanly stored, but there are a few
spurious minima
![Page 37: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/37.jpg)
Two orthogonal 8-bit patterns
37
• Both have regions of attraction
• Some spurious minima
![Page 38: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/38.jpg)
Two non-orthogonal 8-bit patterns
38
• Actually have fewer spurious minima
– Not obvious from visualization..
![Page 39: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/39.jpg)
Four orthogonal 8-bit patterns
39
• Successfully stored
![Page 40: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/40.jpg)
Four non-orthogonal 8-bit patterns
40
• Stored with interference from ghosts..
![Page 41: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/41.jpg)
Eight orthogonal 8-bit patterns
41
• Wipeout
![Page 42: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/42.jpg)
Eight non-orthogonal 8-bit patterns
42
• Nothing stored
– Neither stationary nor stable
![Page 43: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/43.jpg)
Making sense of the behavior
• Seems possible to store K > 0.14N patterns
– i.e. obtain a weight matrix W such that K > 0.14N patterns are stationary
– Possible to make more than 0.14N patterns at-least 1-bit stable
• So what was Hopfield talking about?
• Patterns that are non-orthogonal easier to remember
– I.e. patterns that are closer are easier to remember than patterns that are farther!!
• Can we attempt to get greater control on the process than Hebbian learning gives us?
43
![Page 44: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/44.jpg)
Bold Claim
• I can always store (upto) N orthogonal patterns such that they are stationary!
– Although not necessarily stable
• Why?
44
![Page 45: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/45.jpg)
“Training” the network
• How do we make the network store a specific pattern or set of patterns?
– Hebbian learning
– Geometric approach
– Optimization
• Secondary question
– How many patterns can we store?
45
![Page 46: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/46.jpg)
A minor adjustment
• Note behavior of 𝐄 𝐲 = 𝐲𝑇𝐖𝐲 with
𝐖 = 𝐘𝐘𝑇 −𝑁𝑝𝐈
• Is identical to behavior with
𝐖 = 𝐘𝐘𝑇
• Since
𝐲𝑇 𝐘𝐘𝑇 −𝑁𝑝𝐈 𝐲 = 𝐲𝑇𝐘𝐘𝑇𝐲 − 𝑁𝑁𝑝
• But 𝐖 = 𝐘𝐘𝑇 is easier to analyze. Hence in the following slides we will use 𝐖 = 𝐘𝐘𝑇
46
Energy landscapeonly differs by
an additive constant
Gradients and locationof minima remain same
![Page 47: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/47.jpg)
A minor adjustment
• Note behavior of 𝐄 𝐲 = 𝐲𝑇𝐖𝐲 with
𝐖 = 𝐘𝐘𝑇 −𝑁𝑝𝐈
• Is identical to behavior with
𝐖 = 𝐘𝐘𝑇
• Since
𝐲𝑇 𝐘𝐘𝑇 −𝑁𝑝𝐈 𝐲 = 𝐲𝑇𝐘𝐘𝑇𝐲 − 𝑁𝑁𝑝
• But 𝐖 = 𝐘𝐘𝑇 is easier to analyze. Hence in the following slides we will use 𝐖 = 𝐘𝐘𝑇
47
Energy landscapeonly differs by
an additive constant
Gradients and locationof minima remain same
Both have thesame Eigen vectors
![Page 48: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/48.jpg)
A minor adjustment
• Note behavior of 𝐄 𝐲 = 𝐲𝑇𝐖𝐲 with
𝐖 = 𝐘𝐘𝑇 −𝑁𝑝𝐈
• Is identical to behavior with
𝐖 = 𝐘𝐘𝑇
• Since
𝐲𝑇 𝐘𝐘𝑇 −𝑁𝑝𝐈 𝐲 = 𝐲𝑇𝐘𝐘𝑇𝐲 − 𝑁𝑁𝑝
• But 𝐖 = 𝐘𝐘𝑇 is easier to analyze. Hence in the following slides we will use 𝐖 = 𝐘𝐘𝑇
48
Energy landscapeonly differs by
an additive constant
Gradients and locationof minima remain same
NOTE: Thisis a positive
semidefinite matrix
Both have thesame Eigen vectors
![Page 49: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/49.jpg)
Consider the energy function
• Reinstating the bias term for completeness sake
– Remember that we don’t actually use it in a Hopfield
net
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
49
![Page 50: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/50.jpg)
Consider the energy function
• Reinstating the bias term for completeness sake
– Remember that we don’t actually use it in a Hopfield
net
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
This is a quadratic!
For Hebbian learningW is positive semidefinite
E is convex
50
![Page 51: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/51.jpg)
The energy function
• 𝐸 is a convex quadratic
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
51
![Page 52: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/52.jpg)
The energy function
• 𝐸 is a convex quadratic
– Shown from above (assuming 0 bias)
• But components of 𝑦 can only take values ±1
– I.e 𝑦 lies on the corners of the unit hypercube
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
52
![Page 53: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/53.jpg)
The energy function
• 𝐸 is a convex quadratic
– Shown from above (assuming 0 bias)
• But components of 𝑦 can only take values ±1
– I.e 𝑦 lies on the corners of the unit hypercube
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
53
![Page 54: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/54.jpg)
The energy function
• The stored values of 𝐲 are the ones where all
adjacent corners are higher on the quadratic
– Hebbian learning attempts to make the quadratic
steep in the vicinity of stored patterns
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
Stored patterns
54
![Page 55: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/55.jpg)
Patterns you can store
• Ideally must be maximally separated on the hypercube
– The number of patterns we can store depends on the
actual distance between the patterns
Stored patternsGhosts (negations)
55
![Page 56: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/56.jpg)
Storing patterns• A pattern 𝐲𝑃 is stored if:
– 𝑠𝑖𝑔𝑛 𝐖𝐲𝑝 = 𝐲𝑝 for all target patterns
• Note: for binary vectors 𝑠𝑖𝑔𝑛 𝐲 is a projection
– Projects 𝐲 onto the nearest corner of the hypercube
– It “quantizes” the space into orthants
56
![Page 57: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/57.jpg)
Storing patterns
• A pattern 𝐲𝑃 is stored if:– 𝑠𝑖𝑔𝑛 𝐖𝐲𝑝 = 𝐲𝑝 for all target patterns
• Training: Design 𝐖 such that this holds
• Simple solution: 𝐲𝑝 is an Eigenvector of 𝐖– And the corresponding Eigenvalue is positive
𝐖𝐲𝑝 = 𝜆𝐲𝑝– More generally orthant(𝐖𝐲𝑝) = orthant(𝐲𝑝)
• How many such 𝐲𝑝can we have?
57
![Page 58: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/58.jpg)
Only N patterns?
• Patterns that differ in 𝑁/2 bits are orthogonal• You can have no more than 𝑁 orthogonal vectors
in an 𝑁-dimensional space59
(1,1)
(1,-1)
![Page 59: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/59.jpg)
Another random fact that should interest you
• The Eigenvectors of any symmetric matrix 𝐖are orthogonal
• The Eigenvalues may be positive or negative
60
![Page 60: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/60.jpg)
Storing more than one pattern
• Requirement: Given 𝐲1, 𝐲2, … , 𝐲𝑃– Design 𝐖 such that
• 𝑠𝑖𝑔𝑛 𝐖𝐲𝑝 = 𝐲𝑝 for all target patterns
• There are no other binary vectors for which this holds
• What is the largest number of patterns that can be stored?
61
![Page 61: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/61.jpg)
Storing 𝑲 orthogonal patterns
• Simple solution: Design 𝐖 such that 𝐲1,
𝐲2, … , 𝐲𝐾 are the Eigen vectors of 𝐖
– Let 𝑌 = 𝐲1 𝐲2…𝐲𝐾
𝑊 = 𝑌Λ𝑌𝑇
– 𝜆1, … , 𝜆𝐾 are positive
– For 𝜆1 = 𝜆2 = 𝜆𝐾 = 1 this is exactly the Hebbian
rule
• The patterns are provably stationary62
![Page 62: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/62.jpg)
Hebbian rule
• In reality
– Let 𝑌 = 𝐲1 𝐲2…𝐲𝐾 𝐫𝑲+1 𝐫𝑲+2…𝐫𝑁
𝑊 = 𝑌Λ𝑌𝑇
– 𝐫𝑲+1 𝐫𝑲+2…𝐫𝑁 are orthogonal to 𝐲1 𝐲2…𝐲𝐾
– 𝜆1 = 𝜆2 = 𝜆𝐾 = 1
– 𝜆𝐾+1 , … , 𝜆𝑁 = 0
• All patterns orthogonal to 𝐲1 𝐲2…𝐲𝐾are also
stationary
– Although not stable
63
![Page 63: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/63.jpg)
Storing 𝑵 orthogonal patterns
• When we have 𝑁 orthogonal (or near
orthogonal) patterns 𝐲1, 𝐲2, … , 𝐲𝑁
– 𝑌 = 𝐲1 𝐲2…𝐲𝑁
𝑊 = 𝑌Λ𝑌𝑇
– 𝜆1 = 𝜆2 = 𝜆𝑁 = 1
• The Eigen vectors of 𝑊 span the space
• Also, for any 𝐲𝑘𝐖𝐲𝑘 = 𝐲𝑘
64
![Page 64: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/64.jpg)
Storing 𝑵 orthogonal patterns• The 𝑁 orthogonal patterns 𝐲1, 𝐲2, … , 𝐲𝑁 span the
space
• Any pattern 𝐲 can be written as
𝐲 = 𝑎1𝐲1 + 𝑎2𝐲2 +⋯+ 𝑎𝑁𝐲𝑁𝐖𝐲 = 𝑎1𝐖𝐲1 + 𝑎2𝐖𝐲2 +⋯+ 𝑎𝑁𝐖𝐲𝑁
= 𝑎1𝐲1 + 𝑎2𝐲2 +⋯+ 𝑎𝑁𝐲𝑁 = 𝐲
• All patterns are stable
– Remembers everything
– Completely useless network
65
![Page 65: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/65.jpg)
Storing K orthogonal patterns
• Even if we store fewer than 𝑁 patterns
– Let 𝑌 = 𝐲1 𝐲2…𝐲𝐾 𝐫𝑲+1 𝐫𝑲+2…𝐫𝑁
𝑊 = 𝑌Λ𝑌𝑇
– 𝐫𝑲+1 𝐫𝑲+2…𝐫𝑁 are orthogonal to 𝐲1 𝐲2…𝐲𝐾
– 𝜆1 = 𝜆2 = 𝜆𝐾 = 1
– 𝜆𝐾+1 , … , 𝜆𝑁 = 0
• All patterns orthogonal to 𝐲1 𝐲2…𝐲𝐾 are stationary
• Any pattern that is entirely in the subspace spanned by 𝐲1 𝐲2…𝐲𝐾is also stable (same logic as earlier)
• Only patterns that are partially in the subspace spanned by 𝐲1 𝐲2…𝐲𝐾 are unstable
– Get projected onto subspace spanned by 𝐲1 𝐲2…𝐲𝐾
66
![Page 66: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/66.jpg)
Problem with Hebbian Rule
• Even if we store fewer than 𝑁 patterns
– Let 𝑌 = 𝐲1 𝐲2…𝐲𝐾 𝐫𝑲+1 𝐫𝑲+2…𝐫𝑁
𝑊 = 𝑌Λ𝑌𝑇
– 𝐫𝑲+1 𝐫𝑲+2…𝐫𝑁 are orthogonal to 𝐲1 𝐲2…𝐲𝐾
– 𝜆1 = 𝜆2 = 𝜆𝐾 = 1
• Problems arise because Eigen values are all 1.0
– Ensures stationarity of vectors in the subspace
– What if we get rid of this requirement?
67
![Page 67: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/67.jpg)
Hebbian rule and general (non-orthogonal) vectors
𝑤𝑗𝑖 =
𝑝∈{𝑝}
𝑦𝑖𝑝𝑦𝑗𝑝
• What happens when the patterns are not orthogonal
• What happens when the patterns are presented more than once
– Different patterns presented different numbers of times
– Equivalent to having unequal Eigen values..
• Can we predict the evolution of any vector 𝐲
– Hint: Lanczos iterations
• Can write 𝐘𝑃 = 𝐘𝑜𝑟𝑡ℎ𝑜𝐁, 𝐖 = 𝐘𝑜𝑟𝑡ℎ𝑜𝐁Λ𝐁𝑇𝐘𝑜𝑟𝑡ℎ𝑜
𝑇
68
![Page 68: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/68.jpg)
The bottom line
• With an network of 𝑁 units (i.e. 𝑁-bit patterns)
• The maximum number of stable patterns is actually exponential in 𝑁
– McElice and Posner, 84’
– E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable
• For a specific set of 𝐾 patterns, we can always build a network for which all 𝐾 patterns are stable provided 𝐾 ≤ 𝑁
– Mostafa and St. Jacques 85’
• For large N, the upper bound on K is actually N/4logN
– McElice et. Al. 87’
– But this may come with many “parasitic” memories
69
![Page 69: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/69.jpg)
The bottom line
• With an network of 𝑁 units (i.e. 𝑁-bit patterns)
• The maximum number of stable patterns is actually exponential in 𝑁
– McElice and Posner, 84’
– E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable
• For a specific set of 𝐾 patterns, we can always build a network for which all 𝐾 patterns are stable provided 𝐾 ≤ 𝑁
– Mostafa and St. Jacques 85’
• For large N, the upper bound on K is actually N/4logN
– McElice et. Al. 87’
– But this may come with many “parasitic” memories
70
How do we find this network?
![Page 70: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/70.jpg)
The bottom line
• With an network of 𝑁 units (i.e. 𝑁-bit patterns)
• The maximum number of stable patterns is actually exponential in 𝑁
– McElice and Posner, 84’
– E.g. when we had the Hebbian net with N orthogonal base patterns, all patterns are stable
• For a specific set of 𝐾 patterns, we can always build a network for which all 𝐾 patterns are stable provided 𝐾 ≤ 𝑁
– Mostafa and St. Jacques 85’
• For large N, the upper bound on K is actually N/4logN
– McElice et. Al. 87’
– But this may come with many “parasitic” memories
71
Can we do something about this?
How do we find this network?
![Page 71: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/71.jpg)
A different tack
• How do we make the network store a specific pattern or set of patterns?
– Hebbian learning
– Geometric approach
– Optimization
• Secondary question
– How many patterns can we store?
72
![Page 72: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/72.jpg)
Consider the energy function
• This must be maximally low for target patterns
• Must be maximally high for all other patterns
– So that they are unstable and evolve into one of
the target patterns
𝐸 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
73
![Page 73: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/73.jpg)
Alternate Approach to Estimating the Network
• Estimate 𝐖 (and 𝐛) such that
– 𝐸 is minimized for 𝐲1, 𝐲2, … , 𝐲𝑃
– 𝐸 is maximized for all other 𝐲
• Caveat: Unrealistic to expect to store more than
𝑁 patterns, but can we make those 𝑁 patterns
memorable
𝐸(𝐲) = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲
74
![Page 74: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/74.jpg)
Optimizing W (and b)
• Minimize total energy of target patterns
– Problem with this?
𝐸(𝐲) = −1
2𝐲𝑇𝐖𝐲
75
𝐖 = argmin𝐖
𝐲∈𝐘𝑃
𝐸(𝐲)
The bias can be captured by another fixed-value component
![Page 75: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/75.jpg)
Optimizing W
• Minimize total energy of target patterns
• Maximize the total energy of all non-target
patterns
𝐸(𝐲) = −1
2𝐲𝑇𝐖𝐲
76
𝐖 = argmin𝐖
𝐲∈𝐘𝑃
𝐸(𝐲) −
𝐲∉𝐘𝑃
𝐸(𝐲)
![Page 76: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/76.jpg)
Optimizing W
• Simple gradient descent:
𝐸(𝐲) = −1
2𝐲𝑇𝐖𝐲
77
𝐖 = argmin𝐖
𝐲∈𝐘𝑃
𝐸(𝐲) −
𝐲∉𝐘𝑃
𝐸(𝐲)
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃
𝐲𝐲𝑇
![Page 77: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/77.jpg)
Optimizing W
• Can “emphasize” the importance of a pattern by repeating
– More repetitions greater emphasis
78
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃
𝐲𝐲𝑇
![Page 78: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/78.jpg)
Optimizing W
• Can “emphasize” the importance of a pattern by repeating
– More repetitions greater emphasis
• How many of these?
– Do we need to include all of them?
– Are all equally important?79
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃
𝐲𝐲𝑇
![Page 79: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/79.jpg)
The training again..
• Note the energy contour of a Hopfield network for any weight 𝐖
80
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃
𝐲𝐲𝑇
state
Energy
Bowls will all actually bequadratic
![Page 80: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/80.jpg)
The training again
• The first term tries to minimize the energy at target patterns– Make them local minima– Emphasize more “important” memories by repeating them more
frequently
81
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃
𝐲𝐲𝑇
state
Energy
Target patterns
![Page 81: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/81.jpg)
The negative class
• The second term tries to “raise” all non-target patterns– Do we need to raise everything?
82
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃
𝐲𝐲𝑇
state
Energy
![Page 82: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/82.jpg)
Option 1: Focus on the valleys
• Focus on raising the valleys
– If you raise every valley, eventually they’ll all move up above the target patterns, and many will even vanish
83
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
state
Energy
![Page 83: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/83.jpg)
Identifying the valleys..
• Problem: How do you identify the valleys for
the current 𝐖?
84
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
state
Energy
![Page 84: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/84.jpg)
Identifying the valleys..
85state
Energy
• Initialize the network randomly and let it evolve
– It will settle in a valley
![Page 85: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/85.jpg)
Training the Hopfield network
• Initialize 𝐖
• Compute the total outer product of all target patterns
– More important patterns presented more frequently
• Randomly initialize the network several times and let it evolve
– And settle at a valley
• Compute the total outer product of valley patterns
• Update weights86
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
![Page 86: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/86.jpg)
Training the Hopfield network: SGD version
• Initialize 𝐖
• Do until convergence, satisfaction, or death from boredom:– Sample a target pattern 𝐲𝑝
• Sampling frequency of pattern must reflect importance of pattern
– Randomly initialize the network and let it evolve• And settle at a valley 𝐲𝑣
– Update weights
• 𝐖 = 𝐖+ 𝜂 𝐲𝑝𝐲𝑝𝑇 − 𝐲𝑣𝐲𝑣
𝑇
87
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
![Page 87: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/87.jpg)
Training the Hopfield network
• Initialize 𝐖
• Do until convergence, satisfaction, or death from boredom:– Sample a target pattern 𝐲𝑝
• Sampling frequency of pattern must reflect importance of pattern
– Randomly initialize the network and let it evolve• And settle at a valley 𝐲𝑣
– Update weights
• 𝐖 = 𝐖+ 𝜂 𝐲𝑝𝐲𝑝𝑇 − 𝐲𝑣𝐲𝑣
𝑇
88
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
![Page 88: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/88.jpg)
Which valleys?
89state
Energy
• Should we randomly sample valleys?
– Are all valleys equally important?
![Page 89: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/89.jpg)
Which valleys?
90state
Energy
• Should we randomly sample valleys?
– Are all valleys equally important?
• Major requirement: memories must be stable
– They must be broad valleys
• Spurious valleys in the neighborhood of memories are more important to eliminate
![Page 90: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/90.jpg)
Identifying the valleys..
91state
Energy
• Initialize the network at valid memories and let it evolve
– It will settle in a valley. If this is not the target pattern, raise it
![Page 91: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/91.jpg)
Training the Hopfield network
• Initialize 𝐖
• Compute the total outer product of all target patterns
– More important patterns presented more frequently
• Initialize the network with each target pattern and let it evolve
– And settle at a valley
• Compute the total outer product of valley patterns
• Update weights92
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
![Page 92: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/92.jpg)
Training the Hopfield network: SGD version
• Initialize 𝐖
• Do until convergence, satisfaction, or death from boredom:– Sample a target pattern 𝐲𝑝
• Sampling frequency of pattern must reflect importance of pattern
– Initialize the network at 𝐲𝑝 and let it evolve• And settle at a valley 𝐲𝑣
– Update weights
• 𝐖 = 𝐖+ 𝜂 𝐲𝑝𝐲𝑝𝑇 − 𝐲𝑣𝐲𝑣
𝑇
93
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
![Page 93: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/93.jpg)
A possible problem
94state
Energy
• What if there’s another target pattern
downvalley
– Raising it will destroy a better-represented or
stored pattern!
![Page 94: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/94.jpg)
A related issue• Really no need to raise the entire surface, or
even every valley
95state
Energy
![Page 95: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/95.jpg)
A related issue
• Really no need to raise the entire surface, or even every valley
• Raise the neighborhood of each target memory– Sufficient to make the memory a valley
– The broader the neighborhood considered, the broader the valley
96state
Energy
![Page 96: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/96.jpg)
Raising the neighborhood
97state
Energy
• Starting from a target pattern, let the network
evolve only a few steps
– Try to raise the resultant location
• Will raise the neighborhood of targets
• Will avoid problem of down-valley targets
![Page 97: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/97.jpg)
Training the Hopfield network: SGD version
• Initialize 𝐖
• Do until convergence, satisfaction, or death from boredom:– Sample a target pattern 𝐲𝑝
• Sampling frequency of pattern must reflect importance of pattern
– Initialize the network at 𝐲𝑝 and let it evolve a few steps (2-4)• And arrive at a down-valley position 𝐲𝑑
– Update weights
• 𝐖 = 𝐖+ 𝜂 𝐲𝑝𝐲𝑝𝑇 − 𝐲𝑑𝐲𝑑
𝑇
98
𝐖 = 𝐖+ 𝜂
𝐲∈𝐘𝑃
𝐲𝐲𝑇 −
𝐲∉𝐘𝑃&𝐲=𝑣𝑎𝑙𝑙𝑒𝑦
𝐲𝐲𝑇
![Page 98: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/98.jpg)
A probabilistic interpretation
• For continuous 𝐲, the energy of a pattern is a perfect analog to the negative log likelihood of a Gaussian density
• For binary y it is the analog of the negative log likelihood of a Boltzmann distribution
– Minimizing energy maximizes log likelihood
99
𝐸(𝐲) = −1
2𝐲𝑇𝐖𝐲 𝑃(𝐲) = 𝐶𝑒𝑥𝑝
1
2𝐲𝑇𝐖𝐲
𝐸(𝐲) = −1
2𝐲𝑇𝐖𝐲 𝑃(𝐲) = 𝐶𝑒𝑥𝑝
1
2𝐲𝑇𝐖𝐲
![Page 99: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/99.jpg)
The Boltzmann Distribution
• 𝑘 is the Boltzmann constant
• 𝑇 is the temperature of the system
• The energy terms are like the loglikelihood of a Boltzmann distribution at 𝑇 = 1
– Derivation of this probability is in fact quite trivial..
100
𝐸 𝐲 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲 𝑃(𝐲) = 𝐶𝑒𝑥𝑝
−𝐸(𝐲)
𝑘𝑇
𝐶 =1
σ𝐲𝑃(𝐲)
![Page 100: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/100.jpg)
Continuing the Boltzmann analogy
• The system probabilistically selects states with
lower energy
– With infinitesimally slow cooling, at 𝑇 = 0, it
arrives at the global minimal state
101
𝐸 𝐲 = −1
2𝐲𝑇𝐖𝐲− 𝐛𝑇𝐲 𝑃(𝐲) = 𝐶𝑒𝑥𝑝
−𝐸(𝐲)
𝑘𝑇
𝐶 =1
σ𝐲𝑃(𝐲)
![Page 101: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/101.jpg)
Spin glasses and Hopfield nets
• Selecting a next state is akin to drawing a sample from the Boltzmann distribution at 𝑇 = 1, in a universe where 𝑘 = 1
102
state
Energy
![Page 102: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/102.jpg)
Lookahead..
• The Boltzmann analogy
• Adding capacity to a Hopfield network
103
![Page 103: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/103.jpg)
Storing more than N patterns
• How do we increase the capacity of the network
– Store more patterns
104
![Page 104: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/104.jpg)
Expanding the network
• Add a large number of neurons whose actual values you don’t care about!
N NeuronsK Neurons
105
![Page 105: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/105.jpg)
Expanded Network
• New capacity: ~(N+K) patterns
– Although we only care about the pattern of the first N neurons
– We’re interested in N-bit patterns
N NeuronsK Neurons
106
![Page 106: Neural Networks · •So what was Hopfield talking about? •Patterns that are non-orthogonal easier to remember –I.e. patterns that are closer are easier to remember than patterns](https://reader033.vdocuments.site/reader033/viewer/2022042009/5e716dbf72918075bb61d132/html5/thumbnails/106.jpg)
Introducing…
• The Boltzmann machine…
• Friday please…
N NeuronsK Neurons
107