neural networks - home | computer science and...
TRANSCRIPT
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Neural NetworksNeural Networks
Introduction to Artificial IntelligenceCSE 150
May 29, 2007
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
AdministrationAdministration• Last programming assignment has been posted!
• Final Exam: Tuesday, June 12, 11:30-2:30
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Last LectureLast Lecture• Naïve Bayes classifier (needed for programming
assignment)
•• ReadingReading– Chapter 13 (for probability)
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
This LectureThis Lecture• Neural Networks
– Representation– Perceptron– Learning in linear networks
ReadingReading• Chapter 20, section 5 only
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
This LectureThis Lecture
Neural NetworksNeural Networks
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
What & WhyWhat & Why• Artificial Neural Networks: a bottom-up attempt to
model the functionality of the brain• Two main areas of activity:
– Biological Try to model biological neural systems– Computational
•Artificial neural networks are biologically inspiredbut not necessarily biologically plausible
• So may use other terms: Connectionism, ParallelDistributed Processing, Adaptive Systems Theory
• Interests in neural network differ according toprofession.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Some ApplicationsSome ApplicationsNETTALK vs. DECTALKDECTALK is a system developed by DEC which readsEnglish characters and produces, with a 95% accuracy, thecorrect pronunciation for an input word or text pattern.
•DECTALK is an expert system which took 20 years to finalise
•It uses a list of pronunciation rules and a large dictionary ofexceptions.
NETTALK, a neural network version of DECTALK, wasconstructed over one summer vacation.
•After 16 hours of training the system could read a 100-wordsample text with 98% accuracy!
•When trained with 15,000 words it achieved 86% accuracy on atest set.
NETTALK is an example of one of the advantages of theneural network approach over the symbolic approach toArtificial Intelligence.
•It is difficult to simulate the learning process in a symbolicsystem; rules and exceptions must be known.
•On the other hand, neural systems exhibit learning very clearly;the network learns by example.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Human Brain & Connectionist ModelHuman Brain & Connectionist ModelConsider humans:
• Neuron switching time .001 second• Number of neurons 1010
• Connections per neuron 104
• Scene recognition time .1 second• 100 inference steps doesn’t look likeenough - massively parallel computation
Properties of ANNs:• Many neuron-like switching units• Many weighted interconnectionsamong units• Highly parallel, distributed process• Emphasis on tuning weightsautomatically
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
PerceptronsPerceptrons
Xi : inputs
Wi : weights
θ : threshold
Linear threshold units
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
PerceptronsPerceptrons
The threshold can be easilyforced to 0 by introducing anadditional weight input W0 = θfrom a unit that is always -1.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
How powerful is aHow powerful is a perceptron perceptron??Threshold = 0Threshold = 0
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Concept Space & Linear Concept Space & Linear SeparabilitySeparability
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Learning ruleLearning rule::
• If I’m ON and I’m supposed to be OFF, LOWER weights from active inputsand RAISE the threshold
• If I’m OFF and I’m supposed to be ON, RAISE weights from active inputs andLOWER the threshold
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Training Training PerceptronsPerceptrons
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
• STOP HERE!!!• Now do: derivation of the learning rule
for linear networks• Derivation of the learning rule for non-
linear networks.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Begin: Some slides borrowed andBegin: Some slides borrowed andmodified from modified from Sanjoy DasguptaSanjoy Dasgupta
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Learning classifiersLearning classifiers
Input space: X = {28x28 images}
Classes (or labels): Y = {0,1,2,…,9}
Given some labeled examples, learn to classifyimages, i.e., learn a prediction rule
f: X Y
Most work assumes a binary classification task,such as Y = {-1,+1} , eg.
-1: {digits 0,2,3,4,5,6,7,8,9}
+1: {digit 1}
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
““BatchBatch”” learning learning
1. Start with a set of labeled examples.
2. Learn an underlying rule f: X Y.
3. Use this to classify future examples.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Linear classifiersLinear classifiers
The images X lie in R784ASSUME: they have had the
mean subtracted: If they arelinearly separable - then wedon’t have to worry about thethreshold.
Linear classifier parametrized bya weight vector w:fw(x) = sign(w·x)
w
+
_
_
_
__
_
_
+
+
++
+
+
+
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Learning linear classifiersLearning linear classifiers
Given training data(x(1),y(1)), …, (x(m),y(m))
Find a vector w such thaty(i)(w·x(i)) ≥ 0 for all i = 1,…,m
This is linear programming.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
The trouble with LPThe trouble with LP
+
_
_
_
__
_
_
+
+
++
+
+
+
+
_
_
_
__
_
_
+
+
++
+
+
+
DIFFICULT
EASY
Margin r
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
When thereWhen there’’s a margin:s a margin:
Perceptron algorithm (Rosenblatt, 1958)
w = 0while some (x,y) is misclassified:
w = w + yx
Claim: If all points have length at most one, and there is amargin r>0, then this algorithm makes at most 1/r2updates.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Perceptron Perceptron in actionin action
+
_
_
_
_
+
++
+
+
1
4
5
26
8
3
9
10 7w = x(1) – x(7)
Sparserepresentation:[(1,1), (7,-1)]
w = 0while some (x,y) ismisclassified:
w = w + yx
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
End: Some slides borrowed andEnd: Some slides borrowed andmodified from modified from Sanjoy DasguptaSanjoy Dasgupta
• I didn’t really use the following slides, but youmay find them useful!!
• Most of what I did was on the board…
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Increasing Expressiveness:Increasing Expressiveness:Multi-Layer Neural NetworksMulti-Layer Neural Networks
2-layer Neural Net
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Comments on TrainingComments on Training• No guarantee of convergence; may oscillate or
reach a local minimum• In practice, many large networks can be trained on
large amounts of data for realistic problems.• Many epochs (tens of thousands) may be needed
for adequate training. Large datasets may requiremany hours or days of CPU time.
• Termination criteria: Number of epochs; thresholdon training set error; Not decrease in error,increased error on validation set.
• To avoid local mimima: several trials withdifferent random initial weights.
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
Preference Bias: Early StoppingPreference Bias: Early Stopping
• Overfitting in ANN• Stop training when error goes up on validation set
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
CSE 150, Spring 2007 Gary Cottrell’s modifications of slides originally produced by David Kriegman
ConclusionsConclusions• Perceptrons: learning is convergent, but limited
expressiveness.• Multi-layer neural networks are more expressive
and can be trained using gradient descent(backprop).
• Much slower to train than other learning methods,but exploring a rich space works well in manydomains.
• Representation is not very accessible and difficultto take advantage of domain (prior) knowledge
• Analogy to brain and successful applications havegenerated significant interest.