robot learning jeremy wyatt school of computer science university of birmingham
Post on 20-Dec-2015
221 views
TRANSCRIPT
![Page 1: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/1.jpg)
Robot Learning
Jeremy Wyatt
School of Computer Science
University of Birmingham
![Page 2: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/2.jpg)
Plan
Why and when What we can do
– Learning how to act– Learning maps– Evolutionary Robotics
How we do it– Supervised Learning– Learning from punishments and rewards– Unsupervised Learning
![Page 3: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/3.jpg)
Learning How to Act What can we do?
– Reaching– Road following– Box pushing– Wall following– Pole-balancing– Stick juggling– Walking
![Page 4: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/4.jpg)
Learning How to Act: Reaching
We can learn from reinforcement or from a teacher (supervised learning)
Reinforcement Learning:– Action: Move your arm ()– You received a reward of 2.1
Supervised Learning:– Action: Move your hand to – You should have moved to
(x,y,z)
![Page 5: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/5.jpg)
Learning How to Act: Driving ALVINN: learned to drive in 5 minutes Learns to copy the human response Feedforward multilayer neural network
30
32
Steering wheel
position
![Page 6: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/6.jpg)
Learning How to Act: Driving
Network outputs form a Gaussian Mean encodes the driving direction Compare with the “correct” human action Compute error for each unit given desired Gaussian
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
1 3 5 7 9 11
13
15
17
19
21
23
25
27
29
31
00.010.020.030.040.050.060.07
![Page 7: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/7.jpg)
Learning How to Act: Driving
Distribution of training examples from on the fly learning causes problems
Network doesn’t see how to cope with misalignments Network can forget if it doesn’t see a situation for a
while Answer: generate new examples from the on the fly
images
![Page 8: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/8.jpg)
Learning How to Act: Driving
Use camera geometry to assess new field of view
Fill in using information about road structure Transform the target steering direction Present as a new training example
![Page 9: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/9.jpg)
Learning How to Act: Driving
![Page 10: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/10.jpg)
Learning How to Act
Obelix Learns to push boxes Reinforcement Learning
![Page 11: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/11.jpg)
What is Reinforcement Learning? Learning from punishments and rewards Agent moves through world, observing states
and rewards Adapts its behaviour to maximise some
function of reward
s9s5s4
……
…
+50
-1-1
+3
r9r5r4r1
s1
a9a5a4a2 …a3a1
s2 s3
![Page 12: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/12.jpg)
Return: Long term performance Let’s assume our agent acts according to some
rules, called a policy, The return Rt is a measure of long term reward
collected after time t
The expected return for a state-action pair is called a Q value Q(s,a)
+50
-1-1
+3
r9r5r4r1
3 4 80 3 1 1 50R 0 1
![Page 13: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/13.jpg)
One step Q-learning
Guess how good state-action pairs are Take an action Watch the new state and reward Update the state-action value
1 1 1ˆ ˆ ˆ ˆ( , ) ( , ) max ( , ) ( , )t t t t t t t t t t t t
b AQ s a Q s a r Q s b Q s a
st+1at
strt+1
![Page 14: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/14.jpg)
Obelix
Won’t converge with a single controller Works if you divide it into behaviours But …
![Page 15: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/15.jpg)
Evolutionary Robotics
![Page 16: Robot Learning Jeremy Wyatt School of Computer Science University of Birmingham](https://reader036.vdocuments.site/reader036/viewer/2022062516/56649d4d5503460f94a2c2ca/html5/thumbnails/16.jpg)
Learning Maps