neural turing machines(slides)

68
Neural Turing Machines Can neural nets learn programs? Alex Graves Greg Wayne Ivo Danihelka

Upload: dokien

Post on 02-Feb-2017

245 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Neural Turing Machines(Slides)

Neural  Turing  Machines  

Can  neural  nets  learn  programs?        Alex  Graves      Greg  Wayne      Ivo  Danihelka  

Page 2: Neural Turing Machines(Slides)
Page 3: Neural Turing Machines(Slides)
Page 4: Neural Turing Machines(Slides)

Contents  

1.  IntroducBon  2.  FoundaBonal  Research  3.  Neural  Turing  Machines  4.  Experiments  5.  Conclusions  

Page 5: Neural Turing Machines(Slides)

IntroducBon  

•  First  applicaBon  of  Machine  Learning  to  logical  flow  and  external  memory  

Page 6: Neural Turing Machines(Slides)

IntroducBon  

•  First  applicaBon  of  Machine  Learning  to  logical  flow  and  external  memory  

•  Extend  the  capabiliBes  of  neural  networks  by  coupling  them  to  external  memory  

Page 7: Neural Turing Machines(Slides)

IntroducBon  

•  First  applicaBon  of  Machine  Learning  to  logical  flow  and  external  memory  

•  Extend  the  capabiliBes  of  neural  networks  by  coupling  them  to  external  memory  

•  Analogous  to  TM  coupling  a  finite  state  machine  to  infinite  tape  

Page 8: Neural Turing Machines(Slides)

IntroducBon  

•  First  applicaBon  of  Machine  Learning  to  logical  flow  and  external  memory  

•  Extend  the  capabiliBes  of  neural  networks  by  coupling  them  to  external  memory  

•  Analogous  to  TM  coupling  a  finite  state  machine  to  infinite  tape  

•  RNN’s  have  been  shown  to  be  Turing-­‐Complete,  Siegelmann  et  al  ‘95  

Page 9: Neural Turing Machines(Slides)

IntroducBon  

•  First  applicaBon  of  Machine  Learning  to  logical  flow  and  external  memory  

•  Extend  the  capabiliBes  of  neural  networks  by  coupling  them  to  external  memory  

•  Analogous  to  TM  coupling  a  finite  state  machine  to  infinite  tape  

•  RNN’s  have  been  shown  to  be  Turing-­‐Complete,  Siegelmann  et  al  ‘95  

•  Unlike  TM,  NTM  is  completely  differenBable  

Page 10: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  – Concept  of  “working  memory”:  short-­‐term  memory  storage  and  rule  based  manipulaBon  

– Also  known  as  “rapidly  created  variables”  

Page 11: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  – Concept  of  “working  memory”:  short-­‐term  memory  storage  and  rule  based  manipulaBon  

– Also  known  as  “rapidly  created  variables”  – ObservaBonal  neuroscience  results  in  the  pre-­‐frontal  cortex  and  basal  ganglia  of  monkeys  

Page 12: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  and  LinguisBcs  – AI  and  CogniBve  Science  were  contemporaneous  in  1950’s-­‐1970’s  

Page 13: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  and  LinguisBcs  – AI  and  CogniBve  Science  were  contemporaneous  in  1950’s-­‐1970’s  

– Two  fields  parted  ways  when  neural  nets  received  criBcism,  Fodor  et  al.  ‘88  •  Incapable  of  “variable-­‐binding”    

–  eg  “Mary  spoke  to  John”  

•  Incapable  of  handling  variable  sized  input  

Page 14: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  and  LinguisBcs  – AI  and  CogniBve  Science  were  contemporaneous  in  1950’s-­‐1970’s  

– Two  fields  parted  ways  when  neural  nets  received  criBcism,  Fodor  et  al.  ’88  

– MoBvated  Recurrent  Networks  research  to  handle  variable  binding  and  variable  length  input  

Page 15: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  and  LinguisBcs  – AI  and  CogniBve  Science  were  contemporaneous  in  1950’s-­‐1970’s  

– Two  fields  parted  ways  when  neural  nets  received  criBcism,  Fodor  et  al.  ’88  

– MoBvated  Recurrent  Networks  research  to  handle  variable  binding  and  variable  length  input  

– Recursive  processing  hot  debate  topic  in  role  inhuman  evoluBon  (Pinker  vs  Chomsky)  

Page 16: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  ad  LinguisBcs  •  Recurrent  Neural  networks    – Broad  class  of  machines  with  distributed  and  dynamic  state  

Page 17: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  ad  LinguisBcs  •  Recurrent  Neural  networks    – Broad  class  of  machines  with  distributed  and  dynamic  state  

– Long  Short  Term  Memory  RNN’s  designed  to  handle  vanishing  and  exploding  gradient  

Page 18: Neural Turing Machines(Slides)

FoundaBonal  Research  

•  Neuroscience  and  Psychology  •  CogniBve  Science  ad  LinguisBcs  •  Recurrent  Neural  networks    – Broad  class  of  machines  with  distributed  and  dynamic  state  

– Long  Short  Term  Memory  RNN’s  designed  to  handle  vanishing  and  exploding  gradient  

– NaBvely  handle  variable  length  structures  

Page 19: Neural Turing Machines(Slides)

Neural  Turing  Machines  

Page 20: Neural Turing Machines(Slides)

Neural  Turing  Machines  

Page 21: Neural Turing Machines(Slides)

Neural  Turing  Machines  

1.  Reading  – Mt  is  NxM  matrix  of  memory  at  Bme  t  

Page 22: Neural Turing Machines(Slides)

Neural  Turing  Machines  

1.  Reading  – Mt  is  NxM  matrix  of  memory  at  Bme  t  – wt  

Page 23: Neural Turing Machines(Slides)

Neural  Turing  Machines  

1.  Reading  2.  WriBng  involves  both  erasing  and  adding  

Page 24: Neural Turing Machines(Slides)

Neural  Turing  Machines  

1.  Reading  2.  WriBng  involves  both  erasing  and  adding  

Page 25: Neural Turing Machines(Slides)

Neural  Turing  Machines  

1.  Reading  2.  WriBng  involves  both  erasing  and  adding  3.  Addressing  

Page 26: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  – 1.  Focusing  by  Content  •  Each  head  produces  key  vector  kt  of  length  M  •  Generated  a  content  based  weight  wt

c  based  on  similarity  measure,  using  ‘key  strength’  βt    

Page 27: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  – 2.  InterpolaBon  •  Each  head  emits  a  scalar  interpolaBon  gate  gt    

Page 28: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  – 3.  ConvoluBonal  shif  •  Each  head  emits  a  distribuBon  over  allowable  integer  shifs  st  

Page 29: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  – 4.  Sharpening  •  Each  head  emits  a  scalar  sharpening  parameter  γt    

Page 30: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  (puhng  it  all  together)  

Page 31: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  (puhng  it  all  together)  – This  can  operate  in  three  complementary  modes  •  A  weighBng  can  be  chosen  by  the  content  system  without  any  modificaBon  by  the  locaBon  system  

Page 32: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  (puhng  it  all  together)  – This  can  operate  in  three  complementary  modes  •  A  weighBng  can  be  chosen  by  the  content  system  without  any  modificaBon  by  the  locaBon  system  •  A  weighBng  produced  by  the  content  addressing  system  can  be  chosen  and  then  shifed  

Page 33: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  3.  Addressing  (puhng  it  all  together)  – This  can  operate  in  three  complementary  modes  •  A  weighBng  can  be  chosen  by  the  content  system  without  any  modificaBon  by  the  locaBon  system  •  A  weighBng  produced  by  the  content  addressing  system  can  be  chosen  and  then  shifed  •  A  weighBng  from  the  previous  Bme  step  can  be  rotated  without  any  input  from  the  content-­‐based  addressing  system  

Page 34: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  Controller  Network  Architecture  – Feed  Forward  vs  Recurrent    

Page 35: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  Controller  Network  Architecture  – Feed  Forward  vs  Recurrent    – The  LSTM  version  of  RNN  has  own  internal  memory  complementary  to  M  

Page 36: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  Controller  Network  Architecture  – Feed  Forward  vs  Recurrent    – The  LSTM  version  of  RNN  has  own  internal  memory  complementary  to  M  

– Hidden  LSTM  layers  are  ‘like’  registers  in  processor  

Page 37: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  Controller  Network  Architecture  – Feed  Forward  vs  Recurrent    – The  LSTM  version  of  RNN  has  own  internal  memory  complementary  to  M  

– Hidden  LSTM  layers  are  ‘like’  registers  in  processor  

– Allows  for  mix  of  informaBon  across  mulBple  Bme-­‐steps  

Page 38: Neural Turing Machines(Slides)

Neural  Turing  Machines  

•  Controller  Network  Architecture  – Feed  Forward  vs  Recurrent    – The  LSTM  version  of  RNN  has  own  internal  memory  complementary  to  M  

– Hidden  LSTM  layers  are  ‘like’  registers  in  processor  

– Allows  for  mix  of  informaBon  across  mulBple  Bme-­‐steps  

– Feed  Forward  has  bejer  transparency  

Page 39: Neural Turing Machines(Slides)

Experiments  

•  Test  NTM’s  ability  to  learn  simple  algorithms  like  copying  and  sorBng  

Page 40: Neural Turing Machines(Slides)

Experiments  

•  Test  NTM’s  ability  to  learn  simple  algorithms  like  copying  and  sorBng  

•  Demonstrate  that  soluBons  generalize  well  beyond  the  range  of  training  

Page 41: Neural Turing Machines(Slides)

Experiments  

•  Test  NTM’s  ability  to  learn  simple  algorithms  like  copying  and  sorBng  

•  Demonstrate  that  soluBons  generalize  well  beyond  the  range  of  training  

•  Tests  three  architectures  – NTM  with  feed  forward  controller  – NTM  with  LSTM  controller  – Standard  LSTM  network  

Page 42: Neural Turing Machines(Slides)

Experiments  

•  1.  Copy  – Tests  whether  NTM  can  store  and  retrieve  data  – Trained  to  copy  sequences  of  8  bit  vectors  – Sequences  vary  between  1-­‐20  vectors  

Page 43: Neural Turing Machines(Slides)

Experiments  

•  1.  Copy  

Page 44: Neural Turing Machines(Slides)

Experiments  

•  1.  Copy  – NTM  

Page 45: Neural Turing Machines(Slides)

Experiments  

•  1.  Copy  – LSTM  

Page 46: Neural Turing Machines(Slides)

Experiments  

•  1.  Copy  

Page 47: Neural Turing Machines(Slides)

Experiments  

•  2.  Repeat  Copy  – Tests  whether  NTM  can  learn  simple  nested  funcBon  

– Extend  copy  by  repeatedly  copying  input  specified  number  of  Bmes  

– Training  is  a  random-­‐length  sequence  of  8  bit  binary  inputs  plus  a  scalar  value  for  #  of  copies  

– Scalar  value  is  random  between  1-­‐10  

Page 48: Neural Turing Machines(Slides)

Experiments  

•  2.  Repeat  Copy  

Page 49: Neural Turing Machines(Slides)

Experiments  

•  2.  Repeat  Copy  

Page 50: Neural Turing Machines(Slides)

Experiments  

•  2.  Repeat  Copy  

Page 51: Neural Turing Machines(Slides)

Experiments  

•  3.  AssociaBve  Recall  – Tests  NTM’s  ability  to  associate  data  references  – Training  input  is  list  of  items,  followed  by  a  query  item  

– Output  is  subsequent  item  in  list  – Each  item  is  a  three  sequence  6-­‐bit  binary  vector  – Each  ‘episode’  has  between  two  and  six  items    

Page 52: Neural Turing Machines(Slides)

Experiments  

•  3.  AssociaBve  Recall  

Page 53: Neural Turing Machines(Slides)

Experiments  

•  3.  AssociaBve  Recall  

Page 54: Neural Turing Machines(Slides)

Experiments  

•  3.  AssociaBve  Recall  

Page 55: Neural Turing Machines(Slides)

Experiments  

•  4.  Dynamic  N-­‐Grams  – Test  whether  NTM  could  rapidly  adapt  to  new  predicBve  distribuBons  

– Trained  on  6-­‐gram  binary  pajern  on  200  bit  sequences  

– Can  NTM  learn  opBmal  esBmator  

Page 56: Neural Turing Machines(Slides)

Experiments  

•  4.  Dynamic  N-­‐Grams  

Page 57: Neural Turing Machines(Slides)

Experiments  

•  4.  Dynamic  N-­‐Grams  

Page 58: Neural Turing Machines(Slides)

Experiments  

•  4.  Dynamic  N-­‐Grams  

Page 59: Neural Turing Machines(Slides)

Experiments  

•  5.  Priority  Sort  – Tests  whether  NTM  can  sort  data  –  Input  is  sequence  of  20  random  binary  vectors,  each  with  a  scalar  raBng  drawn  from  [-­‐1,  1]  

– Target  sequence  is  16-­‐highest  priority  vectors  

Page 60: Neural Turing Machines(Slides)

Experiments  

•  5.  Priority  Sort  

Page 61: Neural Turing Machines(Slides)

Experiments  

•  5.  Priority  Sort  

Page 62: Neural Turing Machines(Slides)

Experiments  

•  5.  Priority  Sort  

Page 63: Neural Turing Machines(Slides)

Experiments  

•  6.  Details  – RMSProp  algorithm  – Momentum  0.9  – All  LSTM’s  had  three  stacked  hidden  layers  

Page 64: Neural Turing Machines(Slides)

Experiments  

•  6.  Details  

Page 65: Neural Turing Machines(Slides)

Experiments  

•  6.  Details  

Page 66: Neural Turing Machines(Slides)

Experiments  

•  6.  Details  

Page 67: Neural Turing Machines(Slides)

Conclusion  

•  Introduced  an  neural  net  architecture  with  external  memory  that  is  differenBable  end-­‐to-­‐end  

•  Experiments  demonstrate  that  NTM  are  capable  of  leaning  simple  algorithms  and  are  capable  of  generalizing  beyond  training  regime  

Page 68: Neural Turing Machines(Slides)