1 learning to interpret natural language navigation instructions from observation ray mooney...
Post on 23-Dec-2015
216 Views
Preview:
TRANSCRIPT
1
Learning to Interpret Natural Language Navigation Instructions
from Observation
Ray MooneyDepartment of Computer Science
University of Texas at Austin
Joint work with
David Chen Joohyun Kim
Lu Guo.........
2
Challenge Problem:Learning to Follow Directions in a Virtual World
• Learn to interpret navigation instructions in a virtual environment by simply observing humans giving and following such directions (Chen & Mooney, AAAI-11).
• Eventual goal: Virtual agents in video games and educational software that automatically learn to take and give instructions in natural language.
H
C
L
S S
BC
H
E
L
E
Sample Environment(MacMahon, et al. AAAI-06)
H – Hat Rack
L – Lamp
E – Easel
S – Sofa
B – Barstool
C - Chair
3
Sample Instructions• Take your first left. Go all the
way down until you hit a dead end.
• Go towards the coat hanger and turn left at it. Go straight down the hallway and the dead end is position 4.
• Walk to the hat rack. Turn left. The carpet should have green octagons. Go to the end of this alley. This is p-4.
• Walk forward once. Turn left. Walk forward twice.
Start 3
H 4
4
End
Sample Instructions
3
H 4
• Take your first left. Go all the way down until you hit a dead end.
• Go towards the coat hanger and turn left at it. Go straight down the hallway and the dead end is position 4.
• Walk to the hat rack. Turn left. The carpet should have green octagons. Go to the end of this alley. This is p-4.
• Walk forward once. Turn left. Walk forward twice.
Observed primitive actions:Forward, Left, Forward, Forward
5
Start
End
Observed Training Instance in Chinese
Formal Problem Definition
Given:{ (e1, w1 , a1), (e2, w2 , a2), … , (en, wn , an) }
ei – A natural language instruction
wi – A world state
ai – An observed action sequence
Goal:Build a system that produces the correct aj given a previously unseen (ej, wj).
Learning system for parsing navigation instructions
Observation
Instruction
World State
Training
Action TraceNavigation Plan Constructor
Learning system for parsing navigation instructions
Observation
Instruction
World State
Training
Action TraceNavigation Plan Constructor
Semantic Parser Learner
Learning system for parsing navigation instructions
Observation
Instruction
World State
Instruction
World State
Training
Testing
Action TraceNavigation Plan Constructor
Semantic Parser Learner
Learning system for parsing navigation instructions
Observation
Instruction
World State
Instruction
World State
Training
Testing
Action TraceNavigation Plan Constructor
Semantic Parser Learner
Semantic Parser
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
Training
Testing
Action TraceNavigation Plan Constructor
Semantic Parser Learner
Semantic Parser
Action Trace
Representing Linguistic Context
Context is represented by the sequence of observed actions each followed by verifying all observable aspects of the resulting world state.
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
39
Possible Plans
An instruction can refer to a combinatorial number of possible plans, each composed of some subset of this full contextual description.
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
39
Possible Plan # 1
Turn and walk to the couch
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
40
Possible Plan # 2
Face the blue hall and walk 2 steps
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
41
Possible Plan # 3
Turn left. Walk forward twice.
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
42
Disambiguating Sentence Meaning
• Too many meanings to tractably enumerate them all.
• Therefore, cannot use EM to align sentences with enumerated meanings and thereby disambiguate the training data.
43
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
Training
Testing
Action TraceNavigation Plan Constructor
Semantic Parser Learner
Semantic Parser
Action Trace
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
Training
Testing
Action TraceContext Extractor
Semantic Parser Learner
Semantic Parser
Action Trace
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
Training
Testing
Action TraceContext Extractor
Semantic Parser Learner
Semantic Parser
Action Trace
Lexicon Learner
Learning system for parsing navigation instructions
Observation
Instruction
World State
Execution Module (MARCO)
Instruction
World State
Training
Testing
Action TraceContext Extractor
Semantic Parser Learner
Semantic Parser
Action Trace
Lexicon Learner
Plan Refinement
Lexicon Learning
• Learn meanings of words and short phrases by finding correlations with meaning fragments.
43
Verify TravelTurn
steps: 2
front: BLUEHALL
face
blue hall 2 steps
walk
Lexicon Learning Algorithm
To learn the meaning of the word/short phrase w:1. Collect all landmark plans that co-occur with w
and add them to the set PosMean(w)2. Repeatedly take intersections of all possible
pairs of members of PosMean(w) and add any new entries, g, to PosMean(w).
3. Rank the entries by the scoring function:
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
VerifyTravel Turn Verify
front: BLUEHALL
steps: 1
at: SOFA LEFT
Graph 1: “Turn and walk to the sofa.”
Graph 2: “Walk to the sofa and turn left.”
Graph Intersection
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
VerifyTravel Turn Verify
front: BLUEHALL
steps: 1
at: SOFA LEFT
VerifyTurn
LEFTfront: BLUEHALL
Intersections:
Graph IntersectionGraph 1: “Turn and walk to the sofa.”
Graph 2: “Walk to the sofa and turn left.”
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
VerifyTravel Turn Verify
front: BLUEHALL
steps: 1
at: SOFA LEFT
VerifyTurn
LEFTfront: BLUEHALL
Travel Verify
at: SOFA
Intersections:
Graph IntersectionGraph 1: “Turn and walk to the sofa.”
Graph 2: “Walk to the sofa and turn left.”
Plan Refinement
• Use learned lexicon to determine subset of context representing sentence meaning.
43
Face the blue hall and walk 2 steps
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
Plan Refinement
• Use learned lexicon to determine subset of context representing sentence meaning.
43
Face the blue hall and walk 2 steps
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
Plan Refinement
• Use learned lexicon to determine subset of context representing sentence meaning.
43
Face the blue hall and walk 2 steps
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
Plan Refinement
• Use learned lexicon to determine subset of context representing sentence meaning.
43
Face the blue hall and walk 2 steps
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
Plan Refinement
• Use learned lexicon to determine subset of context representing sentence meaning.
43
Face the blue hall and walk 2 steps
Verify TravelTurn Verify
LEFT steps: 2
at: SOFA
front: SOFA
front: BLUEHALL
Evaluation Data Statistics
• 3 maps, 6 instructors, 1-15 followers/direction• Hand-segmented into single sentence steps
Paragraph Single-Sentence
# Instructions 706 3,236
Avg. # sentences 5.0 (±2.8) 1.0 (±0)
Avg. # words 37.6 (±21.1) 7.8 (±5.1)
Avg. # actions 10.4 (±5.7) 2.1 (±2.4)
End-to-End Execution Evaluation
• Test how well the system follows novel directions.• Leave-one-map-out cross-validation.• Strict metric: Only correct if the final position exactly
matches goal location.• Lower baselines:
• Simple probabilistic generative model of executed plans w/o language.
• Semantic parser trained on full context plans• Upper baselines:
• Semantic parser trained on human annotated plans• Human followers
End-to-End Execution Accuracy
Single-Sentence ParagraphSimple Generative Model 11.08% 2.15%Trained on Full Context 21.95% 2.66%Trained on Refined Plans 57.28% 19.18%Trained onHuman Annotated Plans 62.67% 29.59%
Human Followers N/A 69.64%
Sample Successful Parse
Instruction: “Place your back against the wall of the ‘T’ intersection. Turn left. Go forward along the pink-flowered carpet hall two segments to the intersection with the brick hall. This intersection contains a hatrack. Turn left. Go forward three segments to an intersection with a bare concrete hall, passing a lamp. This is Position 5.”
Parse: Turn ( ), Verify ( back: WALL ),Turn ( LEFT ),Travel ( ),Verify ( side: BRICK HALLWAY ),Turn ( LEFT ),Travel ( steps: 3 ),Verify ( side: CONCRETE HALLWAY )
Mandarin Chinese Experiment
• Translated all the instructions from English to Chinese.
64
Single Sentences ParagraphsTrained on Refined Plans 58.70% 20.13%
Problem with Purely Correlational Lexicon Learning
• The correlation between an n-gram w and graph g can be affected by the context.
• Example:– Bigram: ”the wall”– Sample uses:
• ”turn so the wall is on your right side”• ”with your back to the wall turn left”
– Co-occurring aspects of context• TURN()• VERIFY(direction: WALL)
– But “the wall” is simply an object involving no action
40
Syntactic Bootstrapping
• Children sometimes use syntactic information to guide learning of word meanings (Gleitman, 1990).
• Complement to Pinker’s semantic bootstrapping in which semantics is used to help learn syntax.
41
Using POS to Aid Lexicon Learning
• Annotate each n-gram, w, with POS tags.– dead/JJ end/NN
• Annotate each node in meaning graph, g, with a semantic-category tag.– TURN/Action VERIFY/Action FORWARD/Action
42
Reason: “dead end” is often followed by the action of turning around to face another direction so that there is a way to go forward
Constraints on Lexicon Entry: (w,g)
• The n-gram w should contain a noun if and only if the graph g contains an Object
• The n-gram w should contain a verb if and only if the graph g contains an Action
43
dead/JJ end/NNTURN/Action VERIFT/Action FORWARD/Action
dead/JJ end/NNFront/Relation WALL/Object
Violates the Rules! Remove it. Retain it.
PCFG Induction Model for Grounded Language Learning (Borschinger et al. 2011)
• PCFG rules to describe generative process from MR components to corresponding NL words
Series of Grounded Language Learning Papers that Build Upon Each Other
• Kate & Mooney, AAAI-07• Chen & Mooney, ICML-08• Liang, Jordan, and Klein, ACL-09• Kim & Mooney, COLING-10
– Also integrates Lu, Ng, Lee, & Zettlemoyer, EMNLP-08
• Borschinger, Jones, & Johnson, EMNLP-11• Kim & Mooney, EMNLP-12
46
47
PCFG Induction Model for Grounded Language Learning (Borschinger et al. 2011)
• Generative process– Select complete MR to describe– Generate atomic MR constituents in order– Each atomic MR generates NL words by unigram
Markov process
• Parameters learned using EM (Inside-Outside)• Parse new NL sentences by reading top MR
nonterminal from most probable parse tree– Output MRs only included in PCFG rule set constructed
from training data
Limitations of Borschinger et al. 2011PCFG Approach
• Only works in low ambiguity settings.– Where each sentence can refer to only a few
possible MRs.
• Only output MRs explicitly included in the PCFG constructed from the training data
• Produces intractably large PCFGs for complex MRs with high ambiguity.– Would require ~1018 productions for our
navigation data.
48
Our Enhanced PCFG Model(Kim & Mooney, EMNLP-2012)
• Use learned semantic lexicon to constrain the constructed PCFG.
• Limit each MR to generate only words and phrases paired with this MR in the lexicon.– Only ~18,000 productions produced for the
navigation data, compared to ~33,000 produced by Borschinger et al. for far simpler Robocup data.
• Output novel MRs not appearing in the PCFG by composing subgraphs from the overall context.
49
50
End-to-End Execution Evaluations
Single Sentences Paragraphs
Mapping to supervised semantic parsing 57.28% 19.18%
Our PCFG model 57.22% 20.17%
51
Conclusions
• Challenge problem: Learn to follow NL instructions by just watching people follow them.
• Our goal: Learn without assuming any prior linguistic knowledge.– Easily adapt to new languages
• Exploit existing work on learning for semantic parsing in order to produce structured meaning representations that can handle complex instructions.
• Encouraging initial results on learning to navigate in a virtual world, but still far from human-level performance.
top related