greed%is%good%if%randomized:%new%inference% …people.csail.mit.edu/yuanzh/papers/greed.pdf ·...
TRANSCRIPT
Greed is Good if Randomized: New Inference for Dependency Parsing
1
Yuan Zhang CSAIL, MIT
Joint work with Tao Lei,
Regina Barzilay, and Tommi Jaakkola
Inference vs. Scoring
2
Inference
Scoring Func.on
Approximate
Exact
Limited Expressive
Inference vs. Scoring
3
Inference
Scoring Func.on
Approximate
Exact
Limited Expressive
Minimum Spanning Tree
Inference vs. Scoring
4
Inference
Scoring Func.on
Approximate
Exact
Limited Expressive
Minimum Spanning Tree
Reranking
• Reranking: incorporate arbitrary features
Inference vs. Scoring
5
Inference
Scoring Func.on
Approximate
Exact
Limited Expressive
Minimum Spanning Tree
Reranking
Dual DecomposiKon
• Reranking: incorporate arbitrary features • Dual DecomposiKon: search in full space
Parsing Complexity
• High-‐order parsing is NP-‐hard (McDonald et al., 2006) • Hypothesis: parsing is easy on average • Many NP-‐hard problems are easy on average
- MAX-‐SAT (Resende et al., 1997) - Set cover (Hochbaum, 1982)
6
Parsing Complexity
• High-‐order parsing is NP-‐hard (McDonald et al., 2006) • Hypothesis: parsing is easy on average • Many NP-‐hard problems are easy on average
- MAX-‐SAT (Resende et al., 1997) - Set cover (Hochbaum, 1982)
7
We show • Analysis on average parsing complexity • A simple inference algorithm based on the analysis
Our Approach
8
Our Approach
Inference
Scoring Func.on
Approximate
Exact
Limited Expressive
Minimum Spanning Tree
Reranking
Dual DecomposiKon
• Reranking: incorporate arbitrary features • Dual DecomposiKon: search in full space
Core Idea
• Climb to the opKmal tree in a few small greedy steps
9
1) Randomly sample a dependency tree 2) Greedily improve the tree one edge at a Kme
3) Repeat (2) unKl converge
Randomized Hill-‐climbing
For k = 1 to K
Select the tree with the highest score
Core Idea
• Climb to the opKmal tree in a few small greedy steps
10
1) Randomly sample a dependency tree 2) Greedily improve the tree one edge at a Kme
3) Repeat (2) unKl converge
Randomized Hill-‐climbing
That’s it!
For k = 1 to K
Select the tree with the highest score
It Works!
11
89.44%
88.73%
Our Full
Turbo
Parsing Performance on CoNLL Dataset
Dual Decomposi;on
12
“ I ate an apple today”
Example
13
today
apple
I
ate
an
ROOT
“ I ate an apple today”
Initial treeExample
14
today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today” Target tree
Initial treeExample
15
today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
Target tree
Initial treeExample
16
today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
Target tree
Example
17
today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
Target tree
Example
18
today
apple
I
ate
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
an
Target tree
Example
19
today
apple
I
ate
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
an
Target tree
Example
20
today
apple
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
an I
ate
Target tree
Example
21
today
apple
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
an I
ate
Target tree
Example
22
today
apple
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
an I
ate
Target tree
Example
23
today
apple
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
an I
ate
Target tree
Example
24
today
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
an
ate
today
apple
I
ate
apple
an
Target tree
Example
25
today
ROOT
today apple I
ate
an
ROOT
“ I ate an apple today”
I
ate
apple
an
Target tree
Example
26
Reachability: transforming any tree to any other tree
• maintaining the structure a valid tree at any point
• using as few as d steps (d : head differences/hamming distance)
…… today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
y(0) y(T )
Why Greedy Has a Chance to Work
27
…… today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
y(0) y(T )increase S(x, y(t ) )
Greedy Hill-climbing
28
…… today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
y(0) y(T )increase S(x, y(t ) )
Greedy Hill-climbing
Arbitrary features in the scoring func;on
29
…… today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
y(0) y(T )
tree y
score S
local opKmum
global opKmum
increase S(x, y(t ) )
Challenge: Local Optimum
tree y
score S
Hill-climbing with Restarts
30 Overcome local opKma via restarts
…… today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
31
Hill-climbing with Restarts
Random iniKalizaKon (e.g. uniform)
…… max
……
Overcome local opKma via restarts
…… today
apple
I
ate
an
ROOT
today apple I
ate
an
ROOT
y(0)
y(0)
y(0)
y(T )
y(T )
y(T )
Hill-‐climbing
Learning Algorithm
• Follow common max-‐margin framework
32
∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ
§ is the gold tree y
Learning Algorithm
• Follow common max-‐margin framework
• Adopt passive-‐aggressive online learning framework (Crammer et al. 2006)
• Decode with our randomized greedy algorithm
33
∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ
§ is the gold tree y
Analysis
34
Analysis
35
TheoreKcal Empirical
First-‐order
Analysis
36
TheoreKcal Empirical
First-‐order
High-‐order ?
Analysis
37
TheoreKcal Empirical
First-‐order
High-‐order ?
38
Search Space Complexity: First-order
10 words
39
Search Space Complexity: First-order
10 words
≈ 2 billion trees
40
Search Space Complexity: First-order
10 words
≈ 2 billion trees
< 512 local opKma
41
Search Space Complexity: First-order
Theorem: For any first-‐order scoring funcKon:
• there are at most 2n-‐1 locally opKmal trees
• this upper bound is .ght
42
Search Space Complexity: First-order
Theorem: For any first-‐order scoring funcKon:
• there are at most 2n-‐1 locally opKmal trees
• this upper bound is .ght
2n-‐1 is sKll a lot, but it is the worst case
43
Search Space Complexity: First-order
Theorem: For any first-‐order scoring funcKon:
• there are at most 2n-‐1 locally opKmal trees
• this upper bound is .ght
2n-‐1 is sKll a lot, but it is the worst case
What about the average case?
44
Algorithm for Counting Local Optima
root
Johnsaw
Mary
10
9 20
30
30
0
9
45
Algorithm for Counting Local Optima
The method is based on Chu-‐Liu-‐Edmonds algorithm
root
Johnsaw
Mary
10
9 20
30
30
0
9
• Select the best heads independently
46
Algorithm for Counting Local Optima
root
Johnsaw
Mary
10
9 20
30
30
0
9
• Contract the cycle and recursively count the local opKma Ø Any local opKmum exactly reassigns one edge in the cycle
root
Johnsaw
Mary9 30
9 root
Johnsaw
Mary
10
30
9
+
47
Empirical Results: First-order
How many local opKma in real data?
# Op.ma on English Dataset
% sentences
48
Empirical Results: First-order
50%
21
How many local opKma in real data?
# Op.ma on English Dataset
% sentences
49
Empirical Results: First-order
50% 70%
21 121
How many local opKma in real data?
# Op.ma on English Dataset
% sentences
50
Empirical Results: First-order
50% 70% 90%
21 121
2000
How many local opKma in real data?
# Op.ma on English Dataset
% sentences
51
Empirical Results: First-order
Does the hill-‐climbing find the argmax?
Finding Global Op.mum on English
52
Empirical Results: First-order
Does the hill-‐climbing find the argmax?
Len. ≤ 15 100%
Len. > 15 99.3%
Easy search space leads to successful decoding
Finding Global Op.mum on English
Empirical Results: High-order
% Cer.ficate 94.5%
Dual decomposiKon (Koo et al., 2010)
SDD = SHC 99.8%
Given a cerKficate by DD
Comparison on English Given DD Cert.
Does the hill-‐climbing find the argmax?
Empirical Results: High-order
Overall Comparison on English
Does the hill-‐climbing find the argmax?
Empirical Results: High-order
SDD = SHC 98.7%
Overall Comparison on English
Does the hill-‐climbing find the argmax?
Empirical Results: High-order
SDD = SHC 98.7%
SDD < SHC 1.0%
SDD > SHC 0.3%
Overall Comparison on English
Does the hill-‐climbing find the argmax?
Experimental Setup
57
Implementa;on
§ AdapKve restarKng strategy with
Datasets § 14 languages in CoNLL 2006 & 2008 shared tasks
Features
§ Up to 3rd-‐order (three arcs) features used in MST/Turbo parsers
§ Global features used in re-‐ranking
K = 300
Baselines and Evaluation Measure
58
Baselines: § Turbo Parser: Dual DecomposiKon with 3rd-‐order
features (MarKns et al., 2013)
§ Sampling-‐based Parser: MCMC sampling with global features (Zhang et al., 2014)
Evalua;on Measure:
§ Unlabeled Amachment Score (UAS), without punctuaKons
Comparing with Baselines
59
89.24%
89.23%
88.66%
88.73%
Our Full
Sampling-‐based
Our 3rd
Turbo (DD)
Our Full w/o Tensor
Sampling-‐based (MCMC)
Our Full w/o Tensor
Comparing with Baselines
60
89.24%
89.23%
88.66%
88.73%
Our Full
Sampling-‐based
Our 3rd
Turbo (DD)
Our Full w/o Tensor
Sampling-‐based (MCMC)
Comparing with Baselines
61
89.24%
89.23%
88.66%
88.73%
Our Full
Sampling-‐based
Our 3rd
Turbo (DD)
Our Full w/o Tensor
Sampling-‐based (MCMC)
Global Features
Impact of Initialization
62
88.0 88.1
84.0
85.0
86.0
87.0
88.0
89.0
Uniform Rnd-‐1st
UAS
(%)
Impact of Restarts
63
85.4
88.1
84.0
85.0
86.0
87.0
88.0
89.0
No Restart 300 Restarts
UAS
(%)
Convergence Property
64
• Score normalized by the highest score in 3000 restarts
0 100 200 300 400 500
0.994
0.996
0.998
1
# Restarts
Scor
e
Len ≤ 15Len > 15
Convergence Analysis on English
Trade-off between Speed and Performance
65
Decoding Speed on English
2 4 6 8 10x 10−3
88
90
92
94
Sec/Tok
UAS
3rd−order ModelFull Model
Fast Slow
Conclusion
• Analysis: we invesKgate average case complexity of parsing • Algorithm: we introduce a simple randomized greedy inference algorithm
Source code available at: hSps://github.com/taolei87/RBGParser
66
Thank You!
67