k -best, locally pruned, transition-based dependency parsing using robust risk minimization
DESCRIPTION
We combine transition-based dependency parsing with a high performing but relatively underexplored machine learning technique, Robust Risk Minimization. During decoding, we judiciously prune the next parsing states using k-best ranking. Moreover, we apply a simple post-processing to ensure robustness. We evaluate our approach on the CoNLL’09 shared task English data and improve transition-based dependency parsing accuracy, reaching a labeled attachment score of 89.28%. We also have observed near quadratic average running time in practice for the algorithm.TRANSCRIPT
K-best, Locally-pruned,Transition-based
Dependency Parsing usingRobust Risk Minimization
Jinho D. ChoiUniversity of Colorado at Boulder
J. D. Power and AssociatesSeptember 9, 2009
Dependency Structure
• What is dependency?
- Syntactic or semantic relation between word-tokens
• Syntactic: NMOD (a beautiful woman)
• Semantic: LOC (places in this city), TMP (events in this year)
• Phrase structure vs. dependency structure
- Constituents vs. dependencies
S
NP
Pro
she
VP
V
bought
NP
Det N
a cara
carshe
bought
SBJ OBJ
DET
Dependency Graph• For a sentence s = w1 .. wn , a dependency graph Gs = (Vs, Es)
- Vs = {w0 = root, w1, ... , wn}
- Es = {(wi, r, wj) : wi ! wj, wi ! Vs, wj ! Vs - {w0}, r ! Rs}
- Rs = a set of all dependency relations in s
• A well-formed dependency graph
- Unique root, single head, connected, acyclic
- Projective vs. non-projective
She bought cararoot
She bought cararoot yesterday that was blue
O(n)
vs.
O(n2)
! dependency tree
Dependency Parsing Models• Transition-based parsing model
- Transition: an operation that searches for a dependency relation between each pair of words (e.g. Left-Arc, Shift, etc.)
- Greedy search that finds local optimums (locally optimized transitions) " do better for short-distance dependencies
- Nivre’s algorithm (p, O(n)), Covington’s algorithm (n, O(n2))
• Graph-based parsing model
- Build a complete graph with directed/weighted edges and find the tree with the highest score (sum of all weighted edges)
- Exhaustive search that finds for the global optimum (maximum spanning tree) " do better for long-distance dependencies
- Eisner’s algorithm (p, O(n2)), Edmonds’ algorithm (n, O(n3))
Nivre’s List-based Algorithm
• Transition-based, non-projective dependency parsing algorithm
• #1, #2 != lists of partially processed tokens$ != a list of remaining unprocessed tokens
• Initialization: (#1, #2, $, A) = ([0], [ ], [1, 2, . . . , n], { })Termination: (#1, #2, $, A) = ([...], [...], [ ], {...})
Deterministic shift vs. non-deterministic shift
Nivre’s List-based Algorithm
She bought cararoot
!1 !2 " A
Nivre’s List-based Algorithm
• Initialize
She bought cararoot
!1 !2 " A
Nivre’s List-based Algorithm
bought
she
a
car
• Initialize
She bought cararoot
root
!1 !2 " A
Nivre’s List-based Algorithm
bought
she
a
car
• Shift : she
• Initialize
She bought cararoot
root
!1 !2 " A
Nivre’s List-based Algorithm
bought
a
car
• Shift : she
• Initialize
She bought cararoot
root
she
!1 !2 " A
Nivre’s List-based Algorithm
bought
a
car
• Shift : she
• Left-Arc : she ! bought
• Initialize
She bought cararoot
root
she
!1 !2 " A
Nivre’s List-based Algorithm
bought
a
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Initialize
She bought cararoot
root she
!1 !2 " A
Nivre’s List-based Algorithm
bought
a
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
She bought cararoot
root she
!1 !2 " A
Nivre’s List-based Algorithm
bought
a
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
She bought cararoot
root " bought
she
root
!1 !2 " A
Nivre’s List-based Algorithm
bought
a
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
She bought cararoot
root " bought
she
• Shift : root, she, bought
root
!1 !2 " A
Nivre’s List-based Algorithm
a
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
!1 !2 " A
Nivre’s List-based Algorithm
a
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize • Shift : a
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
!1 !2 " A
Nivre’s List-based Algorithm
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize • Shift : a
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
a
!1 !2 " A
Nivre’s List-based Algorithm
car she ! bought
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Shift : a
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
a
!1 !2 " A
Nivre’s List-based Algorithm
car she ! bought
a ! car
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Shift : a
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
a
!1 !2 " A
Nivre’s List-based Algorithm
car she ! bought
a ! car
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Right-Arc : bought " car
• Shift : a
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
a
!1 !2 " A
Nivre’s List-based Algorithm
car she ! bought
a ! car
bought " car
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Right-Arc : bought " car
• Shift : a
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
a
bought
!1 !2 " A
Nivre’s List-based Algorithm
car she ! bought
a ! car
bought " car
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Right-Arc : bought " car
• Shift : a
• Shift: bought, a, car
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
a
bought
!1 !2 " A
Nivre’s List-based Algorithm
she ! bought
a ! car
bought " car
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Right-Arc : bought " car
• Shift : a
• Shift: bought, a, car
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
bought
a
car
!1 !2 " A
Nivre’s List-based Algorithm
she ! bought
a ! car
bought " car
• Shift : she
• Left-Arc : she ! bought
• Right-Arc : root " bought
• Initialize
• Left-Arc : a ! car
• Right-Arc : bought " car
• Shift : a
• Shift: bought, a, car
She bought cararoot
root " bought
• Shift : root, she, bought
root
she
• Terminate
bought
a
car
!1 !2 " A
Robust Risk Minimization• Linear binary classification algorithm
- Searches for a hyperplane h(x) = wT·x ! ! that separates two classes, -1 and 1, where class(xi) = (h(xi) < 0) ? -1 : 1.
- Finds " and ^! that solve the following optimization problem.
• Advantages
- Learns irrelevant features faster (than Perceptron).
- Deals with non-linearly separable data more flexibly.
K-best, Locally-pruned Parsing• RRM is a binary classification algorithm.
- One-against-all method using multiple classifiers.
- What if more than one classifier predict transitions?
• Pick the transition with the highest score.
• What if the highest scoring transition is not correct?
K-best, Locally-pruned Parsing• Predicting a wrong transition at any state can generate a
completely different tree (from as it would be in gold-standard).
• It is better to use k-best transitions instead of 1-best.
- Derive several trees and pick the one with the highest score.
- score(tree) = % " transitions used to derive the tree score(transition)
- Problem with the above equation (addressed yesterday)
• A tree derived by a longer sequence of transitions win.
• Normalize the score by the total number of transitions.
• score(tree) = 1/|T|·% " transitions score(transition)
Post-processing• The output from the transition-based parser is not guaranteed
to be a tree but rather a forest.
- It is possible for some tokens not found their heads.
- For each such token, compare it against all other tokens and pick the one that gives the highest score to be the head.
- For such wj,
• Compare it against all wi<j and see which wi gives the highest scoring Right-Arc transition.
• Compare it against all wj<k and see which wk gives the highest scoring Left-Arc transition.
Feature Space• About 14 million features
• f: form, m: lemma, p: pos-tag, d: dependency label
• lm(w): left-most dependent , ln(w): left-nearest dependentrm(w): right-most dependent, rn(w): right-nearest dependent
Evaluation• Models
I. Greedy search using the highest scoring transition
II. Best search using all predicted transitions
III. II + using the upper bound of 1
IV. III + using the lower bound of "0.1
V. III + using the lower bound of "0.2
VI. V + using top 2 scoring transitions
VII. VI + post-processing
Evaluation• Parsing accuracies
80.00
83.75
87.50
91.25
95.00
I II III IV V VI VII
90.9790.4790.4790.12
89.4289.3489.21 89.2888.8788.8788.6288.0887.9687.88
Labled Attachment Score Unlabeled Attachment Score
Evaluation• Average number of transitions
0
375
750
1,125
1,500
2007 1-10 11-20 21-30 31-40 41-50 > 50
I II-III IV V VI-VII
Summary and Conclusions• Summary
- Transition-based, non-projective dependency parsing
- k-best, locally pruned dependency parsing
- Post-processing
- Robust Risk Minimization
• Conclusions
- It is possible to achieve higher parsing accuracy by considering k-best, locally pruned trees,
- while keeping near quadratic running time in practice.
Future Work• Parsing Algorithm
- Search transitions for both left and right sides of "[0].
- Beam search.
- Normalize scores and use priors for transitions.
• Feature
- Cut-off ones less than a threshold.
- Predicate-argument structure from frameset files.
• Machine learning algorithm
- Apply different values for learning parameters.
- Compare with Perceptron, Support Vector Machine.