greed%is%good%if%randomized:%new%inference% …people.csail.mit.edu/yuanzh/papers/greed.pdf ·...

Greed is Good if Randomized: New Inference for Dependency Parsing

1

Yuan Zhang CSAIL, MIT

Joint work with Tao Lei,

Regina Barzilay, and Tommi Jaakkola

Inference vs. Scoring

2

Inference

Scoring Func.on

Approximate

Exact

Limited Expressive


3

Inference

Scoring Func.on

Approximate

Exact

Limited Expressive

Minimum Spanning Tree


4

Inference

Scoring Func.on

Approximate

Exact

Limited Expressive


Reranking

•  Reranking: incorporate arbitrary features


5

Inference

Scoring Func.on

Approximate

Exact

Limited Expressive


Reranking

Dual DecomposiKon

•  Reranking: incorporate arbitrary features •  Dual DecomposiKon: search in full space

Parsing Complexity

•  High-‐order parsing is NP-‐hard (McDonald et al., 2006) •  Hypothesis: parsing is easy on average •  Many NP-‐hard problems are easy on average

- MAX-‐SAT (Resende et al., 1997) - Set cover (Hochbaum, 1982)

6

Parsing Complexity

•  High-‐order parsing is NP-‐hard (McDonald et al., 2006) •  Hypothesis: parsing is easy on average •  Many NP-‐hard problems are easy on average

- MAX-‐SAT (Resende et al., 1997) - Set cover (Hochbaum, 1982)

7

We show •  Analysis on average parsing complexity •  A simple inference algorithm based on the analysis

Our Approach

8

Our Approach

Inference

Scoring Func.on

Approximate

Exact

Limited Expressive


Reranking

Dual DecomposiKon

•  Reranking: incorporate arbitrary features •  Dual DecomposiKon: search in full space

Core Idea

•  Climb to the opKmal tree in a few small greedy steps

9

1)  Randomly sample a dependency tree 2)  Greedily improve the tree one edge at a Kme

3)  Repeat (2) unKl converge

Randomized Hill-‐climbing

For k = 1 to K

Select the tree with the highest score

Core Idea

•  Climb to the opKmal tree in a few small greedy steps

10

1)  Randomly sample a dependency tree 2)  Greedily improve the tree one edge at a Kme

3)  Repeat (2) unKl converge

Randomized Hill-‐climbing

That’s it!

For k = 1 to K

Select the tree with the highest score

It Works!

11

89.44%

88.73%

Our Full

Turbo

Parsing Performance on CoNLL Dataset

Dual Decomposi;on

12

“ I ate an apple today”

Example

13

today

apple

I

ate

an

ROOT


Initial treeExample

14

today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

“ I ate an apple today” Target tree

Initial treeExample

15

today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

Target tree

Initial treeExample

16

today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

Target tree

Example

17

today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

Target tree

Example

18

today

apple

I

ate

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

an

Target tree

Example

19

today

apple

I

ate

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

an

Target tree

Example

20

today

apple

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

an I

ate

Target tree

Example

21

today

apple

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

an I

ate

Target tree

Example

22

today

apple

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

an I

ate

Target tree

Example

23

today

apple

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

an I

ate

Target tree

Example

24

today

ROOT

today apple I

ate

an

ROOT


I

an

ate

today

apple

I

ate

apple

an

Target tree

Example

25

today

ROOT

today apple I

ate

an

ROOT


I

ate

apple

an

Target tree

Example

26

Reachability: transforming any tree to any other tree

•  maintaining the structure a valid tree at any point

•  using as few as d steps (d : head differences/hamming distance)

…… today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

y(0) y(T )

Why Greedy Has a Chance to Work

27

…… today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

y(0) y(T )increase S(x, y(t ) )

Greedy Hill-climbing

28

…… today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

y(0) y(T )increase S(x, y(t ) )

Greedy Hill-climbing

Arbitrary features in the scoring func;on

29

…… today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

y(0) y(T )

tree y

score S

local opKmum

global opKmum

increase S(x, y(t ) )

Challenge: Local Optimum

tree y

score S

Hill-climbing with Restarts

30 Overcome local opKma via restarts

…… today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

31

Hill-climbing with Restarts

Random iniKalizaKon (e.g. uniform)

…… max

……

Overcome local opKma via restarts

…… today

apple

I

ate

an

ROOT

today apple I

ate

an

ROOT

y(0)

y(0)

y(0)

y(T )

y(T )

y(T )

Hill-‐climbing

Learning Algorithm

•  Follow common max-‐margin framework

32

∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ

§  is the gold tree y

Learning Algorithm

•  Follow common max-‐margin framework

•  Adopt passive-‐aggressive online learning framework (Crammer et al. 2006)

•  Decode with our randomized greedy algorithm

33

∀y ∈ T (x) S(x, y) ≥ S(x, y)+ | y− y |−ξ

§  is the gold tree y

Analysis

34

Analysis

35

TheoreKcal Empirical

First-‐order

Analysis

36


First-‐order

High-‐order ?

Analysis

37


First-‐order

High-‐order ?

38

Search Space Complexity: First-order

10 words

39


10 words

≈ 2 billion trees

40


10 words

≈ 2 billion trees

< 512 local opKma

41


Theorem: For any first-‐order scoring funcKon:

•  there are at most 2n-‐1 locally opKmal trees

•  this upper bound is .ght

42





2n-‐1 is sKll a lot, but it is the worst case

43





2n-‐1 is sKll a lot, but it is the worst case

What about the average case?

44

Algorithm for Counting Local Optima

root

Johnsaw

Mary

10

9 20

30

30

0

9

45


The method is based on Chu-‐Liu-‐Edmonds algorithm

root

Johnsaw

Mary

10

9 20

30

30

0

9

•  Select the best heads independently

46


root

Johnsaw

Mary

10

9 20

30

30

0

9

•  Contract the cycle and recursively count the local opKma Ø  Any local opKmum exactly reassigns one edge in the cycle

root

Johnsaw

Mary9 30

9 root

Johnsaw

Mary

10

30

9

+

47

Empirical Results: First-order

How many local opKma in real data?

# Op.ma on English Dataset

% sentences

48


50%

21



% sentences

49


50% 70%

21 121



% sentences

50


50% 70% 90%

21 121

2000



% sentences

51


Does the hill-‐climbing find the argmax?

Finding Global Op.mum on English

52



Len. ≤ 15 100%

Len. > 15 99.3%

Easy search space leads to successful decoding

Finding Global Op.mum on English

Empirical Results: High-order

% Cer.ficate 94.5%

Dual decomposiKon (Koo et al., 2010)

SDD = SHC 99.8%

Given a cerKficate by DD

Comparison on English Given DD Cert.



Overall Comparison on English



SDD = SHC 98.7%




SDD = SHC 98.7%

SDD < SHC 1.0%

SDD > SHC 0.3%



Experimental Setup

57

Implementa;on

§  AdapKve restarKng strategy with

Datasets §  14 languages in CoNLL 2006 & 2008 shared tasks

Features

§  Up to 3rd-‐order (three arcs) features used in MST/Turbo parsers

§  Global features used in re-‐ranking

K = 300

Baselines and Evaluation Measure

58

Baselines: §  Turbo Parser: Dual DecomposiKon with 3rd-‐order

features (MarKns et al., 2013)

§  Sampling-‐based Parser: MCMC sampling with global features (Zhang et al., 2014)

Evalua;on Measure:

§  Unlabeled Amachment Score (UAS), without punctuaKons

Comparing with Baselines

59

89.24%

89.23%

88.66%

88.73%

Our Full

Sampling-‐based

Our 3rd

Turbo (DD)

Our Full w/o Tensor

Sampling-‐based (MCMC)

Our Full w/o Tensor


60

89.24%

89.23%

88.66%

88.73%

Our Full

Sampling-‐based

Our 3rd

Turbo (DD)

Our Full w/o Tensor



61

89.24%

89.23%

88.66%

88.73%

Our Full

Sampling-‐based

Our 3rd

Turbo (DD)

Our Full w/o Tensor


Global Features

Impact of Initialization

62

88.0 88.1

84.0

85.0

86.0

87.0

88.0

89.0

Uniform Rnd-‐1st

UAS

(%)

Impact of Restarts

63

85.4

88.1

84.0

85.0

86.0

87.0

88.0

89.0

No Restart 300 Restarts

UAS

(%)

Convergence Property

64

•  Score normalized by the highest score in 3000 restarts

0 100 200 300 400 500

0.994

0.996

0.998

1

# Restarts

Scor

e

Len ≤ 15Len > 15

Convergence Analysis on English

Trade-off between Speed and Performance

65

Decoding Speed on English

2 4 6 8 10x 10−3

88

90

92

94

Sec/Tok

UAS

3rd−order ModelFull Model

Fast Slow

Conclusion

•  Analysis: we invesKgate average case complexity of parsing •  Algorithm: we introduce a simple randomized greedy inference algorithm

Source code available at: hSps://github.com/taolei87/RBGParser

66

Thank You!

67

greed%is%good%if%randomized:%new%inference% …people.csail.mit.edu/yuanzh/papers/greed.pdf ·...

Documents