optimization of nlp components for robustness and scalability

65
Optimization of NLP Components For Robustness and Scalability Dissertation Defense August 15, 2012 Jinho D. Choi [email protected] Department of Computer Science University of Colorado Boulder Friday, August 17, 2012

Upload: jinho-d-choi

Post on 11-May-2015

854 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Optimization of NLP Components for Robustness and Scalability

Optimization of NLP ComponentsFor Robustness and Scalability

Dissertation DefenseAugust 15, 2012

Jinho D. [email protected]

Department of Computer ScienceUniversity of Colorado Boulder

Friday, August 17, 2012

Page 2: Optimization of NLP Components for Robustness and Scalability

Ever since I came, Boulder has been ...• #1: Top 10 College Towns (Livability, 2012)

• #1: Top 10 Least Obese Metro Areas (Gallup Healthways, 2012)

• #1: Top 10 Happiest Cities (Gallup Healthways, 2012)

• #1: The 10 Most Educated U.S. Cities (US News, 2011)

• #1: America’s 15 Most Active Cities (Time - Healthland, 2011)

• #1: Best Quality of Life in America (Porfolio, 2011)

• #1: 20 Brainiest Cities in America (Daily Beast, 2010)

• #1: Western Cities Fare Best in Well-being (USA Today, 2010)

• #1: America's Foodiest Town (Bon Appétit, 2010)

• #1: The Best Cities to Raise an Outdoor Kid (Backpacker, 2009)

• #1: America's Top 25 Towns To Live Well (Forbes, 2009)

• #1: America's Smartest Cities (Forbes, 2008)

• #1: Top Heart Friendly Cities (American Heart Association, 2008)

2

Friday, August 17, 2012

Page 3: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

3

Friday, August 17, 2012

Page 4: Optimization of NLP Components for Robustness and Scalability

Introduction• The application of NLP has ...

- Expanded to everyday computing.

- Broadened to a general audience.

‣ More attention is drawn to the practical aspects of NLP.

• NLP components should be tested for

- Robustness in handling heterogeneous data.

• Need to be evaluated on data from several different sources.

- Scalability in handling a large amount of data.

• Need to be evaluated for speed and complexity.

4

Friday, August 17, 2012

Page 5: Optimization of NLP Components for Robustness and Scalability

Introduction• Research question

- How to improve the robustness and scalability of standard NLP components.

• Goals

- To prepare gold-standard data from several different sources for in-genre and out-of-genre experiments.

- To develop a POS tagger, a dependency parser, and a semantic role labeler showing robust results across this data.

- To reduce average complexities of these components while retaining good performance in accuracy.

5

Friday, August 17, 2012

Page 6: Optimization of NLP Components for Robustness and Scalability

Introduction• Thesis statement

1. We improve the robustness of three NLP components:

• POS tagger: by building a generalized model.

• Dependency parser: by bootstrapping parse information.

• Semantic role labeler: by applying higher-order argument pruning.

2. We improve the scalability of these three components:

• POS tagger: by adapting dynamic model selection.

• Dependency parser: by optimizing the engineering of transition-based parsing algorithms.

• Semantic role labeler: by applying conditional higher-order argument pruning.

6

Friday, August 17, 2012

Page 7: Optimization of NLP Components for Robustness and Scalability

Introduction

7

DependencyConversion

Part-of-speechTrainer

Part-of-speechTagging Model

Part-of-speechTagger

Training Set:Dependency Trees+ Semantic Roles

Evaluation Set:Dependency Trees+ Semantic Roles

ConstituentTreebanks PropBanks

DependencyTrainer

DependencyParsing Model

DependencyParser

Semantic RoleTrainer

Semantic RoleLabeling Model

Semantic RoleLabeler

Start

Stop

Friday, August 17, 2012

Page 8: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

8

Friday, August 17, 2012

Page 9: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion• Motivation

- A small amount of manually annotated dependency trees(Rambow et al., 2002; Cmejrek et al., 2004).

- A large amount of manually annotated constituent trees(Marcus et al., 1993; Weischedel et al., 2011).

- Converting constituent trees into dependency trees→ A large amount of pseudo annotated dependency trees.

• Previous approaches

- Penn2Malt (stp.lingfil.uu.se/~nivre/research/Penn2Malt.html).

- LTH converter (Johansson and Nugues, 2007).

- Stanford converter (de Marneffe and Manning, 2008a).

9

Friday, August 17, 2012

Page 10: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion• Comparison

- The Stanford and CLEAR dependency approaches generate 3.62% and 0.23% of unclassified dependencies, respectively.

- Our conversion produces 3.69% of non-projective trees.

10

Penn2Malt LTH Stanford CLEAR

Labels Malt CoNLL Stanford Stanford+

Long-distance DPs ✓ ✓ ✓ ✓ ✓ ✓Secondary DPs ✓ ✓ ✓ ✓ ✓ ✓ ✓Function Tags ✓ ✓ ✓ ✓New TB Format NO NO NO YES

Maintenance NO NO YES YES

Friday, August 17, 2012

Page 11: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion (1/6)1. Input a constituent tree.

• Penn, OntoNotes, CRAFT, MiPACQ, and SHARP Treebanks.

11

NN CC NN WDT PRP VB -NONE-

joyandPeace that we *T*-1

NP

want

SBAR

WHNP-1 S

NP VP

NP

NP

Friday, August 17, 2012

Page 12: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion (2/6)2. Reorder constituents related to empty categories.

• *T*: wh-movement and topicalization.

• *RNR*: right node raising.

• *ICH* and *PPA*: discontinuous constituent.

12

NN CC NN WDT PRP VB -NONE-

joyandPeace that we *T*-1

NP

want

SBAR

WHNP-1 S

NP VP

NP

NP

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP

Friday, August 17, 2012

Page 13: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion (3/6)3. Handle special cases.

• Apposition, coordination, and small clauses.

13

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP

joyandPeace that we wantroot

ccconj

The original word order is preserved in the converted dependency tree.

Friday, August 17, 2012

Page 14: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion (4/6)4. Handle general cases.

• Head-finding rules and heuristics.

14

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP

joyandPeace that we wantroot

ccconj

rcmod

nsubjdobj

root

Friday, August 17, 2012

Page 15: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion (5/6)5. Add secondary dependencies.

• Gapping, referent, right node raising, open clausal subject.

15

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP

joyandPeace that we wantroot

ccconj

rcmod

nsubjdobj

root

ref

Friday, August 17, 2012

Page 16: Optimization of NLP Components for Robustness and Scalability

Dependency Conversion (6/6)6. Add function tags.

16

Appendix A

Constituent Treebank Tags

This appendix shows tags used in various constituent Treebanks for English (Marcus et al., 1993;

Nielsen et al., 2010; Weischedel et al., 2011; Verspoor et al., 2012). Tags followed by∗

are not the

typical Penn Treebank tags but used in some other Treebanks.

A.1 Function tags

Syntactic rolesADV Adverbial PUT Locative complement of putCLF It-cleft PRD Non-VP predicate

CLR Closely related constituent RED∗ Reduced auxiliary

DTV Dative SBJ Surface subject

LGS Logical subject in passive TPC Topicalization

NOM Nominalization

Semantic rolesBNF Benefactive MNR Manner

DIR Direction PRP Purpose or reason

EXT Extent TMP Temporal

LOC Locative VOC Vocative

Text and speech categoriesETC Et cetera SEZ Direct speech

FRM∗ Formula TTL Title

HLN Headline UNF Unfinished constituent

IMP Imperative

Table A.1: A list of function tags for English.

Friday, August 17, 2012

Page 17: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

17

Friday, August 17, 2012

Page 18: Optimization of NLP Components for Robustness and Scalability

Experimental Setup• The Wall Street Journal (WSJ) models

- Train

• The WSJ 2-21 in OntoNotes (Weischedel et al., 2011).

• Total: 30,060 sentences, 731,677 tokens, 77,826 predicates.

- In-genre evaluation (Avgi)

• The WSJ 23 in OntoNotes.

• Total: 1,640 sentences, 39,590 tokens, 4,138 predicates.

- Out-of-genre evaluation (Avgo)

• 5 genres in OntoNotes, 2 genres in MiPACQ (Nielsen et al., 2010), 1 genre in SHARP.

• Total: 19,368 sentences, 265,337 tokens, 32,142 predicates.

18

Friday, August 17, 2012

Page 19: Optimization of NLP Components for Robustness and Scalability

Experimental Setup• The OntoNotes models

- Train

• 6 genres in OntoNotes.

• Total: 96,406 sentences, 1,983,012 tokens, 213,695 predicates.

- In-genre evaluation (Avgi)

• 6 genres in OntoNotes.

• Total: 13,337 sentences, 201,893 tokens, 25,498 predicates.

- Out-of-genre evaluation (Avgo)

• Same 2 genres in MiPACQ, same 1 genre in SHARP.

• Total: 7,671 sentences, 103,034 tokens, 10,782 predicates.

19

Friday, August 17, 2012

Page 20: Optimization of NLP Components for Robustness and Scalability

Experimental Setup• Accuracy

- Part-of-speech tagging

• Accuracy.

- Dependency parsing

• Labeled attachment score (LAS).

• Unlabeled attachment score (UAS).

- Semantic role labeling

• F1-score of argument identification.

• F1-score of both argument identification and classification.

20

Friday, August 17, 2012

Page 21: Optimization of NLP Components for Robustness and Scalability

Experimental Setup• Speed

- All experiments are run on an Intel Xeon 2.57GHz machine.

- Each model is run 5 times, and an average speed is measured by taking the average of middle 3 speeds.

• Machine learning algorithm

- Liblinear L2-regularization, L1-loss SVM classification(Hsieh et al., 2008).

- Designed to handle large scale, high dimensional vectors.

- Runs fast with accurate performance.

- Our implementation of LibLinear is publicly available.

21

Friday, August 17, 2012

Page 22: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

22

Friday, August 17, 2012

Page 23: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Motivation

- Supervised learning approaches do not perform well in out-of-genre experiments.

- Domain adaptation approaches require knowledge of incoming data.

- Complicated tagging or learning approaches often run slowly during decoding.

• Dynamic model selection

- Build two models, generalized and domain-specific, given one set of training data.

- Dynamically select one of the models during decoding.

23

Friday, August 17, 2012

Page 24: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Training

1. Group training data into documents (e.g., sections in WSJ).

2. Get the document frequency of each simplified word form.

• In simplified word forms, all numerical expressions with or w/o special characters are converted to 0.

3. Build a domain-specific model using features extracted from only tokens whose DF(SW) > 1.

4. Build a generalized model using features extracted from only tokens whose DF(SW) > 2.

5. Find the cosine similarity threshold for dynamic model selection.

24

Friday, August 17, 2012

Page 25: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Cosine similarity threshold

- During cross-validation, collect cosine-similarities between simplified word forms used for building the domain-specific model and input sentences that the domain-specific model shows advantage.

- The cosine similarity in the first 5% area becomes the threshold for dynamic model selection.

25

0 0.02 0.04 0.06

190

0

40

80

120

160

Cosine Similarity

Occ

urre

nce

5%

Friday, August 17, 2012

Page 26: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Decoding

- Measure the cosine similarity between simplified word forms used for building the domain-specific model and each input sentence.

- If the similarity is greater than the threshold, use the domain-specific model.

- If the similarity is less than or equal to the threshold, use the generalized model.

26

Runs as fast as a single model approach.

Friday, August 17, 2012

Page 27: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Experiments

- Baseline: using the original word forms.

- Baseline+: using lowercase simplified word forms.

- Domain: domain-specific model.

- General: generalized model.

- ClearNLP: dynamic model selection.

- Stanford: Toutanova et al., 2003.

- SVMTool: Giménez and Màrquez, 2004.

27

Friday, August 17, 2012

Page 28: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Accuracy - WSJ models (Avgi and Avgo)

28

96.5

97.0

97.5

Baseline Baseline+ Domain General ClearNLP Stanford SVMTool

97.3197.4197.40

97.2497.39

96.9896.93

In-domain experiments

87.5

88.5

89.5

90.5

Baseline Baseline+ Domain General ClearNLP Stanford SVMTool

89.4989.92

90.7990.6190.43

88.6488.25

Out-of-domain experiments

Friday, August 17, 2012

Page 29: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Accuracy - OntoNotes models (Avgi and Avgo)

29

96

96.2

96.4

96.6

Baseline Baseline+ Domain General ClearNLP Stanford SVMTool

96.19

96.5296.56

96.41

96.58

96.3296.23

In-domain experiments

86

87

88

89

90

Baseline Baseline+ Domain General ClearNLP Stanford SVMTool

87.61

89.2089.2689.2688.60

87.75

86.79

Out-of-domain experiments

Friday, August 17, 2012

Page 30: Optimization of NLP Components for Robustness and Scalability

Part-of-Speech Tagging• Speed comparison

30

ModelModel Tokens per sec. Millisecs. per sen.

WSJ

ClearNLP 32,654 0.44

WSJClearNLP+ 39,491 0.37

WSJStanford 250 58.06

WSJ

SVMTool 1,058 13.71

OntoNotes

ClearNLP 32,206 0.45

OntoNotesClearNLP+ 39,882 0.36

OntoNotesStanford 136 106.34

OntoNotes

SVMTool 924 15.71

• ClearNLP : as reported in the thesis.• ClearNLP+: new improved results.

Friday, August 17, 2012

Page 31: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

31

Friday, August 17, 2012

Page 32: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Goals

1. To improve the average parsing complexity for non-projective dependency parsing.

2. To reduce the discrepancy between dynamic features used for training on gold trees and decoding automatic trees.

3. To ensure well-formed dependency graph properties.

• Approach

1. Combine transitions in both projective and non-projective dependency parsing algorithms.

2. Bootstrap dynamic features during training.

3. Post-process.

32

Friday, August 17, 2012

Page 33: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Transition decomposition

- Decompose transitions in:

• Nivre’s arc-eager algorithm (projective; Nivre, 2003).

• Nivre’s list-based algorithm (non-projective; Nivre, 2008).

33

61

5.2 Transition-based dependency parsing

5.2.1 Transition decomposition

Table 5.1 shows functional decomposition of transitions used in Nivre’s arc-eager and Covington’s

algorithms. Nivre’s arc-eager algorithm is a projective parsing algorithm that shows a worst-case

parsing complexity of O(n) (Nivre, 2003). Covington’s algorithm is a non-projective parsing al-

gorithm that shows a worst-case parsing complexity of O(n2) without backtracking (Covington,

2001). Covington’s algorithm was later formulated as a transition-based parsing algorithm by Nivre

(2008), called Nivre’s list-based algorithm. Table 5.3 shows the relation between the decomposed

transitions in Table 5.1 and the transitions from the original algorithms.

Operation Transition Description

ArcLeft-∗l ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A ∪ {i l← j} )Right-∗l ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A ∪ {i l→ j} )No-∗ ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A )

List∗-Shiftd|n ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i|λ2|j], [ ], β, A )∗-Reduce ( [λ1|i], λ2, [j|β], A ) ⇒ ( λ1, λ2, [j|β], A )∗-Pass ( [λ1|i], λ2, [j|β], A ) ⇒ ( λ1, [i|λ2], [j|β], A )

Table 5.1: Decomposed transitions grouped into the Arc and List operations.

Operation Transition Precondition

ArcLeft-∗l [i �= 0] ∧ ¬[∃k. (i ← k) ∈ A] ∧ ¬[(i →∗ j) ∈ A]Right-∗l ¬[∃k. (k → j) ∈ A] ∧ ¬[(i ←∗ j) ∈ A]No-∗ ¬[∃l. Left-∗l ∨ Right-∗l]

List∗-Shiftd|n [λ1 = [ ]]d ∨ ¬[∃k ∈ λ1. (k �= i) ∧ ((k ← j) ∨ (k → j))]n

∗-Reduce [∃h. (h → i) ∈ A] ∧ ¬[∃k ∈ β. (i → k)]∗-Pass ¬[∗-Shiftd|n ∨ ∗-Reduce*]

Table 5.2: Preconditions of the decomposed transitions in Table 5.1.

Table 5.2 shows preconditions of the decomposed transitions in Table 5.1. Some preconditions need

to be satisfied to ensure the properties of a well-formed dependency graph (Section 2.1.2.1). The

parsing states are represented as tuples (λ1, λ2, β, A), where λ1, λ2 are lists of partially processed

tokens and β is a list of the remaining unprocessed tokens. A is a set of labeled arcs representing

This decomposition makes it easier to integrate transitions from different parsing algorithms.

Friday, August 17, 2012

Page 34: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Transition recomposition

- Any combination of two decomposed transitions, one from each operation, can be recomposed.

- For each recomposed transition, an ARC operation is performed first and a LIST operation is performed later.

34

63

Nivre’s list-based algorithm using λ2). Section 5.2.2 shows how these decomposed transitions can

be recomposed into transitions used in several different dependency parsing algorithms.

5.2.2 Transition recomposition

Any combination of two decomposed transitions in Table 5.1, one from each operation, can be

recomposed into a new transition. For instance, the combination of Left-∗l and ∗-Reduce makes a

transition, Left-Reducel, which performs Left-∗l and ∗-Reduce sequentially; the Arc operation

is always performed before the List operation. Table 5.3 shows how these decomposed transitions

are recomposed into transitions used in different dependency parsing algorithms.

Transition Nivre’03 Covington’01 Nivre’08 C&P’11 This workLeft-Reducel � � �Left-Passl � � � �Right-Shiftn

l � �Right-Passl � � � �No-Shiftd � � � � �No-Shiftn � � � �No-Reduce � �No-Pass � � � �

Table 5.3: Transitions in different dependency parsing algorithms. The last column shows transitionsused in our parsing algorithm. The other columns show transitions used in Nivre (2003), Covington(2001), Nivre (2008), and Choi and Palmer (2011a), respectively.

Nivre’s arc-eager algorithm allows no combination of ∗-Pass, which removes or skips tokens that can

violate the projective property (Nivre’03 in Table 5.3). As a result, this algorithm performs at most

2n− 1 transitions during paring, and can produce only projective dependency trees.2 Covington’s

algorithm allows no combination of ∗-Shiftn or ∗-Reduce, which inevitably compares each token

with all tokens prior to it (Covington’01). Thus, this algorithm performs n(n+1)2 transitions during

parsing, and can produce both projective and non-projective dependency trees.

The last three algorithms in Table 5.3 show cumulative updates to Covington’s algorithm;

they add one or two transitions from Nivre’s arc-eager algorithm to Covington’s algorithm. By2 The ∗-Shiftd transitions are not counted because they do not require comparison between word tokens.

Projective Non-projective

Friday, August 17, 2012

Page 35: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Average parsing complexity

- The number of transitions performed per sentence.

35

8010 20 30 40 50 60 70

2850

0

500

1000

1500

2000

Sentence length

# of

tran

sitio

ns

Covington'01

Nivre'08C&P'11This work

8010 20 30 40 50 60 70

330

0

50100

150200

250

Sentence length

# of

tran

sitio

ns

Nivre'08

C&P'11

This work

Friday, August 17, 2012

Page 36: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Bootstrapping

- Transition-based dependency parsing can take advantage of dynamic features (e.g., head, leftmost/rightmost dependent).

- Features extracted from gold-standard trees during training can be different from features extracted from automatic trees during decoding.

- By bootstrapping these dynamic features, we can significantly improve parsing accuracy.

36

wi

w0 ! h < j

w1 wj-1

wj

wj-1wi+1

wi < p < j

Friday, August 17, 2012

Page 37: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing

37

AutomaticFeatures

Machine LearningAlgorithm

DependencyParser

StatisticalModel

Gold-standardFeatures

Gold-standardLabels

Stop?

Begin

End

YES

NO

TrainingData

Determined bycross-validation.

Friday, August 17, 2012

Page 38: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Post-processing

- Transition-based dependency parsing does not guarantee parse output to be a tree.

- After parsing, we find the head of each headless token by comparing it to all other tokens using the same model.

- A predicted head with the highest score that does not break tree properties becomes the head of this token.

- This post-processing technique significantly improves parsing accuracy in out-of-genre experiments.

38

Friday, August 17, 2012

Page 39: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Experiments

- Baseline: using all recomposed transitions.

- Baseline+: Baseline with post-processing.

- ClearNLP: Baseline+ with bootstrapping.

- C&N’09: Choi and Nicolov, 2009.

- C&P’11: Choi and Palmer, 2011a.

- MaltParser: Nivre, 2009.

- MSTParser: McDonald et al., 2005.

• Use only 1st order features; with 2nd order features, accuracy is expected to be higher and speed is expected to be slower.

39

Friday, August 17, 2012

Page 40: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Accuracy - WSJ models (Avgi and Avgo)

40

LASUAS

85

86.25

87.5

88.75

90

Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser

88.3688.23

89.7489.589.6888.8188.57

86.0386.49

88.0387.7988.1087.1886.94

In-genre experiments

73

74.75

76.5

78.25

80

Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser

79.2678.29

79.1879.0879.3678.6078.04

74.4674.1075.3475.2375.50

74.6874.18

Out-of-genre experiments

Friday, August 17, 2012

Page 41: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Accuracy - OntoNotes models (Avgi and Avgo)

41

LASUAS

838485868788

Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser

86.7086.40

87.5787.4887.7586.8386.54

83.6684.05

85.4985.4185.6884.7684.51

In-genre experiments

71.5

73.25

75

76.75

78.5

Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser

77.9477.5477.4077.4378.05

76.6576.26

73.3073.4773.8673.8374.18

72.7372.37

Out-of-genre experiments

Friday, August 17, 2012

Page 42: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Speed comparison - WSJ models

42

0

5

10

15

20

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLP ClearNLP+ C&N’09 C&P’11 MaltParser

1.16 ms1.61 ms 1.25 ms 1.08 ms 2.14 ms

Friday, August 17, 2012

Page 43: Optimization of NLP Components for Robustness and Scalability

Dependency Parsing• Speed comparison - OntoNotes models

43

0

5

10

15

20

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLP ClearNLP+ C&N’09 C&P’11 MaltParser

1.28 ms1.89 ms 1.26 ms 1.12 ms 2.14 ms

Friday, August 17, 2012

Page 44: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

44

Friday, August 17, 2012

Page 45: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Motivation

- Not all tokens need to be visited for semantic role labeling.

- A typical pruning algorithm does not work as well when automatically generated trees are provided.

- An enhanced pruning algorithm could improve argument coverage while maintaining low average labeling complexity.

• Approach

- Higher-order argument pruning.

- Conditional higher-order argument pruning.

- Positional feature separation.

45

Friday, August 17, 2012

Page 46: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Semantic roles in dependency trees

46

ARG0 ARG1 ARG2 ARGM-TMP

Friday, August 17, 2012

Page 47: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• First-order argument pruning (1st)

- Originally designed for constituent trees.

• Considers only siblings of the predicate, predicate’s ancestors, and siblings of predicate’s ancestors argument candidates (Xue and Palmer, 2004).

- Redesigned for dependency trees.

• Considers only dependents of the predicate, predicate’s ancestors, and dependents of predicate’s ancestors argument candidates (Johansson and Nugues, 2008).

- Covers over 99% of all arguments using gold-standard trees.

- Covers only 93% of all arguments using automatic trees.

47

Friday, August 17, 2012

Page 48: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Higher-order argument pruning (High)

- Considers all descendants of the predicate, predicate’s ancestors, and dependents of predicate’s ancestors argument candidates.

- Significantly improves argument coverage when automatically generated trees are used.

48

90

92

94

96

98

100

WSJ-1st ON-1st WSJ-High ON-High Gold-1st Gold-High

99.9299.4498.24

97.59

92.94

91.02

Arg

umen

t C

over

age

Friday, August 17, 2012

Page 49: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Conditional higher-order argument pruning (High+)

- Reduces argument candidates using path-rules.

- Before training,

• Collect paths between predicates and their descendants whose subtrees contain arguments of the predicates.

• Collect paths between predicates and their ancestors whose direct dependents or ancestors are arguments of the predicates.

• Cut off paths whose counts are below thresholds.

- During training and decoding, skip tokens and their subtrees or ancestors whose paths to the predicates are not seen.

49

Friday, August 17, 2012

Page 50: Optimization of NLP Components for Robustness and Scalability

8010 20 30 40 50 60 70

75

0102030405060

Sentence length

# of

can

dida

tes

All

High

High+1st

Using the WSJ models (OntoNotes graph is similar)

Semantic Role Labeling• Average labeling complexity

- The number of tokens visited per predicate.

50

Friday, August 17, 2012

Page 51: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Positional feature separation

- Group features by arguments’ positions with respect to their predicates.

- Two sets of features are extracted.

• All features derived from arguments on the lefthand side of the predicates are grouped in one set, SL.

• All features derived from arguments on the righthand side of the predicates are grouped in another set, SR.

- During training, build two models, ML and MR, for SL and SR.

- During decoding, use ML and MR for argument candidates on the lefthand and righthand sides of the predicates.

51

Friday, August 17, 2012

Page 52: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Experiments

- Baseline: 1st order argument pruning.

- Baseline+: Baseline with positional feature separation.

- High: higher-order argument pruning.

- All: no argument pruning.

- ClearNLP: conditional higher-order argument pruning.

• Previously called High+.

- ClearParser: Choi and Palmer, 2011b.

52

Friday, August 17, 2012

Page 53: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Accuracy - WSJ models (Avgi and Avgo)

53

81.7

82.0

82.3

82.6

Baseline Baseline+ High All ClearNLP ClearParser

82.2682.4282.4882.52

82.28

81.88

In-domain experiments

70.8

71.1

71.4

71.7

72

Baseline Baseline+ High All ClearNLP ClearParser

71.52

71.8571.9571.90

71.64

71.07

Out-of-domain experiments

Friday, August 17, 2012

Page 54: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Accuracy - OntoNotes models (Avgi and Avgo)

54

80.5

80.9

81.3

81.7

Baseline Baseline+ High All ClearNLP ClearParser

81.6981.5281.4881.51

81.33

80.73

In-domain experiments

69.7

70.1

70.5

70.9

Baseline Baseline+ High All ClearNLP ClearParser

70.01

70.6870.81

70.6870.54

70.02

Out-of-domain experiments

Friday, August 17, 2012

Page 55: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Speed comparison - WSJ models

- Milliseconds for finding all arguments of each predicate.

55

0

0.75

1.5

2.25

3

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLPClearNLP+Baseline+HighAllClearParser

Friday, August 17, 2012

Page 56: Optimization of NLP Components for Robustness and Scalability

Semantic Role Labeling• Speed comparison - OntoNotes models

56

0

0.75

1.5

2.25

3

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLPClearNLP+Baseline+HighAllClearParser

Friday, August 17, 2012

Page 57: Optimization of NLP Components for Robustness and Scalability

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

57

Friday, August 17, 2012

Page 58: Optimization of NLP Components for Robustness and Scalability

Conclusion• Our dependency conversion gives rich dependency

representations and can be applied to most English Treebanks.

• The dynamic model selection runs fast and shows robust POS tagging accuracy across different genres.

• Our parsing algorithm shows linear-time average parsing complexity for generating both proj. and non-proj. trees.

• The bootstrapping technique gives significant improvement on parsing accuracy.

• The higher-order argument pruning gives significant improvement on argument coverage.

• The conditional higher-order argument pruning reduces average labeling complexity without compromising the F1-score.

58

Friday, August 17, 2012

Page 59: Optimization of NLP Components for Robustness and Scalability

Conclusion• Contributions

- First time that these three components have been evaluated together on such a wide variety of English data.

- Maintained high level accuracy while improving efficiency, modularity, and portability of these components.

- Dynamic model selection and bootstrapping can be generally applicable for tagging and parsing, respectively.

- Processing all three components take about 2.49 - 2.69 ms (tagging: 0.36 - 0.37, parsing: 1.16 - 1.28, labeling: 0.97 - 1.04).

- All components are publicly available as an open source project, called ClearNLP (clearnlp.googlecode.com).

59

Friday, August 17, 2012

Page 60: Optimization of NLP Components for Robustness and Scalability

Conclusion• Future work

- Integrate the dynamic model selection approach with more sophisticated tagging algorithms.

- Evaluate our parsing approach on languages containing more non-projective dependency trees.

- Improve semantic role labeling where the quality of input parse trees is poor (using joint-inference).

60

Friday, August 17, 2012

Page 61: Optimization of NLP Components for Robustness and Scalability

Acknowledgment • We gratefully acknowledge the support of the following grants. Any contents

expressed in this material are those of the authors and do not necessarily reflect the views of any grant.

- The National Science Foundation Grants IIS-0325646, Domain Independent Semantic Parsing, CISE-CRI-0551615, Towards a Comprehensive Linguistic Annotation, CISE-CRI 0709167, Collaborative: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu, CISE- IIS-RI-0910992, Richer Representations for Machine Translation.

- A grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc.

- A subcontract from the Mayo Clinic and Harvard Children’s Hospital based on a grant from the ONC, 90TR0002/01.

- Strategic Health Advanced Research Project Area 4: Natural Language Processing.

61

Friday, August 17, 2012

Page 62: Optimization of NLP Components for Robustness and Scalability

Acknowledgment• Special thanks are due to

- Martha Palmer for practically being my mom for 5 years.

- James Martin for always encouraging me when I’m low.

- Wayne Ward for wonderful smiles.

- Bhuvana Narasimhan for bringing Hindi to my life.

- Joakim Nivre for suffering under millions of my questions.

- Nicolas Nicolov for making me feel normal when others call me “workaholic”.

- All CINC folks for letting me live (literally) at my cube.

62

Friday, August 17, 2012

Page 63: Optimization of NLP Components for Robustness and Scalability

References• Jinho D. Choi and Nicolas Nicolov. K-best, Locally Pruned, Transition-based Dependency Parsing Using

Robust Risk Minimization. In Recent Advances in Natural Language Processing V, pages 205–216. John Benjamins, 2009.

• Jinho D. Choi and Martha Palmer. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies, ACL:HLT’11, pages 687–692, 2011a.

• Jinho D. Choi and Martha Palmer. Transition-based Semantic Role Labeling Using Predicate Argument Clustering. In Proceedings of ACL workshop on Relational Models of Semantics, RELMS’11, pages 37–45, 2011b.

• M. Cmejrek, J. Curín, and J. Havelka. Prague Czech-English Dependency Treebank: Any Hopes for a Common Annotation Scheme? In HLT-NAACL’04 workshop on Frontiers in CorpusAnnotation, pages 47–54, 2004.

• Jesús Giménez and Lluís Màrquez. SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC’04, 2004.

• Richard Johansson and Pierre Nugues. Dependency-based Semantic Role Labeling of PropBank. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing(EMNLP’08), pages 69–78, 2008.

63

Friday, August 17, 2012

Page 64: Optimization of NLP Components for Robustness and Scalability

References• Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A Dual Coordinate

Descent Method for Large-scale Linear SVM. In Proceedings of the 25th international conference on Machine learning, ICML’08, pages 408–415, 2008.

• Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.

• Marie-Catherine de Marneffe and Christopher D. Manning. The Stanford typed dependencies representation. In Proceedings of the COLING workshop on Cross-Framework and Cross-DomainParser Evaluation, 2008a.

• Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective Dependency Parsing using Spanning Tree Algorithms. In Proceedings of the Conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing (HLT-EMNLP’05), pages523–530, 2005.

• Rodney D. Nielsen, James Masanz, Philip Ogren, Wayne Ward, James H. Martin, Guergana Savova, and Martha Palmer. An architecture for complex clinical question answering. In Proceedings of the 1st ACM International Health Informatics Symposium, IHI’10, pages 395–399, 2010.

• Joakim Nivre. An Efficient Algorithm for Projective Dependency Parsing. In Proceedings of the 8th International Workshop on Parsing Technologies, IWPT’03, pages 149–160, 2003.

• Joakim Nivre. Algorithms for deterministic incremental dependency parsing. Computational Linguistics, 34(4):513–553, 2008.

64

Friday, August 17, 2012

Page 65: Optimization of NLP Components for Robustness and Scalability

References• Joakim Nivre. Non-Projective Dependency Parsing in Expected Linear Time. In Proceedings of the Joint

Conference of the 47th Annual Meeting of the ACL and the 4th International JointConference on Natural Language Processing of the AFNLP (ACL-IJCNLP’09), pages 351–359,2009.

• Owen Rambow, Cassandre Creswell, Rachel Szekely, Harriet Taber, and Marilyn Walker. A Dependency Treebank for English. In Proceedings of the 3rd International Conference on LanguageResources and Evaluation (LREC’02), 2002.

• Ralph Weischedel, Eduard Hovy, Martha Palmer, Mitch Marcus, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. OntoNotes: A Large Training Corpus for Enhanced Processing. In Joseph Olive, Caitlin Christianson, and John McCary, editors, Handbook of NaturalLanguage Processing and Machine Translation. Springer, 2011.

• Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology, NAACL’03, pages 173–180, 2003.

• Nianwen Xue and Martha Palmer. Calibrating Features for Semantic Role Labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004.

65

Friday, August 17, 2012