optimization of nlp components for robustness and scalability

Optimization of NLP ComponentsFor Robustness and Scalability

Dissertation DefenseAugust 15, 2012

Jinho D. [email protected]

Department of Computer ScienceUniversity of Colorado Boulder

Friday, August 17, 2012

mailto:[email protected]

mailto:[email protected]

Ever since I came, Boulder has been ...• #1: Top 10 College Towns (Livability, 2012)

• #1: Top 10 Least Obese Metro Areas (Gallup Healthways, 2012)

• #1: Top 10 Happiest Cities (Gallup Healthways, 2012)

• #1: The 10 Most Educated U.S. Cities (US News, 2011)

• #1: America’s 15 Most Active Cities (Time - Healthland, 2011)

• #1: Best Quality of Life in America (Porfolio, 2011)

• #1: 20 Brainiest Cities in America (Daily Beast, 2010)

• #1: Western Cities Fare Best in Well-being (USA Today, 2010)

• #1: America's Foodiest Town (Bon Appétit, 2010)

• #1: The Best Cities to Raise an Outdoor Kid (Backpacker, 2009)

• #1: America's Top 25 Towns To Live Well (Forbes, 2009)

• #1: America's Smartest Cities (Forbes, 2008)

• #1: Top Heart Friendly Cities (American Heart Association, 2008)

2


http://livability.com/top-10/top-10-college-towns-2011/boulder/co

http://livability.com/top-10/top-10-college-towns-2011/boulder/co

http://www.livescience.com/18889-cities-obesity-rates.html

http://www.livescience.com/18889-cities-obesity-rates.html

http://www.cbsnews.com/8301-505123_162-41142308/the-10-happiest-and-saddest-cities-in-the-us/

http://www.cbsnews.com/8301-505123_162-41142308/the-10-happiest-and-saddest-cities-in-the-us/

http://www.usnews.com/news/articles/2011/08/30/the-10-most-educated-us-cities-boulder-ann-arbor-and-washington-dc-top-the-list

http://www.usnews.com/news/articles/2011/08/30/the-10-most-educated-us-cities-boulder-ann-arbor-and-washington-dc-top-the-list

http://healthland.time.com/2011/02/18/americas-15-most-active-cities/%23fittestcity

http://healthland.time.com/2011/02/18/americas-15-most-active-cities/%23fittestcity

http://blogs.westword.com/latestword/2010/12/boulder_is_americas_brainiest_city_--_and_has_the_best_quality_of_life_too_--_says_portfoliocom.php

http://blogs.westword.com/latestword/2010/12/boulder_is_americas_brainiest_city_--_and_has_the_best_quality_of_life_too_--_says_portfoliocom.php

http://www.thedailybeast.com/articles/2010/08/27/americas-brainiest-cities.html

http://www.thedailybeast.com/articles/2010/08/27/americas-brainiest-cities.html

http://www.usatoday.com/news/nation/2010-02-15-cities_N.htm

http://www.usatoday.com/news/nation/2010-02-15-cities_N.htm

http://livepage.apple.com/

http://livepage.apple.com/

http://www.backpacker.com/august_09_the_best_cities_to_raise_an_outdoor_kid/destinations/13126/

http://www.backpacker.com/august_09_the_best_cities_to_raise_an_outdoor_kid/destinations/13126/

http://www.forbes.com/2009/05/04/towns-cities-real-estate-lifestyle-real-estate-top-towns.html

http://www.forbes.com/2009/05/04/towns-cities-real-estate-lifestyle-real-estate-top-towns.html

http://www.forbes.com/2008/02/07/americas-smartest-cities-oped-cx_apa_0207smartest_slide_2.html?thisspeed=20000

http://www.forbes.com/2008/02/07/americas-smartest-cities-oped-cx_apa_0207smartest_slide_2.html?thisspeed=20000

Contents• Introduction

• Dependency conversion

• Experimental setup

• Part-of-speech tagging

• Dependency parsing

• Semantic role labeling

• Conclusion

3


Introduction• The application of NLP has ...

- Expanded to everyday computing.

- Broadened to a general audience.

‣ More attention is drawn to the practical aspects of NLP.

• NLP components should be tested for

- Robustness in handling heterogeneous data.

• Need to be evaluated on data from several different sources.

- Scalability in handling a large amount of data.

• Need to be evaluated for speed and complexity.

4


Introduction• Research question

- How to improve the robustness and scalability of standard NLP components.

• Goals

- To prepare gold-standard data from several different sources for in-genre and out-of-genre experiments.

- To develop a POS tagger, a dependency parser, and a semantic role labeler showing robust results across this data.

- To reduce average complexities of these components while retaining good performance in accuracy.

5


Introduction• Thesis statement

1. We improve the robustness of three NLP components:

• POS tagger: by building a generalized model.

• Dependency parser: by bootstrapping parse information.

• Semantic role labeler: by applying higher-order argument pruning.

2. We improve the scalability of these three components:

• POS tagger: by adapting dynamic model selection.

• Dependency parser: by optimizing the engineering of transition-based parsing algorithms.

• Semantic role labeler: by applying conditional higher-order argument pruning.

6


Introduction

7

DependencyConversion

Part-of-speechTrainer

Part-of-speechTagging Model

Part-of-speechTagger

Training Set:Dependency Trees+ Semantic Roles

Evaluation Set:Dependency Trees+ Semantic Roles

ConstituentTreebanks PropBanks

DependencyTrainer

DependencyParsing Model

DependencyParser

Semantic RoleTrainer

Semantic RoleLabeling Model

Semantic RoleLabeler

Start

Stop








• Conclusion

8


Dependency Conversion• Motivation

- A small amount of manually annotated dependency trees(Rambow et al., 2002; Cmejrek et al., 2004).

- A large amount of manually annotated constituent trees(Marcus et al., 1993; Weischedel et al., 2011).

- Converting constituent trees into dependency trees→ A large amount of pseudo annotated dependency trees.

• Previous approaches

- Penn2Malt (stp.lingfil.uu.se/~nivre/research/Penn2Malt.html).

- LTH converter (Johansson and Nugues, 2007).

- Stanford converter (de Marneffe and Manning, 2008a).

9


Dependency Conversion• Comparison

- The Stanford and CLEAR dependency approaches generate 3.62% and 0.23% of unclassified dependencies, respectively.

- Our conversion produces 3.69% of non-projective trees.

10

Penn2Malt LTH Stanford CLEAR

Labels Malt CoNLL Stanford Stanford+

Long-distance DPs ✓ ✓ ✓ ✓ ✓ ✓Secondary DPs ✓ ✓ ✓ ✓ ✓ ✓ ✓Function Tags ✓ ✓ ✓ ✓New TB Format NO NO NO YES

Maintenance NO NO YES YES


Dependency Conversion (1/6)1. Input a constituent tree.

• Penn, OntoNotes, CRAFT, MiPACQ, and SHARP Treebanks.

11

NN CC NN WDT PRP VB -NONE-

joyandPeace that we *T*-1

NP

want

SBAR

WHNP-1 S

NP VP

NP

NP


Dependency Conversion (2/6)2. Reorder constituents related to empty categories.

• *T*: wh-movement and topicalization.

• *RNR*: right node raising.

• *ICH* and *PPA*: discontinuous constituent.

12

NN CC NN WDT PRP VB -NONE-

joyandPeace that we *T*-1

NP

want

SBAR

WHNP-1 S

NP VP

NP

NP

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP


Dependency Conversion (3/6)3. Handle special cases.

• Apposition, coordination, and small clauses.

13

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP

joyandPeace that we wantroot

ccconj

The original word order is preserved in the converted dependency tree.


Dependency Conversion (4/6)4. Handle general cases.

• Head-finding rules and heuristics.

14

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP


ccconj

rcmod

nsubjdobj

root


Dependency Conversion (5/6)5. Add secondary dependencies.

• Gapping, referent, right node raising, open clausal subject.

15

NN CC NN PRP VB WDT

joyandPeace we that

NP

want

SBAR

S

NP VP

WHNP-1

NP


ccconj

rcmod

nsubjdobj

root

ref


Dependency Conversion (6/6)6. Add function tags.

16

Appendix A

Constituent Treebank Tags

This appendix shows tags used in various constituent Treebanks for English (Marcus et al., 1993;

Nielsen et al., 2010; Weischedel et al., 2011; Verspoor et al., 2012). Tags followed by∗

are not the

typical Penn Treebank tags but used in some other Treebanks.

A.1 Function tags

Syntactic rolesADV Adverbial PUT Locative complement of putCLF It-cleft PRD Non-VP predicate

CLR Closely related constituent RED∗ Reduced auxiliary

DTV Dative SBJ Surface subject

LGS Logical subject in passive TPC Topicalization

NOM Nominalization

Semantic rolesBNF Benefactive MNR Manner

DIR Direction PRP Purpose or reason

EXT Extent TMP Temporal

LOC Locative VOC Vocative

Text and speech categoriesETC Et cetera SEZ Direct speech

FRM∗ Formula TTL Title

HLN Headline UNF Unfinished constituent

IMP Imperative

Table A.1: A list of function tags for English.








• Conclusion

17


Experimental Setup• The Wall Street Journal (WSJ) models

- Train

• The WSJ 2-21 in OntoNotes (Weischedel et al., 2011).

• Total: 30,060 sentences, 731,677 tokens, 77,826 predicates.

- In-genre evaluation (Avgi)

• The WSJ 23 in OntoNotes.


- Out-of-genre evaluation (Avgo)

• 5 genres in OntoNotes, 2 genres in MiPACQ (Nielsen et al., 2010), 1 genre in SHARP.


18


Experimental Setup• The OntoNotes models

- Train

• 6 genres in OntoNotes.

• Total: 96,406 sentences, 1,983,012 tokens, 213,695 predicates.

- In-genre evaluation (Avgi)

• 6 genres in OntoNotes.


- Out-of-genre evaluation (Avgo)

• Same 2 genres in MiPACQ, same 1 genre in SHARP.


19


Experimental Setup• Accuracy

- Part-of-speech tagging

• Accuracy.

- Dependency parsing

• Labeled attachment score (LAS).

• Unlabeled attachment score (UAS).

- Semantic role labeling

• F1-score of argument identification.

• F1-score of both argument identification and classification.

20


Experimental Setup• Speed

- All experiments are run on an Intel Xeon 2.57GHz machine.

- Each model is run 5 times, and an average speed is measured by taking the average of middle 3 speeds.

• Machine learning algorithm

- Liblinear L2-regularization, L1-loss SVM classification(Hsieh et al., 2008).

- Designed to handle large scale, high dimensional vectors.

- Runs fast with accurate performance.

- Our implementation of LibLinear is publicly available.

21








• Conclusion

22


Part-of-Speech Tagging• Motivation

- Supervised learning approaches do not perform well in out-of-genre experiments.

- Domain adaptation approaches require knowledge of incoming data.

- Complicated tagging or learning approaches often run slowly during decoding.

• Dynamic model selection

- Build two models, generalized and domain-specific, given one set of training data.

- Dynamically select one of the models during decoding.

23


Part-of-Speech Tagging• Training

1. Group training data into documents (e.g., sections in WSJ).

2. Get the document frequency of each simplified word form.

• In simplified word forms, all numerical expressions with or w/o special characters are converted to 0.

3. Build a domain-specific model using features extracted from only tokens whose DF(SW) > 1.

4. Build a generalized model using features extracted from only tokens whose DF(SW) > 2.

5. Find the cosine similarity threshold for dynamic model selection.

24


Part-of-Speech Tagging• Cosine similarity threshold

- During cross-validation, collect cosine-similarities between simplified word forms used for building the domain-specific model and input sentences that the domain-specific model shows advantage.

- The cosine similarity in the first 5% area becomes the threshold for dynamic model selection.

25

0 0.02 0.04 0.06

190

0

40

80

120

160

Cosine Similarity

Occ

urre

nce

5%


Part-of-Speech Tagging• Decoding

- Measure the cosine similarity between simplified word forms used for building the domain-specific model and each input sentence.

- If the similarity is greater than the threshold, use the domain-specific model.

- If the similarity is less than or equal to the threshold, use the generalized model.

26

Runs as fast as a single model approach.


Part-of-Speech Tagging• Experiments

- Baseline: using the original word forms.

- Baseline+: using lowercase simplified word forms.

- Domain: domain-specific model.

- General: generalized model.

- ClearNLP: dynamic model selection.

- Stanford: Toutanova et al., 2003.

- SVMTool: Giménez and Màrquez, 2004.

27


Part-of-Speech Tagging• Accuracy - WSJ models (Avgi and Avgo)

28

96.5

97.0

97.5

Baseline Baseline+ Domain General ClearNLP Stanford SVMTool

97.3197.4197.40

97.2497.39

96.9896.93

In-domain experiments

87.5

88.5

89.5

90.5


89.4989.92

90.7990.6190.43

88.6488.25

Out-of-domain experiments


Part-of-Speech Tagging• Accuracy - OntoNotes models (Avgi and Avgo)

29

96

96.2

96.4

96.6


96.19

96.5296.56

96.41

96.58

96.3296.23


86

87

88

89

90


87.61

89.2089.2689.2688.60

87.75

86.79



Part-of-Speech Tagging• Speed comparison

30

ModelModel Tokens per sec. Millisecs. per sen.

WSJ

ClearNLP 32,654 0.44

WSJClearNLP+ 39,491 0.37

WSJStanford 250 58.06

WSJ

SVMTool 1,058 13.71

OntoNotes

ClearNLP 32,206 0.45

OntoNotesClearNLP+ 39,882 0.36

OntoNotesStanford 136 106.34

OntoNotes

SVMTool 924 15.71

• ClearNLP : as reported in the thesis.• ClearNLP+: new improved results.








• Conclusion

31


Dependency Parsing• Goals

1. To improve the average parsing complexity for non-projective dependency parsing.

2. To reduce the discrepancy between dynamic features used for training on gold trees and decoding automatic trees.

3. To ensure well-formed dependency graph properties.

• Approach

1. Combine transitions in both projective and non-projective dependency parsing algorithms.

2. Bootstrap dynamic features during training.

3. Post-process.

32


Dependency Parsing• Transition decomposition

- Decompose transitions in:

• Nivre’s arc-eager algorithm (projective; Nivre, 2003).

• Nivre’s list-based algorithm (non-projective; Nivre, 2008).

33

61

5.2 Transition-based dependency parsing

5.2.1 Transition decomposition

Table 5.1 shows functional decomposition of transitions used in Nivre’s arc-eager and Covington’s

algorithms. Nivre’s arc-eager algorithm is a projective parsing algorithm that shows a worst-case

parsing complexity of O(n) (Nivre, 2003). Covington’s algorithm is a non-projective parsing al-

gorithm that shows a worst-case parsing complexity of O(n2) without backtracking (Covington,

2001). Covington’s algorithm was later formulated as a transition-based parsing algorithm by Nivre

(2008), called Nivre’s list-based algorithm. Table 5.3 shows the relation between the decomposed

transitions in Table 5.1 and the transitions from the original algorithms.

Operation Transition Description

ArcLeft-∗l ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A ∪ {i l← j} )Right-∗l ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A ∪ {i l→ j} )No-∗ ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A )

List∗-Shiftd|n ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i|λ2|j], [ ], β, A )∗-Reduce ( [λ1|i], λ2, [j|β], A ) ⇒ ( λ1, λ2, [j|β], A )∗-Pass ( [λ1|i], λ2, [j|β], A ) ⇒ ( λ1, [i|λ2], [j|β], A )

Table 5.1: Decomposed transitions grouped into the Arc and List operations.

Operation Transition Precondition

ArcLeft-∗l [i �= 0] ∧ ¬[∃k. (i ← k) ∈ A] ∧ ¬[(i →∗ j) ∈ A]Right-∗l ¬[∃k. (k → j) ∈ A] ∧ ¬[(i ←∗ j) ∈ A]No-∗ ¬[∃l. Left-∗l ∨ Right-∗l]

List∗-Shiftd|n [λ1 = [ ]]d ∨ ¬[∃k ∈ λ1. (k �= i) ∧ ((k ← j) ∨ (k → j))]n

∗-Reduce [∃h. (h → i) ∈ A] ∧ ¬[∃k ∈ β. (i → k)]∗-Pass ¬[∗-Shiftd|n ∨ ∗-Reduce*]

Table 5.2: Preconditions of the decomposed transitions in Table 5.1.

Table 5.2 shows preconditions of the decomposed transitions in Table 5.1. Some preconditions need

to be satisfied to ensure the properties of a well-formed dependency graph (Section 2.1.2.1). The

parsing states are represented as tuples (λ1, λ2, β, A), where λ1, λ2 are lists of partially processed

tokens and β is a list of the remaining unprocessed tokens. A is a set of labeled arcs representing

This decomposition makes it easier to integrate transitions from different parsing algorithms.


Dependency Parsing• Transition recomposition

- Any combination of two decomposed transitions, one from each operation, can be recomposed.

- For each recomposed transition, an ARC operation is performed first and a LIST operation is performed later.

34

63

Nivre’s list-based algorithm using λ2). Section 5.2.2 shows how these decomposed transitions can

be recomposed into transitions used in several different dependency parsing algorithms.

5.2.2 Transition recomposition

Any combination of two decomposed transitions in Table 5.1, one from each operation, can be

recomposed into a new transition. For instance, the combination of Left-∗l and ∗-Reduce makes a

transition, Left-Reducel, which performs Left-∗l and ∗-Reduce sequentially; the Arc operation

is always performed before the List operation. Table 5.3 shows how these decomposed transitions

are recomposed into transitions used in different dependency parsing algorithms.

Transition Nivre’03 Covington’01 Nivre’08 C&P’11 This workLeft-Reducel � � �Left-Passl � � � �Right-Shiftn

l � �Right-Passl � � � �No-Shiftd � � � � �No-Shiftn � � � �No-Reduce � �No-Pass � � � �

Table 5.3: Transitions in different dependency parsing algorithms. The last column shows transitionsused in our parsing algorithm. The other columns show transitions used in Nivre (2003), Covington(2001), Nivre (2008), and Choi and Palmer (2011a), respectively.

Nivre’s arc-eager algorithm allows no combination of ∗-Pass, which removes or skips tokens that can

violate the projective property (Nivre’03 in Table 5.3). As a result, this algorithm performs at most

2n− 1 transitions during paring, and can produce only projective dependency trees.2 Covington’s

algorithm allows no combination of ∗-Shiftn or ∗-Reduce, which inevitably compares each token

with all tokens prior to it (Covington’01). Thus, this algorithm performs n(n+1)2 transitions during

parsing, and can produce both projective and non-projective dependency trees.

The last three algorithms in Table 5.3 show cumulative updates to Covington’s algorithm;

they add one or two transitions from Nivre’s arc-eager algorithm to Covington’s algorithm. By2 The ∗-Shiftd transitions are not counted because they do not require comparison between word tokens.

Projective Non-projective


Dependency Parsing• Average parsing complexity

- The number of transitions performed per sentence.

35

8010 20 30 40 50 60 70

2850

0

500

1000

1500

2000

Sentence length

# of

tran

sitio

ns

Covington'01

Nivre'08C&P'11This work

8010 20 30 40 50 60 70

330

0

50100

150200

250

Sentence length

# of

tran

sitio

ns

Nivre'08

C&P'11

This work


Dependency Parsing• Bootstrapping

- Transition-based dependency parsing can take advantage of dynamic features (e.g., head, leftmost/rightmost dependent).

- Features extracted from gold-standard trees during training can be different from features extracted from automatic trees during decoding.

- By bootstrapping these dynamic features, we can significantly improve parsing accuracy.

36

wi

w0 ! h < j

w1 wj-1

wj

wj-1wi+1

wi < p < j


Dependency Parsing

37

AutomaticFeatures

Machine LearningAlgorithm

DependencyParser

StatisticalModel

Gold-standardFeatures

Gold-standardLabels

Stop?

Begin

End

YES

NO

TrainingData

Determined bycross-validation.


Dependency Parsing• Post-processing

- Transition-based dependency parsing does not guarantee parse output to be a tree.

- After parsing, we find the head of each headless token by comparing it to all other tokens using the same model.

- A predicted head with the highest score that does not break tree properties becomes the head of this token.

- This post-processing technique significantly improves parsing accuracy in out-of-genre experiments.

38


Dependency Parsing• Experiments

- Baseline: using all recomposed transitions.

- Baseline+: Baseline with post-processing.

- ClearNLP: Baseline+ with bootstrapping.

- C&N’09: Choi and Nicolov, 2009.

- C&P’11: Choi and Palmer, 2011a.

- MaltParser: Nivre, 2009.

- MSTParser: McDonald et al., 2005.

• Use only 1st order features; with 2nd order features, accuracy is expected to be higher and speed is expected to be slower.

39


Dependency Parsing• Accuracy - WSJ models (Avgi and Avgo)

40

LASUAS

85

86.25

87.5

88.75

90

Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser

88.3688.23

89.7489.589.6888.8188.57

86.0386.49

88.0387.7988.1087.1886.94

In-genre experiments

73

74.75

76.5

78.25

80


79.2678.29

79.1879.0879.3678.6078.04

74.4674.1075.3475.2375.50

74.6874.18

Out-of-genre experiments


Dependency Parsing• Accuracy - OntoNotes models (Avgi and Avgo)

41

LASUAS

838485868788


86.7086.40

87.5787.4887.7586.8386.54

83.6684.05

85.4985.4185.6884.7684.51

In-genre experiments

71.5

73.25

75

76.75

78.5


77.9477.5477.4077.4378.05

76.6576.26

73.3073.4773.8673.8374.18

72.7372.37

Out-of-genre experiments


Dependency Parsing• Speed comparison - WSJ models

42

0

5

10

15

20

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLP ClearNLP+ C&N’09 C&P’11 MaltParser

1.16 ms1.61 ms 1.25 ms 1.08 ms 2.14 ms


Dependency Parsing• Speed comparison - OntoNotes models

43

0

5

10

15

20

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLP ClearNLP+ C&N’09 C&P’11 MaltParser

1.28 ms1.89 ms 1.26 ms 1.12 ms 2.14 ms








• Conclusion

44


Semantic Role Labeling• Motivation

- Not all tokens need to be visited for semantic role labeling.

- A typical pruning algorithm does not work as well when automatically generated trees are provided.

- An enhanced pruning algorithm could improve argument coverage while maintaining low average labeling complexity.

• Approach

- Higher-order argument pruning.

- Conditional higher-order argument pruning.

- Positional feature separation.

45


Semantic Role Labeling• Semantic roles in dependency trees

46

ARG0 ARG1 ARG2 ARGM-TMP


Semantic Role Labeling• First-order argument pruning (1st)

- Originally designed for constituent trees.

• Considers only siblings of the predicate, predicate’s ancestors, and siblings of predicate’s ancestors argument candidates (Xue and Palmer, 2004).

- Redesigned for dependency trees.

• Considers only dependents of the predicate, predicate’s ancestors, and dependents of predicate’s ancestors argument candidates (Johansson and Nugues, 2008).

- Covers over 99% of all arguments using gold-standard trees.

- Covers only 93% of all arguments using automatic trees.

47


Semantic Role Labeling• Higher-order argument pruning (High)

- Considers all descendants of the predicate, predicate’s ancestors, and dependents of predicate’s ancestors argument candidates.

- Significantly improves argument coverage when automatically generated trees are used.

48

90

92

94

96

98

100

WSJ-1st ON-1st WSJ-High ON-High Gold-1st Gold-High

99.9299.4498.24

97.59

92.94

91.02

Arg

umen

t C

over

age


Semantic Role Labeling• Conditional higher-order argument pruning (High+)

- Reduces argument candidates using path-rules.

- Before training,

• Collect paths between predicates and their descendants whose subtrees contain arguments of the predicates.

• Collect paths between predicates and their ancestors whose direct dependents or ancestors are arguments of the predicates.

• Cut off paths whose counts are below thresholds.

- During training and decoding, skip tokens and their subtrees or ancestors whose paths to the predicates are not seen.

49


8010 20 30 40 50 60 70

75

0102030405060

Sentence length

# of

can

dida

tes

All

High

High+1st

Using the WSJ models (OntoNotes graph is similar)

Semantic Role Labeling• Average labeling complexity

- The number of tokens visited per predicate.

50


Semantic Role Labeling• Positional feature separation

- Group features by arguments’ positions with respect to their predicates.

- Two sets of features are extracted.

• All features derived from arguments on the lefthand side of the predicates are grouped in one set, SL.

• All features derived from arguments on the righthand side of the predicates are grouped in another set, SR.

- During training, build two models, ML and MR, for SL and SR.

- During decoding, use ML and MR for argument candidates on the lefthand and righthand sides of the predicates.

51


Semantic Role Labeling• Experiments

- Baseline: 1st order argument pruning.

- Baseline+: Baseline with positional feature separation.

- High: higher-order argument pruning.

- All: no argument pruning.

- ClearNLP: conditional higher-order argument pruning.

• Previously called High+.

- ClearParser: Choi and Palmer, 2011b.

52


Semantic Role Labeling• Accuracy - WSJ models (Avgi and Avgo)

53

81.7

82.0

82.3

82.6

Baseline Baseline+ High All ClearNLP ClearParser

82.2682.4282.4882.52

82.28

81.88


70.8

71.1

71.4

71.7

72


71.52

71.8571.9571.90

71.64

71.07



Semantic Role Labeling• Accuracy - OntoNotes models (Avgi and Avgo)

54

80.5

80.9

81.3

81.7


81.6981.5281.4881.51

81.33

80.73


69.7

70.1

70.5

70.9


70.01

70.6870.81

70.6870.54

70.02



Semantic Role Labeling• Speed comparison - WSJ models

- Milliseconds for finding all arguments of each predicate.

55

0

0.75

1.5

2.25

3

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLPClearNLP+Baseline+HighAllClearParser


Semantic Role Labeling• Speed comparison - OntoNotes models

56

0

0.75

1.5

2.25

3

10 20 30 40 50 60 70 80

Mill

isec

onds

Sentence Length

ClearNLPClearNLP+Baseline+HighAllClearParser








• Conclusion

57


Conclusion• Our dependency conversion gives rich dependency

representations and can be applied to most English Treebanks.

• The dynamic model selection runs fast and shows robust POS tagging accuracy across different genres.

• Our parsing algorithm shows linear-time average parsing complexity for generating both proj. and non-proj. trees.

• The bootstrapping technique gives significant improvement on parsing accuracy.

• The higher-order argument pruning gives significant improvement on argument coverage.

• The conditional higher-order argument pruning reduces average labeling complexity without compromising the F1-score.

58


Conclusion• Contributions

- First time that these three components have been evaluated together on such a wide variety of English data.

- Maintained high level accuracy while improving efficiency, modularity, and portability of these components.

- Dynamic model selection and bootstrapping can be generally applicable for tagging and parsing, respectively.

- Processing all three components take about 2.49 - 2.69 ms (tagging: 0.36 - 0.37, parsing: 1.16 - 1.28, labeling: 0.97 - 1.04).

- All components are publicly available as an open source project, called ClearNLP (clearnlp.googlecode.com).

59


http://clearnlp.googlecode.com

http://clearnlp.googlecode.com

Conclusion• Future work

- Integrate the dynamic model selection approach with more sophisticated tagging algorithms.

- Evaluate our parsing approach on languages containing more non-projective dependency trees.

- Improve semantic role labeling where the quality of input parse trees is poor (using joint-inference).

60


Acknowledgment • We gratefully acknowledge the support of the following grants. Any contents

expressed in this material are those of the authors and do not necessarily reflect the views of any grant.

- The National Science Foundation Grants IIS-0325646, Domain Independent Semantic Parsing, CISE-CRI-0551615, Towards a Comprehensive Linguistic Annotation, CISE-CRI 0709167, Collaborative: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu, CISE- IIS-RI-0910992, Richer Representations for Machine Translation.

- A grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc.

- A subcontract from the Mayo Clinic and Harvard Children’s Hospital based on a grant from the ONC, 90TR0002/01.

- Strategic Health Advanced Research Project Area 4: Natural Language Processing.

61


Acknowledgment• Special thanks are due to

- Martha Palmer for practically being my mom for 5 years.

- James Martin for always encouraging me when I’m low.

- Wayne Ward for wonderful smiles.

- Bhuvana Narasimhan for bringing Hindi to my life.

- Joakim Nivre for suffering under millions of my questions.

- Nicolas Nicolov for making me feel normal when others call me “workaholic”.

- All CINC folks for letting me live (literally) at my cube.

62


References• Jinho D. Choi and Nicolas Nicolov. K-best, Locally Pruned, Transition-based Dependency Parsing Using

Robust Risk Minimization. In Recent Advances in Natural Language Processing V, pages 205–216. John Benjamins, 2009.

• Jinho D. Choi and Martha Palmer. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies, ACL:HLT’11, pages 687–692, 2011a.

• Jinho D. Choi and Martha Palmer. Transition-based Semantic Role Labeling Using Predicate Argument Clustering. In Proceedings of ACL workshop on Relational Models of Semantics, RELMS’11, pages 37–45, 2011b.

• M. Cmejrek, J. Curín, and J. Havelka. Prague Czech-English Dependency Treebank: Any Hopes for a Common Annotation Scheme? In HLT-NAACL’04 workshop on Frontiers in CorpusAnnotation, pages 47–54, 2004.

• Jesús Giménez and Lluís Màrquez. SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC’04, 2004.

• Richard Johansson and Pierre Nugues. Dependency-based Semantic Role Labeling of PropBank. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing(EMNLP’08), pages 69–78, 2008.

63


References• Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A Dual Coordinate

Descent Method for Large-scale Linear SVM. In Proceedings of the 25th international conference on Machine learning, ICML’08, pages 408–415, 2008.

• Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.

• Marie-Catherine de Marneffe and Christopher D. Manning. The Stanford typed dependencies representation. In Proceedings of the COLING workshop on Cross-Framework and Cross-DomainParser Evaluation, 2008a.

• Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective Dependency Parsing using Spanning Tree Algorithms. In Proceedings of the Conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing (HLT-EMNLP’05), pages523–530, 2005.

• Rodney D. Nielsen, James Masanz, Philip Ogren, Wayne Ward, James H. Martin, Guergana Savova, and Martha Palmer. An architecture for complex clinical question answering. In Proceedings of the 1st ACM International Health Informatics Symposium, IHI’10, pages 395–399, 2010.

• Joakim Nivre. An Efficient Algorithm for Projective Dependency Parsing. In Proceedings of the 8th International Workshop on Parsing Technologies, IWPT’03, pages 149–160, 2003.

• Joakim Nivre. Algorithms for deterministic incremental dependency parsing. Computational Linguistics, 34(4):513–553, 2008.

64


References• Joakim Nivre. Non-Projective Dependency Parsing in Expected Linear Time. In Proceedings of the Joint

Conference of the 47th Annual Meeting of the ACL and the 4th International JointConference on Natural Language Processing of the AFNLP (ACL-IJCNLP’09), pages 351–359,2009.

• Owen Rambow, Cassandre Creswell, Rachel Szekely, Harriet Taber, and Marilyn Walker. A Dependency Treebank for English. In Proceedings of the 3rd International Conference on LanguageResources and Evaluation (LREC’02), 2002.

• Ralph Weischedel, Eduard Hovy, Martha Palmer, Mitch Marcus, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. OntoNotes: A Large Training Corpus for Enhanced Processing. In Joseph Olive, Caitlin Christianson, and John McCary, editors, Handbook of NaturalLanguage Processing and Machine Translation. Springer, 2011.

• Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology, NAACL’03, pages 173–180, 2003.

• Nianwen Xue and Martha Palmer. Calibrating Features for Semantic Role Labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004.

65


optimization of nlp components for robustness and scalability

Technology

dependency parser

converted dependency

semantic role labeler

best cities

nn cc nn prp vb wdt

brainiest cities

optimization of nlp

snpvp np nn ccnn wdt