optimization of nlp components for robustness and scalability
TRANSCRIPT
Optimization of NLP ComponentsFor Robustness and Scalability
Dissertation DefenseAugust 15, 2012
Jinho D. [email protected]
Department of Computer ScienceUniversity of Colorado Boulder
Friday, August 17, 2012
Ever since I came, Boulder has been ...• #1: Top 10 College Towns (Livability, 2012)
• #1: Top 10 Least Obese Metro Areas (Gallup Healthways, 2012)
• #1: Top 10 Happiest Cities (Gallup Healthways, 2012)
• #1: The 10 Most Educated U.S. Cities (US News, 2011)
• #1: America’s 15 Most Active Cities (Time - Healthland, 2011)
• #1: Best Quality of Life in America (Porfolio, 2011)
• #1: 20 Brainiest Cities in America (Daily Beast, 2010)
• #1: Western Cities Fare Best in Well-being (USA Today, 2010)
• #1: America's Foodiest Town (Bon Appétit, 2010)
• #1: The Best Cities to Raise an Outdoor Kid (Backpacker, 2009)
• #1: America's Top 25 Towns To Live Well (Forbes, 2009)
• #1: America's Smartest Cities (Forbes, 2008)
• #1: Top Heart Friendly Cities (American Heart Association, 2008)
2
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
3
Friday, August 17, 2012
Introduction• The application of NLP has ...
- Expanded to everyday computing.
- Broadened to a general audience.
‣ More attention is drawn to the practical aspects of NLP.
• NLP components should be tested for
- Robustness in handling heterogeneous data.
• Need to be evaluated on data from several different sources.
- Scalability in handling a large amount of data.
• Need to be evaluated for speed and complexity.
4
Friday, August 17, 2012
Introduction• Research question
- How to improve the robustness and scalability of standard NLP components.
• Goals
- To prepare gold-standard data from several different sources for in-genre and out-of-genre experiments.
- To develop a POS tagger, a dependency parser, and a semantic role labeler showing robust results across this data.
- To reduce average complexities of these components while retaining good performance in accuracy.
5
Friday, August 17, 2012
Introduction• Thesis statement
1. We improve the robustness of three NLP components:
• POS tagger: by building a generalized model.
• Dependency parser: by bootstrapping parse information.
• Semantic role labeler: by applying higher-order argument pruning.
2. We improve the scalability of these three components:
• POS tagger: by adapting dynamic model selection.
• Dependency parser: by optimizing the engineering of transition-based parsing algorithms.
• Semantic role labeler: by applying conditional higher-order argument pruning.
6
Friday, August 17, 2012
Introduction
7
DependencyConversion
Part-of-speechTrainer
Part-of-speechTagging Model
Part-of-speechTagger
Training Set:Dependency Trees+ Semantic Roles
Evaluation Set:Dependency Trees+ Semantic Roles
ConstituentTreebanks PropBanks
DependencyTrainer
DependencyParsing Model
DependencyParser
Semantic RoleTrainer
Semantic RoleLabeling Model
Semantic RoleLabeler
Start
Stop
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
8
Friday, August 17, 2012
Dependency Conversion• Motivation
- A small amount of manually annotated dependency trees(Rambow et al., 2002; Cmejrek et al., 2004).
- A large amount of manually annotated constituent trees(Marcus et al., 1993; Weischedel et al., 2011).
- Converting constituent trees into dependency trees→ A large amount of pseudo annotated dependency trees.
• Previous approaches
- Penn2Malt (stp.lingfil.uu.se/~nivre/research/Penn2Malt.html).
- LTH converter (Johansson and Nugues, 2007).
- Stanford converter (de Marneffe and Manning, 2008a).
9
Friday, August 17, 2012
Dependency Conversion• Comparison
- The Stanford and CLEAR dependency approaches generate 3.62% and 0.23% of unclassified dependencies, respectively.
- Our conversion produces 3.69% of non-projective trees.
10
Penn2Malt LTH Stanford CLEAR
Labels Malt CoNLL Stanford Stanford+
Long-distance DPs ✓ ✓ ✓ ✓ ✓ ✓Secondary DPs ✓ ✓ ✓ ✓ ✓ ✓ ✓Function Tags ✓ ✓ ✓ ✓New TB Format NO NO NO YES
Maintenance NO NO YES YES
Friday, August 17, 2012
Dependency Conversion (1/6)1. Input a constituent tree.
• Penn, OntoNotes, CRAFT, MiPACQ, and SHARP Treebanks.
11
NN CC NN WDT PRP VB -NONE-
joyandPeace that we *T*-1
NP
want
SBAR
WHNP-1 S
NP VP
NP
NP
Friday, August 17, 2012
Dependency Conversion (2/6)2. Reorder constituents related to empty categories.
• *T*: wh-movement and topicalization.
• *RNR*: right node raising.
• *ICH* and *PPA*: discontinuous constituent.
12
NN CC NN WDT PRP VB -NONE-
joyandPeace that we *T*-1
NP
want
SBAR
WHNP-1 S
NP VP
NP
NP
NN CC NN PRP VB WDT
joyandPeace we that
NP
want
SBAR
S
NP VP
WHNP-1
NP
Friday, August 17, 2012
Dependency Conversion (3/6)3. Handle special cases.
• Apposition, coordination, and small clauses.
13
NN CC NN PRP VB WDT
joyandPeace we that
NP
want
SBAR
S
NP VP
WHNP-1
NP
joyandPeace that we wantroot
ccconj
The original word order is preserved in the converted dependency tree.
Friday, August 17, 2012
Dependency Conversion (4/6)4. Handle general cases.
• Head-finding rules and heuristics.
14
NN CC NN PRP VB WDT
joyandPeace we that
NP
want
SBAR
S
NP VP
WHNP-1
NP
joyandPeace that we wantroot
ccconj
rcmod
nsubjdobj
root
Friday, August 17, 2012
Dependency Conversion (5/6)5. Add secondary dependencies.
• Gapping, referent, right node raising, open clausal subject.
15
NN CC NN PRP VB WDT
joyandPeace we that
NP
want
SBAR
S
NP VP
WHNP-1
NP
joyandPeace that we wantroot
ccconj
rcmod
nsubjdobj
root
ref
Friday, August 17, 2012
Dependency Conversion (6/6)6. Add function tags.
16
Appendix A
Constituent Treebank Tags
This appendix shows tags used in various constituent Treebanks for English (Marcus et al., 1993;
Nielsen et al., 2010; Weischedel et al., 2011; Verspoor et al., 2012). Tags followed by∗
are not the
typical Penn Treebank tags but used in some other Treebanks.
A.1 Function tags
Syntactic rolesADV Adverbial PUT Locative complement of putCLF It-cleft PRD Non-VP predicate
CLR Closely related constituent RED∗ Reduced auxiliary
DTV Dative SBJ Surface subject
LGS Logical subject in passive TPC Topicalization
NOM Nominalization
Semantic rolesBNF Benefactive MNR Manner
DIR Direction PRP Purpose or reason
EXT Extent TMP Temporal
LOC Locative VOC Vocative
Text and speech categoriesETC Et cetera SEZ Direct speech
FRM∗ Formula TTL Title
HLN Headline UNF Unfinished constituent
IMP Imperative
Table A.1: A list of function tags for English.
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
17
Friday, August 17, 2012
Experimental Setup• The Wall Street Journal (WSJ) models
- Train
• The WSJ 2-21 in OntoNotes (Weischedel et al., 2011).
• Total: 30,060 sentences, 731,677 tokens, 77,826 predicates.
- In-genre evaluation (Avgi)
• The WSJ 23 in OntoNotes.
• Total: 1,640 sentences, 39,590 tokens, 4,138 predicates.
- Out-of-genre evaluation (Avgo)
• 5 genres in OntoNotes, 2 genres in MiPACQ (Nielsen et al., 2010), 1 genre in SHARP.
• Total: 19,368 sentences, 265,337 tokens, 32,142 predicates.
18
Friday, August 17, 2012
Experimental Setup• The OntoNotes models
- Train
• 6 genres in OntoNotes.
• Total: 96,406 sentences, 1,983,012 tokens, 213,695 predicates.
- In-genre evaluation (Avgi)
• 6 genres in OntoNotes.
• Total: 13,337 sentences, 201,893 tokens, 25,498 predicates.
- Out-of-genre evaluation (Avgo)
• Same 2 genres in MiPACQ, same 1 genre in SHARP.
• Total: 7,671 sentences, 103,034 tokens, 10,782 predicates.
19
Friday, August 17, 2012
Experimental Setup• Accuracy
- Part-of-speech tagging
• Accuracy.
- Dependency parsing
• Labeled attachment score (LAS).
• Unlabeled attachment score (UAS).
- Semantic role labeling
• F1-score of argument identification.
• F1-score of both argument identification and classification.
20
Friday, August 17, 2012
Experimental Setup• Speed
- All experiments are run on an Intel Xeon 2.57GHz machine.
- Each model is run 5 times, and an average speed is measured by taking the average of middle 3 speeds.
• Machine learning algorithm
- Liblinear L2-regularization, L1-loss SVM classification(Hsieh et al., 2008).
- Designed to handle large scale, high dimensional vectors.
- Runs fast with accurate performance.
- Our implementation of LibLinear is publicly available.
21
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
22
Friday, August 17, 2012
Part-of-Speech Tagging• Motivation
- Supervised learning approaches do not perform well in out-of-genre experiments.
- Domain adaptation approaches require knowledge of incoming data.
- Complicated tagging or learning approaches often run slowly during decoding.
• Dynamic model selection
- Build two models, generalized and domain-specific, given one set of training data.
- Dynamically select one of the models during decoding.
23
Friday, August 17, 2012
Part-of-Speech Tagging• Training
1. Group training data into documents (e.g., sections in WSJ).
2. Get the document frequency of each simplified word form.
• In simplified word forms, all numerical expressions with or w/o special characters are converted to 0.
3. Build a domain-specific model using features extracted from only tokens whose DF(SW) > 1.
4. Build a generalized model using features extracted from only tokens whose DF(SW) > 2.
5. Find the cosine similarity threshold for dynamic model selection.
24
Friday, August 17, 2012
Part-of-Speech Tagging• Cosine similarity threshold
- During cross-validation, collect cosine-similarities between simplified word forms used for building the domain-specific model and input sentences that the domain-specific model shows advantage.
- The cosine similarity in the first 5% area becomes the threshold for dynamic model selection.
25
0 0.02 0.04 0.06
190
0
40
80
120
160
Cosine Similarity
Occ
urre
nce
5%
Friday, August 17, 2012
Part-of-Speech Tagging• Decoding
- Measure the cosine similarity between simplified word forms used for building the domain-specific model and each input sentence.
- If the similarity is greater than the threshold, use the domain-specific model.
- If the similarity is less than or equal to the threshold, use the generalized model.
26
Runs as fast as a single model approach.
Friday, August 17, 2012
Part-of-Speech Tagging• Experiments
- Baseline: using the original word forms.
- Baseline+: using lowercase simplified word forms.
- Domain: domain-specific model.
- General: generalized model.
- ClearNLP: dynamic model selection.
- Stanford: Toutanova et al., 2003.
- SVMTool: Giménez and Màrquez, 2004.
27
Friday, August 17, 2012
Part-of-Speech Tagging• Accuracy - WSJ models (Avgi and Avgo)
28
96.5
97.0
97.5
Baseline Baseline+ Domain General ClearNLP Stanford SVMTool
97.3197.4197.40
97.2497.39
96.9896.93
In-domain experiments
87.5
88.5
89.5
90.5
Baseline Baseline+ Domain General ClearNLP Stanford SVMTool
89.4989.92
90.7990.6190.43
88.6488.25
Out-of-domain experiments
Friday, August 17, 2012
Part-of-Speech Tagging• Accuracy - OntoNotes models (Avgi and Avgo)
29
96
96.2
96.4
96.6
Baseline Baseline+ Domain General ClearNLP Stanford SVMTool
96.19
96.5296.56
96.41
96.58
96.3296.23
In-domain experiments
86
87
88
89
90
Baseline Baseline+ Domain General ClearNLP Stanford SVMTool
87.61
89.2089.2689.2688.60
87.75
86.79
Out-of-domain experiments
Friday, August 17, 2012
Part-of-Speech Tagging• Speed comparison
30
ModelModel Tokens per sec. Millisecs. per sen.
WSJ
ClearNLP 32,654 0.44
WSJClearNLP+ 39,491 0.37
WSJStanford 250 58.06
WSJ
SVMTool 1,058 13.71
OntoNotes
ClearNLP 32,206 0.45
OntoNotesClearNLP+ 39,882 0.36
OntoNotesStanford 136 106.34
OntoNotes
SVMTool 924 15.71
• ClearNLP : as reported in the thesis.• ClearNLP+: new improved results.
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
31
Friday, August 17, 2012
Dependency Parsing• Goals
1. To improve the average parsing complexity for non-projective dependency parsing.
2. To reduce the discrepancy between dynamic features used for training on gold trees and decoding automatic trees.
3. To ensure well-formed dependency graph properties.
• Approach
1. Combine transitions in both projective and non-projective dependency parsing algorithms.
2. Bootstrap dynamic features during training.
3. Post-process.
32
Friday, August 17, 2012
Dependency Parsing• Transition decomposition
- Decompose transitions in:
• Nivre’s arc-eager algorithm (projective; Nivre, 2003).
• Nivre’s list-based algorithm (non-projective; Nivre, 2008).
33
61
5.2 Transition-based dependency parsing
5.2.1 Transition decomposition
Table 5.1 shows functional decomposition of transitions used in Nivre’s arc-eager and Covington’s
algorithms. Nivre’s arc-eager algorithm is a projective parsing algorithm that shows a worst-case
parsing complexity of O(n) (Nivre, 2003). Covington’s algorithm is a non-projective parsing al-
gorithm that shows a worst-case parsing complexity of O(n2) without backtracking (Covington,
2001). Covington’s algorithm was later formulated as a transition-based parsing algorithm by Nivre
(2008), called Nivre’s list-based algorithm. Table 5.3 shows the relation between the decomposed
transitions in Table 5.1 and the transitions from the original algorithms.
Operation Transition Description
ArcLeft-∗l ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A ∪ {i l← j} )Right-∗l ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A ∪ {i l→ j} )No-∗ ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i], λ2, [j|β], A )
List∗-Shiftd|n ( [λ1|i], λ2, [j|β], A ) ⇒ ( [λ1|i|λ2|j], [ ], β, A )∗-Reduce ( [λ1|i], λ2, [j|β], A ) ⇒ ( λ1, λ2, [j|β], A )∗-Pass ( [λ1|i], λ2, [j|β], A ) ⇒ ( λ1, [i|λ2], [j|β], A )
Table 5.1: Decomposed transitions grouped into the Arc and List operations.
Operation Transition Precondition
ArcLeft-∗l [i �= 0] ∧ ¬[∃k. (i ← k) ∈ A] ∧ ¬[(i →∗ j) ∈ A]Right-∗l ¬[∃k. (k → j) ∈ A] ∧ ¬[(i ←∗ j) ∈ A]No-∗ ¬[∃l. Left-∗l ∨ Right-∗l]
List∗-Shiftd|n [λ1 = [ ]]d ∨ ¬[∃k ∈ λ1. (k �= i) ∧ ((k ← j) ∨ (k → j))]n
∗-Reduce [∃h. (h → i) ∈ A] ∧ ¬[∃k ∈ β. (i → k)]∗-Pass ¬[∗-Shiftd|n ∨ ∗-Reduce*]
Table 5.2: Preconditions of the decomposed transitions in Table 5.1.
Table 5.2 shows preconditions of the decomposed transitions in Table 5.1. Some preconditions need
to be satisfied to ensure the properties of a well-formed dependency graph (Section 2.1.2.1). The
parsing states are represented as tuples (λ1, λ2, β, A), where λ1, λ2 are lists of partially processed
tokens and β is a list of the remaining unprocessed tokens. A is a set of labeled arcs representing
This decomposition makes it easier to integrate transitions from different parsing algorithms.
Friday, August 17, 2012
Dependency Parsing• Transition recomposition
- Any combination of two decomposed transitions, one from each operation, can be recomposed.
- For each recomposed transition, an ARC operation is performed first and a LIST operation is performed later.
34
63
Nivre’s list-based algorithm using λ2). Section 5.2.2 shows how these decomposed transitions can
be recomposed into transitions used in several different dependency parsing algorithms.
5.2.2 Transition recomposition
Any combination of two decomposed transitions in Table 5.1, one from each operation, can be
recomposed into a new transition. For instance, the combination of Left-∗l and ∗-Reduce makes a
transition, Left-Reducel, which performs Left-∗l and ∗-Reduce sequentially; the Arc operation
is always performed before the List operation. Table 5.3 shows how these decomposed transitions
are recomposed into transitions used in different dependency parsing algorithms.
Transition Nivre’03 Covington’01 Nivre’08 C&P’11 This workLeft-Reducel � � �Left-Passl � � � �Right-Shiftn
l � �Right-Passl � � � �No-Shiftd � � � � �No-Shiftn � � � �No-Reduce � �No-Pass � � � �
Table 5.3: Transitions in different dependency parsing algorithms. The last column shows transitionsused in our parsing algorithm. The other columns show transitions used in Nivre (2003), Covington(2001), Nivre (2008), and Choi and Palmer (2011a), respectively.
Nivre’s arc-eager algorithm allows no combination of ∗-Pass, which removes or skips tokens that can
violate the projective property (Nivre’03 in Table 5.3). As a result, this algorithm performs at most
2n− 1 transitions during paring, and can produce only projective dependency trees.2 Covington’s
algorithm allows no combination of ∗-Shiftn or ∗-Reduce, which inevitably compares each token
with all tokens prior to it (Covington’01). Thus, this algorithm performs n(n+1)2 transitions during
parsing, and can produce both projective and non-projective dependency trees.
The last three algorithms in Table 5.3 show cumulative updates to Covington’s algorithm;
they add one or two transitions from Nivre’s arc-eager algorithm to Covington’s algorithm. By2 The ∗-Shiftd transitions are not counted because they do not require comparison between word tokens.
Projective Non-projective
Friday, August 17, 2012
Dependency Parsing• Average parsing complexity
- The number of transitions performed per sentence.
35
8010 20 30 40 50 60 70
2850
0
500
1000
1500
2000
Sentence length
# of
tran
sitio
ns
Covington'01
Nivre'08C&P'11This work
8010 20 30 40 50 60 70
330
0
50100
150200
250
Sentence length
# of
tran
sitio
ns
Nivre'08
C&P'11
This work
Friday, August 17, 2012
Dependency Parsing• Bootstrapping
- Transition-based dependency parsing can take advantage of dynamic features (e.g., head, leftmost/rightmost dependent).
- Features extracted from gold-standard trees during training can be different from features extracted from automatic trees during decoding.
- By bootstrapping these dynamic features, we can significantly improve parsing accuracy.
36
wi
w0 ! h < j
w1 wj-1
wj
wj-1wi+1
wi < p < j
Friday, August 17, 2012
Dependency Parsing
37
AutomaticFeatures
Machine LearningAlgorithm
DependencyParser
StatisticalModel
Gold-standardFeatures
Gold-standardLabels
Stop?
Begin
End
YES
NO
TrainingData
Determined bycross-validation.
Friday, August 17, 2012
Dependency Parsing• Post-processing
- Transition-based dependency parsing does not guarantee parse output to be a tree.
- After parsing, we find the head of each headless token by comparing it to all other tokens using the same model.
- A predicted head with the highest score that does not break tree properties becomes the head of this token.
- This post-processing technique significantly improves parsing accuracy in out-of-genre experiments.
38
Friday, August 17, 2012
Dependency Parsing• Experiments
- Baseline: using all recomposed transitions.
- Baseline+: Baseline with post-processing.
- ClearNLP: Baseline+ with bootstrapping.
- C&N’09: Choi and Nicolov, 2009.
- C&P’11: Choi and Palmer, 2011a.
- MaltParser: Nivre, 2009.
- MSTParser: McDonald et al., 2005.
• Use only 1st order features; with 2nd order features, accuracy is expected to be higher and speed is expected to be slower.
39
Friday, August 17, 2012
Dependency Parsing• Accuracy - WSJ models (Avgi and Avgo)
40
LASUAS
85
86.25
87.5
88.75
90
Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser
88.3688.23
89.7489.589.6888.8188.57
86.0386.49
88.0387.7988.1087.1886.94
In-genre experiments
73
74.75
76.5
78.25
80
Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser
79.2678.29
79.1879.0879.3678.6078.04
74.4674.1075.3475.2375.50
74.6874.18
Out-of-genre experiments
Friday, August 17, 2012
Dependency Parsing• Accuracy - OntoNotes models (Avgi and Avgo)
41
LASUAS
838485868788
Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser
86.7086.40
87.5787.4887.7586.8386.54
83.6684.05
85.4985.4185.6884.7684.51
In-genre experiments
71.5
73.25
75
76.75
78.5
Baseline Baseline+ ClearNLP C&N’09 C&P’11 MaltParser MSTParser
77.9477.5477.4077.4378.05
76.6576.26
73.3073.4773.8673.8374.18
72.7372.37
Out-of-genre experiments
Friday, August 17, 2012
Dependency Parsing• Speed comparison - WSJ models
42
0
5
10
15
20
10 20 30 40 50 60 70 80
Mill
isec
onds
Sentence Length
ClearNLP ClearNLP+ C&N’09 C&P’11 MaltParser
1.16 ms1.61 ms 1.25 ms 1.08 ms 2.14 ms
Friday, August 17, 2012
Dependency Parsing• Speed comparison - OntoNotes models
43
0
5
10
15
20
10 20 30 40 50 60 70 80
Mill
isec
onds
Sentence Length
ClearNLP ClearNLP+ C&N’09 C&P’11 MaltParser
1.28 ms1.89 ms 1.26 ms 1.12 ms 2.14 ms
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
44
Friday, August 17, 2012
Semantic Role Labeling• Motivation
- Not all tokens need to be visited for semantic role labeling.
- A typical pruning algorithm does not work as well when automatically generated trees are provided.
- An enhanced pruning algorithm could improve argument coverage while maintaining low average labeling complexity.
• Approach
- Higher-order argument pruning.
- Conditional higher-order argument pruning.
- Positional feature separation.
45
Friday, August 17, 2012
Semantic Role Labeling• Semantic roles in dependency trees
46
ARG0 ARG1 ARG2 ARGM-TMP
Friday, August 17, 2012
Semantic Role Labeling• First-order argument pruning (1st)
- Originally designed for constituent trees.
• Considers only siblings of the predicate, predicate’s ancestors, and siblings of predicate’s ancestors argument candidates (Xue and Palmer, 2004).
- Redesigned for dependency trees.
• Considers only dependents of the predicate, predicate’s ancestors, and dependents of predicate’s ancestors argument candidates (Johansson and Nugues, 2008).
- Covers over 99% of all arguments using gold-standard trees.
- Covers only 93% of all arguments using automatic trees.
47
Friday, August 17, 2012
Semantic Role Labeling• Higher-order argument pruning (High)
- Considers all descendants of the predicate, predicate’s ancestors, and dependents of predicate’s ancestors argument candidates.
- Significantly improves argument coverage when automatically generated trees are used.
48
90
92
94
96
98
100
WSJ-1st ON-1st WSJ-High ON-High Gold-1st Gold-High
99.9299.4498.24
97.59
92.94
91.02
Arg
umen
t C
over
age
Friday, August 17, 2012
Semantic Role Labeling• Conditional higher-order argument pruning (High+)
- Reduces argument candidates using path-rules.
- Before training,
• Collect paths between predicates and their descendants whose subtrees contain arguments of the predicates.
• Collect paths between predicates and their ancestors whose direct dependents or ancestors are arguments of the predicates.
• Cut off paths whose counts are below thresholds.
- During training and decoding, skip tokens and their subtrees or ancestors whose paths to the predicates are not seen.
49
Friday, August 17, 2012
8010 20 30 40 50 60 70
75
0102030405060
Sentence length
# of
can
dida
tes
All
High
High+1st
Using the WSJ models (OntoNotes graph is similar)
Semantic Role Labeling• Average labeling complexity
- The number of tokens visited per predicate.
50
Friday, August 17, 2012
Semantic Role Labeling• Positional feature separation
- Group features by arguments’ positions with respect to their predicates.
- Two sets of features are extracted.
• All features derived from arguments on the lefthand side of the predicates are grouped in one set, SL.
• All features derived from arguments on the righthand side of the predicates are grouped in another set, SR.
- During training, build two models, ML and MR, for SL and SR.
- During decoding, use ML and MR for argument candidates on the lefthand and righthand sides of the predicates.
51
Friday, August 17, 2012
Semantic Role Labeling• Experiments
- Baseline: 1st order argument pruning.
- Baseline+: Baseline with positional feature separation.
- High: higher-order argument pruning.
- All: no argument pruning.
- ClearNLP: conditional higher-order argument pruning.
• Previously called High+.
- ClearParser: Choi and Palmer, 2011b.
52
Friday, August 17, 2012
Semantic Role Labeling• Accuracy - WSJ models (Avgi and Avgo)
53
81.7
82.0
82.3
82.6
Baseline Baseline+ High All ClearNLP ClearParser
82.2682.4282.4882.52
82.28
81.88
In-domain experiments
70.8
71.1
71.4
71.7
72
Baseline Baseline+ High All ClearNLP ClearParser
71.52
71.8571.9571.90
71.64
71.07
Out-of-domain experiments
Friday, August 17, 2012
Semantic Role Labeling• Accuracy - OntoNotes models (Avgi and Avgo)
54
80.5
80.9
81.3
81.7
Baseline Baseline+ High All ClearNLP ClearParser
81.6981.5281.4881.51
81.33
80.73
In-domain experiments
69.7
70.1
70.5
70.9
Baseline Baseline+ High All ClearNLP ClearParser
70.01
70.6870.81
70.6870.54
70.02
Out-of-domain experiments
Friday, August 17, 2012
Semantic Role Labeling• Speed comparison - WSJ models
- Milliseconds for finding all arguments of each predicate.
55
0
0.75
1.5
2.25
3
10 20 30 40 50 60 70 80
Mill
isec
onds
Sentence Length
ClearNLPClearNLP+Baseline+HighAllClearParser
Friday, August 17, 2012
Semantic Role Labeling• Speed comparison - OntoNotes models
56
0
0.75
1.5
2.25
3
10 20 30 40 50 60 70 80
Mill
isec
onds
Sentence Length
ClearNLPClearNLP+Baseline+HighAllClearParser
Friday, August 17, 2012
Contents• Introduction
• Dependency conversion
• Experimental setup
• Part-of-speech tagging
• Dependency parsing
• Semantic role labeling
• Conclusion
57
Friday, August 17, 2012
Conclusion• Our dependency conversion gives rich dependency
representations and can be applied to most English Treebanks.
• The dynamic model selection runs fast and shows robust POS tagging accuracy across different genres.
• Our parsing algorithm shows linear-time average parsing complexity for generating both proj. and non-proj. trees.
• The bootstrapping technique gives significant improvement on parsing accuracy.
• The higher-order argument pruning gives significant improvement on argument coverage.
• The conditional higher-order argument pruning reduces average labeling complexity without compromising the F1-score.
58
Friday, August 17, 2012
Conclusion• Contributions
- First time that these three components have been evaluated together on such a wide variety of English data.
- Maintained high level accuracy while improving efficiency, modularity, and portability of these components.
- Dynamic model selection and bootstrapping can be generally applicable for tagging and parsing, respectively.
- Processing all three components take about 2.49 - 2.69 ms (tagging: 0.36 - 0.37, parsing: 1.16 - 1.28, labeling: 0.97 - 1.04).
- All components are publicly available as an open source project, called ClearNLP (clearnlp.googlecode.com).
59
Friday, August 17, 2012
Conclusion• Future work
- Integrate the dynamic model selection approach with more sophisticated tagging algorithms.
- Evaluate our parsing approach on languages containing more non-projective dependency trees.
- Improve semantic role labeling where the quality of input parse trees is poor (using joint-inference).
60
Friday, August 17, 2012
Acknowledgment • We gratefully acknowledge the support of the following grants. Any contents
expressed in this material are those of the authors and do not necessarily reflect the views of any grant.
- The National Science Foundation Grants IIS-0325646, Domain Independent Semantic Parsing, CISE-CRI-0551615, Towards a Comprehensive Linguistic Annotation, CISE-CRI 0709167, Collaborative: A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu, CISE- IIS-RI-0910992, Richer Representations for Machine Translation.
- A grant from the Defense Advanced Research Projects Agency (DARPA/IPTO) under the GALE program, DARPA/CMO Contract No. HR0011-06-C-0022, subcontract from BBN, Inc.
- A subcontract from the Mayo Clinic and Harvard Children’s Hospital based on a grant from the ONC, 90TR0002/01.
- Strategic Health Advanced Research Project Area 4: Natural Language Processing.
61
Friday, August 17, 2012
Acknowledgment• Special thanks are due to
- Martha Palmer for practically being my mom for 5 years.
- James Martin for always encouraging me when I’m low.
- Wayne Ward for wonderful smiles.
- Bhuvana Narasimhan for bringing Hindi to my life.
- Joakim Nivre for suffering under millions of my questions.
- Nicolas Nicolov for making me feel normal when others call me “workaholic”.
- All CINC folks for letting me live (literally) at my cube.
62
Friday, August 17, 2012
References• Jinho D. Choi and Nicolas Nicolov. K-best, Locally Pruned, Transition-based Dependency Parsing Using
Robust Risk Minimization. In Recent Advances in Natural Language Processing V, pages 205–216. John Benjamins, 2009.
• Jinho D. Choi and Martha Palmer. Getting the Most out of Transition-based Dependency Parsing. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics:Human Language Technologies, ACL:HLT’11, pages 687–692, 2011a.
• Jinho D. Choi and Martha Palmer. Transition-based Semantic Role Labeling Using Predicate Argument Clustering. In Proceedings of ACL workshop on Relational Models of Semantics, RELMS’11, pages 37–45, 2011b.
• M. Cmejrek, J. Curín, and J. Havelka. Prague Czech-English Dependency Treebank: Any Hopes for a Common Annotation Scheme? In HLT-NAACL’04 workshop on Frontiers in CorpusAnnotation, pages 47–54, 2004.
• Jesús Giménez and Lluís Màrquez. SVMTool: A general POS tagger generator based on Support Vector Machines. In Proceedings of the 4th International Conference on Language Resources and Evaluation, LREC’04, 2004.
• Richard Johansson and Pierre Nugues. Dependency-based Semantic Role Labeling of PropBank. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing(EMNLP’08), pages 69–78, 2008.
63
Friday, August 17, 2012
References• Cho-Jui Hsieh, Kai-Wei Chang, Chih-Jen Lin, S. Sathiya Keerthi, and S. Sundararajan. A Dual Coordinate
Descent Method for Large-scale Linear SVM. In Proceedings of the 25th international conference on Machine learning, ICML’08, pages 408–415, 2008.
• Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a Large Annotated Corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330, 1993.
• Marie-Catherine de Marneffe and Christopher D. Manning. The Stanford typed dependencies representation. In Proceedings of the COLING workshop on Cross-Framework and Cross-DomainParser Evaluation, 2008a.
• Ryan McDonald, Fernando Pereira, Kiril Ribarov, and Jan Hajic. Non-projective Dependency Parsing using Spanning Tree Algorithms. In Proceedings of the Conference on Human LanguageTechnology and Empirical Methods in Natural Language Processing (HLT-EMNLP’05), pages523–530, 2005.
• Rodney D. Nielsen, James Masanz, Philip Ogren, Wayne Ward, James H. Martin, Guergana Savova, and Martha Palmer. An architecture for complex clinical question answering. In Proceedings of the 1st ACM International Health Informatics Symposium, IHI’10, pages 395–399, 2010.
• Joakim Nivre. An Efficient Algorithm for Projective Dependency Parsing. In Proceedings of the 8th International Workshop on Parsing Technologies, IWPT’03, pages 149–160, 2003.
• Joakim Nivre. Algorithms for deterministic incremental dependency parsing. Computational Linguistics, 34(4):513–553, 2008.
64
Friday, August 17, 2012
References• Joakim Nivre. Non-Projective Dependency Parsing in Expected Linear Time. In Proceedings of the Joint
Conference of the 47th Annual Meeting of the ACL and the 4th International JointConference on Natural Language Processing of the AFNLP (ACL-IJCNLP’09), pages 351–359,2009.
• Owen Rambow, Cassandre Creswell, Rachel Szekely, Harriet Taber, and Marilyn Walker. A Dependency Treebank for English. In Proceedings of the 3rd International Conference on LanguageResources and Evaluation (LREC’02), 2002.
• Ralph Weischedel, Eduard Hovy, Martha Palmer, Mitch Marcus, Robert Belvin, Sameer Pradhan, Lance Ramshaw, and Nianwen Xue. OntoNotes: A Large Training Corpus for Enhanced Processing. In Joseph Olive, Caitlin Christianson, and John McCary, editors, Handbook of NaturalLanguage Processing and Machine Translation. Springer, 2011.
• Kristina Toutanova, Dan Klein, Christopher D. Manning, and Yoram Singer. Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In Proceedings of the Annual Conference of the North American Chapter of the Association for Computational Linguistics on HumanLanguage Technology, NAACL’03, pages 173–180, 2003.
• Nianwen Xue and Martha Palmer. Calibrating Features for Semantic Role Labeling. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2004.
65
Friday, August 17, 2012