1 introduction to computational linguistics eleni miltsakaki auth fall 2005-lecture 9

42
1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

Post on 19-Dec-2015

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

1

Introduction to Computational Linguistics

Eleni Miltsakaki

AUTH

Fall 2005-Lecture 9

Page 2: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

2

What’s the plan for today?

• Discourse models cont’d– DLTAG: Lexicalized Tree Adjoining Grammar for

Discourse

• A DLTAG-based system for parsing discourse

• The Penn Discourse Treebank– http://www.cis.upenn.edu/~pdtb

Page 3: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

3

Basic references

• Anchoring a Lexicalized Tree-Adjoining Grammar for Discourse (1998), – B. Webber and A. Joshi

• What are Little Texts Made of? A Structural Presuppositional Account Using Lexicalized TAG

– B. Webber, A. Joshi, A. Knott, M. Stone

• DLTAG System: Discourse Parsing with a Lexicalized Tree-Adjoining Grammar (2001)

– K. Forbes, E. Miltsakaki, R. Prasad, A. Sarkar, A. Joshi and B. Webber

• The Penn Discourse Treebank (2004)– E. Miltsakaki, R. Prasad, A. Joshi and B. Webber

Page 4: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

4

Motivation and basics of the DLTAG approach

• Discourse meaning: more than its parts

• Compositional vs non-compositional aspects of discourse meaning

• This distinction is often conflated in most of related work

• Smooth transition from sentence level structure to discourse level structure

Page 5: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

5

The DLTAG view of discourse connectives

• Discourse connectives are treated as higher level predicates taking clausal arguments

• Basic types of discourse connectives: – Structural

• Subordinate conjunctions (when, although, because etc)• Coordinate conjunctions (and, but, or)

– “Anaphoric”• Adverbials (however, therefore, as a result, etc)

Page 6: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

6

Elements of LTAG

Initial and auxiliary trees Initial: Encode predicate-argument

dependencies Auxiliary: recursive, modify elementary

trees

anchors of elementary trees are semantic predicates

substitution and adjunction D-LTAG is similar

anchors of elementary trees are semantic features which can be lexicalized with discourse connectives

Page 7: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

7

D-LTAG Structures and Semantics

• Initial Trees

(a) John failed his exam because he was lazy

Page 8: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

8

Auxiliary trees

(a) Mary saw John but she decided to ignore him.

(b) Mary saw John. She decided to ignore him.

1. On the one hand, John loves Barolo.2. So he ordered three cases of the ‘97.3. On the other hand, he had to cancel the order4. because he then found that he was broke.

Page 9: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

9

Phenomena that DLTAG captures

• Arguments of a coherence relation can be stretched “long distance”

• Multiple discourse connectives can appear in a single sentence or even a single clause

• Coherence relations can vary in how and when they are realized lexically

Page 10: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

10

Stretching arguments

• On the one hand, John loves Barolo.

• So he ordered three cases of the ’97.

• On the other hand, he had to cancel the order

• Because he then found that he was broke.

Page 11: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

11

Non-Compositional Semantics• Non-defeasible vs defeasible causal connection

(a) The City Council refused the women a permit because they feared violence.

(b) The City Council refused the women a permit. They feared violence.

• Presuppositional semantics (Knott et al, 1996): – Defeasible rule: When people go to the zoo, they

leave their work behind.

(c) John went to the zoo. However, he took his cell phone with him.

Page 12: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

12

DLTAG system for parsing discourse

• Theoretical framework: DLTAG

• Main system components:– Sentence level parsing– Tree extractor– Tree mapper – Discourse input representation– Discourse level parsing

Page 13: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

13

Parser (Sarkar, 2000)

– XTAG grammar– One derivation per sentence

E.g. Mary was amazed

Page 14: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

14

Tree extractor:identifying discourse units

(a) While she was eating lunch she saw a dog

Page 15: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

15

Tree mapper

• From sentence level structure to discourse structure

Page 16: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

16

Discourse input representation

Page 17: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

17

System Architecture

Page 18: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

18

Example Discourse

(a) Mary was amazed.

(b) While she was eating lunch, she saw a dog.

(c) She’d seen a lot of dogs, but this one was amazing.

(d) The dog barked and Mary smiled.

(e) Then, she gave it a sandwich

Page 19: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

Derived and Derivation trees

Page 20: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

20

Corpus example

The pilots could play hardball by noting they were crucial to any sale or restructuring because they can refuse to fly the airplanes. If they were to insist on a low bid of, say $200 a share the board mightn’t be able to obtain a higher offer from the bidders because banks might hesitate to finance a transaction the pilots oppose. Also, because UAL chairman Stephen Wolf and other UAL executives have joined the pilots’ bid, the board might be able to exclude him for its deliberations in order to be fair to other bidders

(Wall Street Journal) LEXTRACT (Xia et al 2000)

Page 21: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

Corpus: Derivation Tree

Page 22: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

22

Derived Tree

Page 23: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

23

Summary points of the DLTAG system

• Implementation of D-LTAG use LTAG grammar to parse each clause use the same LTAG-based parser both at the sentence level and

discourse level build the semantics compositionally from the sentence to the

discourse level factor away non-compositional semantic contributions

• In the output representation The semantics of the connectives form only part of the

compositional derivation of discourse relations Discourse connectives are NOT viewed as names of relations

Page 24: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

24

The Penn Discourse Treebank

Annotation of discourse connective and their arguments

Large scale: annotation of the entire Penn Treebank (1 million words)

Page 25: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

25

Merits of the PDTB

Discourse relations are lexically grounded• Exposing a clearly defined level of discourse

structure• Enabling annotations with high reliability

Building on existing syntactic and semantic layers of annotation (Treebank, PropBank)

Annotations independent of the DLTAG (or any other) framework

Page 26: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

26

Project description

Annotation of connectives in the Penn Treebank 30K tokens of connectives

• 20K explicit conns + 10K implicit conns Annotation of ARG1 and ARG2 of conns Ex. Mary left early because she was sick. ARG1: Mary left early CONN: because ARG2: she was sick Four annotators at the beginning, then two To come: Semantic role labels for ARG1 and ARG2

Page 27: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

27

Connectives

Subordinate conjunctions

(when, because, although, etc.)

ARG1 – ARG2

(1) Because [the drought reduced U.S. stockpiles], [they have more than enough storage space for their new crop], and that permits them to wait for prices to rise.

Page 28: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

28

Connectives

Coordinate conjunctions

(and, but, or, etc.)

ARG1 – ARG2

(2) [William Gates and Paul Allen in 1975 developed an early language-housekeeper system for PCs], and [Gates became an industry billionaire six years after IBP adapted one of these versions in 1981].

Page 29: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

29

Connectives

Adverbials (therefore, then, as a result, etc.) ARG1 – ARG2

• (3) For years, costume jewelry makers fought a losing battle. Jewelry displays in department stores were often cluttered and uninspired. And the merchandise was, well, fake. As a result, marketers of faux gems steadily lost space in department stores to more fashionable rivals -- cosmetics makers.

Page 30: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

30

Connectives

Implicit (annotators provide named expression for implicit

connective) ARG1 – ARG2 (4) …[The $6 billion that some 40

companies are looking to raise in the year ending March 31 compares with only $2.7 billion raised on the capital market in the previous fiscal year]. IMPLICIT-(In contrast) [In fiscal 1984 before Mr. Gandhi came to power, only $810 million was raised].

Page 31: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

31

Annotation guidelines

http://www.cis.upenn.edu/~pdtb

What counts as a connective?• Including distinction between clausal adverbials and discourse

adverbials

What counts as an argument?• Minimally a clause

How far does the argument extend?• Including distinction between arguments (ARG1 and ARG2) and

supplements to arguments (SUP1 and SUP2 respectively)• Interesting comparison with ProbBank annotations of verbs

Page 32: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

32

WordFreak (T. Morton & J. Lacivita)

Page 33: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

33

Comparison with the RST corpus

• RST-corpus1. Higher-level annot.2. Abstract discourse

relations3. Doesn’t contain the

basis of the relations4. Low inter-annotator

agreement5. Small scale (385 wsj

files)6. No explicit links to

Treebank

• PDTB1. Basic level annot.

2. Connectives+args

3. Relations anchored to lexical items

4. High inter-annotator agreement

5. Large scale(Treebank: 2,500 wsj files)

6. Links to Treebank and PropBankInteresting to see how RST labels relate

to semantic role assignment in PDTB

Page 34: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

34

Preliminary experiments

10 explicit connectives (2717 tokens)• Therefore, as a result, instead, otherwise, nevertheless,

because, although, even though, when, so that

386 tokens of implicit connectives

2 annotators

Page 35: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

35

Inter-annotator agreement (1) Measure by token (ARG1+ARG2)

• ARG1 and ARG2 counted together• Total number of connective ARG1/ARG2 tokens =

2717

Agreement = 82.8%• Subord. Conj. = 86%• Adverbials = 57%

Page 36: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

36

Agreement per connective (1)CONNECTIVES AGR No. Conn. Total % AGR

When

Because

Even though

Although

So that

868

804

91

288

27

1016

912

103

352

34

86.4%

88.2%

88.3%

81.8%

79.4%

TOTAL SUBCONJ 2078 2417 86.0%

Nevertheless

Otherwise

Instead

As a result

Therefore

18

21

72

38

22

47

23

118

84

28

38.3%

91.3%

61.0%

45.2%

78.6%

TOTAL ADV. 171 300 57.0

OVERALL TOTAL 2249 2717 82.8%

Page 37: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

37

Inter-annotator agreement (2) Measure by ARG (ARG1, ARG2)

• Check agreement for ARG1 and ARG2 • Total number of argument tokens = 5434 (2717

ARG1 + 2717 ARG2)

Agreement = 90.2%– ARG1 = 86.3%– ARG2 = 94.1%– Subord. Conj. =92.4%– Adverbial: =71.8%

Page 38: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

38

Agreement per connective (2) CONNECTIVES AGR No. Conn. Total % AGR

When

Because

Even though

Although

So that

1877

1703

194

635

66

2032

1824

206

704

74

92.4%

93.4%

94.1%

90.1%

89.2%

TOTAL SUBCONJ 4469 4834 92.4%

Nevertheless

Otherwise

Instead

As a result

therefore

56

44

172

110

49

94

46

236

168

56

59.6%

95.7%

72.9%

65.5%

87.5

TOTAL ADV. 431 600 71.8%

OVERALL TOTAL 4900 5434 90.2%

Page 39: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

39

Analysis of disagreement

Majority of disagreement due to ‘partial overlap’: 79%

(5) It was forced into liquidation before trial when investors yanked their funds after the government demanded a huge pre-trial asset forfeiture.

DISAGREEMENT TYPE No. %

Missing annotations

No overlap

72

30

13.5%

5.6%

PARTIAL OVERLAP TOTAL 422 79%

Parentheticals

Higher verb

Dependent clause

Other

53

181

182

6

9.9%

33.9%

34.1%

1.1%

Unresolved 10 1.9%

TOTAL 534 100%

Page 40: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

40

Reanalysis of agreement

Inter-annotator agreement counting in partial overlap• 94.5%

Dealing with extent of the argument• Revise guidelines• BUT: Some disagreement will persist

Page 41: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

41

Comparing predicates

PropBank – sentence level predicates (verbs)• Arity of arguments: Hard • Extent of the argument: Easy

Penn Discourse Treebank – discourse predicates• Arity of arguments: Easy • Extent of the argument: Hard

Page 42: 1 Introduction to Computational Linguistics Eleni Miltsakaki AUTH Fall 2005-Lecture 9

42

Summary points for PDTB http://www.cis.upenn.edu/~pdtb The Penn Discourse Treebank

• Large scale discourse annotation• Basic level of annotation: connectives and their arguments• Links to Penn Treebank and Penn PropBank (rich substrate

for extracting syntactic and semantic features)• Expected completion November 2005

Inter-annotator agreement• Most conservative: 82.8%• Relaxing exact match: 94.5%