a monolingual tree-based translation model for sentence simplification

36
1 24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu | Zhemin Zhu, UKP, TU Darmstadt, Germany Delphine Bernhard, LIMSI-CNRS, France Iryna Gurevych, UKP, TU Darmstadt, Germany A Monolingual Tree-based Translation Model for Sentence Simplification Presenter: Zhemin Zhu COLING2010 – Beijing, China

Upload: yardley-pitts

Post on 31-Dec-2015

38 views

Category:

Documents


8 download

DESCRIPTION

A Monolingual Tree-based Translation Model for Sentence Simplification. Zhemin Zhu, UKP, TU Darmstadt, Germany Delphine Bernhard, LIMSI-CNRS, France Iryna Gurevych , UKP, TU Darmstadt, Germany. COLING2010 – Beijing, China. Presenter: Zhemin Zhu. An Example of Sentence Simplification. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Monolingual Tree-based Translation Model for Sentence Simplification

124.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Zhemin Zhu, UKP, TU Darmstadt, GermanyDelphine Bernhard, LIMSI-CNRS, FranceIryna Gurevych, UKP, TU Darmstadt, Germany

A Monolingual Tree-based Translation Model for Sentence Simplification

Presenter: Zhemin Zhu

COLING2010 – Beijing, China

Page 2: A Monolingual Tree-based Translation Model for Sentence Simplification

224.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

An Example of Sentence Simplification

This month was first called  Sextilis  in Latin, because it was the sixth month in the old Roman calendar. The Roman calendar began in March about 735 BC with Romulus.

-- Simple Wikipedia

This month was originally named Sextilis in Latin, because it was the sixth month in the original [ten-month] Roman calendar under Romulus in 753 BC, when March was the first month of the year.

-- Wikipedia

Page 3: A Monolingual Tree-based Translation Model for Sentence Simplification

3

Sentence Simplification Targeted at Humans

Reading and Speech Assistance

People with Comprehension Disabilities [Carroll et al., 1999; Inui et al., 2003]

Low-literacy people[Watanabe et al., 2009]

Non-native Speakers [Siddharthan, 2002]

Children 24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 4: A Monolingual Tree-based Translation Model for Sentence Simplification

4

Sentence Simplification Targeted at NLP Applications

Parsing and Translation [Chandrasekar et al., 1996]

Summarization[Knight and Marcu, 2000]

Sentence Fusion[Filippova and Strube, 2008b]

Semantic Role Labeling[Vickrey and Koller, 2008]

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Question Generation[Heilman and Smith, 2009]

Relation Extraction[Miwa et al., COLING2010]

Information Extraction [Jonnalagadda and Gonzalez, 2009]

Robot Command[Young KY and Liu SH, 2002]

Page 5: A Monolingual Tree-based Translation Model for Sentence Simplification

5

What Makes a Sentence Difficult?

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

1. Difficult Vocabulary→ Vocabulary (Word/Phrase) Substitution

2. Complex Syntax Length → Splitting, Dropping Order → Reordering, such as passive and active

Simplification operations: Splitting, Dropping, Reordering and Substitution

This month was originally named Sextilis in Latin, because it was the sixth month in the original ten-month Roman calendar under Romulus in 753 BC, when March was the first month of the year.

-- Wikipedia

Page 6: A Monolingual Tree-based Translation Model for Sentence Simplification

6

Simplification Operation: Sentence Splitting

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

August is the eighth month of the year in the Gregorian Calendar and one of seven Gregorian months with a length of 31 days.

-- Wikipedia

August is the eighth month of the year.It has 31 days.

-- Simple Wikipedia

Page 7: A Monolingual Tree-based Translation Model for Sentence Simplification

7

Simplification Operation: Dropping

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

April is the fourth month of the year [in the Gregorian Calendar, and one of four months] with [a length of] 30 days.

-- Wikipedia

April is the fourth month of the year with 30 days.

-- Simple Wikipedia

Page 8: A Monolingual Tree-based Translation Model for Sentence Simplification

8

Simplification Operation: Reordering

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Mr. Anthony, who runs an employment agency, decries program trading, but he isn't sure it should be strictly regulated.

-- [Siddharthan, 2006]

Mr. Anthony decries program trading. Mr. Anthony runs an employment agency.But he isn't sure it should be strictly regulated.

-- [Siddharthan, 2006]

Page 9: A Monolingual Tree-based Translation Model for Sentence Simplification

9

Simplification Operation: Substitution

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

The traditional etymology is from the Latin aperire, "to open," in allusion to its being the season when trees and flowers begin to "open," which is supported by comparison with the modern Greek use of ἁνοιξις (opening) for spring.

-- Wikipedia

The name April comes from that Latin word aperire which means "to open".

-- Simple Wikipedia

Page 10: A Monolingual Tree-based Translation Model for Sentence Simplification

10

Motivation

Most of the existing methods only cover one simplification operation: [Siddharthan, 2006] and [Petersen and Ostendorf , 2007]: Splitting Sentence Compression: Dropping [Carroll et al. ,1999]: Word Substitution

In most cases, different simplification operations happen simultaneously.

It is necessary to model different simplification operations integrally.

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 11: A Monolingual Tree-based Translation Model for Sentence Simplification

11

Our Contributions

The first statistical model: TSM (Tree-based Simplification Model) Integrally covering splitting, dropping, reordering and word/phrase substitution Based on the great successes of parsing and translation techniques.

An Efficient Training Method for TSM Speeding up by monolingual word mapping

PWKP : Parallel Complex-Simple Dataset Obtained from Wikipedia and Simple Wikipedia

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 12: A Monolingual Tree-based Translation Model for Sentence Simplification

12

Tree-base Simplification Model: TSM

Splitting

Dropping

Reordering

Substitution

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Parse Trees of Complex Sentences

SimpleSentences

Probabilistic Model: EM Training

Page 13: A Monolingual Tree-based Translation Model for Sentence Simplification

13

Parallel Complex-Simple Dataset: PWKP

Paired articles from the Wikipedia and Simple Wikipedia

1. Article Pairing: following the “language links”

2. Plain Text Extraction: JWPL [Zesch et al., 2008]

3. Pre-processing: sentence boundary detection and tokenization with the Stanford Parser package [Klein and Manning, 2003], lemmatization with the TreeTagger [Schmid,1994]

4. Monolingual Sentence Alignment: sentence-level TF*IDF [Nelken and Shieber, 2006]

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 14: A Monolingual Tree-based Translation Model for Sentence Simplification

14

Parallel Complex-Simple Dataset: PWKP

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Similarity Precision Recall

TF*IDF 91.3% 55.4%

Word Overlap 50.5% 55.1%

MED 13.9% 54.7%

Table 1: Monolingual Sentence Alignment

Sentence Length Token Length #Pairs

Simple 20.87 4.89108016Complex 25.01 5.06

Table 2: Statistics for the PWKP dataset

Page 15: A Monolingual Tree-based Translation Model for Sentence Simplification

15

TSM: Splitting

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Example Complex Sentence:

August was the sixth month in the ancient Roman calendar which started in 735BC.

Page 16: A Monolingual Tree-based Translation Model for Sentence Simplification

16

TSM: Splitting

Question 1: Where to split the sentence? Step 1: Segmentation

Question 2: How to make the split sentences complete and grammatical? Step 2: Completion

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 17: A Monolingual Tree-based Translation Model for Sentence Simplification

17

TSM: Splitting

Step 1: Segmentation

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Word Constituent Length Probability

which SBAR 1 0.0016

which SBAR 2 0.0835

Table 3: Segmentation Feature Table (SFT)

Page 18: A Monolingual Tree-based Translation Model for Sentence Simplification

18

TSM: Splitting

Step 1: Segmentation

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 19: A Monolingual Tree-based Translation Model for Sentence Simplification

19

TSM: Splitting

Step 2: Completion

Should the “which” be dropped?

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Word Constituent isDropped Probability

which WHNP true 1.0

which WHNP false Prob.min

Table 4: Border Drop Feature Table (BDFT)

Page 20: A Monolingual Tree-based Translation Model for Sentence Simplification

20

TSM: Splitting

Step 2: Completion

Which parts should be copied? Where to put these parts in the new sentences?

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Dependency Constituent isCopied Position Probability

gov_nsubj VBD true left 0.9000

gov_nsubj VBD true right 0.0994

gov_nsubj VBD false left + right 0.0006

Table 5: Copy Feature Table (CFT)

Page 21: A Monolingual Tree-based Translation Model for Sentence Simplification

21

TSM: Splitting

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 22: A Monolingual Tree-based Translation Model for Sentence Simplification

22

TSM: Dropping & Reordering

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Constituent Children Drop Probability

NP DT JJ NNP NN 1101 7.66E-4

NP DT JJ NNP NN 0001 1.26E-7

Table 6: Dropping Feature Table (DFT)

Constituent Children Reorder Probability

NP DT JJ NN 012 0.8303

NP DT JJ NN 210 0.0039

Table 7: Reordering Feature Table (RFT)

Page 23: A Monolingual Tree-based Translation Model for Sentence Simplification

23

TSM: Dropping & Reordering

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 24: A Monolingual Tree-based Translation Model for Sentence Simplification

24

TSM: Word/Phrase Substitution

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Original (word/phrase)

Substitution(word/phrase)

Probability

ancient ancient 0.963

ancient old 0.0183

old ancient 0.005

ancient than transportation 1.83E-102

Table 8: Substitution Feature Table (SubFT)

Word substitution: terminal nodes

Phrase Substitution: non-terminal nodes

Page 25: A Monolingual Tree-based Translation Model for Sentence Simplification

25

TSM: Word/Phrase Substitution

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 26: A Monolingual Tree-based Translation Model for Sentence Simplification

26

Speeding up

We filter out the unpromising candidates at the early stages. This is done using monolingual word mapping.

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 27: A Monolingual Tree-based Translation Model for Sentence Simplification

27

Experiments

Testing dataset:100 complex sentences

131 parallel simple sentences from PWKP

Baseline systems:1. Moses: state-of-the-art phrase-based SMT

2. Compression (Filippova and Strube, 2008a)

3. Compression + Substitution Substitution: Wordnet + Frequency in Simple Wikipedia Articles

4. Compression + Substitution + Splitting Splitting: split at conjunctions and relatives.

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 28: A Monolingual Tree-based Translation Model for Sentence Simplification

28

Experiments: Basic Statistics

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Tok. Len. Sent . Len. #Sent.

Complex Sentences 4.95 27.81 100

Simple Sentences 4.76 17.86 131

1. Moses 4.81 26.08 100

2. Compression 4.98 18.02 103

3. Compression+Substitution 4.90 18.11 103

4. Compression+Substitution+splitting 4.98 10.20 182

5. TSM 4.76 13.57 180

Page 29: A Monolingual Tree-based Translation Model for Sentence Simplification

29

Experiments: Translation Assessment

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

BLEU NIST #Same

Complex Sentences 0.50 6.89 100

Simple Sentences 1.00 10.98 3

1. Moses 0.55 7.47 25

2. Compression 0.28 5.37 1

3. Compression+Substitution 0.19 4.51 0

4. Compression+Substitution+splitting 0.18 4.42 0

5. TSM 0.38 6.21 2

Page 30: A Monolingual Tree-based Translation Model for Sentence Simplification

30

Experiments: Readability Assessment

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Flesch Lix (Grade)

OOV% PPL

Complex Sentences 49.1 53.0 (10) 52.9 384

Simple Sentences 60.4 (PE) 44.1 (8) 50.7 179

1. Moses 54.8 48.1 (9) 52.0 363

2. Compression 56.2 45.9 (8) 51.7 481

3. Compression+Substitution 59.1 45.1 (8) 49.5 616

4. Compression+Substitution+splitting 65.5 (PE) 38.3 (6) 53.4 581

5. TSM 67.4 (PE) 36.7 (5) 50.8 353

PE: Plain English Grade: School Year

Page 31: A Monolingual Tree-based Translation Model for Sentence Simplification

31

Conclusions

1. Moses is not good at simplification tasks.

2. BLEU and NIST are not a good evaluation metrics for sentence simplification systems.

3. TSM can achieve the best overall readability scores.

4. We contributed the PWKP dataset:

http://www.ukp.tu-darmstadt.de/software-data/data/quality-assessment/

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 32: A Monolingual Tree-based Translation Model for Sentence Simplification

32

Future Work

More sophisticated features and rules to improve TSM

Extend TSM’s expressiveness to model more complex transformations: synchronous syntax is a promising direction

Evaluation methods for simplification systems: Readability Assessment

24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 33: A Monolingual Tree-based Translation Model for Sentence Simplification

33

Acknowledgements

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 34: A Monolingual Tree-based Translation Model for Sentence Simplification

34

Thanks for your interests!

Comments & Questions!

24.08.2010| Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Page 35: A Monolingual Tree-based Translation Model for Sentence Simplification

35

Backup: Training

EM algorithm:

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |

Training (dataset){Initialize all probability tables using the uniform distribution;for (several iterations){

reset all cnt = 0;for (each sentence pair < c; s > in dataset){

tt = buildTrainingTree(< c; s >);calcInsideProb(tt);calcOutsideProb(tt);update cnt for each conditioning feature in eachnode of tt: cnt = cnt + node:insideP rob node:outsideP rob=root:insideP rob;

}updateProbability();

}}

Page 36: A Monolingual Tree-based Translation Model for Sentence Simplification

36

Backup: Training

24.08.2010 | Computer Science Department | UKP Lab - Prof. Dr. Iryna Gurevych | Zhemin Zhu |