preslav nakov - the web as a training set part 2

33
1 On the Stability of Web N-gram Frequencies

Upload: datasciencesociety

Post on 13-Apr-2017

257 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Preslav Nakov - The Web as a Training Set Part 2

1

On the Stabilityof Web N-gram Frequencies

Page 2: Preslav Nakov - The Web as a Training Set Part 2

2

Accuracy over Time: Google

Accuracy for any language, no inflections. Average coverage is shown in parentheses.

Varying time intervals,in case index changes happen periodically

(Nakov & Hearst, RANLP 2005)

Page 3: Preslav Nakov - The Web as a Training Set Part 2

3

Accuracy over Time: MSNStatistically significant

Accuracy for any language, no inflections. Average coverage is shown in parentheses.

(Nakov & Hearst, RANLP 2005)

Page 4: Preslav Nakov - The Web as a Training Set Part 2

4

Accuracy by Search Engine for 6/6/2005

Accuracy for any language, no inflections.Average coverage is shown in parentheses.

Statistically significant

(Nakov & Hearst, RANLP 2005)

Page 5: Preslav Nakov - The Web as a Training Set Part 2

5

Coverage by Search Engine for 6/6/2005

Coverage for any language, no inflections.

No much variability In coverage (but Googlehas the biggest index)

(Nakov & Hearst, RANLP 2005)

Page 6: Preslav Nakov - The Web as a Training Set Part 2

6

Conclusion

•Overall: n-gram variability does not have statistically significant impact on performance (at least for noun compound bracketing).

(Nakov & Hearst, RANLP 2005)

Page 7: Preslav Nakov - The Web as a Training Set Part 2

7

Google N-gramsvs.

Search Engines

Page 8: Preslav Nakov - The Web as a Training Set Part 2

8

Issues with Using Search Engines

•Search Engines for NLP- Scientifically: not reproducible

[Kilgarriff, CL 2007, “Googleology is bad science.”]

- Practically: slow for millions of queries

Page 9: Preslav Nakov - The Web as a Training Set Part 2

9

Google N-grams•LDC2006T13: Web 1T 5-gram Version 1 [Brants & Franz, 2006]

- N words in sequence + their count on the Webo tokens: 1,024,908,267,229o sentences: 95,119,665,584o unigrams: 13,588,391o bigrams: 314,843,401o trigrams: 977,069,902 o fourgrams: 1,313,818,354 o fivegrams: 1,176,470,663

BUT discarded:- n-grams appearing less than 40

times,- words appearing less than 200

times.

Page 10: Preslav Nakov - The Web as a Training Set Part 2

10

Google N-gram Data Version 2

•Google N-grams Version 2 [Lin et al., LREC 2010]- Same source as Google N-grams Version 1- More pre-processing: duplicate sentence removal,

sentence-length and alphabetical constraints- Includes part-of-speech tags!

flies 1643568 NNS|611646 VBZ|1031922 caught the flies , 11 VBD|DT|NNS|,|11 plane flies really well 10 NN|VBZ|RB|RB|10

But was never made publicly available…

Page 11: Preslav Nakov - The Web as a Training Set Part 2

11

MicrosoftWeb N-grams Services

Page 12: Preslav Nakov - The Web as a Training Set Part 2

12

Issues with Google N-grams

•Minor issues- Non-standard tokenization: hyphens, etc.- Truncated counts: freq(ngrams)>=40, freq(words)>=200- Too large for efficient local use: 24G gzipped

•Major issues- Mixed data: no distinction between title, body, anchor text- No dynamics: “dead” corpus, frozen in time

Page 13: Preslav Nakov - The Web as a Training Set Part 2

13

Major Issue 1: Mixed Data

Body

Heading

HTML Title

URLAnchor Text

Search Queriesgoogle earning earnings GOOGgooogle quarterly report…

Caption

Page 14: Preslav Nakov - The Web as a Training Set Part 2

14

Major Issue 2: Dynamics of the Web

Page 15: Preslav Nakov - The Web as a Training Set Part 2

15

Microsoft Web N-grams Services

•Content types: Document Body, Document Title, Anchor Texts

•Model types: Smoothed models

•N-gram availability: unigram, bigram, trigram, N-gram with N=4, 5

•Training size (Body): All documents indexed by Bing in the EN-US market

•Access: Hosted Services by Microsoft

•Updates: Periodical updates

http://weblm.research.microsoft.com/

Page 16: Preslav Nakov - The Web as a Training Set Part 2

16

95.00%

95.50%

96.00%

96.50%

97.00%

97.50%

98.00%

98.50%

99.00%

1-gram2-gram

3-gram

Body Title

Query

Anchor

Application: Word Breaker

(Wang, WWW 2011)

1800flowers247momsmuenchenairportparlezvousfrancais

Page 17: Preslav Nakov - The Web as a Training Set Part 2

17

Conclusion

There is no data like more data…

…that can be correctly exploited

Page 18: Preslav Nakov - The Web as a Training Set Part 2

18

Google Book N-grams

Page 19: Preslav Nakov - The Web as a Training Set Part 2

19

Google Book N-grams v.2

• 6% of all books ever published• Syntax: POS and head-modifier relations

• Great tool to study linguistic trends• Esp. the evolution of syntax

https://books.google.com/ngrams

Page 20: Preslav Nakov - The Web as a Training Set Part 2

20

Annotated N-grams

(Lin & al, ACL 2012)

Page 21: Preslav Nakov - The Web as a Training Set Part 2

21

Data Statistics

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

Page 22: Preslav Nakov - The Web as a Training Set Part 2

22

Read--Book

(Lin & al, ACL 2012)

Page 23: Preslav Nakov - The Web as a Training Set Part 2

23

Burnt vs. Burned

(Lin & al, ACL 2012)

Page 24: Preslav Nakov - The Web as a Training Set Part 2

24

The Rise of ‘Tackle’; Relation to Football

(Lin & al, ACL 2012)

Page 25: Preslav Nakov - The Web as a Training Set Part 2

25

Is the World Getting More Quantitative?

(Lin & al, ACL 2012)

Page 26: Preslav Nakov - The Web as a Training Set Part 2

26

The Web of Images

Page 27: Preslav Nakov - The Web as a Training Set Part 2

27

English Web Images

Spanish Web Images

turtle

candle

vela

tortuga

cockatoo

cacatúa

(Bergsma and Van Durme, IJCAI 2011)

Page 28: Preslav Nakov - The Web as a Training Set Part 2

28

Application 1: Bilingual Lexicon InductionWeb-based visual similarity

(Bergsma and Van Durme, IJCAI 2011)

Color histogram features

Page 29: Preslav Nakov - The Web as a Training Set Part 2

29

Application 1: Bilingual Lexicon InductionWeb-based visual similarity

(Bergsma and Van Durme, IJCAI 2011)

SIFT keypoint features

Page 30: Preslav Nakov - The Web as a Training Set Part 2

30

Application 1: Bilingual Lexicon InductionWeb-based visual similarity

(Bergsma and Van Durme, IJCAI 2011)

0.33

0.55

0.19

0.46

VectorCosine

Similarity

Best match for one English

image

Avg. over all

English images

Page 31: Preslav Nakov - The Web as a Training Set Part 2

31

Application 2: Lexical Preference from Images

(Bergsma and Goebel, RANLP 2011)

Can you eat “migas”?

Can you eat “carillon”?

Can you eat “mamey”?

Selectional Preference:

Is noun X a plausible object for verb Y?

Page 32: Preslav Nakov - The Web as a Training Set Part 2

32

Application 2: Lexical Preference from Images

(Bergsma and Goebel, RANLP 2011)

Page 33: Preslav Nakov - The Web as a Training Set Part 2

33

Application 2: Lexical Preference from Images

(Bergsma and Goebel, RANLP 2011)

For a given verb-noun pair (e.g., eat+migas): • sum feature vectors for all images for the noun • apply the verb-specific weights to get a score