preslav nakov - the web as a training set part 2

1

On the Stabilityof Web N-gram Frequencies

2

Accuracy over Time: Google

Accuracy for any language, no inflections. Average coverage is shown in parentheses.

Varying time intervals,in case index changes happen periodically

(Nakov & Hearst, RANLP 2005)

3

Accuracy over Time: MSNStatistically significant

Accuracy for any language, no inflections. Average coverage is shown in parentheses.


4

Accuracy by Search Engine for 6/6/2005

Accuracy for any language, no inflections.Average coverage is shown in parentheses.

Statistically significant


5

Coverage by Search Engine for 6/6/2005

Coverage for any language, no inflections.

No much variability In coverage (but Googlehas the biggest index)


6

Conclusion

•Overall: n-gram variability does not have statistically significant impact on performance (at least for noun compound bracketing).


7

Google N-gramsvs.

Search Engines

8

Issues with Using Search Engines

•Search Engines for NLP- Scientifically: not reproducible

[Kilgarriff, CL 2007, “Googleology is bad science.”]

- Practically: slow for millions of queries

9

Google N-grams•LDC2006T13: Web 1T 5-gram Version 1 [Brants & Franz, 2006]

- N words in sequence + their count on the Webo tokens: 1,024,908,267,229o sentences: 95,119,665,584o unigrams: 13,588,391o bigrams: 314,843,401o trigrams: 977,069,902 o fourgrams: 1,313,818,354 o fivegrams: 1,176,470,663

BUT discarded:- n-grams appearing less than 40

times,- words appearing less than 200

times.

10

Google N-gram Data Version 2

•Google N-grams Version 2 [Lin et al., LREC 2010]- Same source as Google N-grams Version 1- More pre-processing: duplicate sentence removal,

sentence-length and alphabetical constraints- Includes part-of-speech tags!

flies 1643568 NNS|611646 VBZ|1031922 caught the flies , 11 VBD|DT|NNS|,|11 plane flies really well 10 NN|VBZ|RB|RB|10

But was never made publicly available…

11

MicrosoftWeb N-grams Services

12

Issues with Google N-grams

•Minor issues- Non-standard tokenization: hyphens, etc.- Truncated counts: freq(ngrams)>=40, freq(words)>=200- Too large for efficient local use: 24G gzipped

•Major issues- Mixed data: no distinction between title, body, anchor text- No dynamics: “dead” corpus, frozen in time

13

Major Issue 1: Mixed Data

Body

Heading

HTML Title

URLAnchor Text

Search Queriesgoogle earning earnings GOOGgooogle quarterly report…

Caption

14

Major Issue 2: Dynamics of the Web

15

Microsoft Web N-grams Services

•Content types: Document Body, Document Title, Anchor Texts

•Model types: Smoothed models

•N-gram availability: unigram, bigram, trigram, N-gram with N=4, 5

•Training size (Body): All documents indexed by Bing in the EN-US market

•Access: Hosted Services by Microsoft

•Updates: Periodical updates

http://weblm.research.microsoft.com/

16

95.00%

95.50%

96.00%

96.50%

97.00%

97.50%

98.00%

98.50%

99.00%

1-gram2-gram

3-gram

Body Title

Query

Anchor

Application: Word Breaker

(Wang, WWW 2011)

1800flowers247momsmuenchenairportparlezvousfrancais

17

Conclusion

There is no data like more data…

…that can be correctly exploited

18

Google Book N-grams

19

Google Book N-grams v.2

• 6% of all books ever published• Syntax: POS and head-modifier relations

• Great tool to study linguistic trends• Esp. the evolution of syntax

https://books.google.com/ngrams

20

Annotated N-grams

(Lin & al, ACL 2012)

21

Data Statistics

http://storage.googleapis.com/books/ngrams/books/datasetsv2.html

22

Read--Book


23

Burnt vs. Burned


24

The Rise of ‘Tackle’; Relation to Football


25

Is the World Getting More Quantitative?


26

The Web of Images

27

English Web Images

Spanish Web Images

turtle

candle

vela

tortuga

cockatoo

cacatúa

(Bergsma and Van Durme, IJCAI 2011)

28

Application 1: Bilingual Lexicon InductionWeb-based visual similarity


Color histogram features

29



SIFT keypoint features

30



0.33

0.55

0.19

0.46

VectorCosine

Similarity

Best match for one English

image

Avg. over all

English images

31

Application 2: Lexical Preference from Images

(Bergsma and Goebel, RANLP 2011)

Can you eat “migas”?

Can you eat “carillon”?

Can you eat “mamey”?

Selectional Preference:

Is noun X a plausible object for verb Y?

32



33



For a given verb-noun pair (e.g., eat+migas): • sum feature vectors for all images for the noun • apply the verb-specific weights to get a score

preslav nakov - the web as a training set part 2

Data & Analytics