preslav nakov - the web as a training set part 2
TRANSCRIPT
1
On the Stabilityof Web N-gram Frequencies
2
Accuracy over Time: Google
Accuracy for any language, no inflections. Average coverage is shown in parentheses.
Varying time intervals,in case index changes happen periodically
(Nakov & Hearst, RANLP 2005)
3
Accuracy over Time: MSNStatistically significant
Accuracy for any language, no inflections. Average coverage is shown in parentheses.
(Nakov & Hearst, RANLP 2005)
4
Accuracy by Search Engine for 6/6/2005
Accuracy for any language, no inflections.Average coverage is shown in parentheses.
Statistically significant
(Nakov & Hearst, RANLP 2005)
5
Coverage by Search Engine for 6/6/2005
Coverage for any language, no inflections.
No much variability In coverage (but Googlehas the biggest index)
(Nakov & Hearst, RANLP 2005)
6
Conclusion
•Overall: n-gram variability does not have statistically significant impact on performance (at least for noun compound bracketing).
(Nakov & Hearst, RANLP 2005)
7
Google N-gramsvs.
Search Engines
8
Issues with Using Search Engines
•Search Engines for NLP- Scientifically: not reproducible
[Kilgarriff, CL 2007, “Googleology is bad science.”]
- Practically: slow for millions of queries
9
Google N-grams•LDC2006T13: Web 1T 5-gram Version 1 [Brants & Franz, 2006]
- N words in sequence + their count on the Webo tokens: 1,024,908,267,229o sentences: 95,119,665,584o unigrams: 13,588,391o bigrams: 314,843,401o trigrams: 977,069,902 o fourgrams: 1,313,818,354 o fivegrams: 1,176,470,663
BUT discarded:- n-grams appearing less than 40
times,- words appearing less than 200
times.
10
Google N-gram Data Version 2
•Google N-grams Version 2 [Lin et al., LREC 2010]- Same source as Google N-grams Version 1- More pre-processing: duplicate sentence removal,
sentence-length and alphabetical constraints- Includes part-of-speech tags!
flies 1643568 NNS|611646 VBZ|1031922 caught the flies , 11 VBD|DT|NNS|,|11 plane flies really well 10 NN|VBZ|RB|RB|10
But was never made publicly available…
11
MicrosoftWeb N-grams Services
12
Issues with Google N-grams
•Minor issues- Non-standard tokenization: hyphens, etc.- Truncated counts: freq(ngrams)>=40, freq(words)>=200- Too large for efficient local use: 24G gzipped
•Major issues- Mixed data: no distinction between title, body, anchor text- No dynamics: “dead” corpus, frozen in time
13
Major Issue 1: Mixed Data
Body
Heading
HTML Title
URLAnchor Text
Search Queriesgoogle earning earnings GOOGgooogle quarterly report…
Caption
14
Major Issue 2: Dynamics of the Web
15
Microsoft Web N-grams Services
•Content types: Document Body, Document Title, Anchor Texts
•Model types: Smoothed models
•N-gram availability: unigram, bigram, trigram, N-gram with N=4, 5
•Training size (Body): All documents indexed by Bing in the EN-US market
•Access: Hosted Services by Microsoft
•Updates: Periodical updates
http://weblm.research.microsoft.com/
16
95.00%
95.50%
96.00%
96.50%
97.00%
97.50%
98.00%
98.50%
99.00%
1-gram2-gram
3-gram
Body Title
Query
Anchor
Application: Word Breaker
(Wang, WWW 2011)
1800flowers247momsmuenchenairportparlezvousfrancais
17
Conclusion
There is no data like more data…
…that can be correctly exploited
18
Google Book N-grams
19
Google Book N-grams v.2
• 6% of all books ever published• Syntax: POS and head-modifier relations
• Great tool to study linguistic trends• Esp. the evolution of syntax
https://books.google.com/ngrams
20
Annotated N-grams
(Lin & al, ACL 2012)
21
Data Statistics
http://storage.googleapis.com/books/ngrams/books/datasetsv2.html
22
Read--Book
(Lin & al, ACL 2012)
23
Burnt vs. Burned
(Lin & al, ACL 2012)
24
The Rise of ‘Tackle’; Relation to Football
(Lin & al, ACL 2012)
25
Is the World Getting More Quantitative?
(Lin & al, ACL 2012)
26
The Web of Images
27
English Web Images
Spanish Web Images
turtle
candle
vela
tortuga
cockatoo
cacatúa
(Bergsma and Van Durme, IJCAI 2011)
28
Application 1: Bilingual Lexicon InductionWeb-based visual similarity
(Bergsma and Van Durme, IJCAI 2011)
Color histogram features
29
Application 1: Bilingual Lexicon InductionWeb-based visual similarity
(Bergsma and Van Durme, IJCAI 2011)
SIFT keypoint features
30
Application 1: Bilingual Lexicon InductionWeb-based visual similarity
(Bergsma and Van Durme, IJCAI 2011)
0.33
0.55
0.19
0.46
VectorCosine
Similarity
Best match for one English
image
Avg. over all
English images
31
Application 2: Lexical Preference from Images
(Bergsma and Goebel, RANLP 2011)
Can you eat “migas”?
Can you eat “carillon”?
Can you eat “mamey”?
Selectional Preference:
Is noun X a plausible object for verb Y?
32
Application 2: Lexical Preference from Images
(Bergsma and Goebel, RANLP 2011)
33
Application 2: Lexical Preference from Images
(Bergsma and Goebel, RANLP 2011)
For a given verb-noun pair (e.g., eat+migas): • sum feature vectors for all images for the noun • apply the verb-specific weights to get a score