computing n-gram statistics in mapreducekberberi/presentations/... · 2017-12-09 · computing...
TRANSCRIPT
![Page 1: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/1.jpg)
Computingn-Gram Statistics
in MapReduce
Klaus Berberich([email protected])
Srikanta Bedathur([email protected])
![Page 2: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/2.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
![Page 3: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/3.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
rates hilton paris
the hilton parisoffers great rates
in the summerd42
![Page 4: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/4.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
siri how is the
rates hilton paris
the hilton parisoffers great rates
in the summerd42
![Page 5: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/5.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
n-Gram Statistics
✦ Statistics about variable-length word sequences(e.g., lord of the rings, at the end of, …)have many applications in fields including
✦ Information Retrieval
✦ Natural Language Processing
✦ Digital Humanities
2
siri how is the
rates hilton paris
the hilton parisoffers great rates
in the summerd42
thou shalt notdon’t ya
![Page 6: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/6.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Problem Statement
✦ Can be seen as a special case of frequent sequence mining (no gaps, single-item transaction only) with slightly different notion of frequency
✦ Our focus is on large-scale document collections (millions of documents or more, natural language)
3
How can we efficiently compute statistics about n-grams, that occur at least τ times and consist of at most σ words,
using MapReduce?
![Page 7: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/7.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
4
![Page 8: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/8.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
MapReduce
✦ Distributed data processing platform by Google [1]
✦ for clusters of commodity hardware
✦ handles hardware/software failures transparently
✦ available as open-source Apache Hadoop
✦ Programming model operating on key-value pairs
✦ map() : <k1,v1> -‐> list<k2,v2>
✦ reduce() : <k2,list<v2>> -‐> list<k3,v3>
✦ compare() partition()
5
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004
![Page 9: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/9.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Determine counts of all individual words
WORD COUNT
6
map(did, content): for all words in content: emit(word, did)
reduce(word, list<did>): emit(word, length(list<did>))
![Page 10: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/10.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Determine counts of all individual words
WORD COUNT
6
map(did, content): for all words in content: emit(word, did)
reduce(word, list<did>): emit(word, length(list<did>))
d1@t1a x bb a y
d2@t2b y ax a b
(a,4)(b,4)…
(x,2)(y,2)…
Map
M1
Mn
(a,d1@t1),(x,d1@t1),
…
(b,d2@t2),(y,d2@t2),
…
map()
Reduce
R1
Rm
(a,d1@t1),(a,d2@t2),
…
(x,d1@t1),(x,d2@t2),
…
reduce()
Shuffle1
m
1
m
1
m
1
m
partition()compare()
![Page 11: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/11.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
N-GRAM COUNT
7
map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)
reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007
![Page 12: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/12.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
N-GRAM COUNT
7
map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)
reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007
![Page 13: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/13.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
N-GRAM COUNT
7
map(did, content): for k in <1 ... σ >: for all k-‐grams in content: emit(k-‐gram, did)
reduce(n-‐gram, list<did>): if length(list<did>) >= τ: emit(n-‐gram, length(list<did>))
[1] J. Dean and S. Ghemawat: Simplified Data Processing on Large Clusters, OSDI 2004[2] T. Brants et al.: Large Language Models in Machine Translation, EMNLP-CoNLL 2007
![Page 14: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/14.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
![Page 15: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/15.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
d1@t1a x bb a y
(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)
{ }
(1)
APRIORI-SCAN
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
![Page 16: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/16.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
d1@t1a x bb a y
(a,d1@t1),(b,d1@t1),(x,d1@t1),(y,d1@t1)
{ }
(1)
APRIORI-SCAN
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
d1@t1a x bb a y
(ax,d1@t1),(ay,d1@t1),…(bb,d1@t1),(ba,d1@t1),…(xb,d1@t1)
{a,b,x,y}
(2)
![Page 17: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/17.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
![Page 18: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/18.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
APRIORI-INDEX
(2)ab d5@t5 [2,7] d7@t7 [1,11]
bx d5@t5 [8] d7@t7 [2]
abx d5@t5 [7] d7@t7 [1]
![Page 19: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/19.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
✦ Apriori Principle: k-gram can occur more than τ times only if its constituent (k-1)-grams occur at least τ times
APRIORI-SCAN & APRIORI-INDEX
8
[1] R. Srikant and R. Agrawal: Mining Sequential Patterns: Generalizations & Performance Improvements, EDBT 1996[2] M. J. Zaki: SPADE: An Efficient Algorithm for Mining Frequent Sequences, ML 42(1/2):31-60, 2001
APRIORI-INDEX
(2)ab d5@t5 [2,7] d7@t7 [1,11]
bx d5@t5 [8] d7@t7 [2]
abx d5@t5 [7] d7@t7 [1]
(3)abx d8@t8 [2,7] d9@t9 [1,11]
bxy d8@t8 [3]
abxy d8@t8 [2]
![Page 20: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/20.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Challenges & Desiderata
✦ Single MapReduce Job(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)
✦ Communication Cost(N-GRAM COUNT ✗ / APRIORI-SCAN ✓ / APRIORI-INDEX ✓)
✦ Main-Memory Consumption(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)
✦ Ease of Implementation(N-GRAM COUNT ✓ / APRIORI-SCAN ✗ / APRIORI-INDEX ✗)
9
![Page 21: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/21.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
10
![Page 22: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/22.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
SUFFIX-σ
✦ SUFFIX-σ is based on three key ideas, inspired by methods from String Processing (e.g., suffix arrays)
✦ emit only suffixes of documents in map() to reduce communication cost
✦ partition() suffixes based on their first word
✦ sort suffixes in reverse lexicographic orderto limit main-memory consumption in reduce()
11
![Page 23: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/23.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Suffixes
✦ SUFFIX-σ emits only suffixes of documents in map()
✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)
12
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
![Page 24: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/24.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Suffixes
✦ SUFFIX-σ emits only suffixes of documents in map()
✦ each of them represents multiple n-grams corresponding to its prefixes (e.g., axbbay represents a, ax, axb, axbb, axbba, and axbbay)
12
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
d1@t1a x bb a y
(a,d1@t1),(x,d1@t1),…(ax,d1@t1),(xb,d1@t1),…(axb,d1@t1),(xbb,d1@t1)…(axbb,d1@t1),(xbba,d1@t1),…(axbba,d1@t1),(xbbay,d1@t1),…(axbbay,d1@t1)
![Page 25: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/25.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Partitioning
✦ SUFFIX-σ partitions suffixes based on their first word
✦ brings together suffixes representing same n-gram
✦ crucial for computation in single MapReduce job
13
(axbbay,d1@t1)(xbbay,d1@t1)(yyabbx,d3@t3)(yabbx,d3@t3)(axbyyx,d4@t4)(xbyyx,d4@t4)
(axbbay,d1@t1)(axbyyx,d4@t4)
(xbbay,d1@t1)(xbyyx,d4@t4)
(yyabbx,d3@t3)(yabbx,d3@t3)
![Page 26: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/26.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
![Page 27: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/27.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
![Page 28: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/28.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
yabbxa
{d1@t1}
{d4@t4}{d7@d7}
![Page 29: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/29.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
yabbxa
{d1@t1}
{d4@t4}{d7@d7}
(axbbay,1)(axbba,1)(axbb,1)(axb,2)(ax,3)
![Page 30: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/30.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Sorting
✦ SUFFIX-σ sorts suffixes in reverse lexicographic order
✦ bookkeeping using stack of bounded height σ
✦ crucial for low main-memory consumption
14
(axbbay,d1@t1)(axbyyx,d4@t4)(abbxa,d5@t5)(aaxxa,d6@t6)(axxxa,d7@t7)(aax,d8@t9)
(axxxa,d7@t7)(axbyyx,d4@t4)(axbbay,d1@t1)(abbxa,d5@t5)(aaxxa,d6@t6)(aax,d8@t9)
axbba
{d5@t5}
{d1@t1, d4@t4, d7@t7}
![Page 31: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/31.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
SUFFIX-σ
15
map(did, content): for all suffixes in content: emit(suffix, did)
partition(suffix, did): return suffix[0] % m
compare(suffix0, suffix1): return -‐strcmp(suffix0, suffix1)
![Page 32: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/32.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
16
![Page 33: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/33.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Extensions
✦ Closed/Maximal n-Grams
✦ SUFFIX-σ can emit only prefix-closed/maximal n-grams in reduce(); additional MapReduce job then identifies suffix-closed/maximal n-grams
✦ Other Aggregations
✦ n-gram time series
✦ n-gram inverted index
17
[1] J.-B. Michel et al.: Quantitative Analysis of Culture Using Millions of Digitized Books, Science 2010
a b
a b c
b c x
d2, d7, d9
d2, d7
d3, d6
![Page 34: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/34.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
18
![Page 35: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/35.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Datasets & Setup
✦ The New York Times Annotated Corpus (NYT)1.8 million newspaper articles, 1987 – 2007, ~3 GB
✦ ClueWeb09-B (CW)50 million web documents, 2009, ~246 GB
✦ 10 Cluster Nodes (2x6 cores, 64 GB RAM, 4x2 TB HDD, Debian 5.0.9, 1 Gbit Ethernet, CDH3u0)
✦ Implementation operates on compressed integer sequences; datasets pre-processed accordingly
19
![Page 36: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/36.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Use Cases
✦ Training a Statistical Language Model (LM)
✦ σ = 5 (i.e., n-grams consisting of up to five words)
✦ τ = 10 (NYT) / τ = 100 (CW)
✦ Identifying Repeated Text Fragments (RT)
✦ σ = 100 (to also capture quotations, idioms, etc.)
✦ τ = 100 (NYT) / τ = 1,000 (CW)
20
![Page 37: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/37.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Results (LM)
21
1
10
100
1,000
10,000
NYT CW
81
3
3,809
37
240
9
309
10
Wal
lclo
ck T
ime
(min
utes
)
N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ
![Page 38: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/38.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Results (RT)
22
1
10
100
1,000
10,000
NYT CW
229
5
393
62
338
77
15,000
117
Wal
lclo
ck T
ime
(min
utes
)
N-GRAM COUNT APRIORI-SCAN APRIORI-INDEX SUFFIX-σ
![Page 39: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/39.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Outline
✦ Motivation
✦ Competitors & Challenges
✦ SUFFIX-σ
✦ Extensions
✦ Experimental Evaluation
✦ Conclusion
23
![Page 40: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/40.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Conclusion
✦ SUFFIX-σ – computes n-gram statistics in MapReduce
✦ based on “suffix idea” from String Processing
✦ robust to wide variety of parameter choices
✦ outperforms state-of-the-art competitors
✦ runs in a single MapReduce job, consumes little main memory, and is easy to implement
24
![Page 41: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/41.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26
Advertisements
✦ Codehttp://github.com/kberberi/mpiingrams
✦ EU ProjectLongitudinal Analytics of Web Archive Data
✦ Follow-Up WorkI. Miliaraki, K. Berberich, R. Gemulla, S. Zoupanos: Mind the Gap: Large-Scale Frequent Sequence Mining,SIGMOD 2013
25
![Page 42: Computing n-Gram Statistics in MapReducekberberi/presentations/... · 2017-12-09 · Computing n-Gram Statistics in MapReduce – Klaus Berberich / 26 Problem Statement Can be seen](https://reader035.vdocuments.site/reader035/viewer/2022070806/5f0481b57e708231d40e4f47/html5/thumbnails/42.jpg)
Computing n-Gram Statistics in MapReduce – Klaus Berberich / 2626
Thank you!
Questions ?