samudramanthan popular terms

20
1 SamudraManthan Popular terms Dinesh Bhirud Prasad Kulkarni Varada Kolhatkar

Upload: lucy-price

Post on 30-Dec-2015

30 views

Category:

Documents


2 download

DESCRIPTION

SamudraManthan Popular terms. Dinesh Bhirud Prasad Kulkarni Varada Kolhatkar. Architecture. R E DU C T I ON. create datastructures. Ngram pruning. Intra-process reduction. Finding Top Ngrams. create datastructures. Ngram pruning. Intra-process reduction. create datastructures. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: SamudraManthan   Popular terms

1

SamudraManthan Popular terms

Dinesh Bhirud

Prasad Kulkarni

Varada Kolhatkar

Page 2: SamudraManthan   Popular terms

2

Architecture

MANAG E R

MANAG E R

create datastructures

Intra-process reduction

R E DU C T I ON

create datastructures

create datastructures

Ngram pruning

Ngram pruning

Ngram pruning

Intra-process reduction

Intra-process reduction

WORKER PROCESSORSWORKER PROCESSORS

Finding Top

Ngrams

Page 3: SamudraManthan   Popular terms

3

Data DistributionMANAGER

P0

HANDSHAKE MODULE

WORKER PN

WORKER P2

WORKER P1

1. Signal Ready (W->M)

2. Data msg (M->W)

3. Next ready signal (W->M)

.

.

.4. Terminate msg (M->W)

Handshake protocol

Page 4: SamudraManthan   Popular terms

4

Data Distribution (contd…)

Manager (processor 0) reads an article and passes it on to the other processors(workers) in a round-robin fashion

Before sending new article to the same worker, manager waits till the worker is ready to receive more data.

Worker processes the article and creates data structures before receiving new article.

Sends & receives are synchronous.

Page 5: SamudraManthan   Popular terms

5

Suffix Array, LCP Vector And Equivalence Classes

Suffix array is a sorted array of suffixes LCP vector keeps track of repeating terms in

suffixes We use suffix arrays and LCP vector to

partition articles into classes Each class represents a group of Ngrams Classes represent all Ngrams in the article

and no Ngram is represented more than once

Page 6: SamudraManthan   Popular terms

6

Example

S LCP Trivial Classes

Non-trivial Classes

0 A ROSE 0 <0,0> <0,1>

1 A ROSE IS A ROSE 2 <1,1>

2 IS A ROSE 0 <2,2>

3 ROSE 0 <3,3> <3,4>

4 ROSE IS A ROSE 1 <4,4>

5 0

A ROSE IS A ROSEA ROSE IS A ROSE

Page 7: SamudraManthan   Popular terms

7

Advantages of Suffix Arrays

Time Complexity There can be at the most 2N-1 classes, where N

is number of words in an article Ngrams of all/any sizes can be identified with

their tfs in linear time These data structures enable us to represent

all and any sized Ngrams without actually storing them

Page 8: SamudraManthan   Popular terms

8

Intra-Processor Reduction Problem

Suffix array data structure gives us article level unique Ngrams with term frequencies

A processor processes multiple articles Need to identify unique Ngrams across

articles Need to have an unique identifier for each

word

Page 9: SamudraManthan   Popular terms

9

Dictionary – Our Savior

Dictionary is a sorted list of all unique words in the Gigaword corpus

Dictionary ids form a unified basis for intra/inter process reduction

Page 10: SamudraManthan   Popular terms

10

Intra-Processor Reduction

Used a hash table to store unique Ngrams with tf and df Hashing function

Simple mod hashing function H(t)= ∑ t(i) mod HASH_SIZE, where t(i) is the dictionary id of

word i in Ngram t Hash data structure

struct ngramstore {int *word_id;int cnt;int doc_freq;struct ngramstore *chain;

};

Page 11: SamudraManthan   Popular terms

11

Steps

• Inter-Process Reduction Binomial Tree

0

1

3

2 4

5

7

6

i varies from 0 to log(n) - 1• Send -> Recv diff = (2 ^ i)• For any iteration, recv if(id % (2^i) == 0) else sender• max_recv = (reductions-1) * (int)pow((double)2, i+1);

Processors enter next iteration by calling MPI_Barrier()

1 -> 03 -> 25 -> 47 -> 6

2 -> 06 -> 4

4 -> 0

Page 12: SamudraManthan   Popular terms

12

Inter-Process Reduction using Hashing

Reusing our hash technique and code from intra-process reduction

All processes use binomial tree collection pattern to reduce unique Ngrams

After log n steps process 0 has the final hash with all unique Ngrams

Page 13: SamudraManthan   Popular terms

13

Scaling up to GigaWord?

Goal Reduce per processor memory requirement

Cut off term frequency Ngrams with low tf are not going to score high Observation : 66 % of total trigrams have term

frequency 1 in 1.2GB data Unnecessary to carry such Ngrams Solution: Eliminate Ngrams with very low term

frequency

Page 14: SamudraManthan   Popular terms

14

Pruning – stoplist motivation

Similarly Ngrams with high df are not going to score high.

Memory hotspot This elimination can be done only after intra-

process collection Defeats the goal of per processor memory

reduction Need for an adaptive elimination

Page 15: SamudraManthan   Popular terms

15

Pruning - Stoplist

Ngrams such as "IN THE FIRST" scored high using TF*IDF measure

Eliminate such Ngrams to extract really interesting terms

Stoplist is a list of commonly occurring words such as “the”, “a”, “to”, “from”, “is”, “first”

Stoplist is based on our dictionary Still evolving and currently contains 160 words Eliminate Ngrams containing all words from the

stoplist

Page 16: SamudraManthan   Popular terms

16

Interesting 3-grams on GigaWord

Page 17: SamudraManthan   Popular terms

17

Performance Analysis - Speedup

Page 18: SamudraManthan   Popular terms

18

Space Complexity

Memory requirement increases for higher order Ngrams

Why? Suppose there are n unique Ngrams in each article and

m such articles For higher order Ngrams, the number of unique ngrams

increase We store each unique Ngram in our hash data structure In worst case all Ngrams across articles are unique.

We have to store mn unique Ngrams per processor

Page 19: SamudraManthan   Popular terms

19

Current Limitations

Static Dictionary M through N Interesting Ngrams

Our hash data structure is designed to handle a single sized Ngram at a time

We provide M through N functionality by repetitively building all data structures

Not a scalable approach

Page 20: SamudraManthan   Popular terms

20

Thanks