randomized algorithms and nlp using locality sensitive hash functions for high speed noun clustering...

34
Randomized Algorithms and Randomized Algorithms and NLP NLP Using Locality Sensitive Hash Using Locality Sensitive Hash Functions for High Speed Noun Functions for High Speed Noun Clustering Clustering Chen LUO Chen LUO Mohamed AbdElRahman Mohamed AbdElRahman Instructor: Dr. Instructor: Dr. Anshumali Shrivastava Anshumali Shrivastava Rice University Rice University July 4, 2022 July 4, 2022

Upload: beryl-hunt

Post on 18-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Randomized Algorithms and NLPRandomized Algorithms and NLPUsing Locality Sensitive Hash Functions for Using Locality Sensitive Hash Functions for

High Speed Noun Clustering High Speed Noun Clustering

Chen LUOChen LUO ,, Mohamed AbdElRahmanMohamed AbdElRahmanInstructor: Dr. Instructor: Dr. Anshumali ShrivastavaAnshumali Shrivastava

††Rice UniversityRice UniversityApril 21, 2023April 21, 2023

Page 2: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

OutlineOutline

• Problem BackgroundProblem Background

• TheoryTheory• LSH Function Preserving Cosine Similarity (Dimension

Reduction)• Fast Search Algorithm (From n2 to n)

• Extending to NLP and ExperimentExtending to NLP and Experiment

22

Page 3: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

MotivationMotivation

•What is the meaning of the word: What is the meaning of the word: ““tezgüno” ?

33

Page 4: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

MotivationMotivation

• Consider the following Context:Consider the following Context:

A bottle of tezgüno is on the table.

Everyone likes tezgüno.

Tezgüno makes you drunk.

We make tezgüno out of corn.

•Still not Sure?

44

Page 5: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

MotivationMotivation

• Consider the following Context:Consider the following Context:A bottle of tezgüno is on the table.

Everyone likes tezgüno.

Tezgüno makes you drunk.

We make tezgüno out of corn.

A bottle of beer is on the table.

Everyone likes beer.

Beer makes you drunk.

We make beer out of corn.

““Beer” and “tezgüno”Beer” and “tezgüno” have similar context, have have similar context, have similar meaning.similar meaning.

55

Page 6: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

MotivationMotivation

• We want this process to be done automatically by a We want this process to be done automatically by a computer!computer!

• So, the main task here is to find similar nouns!So, the main task here is to find similar nouns!

Noun ClusteringNoun Clustering

66

Page 7: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Problem BackgroundProblem Background

• Task: Clustering Very Large scale nounsTask: Clustering Very Large scale nouns• n nodes (n nouns)• Each nodes has k features. (Details Later)

• Calculate Full Similarity MatrixCalculate Full Similarity Matrix• Complexity:

• Can not be tolerated when n is very large!Can not be tolerated when n is very large!

77

)( 2knO

Page 8: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Problem BackgroundProblem Background

• By 2000By 2000• Over 500 billion readily accessible words on the web

• NowNow• Very Very Very Large amount!

• We want linear:We want linear:

• Hashing is a good way!Hashing is a good way!

88

)(knO

Page 9: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

OutlineOutline

• Problem BackgroundProblem Background

• TheoryTheory• LSH Function Preserving Cosine Similarity (Dimension

Reduction)• Fast Search Algorithm (From n2 to n)

• Extending to NLP and ExperimentExtending to NLP and Experiment

99

Page 10: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• The similarity measure between each node is Cosine The similarity measure between each node is Cosine Similarity.Similarity.• Cosine SimilarityCosine Similarity

• We want to design a hash function that preserve this We want to design a hash function that preserve this similarity.similarity.

1010

|v|||

|*|)),(cos(

u

vuvu

Page 11: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• From the paper, the hash function is defined as follow:From the paper, the hash function is defined as follow:

• In above, r is a spherically symmetric random vector of In above, r is a spherically symmetric random vector of unit length.unit length.

1111

0*:0

0*:1)(

ur

uruhr

Page 12: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• Then, for vectors Then, for vectors uu and and vv, we have:, we have:

• OrOr

• Directly proportional!Directly proportional!

1212

),(-1)]()(Pr[

vuvhuh rr

),(

)]()(Pr[vu

vhuh rr

Page 13: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• From the equation bellow:From the equation bellow:

• We can haveWe can have

• Then, we can estimate cosine similarity using:Then, we can estimate cosine similarity using:

1313

))])()(cos((Pr[)),((cos vhuhvu rr

),(

)]()(Pr[vu

vhuh rr

)]()(Pr[ vhuh rr

Page 14: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• Each vector u can be represented by a bit stream Each vector u can be represented by a bit stream length d using the hash function.length d using the hash function. (etc. 001101 with (etc. 001101 with d=6).d=6).

• Then will be close related to Then will be close related to hamming distance between u and v.hamming distance between u and v.

1414

)]()(Pr[ vhuh rr

(u)}h(u),...,h(u),{hu rdr1r0

i ii |B-A|

Page 15: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• For example:For example:• Given:Given:

• Then: Then:

• So, the Hamming Distance:So, the Hamming Distance:

1515

)2

2

2

2,

2

2-(),

2

2

2

2,

2

2( 21 ,, rr

)0,0,1(),1,0,0(u v

)0*(1)(H 11 ruur

)0*(1)(H 22 ruur

)0*(1)v(H 11 rvr

)0*(0)v(H 22 rvr

1),( vuHamDis

]1,1[(u)]H(u),[H r2r1 u

]0,1[(v)]H(v),[Hv r2r1

Page 16: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

• Convert:Convert:• Finding the cosine distance of two vectors Finding the cosine distance of two vectors • Finding the Hamming Distance between bit streamsFinding the Hamming Distance between bit streams

1616

Dimension Reduction! But the complexity is still nDimension Reduction! But the complexity is still n22

Page 17: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

OutlineOutline

• Problem BackgroundProblem Background

• Theory and AlgorithmTheory and Algorithm• LSH Function Preserving Cosine Similarity (Dimension

Reduction)• Fast Search Algorithm (From n2 to n)

• Extending to NLP and ExperimentExtending to NLP and Experiment

1717

Page 18: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Fast Search AlgorithmFast Search Algorithm• Task:Task:• Given the Given the signature for each vectors: signature for each vectors: • Stream bit (e.g. 1001) for each vectors.Stream bit (e.g. 1001) for each vectors.

• Find the nearest neighbors for each vector.Find the nearest neighbors for each vector.

1818

(u)}(u),...,H(u),H{Hu rdrr 10

Page 19: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Fast Search AlgorithmFast Search Algorithm

• Apply Apply qq Randomly PermutationRandomly Permutation on each bit stream. on each bit stream. • We can get q random permuted list.We can get q random permuted list.• Complexity: O(n)Complexity: O(n)

• For example:For example:• Given a bit stream , and two permutation (q=2).

• Then

1919

)0,1,2,3()3,2,1,0(:1 )1,2,0,3()3,2,1,0(:2

(1011)u

)1101()(1 u )1011()(1 u

Page 20: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Fast Search AlgorithmFast Search Algorithm

• Sorting the Sorting the qq random permuted list, and find the random permuted list, and find the nearest B neighbors on these sorted list.nearest B neighbors on these sorted list.• Complexity: O(n log n)Complexity: O(n log n)

• For exampleFor example• B=2, q=2 (Constant)B=2, q=2 (Constant)

2020

1 2

v

v

KnKn2 2 Kn Kn

Kn2 n log n

Page 21: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

QuestionQuestion

2121

• What is the hamming distance between two bit What is the hamming distance between two bit stream: stream: A=[0011000], B=[1111001]?A=[0011000], B=[1111001]?• AnsAns. Hamming(A,B)=3. Hamming(A,B)=3

• Suppose we have two 2-dimension vectors u=[1,0], Suppose we have two 2-dimension vectors u=[1,0], v=[0,1]. r is the spherically symmetric random vector. v=[0,1]. r is the spherically symmetric random vector. Then what is the value of ?Then what is the value of ?•

)]()(Pr[ vhuh rr 1/2)]()(Pr[ vhuh rr

Page 22: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

OutlineOutline

• Problem BackgroundProblem Background

• Theory and AlgorithmTheory and Algorithm• LSH Function Preserving Cosine Similarity (Dimension

Reduction)• Fast Search Algorithm (From n2 to n)

• Extending to NLP and ExperimentExtending to NLP and Experiment

2222

Page 23: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Testing DatasetsTesting Datasets

2323

Corpus Web Corpus Newspaper Corpus

Corpus Size From 70 million to31 million web pages (138GB)“remove non-English docs & duplicate and near duplicate docs”

6GB

Nouns Identification Using a noun-phrase identifier

Using the dependency parser Minipar (Lin, 1994)

Unique Nouns 655,495 65,547

Features Identification for noun vector

For each noun phrase ←←noun phrase→→

Take the grammatical context of the noun as in Minipar

Feature Size 1,306,482 940,154

Page 24: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Calculation of Feature VectorsCalculation of Feature Vectors

2424

• Mutual Information VectorMutual Information VectorUsed to measure the association strength between two words. Here, it is used between word (e) and feature (f)

Cef, is the number of times word (e) occurred in context (f)N, is the total frequency count of all features of all wordsn, is the number of wordsFor each noun, we have MI(e) = (mi(e1), mi(e2), … mi(ek))

Page 25: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

ExampleExample

2525

• Soccer Quotes from the internet:Soccer Quotes from the internet:A soccer team is like a beautiful woman. When you do not tell her, she forgets she is beautiful. (Arsène Wenger)In his life, a man can change wives, political parties or religions buthe cannot change his favorite soccer team.(Eduardo Hughes Galeano)

“Don’t change your wife”

• Removing stop words, and identifying nouns:Removing stop words, and identifying nouns:Soccer team like beautiful woman tell forgets beautiful. Life man change wives political parties religions change favorite soccer team.Features:, 2 left, (for each noun), 2 rightAll Nouns (5): {Soccer team, woman, wives, parties, religions}All Features (11): {Like, beautiful, tell, forgets, man, change, political, parties, wives, religions, favorite}

Page 26: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

ExampleExample

2626

Soccer team like beautiful woman tell forgets beautiful. Life man change wives political parties religions change favorite soccer team.

Like Beautiful Tell Forgets Man Change Political Parties wives Religions favorite Total

Soccer Team 1 1 0 0 0 1 0 0 0 0 1 4

Woman 1 1 1 1 0 0 0 0 0 0 0 4

Wives 0 0 0 0 1 1 1 1 0 0 0 4

Parties 0 0 0 0 0 1 1 0 1 1 0 4

Religions 0 0 0 0 0 1 1 1 0 0 1 4

Total 2 2 1 1 1 4 3 2 1 1 2 20

Page 27: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

ExampleExample

2727

Like Beautiful Tell Forgets Man Change Political Parties wives Religions favorite Total

Soccer Team 1 1 0 0 0 1 0 0 0 0 1 4

Woman 1 1 1 1 0 0 0 0 0 0 0 4

Wives 0 0 0 0 1 1 1 1 0 0 0 4

Parties 0 0 0 0 0 1 1 0 1 1 0 4

Religions 0 0 0 0 0 1 1 1 0 0 1 4

Total 2 2 1 1 1 4 3 2 1 1 2 20

MI(soccer team) = (mi(soccer team, like), mi(soccer team, beautiful), … mi(soccer team, favorite)

mi(soccer team, like) = log (1/20) / (2/20) X (4/20) ~ (0.4 )

Page 28: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Evaluation: LSH functionEvaluation: LSH function

d↑, Error↓, Time ↑

2828

• Randomly choose 100 nouns (vectors) from the web collection (using the web corpus dataset)

(i) is for all pairs with CS(real,i) >= 0.15

Page 29: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Evaluation: Fast Hamming DistanceEvaluation: Fast Hamming Distance

2929

• Randomly choose 100 nouns from the (web corpus dataset). For each, calculate all pairwise hamming distance manually. Filter for those >= 0.15 “Gold Standard test set”.

• Obtain a list of bit streams for all nouns from Web Corpus Dataset for hamming distance calculation.

• Compare Top N elements retrieved by the fast hamming distance against those in the gold standard test set (calculate percentage overlap).

Page 30: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Evaluation: Fast Hamming DistanceEvaluation: Fast Hamming Distance

3030

B↑, q↑, Accuracy↑ , Search Time↑

Page 31: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Evaluation: Final Similarity ListsEvaluation: Final Similarity Lists

3131

• Using (the Newspaper Corpus).

• Randomly choose 100 nouns and calculate top N elements using the randomized algorithm, and compare with those resulted from (Pantel and Lin (2002) system) and calculate (percentage overlap).

Page 32: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

SummarySummary

3232

• Using random vectors, we manage to represent each noun as a bit of stream of length d << number of features, which result in dimensionality reduction.

• The proposed method reduced the running time from quadratic time to kn, with similarity accuracy of ~ 70%.

Page 33: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

Randomized Algorithms and NLPRandomized Algorithms and NLPUsing Locality Sensitive Hash Functions for Using Locality Sensitive Hash Functions for

High Speed Noun Clustering High Speed Noun Clustering

Chen LUOChen LUO ,, Mohamed AbdElRahmanMohamed AbdElRahmanInstructor: Dr. Instructor: Dr. Anshumali ShrivastavaAnshumali Shrivastava

††Rice UniversityRice UniversityApril 21, 2023April 21, 2023

Page 34: Randomized Algorithms and NLP Using Locality Sensitive Hash Functions for High Speed Noun Clustering Chen LUO , Mohamed AbdElRahman Instructor: Dr. Anshumali

LSH Function Preserving Cosine SimilarityLSH Function Preserving Cosine Similarity

3434