Randomized Algorithms Graph Algorithms

Download Randomized  Algorithms Graph Algorithms

Post on 03-Jan-2016

60 views

Category:

Documents

4 download

DESCRIPTION

Randomized Algorithms Graph Algorithms. William Cohen. Outline. Randomized methods SGD with the hash trick (review) Other randomized algorithms Bloom filters Locality sensitive hashing Graph Algorithms. Learning as optimization for regularized logistic regression. Algorithm: - PowerPoint PPT Presentation

TRANSCRIPT

Machine Learning from Big Datasets

Randomized AlgorithmsGraph AlgorithmsWilliam CohenOutlineRandomized methodsSGD with the hash trick (review)Other randomized algorithmsBloom filtersLocality sensitive hashingGraph AlgorithmsLearning as optimization for regularized logistic regressionAlgorithm:Initialize arrays W, A of size R and set k=0For each iteration t=1,TFor each example (xi,yi)Let V be hash table so that pi = ; k++For each hash value h: V[h]>0:W[h] *= (1 - 2)k-A[j]W[h] = W[h] + (yi - pi)V[h]A[j] = k

Learning as optimization for regularized logistic regressionInitialize arrays W, A of size R and set k=0For each iteration t=1,TFor each example (xi,yi)k++; let V be a new hash table; let tmp=0For each j: xi j >0: V[hash(j)%R] += xi j Let ip=0For each h: V[h]>0: W[h] *= (1 - 2)k-A[j]ip+= V[h]*W[h]A[h] = kp = 1/(1+exp(-ip))For each h: V[h]>0:W[h] = W[h] + (yi - pi)V[h]

regularize W[h]sAn example

2^26 entries = 1 Gb @ 8bytes/weight3,2M emails400k users40M tokens5Results

A variant of feature hashingHash each feature multiple times with different hash functionsNow, each w has k chances to not collide with another useful w An easy way to get multiple hash functionsGenerate some random strings s1,,sLLet the k-th hash function for w be the ordinary hash of concatenation wsk

A variant of feature hashingWhy would this work?

Claim: with 100,000 features and 100,000,000 buckets:k=1 Pr(any duplication) 1k=2 Pr(any duplication) 0.4k=3 Pr(any duplication) 0.01

Hash Trick - InsightsSave memory: dont store hash keysAllow collisionseven though it distorts your data someLet the learner (downstream) take up the slack

Heres another famous trick that exploits these insights.

Bloom filtersInterface to a Bloom filterBloomFilter(int maxSize, double p);void bf.add(String s); // insert sbool bd.contains(String s);// If s was added return true;// else with probability at least 1-p return false;// else with probability at most p return true;

I.e., a noisy set where you can test membership (and thats it)note a hash table would do this in constant time and storagethe hash trick does this as well10Bloom filtersAnother implementationAllocate M bits, bit[0],bit[1-M]Pick K hash functions hash(1,s),hash(2,s),.E.g: hash(s,i) = hash(s+ randomString[i])To add string s:For i=1 to k, set bit[hash(i,s)] = 1To check contains(s):For i=1 to k, test bit[hash(i,s)]Return true if theyre all set; otherwise, return falseWell discuss how to set M and K soon, but for now:Let M = 1.5*maxSize // less than two bits per item!Let K = 2*log(1/p) // about right with this M

Bloom filtersAnalysis:Assume hash(i,s) is a random functionLook at Pr(bit j is unset after n adds):

and Pr(collision):

. fix m and n and minimize k:

k =Bloom filtersAnalysis:Assume hash(i,s) is a random functionLook at Pr(bit j is unset after n adds):

and Pr(collision):

. fix m and n, you can minimize k:

k =p =Bloom filtersAnalysis:Plug optimal k=m/n*ln(2) back into Pr(collision):

Now we can fix any two of p, n, m and solve for the 3rd:

E.g., the value for m in terms of n and p:

p =

Bloom filters: demoBloom filtersAn example applicationFinding items in sharded dataEasy if you know the sharding ruleHarder if you dont (like Google n-grams)Simple idea:Build a BF of the contents of each shardTo look for key, load in the BFs one by one, and search only the shards that probably contain keyAnalysis: you wont miss anything, you might look in some extra shardsYoull hit O(1) extra shards if you set p=1/#shardsBloom filtersAn example applicationdiscarding singleton features from a classifierScan through data once and check each w:if bf1.contains(w): bf2.add(w)else bf1.add(w)Now:bf1.contains(w) w appears >= oncebf2.contains(w) w appears >= 2xThen train, ignoring words not in bf2Bloom filtersAn example applicationdiscarding rare features from a classifierseldom hurts much, can speed up experimentsScan through data once and check each w:if bf1.contains(w): if bf2.contains(w): bf3.add(w)else bf2.add(w)else bf1.add(w)Now:bf2.contains(w) w appears >= 2xbf3.contains(w) w appears >= 3xThen train, ignoring words not in bf3Bloom filtersMore on this next week..LSH: key ideasGoal: map feature vector x to bit vector bxensure that bx preserves similarity

Random ProjectionsRandom projections

u-u2+++++++++---------Random projections

u-u2+++++++++---------To make those points close we need to project to a direction orthogonal to the line between themRandom projections

u-u2+++++++++---------Any other direction will keep the distant points distant.So if I pick a random r and r.x and r.x are closer than then probably x and x were close to start with.

LSH: key ideasGoal: map feature vector x to bit vector bxensure that bx preserves similarityBasic idea: use random projections of xRepeat many times:Pick a random hyperplane rCompute the inner product or r with xRecord if x is close to r (r.x>=0) the next bit in bxTheory says that is x and x have small cosine distance then bx and bx will have small Hamming distance

LSH: key ideasNave algorithm:Initialization:For i=1 to outputBits:For each feature f:Draw r(f,i) ~ Normal(0,1)Given an instance xFor i=1 to outputBits:LSH[i] = sum(x[f]*r[i,f] for f with non-zero weight in x) > 0 ? 1 : 0Return the bit-vector LSHProblem: the array of rs is very large

LSH: pooling (van Durme)Better algorithm:Initialization:Create a pool:Pick a random seed sFor i=1 to poolSize:Draw pool[i] ~ Normal(0,1)For i=1 to outputBits:Devise a random hash function hash(i,f): E.g.: hash(i,f) = hashcode(f) XOR randomBitString[i]Given an instance xFor i=1 to outputBits:LSH[i] = sum( x[f] * pool[hash(i,f) % poolSize] for f in x) > 0 ? 1 : 0Return the bit-vector LSH

LSH: key ideasAdvantages:with pooling, this is a compact re-encoding of the datayou dont need to store the rs, just the poolleads to very fast nearest neighbor methodjust look at other items with bx=bxalso very fast nearest-neighbor methods for Hamming distancesimilarly, leads to very fast clusteringcluster = all things with same bx vectorMore next week.

Graph AlgorithmsGraph algorithmsPageRank implementationsin memorystreaming, node list in memorystreaming, no memorymap-reduceGoogles PageRankweb site xxxweb site yyyyweb site a b c d e f gweb site pdq pdq ..web site yyyyweb site a b c d e f gweb site xxxInlinks are good (recommendations)Inlinks from a good site are better than inlinks from a bad sitebut inlinks from sites with many outlinks are not as good...Good and bad are relative.web site xxxGoogles PageRankweb site xxxweb site yyyyweb site a b c d e f gweb site pdq pdq ..web site yyyyweb site a b c d e f gweb site xxxImagine a pagehopper that always either follows a random link, or jumps to random pageGoogles PageRank(Brin & Page, http://www-db.stanford.edu/~backrub/google.html)web site xxxweb site yyyyweb site a b c d e f gweb site pdq pdq ..web site yyyyweb site a b c d e f gweb site xxxImagine a pagehopper that always either follows a random link, or jumps to random pagePageRank ranks pages by the amount of time the pagehopper spends on a page: or, if there were many pagehoppers, PageRank is the expected crowd sizePageRank in MemoryLet u = (1/N, , 1/N)dimension = #nodes NLet A = adjacency matrix: [aij=1 i links to j]Let W = [wij = aij/outdegree(i)]wij is probability of jump from i to jLet v0 = (1,1,.,1) or anything else you wantRepeat until converged:Let vt+1 = cu + (1-c)Wvtc is probability of jumping anywhere randomlyStreaming PageRankAssume we can store v but not W in memoryRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Store A as a row matrix: each line isi ji,1,,ji,d [the neighbors of i]Store v and v in memory: v starts out as cuFor each line i ji,1,,ji,d For each j in ji,1,,ji,d v[j] += (1-c)v[i]/dStreaming PageRank: with some long rowsRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Store A as a list of edges: each line is: i d(i) jStore v and v in memory: v starts out as cuFor each line i d jv[j] += (1-c)v[i]/dStreaming PageRank: with some long rowsRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Preprocessing AStreaming PageRank: with some long rowsRepeat until converged:Let vt+1 = cu + (1-c)Wvt

Pure streamingforeach line i,j,d(i),v(i):send to i: inc_v csend to j: inc v (1-c)*v(i)/d(i)Sort and add messages to get vThen for each line: i,vsend to i: (so that the msg comes before js): set_v vthen can do the scan thru the reduce inputs to reset v to v

Streaming PageRank: constant memoryRepeat until converged:Let vt+1 = cu + (1-c)Wvt

PRStats: each line is I,d(i),v(i)Links: each line is i,jPhase 1: for each link i,jsend to i: link jCombine with PRStats, sort and reduce:For each i,d(i),v(i)For each msg link jSend to i incv cSend to j incv (1-c)v(i)/d(j)Combine with PRStats, sort and reduce:For each I,d(i),v(i)v = 0For each msg incv delta: v += deltaOutput new I,d(i),v(i)

Streaming PageRank: constant memoryRepeat until converged:Let vt+1 = cu + (1-c)Wvt

PRStats: each line is I,d(i),v(i)Links: each line is i,jPhase 1: for each link i,jsend to i: link jCombine with PRStats, sort and reduce:For each i,d(i),v(i)For each msg link jSend to i incv cSend to j incv (1-c)v(i)/d(j)Combine with PRStats, sort and reduce:For each I,d(i),v(i)v = 0For each msg incv delta: v += deltaOutput new I,d(i),v(i)

I,d(i),v(i)~link j1~link j2.

I,d(i),v(i)~incv x1~incv x2.

I ~incv xj1 ~incv x1j2 ~incv x2I d(i), v(i)I ~link j1I ~link j2.

I j1I j2

inout+ PRstats+ PRstats

Recommended

View more >