“geometric”&data&structures · ©sham&kakade&2017 1...

35
©Sham Kakade 2017 1 Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade “Geometric” data structures:

Upload: others

Post on 18-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

©Sham  Kakade  2017 1

Machine  Learning  for  Big  Data    CSE547/STAT548,  University  of  Washington

Sham  Kakade

“Geometric”  data  structures:

Page 2: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

2

Announcements:

• HW3  posted

• Today:  – Review:  LSH  for  Euclidean  distance– Other  ideas:  KD-­‐trees,  ball  trees,  cover  trees

©Kakade  2017

Page 3: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Image  Search…

3

Page 4: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

LSH  for  Euclidean  distance

©Sham  Kakade  2017 4

n The  family  of  hash  functions:

n Recall  R,  cR,  P1,  P2n Pre-­‐processing  time:

n Query  time:  

Page 5: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

What  other  guarantees  might  we  hope  for?

©Sham  Kakade  2017 5

n Recall  sorting:

n LSH:n Voronoi:

n How  about  other  ”geometric”  data  structures?

n What  is  the  ‘key’  inequality  to  exploit?

Page 6: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

n Smarter  approach:  kd-­‐trees¨ Structured  organization  of  documents

n Recursively  partitions  points  into  axis  aligned  boxes.

¨ Enables  more  efficient  pruning  of  search  space

n Examine  nearby  points  first.n Ignore  any  points  that  are  further  than  the  nearest  point  found  so  far.

n kd-­‐trees work  “well”  in  “low-­‐medium”  dimensions¨ We’ll  get  back  to  this…

KD-­‐Trees

©Sham  Kakade  2017 6

Page 7: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 7

Pt X Y

1 0.00 0.002 1.00 4.313 0.13 2.85… … …

n Start  with  a  list  of  d-­‐dimensional  points.

Page 8: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 8

Pt X Y1 0.00 0.003 0.13 2.85… … …

X>.5

Pt X Y2 1.00 4.31… … …

YESNO

n Split  the  points  into  2  groups  by:¨ Choosing  dimension  dj and  value  V  (methods  to  be  discussed…)

¨ Separating  the  points  into   >  V  and     <=  V.x

idj

x

idj

Page 9: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 9

X>.5

Pt X Y2 1.00 4.31… … …

YESNO

n Consider  each  group  separately  and  possibly  split  again  (along  same/different  dimension).¨ Stopping  criterion  to  be  discussed…

Pt X Y1 0.00 0.003 0.13 2.85… … …

Page 10: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 10

Pt X Y3 0.13 2.85… … …

X>.5

Pt X Y2 1.00 4.31… … …

YESNO

Pt X Y1 0.00 0.00… … …

Y>.1NO YES

n Consider  each  group  separately  and  possibly  split  again  (along  same/different  dimension).¨ Stopping  criterion  to  be  discussed…

Page 11: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 11

n Continue  splitting  points  in  each  set  ¨ creates  a  binary  tree  structure

n Each  leaf  node  contains  a  list  of  points

Page 12: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 12

n Keep  one  additional  piece  of  information  at  each  node:¨ The  (tight)  bounds  of  the  points  at  or  below  this  node.

Page 13: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

KD-­‐Tree  Construction

©Sham  Kakade  2017 13

n Use  heuristics  to  make  splitting  decisions:n Which  dimension  do  we  split  along?  

n Which  value  do  we  split  at?    

n When  do  we  stop?      

Page 14: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Many  heuristics…

©Sham  Kakade  2017 14

median  heuristic center-­‐of-­‐range  heuristic

Page 15: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 15

n Traverse  the  tree  looking  for  the  nearest  neighbor  of  the  query  point.

Page 16: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 16

n Examine  nearby  points  first:  ¨ Explore  branch  of  tree  closest  to  the  query  point  first.

Page 17: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 17

n Examine  nearby  points  first:  ¨ Explore  branch  of  tree  closest  to  the  query  point  first.

Page 18: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 18

n When  we  reach  a  leaf  node:  ¨ Compute  the  distance  to  each  point  in  the  node.

Page 19: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 19

n When  we  reach  a  leaf  node:  ¨ Compute  the  distance  to  each  point  in  the  node.

Page 20: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 20

n Then  backtrack  and  try  the  other  branch  at  each  node  visited

Page 21: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 21

n Each  time  a  new  closest  node  is  found,  update  the  distance  bound

Page 22: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 22

n Using  the  distance  bound  and  bounding  box  of  each  node:¨ Prune  parts  of  the  tree  that  could  NOT  include  the  nearest  neighbor

Page 23: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 23

n Using  the  distance  bound  and  bounding  box  of  each  node:¨ Prune  parts  of  the  tree  that  could  NOT  include  the  nearest  neighbor

Page 24: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Nearest  Neighbor  with  KD  Trees

©Sham  Kakade  2017 24

n Using  the  distance  bound  and  bounding  box  of  each  node:¨ Prune  parts  of  the  tree  that  could  NOT  include  the  nearest  neighbor

Page 25: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Complexity

©Sham  Kakade  2017 25

n For  (nearly)  balanced,  binary  trees...n Construction

¨ Size:¨ Depth:  ¨ Median  +  send  points  left  right:¨ Construction  time:  

n 1-­‐NN  query¨ Traverse  down  tree  to  starting  point:¨ Maximum  backtrack  and  traverse:¨ Complexity  range:

n Under  some  assumptions  on  distribution  of  points,  we  get  O(logN)  but  exponential  in  d (see  citations  in  reading)

Page 26: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Complexity

©Sham  Kakade  2017 26

Page 27: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

K-­‐NN  with  KD  Trees

©Sham  Kakade  2017 27

n Exactly  the  same  algorithm,  but  maintain  distance  as  distance  to  furthest  of  current  k  nearest  neighbors

n Complexity  is:

Page 28: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Approximate  K-­‐NN  with  KD  Trees

©Sham  Kakade  2017 28

n Before:  Prune  when  distance  to  bounding  box  >  n Now:  Prune  when  distance  to  bounding  box  >n Will  prune  more  than  allowed,  but  can  guarantee  that  if  we  return  a  neighbor  at  

distance   ,  then  there  is  no  neighbor  closer  than                      .n In  practice  this  bound  is  loose…Can  be  closer  to  optimal.n Saves  lots  of  search  time  at  little  cost  in  quality  of  nearest  neighbor.

r/↵r

Page 29: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

What  about  NNs  searchesin  high  dimensions?

©Sham  Kakade  2017 29

n KD-­‐trees:¨ What  is  going  wrong?

¨ Can  this  be  easily  fixed?

n What  do  have  to  utilize?¨ utilize  triangle  inequality  of  metric

¨ New  ideas:  ball  trees  and  cover  trees

Page 30: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Ball  Trees

©Sham  Kakade  2017 30

Page 31: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Ball  Tree  Construction

©Sham  Kakade  2017 31

n Node:¨ Every  node  defines  a  ball  (hypersphere),  containing  

n a  subset  of  the  the  points  (to  be  searched)n A  centern A  (tight)  radius  of  the  points

n Construction:  ¨ Root:  start  with  a    ball  which  contains  all  the  data¨ take  a  ball  and  make  two  children  (nodes)  as  follows:

n Make  two  spheres,  assign  each  point  (in  the  parent  sphere)  to  its  closer  sphere

n Make  the  two  spheres  in  a  “reasonable”  manner

Page 32: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Ball  Tree  Search

©Sham  Kakade  2017 32

n Given  point  x,  how  do  find  its  nearest  neighbor  quickly?n Approach:  

¨ Start:  follow  a  greedy  path  through  the  tree¨ Backtrack  and  prune:  rule  out  other  paths  based  on  the  triange inequality

n (just  like  in  KD-­‐trees)

n How  good  is  it?¨ Guarantees:¨ Practice:

Page 33: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Cover  trees

©Sham  Kakade  2017 33

n What  about  exact  NNs  in  general  metric  spaces?

n Same  Idea:  utilize  triangle  inequality  of  metric  (so  allow  for  arbitrary  metric)

n What  does  the  dimension  even  mean?

n cover-­‐tree  idea:

Page 34: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

Intrinsic  Dimension

©Sham  Kakade  2017 34

n How  does  the  volume  grow,  from  radius  R  to  2R?

n Can  we  relax  this  idea  to  get  at  the  “intrinsic”  dimension?

¨ This  is  the  “doubling”  dimension:

Page 35: “Geometric”&data&structures · ©Sham&Kakade&2017 1 MachineLearningforBigData&& CSE547/STAT548,University&of&Washington Sham&Kakade “Geometric”&data&structures:

NN  complexities

©Sham  Kakade  2017 35