page 1 md-hbase: a scalable multi-dimensional data infrastructure for location aware services shoji...

20
Page 1 MD-HBase: A Scalable Multi- dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi (University of California, Santa Barbara) * Work done as a visiting researcher at U

Upload: aleesha-carr

Post on 05-Jan-2016

218 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 1

MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services

Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das, Divyakant Agrawal, Amr El Abbadi

(University of California, Santa Barbara)

* Work done as a visiting researcher at UCSB

Page 2: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 2

Overview

▐ A Motivating Story

▐ Existing Technologies

▐ Our proposal

▐ Evaluation

▐ Conclusion

Page 3: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 3

Motivating Scenario: Mobile Coupon Distribution

Coupon

CurrentLocation Current

LocationCurrentLocation

Distribution Policy

• Area• # of coupons

Mobile CouponDistributer

Page 4: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 4

Motivating Scenario: Mobile Coupon Distribution

CurrentLocation

CurrentLocation

CurrentLocation

CurrentLocation

CurrentLocation Current

Location

CurrentLocation

CurrentLocation

CurrentLocation

CurrentLocation

CurrentLocation Current

Location

Distribution Policy• Area• # of coupons

CouponCouponCoupon

Large amounts of DataHigh Throughput

System Scalability

Multi-Dimensional QueryNearest Neighbors Query

Efficient Complex Queries

125,000,000 subscribersin Japan

Page 5: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 5

Existing Technologies

Multi-dimensional

QueriesScalability

Relational DBs

Spatial DBs

Commercial products

but expensive

Open source products

Key-Value Stores

What We Want

at a reasonable price

Page 6: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 6

Ordered Key-Value Stores

key00

key11

keynn

key00

key01

key0X

value00

value01

value0X

key11

key12

key1Y

value11

value12

value1Y

keynn valuenn

Index

BucketsSorted by key

Good at 1-D Range Query

LongitudeTime

Latit

ude

But, our target is multi-dimensional…

Page 7: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 7

Naïve Solution: Linearlization

key00

key11

keynn

key00

key01

key0X

value00

value01

value0X

key11

key12

key1Y

value11

value12

value1Y

keynn valuenn

Projects n-D space to 1-D space

Simple, but problematic…

Apply a Z-ordering curve…

5 7 13 15

4 6 12 14

1 3 9 11

0 2 8 10

Page 8: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 8

Problem: False positive scans

▐ MD-query on Linearized spaceTranslate a MD-query to

linearized range query.• Ex. Query from 2 to 9.

Scan queried linearized range.Filter points out of the queried area.

• ex. blue-hatched area (4 to 7)

Require the boundary information of

the original space.

5 7 13 15

4 6 12 14

1 3 9 11

0 2 8 102

9

Page 9: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 9

Build a Multi-dimensional Index Layer on top of an Ordered Key-Value store

Our Approach: MD-HBase

Single Dimensional IndexMulti-Dimensional Index

Ordered Key-Value Storeex. BigTable, HBase, …

MD-HBase

Page 10: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 10

Introduce Multi-dimensional Index

▐ Multi-dimensional Index (ex. The K-d tree, The Quad tree)Divide a space into subspaces containing almost same # of pointsOrganize subspaces as tree

Efficient subspace pruning → to avoid false positive scans

Divide into Organize as

Page 11: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 11

Space Partition By the K-d tree

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

Binary Z-ordering space

00 01 10 11

11

10

01

00

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00

Partitioned space bythe K-d tree

How do we represent these subspaces?

bitwise interleaving

Page 12: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 12

Key Idea: The longest common prefix naming scheme

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00

000* 1***

Subspaces represented as the longest common prefix of keys!

Remarkable Property• Preserve boundary information

of the original space

1***

Left-bottomcorner

Right-topcorner

1000 1111

*→0 *→1

(10, 00) (11, 11)

Page 13: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 13

Build an index with the longest common prefix of keys

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00000* 001*

01**

1***

000*

001*

01**

1***

Index

Buckets

allocate per subspace

Page 14: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 14

Reconstruct the boundary Info. &Check whether intersecting the queried area

Multi-dimensional Range Query

0101 0111 1101 1111

0100 0110 1100 1110

0001 0011 1001 1011

0000 0010 1000 1010

00 01 10 11

11

10

01

00

000*

001*

01**

10**

11**

Index

Filter

001*

000*

001*

10**

11**

01**

10**

Scan

Scan

Subspace Pruning

Scan 0010 -1001on the index

Page 15: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 15

K Nearest Neighbors Query

▐ The best first algorithm can be applied. the most efficient technique in practical case

▐ Check the detail in our paper

1 2

4

3

5

Page 16: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Variations of Storage Layer

Table Share Model Use single table, Maintain bucket boundary Most space efficiency Monitor

Table per Bucket Model Allocate a table per bucket Most flexible mapping

One-to-one, one-to-many, many-to-one Bucket split is expensive

Copy all points to the new buckets.

Region per Bucket Model Allocate a region per bucket Most bucket split efficiency

Asynchronous bucket split Require modification of HBase

Page 17: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 17

Experimental Results: Multi-dimensional Range Query

Dataset: 400,000,000 points Queries: select objects within MD ranges and change selectivity Cluster size: 16 nodes MD-HBase responses 10~100 times faster than others

and responses proportional time to selectivity.

1

10

100

1000

0.01 0.1 1 10

Selectivity (%)

Res

po

nse

Tim

e (S

ec)

MD-HBase HBase(ZOrder) MapReduce

Page 18: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 18

Experimental Results: k Nearest Neighbors Query

Dataset: 400,000,000 points Queries: choose a point and change the number of neighbors Cluster size: 16 nodes MD-HBase responses 1.5 sec where k 100, ≦

and 11 sec even if k = 10,000

0

2

4

6

8

10

12

1 10 100 1000 10000

k: Number of Neighbors

Res

po

nse

Tim

e (S

ec)

Page 19: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 19

Experimental Results: Insert

Dataset: spatially skewed data generated by zipfian distribution MD-HBase shows good scalability without significant overhead.

0

50,000

100,000

150,000

200,000

250,000

0 4 8 12 16 20

Number of nodes

Th

ou

gh

pu

t(r

eco

rds/

sec)

MD-HBase

Hbase(Zorder)

Page 20: Page 1 MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services Shoji Nishimura (NEC Service Platforms Labs.), Sudipto Das,

Page 20

Conclusions

Designed a scalable multi-dimensional data store. Scalability & Efficient multi-dimensional queries Key Idea: indexing the longest common prefix of keys Easily extend general ordered key-value stores.

Demonstrated scalable insert throughput and excellent query performance.

Range Query: 10-100 times faster than existing technologies. kNN Query: 1.5 s when k 100.≦ Insert: 220K inserts/sec on 16 nodes cluster without overhead

Thank you. Any Questions?