hgrid a data model for large geospatial data sets in hbase
TRANSCRIPT
![Page 1: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/1.jpg)
Dan Han and Eleni StrouliaUniversity of [email protected]://ssrg.cs.ualberta.ca
104/12/23 Cloud 2013
![Page 2: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/2.jpg)
04/12/23 2Cloud 2013
![Page 3: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/3.jpg)
The General Research ProblemThe Geospatial Problem Instance
The Data Set HBase data-organization alternatives Performance analysis
Some Lessons Learned
04/12/23 3Cloud 2013
![Page 4: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/4.jpg)
04/12/23 4Cloud 2013
![Page 5: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/5.jpg)
04/12/23 Cloud 2013 5
![Page 6: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/6.jpg)
04/12/23 6Cloud 2013
![Page 7: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/7.jpg)
Appropriate data models for time-series (MESOCA 2012) Geospatial (CLOUD 2013)applications
In progress: spatio-temporal applications
04/12/23 7Cloud 2013
![Page 8: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/8.jpg)
04/12/23 9Cloud 2013
![Page 9: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/9.jpg)
04/12/23 10Cloud 2013
![Page 10: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/10.jpg)
[1] built a multi-dimensional index layer on top of a one-dimensional key-value store HBase to perform spatial queries.
[2] presented a novel key formulation schema, based on R+-tree for spatial index in HBase.
Focus on row-key designno discussion about column and version design
04/12/23 11
[1] Shoji Nishimura, Sudipto Das, Divyakant Agrawal, Amr El Abbadi: MD-HBase: A Scalable Multi-dimensional Data Infrastructure for Location Aware Services. Mobile Data Management (1) 2011: 7-16[2] Ya-Ting Hsu, Yi-Chin Pan, Ling-Yin Wei, Wen-Chih Peng, Wang-Chien Lee: Key Formulation Schemes for Spatial Index in Cloud Data Managements. MDM 2012: 21-26
Cloud 2013
![Page 11: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/11.jpg)
Two Synthetic Datasets Uniform and ZipF distribution Based on Bixi dataset, each object includes
▪ station ID, ▪ latitude, longitude, station name, terminal name, ▪ number of docks▪ number of bikes
100 Million objects (70GB) in a 100km*100km simulated space
04/12/23 12Cloud 2013
![Page 12: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/12.jpg)
Regular Grid Indexing Row key: Grid rowID Column: Grid columnID Version: counter of Objects Value: one object in JSON format
04/12/23 13
Coun
ter
Column ID
Row
ID
00 01 02 03
00
01
02
03
Cloud 2013
![Page 13: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/13.jpg)
Tie-based quad-tree Indexing Z-value Linearization Rowkey: Z-value Column: Object ID Value: one object in JSON Format
04/12/23 14
Z-Value
Object IDZ-value
Cloud 2013
![Page 14: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/14.jpg)
Quad-Tree data model More rows with deeper
tree Z-ordering linearization
(violates data locality) In-time construction vs.
pre-construction implies a tradeoff between query performance and memory allocation
Regular Grid data model Very easy to locate a
cell by row id and column id
Cannot handle large space and fine-grained grid because in-memory indexes are subject to memory constraints
04/12/23 15
How much unrelated data is examined in a query matters a lot!
Cloud 2013
![Page 15: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/15.jpg)
04/12/23 16
Obj
ect Att
ribu
te
Columnid-ObjectId
QTId
-Row
Id
A A A
A A A
A A A
B B B
B B B
B B B
C C C
C C C
C C C
D D D
D D D
D D D
00
0111
10
01 02 03 01 02 03
Space
Cloud 2013
![Page 16: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/16.jpg)
04/12/23 17Cloud 2013
The row key is the QT Z-value + the RG row index.
The row key is the QT Z-value + the RG row index.
The column name is the RG column and the object-ID
The column name is the RG column and the object-ID
The attributes of the data point are stored in the third dimension.
The attributes of the data point are stored in the third dimension.
![Page 17: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/17.jpg)
1. Compute minimum bounding square based on the query input location and the range
2. Compute the quad-tree tiles that overlap with the bounding square Z-codes
3. Compute all the regular-grid cells indexes in these quad-tree tiles the secondary index of rows and columns
4. Issue one sub-query for each selected tile of the quad-tree; process with user-level coprocessors on the HBase regions
5. Collect the results of the sub-queries at the client-side
04/12/23 18Cloud 2013
![Page 18: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/18.jpg)
04/12/23 20Cloud 2013
![Page 19: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/19.jpg)
04/12/23 21
00
02
04
06
00
02
04
06
Cloud 2013
![Page 20: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/20.jpg)
04/12/23 22
00
02
04
06
00
02
04
06
09-00
09-04
Cloud 2013
![Page 21: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/21.jpg)
1. Estimate the search range (density-based range estimation)
2. Compute indices of rows and columns (steps 2 and 3 of Range Query)
3. Issue a scan query to retrieve the relevant data points
4. If fewer than K data points are returned, re-estimate the search range and repeat steps 2-3
5. Sort the return set in increasing distance from the input location
04/12/23 23Cloud 2013
![Page 22: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/22.jpg)
Experiment Environment A four-node cluster on virtual machines with
Ubuntu on OpenStack Hadoop 1.0.2 (replication factor is 2), HBase 0.94 HBase Configuration
▪ 5K Caching Size▪ Block cache is true▪ ROWCOL bloom filter
Query processing Implementation Native java API User-Level Coprocessor Implementation04/12/23 24Cloud 2013
![Page 23: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/23.jpg)
The granularity of grid affects query-processing performance
Explore the “best” cell configuration of each model Quad-tree=>(t= 1) RG=>(t=0.1) HGrid=>(T=10,t=0.1)
04/12/23 25Cloud 2013
HG:≈10:0.1 fewer sub-queries more false positives
HG:≈1:0.1 more sub-queries fewer false positives
HG:≈10:0.01 more rows
HG:≈10:0.1 fewer rows
![Page 24: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/24.jpg)
04/12/23 26
Given a location and a radius, Return the data points, located within
a distance less or equal to the radius from the input location
Cloud 2013
![Page 25: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/25.jpg)
Given the coordinates of a location,Return the K points nearest to the
location
04/12/23 27Cloud 2013
![Page 26: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/26.jpg)
04/12/23 28Cloud 2013
![Page 27: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/27.jpg)
04/12/23 29Cloud 2013
![Page 28: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/28.jpg)
Data Organization Short row key and column name Better to have one column family and few columns Not large amount of data in one row Row key design should ease pruning unrelated data 3rd dimension can store data as well Bloom Filter should be configured to prune rows and
columns Compression can reduce the amount of data
transmission
04/12/23 30Cloud 2013
![Page 29: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/29.jpg)
Query Processing Scanned rows for one query should not exceed the
scan cache size, otherwise, split the query into sub-queries.
“Scan” is better than “Get” for retrieving discontinuous keys, even though the unrelated data
“Scan” for small queries, while Coprocessor for large queries
Better to split one large query into multiple sub-queries than use one query with row filter mechanism
04/12/23 31Cloud 2013
![Page 30: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/30.jpg)
Benefits from the good locality of the RG index; suffers from the poor locality of the z-ordering QT linearization Performance could be improved with other linearization
techniques Can be flexibly configured and extended
The QT index can be replaced by the hash code of each sub-space
The granularity in the second stage can be varied from sub-space to sub-space based on the various densities
Is more suitable for homogeneously covered and discontinuous spaces
04/12/23 32Cloud 2013
![Page 31: HGrid A Data Model for Large Geospatial Data Sets in HBase](https://reader035.vdocuments.site/reader035/viewer/2022081519/55509931b4c90595208b476b/html5/thumbnails/31.jpg)
A Data Model for spatio-temporal dataset
Towards a General Systematic Guidance for Column Families and other NoSQL databases
To apply the data model into cloud-based applications and big data analytics system
04/12/23 33Cloud 2013