parallel geospatial data management for multi-scale environmental data analysis on gpus

Post on 24-Feb-2016

55 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Cells. Polygon vertices. (C) Outside. (B) Inside. V. T. Temporal Trends. ROIs. High-End Computing Facility. C. B. A. Thread Block. Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs. - PowerPoint PPT Presentation

TRANSCRIPT

Parallel Geospatial Data Management for Multi-Scale Environmental Data Analysis on GPUs

Visiting Faculty: Jianting Zhang, The City College of New York (CCNY), jzhang@cs.ccny.cuny.eduORNL: Host: Dali Wang, Climate Change Science Institute (CCSI), wangd@ornl.gov

OverviewAs the spatial and temporal resolutions of Earth observatory data and Earth system simulation outputs are getting higher, in-situ and/or post- processing such large amount of geospatial data increasingly becomes a bottleneck in scientific inquires of Earth systems and their human impacts. Existing geospatial techniques that are based on outdated computing models (e.g., serial algorithms and disk-resident systems), as have been implemented in many commercial and open source packages, are incapable of processing large-scale geospatial data and achieve desired level of performance. Partially Supported by DOE’s Visiting Faculty Program (VFP), we are investigating on a set of parallel data structures and algorithms that are capable of utilizing massively data parallel computing power available on commodity Graphics Processing Units (GPUs). We have designed and implemented a popular geospatial technique called Zonal Statistics, i.e., deriving histograms of points or raster cells in polygonal zones. Results have shown that our GPU-based parallel Zonal Statistic technique on 3000+ US counties over 20+ billion NASA SRTM 30 meter resolution Digital Elevation (DEM) raster cells has achieved impressive end-to-end runtimes: 101 seconds and 46 seconds a low-end workstation equipped with a Nvidia GTX Titan GPU using cold and hot cache, respectively; and ,60-70 seconds using a single OLCF TITAN computing node and 10-15 seconds using 8 nodes.

High-resolution Satellite ImageryT

In-situ Observation Sensor Data

T

Global and Regional Climate Model Outputs

Data Assimilation Zonal Statistics

TEcological, environmental and administrative zones

T

T

V

Temporal Trends

High-End Computing Facility

A

BC

Thread Block

ROIs

Background & Motivation

Zonal Statistics on NASA Shuttle Radar Topography Mission (SRTM) Data

2 3

SQL: SELECT COUNT(*) from T_O, T_Z WHERE ST_WITHIN (T_O.the_geom,T_Z.the_geom) GROUP BY T_Z.z_id;

Point-in-Polygon Test

For each county, derive its histogram of elevations from raster cells that are in the polygon.

•SRTM: 20*109 (billion) raster cells (~40GB raw, ~15GB compressed TIFF)•Zones: 3141 counties, 87,097 vertices

Brute-force Point-in-polygon test•RT=(#of points)*(number of vertices)*(number of ops per p-in-p test)/(number of ops per second) =20*109*87097*20/(10*109)=3.7*106seconds=40days•Using up all Titan’s 18,688 nodes: ~200 seconds•Flops utilization is typically low: can be <1% for data intensive applications (typically <10% in HPC)

Hybrid Spatial Databases + HPC ApproachObservation : only points/cells that are close enough to polygons need to be tested Question: how do we pair neighboring points/cells with polygons?

Minimum Bounding Boxes (MBRs)

Step1: divide a raster tile into blocks and generate per-tile histograms Step 2: derive polygon MBRs and pair MBRs with blocks through box-in-polygon test (inside/intersect)Step 3: aggregate per-blocks histograms into per-polygon histograms if blocks are within polygonsStep 4: for each intersected polygon/block pair, perform point(cell)-in-polygon test for all the raster cells in the blocks and update respective polygon histogram

(A) Intersect (B) Inside (C) Outside

GPU Implementations

Identifying parallelisms and mapping to hardware(1)Deriving per-block histograms(2)Block in polygon test(3)Aggregate histograms for “within” blocks (4)Point-in-polygon test for individual cells and update

GPU Thread BlockRaster Block

AtomicAdd

1

M1C1

M1C2

M2C3

•Point-in-poly test for each of cell’s 4 corners•All-pair Edge intersection tests between polygon and cell 2

Polygon vertices

Cells

•Perfect coalesced memory accesses •Utilizing GPU floating point power

4

BPQ-Tree based Raster compression to save disk I/O•Idea: chop a M-bit raster into M binary bitmaps and then build a quadtree for each bitmap (Zhang et al 2011)

•BPQ-Tree achieves competitive compression ratio but is much more parallelization friendly on GPUs •Advantage 1: compressed data is streamed from disk to GPU without requiring decompression on CPUs reducing CPU memory footprint and data transfer time•Advantage 2: can be used in conjunction with CPU-based compression to further improve compression ratio

Experiments and EvaluationsTile # dimension Partition Schema1 54000*43200 2*22 50400*43200 2*23 50400*43200 2*24 82800*36000 2*25 61200*46800 2*26 68400*111600 4*4Total 20,165,760,000 36

Data Format Volume (GB)Original (Raw) 38TIFF Compression 15gzip compression 8.3BPQ-Tree Compression 7.3BPQ-Tree+ gzip compression

5.5

Single Node Configuration 1: Dell T5400 WS•Intel Xeon E5405 dual Quad-Core Processor (2.00 GHZ), 16 GB, PCI-E Gen2, 3*500GB 7200 RPM disk with 32M cache ($5,000)•Nvidia Quadro 6000 GPU, 448 Fermi core ( 574 MHZ), 6 GB, 144GB/s ($4,500)

Single Node Configuration 2: Do-IT Yourself Workstation•Intel Core-i5 650 Dual-Core, 8 GB, PCI-E Gen3, 500GB 7200 RPM disk with 32M (recycled), ($1000)•Nvidia GTX Titan GPU, 2688 Kepler core ( 837 MHZ), 6 GB, 288 GB/s ($1000)

Parameters:•Raster chunk size (coding): 4096*4096• Thread block size: 256

•Maximum histogram bins: 5000•Raster Block size: 0.1*0.1 degree 360*360 (resolution is 1*1 arc-second)

GPU Cluster : OLCF Titan

ResultsSingle Node Cold Cache Hot cacheConfig1 180s 78sConfig2 101s 46s

Quadro 6000 GTX Titan 8-core CPU(Step 0): Raster decompression 16.2 8.30 131.9Step 1: Per-block histogramming 21.5 13.4 /Step 2: Block-in-polygon test 0.11 0.07 /Step 3: “within-block” histogram aggregation 0.14 0.11 /Step 4: cell-in-polygon test and histogram update 29.7 11.4 /total major steps 67.7 33.3 /Wall-clock end-to-end 85 46 /

Titan#of nodes 1 2 4 8 16Runtime(s) 60.7 31.3 17.9 10.2 7.6

Data and Pre-processing

top related