![Page 1: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/1.jpg)
ICPP 2012
Indexing and Parallel Query Processing Support for Visualizing
Climate Datasets
Yu Su*, Gagan Agrawal*, Jonathan Woodring†
*The Ohio State University†Los Alamos National Laboratory
![Page 2: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/2.jpg)
ICPP 2012
Outline
• Motivation and Introduction• Background• System Overview and Optimization• Experiment• Conclusion
![Page 3: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/3.jpg)
ICPP 2012
Motivation
• Science becomes increasingly data driven;• Strong desire for efficient data visualization;• Challenges:
– Fast data generation speed– Slow disk IO and network speed – Worse performance during visualization– Different kinds of subsetting requests
• Difficult and Unnecessary to visualize all the data
![Page 4: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/4.jpg)
ICPP 2012
Data Subsetting in Paraview
• A widely used data analysis and visualization application
• Problems: Load + Filter mode– Load the entire data set– Data filtering in visualization level
• Threshold Filter: based on values• Extract Subset Filter: based on dimension info
– Grid transformation needed during filtering• Regular Structured Grid -> Unstructured Grid
![Page 5: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/5.jpg)
ICPP 2012
A Faster Solution• Subset at the I/O level
– User specifies the subset in one query for both dimension and value ranges
– Reduced I/O time and memory footprint• SQL queries in ParaView
– Query over Dimensions – API support– Query over Values - Indexing
• Bitmap Indices and Parallel Bitmap Indices– Efficient subsetting over values
![Page 6: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/6.jpg)
ICPP 2012
Background: Bitmap Indexing• Fastbit: widely used in Scientific Data Management
• Suitable for float value for binning small ranges• Run Length Compression(WAH, BBC)
– Compress bitvector based on continuous 0s or 1s
![Page 7: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/7.jpg)
ICPP 2012
Bitmap Index and Dim Subset• Run-length Compression(WAH, BBC)
– Good: compression rate, fast bitwise operation;– Bad: ability to locate dim subset is lost;
• Two traditional methods: – With bitmap indices: post-filter on dim info;– Without bitmap indices: post-filter on values;
• Two-phase optimization: – Index Generate: Distributed Indices over sub-
blocks;– Index Retrieval: Transform dim subsetting info into
bitvectors, and support fast bitwise operation;
![Page 8: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/8.jpg)
ICPP 2012
System Overview
Parse the SQL expression
Parse the metadata file
Generate Query Request
Index Generation if not generated; Index Retrieving after that.
![Page 9: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/9.jpg)
ICPP 2012
Optimization 1: Distributed Index Generation
Study relationship betweenQueries and Partitions.
Partition the data based onQuery Preference
![Page 10: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/10.jpg)
ICPP 2012
Index Partition Strategy
• α rate: Participation rate of data elements– Number of elements in indexing / Total data size– Worst: All elements have to be involved – Ideal: Elements exact the same as dim subset
• Partition Strategies: – Strategy 1: α is proportional to dim subsetting percentage and inversely
proportional to number of partitions.
– Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim.
– Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.
![Page 11: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/11.jpg)
ICPP 2012
Optimization 2: Index Retrieval
Post-filter?
![Page 12: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/12.jpg)
ICPP 2012
Parallel Index Architecture
L3: data block
L1: data file
L2: variable
![Page 13: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/13.jpg)
ICPP 2012
Experiment Setup• Goals:
– SQL subsetting vs. Load + Filter in Paraview– Scalability of parallel indexing method– Indexing and Partition Strategy vs. FastQuery
• Dataset: – Parallel Ocean Program– Data size: 33.6 GB– Data format: NetCDF(array based)
• Environment: – IBM Xeon Cluster 8 cores, 2.53GHZ– 12 GB memory
![Page 14: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/14.jpg)
ICPP 2012
Efficiency Comparison with Filtering in Paraview
• Data size: 5.6 GB• Input: 400 queries• Depends on subset
percentage• General index method is
better than filtering when data subset < 60%
• Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method
Index m1: Bitmap Indexing, no optimization
Index m2: Use bitwise operation instead of post-filtering
Index m3: Use both bitwise operation and index partition
Filter: load all data + filter
![Page 15: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/15.jpg)
ICPP 2012
Memory Comparison with Filtering in Paraview
• Data size: 5.6 GB• Input: 400 queries• Depends on subset
percentage• General index method has
much smaller memory cost than filtering method
• Two phase optimization only has small extra memory cost
Index m1: Bitmap Indexing, no optimization
Index m2: Use bitwise operation instead of post-filtering
Index m3: Use both bitwise operation and index partition
Filter: load all data + filter
![Page 16: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/16.jpg)
ICPP 2012
Scalability with Different Proc#
• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: time• Each process take care of
one sub-block• Good scalability as
number of processes increases
![Page 17: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/17.jpg)
ICPP 2012
Alpha Rate with Different Proc#
• Data size: 8.4 GB• Proc#: 6, 24, 48, 96• Input: 100 queries• X pivot: subset percentage• Y pivot: Alpha Rate• More number of processes
means more index partitions
• Good participation rate when selecting a smaller percentage data subset
![Page 18: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/18.jpg)
ICPP 2012
Alpha Rate and IO Access Times Comparison with FastQuery
• FastQuery: • Build relational table view over scientific dataset• Difference: doesn’t consider multi-dimension data features
• Data size: 8.4 GB, 48 processes• Query Type: value + 1st dim, value + 2nd dim, value + 3rd dim, overall• Input: 100 queries for each query type
![Page 19: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/19.jpg)
ICPP 2012
Efficiency Comparison with FastQuery
• Data size: 8.4 GB• Proc#: 48• Input: 100 queries for each
query type• Achieved a 1.41 to 2.12
speedup compared with FastQuery
![Page 20: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/20.jpg)
ICPP 2012
![Page 21: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/21.jpg)
ICPP 2012
Conclusion
• Big data issue in data analysis and visualization• Find exact data subset in IO level with SQL
interface and bitmap indexing• A good speedup compared with filtering method• Data partition strategy and parallel indexing• A good speedup compared with FastQuery
![Page 22: ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University](https://reader030.vdocuments.site/reader030/viewer/2022032709/56649ee05503460f94bf10bf/html5/thumbnails/22.jpg)
ICPP 2012 22
Thanks