a metadata based approach for supporting subsetting queries over parallel hdf5 datasets vignesh...

A Metadata Based Approach For Supporting Subsetting Queries Over

Parallel HDF5 Datasets

Vignesh Santhanagopalan

Graduate StudentDepartment Of CSE

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets

A Metadata Based Approach For Supporti ng Subsetti ng Queries Over Parallel HDF5 Datasets

Outline

Motivation Challenges Involved Contributions Background Overview of the System Design Metadata Extraction and Handling Pre-Processing and Post-Processing Modules Parallelization of our System Experiments Related Work Conclusion

Motivation

Scientific Data Management• Extremely large datasets

Data Driven Applications• Scientific simulations• High precision data collection instruments • Sensors attached to a satellite

Challenges Involved

Data exists in a variety of low-level formats • Hard for the user to extract the subset of data• Significant effort to understand the layout of

data More efficient access to scientific dataset is

needed• Parallel Computing

Contributions

Providing a virtual relational table view over HDF5 dataset

Allows the users to specify the query using the powerful SQL statements

Supporting queries which are based on the dimensions of the dataset

Supporting queries which are based on the dimensions and attributes of the dataset

Background-HDF5

Hierarchical Data Format is the name of a set of file formats and libraries designed to store and organize large amounts of numerical data

Stores the data in a tree like structureProvides organization by dividing the structure

into groups, datasets, attributes

Structure of HDF5 file

Parallel HDF5

Allows users to exploit parallelism to improve I/O performance

Provides standard parallel I/O interface and MPI programming

Opens a file in parallel using communicatorCollective parallel access to a file coordinated by

all processes

Our System

Supports SQL-like data subsetting with a virtualized view of HDF5 datasets• Metadata Extraction and Handling• Pre-processing and Post-processing Modules

Parallel I/O optimizations with Data Virtualization• MPI• Query Partition

Query Structure

Support SQL like abstraction with virtualized view of HDF5 datasets

SELECT <Dataset variables>FROM <Dataset name>

WHERE <Expression List>

Pre-Processing and Post-Processing Queries

SYSTEM DESIGN

SQL query input

Master Process

SQL parser

Metadata descriptor

Pre-Processing Module

Query Partition

Post-Processing Module

Slave Processes

Data Access Code

Parallel HDF5

HDF5 Dataset

Main Steps of Our System(1/2)

Input: SQL queryOutput: Necessary subset of data to the userProcess:• For every HDF5 dataset, metadata descriptor is

generated• SQL parser is used to parse the SQL query to retrieve

the grammar information• Variables and dimensions from the WHERE expression

of the SQL query is retrieved

Main Steps of Our System(2/2)

By evaluating the parse tree and the metadata information, a query request is generated

Based on the query request that was generated the data size is computed

Query-Partitioning module divides the query request into several sub-requests

The data results are obtained by each node based on the sub-request

HDF5 File Organization

Organizes data as collection of various objects like groups, datasets and attributes

Groups provide logical structuring to dataDatasets contain multi-dimensional array of data

elements• Dataspace• Datatype

Attributes

Metadata Extraction and Handling

For every HDF5 dataset, a metadata descriptor is generated

Metadata Information for each dataset:• Information to interpret data- Datatype• Information to describe the logical layout of data –

Dataspace• Information about Attributes attached to a dataset

Metadata Extraction Example

Datatype- IntegerDataspace- Number of dimensions and size of each

dimension• Number of dimensions – 3• Size of dimension1 – 100• Size of dimension2 – 200• Size of dimension3 – 300

Attributes• Temperature• Velocity

Metadata Extraction and Handling

For each group – Information regarding datasets it contains must be extracted

Can be imagined as a table• Row- group• Columns- all the datasets it contains

Mapping between the dataset variables and groupInformation regarding attributes stored for each

dataset

Example HDF5 File

GROUP “/” { GROUP “HDFEOS”{ GROUP “GRIDS”{ GROUP “ColumnAmount03” { GROUP “Data Fields” { DATASET “SolarZenithAngle” { DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE { ( 720, 1440 ) / ( 720, 1440 ) } DATA { } ATTRIBUTE "_FillValue" {

DATATYPE H5T_IEEE_F32LEDATASPACE SIMPLE { ( 1 ) / ( 1 ) }DATA {}}

Path Information

For Dataset SolarZenithAngle the path is /HDFEOS/GRIDS/ColumnAmount03/DataFields/SolarZenithAngleFor Attribute _FillValue the path is

/HDFEOS/GRIDS/ColumnAmount03/DataFields/SolarZenithAngle/_FillValue

Dataspace and Datatype for SolarZenithAngleDatatype: FloatNumber of Dimensions: 2Dimension Size: 720X1440Information about an attribute _Fillvalue:Datatype: FloatArray Size: 1

Pre-Processing and Post-Processing Modules

Two different types of queries• Query based on dimensions• Query based on attributes also

First type query supported by HDF5 API• Complete understanding of the layout of data• Separate programs to retrieve each subset of data

Second type of query• No direct support • Detailed knowledge of the datasets, HDF5 API and complex

programming

Pre-Processing and Post-Processing Modules

Pre-Processing Module:• Inputs:

– SQL grammar–Metadata

• Filtering is done based on dimensions of the dataset

Post-Processing Module:• Queried based on the attributes• Manually filter out to retrieve necessary subset of data

Parallelization

Parallel HDF5 has MPI-I/O layer on top of HDF5API support for parallel access through message

passingCollective I/O call for shared access to a file

Parallelization

Master-Slave approach with Parallel HDF5 processingMaster Process:• Parses the SQL query given by the user• Generates data subsetting request• Partitions requests into several sub-requests• Also performs post-processing

Slave Processes:• Receives sub-requests from master process• queries a data chunk by accessing the HDF5 in parallel and

obtains the data results

Experiments

Experimental Goals:• To evaluate our system with different types of queries

on Parallel HDF5• To show performance improvement of the parallel

version with sequential subsetting• To show our system’s capability on larger datasets• To show parallel scalability of our system

Experimental Setup

Dataset Used:• Ozone Monitoring Instrument from NASA website• Size available for download: 6.5 MB• Extended it to 500 MB, 1 GB, 2 GB and 4 GB

Implementation execution environment:• IBM Opteron Cluster• Each compute node has• Dual core 2.3 GHz Opterons• 8 GB memory

Performance Comparison of sequential and parallel version (4 processors)

Dataset Size : 500 MB Dataset Size : 1 GB

Performance Comparison of sequential and parallel version (4 processors)

Dataset Size : 2 GB Dataset Size : 4 GB

Parallel Scalability of our System

Dataset Size : 500 MB Dataset Size : 1 GB

Parallel Scalability of our System

Dataset Size : 2 GB Dataset Size : 4 GB

Related Work

Li Weng et al provided the automatic data virtualization approach seven years back

SciDB provides a scientific database where natural way of storing data is Arrays

Beomseok Nam et al provide an indexing scheme for efficient retrieval of subset of data- No notion of data virtualization & use of parallel computing

Lot of work on extending relational database technology to support scientific data

Conclusion

Provide a data management approach for scientific datasets stored in HDF5

Support for SQL queries over virtual view of dataParallelize queries based on dimensions and also

on attributesSignificant performance improvement over

Sequential subsetting System scales well with varying number of nodes

and different data sizes

Thank You!

a metadata based approach for supporting subsetting queries over parallel hdf5 datasets vignesh...

subsetting queries

data subsetting

parallel computing

hdf5 datasetallows

datasetsupporting queries

layout of data

structured data

set of file formats

Documents

mosaicking and subsetting images - dst...

hdf-eos subsetting: hew and other tools

presentation by vignesh swamidurai

vignesh electricals

profile - dhana vignesh kumar ( developer )

kapok fiber by vignesh dhanabalan

spasial subsetting

sizing by vignesh dhanabalan

santhanagopalan jayaraman company secretary in practice

vignesh ram undergraduate architecture portfolio

subsetting data in r -...

empowerment by vignesh

aggregation and subsetting in erddap

coated fabrics by vignesh dhanabalan

vignesh, krishna, dava & rameshwar

vignesh portfolio

vignesh 3rd sem

architecture porfolio | rahul vignesh

modis subsetting and visualization tool:

data masking and subsetting guide - asktheway.org · data...