ohio state university department of computer science and engineering an approach for automatic data...

hpdc 04 talkOhio State University
Motivating Applications
Cancer Studies using MRI Telepathology with Digitized Slides
Satellite Data Processing Virtual Microscope
…
Opportunity and Issues
Can enable sharing of data in an unprecedented way
Access mechanisms for remote repositories
Complex low-level formats make accessing and processing of data difficult
Main desired functionality
Ohio State University
Current Approaches
Good! But is it too heavyweight for read-mostly scientific data ?
Manual implementation based on low-level datasets
Need detailed understanding of low-level formats
HDF5, NetCDF, etc
BinX, BFD, DFDL
Machine readable descriptions, but application is dependent on a specific layout
Data Virtualization
dataset
A Data Virtualization describes an abstract view of data.
A Data Service implements the mechanism to access and process data
through the Data Virtualization
Our Approach: Automatic Data Virtualization
Automatically create data services
A new application of compiler technology
A meta-data descriptor describes the layout of data on a repository
An abstract view is exposed to the users
This paper:
Outline
Introduction
Motivation
Experimental results
Related work
System Overview
STORM Runtime System
A middleware to support data selection, data partitioning, and data transfer operations on flat-file datasets hosted on a parallel system.
Services
Outline
Introduction
Motivation
Related work
Scientific datasets
Large volume
Distributed datasets
Multi-dimensional datasets
Filtering attributes
Design a Meta-data Description Language
Requirements
Specify the relationship of a dataset to the virtual dataset schema
Describe the dataset physical layout within a file
Describe the dataset distribution on nodes of one or more clusters
Specify the subsetting index attributes
Easy to use for data repository administrators and also convenient for our code generation
Design Overview
An Example
The dataset comprises several simulation on the same grid
For each realization, each grid point, a number of attributes are stored.
The dataset is stored on a 4 node cluster.
Component I: Dataset Schema Description
[IPARS] // { * Dataset schema name *}
TIME = int
X = float
Y = float
Z = float
SOIL = float
SGAS = float
[IparsData] //{* Dataset name *}
Data Layout Description Component
An Example
Oil Reservoir Management
Use LOOP keyword for capturing the repetitive structure within a file.
The grid has 4 partitions (0~3).
“IparsData” comprises “ipars1” and “ipars2”. “ipars1” describes the data files with the spatial coordinates’ stored; “ipars2” specifies the data files with other attributes stored.
Component III: Dataset Layout Description
DATASET “IparsData” { //{* Name for Dataset *}
DATATYPE { IPARS } //{* Schema for Dataset *}
DATAINDEX { REL TIME }
DATASET “ipars1” {
X Y Z
} // {* end of DATASET “ipars1” *}
SOIL SGAS
$DIRID = 0:3:1 }
Automatic Virtualization Using Meta-data
Our tool parses the meta-data descriptor and generates function codes.
At run time, the query would provide parameters to invoke the generated functions to create Aligned File Chunks.
Dataset Root
dataset 1
dataset 2
dataset 3
Compiler Analysis
Meta-data descriptor
Create AFC
Process AFC
Find _File _Groups {
Let S be the set of files that match against the query
Classify files in S by the set of attributes they have
Let S1, … ,Sm be the m sets
T = Ø
If the values of implicit attributes are not inconsistent {
T = T ∪ {s1, … ,sm }
foreach Aligned File Chunk {
}
}
An Example
Consider a query for selecting a subset with REL values of 0 and 1, TIME from 1 to 100.
Exclude DATA2, DATA3
Exclude COORD2, COORD3
DIR[k]/{COORD0, DATA0} DIR[k]/{COORD1, DATA1}
Create 100 Aligned File Chunks for each file group
Component III: Dataset Layout Description
DATASET “IparsData” { //{* Name for Dataset *}
DATATYPE { IPARS } //{* Schema for Dataset *}
DATAINDEX { REL TIME }
DATASET “ipars1” {
X Y Z
} // {* end of DATASET “ipars1” *}
SOIL SGAS
$DIRID = 0:3:1 }
Outline
Introduction
Motivation
Related work
Experimental Setup & Design
A Linux cluster connected via a Switched Fast Ethernet.
Each node with a PIII 933MHz CPU, 512 MB main Memory, and three 100GB IDE disks.
Three sets of experiments:
Test the ability of our code generation tool
Layout0 - original layout from the application collaborators
Layout1 – all data stored as a table in a file
Layout2 - all data in a file and each attribute stored as an array
Layout3 – split the layout1into multiple files based on value of the time
step
Layout4 – like layout3, but each attribute stored as an array in each
data file
Layout5 – data stored in 7 files where the first file with spatial
coordinates and the other attributes divided into 6 files
Layout6 – like layout5, but each attribute stored as an array in each
data file
Test the ability of our code generation tool
Oil Reservoir Management
The performance difference is within 4%~10% as for Layout 0.
Correctly and efficiently handle a variety of different layouts for the same data
Chart7
Vary the size of queries , vary the parameters of griddb
configure:
FFHOSTS = osumed02.epn.osc.edu
DPHOSTS = osumed02
DMHOSTS = osumed02
CLIENTHOSTS = osumed02
select * from bh where X>=0 and X<=7680 and Y>=1137 and Y<=3413 and Z>=0 and Z<=10;
Small Query
Large Query
usechunks=true
usechunks=false
Configure:
DPNAME = ???
Query: 8 * (76347*(2*4+20*8)) = 48MB
Select * from bh where RID in (0, 6, 26, 27) and TIME in [1001, 1003] and solid>0.7 and speed(OILVX, OILVY, OILVZ)<30.0;
DataPartitioner
IparsPartitioner
IparsRRPartitioner
CG
37.301
42.228
57.673
HW
35.677
38.532
41.425
4.55%
9.59%
39.22%
Configure:
Query:
Select * from bh where RID in (0,6,26,27) and TIME in [1001,1003] and solid>0.7 and speed(OILVX, OILVY, OILVZ)<30.0;
FFHOSTS
1
2
4
8
16
CG
37.301
23.35
16.977
17.021
17.108
HW
35.677
22.213
15.625
15.096
15.026
4.55%
5.12%
8.65%
12.75%
13.86%
Configure:
Query:
Select * from bh where RID in (0,6,26,27) and TIME in [1001, 1003] and solid>0.7 and speed(OILVX, OILVY, OILVZ)<30.0;
DMHOSTS
1
2
4
8
16
CG
20.255
18.178
15.346
15.219
16.026
HW
17.322
15.608
14.562
14.156
14.671
16.93%
16.47%
5.38%
7.51%
9.24%
5. Ipars dataset
-- Vary both the filtering filter hosts and the data mover hosts configure
DPNAME = DataPartitioner
USECHUNKS = false
NEW
Query:
FF and DM
1.82
3.21
5.31
8.12
Query:
FF and DM
DPNAME = IparsPartitioner
USECHUNKS = false
FFHOSTS = osumed-8
DPHOSTS = osumed05
DMHOSTS = osumed-8
CLIENTHOSTS = osumed-8
New
Query:
Select * from bh where in (0,6,26,27) and TIME in [1001, ???] and solid>0.7 and speed(OILVX, OILVY, OILVZ)<30.0;
4.8GB
9.6GB
14.4GB
19.2GB
CG
180
363.805
554.373
736.665
HW
165.587
320.327
472.724
628.432
8.70%
13.57%
17.27%
17.22%
14.19%
2.0211388889
3.07985
4.0925833333
1.9344936499
2.8548376382
3.7951771576
OLD
Query:
[1001, 1020]
HW
CG
avg
avg
RUN #2
HW
CG
Execution Time (seconds)
configure:
DPHOSTS = osumed02
DMHOSTS = osumed02
Small Query
Large Query
usechunks=true
usechunks=false
Configure:
DPNAME = ???
Query: 8 * (76347*(2*4+20*8)) = 48MB
DataPartitioner
IparsPartitioner
IparsRRPartitioner
CG
37.301
42.228
57.673
HW
35.677
38.532
41.425
4.55%
9.59%
39.22%
Configure:
Query:
FFHOSTS
1
2
4
8
16
CG
37.301
23.35
16.977
17.021
17.108
HW
35.677
22.213
15.625
15.096
15.026
4.55%
5.12%
8.65%
12.75%
13.86%
Configure:
Query:
DMHOSTS
1
2
4
8
16
CG
20.255
18.178
15.346
15.219
16.026
HW
17.322
15.608
14.562
14.156
14.671
16.93%
16.47%
5.38%
7.51%
9.24%
5. Ipars dataset
USECHUNKS = false
NEW
Query:
FF and DM
1.82
3.21
5.31
8.12
Query:
FF and DM
USECHUNKS = false
FFHOSTS = osumed-8
DPHOSTS = osumed05
DMHOSTS = osumed-8
New
Query:
4.8GB
9.6GB
14.4GB
19.2GB
CG
180
363.805
554.373
736.665
HW
165.587
320.327
472.724
628.432
8.70%
13.57%
17.27%
17.22%
14.19%
2.0211388889
3.07985
4.0925833333
1.9344936499
2.8548376382
3.7951771576
OLD
Query:
[1001, 1020]
HW
CG
avg
avg
RUN #2
HW
CG
Department of Computer Science and Engineering
Evaluate the Scalability of Our Tool
Scale the number of nodes hosting the Oil reservoir management dataset
Extract a subset of interest at the size of 1.3GB
The execution times scale almost linearly.
The performance difference varies between 5%~34%, with an average difference of 16%.
Chart6
1
1
2
2
4
4
8
8
HW
CG
configure:
DPHOSTS = osumed02
DMHOSTS = osumed02
Small Query
Large Query
usechunks=true
usechunks=false
Configure:
DPNAME = ???
Query: 8 * (76347*(2*4+20*8)) = 48MB
DataPartitioner
IparsPartitioner
IparsRRPartitioner
CG
37.301
42.228
57.673
HW
35.677
38.532
41.425
4.55%
9.59%
39.22%
Configure:
Query:
FFHOSTS
1
2
4
8
16
CG
37.301
23.35
16.977
17.021
17.108
HW
35.677
22.213
15.625
15.096
15.026
4.55%
5.12%
8.65%
12.75%
13.86%
Configure:
Query:
DMHOSTS
1
2
4
8
16
CG
20.255
18.178
15.346
15.219
16.026
HW
17.322
15.608
14.562
14.156
14.671
16.93%
16.47%
5.38%
7.51%
9.24%
5. Ipars dataset
USECHUNKS = false
NEW
Query:
FF and DM
1.82
3.21
5.31
8.12
Query:
FF and DM
USECHUNKS = false
FFHOSTS = osumed-8
DPHOSTS = osumed05
DMHOSTS = osumed-8
New
Query:
4.8GB
9.6GB
14.4GB
19.2GB
CG
180
363.805
554.373
736.665
HW
165.587
320.327
472.724
628.432
8.70%
13.57%
17.27%
17.22%
14.19%
2.0211388889
3.07985
4.0925833333
1.9344936499
2.8548376382
3.7951771576
OLD
Query:
[1001, 1020]
HW
CG
avg
avg
RUN #2
HW
CG
Comparison with hand written codes
Oil reservoir management dataset stored on 16 nodes.
Performance difference is within 17%,
With an average difference of 14%
Satellite data processing stored on a single node.
Performance difference is within 4%
Chart1
20MB
20MB
57MB
57MB
92MB
92MB
718MB
718MB
1.2GB
1.2GB
HW
CG
configure:
DPHOSTS = osumed02
DMHOSTS = osumed02
Small Query
Large Query
usechunks=true
usechunks=false
Configure:
DPNAME = ???
Query: 8 * (76347*(2*4+20*8)) = 48MB
DataPartitioner
IparsPartitioner
IparsRRPartitioner
CG
37.301
42.228
57.673
HW
35.677
38.532
41.425
4.55%
9.59%
39.22%
Configure:
Query:
FFHOSTS
1
2
4
8
16
CG
37.301
23.35
16.977
17.021
17.108
HW
35.677
22.213
15.625
15.096
15.026
4.55%
5.12%
8.65%
12.75%
13.86%
Configure:
Query:
DMHOSTS
1
2
4
8
16
CG
20.255
18.178
15.346
15.219
16.026
HW
17.322
15.608
14.562
14.156
14.671
16.93%
16.47%
5.38%
7.51%
9.24%
5. Ipars dataset
USECHUNKS = false
NEW
Query:
FF and DM
1.82
3.21
5.31
8.12
Query:
FF and DM
USECHUNKS = false
FFHOSTS = osumed-8
DPHOSTS = osumed05
DMHOSTS = osumed-8
New
Query:
4.8GB
9.6GB
14.4GB
19.2GB
CG
180
363.805
554.373
736.665
HW
165.587
320.327
472.724
628.432
8.70%
13.57%
17.27%
17.22%
14.19%
2.0211388889
3.07985
4.0925833333
1.9344936499
2.8548376382
3.7951771576
OLD
Query:
[1001, 1020]
HW
CG
avg
avg
RUN #2
HW
CG
HW
CG
configure:
DPHOSTS = osumed02
DMHOSTS = osumed02
Small Query
Large Query
usechunks=true
usechunks=false
Configure:
DPNAME = ???
Query: 8 * (76347*(2*4+20*8)) = 48MB
DataPartitioner
IparsPartitioner
IparsRRPartitioner
CG
37.301
42.228
57.673
HW
35.677
38.532
41.425
4.55%
9.59%
39.22%
Configure:
Query:
FFHOSTS
1
2
4
8
16
CG
37.301
23.35
16.977
17.021
17.108
HW
35.677
22.213
15.625
15.096
15.026
4.55%
5.12%
8.65%
12.75%
13.86%
Configure:
Query:
DMHOSTS
1
2
4
8
16
CG
20.255
18.178
15.346
15.219
16.026
HW
17.322
15.608
14.562
14.156
14.671
16.93%
16.47%
5.38%
7.51%
9.24%
5. Ipars dataset
USECHUNKS = false
NEW
Query:
FF and DM
1.82
3.21
5.31
8.12
Query:
FF and DM
USECHUNKS = false
FFHOSTS = osumed-8
DPHOSTS = osumed05
DMHOSTS = osumed-8
New
Query:
4.8GB
9.6GB
14.4GB
19.2GB
CG
180
363.805
554.373
736.665
HW
165.587
320.327
472.724
628.432
8.70%
13.57%
17.27%
17.22%
14.19%
2.0211388889
3.07985
4.0925833333
1.9344936499
2.8548376382
3.7951771576
OLD
Query:
[1001, 1020]
HW
CG
avg
avg
RUN #2
HW
CG
HW
CG
Related Work
HDF5
Oracle’s external tables
Conclusions and Future Work
An automatic approach to support data virtualization for large distributed scientific datasets in low-level formats.
Design a meta-data description language
Compiler based strategy to generate extractor codes automatically
The dataset can be stored in the format it is generated in and no effort is involved in loading it in a database system.
Experimental evaluation demonstrates the efficacy and efficiency of our tool
Future work
Multiple datasets’ integration in the grid computing environment
The total storage required after loading the
data in PostgreSQL is 18GB.
Create Index for both spatial coordinates
and S1 in PostgreSQL.
the experiment.
2
SELECT * FROM TITAN WHERE X>=0 AND X<=10000 AND Y>=0 AND Y<=10000 AND Z>=0 AND Z<=100;
3
4
5
Chart1
Query 2
SELECT FROM titan where x>=0 AND x<=10000 and y >=0 and y <=10000 and z>=0 and z<=100
Query 3
Query 4
Query 5
Query 6
SELECT * FROM titan WHERE x>=0 AND x<=1000 AND y>=0 AND y<=1000 AND z>=0 AND z<=100
Ntuples (x1600)
(AVHRR)
•
are gathered to form an
instantaneous field of view
A single file of
IFOV’s
Small Queries

ohio state university department of computer science and engineering an approach for automatic data...

Documents