Scibox: Online Sharing of Scientific Data via the Cloud
Jian Huang†, Xuechen Zhang†, Greg Eisenhauer†, Karsten Schwan†
Matthew Wolf†,‡, Stephane Ethierǂ, Scott Klasky‡
†CERCS Research Center, Georgia TechǂPrinceton Plasma Physics Laboratory
‡Oak Ridge National LaboratorySupported in part by funding from the US Department of Energy for DOE SDAV SciDac
1
Outline
• Background and Motivation
• Problems and Challenges
• Design and Implementation
• Evaluation
• Conclusion and Future Work
2
Cloud Storage is Popular
Easy-of-use
Pay-as-you-go model
Universal accessibility
Good scalability and durability
3
Cloud Storage is Popular
Easy-of-use
Pay-as-you-go model
Universal accessibility
Good scalability and durability
3
Works based on cloud storage• Dropbox, GoogleDrive, iCloud, SkyDrive, and etc.
Cloud Storage is Popular
Easy-of-use
Pay-as-you-go model
Universal accessibility
Good scalability and durability
3
Works based on cloud storage• Dropbox, GoogleDrive, iCloud, SkyDrive, and etc.
Scibox: focus on scientific data sharing
Use Cases for Cloud Storage
4
CombustionExperimental
Data Private Cloud
Aero Cluster
Use Cases for Cloud Storage
4
CombustionExperimental
Data Private Cloud
Aero Cluster
Use Cases for Cloud Storage
4
CombustionExperimental
Data Private Cloud
ImageProcessing
Aero Cluster Student PC
Use Cases for Cloud Storage
4
CombustionExperimental
Data
GTS/LAMMPS
Private Cloud
ImageProcessing
Aero Cluster Vogue Student PC
Use Cases for Cloud Storage
4
CombustionExperimental
Data
GTS/LAMMPS
Private Cloud
ImageProcessing
Aero Cluster Vogue Student PC
Public Cloud
Use Cases for Cloud Storage
4
CombustionExperimental
Data
GTS/LAMMPS
Private Cloud
ImageProcessing
Visualization
Aero Cluster Vogue Student PC
GeorgiaTech (Atlanta)WSU (Detroit)
OSU (Columbus)
Public Cloud
Use Cases for Cloud Storage
4
CombustionExperimental
Data
GTS/LAMMPS
Private Cloud
ImageProcessing
Visualization
Aero Cluster Vogue Student PC
GeorgiaTech (Atlanta)WSU (Detroit)
OSU (Columbus)
Public Cloud
1. Easy of use 2. Universal accessibility3. Good scalability
Outline
• Background and Motivation
• Problems and Challenges
• Design and Implementation
• Evaluation
• Conclusion and Future Work
5
Cloud Storage is Not Ready for HEC
Scientific applications are data-intensive• Generate large amounts of data
6
Cloud Storage is Not Ready for HEC
Scientific applications are data-intensive• Generate large amounts of data
6
Networking bandwidth is limited• Inadequate levels of ingress and egress bandwidths available to/from
remote cloud stores
Cloud Storage is Not Ready for HEC
Scientific applications are data-intensive• Generate large amounts of data
6
Networking bandwidth is limited• Inadequate levels of ingress and egress bandwidths available to/from
remote cloud stores
High costs imposed by cloud providers• Expensive for large amounts of data when using the pay-as-you-go model
Cloud Storage is Not Ready for HEC
Scientific applications are data-intensive• Generate large amounts of data
6
Networking bandwidth is limited• Inadequate levels of ingress and egress bandwidths available to/from
remote cloud stores
High costs imposed by cloud providers• Expensive for large amounts of data when using the pay-as-you-go model
An example:
A GTS runs on 29K cores on the Jaguar machine at OLCF generates over 54 Terabytes of data in a 24 hour period.
Amazon S3: ~$0.03/GB for storage and $0.09/GB for data transfer out.
Cost: $6635.52/day, increases with increasing number of collaborators
Problem: Too Much Data Movement
Issue: naïve approach transfers lots of data, even if only some of it is needed
7
Cloud
Data producer
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Data consumer
ExampleOutput of GTS fusion modeling simulation:Checkpoint data, diagnosis data, visualization data and etc.Each data subset includes many elements
Problem: Too Much Data Movement
Issue: naïve approach transfers lots of data, even if only some of it is needed
7
Cloud
Data producer
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Data consumer
Problem: Too Much Data Movement
Issue: naïve approach transfers lots of data, even if only some of it is needed
7
Cloud
Data producer
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
• Scientific data formats: BP/HDF5 Structured and meta-data rich
• Standard I/O interface: ADIOS Almost transparent to users
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Data consumer
Problem: Too Much Data Movement
Issue: naïve approach transfers lots of data, even if only some of it is needed
7
Cloud
Data producer
• Scientific data formats: BP/HDF5 Structured and meta-data rich
• Standard I/O interface: ADIOS Almost transparent to users
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Data consumer
Problem: Too Much Data Movement
Issue: naïve approach transfers lots of data, even if only some of it is needed
7
Cloud
Data producer
• Scientific data formats: BP/HDF5 Structured and meta-data rich
• Standard I/O interface: ADIOS Almost transparent to users
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Data consumer
Problem: Too Much Data Movement
Issue: naïve approach transfers lots of data, even if only some of it is needed
7
Cloud
Data producer
• Scientific data formats: BP/HDF5 Structured and meta-data rich
• Standard I/O interface: ADIOS Almost transparent to users
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Goal: Reduce data transfer from producers to consumersData consumer
Solutions for Minimizing Data Transfers for Data Sharing
Compression• Helps, but compression ratio can be low for floating-point scientific data
8
Solutions for Minimizing Data Transfers for Data Sharing
Compression• Helps, but compression ratio can be low for floating-point scientific data
Data selection• Files: users can ask for subsets of data files, by specifying file offsets
• Requires knowledge about data layout in files
8
Solutions for Minimizing Data Transfers for Data Sharing
Compression• Helps, but compression ratio can be low for floating-point scientific data
Data selection• Files: users can ask for subsets of data files, by specifying file offsets
• Requires knowledge about data layout in files
Content-based data indexing
• Useful, but may require large amounts of meta-data
8
Solutions for Minimizing Data Transfers for Data Sharing
Compression• Helps, but compression ratio can be low for floating-point scientific data
Data selection• Files: users can ask for subsets of data files, by specifying file offsets
• Requires knowledge about data layout in files
Content-based data indexing
• Useful, but may require large amounts of meta-data
8
Scibox Approaches:
• Filter unnecessary data at producer-side via metadata (uploads)
Solutions for Minimizing Data Transfers for Data Sharing
Compression• Helps, but compression ratio can be low for floating-point scientific data
Data selection• Files: users can ask for subsets of data files, by specifying file offsets
• Requires knowledge about data layout in files
Content-based data indexing
• Useful, but may require large amounts of meta-data
8
Scibox Approaches:
• Filter unnecessary data at producer-side via metadata (uploads)
• Merge overlapping subsets when multiple users share the same data(uploads)
Solutions for Minimizing Data Transfers for Data Sharing
Compression• Helps, but compression ratio can be low for floating-point scientific data
Data selection• Files: users can ask for subsets of data files, by specifying file offsets
• Requires knowledge about data layout in files
Content-based data indexing
• Useful, but may require large amounts of meta-data
8
Scibox Approaches:
• Filter unnecessary data at producer-side via metadata (uploads)
• Merge overlapping subsets when multiple users share the same data(uploads)
• Minimize data sharing cost in cloud storage via new software protocol(downloads)
Challenge: How to Filter Data
Recall: scientific data is structured and meta-data rich
9
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time steps
Variable {dimensions,attributes
}
Challenge: How to Filter Data
Recall: scientific data is structured and meta-data rich
9
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time step 0A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time step 1A0 A1 A2 … An
B0 B1 B2 … Bn
C0 C1 C2 … Cn
…
Time steps
Variable {dimensions,attributes
}
Analytics users know what can/needs to be filtered
Outline
• Background and Motivation
• Problems and Challenges
• Design and Implementation
• Evaluation
• Conclusion and Future Work
10
Scibox Design
D(ata)R(eduction)-Function for data filtering• Reduce cloud upload/download volumes
• Permit end users to identify the exact data needed for each specificanalytics activity, using filters
11
Scibox Design
D(ata)R(eduction)-Function for data filtering• Reduce cloud upload/download volumes
• Permit end users to identify the exact data needed for each specificanalytics activity, using filters
Merge shared data objects before uploading• Promote data sharing in the cloud to reduce data redundancy on the
storage server and data volumes transferred across the network
11
Scibox Design
D(ata)R(eduction)-Function for data filtering• Reduce cloud upload/download volumes
• Permit end users to identify the exact data needed for each specificanalytics activity, using filters
Merge shared data objects before uploading• Promote data sharing in the cloud to reduce data redundancy on the
storage server and data volumes transferred across the network
Utilize ADIOS metadata-rich I/O methods
New ADIOS I/O transport• Write output to cloud that can be directly read by subsequent,
potentially remote data analytics or visualization codes
• Transparent on both the producer and consumer sides
11
Scibox Design
D(ata)R(eduction)-Function for data filtering• Reduce cloud upload/download volumes
• Permit end users to identify the exact data needed for each specificanalytics activity, using filters
Merge shared data objects before uploading• Promote data sharing in the cloud to reduce data redundancy on the
storage server and data volumes transferred across the network
Utilize ADIOS metadata-rich I/O methods
New ADIOS I/O transport• Write output to cloud that can be directly read by subsequent,
potentially remote data analytics or visualization codes
• Transparent on both the producer and consumer sides
Partial object access for private cloud storage• Patch the OpenStack Swift object store
11
Cloud-IO Transport
12
HPC Applications
ADIOS API
MP
I-IO
PO
SIX
-IO
LIV
E/D
ata
Tap
DA
RT
HD
F-5
pn
etC
DF
Viz
Eng
ines
CL
OU
D-IO
Oth
ers
(Plu
g-in
)
Proxy Node
Auth Node
Store Node
Store Node
Store Node
Store Node
HPC Platform Swift Object Storage
DR-Function Run-time Execution
Authentication
Account Server
Container Server
Object Server
Account Server
Container Server
Object Server
Account Server
Container Server
Object Server
Account Server
Container Server
Object Server
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group38
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group39
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group40
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group41
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group42
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group43
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group44
Scibox Architecture
Amazon S3/OpenStack Swift
Scibox ClientCloud-IO (Write)
Data Producer
Scibox ClientCloud-IO (Read)
Data Consumer
Cloud Storage
Control Flow Data Flow User Group45
Sample XML File
14
<?xml version="1.0"?><adios-config host-language="Fortran">
<adios-group name="restart" coordination-communicator="comm"><var name="mype" type="integer"/><var name="numberpe" type="integer"/><var name="istep" type="integer"/><var name="MIMAX_VAR" type="integer"/><var name="NX" type="integer"/><var name="NY" type="integer"/><var name="zion0_1Darray" gwrite="zion0" type="double"
dimensions="MIMAX_VAR”/><var name="phi_2Darray" gwrite="phi" type="double" dimensions="NX, NY"/>
<!– for reader.xml -->
</adios-group><method group="restart" method=”CLOUD-IO”>;</method><buffer size-MB="20" allocate-time="now"/></adios-config>
Sample XML File
14
<?xml version="1.0"?><adios-config host-language="Fortran">
<adios-group name="restart" coordination-communicator="comm"><var name="mype" type="integer"/><var name="numberpe" type="integer"/><var name="istep" type="integer"/><var name="MIMAX_VAR" type="integer"/><var name="NX" type="integer"/><var name="NY" type="integer"/><var name="zion0_1Darray" gwrite="zion0" type="double"
dimensions="MIMAX_VAR”/><var name="phi_2Darray" gwrite="phi" type="double" dimensions="NX, NY"/>
<!– for reader.xml --><rd type=8 name="phi_2Darray”
cod=“int i; double sum = 0.0; for(i = 0; i<input.count; i= i+1) { sum = sum + input.vals[i]; } return sum;” />
</adios-group><method group="restart" method=”CLOUD-IO”>;</method><buffer size-MB="20" allocate-time="now"/></adios-config>
Data Reduction (DR) Functions
15
Definition• Function defined by end-users that can transform/reduce data to prepare
it for cloud storage
Data Reduction (DR) Functions
15
Definition• Function defined by end-users that can transform/reduce data to prepare
it for cloud storage
Creation• Programmed by end users
• Generated from higher level descriptions
• Derived from users’ I/O access patterns
Data Reduction (DR) Functions
15
Definition• Function defined by end-users that can transform/reduce data to prepare
it for cloud storage
Creation• Programmed by end users
• Generated from higher level descriptions
• Derived from users’ I/O access patterns
Current implementation• Customized CoD (C on Demand)
require producer-side computational resources
• DR-function library
same DR-function specified by multiple clients, will be executed onlyonce, and its output data will be reused for multiple consumers
DR-functions Provided by SciBox
51
Type Description Example
DR1 Max(variable) Max(var_double_2Darray)
DR2 Min(variable) Min(var_double_2Darray)
DR3 Mean(variable) Mean(var_double_2Darray)
DR4 Range(variable, dimension, start_pos, end_pos)
Range(var_int_1Darray, 1, 100, 1000)
DR5 Select(variable, threshold1, threshold2)
Select var.valuewhere var.value in (threshold1, threshold2)
DR6 Select(variable, DR_Function1, DR_Function2)
Select var.valuewhere var.value ≥ Mean(var)
DR7 Select(variable1, variable2,threshold1, threshold2)
Select var2.value where var1.value in (threshold1, threshold2)
DR8 Self defined function Double proc(cod_exec_context ec, input_type * input, int k, int m) {int I; intj; double sum = 0.0; double average=0.0; for(i=0;i<m;i++)sum+=input.tmpbuf[i+k*m];average=sum/m; resturn average;}
DR-functions Provided by SciBox
52
Type Description Example
DR1 Max(variable) Max(var_double_2Darray)
DR2 Min(variable) Min(var_double_2Darray)
DR3 Mean(variable) Mean(var_double_2Darray)
DR4 Range(variable, dimension, start_pos, end_pos)
Range(var_int_1Darray, 1, 100, 1000)
DR5 Select(variable, threshold1, threshold2)
Select var.valuewhere var.value in (threshold1, threshold2)
DR6 Select(variable, DR_Function1, DR_Function2)
Select var.valuewhere var.value ≥ Mean(var)
DR7 Select(variable1, variable2,threshold1, threshold2)
Select var2.value where var1.value in (threshold1, threshold2)
DR8 Self defined function Double proc(cod_exec_context ec, input_type * input, int k, int m) {int I; intj; double sum = 0.0; double average=0.0; for(i=0;i<m;i++)sum+=input.tmpbuf[i+k*m];average=sum/m; resturn average;}
C on Demand (CoD):Consumer: a stringProducer:1. registration 2. compile and execute on demand.
Data Object Management
17
Group Metadata User 0 Metadata User 1 Metadata User N Metadata
Scibox Super File
Writer.xml
Reader0.xml
Reader1.xml
……
……
ReaderN.xml
Group Metadata Container
User0 Metadata Object
User1 Metadata Object
UserN Metadata Object
……
Object 0 Object 1 Object N……
User Metadata Container
Data Container
Data Object Management
17
Group Metadata User 0 Metadata User 1 Metadata User N Metadata
Scibox Super File
Writer.xml
Reader0.xml
Reader1.xml
……
……
ReaderN.xml
Group Metadata Container
User0 Metadata Object
User1 Metadata Object
UserN Metadata Object
……
Object 0 Object 1 Object N……
User Metadata Container
Data Container
Enable object sharing
Determination of Object Size
18
Merge overlapping data sets• reduce upload data size
Partial object access• Current Amazon S3 and OpenStack Swift stores do not support this
• Users have to download the whole object, even if only a small portion of itsdata is needed
Determination of Object Size
18
Merge overlapping data sets• reduce upload data size
Partial object access• Current Amazon S3 and OpenStack Swift stores do not support this
• Users have to download the whole object, even if only a small portion of itsdata is needed
Two approaches used in Scibox• Private cloud
modify the software to enable partial object access
Object size is determined by predicting upload throughput
Determination of Object Size
18
Merge overlapping data sets• reduce upload data size
Partial object access• Current Amazon S3 and OpenStack Swift stores do not support this
• Users have to download the whole object, even if only a small portion of itsdata is needed
Two approaches used in Scibox• Private cloud
modify the software to enable partial object access
Object size is determined by predicting upload throughput
• Public cloud
limit object size considering storage pricing
Object size is determined by comparing the cost w/ sharing and w/o sharing
Cost Model
19
Definitionα: $/GB of standard cloud storage
β: $/GB of data transfer into cloud
γ: $/GB of data transfer out from cloud
Assumptionn clients request Data1, Data2, … Datan respectively
Cost Model
19
Definitionα: $/GB of standard cloud storage
β: $/GB of data transfer into cloud
γ: $/GB of data transfer out from cloud
Assumptionn clients request Data1, Data2, … Datan respectively
Without sharing:
Cost Model
19
Definitionα: $/GB of standard cloud storage
β: $/GB of data transfer into cloud
γ: $/GB of data transfer out from cloud
Assumptionn clients request Data1, Data2, … Datan respectively
Without sharing:
With sharing ( n objects can be merged into one):
Cost Model
19
Definitionα: $/GB of standard cloud storage
β: $/GB of data transfer into cloud
γ: $/GB of data transfer out from cloud
Assumptionn clients request Data1, Data2, … Datan respectively
Without sharing:
With sharing ( n objects can be merged into one):
Outline
• Background and Motivation
• Problems and Challenges
• Design and Implementation
• Evaluation
• Conclusion and Future Work
20
Experimental Setup
2163
AerospaceApp
GTS
Aero Cluster Vogue
OpenStackSwift
Experimental Setup
2164
AerospaceApp
GTS
ImagesProcessing
Aero Cluster Vogue Jedi Cluster
OpenStackSwift
Experimental Setup
2165
AerospaceApp
GTS
ImagesProcessing
Visualization
Aero Cluster Vogue Jedi Cluster
GeorgiaTech (Atlanta)WSU (Detroit)
OSU(Columbus)
OpenStackSwift
Workloads
Synthetic Workloads• 10 variables shared by multiple consumers
• 1,000 requests generated by each consumer
• 8 types of DR functions, uniformly distributed
• 1 data producer serving 1,000 requests x #client servers
• 3 self-defined DR-functions:
FFT, histogram diagnosis, and average of row values of a matrix
22
Workloads
Synthetic Workloads• 10 variables shared by multiple consumers
• 1,000 requests generated by each consumer
• 8 types of DR functions, uniformly distributed
• 1 data producer serving 1,000 requests x #client servers
• 3 self-defined DR-functions:
FFT, histogram diagnosis, and average of row values of a matrix
22
Real Workloads• GTS workload
128 parallel processes, consumers are from 3 different states in USA
• Combustion workload
10, 000 512X512 12-bit double framed images (~1.5 MB per image)
Latency Breakdown of Swift Object Store
23
34.63%
63.82% 73.58%
84.93% 88.69% 91.30% 93.28%
0.48 s 0.88 s 1.28 s 2.17 s 3.65 s 6.99 s 14.64 s
0%
20%
40%
60%
80%
100%
1 4 8 16 32 64 128
Percen
tag
e
Object Size (MB)
Headers Transfer Data Transfer & Server Processing
Disk Write & CheckSum Authorization & Container Checking
Software Overhead & Others 0.43 s 0.61 s 0.92 s 1.49 s 2.62 s 5.05 s 9.36 s
17.04%
41.77%
54.38%
67.33% 75.62% 78.78%
83.20%
0%
20%
40%
60%
80%
100%
1 4 8 16 32 64 128
Percen
ta
ge
Object Size (MB)
Data Transfer & Server Processing Disk Read
Authorization & Container Checking Software Overhead & Others
Put Get
Latency Breakdown of Swift Object Store
23
34.63%
63.82% 73.58%
84.93% 88.69% 91.30% 93.28%
0.48 s 0.88 s 1.28 s 2.17 s 3.65 s 6.99 s 14.64 s
0%
20%
40%
60%
80%
100%
1 4 8 16 32 64 128
Percen
tag
e
Object Size (MB)
Headers Transfer Data Transfer & Server Processing
Disk Write & CheckSum Authorization & Container Checking
Software Overhead & Others 0.43 s 0.61 s 0.92 s 1.49 s 2.62 s 5.05 s 9.36 s
17.04%
41.77%
54.38%
67.33% 75.62% 78.78%
83.20%
0%
20%
40%
60%
80%
100%
1 4 8 16 32 64 128
Percen
ta
ge
Object Size (MB)
Data Transfer & Server Processing Disk Read
Authorization & Container Checking Software Overhead & Others
Put GetData transfer dominates the request latency.
Execution Time of DR-functions
24
0
0.5
1
1.5
2
2.5
3
3.5
4
64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB
Ex
ecu
tion
Tim
e (s
eco
nd
s)
Variable Size
DR1 DR2 DR3
0
5
10
15
20
25
30
35
40
64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB
Ex
ecu
tio
n T
ime (
seco
nd
s)
Variable Size
DR5 DR6 DR7
0
1
2
3
4
5
6
7
64 MB 128 MB 256 MB 512 MB 1 GB 1.5 GB
Exe
cuti
on
Tim
e (s
econ
ds)
Variable Size
DR4-1K DR4-1M DR4-16M DR4-64M DR4-128M
Recall: DR7:Select var2.value where var1.value in (r1, r2). var1 is small.
Recall: DR6:var.value where var.value>Mean(var). Double scan of input data.
SciBox System Scalability
25
842.81 968.86
1329.83
2643.06
4962.91
92.39 109.65 151.44 271.46
412.12
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
1 2 4 8 16
Execu
tio
n T
ime (
seco
nd
s)
Number of Producers
Stock System
Scibox
4.34 5.1
9.54
19.66
42.25
81.91
1.98 2 3.09 6.48
12.68
21.94
0
10
20
30
40
50
60
70
80
90
2 4 8 16 32 64
La
ten
cy
(seco
nd
s)
Number of Consumers in the Same Sharing Group
Stock System
Scibox
• With Scibox, data is merged before upload
• With Scibox, partial object access is supported
GTS Workload
26
0
200
400
600
800
1000
1200
1400
1600
Gatech-Scibox Gatech-FTP OSU-Scibox OSU-FTP WSU-Scibox WSU-FTP
La
ten
cy (
seco
nd
s)
Data Sharing Schemes
Post-Computation Download Filter-Computation GTS-Computation
WSU OSU GT
GT 900 KB/s 4.4 MB/s 44 MB/s
Combustion Workload
27
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1 2 4 8 16
Sp
eed
up
of
Per
form
an
ce
Number of Clients in the Same Sharing Group
1 Process 2 Processes 4 Processes
8 Processes 16 Processes 32 Processes
• DR-function: (ImageName, DR8, FFT)
• 10, 000 images (~1.5 MB/image) shared via Scibox
Outline
• Background and Motivation
• Problems and Challenges
• Design and Implementation
• Evaluation
• Conclusion and Future Work
28
Conclusions and Future Work
Scibox: Cloud-based support for scientific data sharing• Can operate across both public and private cloud stores
29
Conclusions and Future Work
Scibox: Cloud-based support for scientific data sharing• Can operate across both public and private cloud stores
Data Reduction functions• Exploit the structured nature of scientific data
• Reduce the amount of transferred data
29
Conclusions and Future Work
Scibox: Cloud-based support for scientific data sharing• Can operate across both public and private cloud stores
Data Reduction functions• Exploit the structured nature of scientific data
• Reduce the amount of transferred data
Transparent to Applications/Data Producers• Implemented in ADIOS
• Can be also applied to other I/O libraries
29
Conclusions and Future Work
Scibox: Cloud-based support for scientific data sharing• Can operate across both public and private cloud stores
Data Reduction functions• Exploit the structured nature of scientific data
• Reduce the amount of transferred data
Transparent to Applications/Data Producers• Implemented in ADIOS
• Can be also applied to other I/O libraries
Future Work• Deploy Scibox on national labs’ facilities to better understand potential use
cases
• Additional optimizations of cloud storage for scientific data management
29
Thanks!
30