performance analysis and diagnosis of cloud-based dddas ... · performance analysis and diagnosis...
TRANSCRIPT
Performance Analysis and Diagnosis of Cloud-based DDDAS Applications
FA 9550-15-1-0184
Program Director: Dr. Frederica Darema
PIs: Mohammad Maifi Khan, Swapna Gokhale Department of Computer Science and Engineering
University of Connecticut
Dat
a H
and
ling
Serv
ice
Data Analytic Application - 1
Data Analytic Application - 2
Data Analytic Application - N
Video
Audio
Sound, Light, Temperature
Vibration, Accelerometer
Twitter feed
Data Replication
Service
Data Lookup Service
Failure Recovery Service
Cloud Data Storage Service
End User Visualization
System Layer
Cloud-based Storage Layer
Application Layer
……………..
Predictable Performance is of utmost importance!
Challenges
• The sampling rate of different sensors and sensing modalities may change – Cascading effect on the cloud side
• Different data processing algorithms requiring different sets of resources
(e.g., motif mining vs. image analysis)
• Virtualization, wide adoption of parallel and multi-threaded programs, and increasingly larger scales – High degree of interactive complexity
• We need a solution that can answer questions such as -
– How is the changed sampling rate going to affect the execution? – Why is allocating more servers not improving the execution time? – How long is it going to take to finish the job? – …..
Performance Monitor Resource Allocation
Service
Performance Monitor
Performance Monitor
Performance Modeling Framework
Sensor Stream Configurations
System Configurations
Expected Performance
Actual Performance
Diagnostic Service
Target Performance
What is expected to happen in response to critical events?
Highly Important Nodes
Highly Important Nodes
What is expected to happen in response to critical events?
System Architecture
• Each node has a non-negative Importance function:
𝐼𝑖 𝑡 = 1
𝑁
𝑖=1
N: number of sensors, t: time point
• Each node i has a non-negative Quality of information function QoI(𝑠𝑖) ∈ 0,1 : QoI(𝑠𝑖) = 0, 𝑠𝑖 ≤ 𝑚𝑖𝑛𝑖 QoI 𝑠𝑖 = 1, 𝑠𝑖 ≥ 𝑀𝑎𝑥𝑖
• QoI(minimum) = f(importance)
Jin,J.,Palaniswami,M.,Krishnamachari,B.,2012.Ratecontrol for heterogeneous wireless sensor networks: characterization, algorithms and performance. Computer Networks 56(17),3783–3794.
Overview • System Capacity(SC): the
capacity of the bottleneck resource in the system:
𝑠𝑖 . 𝑚𝑠𝑖 . 𝑥𝑖 ≤ 𝑆𝐶
𝑁
𝑖=1
𝑠𝑖 : sampling rate of node i
𝑚𝑠𝑖: message size
𝑥𝑖 ∈ {0, 1}: node status
maximize
𝑄𝑜𝐼𝑖 𝑠𝑖 . 𝑥𝑖
𝑁
𝑖=1
s.t:
𝐼𝑖 𝑡 = 1
𝑁
𝑖=1
𝑠𝑖 . 𝑚𝑠𝑖 . 𝑥𝑖 ≤ 𝑆𝐶
𝑁
𝑖=1
𝑥𝑖 ∈ {0, 1}: node status
Knapsack Rate Allocation Algorithm ?
• Sampling rate assignment that maximizes the QoI for the whole network may not be the ideal solution
– Node with the highest importance may have
smaller 𝑄𝑜𝐼𝑖
𝑠𝑖.𝑚𝑠𝑖 due to large message size
Preliminary Approach
• Use threshold to separate nodes into two (or more ) groups: – Nodes that have importance higher than the
threshold
– The remaining nodes
• Apply Knapsack rate allocation algorithm for each group separately
• First, pick nodes from the critical group
• Next, pick nodes from the less critical group (if system capacity allows)
Simulated network topology
1 12
15
10
2 4
19
3
11
13 16
6
8
21
9 18
23 20
7 14
22
5
17
24
0
Group 1 Group 2
Group 3
Group 4 Group 5
Group 6
Group 7
Evaluation
0
0.05
0.1
0.15
0.2
0.25
0.3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Imp
ort
ance
of
no
de
(%)
Time (minute)
Node 1 (Group 1)
Node 12 (Group 1)
Node 15 (Group 1)
Node 7 (Group 5)
Node 14 (Group 5)
Node 22 (Group 5)
Node 6 (Group 7)
Node 8 (Group 7)
Node 21 (Group 7)
Duration of Event 1
in Group 1
Duration of Event 2
in Group 5
Duration of Event 3
in Group 7
The Importance Metrics of Sensor Nodes
Quality of Information vs. Network Coverage
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3
Qu
alit
y o
f In
form
atio
n (
%)
Quality of Information for the Entire Network Number of Suspended Nodes
0
2
4
6
8
10
12
14
16
1 2 3
Nu
mb
er
of
susp
en
de
d n
od
es
RateAllocationAlgorithm
Threshold BasedRateAllocationAlgorithm
MaximumGreedyAlgorithm
Time of different experiment
Time of different experiment
How does changing sampling rate affect backend?
• If a server gets overloaded due to changed sampling rate, we may lose critical data
• One way to address this is by redirecting certain sensor streams to different servers on the fly
• The challenge is
– How to determine possible overload in advance?
– How to determine the set of sensors that need to be redirected?
Modeling Performance of Application layer jobs
A Different Platform
Apache Spark™ is a widely used cloud-based
platform for large-scale data processing.[1,2]
Resilient distributed datasets (RDDs) feature supports
in-memory computation
[1] Apache Spark™. http://spark.apache.org/.
[2] Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.
23
24
Initial Idea
Execute sample job, measure the performance of
the job
Predict the performance of the actual job based
on the performance of the sample job
25
1
2
Run the same program as actual job, but
only with a fraction of the input data
26
Execute
Sample Job
Performance
Model
Sampled
Input
Event
Logs
Performance
Info
Predict
Performance
of the Actual
Job
Apache Spark Job
• One job consists of sequential stages
• One stage contains parallel and sequential tasks
• Tasks run in batches
One batch of P tasks run in parallel
, H is the number of working nodes
M is the number of stages and N is the number of tasks in a stage
27
Performance Metrics
28
Execution time
Kc is the number of sequential tasks running in CPU core c, P is the total
number of core.
• Average execution time of the first batch is different
from the subsequent batches within the same stage
29
Here nh is the number of tasks running in host h, and Ph is the number of tasks in the first batch
Experimental Setup
30
Cluster Setup
31
One Batch Two Batches
Reduced Scale 1.25GB 2.5GB
Full Scale 7.5GB 15GB
Example Jobs
Sample Input
Job Input
WordCount 75GB Wikipedia Dump
Logistic Regression 50GB SDSS CMD Data
K-Means 50GB SDSS CMD Data
PageRank 25GB SNAP Network Dataset
- Sloan digital sky survey. http://www.sdss.org/.
- Stanford snap. http://snap.stanford.edu/.
Predication Accuracy Calculation
• Prediction accuracy is calculated for each stage and
summed up as follows:
, M is number of stages
32
App – I: WordCount
33
Prediction accuracy Time Prediction
I/O Write Prediction I/O Read Prediction
App – II: Logistic Regression
34
Prediction accuracy Time Prediction
App – III: K-Means
35
Prediction accuracy Time Prediction
I/O Write Prediction I/O Read Prediction
App – IV: PageRank
36
Prediction accuracy Time Prediction
I/O Write Prediction I/O Read Prediction
Can we model Interference among multiple Jobs?
Main Idea
• Model the slowdown ratio using different kinds of job mix (CPU bound, I/O bound)
• Next, based on stage models, estimate the impact on execution time per stage
• Account for the cascading effect on execution to predict the total execution time
Preliminary Evaluation
• We choose four Apache Spark jobs – PageRank – K-Means – Logistic Regression – Word Count
• For PageRank, we use the 20 GB LiveJournal network dataset from SNAP
• K-Means and Logistic Regression applications use 20 GB of numerical Color-Magnitude Diagram data of galaxy from Sloan Digital Sky Survey (SDSS)
• WordCount application uses 20 GB Wikipedia dump data
Next Goal
• Modeling the interference among multiple jobs and validating the model
• Developing Algorithms for Interference Aware Job Scheduling
Long Term Goal
- Leveraging performance models for performance troubleshooting - Modeling interference between application layer and storage layer - Scalable instrumentation - Scalable troubleshooting algorithms
Publications Relevant to this Project
• Published – Performance Prediction for Apache Spark Platform. Kewen Wang and
Mohammad Maifi Hasan Khan. In proceedings of 17th IEEE International Conference on High Performance Computing and Communications (HPCC), 2015.
– A closed-loop context aware data acquisition and resource allocation framework for dynamic data driven applications systems (DDDAS) on the cloud. Nguyen, Nhan, and Mohammad Maifi Hasan Khan. Journal of Systems and Software, Elsevier, 2015.
– Context aware data acquisition framework for dynamic data driven applications systems (DDDAS). Nhan Nguyen, Mohammad Maifi Hasan Khan. In proceedings of the 32nd IEEE Military Communication Conference (MILCOM), San Diego, CA, USA, 2013.
• Under Preparation – Modeling interference on Apache Spark Platform – Interference aware job scheduling for Apache Spark Platform
THANK YOU!
Please feel free to contact if you have any questions.
Mohammad Maifi Khan <[email protected]> Swapna Gokhale <[email protected]>