performance analysis and diagnosis of cloud-based dddas ... · performance analysis and diagnosis...

Performance Analysis and Diagnosis of Cloud-based DDDAS Applications

FA 9550-15-1-0184

Program Director: Dr. Frederica Darema

PIs: Mohammad Maifi Khan, Swapna Gokhale Department of Computer Science and Engineering

University of Connecticut

Dat

a H

and

ling

Serv

ice

Data Analytic Application - 1

Data Analytic Application - 2

Data Analytic Application - N

Video

Audio

Sound, Light, Temperature

Vibration, Accelerometer

Twitter feed

Data Replication

Service

Data Lookup Service

Failure Recovery Service

Cloud Data Storage Service

End User Visualization

System Layer

Cloud-based Storage Layer

Application Layer

……………..

Predictable Performance is of utmost importance!

Challenges

• The sampling rate of different sensors and sensing modalities may change – Cascading effect on the cloud side

• Different data processing algorithms requiring different sets of resources

(e.g., motif mining vs. image analysis)

• Virtualization, wide adoption of parallel and multi-threaded programs, and increasingly larger scales – High degree of interactive complexity

• We need a solution that can answer questions such as -

– How is the changed sampling rate going to affect the execution? – Why is allocating more servers not improving the execution time? – How long is it going to take to finish the job? – …..

Performance Monitor Resource Allocation

Service

Performance Monitor

Performance Monitor

Performance Modeling Framework

Sensor Stream Configurations

System Configurations

Expected Performance

Actual Performance

Diagnostic Service

Target Performance

What is expected to happen in response to critical events?

Highly Important Nodes

Highly Important Nodes

What is expected to happen in response to critical events?

System Architecture

• Each node has a non-negative Importance function:

𝐼𝑖 𝑡 = 1

𝑁

𝑖=1

N: number of sensors, t: time point

• Each node i has a non-negative Quality of information function QoI(𝑠𝑖) ∈ 0,1 : QoI(𝑠𝑖) = 0, 𝑠𝑖 ≤ 𝑚𝑖𝑛𝑖 QoI 𝑠𝑖 = 1, 𝑠𝑖 ≥ 𝑀𝑎𝑥𝑖

• QoI(minimum) = f(importance)

Jin,J.,Palaniswami,M.,Krishnamachari,B.,2012.Ratecontrol for heterogeneous wireless sensor networks: characterization, algorithms and performance. Computer Networks 56(17),3783–3794.

Overview • System Capacity(SC): the

capacity of the bottleneck resource in the system:

𝑠𝑖 . 𝑚𝑠𝑖 . 𝑥𝑖 ≤ 𝑆𝐶

𝑁

𝑖=1

𝑠𝑖 : sampling rate of node i

𝑚𝑠𝑖: message size

𝑥𝑖 ∈ {0, 1}: node status

maximize

𝑄𝑜𝐼𝑖 𝑠𝑖 . 𝑥𝑖

𝑁

𝑖=1

s.t:

𝐼𝑖 𝑡 = 1

𝑁

𝑖=1

𝑠𝑖 . 𝑚𝑠𝑖 . 𝑥𝑖 ≤ 𝑆𝐶

𝑁

𝑖=1

𝑥𝑖 ∈ {0, 1}: node status

Knapsack Rate Allocation Algorithm ?

• Sampling rate assignment that maximizes the QoI for the whole network may not be the ideal solution

– Node with the highest importance may have

smaller 𝑄𝑜𝐼𝑖

𝑠𝑖.𝑚𝑠𝑖 due to large message size

Preliminary Approach

• Use threshold to separate nodes into two (or more ) groups: – Nodes that have importance higher than the

threshold

– The remaining nodes

• Apply Knapsack rate allocation algorithm for each group separately

• First, pick nodes from the critical group

• Next, pick nodes from the less critical group (if system capacity allows)

Simulated network topology

1 12

15

10

2 4

19

3

11

13 16

6

8

21

9 18

23 20

7 14

22

5

17

24

0

Group 1 Group 2

Group 3

Group 4 Group 5

Group 6

Group 7

Evaluation

0

0.05

0.1

0.15

0.2

0.25

0.3

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Imp

ort

ance

of

no

de

(%)

Time (minute)

Node 1 (Group 1)

Node 12 (Group 1)

Node 15 (Group 1)

Node 7 (Group 5)

Node 14 (Group 5)

Node 22 (Group 5)

Node 6 (Group 7)

Node 8 (Group 7)

Node 21 (Group 7)

Duration of Event 1

in Group 1

Duration of Event 2

in Group 5

Duration of Event 3

in Group 7

The Importance Metrics of Sensor Nodes

Quality of Information vs. Network Coverage

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 2 3

Qu

alit

y o

f In

form

atio

n (

%)

Quality of Information for the Entire Network Number of Suspended Nodes

0

2

4

6

8

10

12

14

16

1 2 3

Nu

mb

er

of

susp

en

de

d n

od

es

RateAllocationAlgorithm

Threshold BasedRateAllocationAlgorithm

MaximumGreedyAlgorithm

Time of different experiment

Time of different experiment

How does changing sampling rate affect backend?

• If a server gets overloaded due to changed sampling rate, we may lose critical data

• One way to address this is by redirecting certain sensor streams to different servers on the fly

• The challenge is

– How to determine possible overload in advance?

– How to determine the set of sensors that need to be redirected?

Modeling Performance of Application layer jobs

A Different Platform

Apache Spark™ is a widely used cloud-based

platform for large-scale data processing.[1,2]

Resilient distributed datasets (RDDs) feature supports

in-memory computation

[1] Apache Spark™. http://spark.apache.org/.

[2] Powered By Spark. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark.

23

Initial Idea

Execute sample job, measure the performance of

the job

Predict the performance of the actual job based

on the performance of the sample job

25

1

2

Run the same program as actual job, but

only with a fraction of the input data

26

Execute

Sample Job

Performance

Model

Sampled

Input

Event

Logs

Performance

Info

Predict

Performance

of the Actual

Job

Apache Spark Job

• One job consists of sequential stages

• One stage contains parallel and sequential tasks

• Tasks run in batches

One batch of P tasks run in parallel

, H is the number of working nodes

M is the number of stages and N is the number of tasks in a stage

27

Performance Metrics

28

Execution time

Kc is the number of sequential tasks running in CPU core c, P is the total

number of core.

• Average execution time of the first batch is different

from the subsequent batches within the same stage

29

Here nh is the number of tasks running in host h, and Ph is the number of tasks in the first batch

Experimental Setup

30

Cluster Setup

31

One Batch Two Batches

Reduced Scale 1.25GB 2.5GB

Full Scale 7.5GB 15GB

Example Jobs

Sample Input

Job Input

WordCount 75GB Wikipedia Dump

Logistic Regression 50GB SDSS CMD Data

K-Means 50GB SDSS CMD Data

PageRank 25GB SNAP Network Dataset

- Sloan digital sky survey. http://www.sdss.org/.

- Stanford snap. http://snap.stanford.edu/.

Predication Accuracy Calculation

• Prediction accuracy is calculated for each stage and

summed up as follows:

, M is number of stages

32

App – I: WordCount

33

Prediction accuracy Time Prediction

I/O Write Prediction I/O Read Prediction

App – II: Logistic Regression

34


App – III: K-Means

35



App – IV: PageRank

36



Can we model Interference among multiple Jobs?

Main Idea

• Model the slowdown ratio using different kinds of job mix (CPU bound, I/O bound)

• Next, based on stage models, estimate the impact on execution time per stage

• Account for the cascading effect on execution to predict the total execution time

Preliminary Evaluation

• We choose four Apache Spark jobs – PageRank – K-Means – Logistic Regression – Word Count

• For PageRank, we use the 20 GB LiveJournal network dataset from SNAP

• K-Means and Logistic Regression applications use 20 GB of numerical Color-Magnitude Diagram data of galaxy from Sloan Digital Sky Survey (SDSS)

• WordCount application uses 20 GB Wikipedia dump data

Next Goal

• Modeling the interference among multiple jobs and validating the model

• Developing Algorithms for Interference Aware Job Scheduling

Long Term Goal

- Leveraging performance models for performance troubleshooting - Modeling interference between application layer and storage layer - Scalable instrumentation - Scalable troubleshooting algorithms

Publications Relevant to this Project

• Published – Performance Prediction for Apache Spark Platform. Kewen Wang and

Mohammad Maifi Hasan Khan. In proceedings of 17th IEEE International Conference on High Performance Computing and Communications (HPCC), 2015.

– A closed-loop context aware data acquisition and resource allocation framework for dynamic data driven applications systems (DDDAS) on the cloud. Nguyen, Nhan, and Mohammad Maifi Hasan Khan. Journal of Systems and Software, Elsevier, 2015.

– Context aware data acquisition framework for dynamic data driven applications systems (DDDAS). Nhan Nguyen, Mohammad Maifi Hasan Khan. In proceedings of the 32nd IEEE Military Communication Conference (MILCOM), San Diego, CA, USA, 2013.

• Under Preparation – Modeling interference on Apache Spark Platform – Interference aware job scheduling for Apache Spark Platform

THANK YOU!

Please feel free to contact if you have any questions.

Mohammad Maifi Khan <[email protected]> Swapna Gokhale <[email protected]>

performance analysis and diagnosis of cloud-based dddas ... · performance analysis and diagnosis...

Documents