empirical analysis of distributed database server

An Experimental Study of

Rapidly Alternating Bottleneck

in n-Tier Applications

Qingyang Wang, Yasuhiko Kanemasa, Jack Li,

Deepal Jayasinghe, Toshihiro Shimizu, Masazumi

Matsubara, Motoyuki Kawaba, Calton Pu

Scaling Web Applications

On-Demand in Cloud

2

Good performance + Cost efficiency

BottleneckBottleneck

High throughput + low

response time

High resource

utilization

What If No Bottleneck Was Detected?

3

CPU util. 79.2%

Disk I/O 0.5%

Memory 30%

……

CPU util. 78.1%

Disk I/O 0.5%

Memory 50%

…… CPU util. 34.6%

Disk I/O 0.5%

Memory 50%

……

CPU util. 26.7%

Disk I/O 0.5%

Memory 50%

……

How to scale a web application while no bottleneck is identified?

Bad

performance

Rapidly Alternating Bottlenecks

4

Web server

App server

DB server

t0 t1 t2 t3 t

Bottleneck

Bottleneck

Bottleneck

Bottleneck

workload

Time

2. Duration of each bottleneck is short (e.g., < 100ms)

1. Throughput is limited with no saturated resources

Experimental Setup

5

RUBBoS benchmark: a bulletin board

system like Slashdot

24 web interactions

CPU intensive

Workload consists of emulated clients

Intel Xeon E5607

2 quad-core 2.26 GHz

16 GB memory

Motivational Example

6

Response time & throughput of a 3-minute benchmark

on the 4-tier application with increasing workloads.

WL 14,000

WL 14,000

Resp

on

se t

ime [

s]

Workload [# users x1000]

Th

rou

gh

pu

t [

req

/s]


No Obvious Bottleneck is Detected

at WL 14,000

7

CPU util. 79.2% CPU util. 78.1%

Workload is CPU intensive

Disk I/O utilization (<5%), network I/O utilization (<

20%), Memory usage (<40%);

CPU util. 34.6% CPU util. 26.7%

CJDBC

Sources: We find that other than bursty

workload, system environmental conditions:

JVM garbage collection

VM collocation

Detection and Visualization: We implement

a fine-grained monitoring method based on

passive network tracing.

Negligible monitoring overhead for running

applications

Rapidly Alternating Bottleneck:

Sources and Detection

8

Introduction & Motivation

Detection and Visualization

Fine-grained load/throughput analysis

Two Observations of Rapidly Alternating

Bottlenecks

JVM garbage collection (JVM GC)

VM collocation

Conclusion & Future Works

Outline

9

Two Steps for Detecting Rapidly Alternating

bottlenecks

10

1. Find the participating servers that present transient

bottlenecks(e.g., 50ms)

2. Check whether the transient bottlenecks of each

participating server occur in an alternating pattern

Passive Network Tracing Infrastructure

11

Collect interaction messages in the system using

SysViz to measure fine-grained active load and

throughput on each server.

Active load: The # of concurrent requests in a server

Throughput: The # of completed requests of a server

Web AP DB

Network switch

Web server AP server DB server

01:58:20.039 01:58:20.053

01:58:20.072

01:58:20.081

01:58:20.090

01:58:20.124

01:58:20.134

01:58:20.142

01:58:20.154

01:58:20.161

01:58:20.182

01:58:20.193

SysViz box

Fine-Grained Active Load

Calculation in a Server

SysViz

monitoring for

MySQL

50ms 50ms

12

time arrival

timestamp

departure

timestamp

Saturation

area

Active load in a server

Active-Load/Throughput

Correlation Analysis

13

Serv

er

thro

ugh

put

Saturation

point N*

TPmax

Non-

Saturation

area

Active-Load/Throughput Analysis

for MySQL at WL 14,000

14

MySQL active load [#]

MyS

QL

th

rou

gh

pu

t

[req

/s]

Time [s] Time [s]


MySQL active load (every 50ms) MySQL throughput (every 50ms)

Acti

ve lo

ad

[#

]

Th

rou

gh

pu

t [r

eq

/s]

1

2 3

1

2

3

1

2

3

N*





Bottlenecks


VM collocation


Outline

15


at Workload 7,000

16

WL 7,000

Syst

em

Th

rou

gh

pu

t [r

eq

/s]

System is far from

saturation at WL 7,000


Tomcat active load [#]

To

mcat

thro

ugh

pu

t [r

eq

/s]


MyS

QL

th

rou

gh

pu

t [r

eq

/s]

Point of Interest (POI)


at Workload 14,000

17

Syst

em

Th

rou

gh

pu

t [r

eq

/s]

System achieves the

maximum throughput at

WL14,000


WL 14,000


To

mcat

thro

ugh

pu

t [r

eq

/s]


MyS

QL

th

rou

gh

pu

t [r

eq

/s]

POI

Timeline Analysis at Workload 14,000

18

Acti

ve lo

ad

[#

]

Th

rou

gh

pu

t [r

eq

/s]

Tomcat scatter graph

4 POI

4

Time (s)

Acti

ve lo

ad

[#

]

To

mcat

GC

run

nin

g r

ati

o [

%]

Time (s)

Timeline Analysis at Workload 14,000 (Cont.)

19

Acti

ve lo

ad

[#

]

Time (s)

Acti

ve lo

ad

[#

]

Th

rou

gh

pu

t [r

eq

/s]

Th

rou

gh

pu

t [r

eq

/s]

MySQL scatter graph

Tomcat scatter graph

Time (s)

4

4’

4

4’

POI

Correlation Analysis of Rapidly Alternating

Bottlenecks

20

To

mcat

acti

ve lo

ad

[#

]

MyS

QL

acti

ve lo

ad

[#

]

Time (s)

Correlation coefficient: -0.42, negative correlation

suggests rapidly alternating bottleneck.





Bottlenecks


VM collocation


Outline

21

Rapidly alternating bottlenecks can cause

non-trivial performance loss in an n-tier

system.

We proposed a rapidly alternating bottleneck

detection and visualization method through

fine-grained active-load/throughput analysis

Ongoing work: more analysis of different

types of workloads and more system factors

that cause rapidly alternating bottlenecks.

Conclusion & Future Work

22

Thank You. Any Questions?

Qingyang Wang

[email protected]

23

24

Backup slides

Resolving Rapidly Alternating Bottlenecks

25


MyS

QL

th

rou

gh

pu

t [r

eq

/s]


MyS

QL

th

rou

gh

pu

t [r

eq

/s]


To

mcat

thro

ugh

pu

t [r

eq

/s]

To

mcat

thro

ugh

pu

t [r

eq

/s]


Performance Gain After Resolving Rapidly

Alternating Bottlenecks

26

Resp

on

se t

ime[s

]

Workload [# users x1000] T

hro

ugh

pu

t [r

eq

/s]



at Workload 14,000

27

Syst

em

Th

rou

gh

pu

t [r

eq

/s]

System achieves the

maximum throughput at

WL14,000


WL 14,000

empirical analysis of distributed database server

Documents