applied...applied performance theory @kavya719. kavya. applying performance theory ... scaling...
TRANSCRIPT
![Page 1: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/1.jpg)
Applied Performance Theory
kavya719
kavya
applyingperformance theory
to practice
performance
capacity
bull Whatrsquos the additional load the system can support without degrading response time
bull Whatrsquore the system utilization bottlenecks
bull Whatrsquos the impact of a change on response timemaximum throughput
bull How many additional servers to support 10x load
bull Is the system over-provisioned
YOLO method
load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful
performance modeling
performance modeling
real-world system theoretical model
resultsanalyze
translate back
model as
makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 2: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/2.jpg)
kavya
applyingperformance theory
to practice
performance
capacity
bull Whatrsquos the additional load the system can support without degrading response time
bull Whatrsquore the system utilization bottlenecks
bull Whatrsquos the impact of a change on response timemaximum throughput
bull How many additional servers to support 10x load
bull Is the system over-provisioned
YOLO method
load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful
performance modeling
performance modeling
real-world system theoretical model
resultsanalyze
translate back
model as
makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 3: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/3.jpg)
applyingperformance theory
to practice
performance
capacity
bull Whatrsquos the additional load the system can support without degrading response time
bull Whatrsquore the system utilization bottlenecks
bull Whatrsquos the impact of a change on response timemaximum throughput
bull How many additional servers to support 10x load
bull Is the system over-provisioned
YOLO method
load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful
performance modeling
performance modeling
real-world system theoretical model
resultsanalyze
translate back
model as
makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 4: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/4.jpg)
performance
capacity
bull Whatrsquos the additional load the system can support without degrading response time
bull Whatrsquore the system utilization bottlenecks
bull Whatrsquos the impact of a change on response timemaximum throughput
bull How many additional servers to support 10x load
bull Is the system over-provisioned
YOLO method
load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful
performance modeling
performance modeling
real-world system theoretical model
resultsanalyze
translate back
model as
makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 5: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/5.jpg)
YOLO method
load simulation Stressing the system to empirically determine actual performance characteristics bottlenecksCan be incredibly powerful
performance modeling
performance modeling
real-world system theoretical model
resultsanalyze
translate back
model as
makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 6: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/6.jpg)
performance modeling
real-world system theoretical model
resultsanalyze
translate back
model as
makes assumptions about the system request arrival rate service order times cannot apply the results if your system does not satisfy them
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 7: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/7.jpg)
a cluster of many servers the USL
scaling bottlenecks
a single server open closed queueing systems
utilization law Littlersquos law the P-K formula CoDel adaptive LIFO
stepping back the role of performance modeling
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 8: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/8.jpg)
a single server
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 9: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/9.jpg)
model I
clients web server
ldquohow can we improve the mean response timerdquo
ldquowhatrsquos the maximum throughput of this server given a response time targetrdquo
resp
onse
tim
e (m
s)
throughput (requests second)
response time threshold
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 10: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/10.jpg)
model the web server as a queueing system
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 11: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/11.jpg)
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 12: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/12.jpg)
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 13: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/13.jpg)
model the web server as a queueing system
assumptions
1 requests are independent and random arrive at some ldquoarrival raterdquo 2 requests are processed one at a time in FIFO order
requests queue if server is busy (ldquoqueueing delayrdquo) 3 ldquoservice timerdquo of a request is constant
web server
request response
queueing delay + service time = response time
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 14: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/14.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 15: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/15.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases
utilization = arrival rate service time
ldquobusynessrdquo
utili
zatio
n
arrival rate
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 16: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/16.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 17: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/17.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 18: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/18.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
P(request has to queue) increases somean queue length increases so mean queueing delay increases
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 19: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/19.jpg)
Pollaczek-Khinchine (P-K) formula
mean queueing delay = U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
assuming constant service time and so request sizes
mean queueing delay prop U (1 - U)
utilization (U)
resp
onse
tim
esince response time prop queueing delay
utilization (U)
queu
eing
del
ay
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 20: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/20.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
low utilization regime
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 21: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/21.jpg)
ldquoWhatrsquos the maximum throughput of this serverrdquo ie given a response time target
arrival rate increases
server utilization increases linearly
Utilization law
P-K formula
mean queueing delay increases non-linearly so response time too
resp
onse
tim
e (m
s)
throughput (requests second)
max throughput
low utilization regime
high utilization regime
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 22: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/22.jpg)
ldquoHow can we improve the mean response timerdquo
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 23: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/23.jpg)
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 24: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/24.jpg)
ldquoHow can we improve the mean response timerdquo
onNewRequest(req queue) if (queuelastEmptyTime() lt (now - N ms)) Queue was last empty more than N ms ago set timeout to M ltlt N ms timeout = M ms else Else set timeout to N ms timeout = N ms queueenqueue(req timeout)
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 25: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/25.jpg)
ldquoHow can we improve the mean response timerdquo
1 response time prop queueing delay
prevent requests from queuing too long
bull Controlled Delay (CoDel)in Facebookrsquos Thrift framework
bull adaptive or always LIFOin Facebookrsquos PHP runtime Dropboxrsquos Bandaid reverse proxy
bull set a max queue length
bull client-side concurrency control
newest requests first not old requests that are likely to expire
helps when system is overloaded makes no difference when itrsquos not
key insight queues are typically empty allows short bursts prevents standing queues
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 26: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/26.jpg)
ldquoHow can we improve the mean response timerdquo
2 response time prop queueing delay
U linear fn (mean service time) quadratic fn (service time variability) (1 - U)
P-K formula
decrease request service size variability for example by batching requests
decrease service time by optimizing application code
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 27: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/27.jpg)
the cloudindustry site
N sensors
server
while true upload synchronously ack = upload(data) update state sleep for Z seconds deleteUploaded(ack) sleep(Z seconds)
processes data from N sensors
model II
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 28: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/28.jpg)
bull requests are synchronized
bull fixed number of clients
throughput depends on response time
queue length is bounded (lt= N) so response time bounded
This is called a closed system super different that the previous web server model (open system)
server
N clients
] ]responserequest
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 29: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/29.jpg)
response time vs load for closed systems
assumptions
1 sleep time (ldquothink timerdquo) is constant 2 requests are processed one at a time in FIFO order 3 service time is constant
What happens to response time in this regime
Like earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing th
roug
hput
number of clients
low utilization regime
high utilization regime
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 30: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/30.jpg)
Littlersquos Law for closed systems
server
sleeping
waiting being processed
] ]
the total number of requests in the system includes requests across the states
a request can be in one of three states in the system sleeping (on the device) waiting (in the server queue) being processed (in the server)
the system in this case is the entire loop ie
N clients
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 31: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/31.jpg)
Littlersquos Law for closed systems
requests in system = throughput round-trip time of a request across the whole system
sleep time + response time
server
sleep time
queueing delay + service time = response time
] ]
So response time only grows linearly with N
N = constant response time
applying it in the high utilization regime (constant throughput) and assuming constant sleep
N clients
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 32: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/32.jpg)
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing high utilization regime
grows linearly with N
low utilization regime response time stays ~same
high utilization regime
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 33: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/33.jpg)
response time vs load for closed systems
So response time for a closed system
number of clients
resp
onse
tim
eLike earlier as the number of clients (N) increases throughput increases to a point ie until utilization is high after that increasing N only increases queuing
arrival rate
resp
onse
tim
e
way different than for an open system
high utilization regimegrows linearly with N
low utilization regime response time stays ~same
high utilization regime high utilization regime
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 34: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/34.jpg)
open vs closed systems
bull how throughput relates to response time bull response time versus load especially in the high load regime
closed systems are very different from open systems
uh ohhellip
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 35: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/35.jpg)
standard load simulators typically mimic closed systems
A couple neat papers on the topic workarounds Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
So load simulation might predict bull lower response times than the actual system yields bull better tolerance to request size variability bull other differences you probably donrsquot want to find out in productionhellip
open vs closed systems
hellipbut the system with real users may not be one
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 36: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/36.jpg)
a cluster of servers
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 37: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/37.jpg)
clients
cluster of web servers
load balancer
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
capacity planning
ldquoHow can we improve how the system scalesrdquo scalability
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 38: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/38.jpg)
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 39: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/39.jpg)
max throughput of a cluster of N servers = max single server throughput N
ldquoHow many servers do we need to support a target throughputrdquo while keeping response time the same
no systems donrsquot scale linearly
bull contention penaltydue to serialization for shared resourcesexamples database contention lock contention
bull crosstalk penaltydue to coordination for coherence
examples servers coordinating to synchronize mutable state
αN
βN2
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 40: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/40.jpg)
Universal Scalability Law (USL)
throughput of N servers = N (αN + βN2 + C)
N (αN + βN2 + C)
NC
N(αN + C)
contention and crosstalk
linear scaling
contention
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 41: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/41.jpg)
bull smarter data partitioning smaller partitions in Facebookrsquos TAO cache
ldquoHow can we improve how the system scalesrdquo
Avoid contention (serialization) and crosstalk (synchronization)
bull smarter aggregation in Facebookrsquos SCUBA data store
bull better load balancing strategies best of two random choices bull fine-grained locking bull MVCC databases bull etc
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 42: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/42.jpg)
stepping back
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 43: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/43.jpg)
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 44: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/44.jpg)
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 45: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/45.jpg)
load simulation results with increasing number of virtual clients (N) = 1 hellip 100
hellip load simulator hit a bottleneck
resp
onse
tim
e
number of clients
wrong shape for response time curve
should be one of the two curves above
number of clients
resp
onse
tim
e
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 46: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/46.jpg)
modeling requires assumptions that may be difficult to practically validatebut gives us a rigorous framework to
bull determine what experiments to runrun experiments needed to get data to fit the USL curve response time graphs
bull interpret and evaluate the resultsload simulations predicted better results than your system shows
bull decide what improvements give the biggest winsimprove mean service time reduce service time variability remove crosstalk etc
the role of performance modelingmost useful in conjunction with empirical analysis
load simulation experiments
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 47: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/47.jpg)
kavya719speakerdeckcomkavya719applied-performance-theory
Special thanks to Eben Freeman for reading drafts of this
References Performance Modeling and Design of Computer Systems Mor Harchol-Balter Practical Scalability Analysis with the Universal Scalability Law Baron Schwartz Open Versus Closed A Cautionary Tale How to Emulate Web Traffic Using Standard Load Testing Tools
Queuing Theory In Practice
Fail at Scale Kraken Leveraging Live Traffic Tests
SCUBA Diving into Data at Facebook
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 48: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/48.jpg)
On CoDel at Facebook ldquoAn attractive property of this algorithm is that the values of M and N tend not to need tuning Other methods of solving the problem of standing queues such as setting a limit on the number of items in the queue or setting a timeout for the queue have required tuning on a per-service basis We have found that a value of 5 milliseconds for M and 100 ms for N tends to work well across a wide set of use cases ldquo
Using LIFO to select thread to run next to reduce mutex cache trashing and context switching overhead
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 49: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/49.jpg)
number of virtual clients (N) = 1 hellip 100
response time
concurrency (N)
wrong shape for response time curve
should be
concurrency (N)
response time
hellip load simulator hit a bottleneck
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 50: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/50.jpg)
utilization = throughput service time (Utilization Law)
throughput
ldquobusynessrdquo
queueing delay increases (non-linearly) so response time
throughput increases
utilization increases
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 51: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/51.jpg)
Facebook sets target cluster capacity = 93 of theoretical
hellipis this good or is there a bottleneck
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 52: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/52.jpg)
cluster capacity is ~90 of theoretical so therersquos a bottleneck to fix
Facebook sets target cluster capacity = 93 of theoretical
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 53: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/53.jpg)
throughput
latency
non-linear responses to load
throughput
concurrency
non-linear scaling
microservices systems are complex
continuous deployssystems are in flux
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 54: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/54.jpg)
load generation
need a representative workload
hellipuse live traffic
traffic shifting
profile (read write requests) arrival pattern including traffic bursts
capture and replay
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting
![Page 55: Applied...Applied Performance Theory @kavya719. kavya. applying performance theory ... scaling bottlenecks a single server open, closed queueing systems ... model the web server as](https://reader033.vdocuments.site/reader033/viewer/2022052103/603e1b1edb011f4fee7834b6/html5/thumbnails/55.jpg)
edge weight cluster weight server weight
adjust weights that control load balancing to increase the fraction of traffic to a cluster region server
traffic shifting