perfiso: performance isolation for commercial latency ... · paulo tomita alex chen jack zhang...

75
PerfIso: Performance Isolation for Commercial Latency-Sensitive Services Călin Iorgulescu Reza Azimi Youngjin Kwon EPFL Brown University University of Texas Sameh Elnikety Manoj Syamala Vivek Narasayya Herodotus Herodotou Microsoft Research Cyprus University of Technology Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing

Upload: others

Post on 05-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Performance Isolation for Commercial Latency-Sensitive Services

Călin Iorgulescu Reza Azimi Youngjin KwonEPFL Brown University University of Texas

Sameh Elnikety Manoj Syamala Vivek Narasayya Herodotus HerodotouMicrosoft Research Cyprus University of Technology

Paulo Tomita Alex Chen Jack Zhang Junhua WangMicrosoft Bing

Page 2: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Interactive services must feel instantaneous

2 / 297/12/18

Page 3: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Interactive services must feel instantaneous

2 / 297/12/18

Page 4: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Interactive services must feel instantaneous

≤ 0.1 s

2 / 297/12/18

Page 5: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

A single query involves hundreds of machines!

≤ 0.1 s

3 / 297/12/18

Web Index

Page 6: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

A single query involves hundreds of machines!

≤ 0.1 s

3 / 297/12/18

Page 7: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

A single query involves hundreds of machines!

≤ 0.1 s

Embarrassingly parallel search

3 / 297/12/18

Page 8: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

A single query involves hundreds of machines!

≤ 0.1 s

Embarrassingly parallel search

3 / 297/12/18

Slowest response must be << 0.1 s

Page 9: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

A single query involves hundreds of machines!

≤ 0.1 s

Embarrassingly parallel search

3 / 297/12/18

Slowest response must be << 0.1 s

Multiple layers of aggregation!Just one service out of many!

Page 10: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 20174 / 297/12/18

Page 11: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load

4 / 297/12/18

Page 12: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load Peak load

4 / 297/12/18

Page 13: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load Peak load

4 / 297/12/18

>>

Page 14: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Wednesday Thursday Friday Saturday Sunday Monday Tuesday

Qu

ery

Arr

ival

Rat

e

Machines are provisioned for peak load

Request-rate variation for a Microsoft Bing sub-cluster over 1 week in 2017

Average load Peak load

4 / 297/12/18

Datacenters have spare resources

How can we leverage this ?

>>

Page 15: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Solution: colocate batch jobs with online services

• Get spare resources to do useful work

• Primary tenant – guaranteed performance• e.g., Bing IndexServe

• Secondary tenant – best-effort performance• e.g., Apache Spark

Primary Idle PrimaryBatch Job

Without colocation With colocation + PerfIso

5 / 297/12/18

Page 16: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: performance isolation for online services

6 / 297/12/18

Page 17: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: performance isolation for online services

• Maintains P99 of response-times (10s of ms) under colocation

Provides performance isolation of Primary

6 / 297/12/18

Page 18: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: performance isolation for online services

• Maintains P99 of response-times (10s of ms) under colocation

Provides performance isolation of Primary

• 45% of the CPU is used to do useful batch work

Increases system efficiency

6 / 297/12/18

Page 19: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: performance isolation for online services

• Maintains P99 of response-times (10s of ms) under colocation

Provides performance isolation of Primary

• 45% of the CPU is used to do useful batch work

Increases system efficiency

• Many different interactive services and hardware setups

Deployed on over 90,000 servers

6 / 297/12/18

Page 20: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Many papers published on performance isolation

Quasar [ASPLOS ‘14] Heracles [ISCA ‘15] Elfen [USENIX ATC ’16]

7 / 297/12/18

Page 21: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Many papers published on performance isolation

Quasar [ASPLOS ‘14] Heracles [ISCA ‘15] Elfen [USENIX ATC ’16]

7 / 297/12/18

Existing solutions do not fit our requirements

Page 22: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Requirements

8 / 297/12/18

Page 23: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Requirements

1. “Black-box”: Fewest assumptions about tenants (wider applicability)

8 / 297/12/18

Page 24: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Requirements

1. “Black-box”: Fewest assumptions about tenants (wider applicability)

2. “Standalone”: Primary acts like it runs alone (negligible interference)

8 / 297/12/18

Page 25: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Requirements

1. “Black-box”: Fewest assumptions about tenants (wider applicability)

2. “Standalone”: Primary acts like it runs alone (negligible interference)

3. “Integrability”: Minimize software-stack changes (easy deployment)

8 / 297/12/18

Page 26: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Why is Performance Isolation hard?

Page 27: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Interactive services – highly sensitive to interference!

Leaf-servers keep 99th percentile low

• Over 10 years of optimization work!• e.g., compression, adaptive parallelism, etc.

How often does the 99th percentile occur?

• For 10,000 queries / s → 100 times / s

What happens in a 100-node fanout?

• Every query runs at the 99th percentile!

10 / 297/12/18

Page 28: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary demands many resources quickly

• Bing IndexServe: multi-threaded web-index server

➢Up to 15 threads wake up in 5𝜇s1

1Constant query rate 4,000 Q/s, 500k queries experiment

11 / 297/12/18

Page 29: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary demands many resources quickly

• Bing IndexServe: multi-threaded web-index server

➢Up to 15 threads wake up in 5𝜇s1

• Burstiness due to query-processing optimizations!• some queries will spawn many workers

1Constant query rate 4,000 Q/s, 500k queries experiment

11 / 297/12/18

Page 30: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary demands many resources quickly

• Bing IndexServe: multi-threaded web-index server

➢Up to 15 threads wake up in 5𝜇s1

• Burstiness due to query-processing optimizations!• some queries will spawn many workers

• Workload arrives in bursts – exacerbates problem

1Constant query rate 4,000 Q/s, 500k queries experiment

11 / 297/12/18

Page 31: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary must behave as if it were standalone

7/12/18 12 / 29

Page 32: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

7/12/18 12 / 29

Page 33: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

• Any delays → performance penalties incurred

7/12/18 12 / 29

Page 34: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

• Any delays → performance penalties incurred

• Any resource can become a performance bottleneck.

7/12/18 12 / 29

Page 35: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

The Primary must behave as if it were standalone

• Primary’s resource demands must be fulfilled instantly.

• Any delays → performance penalties incurred

• Any resource can become a performance bottleneck.

If a query is delayed, it is already too late!

7/12/18 12 / 29

Page 36: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso

Page 37: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Implemented as a user-mode service

14 / 297/12/18

OS

Primary

PerfIso

Secondary

• Only keeps track of Secondary’s PID

Page 38: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Managed resources

15 / 297/12/18

OS

CPU

Primary

PerfIso

Secondary

Blind Isolation

Page 39: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Managed resources

15 / 297/12/18

OS

DISK

Primary

PerfIso

Secondary

I/O throttling

CPU

Page 40: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Managed resources

15 / 297/12/18

OS

MEMORY

Primary

PerfIso

Secondary

Restrict footprint

DISKCPU

Page 41: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: Managed resources

15 / 297/12/18

OS

NETWORK

Primary

PerfIso

Secondary

Throttle egress packets

MEMORYDISKCPU

Page 42: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: CPU is the most important resource

15 / 297/12/18

OS

CPU

Primary

PerfIso

Secondary

Blind Isolation

Page 43: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU sharing without PerfIso

• Primary and Secondary compete for cores.

• Secondary is aggressive: no idle cores exist.

16 / 297/12/18

Machine with 12 cores

Primary

Secondary

Page 44: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Keep a “buffer” of idle cores

• PerfIso only knows the Secondary.

• Restrict Secondary by changing core affinities.

17 / 297/12/18

Primary

Secondary

Machine with 12 cores

Restrict Secondary to create a buffer of idle cores.

Page 45: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Keep a “buffer” of idle cores

• PerfIso only knows the Secondary.

• Restrict Secondary by changing core affinities.Primary

Secondary

Machine with 12 cores

Idle

Restricted Secondary

Buffer of idle cores

17 / 297/12/18

Page 46: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Keep a “buffer” of idle cores

• Primary is unrestricted. Secondary is restricted.

17 / 297/12/18

Machine with 12 cores

Primary can expand into the buffer!

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

Page 47: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Keep a “buffer” of idle cores

• Primary is unrestricted. Secondary is restricted.

Machine with 12 cores

Primary can expand into the buffer!

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

17 / 297/12/18

Page 48: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: React to bursts from Primary

• Continuously read idle core status.

• Adjust Secondary ”slice” to maintain buffer.

18 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

Page 49: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: React to bursts from Primary

• Continuously read idle core status.

• Adjust Secondary ”slice” to maintain buffer.

18 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

Page 50: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: React to bursts from Primary

• Continuously read idle core status.

• Adjust Secondary ”slice” to maintain buffer.

18 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Primary

Secondary

Idle

Page 51: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

Page 52: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

Page 53: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

Page 54: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

Page 55: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

Page 56: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: Secondary gets spare cores

• Allow Secondary to use spare idle cores.

• Release spare cores incrementally.

19 / 297/12/18

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

Page 57: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

CPU Blind Isolation: We dedicate 1 core to PerfIso

• PerfIso does continuous polling → we affinitize it to 1 core.

PerfIso

Machine with 12 cores

Restricted Secondary

Buffer of idle cores

Idle

Primary

Secondary

20 / 297/12/18

Page 58: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Evaluation

Page 59: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Experiment testbed

Hardware

• Intel Xeon E5 – 24 cores (48 w/ HT)

• 128GB RAM

Primary: Bing IndexServe

• 569 GB index-slice

• Open-loop client

• 500,000 queries @ 2,000 Q / s

Secondary: CPU micro-benchmark

22 / 297/12/18

Page 60: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

11.65

349.08

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone Colocated

SLO

No isolation

Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark

23 / 297/12/18

Page 61: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

11.65

349.08

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone Colocated

SLO

No isolation

One order of magnitude worse !

Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark

23 / 297/12/18

Page 62: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

11.65

349.08

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone Colocated

SLO 11.65 12.07

1

10

100

1000

Standalone Colocated

No isolation PerfIso

One order of magnitude worse !

Single server: PerfIso protects tail-latencySecondary: CPU-intensive micro-benchmark

23 / 297/12/18

Page 63: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Single server: CPU utilization 3x higher! Secondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

24 / 297/12/18

No colocation PerfIso

21%

11.65 12.07

1

10

100

1000

Standalone Colocated

67%

Page 64: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Single server: CPU utilization 3x higher! Secondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

24 / 297/12/18

No colocation PerfIso

21%

11.65 12.07

1

10

100

1000

Standalone Colocated

67%46% of CPU time → useful work

Page 65: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Restricting CPU cycles does not workSecondary: CPU-intensive micro-benchmark

11.65

349.08

12.07

33.74

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone No isolation

PerfIso Restrict cycles

SLO

25 / 297/12/18

Secondary → 5% of CPU cycles

P99 latency – 3x higher than SLO!

Page 66: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Restricting CPU cores does not workSecondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

PerfIsoStandalone Restrict cores

SLO 11.65

349.08

12.07 11.63

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone No isolation

PerfIso Restrict cores

21%

67%

26 / 297/12/18

38%

Page 67: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Restricting CPU cores does not workSecondary: CPU-intensive micro-benchmark

0

20

40

60

80

100

CP

U u

tiliz

atio

n %

Primary Secondary

PerfIsoStandalone Restrict cores

SLO 11.65

349.08

12.07 11.63

1

10

100

1000

P9

9 la

ten

cy (

ms)

Standalone No isolation

PerfIso Restrict cores

21%

67%Provisioned for peak load→

CPU utilization ~30% lower!

26 / 297/12/18

38%

Page 68: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

020406080

100

CP

U u

til.

%

Avg CPU Utilization %

0

1000

2000

3000

4000

5000

0

10

20

30

40

0 10 20 30 40 50 60

Qu

erie

s /

s

Late

ncy

(m

s)

Time (minutes)

Top-Level Aggregator P99 latency (ms) Queries / s

1-hour run of 650 machine clusterSecondary: Machine-Learning computation

27 / 297/12/18

Page 69: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

020406080

100

CP

U u

til.

%

Avg CPU Utilization %

0

1000

2000

3000

4000

5000

0

10

20

30

40

0 10 20 30 40 50 60

Qu

erie

s /

s

Late

ncy

(m

s)

Time (minutes)

Top-Level Aggregator P99 latency (ms) Queries / s

1-hour run of 650 machine cluster

Average CPU utilization is 50% - 80%!

Secondary: Machine-Learning computation

27 / 297/12/18

Page 70: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

Interesting details in the paper

• Effectiveness of static CPU isolation methods

• Restricting CPU cycles

• Restricting CPU cores

• Comparison of state-of-the-art techniques

• Managing disk, memory, and network

28 / 297/12/18

Page 71: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: colocate batch jobs with online services

29 / 297/12/18

Page 72: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

29 / 297/12/18

Page 73: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

• Robustness: favor user-mode over kernel implementation

29 / 297/12/18

Page 74: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

• Robustness: favor user-mode over kernel implementation

• Headroom: some core-slack makes Primary behave like standalone

29 / 297/12/18

Page 75: PerfIso: Performance Isolation for Commercial Latency ... · Paulo Tomita Alex Chen Jack Zhang Junhua Wang Microsoft Bing. Interactive services must feel instantaneous 7/12/18 2

PerfIso: colocate batch jobs with online services

• Black-box: do not tailor to one specific service

• Robustness: favor user-mode over kernel implementation

• Headroom: some core-slack makes Primary behave like standalone

• CPU Blind Isolation → colocation without impacting service performance

29 / 297/12/18