elfen scheduling - usenix · elfen scheduling fine-grain principled borrowing from latency-critical...

39
Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University Stephen M Blackburn Australian National University Kathryn S McKinley Microsoft Research

Upload: nguyenbao

Post on 17-Sep-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Elfen SchedulingFine-Grain Principled Borrowing from

Latency-Critical Workloads using Simultaneous Multithreading (SMT)

Xi YangAustralian National University

Stephen M BlackburnAustralian National University

Kathryn S McKinleyMicrosoft Research

Page 2: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Tail Latency Matters

2

Two second slowdown reduced revenue/user by 4.3%.

[Eric Schurman, Bing]

400 millisecond delay decreased searches/user by 0.59%.

[Jack Brutlag, Google]

Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”

“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”

Page 3: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 50%ileLucene alone 95%ileLucene alone 99%ile

High Responsiveness—Low Utilization

3

Intel Xeon-D 1540 Broadwell

LC

Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”

“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”

1 core, no SMT

Service Level Objective100ms SLO

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 99%ile

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

0 20 40 60 80 100 120 140 160 180

Uti

lizat

ion

/ no

SM

T

RPS

34% w/ SMT

67% no SMT

LC

Page 4: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Soak up Slack with Batch?

4

LC batch

core core core core

shared cache

Co-running on different coresSMT turned off

responsiveness requires idle coresbecause OS descheduling is slow

LC idle LCidlecorebatchidleLC

shared cache shared cache

core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2

Co-running on different coresSMT turned off

Co-running on same core in SMT lanes

Page 5: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

Lucene alonewith IPC 1.0with IPC 0.01

SMT Co-Runner

5

Even IPC 0.01 violates SLO at low load!

while(1);

while(1) {movnti();mfence();

}

IPC 1.0

IPC 0.01

1 core, 2 SMT lanes

SLO

Great utilization!

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

Page 6: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

6

FunctionalUnits

Load StoreQueue

IssueLogicLanes

timelucene

Simultaneous Multithreading OFF

Page 7: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

7

lane 1

FunctionalUnits

Load StoreQueue

IssueLogicLanes

Dynamically shared

Staticallypartitioned

Round-robinshared

timelucene

IPC 0.01

Active SMT lanes share critical resources

lane 2

Simultaneous Multithreading ON

Page 8: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Principled Borrowing

8

time

busy idle

Batch borrows hardware when LC is idle

Batch releases hardware when LC is busy

Can we implement principled borrowing on current hardware?

LC

batch

Page 9: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Hardware is Ready — Software is Not

9

time

Batch lane calls “mwait”

Thread sleeps, releasing hardware to OS (~2K cycles)

OS supports thread sleeping, but not hardware sleepingrelease SMT hardware to other lane

OS schedules batch lane with any ready job

LC

batch

Page 10: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

nanonap()

10

Semantics

Thread invoking nanonap releases SMT hardware without releasing SMT context

OS cannot schedule hardware context

OS can interrupt and wakeup thread

Page 11: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

nanonap()

11

per_cpu_variable: nap_flag;void nanonap() {

enter_kernel();disable_preemption();my_nap_flag = this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();

}

timeLC

batch

nanonap

Page 12: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Elfen Scheduler

12

Instrument batch workloads todetect LC threads & nap

Bind latency-critical threads to LC laneBind batch threads to batch lane

No change to latency-critical threads

Page 13: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Elfen Scheduler

13

time

12

1 Batch thread borrows resources, continuously checks LC lane status

2 LC starts, batch calls nanonap() to release SMT hardware resources

OS touches nap_flag to wake up batch thread

/* fast path check injected into method body */check:if (!request_lane_idle)

slow_path();

slow_path() {nanonap(); }

1

2

/* maps lane IDs to the running task */exposed SHIM signal: cpu_task_maptask_switch(task T) {

cpu_task_map[thiscpu] = T; }idle_task() { // wake up any waiting batch thread update_nap_flag_of_partner_lane(); ......

}

LC

batch

3

3

3

nanonap()

Page 14: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

0

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

w antlr

w bloat

w eclipse

w fop

w hsqldb

w jython

w luindex

w lusearch

w pmd

w xalan

Lucene alone

Results: Borrow Idle

14

1 core, 2 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

increased utilization 10x - 1.5x

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)

RPS

7 cores, 2x7 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

4x - 0.19x

same latency!

Page 15: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Beyond Mutual Exclusion

15

time

Mutual exclusion

LC

batch

Concurrent co-running with a budget

• Fixed budget policy• Refresh budget policy• Dynamic budget policy

• Borrow idle policy

nanonap nanonap nanonap

Page 16: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

16

0

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

Borrow idle on 1 core Borrow idle on 7 cores Dynamic budget 7 cores

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

10x - 0.2x

Page 17: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Conclusion

17

Principled borrowing for SMT

nanonap() releases STM hardware without giving it to OS

Elfen scheduler implementation & policies

90% to 25% increase in utilization

Same tail latency!

Thank you!

https://github.com/elfenscheduler

Page 18: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

BACKUP SLIDES

18

Page 19: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

50%ile Lucene alone95%ile Lucene alone99%ile Lucene alone

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

0 20 40 60 80 100 120 140 160 180

Uti

lizat

ion

/ no

SM

T

RPS

High Responsiveness—Low Utilization

19

Intel Xeon-D 1540 Broadwell

67% no SMT

LC

34% w/ SMT

Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”

“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”

1 core, no SMT

Service Level Objective100ms SLO

LC

Page 20: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Results: Borrow Idle

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan

1 core, 2 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilizatio

n /

no S

MT

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

7 cores, 2x7 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

increased utilization 90% to 37%

increased utilization 80% to 10%

Page 21: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

21

Borrow idle on 1 cores

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization

/ n

o S

MT

RPS

Borrow idle on 7 cores Dynamic budget 7 cores

increased utilization 100% to 10%

Page 22: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Result

22

Page 23: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Result

23

Page 24: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

BACKUP SLIDES

24

Page 25: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Tail Latency Matters

25

Two second slowdown reduced revenue/user by 4.3%.

[Eric Schurman, Bing]

400 millisecond delay decreased searches/user by 0.59%.

[Jack Brutlag, Google] p75, Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction

to the Design of Warehouse-Scale Machines”

“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”

Page 26: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

0

50

100

150

200

20 40 60 80 100 120

140 160

180

Late

ncy

(m

s)

RPS

50%ile

95%ile

99%ile

0

50

100

150

200

20 40 60 80 100 120

140 160

180

Late

ncy

(m

s)

RPS

50%ile

High Responsiveness—Low Utilization

26

0

50

100

150

200

20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

50%ile95%ile99%ile

Lucene search benchmark on one core of an Intel Xeon-D 1540 Broadwell.

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

20 40

60 80

100 120

140 160

180

Utilizatio

n /

no S

MT

RPS

67% no SMT

LC

34% w/ SMT

p75, Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction

to the Design of Warehouse-Scale Machines”

“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”

Page 27: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Soak up Slack with Batch?

27

core core core core

shared cache

LC batch

shared cache shared cache

Co-running on different cores SMT turned off Co-running on same core in SMT lanes

LC idle core LCbatch core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2

Page 28: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

SMT Co-Runner

28

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonewith IPC 0.01with IPC 1

Even IPC 0.01 violates SLO at low load!

while(1);

while(1) {movnti();mfence();

}

IPC 1.0

IPC 0.01

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization / n

o S

MT

RPS

Page 29: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Even Low IPC Co-Runner is Destructive

29

LC

FunctionalUnits

Load StoreQueue

IssueLogicLanes

Dynamically shared

Staticallypartitioned

Round-robinshared

timelucene

IPC 0.01

Critical resources are shared by active SMT lanes.

batch

Page 30: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Principled Borrowing

30

time

busy idle

Batch lane borrows hardware resources when LC lane is idle

Batch lane releases hardware resources when LC lane is busy

Can we implement principled borrowing on current hardware?

LC

batch

Page 31: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Hardware is Ready — Software is Not

31

time

Batch lane calls “mwait”

Thread sleeps, releasing hardware to OS (~2K cycles)

When LC lane is busy, batch lane must do nothing (nap)

OS schedules batch lane with any ready job

LC

batch

Page 32: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

nanonap()

32

Thread nap semantics

Put SMT hardware lane in idle state

OS cannot schedule hardware context

OS can interrupt and wakeup thread

Page 33: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

nanonap()

33

per_cpu_variable: nap_flag;void nanonap() {

enter_kernel();disable_preemption();my_nap_flag = this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();

}

time

batch lane naps

LC

batch

Page 34: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Elfen Scheduler

No change to latency critical threads

Bind latency critical threads to LC laneBind batch threads to batch lane

Instrument batch workloads to detect LC threads quickly & nap

34

Page 35: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Elfen Scheduler

35

time

1 2

3

1 Batch thread borrows resources continuously checks LC lane status

2 LC starts, batch calls nanonap() to release SMT hardware context

3 OS touches nap_flag to wake up batch thread

/* fast path check injected into method body */check:if (!request_lane_idle)

slow_path();

slow_path() { nanonap(); }

1

2

/* maps lane IDs to the running task */exposed SHIM signal: cpu_task_maptask_switch(task T) {

cpu_task_map[thiscpu] = T; }idle_task() {

// wake up any waiting batch thread update_nap_flag_of_partner_lane(); ......

}

3

LC

batch

Page 36: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Results: Borrow Idle

36

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan

1 core, 2 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization

/ n

o S

MT

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

7 cores, 2x7 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

increased utilization 90% to 37%

increased utilization 80% to 10%

Page 37: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Beyond Mutual Exclusion

37

time

Mutual exclusion

LC

batch

Concurrent co-running with a budget

• Fixed budget policy• Refresh budget policy• Dynamic budget policy

• Borrow idle policy

nanonap nanonap nanonap

Page 38: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

38

Borrow idle on 1 cores

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization

/ n

o S

MT

RPS

Borrow idle on 7 cores Dynamic budget 7 cores

increased utilization 100% to 10%

Page 39: Elfen Scheduling - USENIX · Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University

Conclusion

39

Principled borrowing for SMT

nanonap() releases STM hardware without giving it to OS

Elfen scheduler implementation & policies

90% to 25% increase in utilization

Same tail latency!?

Thank you!

https: //github.com/elfenscheduler