elfen scheduling - usenix · elfen scheduling fine-grain principled borrowing from latency-critical...

Elfen SchedulingFine-Grain Principled Borrowing from

Latency-Critical Workloads using Simultaneous Multithreading (SMT)

Xi YangAustralian National University

Stephen M BlackburnAustralian National University

Kathryn S McKinleyMicrosoft Research

Tail Latency Matters

2

Two second slowdown reduced revenue/user by 4.3%.

[Eric Schurman, Bing]

400 millisecond delay decreased searches/user by 0.59%.

[Jack Brutlag, Google]

Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines”

“Such WSCs tend to have relatively low average utilization, spending most of its time in the10 - 50% CPU utilization range.”

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 50%ileLucene alone 95%ileLucene alone 99%ile

High Responsiveness—Low Utilization

3

Intel Xeon-D 1540 Broadwell

LC



1 core, no SMT

Service Level Objective100ms SLO

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

Lucene alone 99%ile

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

0 20 40 60 80 100 120 140 160 180

Uti

lizat

ion

/ no

SM

T

RPS

34% w/ SMT

67% no SMT

LC

Soak up Slack with Batch?

4

LC batch

core core core core

shared cache

Co-running on different coresSMT turned off

responsiveness requires idle coresbecause OS descheduling is slow

LC idle LCidlecorebatchidleLC

shared cache shared cache

core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2

Co-running on different coresSMT turned off

Co-running on same core in SMT lanes

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

Lucene alonewith IPC 1.0with IPC 0.01

SMT Co-Runner

5

Even IPC 0.01 violates SLO at low load!

while(1);

while(1) {movnti();mfence();

}

IPC 1.0

IPC 0.01

1 core, 2 SMT lanes

SLO

Great utilization!

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

6

FunctionalUnits

Load StoreQueue

IssueLogicLanes

timelucene

Simultaneous Multithreading OFF

7

lane 1

FunctionalUnits

Load StoreQueue

IssueLogicLanes

Dynamically shared

Staticallypartitioned

Round-robinshared

timelucene

IPC 0.01

Active SMT lanes share critical resources

lane 2

Simultaneous Multithreading ON

Principled Borrowing

8

time

busy idle

Batch borrows hardware when LC is idle

Batch releases hardware when LC is busy

Can we implement principled borrowing on current hardware?

LC

batch

Hardware is Ready — Software is Not

9

time

Batch lane calls “mwait”

Thread sleeps, releasing hardware to OS (~2K cycles)

OS supports thread sleeping, but not hardware sleepingrelease SMT hardware to other lane

OS schedules batch lane with any ready job

LC

batch

nanonap()

10

Semantics

Thread invoking nanonap releases SMT hardware without releasing SMT context

OS cannot schedule hardware context

OS can interrupt and wakeup thread

nanonap()

11

per_cpu_variable: nap_flag;void nanonap() {

enter_kernel();disable_preemption();my_nap_flag = this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();

}

timeLC

batch

nanonap

Elfen Scheduler

12

Instrument batch workloads todetect LC threads & nap

Bind latency-critical threads to LC laneBind batch threads to batch lane

No change to latency-critical threads

Elfen Scheduler

13

time

12

1 Batch thread borrows resources, continuously checks LC lane status

2 LC starts, batch calls nanonap() to release SMT hardware resources

OS touches nap_flag to wake up batch thread

/* fast path check injected into method body */check:if (!request_lane_idle)

slow_path();

slow_path() {nanonap(); }

1

2

/* maps lane IDs to the running task */exposed SHIM signal: cpu_task_maptask_switch(task T) {

cpu_task_map[thiscpu] = T; }idle_task() { // wake up any waiting batch thread update_nap_flag_of_partner_lane(); ......

}

LC

batch

3

3

3

nanonap()

0

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

w antlr

w bloat

w eclipse

w fop

w hsqldb

w jython

w luindex

w lusearch

w pmd

w xalan

Lucene alone

Results: Borrow Idle

14

1 core, 2 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

increased utilization 10x - 1.5x

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)

RPS

7 cores, 2x7 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

4x - 0.19x

same latency!

Beyond Mutual Exclusion

15

time

Mutual exclusion

LC

batch

Concurrent co-running with a budget

• Fixed budget policy• Refresh budget policy• Dynamic budget policy

• Borrow idle policy

nanonap nanonap nanonap

16

0

20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%i

le la

tenc

y (m

s)

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20 30 40 50 60 70 80 90 100

Uti

lizat

ion

/ no

SM

T

RPS

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

Borrow idle on 1 core Borrow idle on 7 cores Dynamic budget 7 cores

0

20

40

60

80

100

120

140

200 400 600 800 1000

99%i

le la

tenc

y (m

s)

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 400 600 800 1000

Uti

lizat

ion

/ no

SM

T

RPS

10x - 0.2x

Conclusion

17

Principled borrowing for SMT

nanonap() releases STM hardware without giving it to OS

Elfen scheduler implementation & policies

90% to 25% increase in utilization

Same tail latency!

Thank you!

https://github.com/elfenscheduler

BACKUP SLIDES

18

0

50

100

150

200

0 20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

50%ile Lucene alone95%ile Lucene alone99%ile Lucene alone

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

0 20 40 60 80 100 120 140 160 180

Uti

lizat

ion

/ no

SM

T

RPS


19

Intel Xeon-D 1540 Broadwell

67% no SMT

LC

34% w/ SMT



1 core, no SMT

Service Level Objective100ms SLO

LC


20

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonew antlrw bloatw eclipsew fopw hsqldbw jythonw luindexw lusearchw pmdw xalan

1 core, 2 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilizatio

n /

no S

MT

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS


0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

increased utilization 90% to 37%


21

Borrow idle on 1 cores

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS


0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization

/ n

o S

MT

RPS

Borrow idle on 7 cores Dynamic budget 7 cores


Result

22

Result

23

BACKUP SLIDES

24

Tail Latency Matters

25

Two second slowdown reduced revenue/user by 4.3%.

[Eric Schurman, Bing]

400 millisecond delay decreased searches/user by 0.59%.

[Jack Brutlag, Google] p75, Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction

to the Design of Warehouse-Scale Machines”


0

50

100

150

200

20 40 60 80 100 120

140 160

180

Late

ncy

(m

s)

RPS

50%ile

95%ile

99%ile

0

50

100

150

200

20 40 60 80 100 120

140 160

180

Late

ncy

(m

s)

RPS

50%ile


26

0

50

100

150

200

20 40 60 80 100 120 140 160 180

Late

ncy

(ms)

RPS

50%ile95%ile99%ile

Lucene search benchmark on one core of an Intel Xeon-D 1540 Broadwell.

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

20 40

60 80

100 120

140 160

180

Utilizatio

n /

no S

MT

RPS

67% no SMT

LC

34% w/ SMT

p75, Luiz André Barroso, Urs Hölzle“The Datacenter as a Computer: An Introduction

to the Design of Warehouse-Scale Machines”


Soak up Slack with Batch?

27

core core core core

shared cache

LC batch

shared cache shared cache

Co-running on different cores SMT turned off Co-running on same core in SMT lanes

LC idle core LCbatch core core core core t1 t2 t1 t2 t1 t2 t1 t2t1 t2 t1 t2 t1 t2 t1 t2

SMT Co-Runner

28

0

100

200

300

400

500

600

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS

Lucene alonewith IPC 0.01with IPC 1

Even IPC 0.01 violates SLO at low load!

while(1);

while(1) {movnti();mfence();

}

IPC 1.0

IPC 0.01

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization / n

o S

MT

RPS

Even Low IPC Co-Runner is Destructive

29

LC

FunctionalUnits

Load StoreQueue

IssueLogicLanes

Dynamically shared

Staticallypartitioned

Round-robinshared

timelucene

IPC 0.01

Critical resources are shared by active SMT lanes.

batch

Principled Borrowing

30

time

busy idle

Batch lane borrows hardware resources when LC lane is idle

Batch lane releases hardware resources when LC lane is busy

Can we implement principled borrowing on current hardware?

LC

batch

Hardware is Ready — Software is Not

31

time

Batch lane calls “mwait”

Thread sleeps, releasing hardware to OS (~2K cycles)

When LC lane is busy, batch lane must do nothing (nap)

OS schedules batch lane with any ready job

LC

batch

nanonap()

32

Thread nap semantics

Put SMT hardware lane in idle state

OS cannot schedule hardware context

OS can interrupt and wakeup thread

nanonap()

33

per_cpu_variable: nap_flag;void nanonap() {

enter_kernel();disable_preemption();my_nap_flag = this_cpu_flag(nap_flag);monitor(my_nap_flag);mwait();enable_preemption();leave_kernel();

}

time

batch lane naps

LC

batch

Elfen Scheduler

No change to latency critical threads

Bind latency critical threads to LC laneBind batch threads to batch lane

Instrument batch workloads to detect LC threads quickly & nap

34

Elfen Scheduler

35

time

1 2

3

1 Batch thread borrows resources continuously checks LC lane status

2 LC starts, batch calls nanonap() to release SMT hardware context

3 OS touches nap_flag to wake up batch thread

/* fast path check injected into method body */check:if (!request_lane_idle)

slow_path();

slow_path() { nanonap(); }

1

2

/* maps lane IDs to the running task */exposed SHIM signal: cpu_task_maptask_switch(task T) {

cpu_task_map[thiscpu] = T; }idle_task() {

// wake up any waiting batch thread update_nap_flag_of_partner_lane(); ......

}

3

LC

batch


36

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS


1 core, 2 SMT lanes

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization

/ n

o S

MT

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS


0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS



Beyond Mutual Exclusion

37

time

Mutual exclusion

LC

batch

Concurrent co-running with a budget

• Fixed budget policy• Refresh budget policy• Dynamic budget policy

• Borrow idle policy

nanonap nanonap nanonap

38

Borrow idle on 1 cores

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

200 300 400 500 600 700 800 900 10001100

RPS

0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

200 300

400 500

600 700

800 900

1000 1100

RPS

40

60

80

100

120

140

10 20 30 40 50 60 70 80 90 100

99%

ile la

tenc

y (m

s)

RPS


0

0.25

0.5

0.75

1

1.25

1.5

1.75

2

10 20

30 40

50 60

70 80

90 100

Utilization

/ n

o S

MT

RPS

Borrow idle on 7 cores Dynamic budget 7 cores


Conclusion

39

Principled borrowing for SMT

nanonap() releases STM hardware without giving it to OS

Elfen scheduler implementation & policies

90% to 25% increase in utilization

Same tail latency!?

Thank you!

https: //github.com/elfenscheduler

elfen scheduling - usenix · elfen scheduling fine-grain principled borrowing from latency-critical...

Documents