a scalable ordering primitive for multicore machines

A Scalable Ordering Primitivefor Multicore Machines

Sanidhya Kashyap Changwoo Min Kangnyeon Kim Taesoo Kim

2

Era of multicore machines

3

Scope of multicore machines

Huge hardwarethread parallelism

...

How are operations executed correctly?Ordering

Becomes scalability bottleneck

4

Example: Read Log Update (RLU)

● Extension of RCU● Modifes objects in a thread’s local log● Clock maintains correct snapshot (old vs new)● Frees objects via epoch-based reclamation

A B C D

D’

E

B’

A log/bufer to store copies (per-thread)

LogLogRLU header

Global Clock (22)

Local Clock(22)

Read on start

Read Log Update (RLU) operation

P

Q

Write Clock(∞)

Global Clock (22)

A B C D

C’D’

E

B’

1. P updates clocks 2. P executes RCU-epoch Waits for Q to fnish

1. P updates clocks 2. P executes RCU-epoch Waits for Q to fnish

Global Clock (23)

Local Clock(22)

Write Clock(23)

Q will read only old objects

RLU commit operation

P

Q

Logical Clock maintains correctness/orderingMaintained via atomic instructions FAA/CAS→

7

Issue with logical clock● RLU sufers from global clock contention

– Cache-line contention due to atomic instructions– Possible to circumvent with our approach

0

30

60

90

120

150

180

0 64 128 192 256

Ops

/use

c

#cores

Phi

Atomic

0

40

80

120

160

0 16 32 48 64 80 96

#cores

ARM

0

30

60

90

120

150

180

0 64 128 192 256

Ops

/use

c

#cores

Phi

Atomic

Ordo

0

40

80

120

160

0 16 32 48 64 80 96

#cores

ARM

How can we achieve orderingwith minimal timestamping overhead?

8

Our proposed ordering primitive: Ordo

● Exposes a monotonically increasing clock– Current hardware already provides– rdtscp (X86), cntvct (ARM), stick (Sparc)

● Relies on a per-core invariant hardware clock– Monotonically increases with constant skew regardless

of dynamic frequency and voltage scaling

9

Challenges with Ordo

● Comparing two clocks– Clocks are not synchronized– Cores receive RESET signal at varying times

● Application:– Modifying algorithms to use Ordo– Able to compare between two timestamps

10

Embracing the invariant clocks

● Measure a global uncertainty window– Ensure a new timestamp once a window is over– Provides a notion of globally synchronized clock

● Measured ofset MUST have the invariant:Measured ofset is greater than the physical ofset

– Physical ofset: ofset due to RESET signal– Measured ofset: physical ofset + one-way delay

11

Calculating global uncertainty window:ORDO_BOUNDARY

● Add one-way delay latency on each path

C1 C2

T(C1): 0

T(C2): 20

C1 C→ 2

20

time

1) Calculate C1 timestamp2) Notify C2 via memory3) Get C2 timestamp4) Repeat steps 1-3 to get

the minimum

12


● Add one-way delay latency on each path

C1 C2

T(C1): 80

T(C2): 50

C2 C→ 1

30

time

● Repeat prior steps in opposite direction

● Do not know which clockis ahead of the other

13


● Repeat steps for each pair of cores from C1 to Cn

● The maximum ofset is the ORDO_BOUNDARY

C1 C2

C1 C←→ 2

30

T(C1): 80

T(C2): 50

C1 C→ 2

20

C2 C→ 1

30

C1 C2

T(C1): 0

T(C2): 20

time

14

Ordo application

● Applicable to any timestamp-based algorithm● Expose Ordo API for these algorithms

– get_time(): Current hardware timestamp– cmp_time(t1, t2): Compare two timestamps with

uncertainty, if |t1-t2| < ORDO_BOUNDARY

– new_time(t): Return tnew > (t + ORDO_BOUNDARY)

● Catch: Algorithms should handle uncertainty

15

Algorithms with Ordohandling uncertainty

● Physical to logical timestamping:– Rely on cmp_time() to compare two timestamps– Either defer or revert if comparison is uncertain– Use new_time() to guarantee new time

● Physical timestamping:– Use new_time() to access the global clock

A B C D

D’

E

B’ LogLog

Q’s coreclock (50)

Local Clock(50)

P’s localclock (22)

Read on start

Read Log Update (RLUOrdo) operation

P

Q

Global ofset(30)

Write Clock(∞)

A B C D

C’D’

E

B’

Local Clock(50)

Write Clock(150)

Q will read only old objects

RLUOrdo commit operation

P

Q

Global ofset(30)

1. P updates own clock 2. P executes RCU-epoch Waits for Q to fnish

1. P updates own clock 2. P executes RCU-epoch Waits for Q to fnish

18

Algorithms modifed with Ordo

● RLU● Transactional Locking (TL2) in STM● Database concurrency control: OCC, MVCC● Oplog used in Linux forking functionality

See our paper

19

Evaluation

● Questions:– Measured global ofset (ORDO_BOUNDARY)– Maximum scalability of Ordo– Ordo’s impact on algorithms

● Machines confguration:– 240 core, 8 socket Intel Xeon machine (Xeon)– 256 core, Intel Xeon Phi (Phi)– 96 core, 2 socket ARM machine (ARM)– 32 core, 8 socket AMD machine (AMD)

20

Machine Minimum (ns) Maximum (ns)

Intel Xeon 70 276

Intel Xeon phi 90 270

ARM 100 1,100

AMD 93 203

Ofset between clocks

● Empirically measured ofset after reboots● ORDO_BOUNDARY is the maximum ofset

21

Timestamping with Ordo● Ordo relies on hardware timestamping ● 17.4 – 285.5x faster than atomic increments

0

4

8

12

0 60 120 180 240

Ops

/use

c/co

re Xeon(Atomic)

0

4

8

12

0 64 128 192 256

Phi(Atomic)

0

4

8

12

0 16 32 48 64 80 96

Ops

/use

c/co

re

#core

ARM(Atomic)

0

4

8

12

0 4 8 12 16 20 24 28 32#core

AMD(Atomic)

0

4

8

12

0 60 120 180 240

Ops

/use

c/co

re Xeon(Atomic)

Xeon(Ordo)

0

4

8

12

0 64 128 192 256

Phi(Atomic)

Phi(Ordo)

0

4

8

12

0 16 32 48 64 80 96

Ops

/use

c/co

re

#core

ARM(Atomic)

ARM(Ordo)

0

4

8

12

0 4 8 12 16 20 24 28 32#core

AMD(Atomic)

AMD(Ordo)

22

Scaling RLU with Ordo● RLUOrdo is 2.1x faster on an average● Still sufers from object copy and its locking

0306090

120150

0 60 120 180 240

Ops

/use

c

Xeon

RLU 2% RLU(Ordo) 2%

0306090

120150180

0 64 128 192 256

Phi

0

40

80

120

160

0 16 32 48 64 80 96

Ops

/use

c

#core

ARM0

20

40

60

80

0 8 16 24 32#core

AMD

23

Discussion and limitations

● Simplifes the design and understanding of algorithms

● Not a panacea– Applicable when clock is contentious

● No skew consideration● Thread ID-based timestamp comparison has its

limitation

24

Conclusion

● Ordo is a scalable timestamping primitive– Relies on invariant hardware clocks

● Exposes time-based API to the user● Applied Ordo to fve concurrent algorithms● Improves the scalability of algorithms by at

most 39.7x across architectures

Backup Slides

26

Ofset between clocks● Clocks are not synchronized

– 8th socket in Xeon and 2nd socket in ARM– Results remain consistent even after reboots and

measuring after a period of time

0

30

60

90

120

0 30 60 90 120

# core

Xeon

0

75

150

225

0

24

48

72

96

0 24 48 72 96

# core

Arm

0

300

600

900

Of

set b

etw

een

cloc

ks

27

Sensitivity of ORDO_BOUNDARY● Varying ORDO_BOUNDARY from 1/8x – 8x● Cycles increases from 32.2–18K on Xeon machine

0.92

0.96

1.00

1.04

1.08

1-core 1-socket 8-sockets

Nor

mal

ized

thro

ughp

ut

28

Physical timestamping: Oplog● Improves Exim performance by 1.9x at 240 cores

0k

20k

40k

60k

80k

100k

120k

30 60 90 120 150 180 210 240

Mes

sage

s/se

c

#core

Stock

Oplog(Ordo)

29

Scaling database concurrency control● Improves OCC and MVCC by 4.1–39.7x for read-only (YCSB)● OCCOrdo 1.24x faster than Tictoc and Silo (TPC-C)

0306090

120150180

0 60 120 180 240

Txns

/use

c

Xeon

OCCOCC (Ordo)

MVCCMVCC (Ordo)

020406080

100

0 64 128 192

Phi

08

16243240

0 16 32 48 64 80 96

Txns

/use

c

# core

ARM

07

14212835

0 8 16 24 32# core

AMD

30

Cannot use clock synchronization protocols

● No information on minimum bounds on message delivery between/among clocks

● Protocols introduce various errors● Can lead to mis-synchronized clocks

– Larger or smaller than the actual physical ofset

Lead to incorrect implementation ofconcurrent algorithms

a scalable ordering primitive for multicore machines

Documents