concurrent data structures for near-memory computing · concurrent data structures 2 are used...

32
© 2017 VMware Inc. All rights reserved. Concurrent Data Structures for Near-Memory Computing Zhiyu Liu (Brown) Irina Calciu (VMware Research) Maurice Herlihy (Brown) Onur Mutlu (ETH)

Upload: others

Post on 18-Jul-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

© 2017 VMware Inc. All rights reserved.

Concurrent Data Structures for Near-Memory Computing

Zhiyu Liu (Brown)

Irina Calciu (VMware Research)

Maurice Herlihy (Brown)

Onur Mutlu (ETH)

Page 2: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Concurrent Data Structures

2

Are used everywhere: kernel, libraries, applications

Issues:• Difficult to design and implement • Data layout and memory/cache hierarchy

play crucial role in performance

Cache

Memory

Page 3: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

The Memory Wall

3

CPU

L1 cache

L2 cache

L3 cache

Memory

< 10 ns

Tens of ns

Hundreds

of ns

2.2 GHz = 220K cycles during this timeData movement

Page 4: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Near Memory Computing

4

• Also called Processing In Memory (PIM)

• Avoid data movement by doing computation in memory

• Old idea

• New advances in 3D integration and die-stacked memory

• Viable in the near future

Page 5: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Near Memory Computing: Architecture

5

• Vaults: memory partitions• PIM cores: lightweight

• Fast access to its own vault

• Communication• Between a CPU and a PIM• Between PIMs• Via messages sent to

buffers

CPU

Cro

ssba

rNet

wor

k

CPU

PIM coreVault

PIM memory

PIM core

PIM core

CPU

CPU

Vault

Vault

DRAM

Page 6: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Data Structures + Hardware

6

• Tight integration between algorithmic designand hardware characteristics

• Memory becomes an active component in managing data

• Managing data structures in PIM

• Old work: pointer chasing for sequential data structures

• Our work: concurrent data structures

Page 7: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

CONTENTION

7

LOW HIGH

Cacheable Pointer-chasing

Goals: PIM Concurrent Data Structures

1. How do PIM data structures compare to state-of-the-art concurrent data structures?

2. How to design efficient PIM data structures?

Page 8: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

CONTENTION

8

LOW HIGH

Cacheable Pointer-chasing

Goals: PIM Concurrent Data Structures

1. How do PIM data structures compare to state-of-the-art concurrent data structures?

2. How to design efficient PIM data structures?

e.g., uniform distribution skiplist & linkedlist

Page 9: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

9

1

1

1

1 3 4 6 7 8 10 11 14

4

4 7

10

10

108

12

14

12 14

7 16

16

16

16

Pointer Chasing Data Structures

Page 10: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

CONTENTION

10

LOW HIGH

Cacheable Pointer-chasing

Goals: PIM Concurrent Data Structures

1. How do PIM data structures compare to state-of-the-art concurrent data structures?

2. How to design efficient PIM data structures?

e.g., uniform distribution skiplist & linkedlist

Page 11: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Naïve PIM Skiplist

11

CPUCPU CPU

add(30) add(5) delete(80)

PIM Memory Vault

PIM core

1

1

1

1 3 4 6 7 8 10 11 14

4

4 7

10

10

108

12

14

12 14

7 16

16

16

16

Low Latency

Page 12: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Concurrent Data Structures

12

CPUCPU CPU

add(30) add(5) delete(80)

1

1

1

1 3 4 6 7 8 10 11 14

4

4 7

10

10

108

12

14

12 14

7 16

16

16

16

DRAM

High Latency

Page 13: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Flat Combining

13

CPUCPU CPU

add(30) add(5) delete(80)

1

1

1

1 3 4 6 7 8 10 11 14

4

4 7

10

10

108

12

14

12 14

7 16

16

16

16

Sequential access

Combiner lock

DRAM

High Latency

Page 14: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

0 5 10 15 20 25 30

14

Skiplist Throughput

Lock-free

FC

Ops

/s

Threads

Intel Xeon E7-4850v3 28 hardware threads,

2.2 GHz

Page 15: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

PIM Performance

15

N Size of the skiplist

p Number of processes

LCPU Latency of a memory access from the CPU

LLLC Latency of a LLC access

LATOMIC Latency of an atomic instruction (by the CPU)

LPIM Latency of a memory access from the PIM core

LMSG Latency of a message from the CPU to the PIM core

Page 16: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

PIM Performance

16

LCPU = r1 LPIM = r2 LLLC

LMSG = LCPU

r1 = r2 = 3

Page 17: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

PIM Performance

17

Algorithm ThroughputLock-free p / ( B * LCPU )

Flat Combining (FC) 1 / ( B * LCPU )

PIM 1 / ( B * LPIM + LMSG )

B = average number of nodes accessed during a skiplist operation

Page 18: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

0 5 10 15 20 25 30

18

Skiplist Throughput

Lock-free

Ops

/s

Threads

PIM (expected)

FC

Page 19: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

CONTENTION

19

LOW HIGH

Cacheable Pointer-chasing

Goals: PIM Concurrent Data Structures

1. How do PIM data structures compare to state-of-the-art concurrent data structures?

2. How to design efficient PIM data structures?

e.g., uniform distribution skiplist & linkedlist

Page 20: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

New PIM algorithm: Exploit Partitioning

20

CPU

Cro

ssba

rNet

wor

k

CPU

PIM coreVault

PIM memory

PIM core

PIM core

CPU

CPU

Vault

Vault

DRAM

Page 21: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

PIM Skiplist w/ Partitioning

21

1

1

1

1 3 4 6 7 8 10 11 20 26

4

4 7

10

10

108

12

20

20

12 20

7

CPUCPU CPU 1 10 20

CPU cache

Vault 1 Vault 2 Vault 3

PIM core PIM core PIM core

PIM Memory

Page 22: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

PIM Performance

22

Algorithm ThroughputLock-free p / ( B * LCPU )

Flat Combining (FC) 1 / ( B * LCPU )

PIM 1 / ( B * LPIM + LMSG )

FC + k partitions k / ( B * LCPU )

PIM + k partitions k / ( B * LPIM + LMSG )

B = average number of nodes accessed during a skiplist operation

Page 23: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

23

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

0 5 10 15 20 25 30

Lock-free

FC

Ops

/s

Threads

FC w/ 16 partitions

FC w/ 8 partitions

FC w/ 4 partitions

Skiplist Throughput

Page 24: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

24

0

2000000

4000000

6000000

8000000

10000000

12000000

14000000

16000000

18000000

0 5 10 15 20 25 30

Lock-free

FC

Ops

/s

Threads

FC w/ 16 partitions

FC w/ 8 partitions FC w/ 4 partitions

PIM w/ 8 partitions (expected)

PIM w/ 16 partitions (expected)

Skiplist Throughput

Page 25: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

CONTENTION

25

LOW HIGH

Cacheable Pointer-chasing

Goals: PIM Concurrent Data Structures

1. How do PIM data structures compare to state-of-the-art concurrent data structures?

2. How to design efficient PIM data structures?

e.g., FIFO queue

Page 26: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

26

HEAD TAIL

enqueue dequeue

FIFO Queue

Page 27: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

PIM FIFO Queue

27

Tail

CPU CPU

PIM core

Deq()Deq()

1

dequeues a node2

retrievesrequest

3 sends backthe node

1 retrievesrequest

PIM Memory Vault

Page 28: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Pipelining

28

1 2

Timeline

3Deq()

Deq()

Deq()

1 2 3

1 2 3

Can overlap the execution of the next request

Page 29: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Parallelize Enqs and Deqs

29

Head Tail

CPUCPU CPU

Vault 2

PIM core

Vault 1

PIM core

enqs deqs

Page 30: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Conclusion

30

PIM is becoming feasible in the near future

We investigate Concurrent Data Structures (CDS) for PIM

Results:

• Naïve PIM data structures are less efficient than CDS

• New PIM algorithms can leverage PIM features

• They outperform efficient CDS

• They are simpler to design and implement

Page 31: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

Thank you!

https://research.vmware.com/

[email protected]

Page 32: Concurrent Data Structures for Near-Memory Computing · Concurrent Data Structures 2 Are used everywhere: kernel, libraries, applications Issues: • Difficult to design and implement

32