fast and energy-efficient in-dram bulk data copy and ... · fast and energy-efficient in-dram bulk...

49
Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu, P. B. Gibbons, M. A. Kozuch, T. C. Mowry Vivek Seshadri

Upload: buiphuc

Post on 01-May-2018

217 views

Category:

Documents


2 download

TRANSCRIPT

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu,

P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Vivek Seshadri

2

Bulk data copy and initialization

• Unnecessarily move data on the memory channel

• Degrade system performance and energy efficiency

RowClone – perform copy in DRAM with low cost

• Uses row buffer to copy large quantity of data

• Source row → row buffer → destination row

• 11X lower latency and 74X lower energy for a bulk copy

Accelerate Copy-on-Write and Bulk Zeroing

• Forking, checkpointing, zeroing (security), VM cloning

Improves performance and energy efficiency at low cost

• 27% and 17% for 8-core systems (0.01% DRAM chip area)

3

Core

Core

Cac

he

MC

Mem

ory

Channel

Limited Bandwidth

High Energy

4

Core

Core

Cac

he

MC

Mem

ory

Channel

Reduce unnecessary data movement

5

Bulk Data Copy

Bulk Data Initialization

src dst

dstval

6

Bulk Data Copy

Bulk Data Initialization

src dst

dstval

7

Forking

000000000000000

Zero initialization(e.g., security)

VM CloningDeduplication

Checkpointing

Page Migration

Many more

8

Core

Core

Cac

he

MC Channel src

dst

High latency(1046ns to copy 4KB)

Interference

High Energy(3600nJ to copy 4KB)

9

Core

Core

Cac

he

MC Channeldst

High latency

Interference

High Energy

src

XX

X

?

10

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

11

Mem

ory

Ch

ann

el

Ch

ip I/

O

Bank

Bank I/O

Subarray

Row Buffer

Row of DRAM Cells

12

Mem

ory

Ch

ann

el

Ch

ip I/

O

Bank I/O

ACTIVATE: Copy data from row to row buffer

READ: Transfer data to channel using the shared bus

13

VDD/2

VDD/2

0

VDD

DRAMCell

Sense Amplifier(Row Buffer)

14

VDD/2

VDD/2

0

VDD/2 + δ

0

VDDVDDVDD/2 + δ

DRAMCell

Sense Amplifier(Row Buffer)

Cell loses

charge

Amplify the differenceRestore

Cell Data

READ/WRITE

In the stable state, the sense amplifier drives the cell

ACTIVATE

15

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

16

r c r o ws

s t o wd r

1. Source row to row buffer

2. Row buffer to destination row

Row Buffer

r c r o ws

s r c r o w

?

17

VDD/2

VDD/2

0

VDD/2 + δ

0

VDD

VDDVDD/2 + δ

Sense Amplifier(Row Buffer)

Amplify the difference

0

Data gets copied

src

dst

18

r c r o ws

s t o wd r

Row Buffer

r c r o ws

s r c r o w

1. Activate src row (copy data from src to row buffer)

2. Activate dst row (disconnect src from row buffer, connect dst – copy data from row buffer to dst)

19

Latency Energy11x 74x

Bulk Data Copy

1046ns to 90ns 3600nJ to 40nJ

No bandwidth consumption

Very little changes to the DRAM chip

20

Location of source/destination

• Both should be in the same subarray

Size of the copy

• Copies all the data from source row to destination

21

Mem

ory

Ch

ann

el

Ch

ip I/

OBank

Shared internal bus

Overlap the latency of the read and the write1.9X latency reduction, 3.2X energy reduction

22

Mem

ory

Ch

ann

el

Ch

ip I/

O

Bank Bank I/O

Subarray

Intra subarrayUse FPM

Inter bankUse PSM

Inter subarrayUse PSM twice

23

Initialization with arbitrary data

• Initialize one row

• Copy the data to other rows

Zero initialization (most common)

• Reserve a row in each subarray (always zero)

• Copy data from reserved row (FPM mode)

• 6.0X lower latency, 41.5X lower DRAM energy

• 0.2% loss in capacity

24

02468

101214

Intr

a-S

ub

arra

y

Inte

r-B

ank

Inte

r-S

ub

arra

y

Intr

a-S

ub

arra

y

Copy Zero

Latency Reduction

020406080

Intr

a-S

ub

arra

y

Inte

r-B

ank

Inte

r-S

ub

arra

y

Intr

a-S

ub

arra

y

Copy Zero

Energy Reduction

11.6x

1.9x

6.0x

1.0x

74.4x

3.2x 1.5x

41.5x

Very low cost: 0.01% increase in die area

25

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

26

DRAM (RowClone)

Microarchitecture

ISA

Operating System

Application How does the software communicate occurrences of bulk copy/initialization to hardware?

How to maximize use of the Fast Parallel Mode?

How to ensure cache coherence?

Handling data reuse after zero initialization?

27

Two new instructions

• memcopy and meminit

• Similar instructions present in existing ISAs

Microarchitecture Implementation

• Checks if instructions can be sped up by RowClone

• Export instructions to the memory controller

28

RowClone modifies data in memory

• Need to maintain coherence of cached data

Similar to DMA

• Source and destination in memory

• Can leverage hardware support for DMA

Additional optimizations

29

Make operating system subarray-aware

Primitives amenable to use of FPM

• Copy-on-Write Allocate destination in same subarray as source

Use FPM to copy

• Bulk Zeroing Use FPM to copy data from reserved zero row

30

Data reuse after zero initialization• Phase 1: OS zeroes out the page

• Phase 2: Application uses cachelines of the page

RowClone• Avoids misses in phase 1• But incurs misses in phase 2

RowClone-Zero-Insert (RowClone-ZI)• Insert clean zero cachelines

31

Introduction

DRAM Background

RowClone

• Fast Parallel Mode

• Pipelined Serial Mode

End-to-end Design

Evaluation

32

Out-of-order multi-core simulator

1MB/core last-level cache

Cycle-accurate DDR3 DRAM simulator

6 Copy/Initialization intensive applications

+SPEC CPU2006 for multi-core

Performance

• Instruction throughput for single-core

• Weighted Speedup for multi-core

33

System bootup (Booting the Debian OS)

Compile (GNU C compiler – executing cc1)

Forkbench (A fork microbenchmark)

Memcached (Inserting a large number of objects)

MySql (Loading a database)

Shell script (find with ls on each subdirectory)

34

0

0.2

0.4

0.6

0.8

1

bootup compile forkbench mcached mysql shell

Frac

tion

of M

emor

y Tr

affic

Zero Copy Write Read

35

0%

10%

20%

30%

40%

50%

60%

70%

bootup compile forkbench mcached mysql shell

Com

pare

d to

Bas

elin

e

IPC Improvement Memory Energy Reduction

Improvements correlate with fraction of memory traffic due to copy/initialization

36

Reduced bandwidth consumption benefits all applications.

Run copy/initialization intensive applications with memory intensive SPEC applications.

Half the cores run copy/initialization intensive applications. Remaining half run SPEC applications.

37

0%

5%

10%

15%

20%

25%

30%

2-Core 4-Core 8-CoreImpr

ove

men

t o

ver

Bas

elin

e

System Performance Memory Energy Efficiency

Performance improvement increases with increasing core countConsistent improvement in

energy/instruction

38

Discussion on interleaving and copy granularity

Detailed analysis of the fork benchmark

Detailed multi-core results and analysis

Results with the PSM mode

Analysis of RowClone-ZI

Comparison to memory-controller-based DMA

39

Bulk data copy and initialization

• Unnecessarily move data on the memory channel

• Degrade system performance and energy efficiency

RowClone – perform copy in DRAM with low cost

• Uses row buffer to copy large quantity of data

• Source row → row buffer → destination row

• 11X lower latency and 74X lower energy for a bulk copy

Accelerate Copy-on-Write and Bulk Zeroing

• Forking, checkpointing, zeroing (security), VM cloning

Improves performance and energy efficiency at low cost

• 27% and 17% for 8-core systems (0.01% chip area overhead)

Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization

Y. Kim, C. Fallin, D. Lee, R. Ausavarungnirun, G. Pekhimenko, Y. Luo, O. Mutlu,

P. B. Gibbons, M. A. Kozuch, T. C. Mowry

Vivek Seshadri

42

2-core 4-core 8-core# Workloads 138 50 40

Weighted Speedup 15% 20% 27%Instruction Throughput 14% 15% 25%

Harmonic Speedup 13% 16% 29%Max Slowdown Reduction 6% 12% 23%

Bandwidth/Instruction Reduction 29% 27% 28%Energy/Instruction Reduction 19% 17% 17%

43

0

0.5

1

1.5

2

2.5

bootup compile forkbench mcached mysql shell

Inst

ruct

ions

per

Cyc

le

Baseline RowClone RowClone-ZI

44

0.90.95

11.051.1

1.151.2

1.251.3

1.351.4

No

rmai

zed

Wei

ghte

d S

peed

up

Baseline RowClone RowClone-ZI

45

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k

Number of Pages Updated

64MB 128MB

46

0

0.5

1

1.5

2

2.5

2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k

No

rmal

ized

IPC

Number of Pages Updated

64MB 128MB

47

0

0.2

0.4

0.6

0.8

1

1.2

2 4 8 16 32 64 128 256 512 1k 2k 4k 8k 16k

No

rmal

ized

En

ergy

Number of Pages Updated

Baseline RowClone-PSM RowClone-FPM

48

Copy engines (Zhao et al. 2005, Jiang et al. 2009)

• Addresses cache pollution, pipeline stalls due to copy

• But requires data transfer over the memory channel

IRAM (Patterson et al. 1997)• Compute + memory using same technology

• Exploit high DRAM bandwidth

• Goal: Wider range of SIMD operations

• High cost

49

Copy/Initialization is important

• But not well known

Opportunity to perform in DRAM

• Not well known

This paper: Proof of concept

• More challenges to be addressed