computer architecture - eth z...in-memory bulk bitwise operations we can support in-dram copy, zero,...

79
Computer Architecture Lecture 7: Computation in Memory II Prof. Onur Mutlu ETH Zürich Fall 2019 10 October 2019

Upload: others

Post on 09-Jul-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Computer ArchitectureLecture 7: Computation in Memory II

Prof. Onur Mutlu

ETH Zürich

Fall 2019

10 October 2019

Page 2: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Sub-Agenda: In-Memory Computation

◼ Major Trends Affecting Main Memory

◼ The Need for Intelligent Memory Controllers

❑ Bottom Up: Push from Circuits and Devices

❑ Top Down: Pull from Systems and Applications

◼ Processing in Memory: Two Directions

❑ Minimally Changing Memory Chips

❑ Exploiting 3D-Stacked Memory

◼ How to Enable Adoption of Processing in Memory

◼ Conclusion

2

Page 3: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Processing in Memory:

Two Approaches

1. Minimally changing memory chips

2. Exploiting 3D-stacked memory

3

Page 4: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Recall: RowClone

◼ Vivek Seshadri, Yoongu Kim, Chris Fallin, Donghyuk Lee, RachataAusavarungnirun, Gennady Pekhimenko, Yixin Luo, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry,"RowClone: Fast and Energy-Efficient In-DRAM Bulk Data Copy and Initialization"Proceedings of the 46th International Symposium on Microarchitecture(MICRO), Davis, CA, December 2013. [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)]

4

Page 5: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Recall: End-to-End System Design

5

DRAM (RowClone)

Microarchitecture

ISA

Operating System

ApplicationHow to communicate occurrences of bulk copy/initialization across layers?

How to maximize latency and energy savings?

How to ensure cache coherence?

How to handle data reuse?

Page 6: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Memory as an Accelerator

CPUcore

CPUcore

CPUcore

CPUcore

mini-CPUcore

videocore

GPU(throughput)

core

GPU(throughput)

core

GPU(throughput)

core

GPU(throughput)

core

LLC

Memory ControllerSpecialized

compute-capabilityin memory

Memoryimagingcore

Memory Bus

Memory similar to a “conventional” accelerator

Page 7: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

In-Memory Bulk Bitwise Operations

◼ We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ

◼ At low cost

◼ Using analog computation capability of DRAM

❑ Idea: activating multiple rows performs computation

◼ 30-60X performance and energy improvement

❑ Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” MICRO 2017.

◼ New memory technologies enable even more opportunities

❑ Memristors, resistive RAM, phase change mem, STT-MRAM, …

❑ Can operate on data with minimal movement

7

Page 8: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

In-DRAM AND/OR: Triple Row Activation

8

½ VDD

½ VDD

dis

A

B

C

Final StateAB + BC + AC

½ VDD+δ

C(A + B) + ~C(AB)en

0

VDD

Seshadri+, “Fast Bulk Bitwise AND and OR in DRAM”, IEEE CAL 2015.

Page 9: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

In-DRAM Bulk Bitwise AND/OR Operation

◼ BULKAND A, B → C

◼ Semantics: Perform a bitwise AND of two rows A and B and store the result in row C

◼ R0 – reserved zero row, R1 – reserved one row

◼ D1, D2, D3 – Designated rows for triple activation

1. RowClone A into D1

2. RowClone B into D2

3. RowClone R0 into D3

4. ACTIVATE D1,D2,D3

5. RowClone Result into C9

Page 10: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

More on In-DRAM Bulk AND/OR

◼ Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry,"Fast Bulk Bitwise AND and OR in DRAM"IEEE Computer Architecture Letters (CAL), April 2015.

10

Page 11: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

In-DRAM NOT: Dual Contact Cell

11

Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017.

Idea: Feed the

negated value in the sense amplifier

into a special row

Page 12: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

In-DRAM NOT Operation

12

Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017.

Page 13: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Performance: In-DRAM Bitwise Operations

13

Page 14: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Energy of In-DRAM Bitwise Operations

14

Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017.

Page 15: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Ambit vs. DDR3: Performance and

Energy

15

010203040506070

Performance Improvement

Energy Reduction

32X 35X

Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017.

Page 16: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Bulk Bitwise Operations in Workloads

[1] Li and Patel, BitWeaving, SIGMOD 2013

[2] Goodwin+, BitFunnel, SIGIR 2017

Page 17: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Example Data Structure: Bitmap Index

◼ Alternative to B-tree and its variants

◼ Efficient for performing range queries and joins

◼ Many bitwise operations to perform a query

Bit

map

1

Bit

map

2

Bit

map

4

Bit

map

3

age < 18 18 < age < 25 25 < age < 60 age > 60

Page 18: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Performance: Bitmap Index on Ambit

18

Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017.

>5.4-6.6X Performance Improvement

Page 19: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Performance: BitWeaving on Ambit

19

Seshadri+, “Ambit: In-Memory Accelerator for Bulk Bitwise Operations using Commodity DRAM Technology,” MICRO 2017.

>4-12X Performance Improvement

Page 20: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

More on In-DRAM Bulk AND/OR

◼ Vivek Seshadri, Kevin Hsieh, Amirali Boroumand, Donghyuk Lee, Michael A. Kozuch, Onur Mutlu, Phillip B. Gibbons, and Todd C. Mowry,"Fast Bulk Bitwise AND and OR in DRAM"IEEE Computer Architecture Letters (CAL), April 2015.

20

Page 21: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

More on In-DRAM Bitwise Operations

◼ Vivek Seshadri et al., “Ambit: In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology,” MICRO 2017.

21

Page 22: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

More on In-DRAM Bulk Bitwise Execution

◼ Vivek Seshadri and Onur Mutlu,"In-DRAM Bulk Bitwise Execution Engine"Invited Book Chapter in Advances in Computers, to appear in 2020.[Preliminary arXiv version]

22

Page 23: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Challenge: Intelligent Memory Device

Does memory

have to be

dumb?

23

Page 24: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Challenge and Opportunity for Future

Computing Architectures

with

Minimal Data Movement

24

Page 25: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

A Detour

on the Review Process

25

Page 26: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Ambit Sounds Good, No?

26

Review from ISCA 2016

Page 27: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Another Review

27

Another Review from ISCA 2016

Page 28: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Yet Another Review

28

Yet Another Review from ISCA 2016

Page 29: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

The Reviewer Accountability Problem

29

Page 30: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

We Have a Mindset Issue…

◼ There are many other similar examples from reviews…

❑ For many other papers…

◼ And, we are not even talking about JEDEC yet…

◼ How do we fix the mindset problem?

◼ By doing more research, education, implementation in alternative processing paradigms

30

We need to work on enabling the better future…

Page 31: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Aside: A Recommended Book

31

Raj Jain, “The Art of

Computer Systems

Performance Analysis,”

Wiley, 1991.

Page 32: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

32

Raj Jain, “The Art of

Computer Systems

Performance Analysis,”

Wiley, 1991.

Page 33: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

33

Raj Jain, “The Art of

Computer Systems

Performance Analysis,”

Wiley, 1991.

Page 34: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Suggestion to Community

We Need to Fix the Reviewer Accountability

Problem

Page 35: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Takeaway

Main Memory Needs

Intelligent Controllers

Page 36: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Takeaway

Research Community Needs

Accountable Reviewers

Page 37: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Suggestions to Reviewers

◼ Be fair; you do not know it all

◼ Be open-minded; you do not know it all

◼ Be accepting of diverse research methods: there is no single way of doing research

◼ Be constructive, not destructive

◼ Do not have double standards…

Do not block or delay scientific progress for non-reasons

Page 38: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

RowClone & Bitwise Ops in Real DRAM Chips

38https://parallel.princeton.edu/papers/micro19-gao.pdf

Page 39: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Pinatubo: RowClone and Bitwise Ops in PCM

39https://cseweb.ucsd.edu/~jzhao/files/Pinatubo-dac2016.pdf

Page 40: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Other Examples of

“Why Change? It’s Working OK!”

40

Page 41: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Mindset Issues Are Everywhere

◼ “Why Change? It’s Working OK!” mindset limits progress

◼ There are many such examples in real life

◼ Examples of Bandwidth Waste in Real Life

◼ Examples of Latency and Queueing Delays in Real Life

◼ Example of Where to Build a Bridge on the Road

41

Page 42: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Another Example

42

Page 43: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Initial RowHammer Reviews

Page 44: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Missing the Point Reviews from Micro 2013

Page 45: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Experimental DRAM Testing Infrastructure

45Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.

TemperatureController

PC

HeaterFPGAs FPGAs

Page 46: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

46

Tested

DRAM

Modules

(129 total)

Page 47: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Fast Forward 6 Months

47

Page 48: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

More Reviews… Reviews from ISCA 2014

Page 49: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Final RowHammer Reviews

Page 50: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

RowHammer: Hindsight & Impact (I)

50

Exploiting the DRAM rowhammer bug to gain kernel privileges (Seaborn, 2015)

Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors(Kim et al., ISCA 2014)

Page 51: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

RowHammer: Hindsight & Impact (II)

◼ Onur Mutlu and Jeremie Kim,"RowHammer: A Retrospective"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD) Special Issue on Top Picks in Hardware and Embedded Security, 2019.[Preliminary arXiv version]

51

Page 52: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Suggestion to Researchers: Principle: Passion

Follow Your Passion

(Do not get derailed

by naysayers)

Page 53: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Suggestion to Researchers: Principle: Resilience

Be Resilient

Page 54: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Principle: Learning and Scholarship

Focus on

learning and scholarship

Page 55: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Principle: Learning and Scholarship

The quality of your work defines your impact

Page 56: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Sub-Agenda: In-Memory Computation

◼ Major Trends Affecting Main Memory

◼ The Need for Intelligent Memory Controllers

❑ Bottom Up: Push from Circuits and Devices

❑ Top Down: Pull from Systems and Applications

◼ Processing in Memory: Two Directions

❑ Minimally Changing Memory Chips

❑ Exploiting 3D-Stacked Memory

◼ How to Enable Adoption of Processing in Memory

◼ Conclusion

56

Page 57: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

We Need to Think Differently

from the Past Approaches

57

Page 58: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Memory as an Accelerator

CPUcore

CPUcore

CPUcore

CPUcore

mini-CPUcore

videocore

GPU(throughput)

core

GPU(throughput)

core

GPU(throughput)

core

GPU(throughput)

core

LLC

Memory ControllerSpecialized

compute-capabilityin memory

Memoryimagingcore

Memory Bus

Memory similar to a “conventional” accelerator

Page 59: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Processing in Memory:

Two Approaches

1. Minimally changing memory chips

2. Exploiting 3D-stacked memory

59

Page 60: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Opportunity: 3D-Stacked Logic+Memory

60

Logic

Memory

Other “True 3D” technologiesunder development

Page 61: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

DRAM Landscape (circa 2015)

61

Kim+, “Ramulator: A Flexible and Extensible DRAM Simulator”, IEEE CAL 2015.

Page 62: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Several Questions in 3D-Stacked PIM

◼ What are the performance and energy benefits of using 3D-stacked memory as a coarse-grained accelerator?

❑ By changing the entire system

❑ By performing simple function offloading

◼ What is the minimal processing-in-memory support we can provide?

❑ With minimal changes to system and programming

62

Page 63: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Another Example: In-Memory Graph Processing

63

◼ Large graphs are everywhere (circa 2015)

◼ Scalable large-scale graph processing is challenging

36 Million Wikipedia Pages

1.4 BillionFacebook Users

300 MillionTwitter Users

30 BillionInstagram Photos

+42%

0 1 2 3 4

128…

32 Cores

Speedup

Page 64: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Key Bottlenecks in Graph Processing

64

for (v: graph.vertices) {

for (w: v.successors) {

w.next_rank += weight * v.rank;

}

}

weight * v.rank

v

w

&w

1. Frequent random memory accesses

2. Little amount of computation

w.rank

w.next_rank

w.edges

Page 65: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Tesseract System for Graph Processing

Crossbar Network

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Interconnected set of 3D-stacked memory+logic chips with simple cores

Logic

Memory

Ahn+, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” ISCA 2015.

Page 66: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Logic

Memory

Tesseract System for Graph Processing

66

Crossbar Network

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Communications viaRemote Function Calls

Page 67: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Communications In Tesseract (I)

67

Page 68: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Communications In Tesseract (II)

68

Page 69: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Communications In Tesseract (III)

69

Page 70: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Remote Function Call (Non-Blocking)

70

Page 71: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Logic

Memory

Tesseract System for Graph Processing

71

Crossbar Network

……

DR

AM

Co

ntro

ller

NI

In-Order Core

Message Queue

PF Buffer

MTP

LP

Host Processor

Memory-MappedAccelerator Interface

(Noncacheable, Physically Addressed)

Prefetching

Page 72: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Evaluated Systems

HMC-MC

128In-Order2GHz

128In-Order2GHz

128In-Order2GHz

128In-Order2GHz

102.4GB/s 640GB/s 640GB/s 8TB/s

HMC-OoO

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

8 OoO4GHz

DDR3-OoO Tesseract

32 Tesseract

Cores

Ahn+, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” ISCA 2015.

Page 73: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Tesseract Graph Processing Performance

+56% +25%

9.0x

11.6x

13.8x

0

2

4

6

8

10

12

14

16

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract-LP

Tesseract-LP-MTP

Spee

du

p

>13X Performance Improvement

Ahn+, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” ISCA 2015.

On five graph processing algorithms

Page 74: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Tesseract Graph Processing Performance

74

+56% +25%

9.0x

11.6x

13.8x

0

2

4

6

8

10

12

14

16

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract-LP

Tesseract-LP-MTP

Spee

du

p

80GB/s 190GB/s 243GB/s

1.3TB/s

2.2TB/s

2.9TB/s

0

0.5

1

1.5

2

2.5

3

3.5

DDR3-OoO HMC-OoO HMC-MC Tesseract Tesseract-LP

Tesseract-LP-MTP

Mem

ory

Ban

dw

idth

(TB

/s)

Memory Bandwidth Consumption

Page 75: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Effect of Bandwidth & Programming Model

75

2.3x

3.0x

6.5x

0

1

2

3

4

5

6

7

HMC-MC HMC-MC +PIM BW

Tesseract +Conventional BW

Tesseract

Spee

du

p

HMC-MC Bandwidth (640GB/s) Tesseract Bandwidth (8TB/s)

Bandwidth

Programming Model

(No Prefetching)

Page 76: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Tesseract Graph Processing System Energy

0

0.2

0.4

0.6

0.8

1

1.2

HMC-OoO Tesseract with Prefetching

Memory Layers Logic Layers Cores

> 8X Energy Reduction

Ahn+, “A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing” ISCA 2015.

Page 77: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Tesseract: Advantages & Disadvantages

◼ Advantages

+ Specialized graph processing accelerator using PIM

+ Large system performance and energy benefits

+ Takes advantage of 3D stacking for an important workload

+ More general than just graph processing

◼ Disadvantages

- Changes a lot in the system

- New programming model

- Specialized Tesseract cores for graph processing

- Cost

- Scalability limited by off-chip links or graph partitioning77

Page 78: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

More on Tesseract

◼ Junwhan Ahn, Sungpack Hong, Sungjoo Yoo, Onur Mutlu, and Kiyoung Choi,"A Scalable Processing-in-Memory Accelerator for Parallel Graph Processing"Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), Portland, OR, June 2015. [Slides (pdf)] [Lightning Session Slides (pdf)]

78

Page 79: Computer Architecture - ETH Z...In-Memory Bulk Bitwise Operations We can support in-DRAM COPY, ZERO, AND, OR, NOT, MAJ At low cost Using analog computation capability of DRAM Idea:

Computer ArchitectureLecture 7: Computation in Memory II

Prof. Onur Mutlu

ETH Zürich

Fall 2019

10 October 2019