energy-e cient architectures based on stt-mram

150
Energy-Efficient Architectures Based on STT-MRAM by Xiaochen Guo Submitted in Partial Fulfillment of the Requirements of the Degree Doctor of Philosophy Supervised by Professor Engin Ipek Department of Electrical and Computer Engineering Arts, Sciences and Engineering Edmund A. Hajim School of Engineering and Applied Sciences University of Rochester Rochester, New York 2015

Upload: others

Post on 11-Dec-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Energy-Efficient Architectures

Based on STT-MRAM

by

Xiaochen Guo

Submitted in Partial Fulfillment of the

Requirements of the Degree

Doctor of Philosophy

Supervised by

Professor Engin Ipek

Department of Electrical and Computer Engineering

Arts, Sciences and Engineering

Edmund A. Hajim School of Engineering and Applied Sciences

University of Rochester

Rochester, New York

2015

ii

Biographical Sketch

The author graduated from Beihang University, Beijing, China with a Bachelor

of Science degree in Computer Science and Engineering, in 2009. She received her

Master of Science degree in Electrical and Computer Engineering from the University

of Rochester, Rochester, NY, in 2011. She continued to pursue a doctoral degree

in Electrical and Computer Engineering at the University of Rochester under the

direction of Professor Engin Ipek. Her dissertation research leverages resistive mem-

ories to build energy-efficient processors, memory systems, and accelerators. She was

awarded the IBM Ph.D. Fellowship twice, in 2012 and 2014. She interned at Samsung

Research America, San Jose, CA, in 2011, and IBM T. J. Watson Research Center,

Yorktown Heights, NY, in 2012 and 2013.

The following publications were a result of the work conducted during doctoral

study:

• Xiaochen Guo, Mahdi Nazm Bojnordi, Qing Guo, and Engin Ipek, “Sanitizer:

Mitigating the Impact of Expensive ECC Checks in STT-MRAM based Main

Memories,” submitted to the 48th International Symposium on Microarchitec-

ture.

• Shibo Wang, Mahdi Nazm Bojnordi, Xiaochen Guo, and Engin Ipek, “Con-

tent Aware Refresh,” submitted to the 48th International Symposium on Mi-

croarchitecture.

• Qing Guo, Xiaochen Guo, Yuxin Bai, Ravi Patel, Engin Ipek, and Eby G.

Friedman, “Resistive TCAM Systems for Data-intensive Computing,” to apear

in IEEE Micro Special Issue on Alternative Computing Designs & Technologies,

2015.

• Ravi Patel, Xiaochen Guo, Qing Guo, Engin Ipek, and Eby G. Friedman,

“Reducing Switching Latency and Energy in STT-MRAM Caches with Field-

Assisted Writing”, to appear in IEEE Transactions on Very Large Scale Inte-

gration (VLSI) Systems, 2015.

iii

• Isaac Richter, Kamil Pas, Xiaochen Guo, Ravi Patel, Ji Liu, Engin Ipek, and

Eby G. Friedman, “Memristive Accelerator for Extreme Scale Linear Solvers,” in

Proceedings of the Government Microcircuit Applications & Critical Technology

Conference, St. Louis, MO, March 2015.

• Engin Ipek, Qing Guo, Xiaochen Guo, and Yuxin Bai,“Resistive Memories

in Associative Computing,” Emerging Memory Technologies: Design, Architec-

ture, and Applications, Yuan Xie (Editor), Springer, July 2013.

• Qing Guo, Xiaochen Guo, Ravi Patel, Engin Ipek, and Eby G. Friedman,

“AC-DIMM: Associative Computing with STT-MRAM,” in Proceedings of the

40th International Symposium on Computer Architecture, Tel-Aviv, Israel, June

2013.

• Qing Guo, Xiaochen Guo, Yuxin Bai, and Engin Ipek, “A Resistive TCAM

Accelerator for Data Intensive Computing,” in Proceedings of the 44th Interna-

tional Symposium on Microarchitecture, Porto Alegre, Brazil, December 2011.

• Xiaochen Guo, Engin Ipek, and Tolga Soyata, “Resistive Computation: Avoid-

ing the Power Wall with Low-Leakage, STT-MRAM Based Computing,” in

Proceedings of the 37th International Symposium on Computer Architecture,

Saint-Malo, France, June 2010.

iv

Acknowledgements

First and foremost, I would like to thank my advisor Prof. Engin Ipek for his

tremendous help and inspiration during these six years. Engin has been a great

teacher, mentor, and friend to me, who has always believed in me more than I have.

I am thankful to Prof. Michael Huang, without whom I would not have come to the

University of Rochester. I would also like to acknowledge NSF, IBM Research, and

Samsung for providing financial support during my graduate studies.

I want to give my grateful and sincere thanks to Prof. Eby Friedman, Prof.

Sandhya Dwarkadas, and Dr. Pradip Bose for serving on my thesis committee and

providing helpful feedback. I appreciate all of the effort that Prof. Chen Ding put in

as the Chair for my defense. I would also like to thank Dr. Tolga Soyata for providing

circuit simulation results for the STT-MRAM based microprocessor work.

I am grateful to my mentors Dr. Hillery Hunter, Dr. Pradip Bose, Dr. Alper

Buyuktosunoglu, Dr. Viji Srinivasan, and Dr. Jude Rivers at IBM research, who

helped me become an independent researcher.

I have been fortunate to collaborate with excellent colleagues in the ECE and

CS departments. I would like to thank Ravi Patel, Mahdi Nazm Bojnordi, Qing

Guo, Yanwei Song, Yuxin Bai, Shibo Wang, Benjamin Feinberg, Isaac Richter, and

Mohammad Kazemi for their help and support.

I would like to give my special thanks to my family and friends for their love,

support, and encouragement.

v

Abstract

As CMOS technology scales to smaller dimensions, leakage concerns are starting

to limit microprocessor performance growth. To keep dynamic power constant across

process generations, traditional MOSFET scaling theory prescribes reducing supply

and threshold voltages in proportion to device dimensions, a practice that induces an

exponential increase in subthreshold leakage. As a result, leakage power has become

comparable to dynamic power in current-generation processes, and will soon exceed

it in magnitude if voltages are scaled down any further.

The rise in sub-threshold leakage also has an adverse effect on the scaling of

semiconductor memories. DRAM density scaling has become increasingly difficult

due to the challenges in maintaining a sufficiently high storage capacitance and a

sufficiently low leakage current at nanoscale feature sizes. Non-volatile memories

(NVMs) have drawn significant attention as potential DRAM replacements because

they represent information using resistance rather than electrical charge. Spin-torque

transfer magnetoresistive RAM (STT-MRAM) is one of the most promising NVM

technologies due to its low write energy, high speed, and high endurance.

This dissertation presents a new class of energy-efficient processor and memory ar-

chitectures based on STT-MRAM. By implementing much of the on-chip storage and

combinational logic using leakage-resistant, scalable RAM blocks and lookup tables,

and by carefully re-architecting the pipeline, an STT-MRAM based implementation

of an eight-core Sun Niagara-like processor reduces chip-wide power dissipation by

1.7× and leakage power by 2.1× at the 32nm technology node, while maintaining

93% of the system throughput of a CMOS-based design.

A new memory architecture, Sanitizer, is introduced to make STT-MRAM a vi-

able DRAM replacement for main memory. Sanitizer addresses retention errors, one

of the most critical scaling problems of STT-MRAM. As the size of the storage el-

ement within an STT-MRAM cell decreases with technology scaling, STT-MRAM

retention errors are expected to become more frequent, which will require multi-bit

error-correcting code (ECC) and periodic scrubbing mechanisms. Sanitizer mitigates

the performance and energy overheads of ECC and scrubbing in future STT-MRAM

vi

based main memories by anticipating the memory regions that will be accessed in

the near future and scrubbing them in advance. It improves performance by 1.22×and reduces end-to-end system energy by 22% over a baseline STT-MRAM system

at 22nm.

vii

Contributors and Funding Sources

This work was supported by a dissertation committee consisting of Professors En-

gin Ipek (advisor) and Eby Friedman of the Department of Electrical and Computer

Engineering, Professor Sandhya Dwarkadas of the Computer Science Department,

and Dr. Pradip Bose from IBM Research. The committee was chaired by Professor

Chen Ding from the Computer Science Department. The following chapters of this

dissertation proposal were jointly produced, and were funded by multiple sources.

My participation and contributions to the research as well as funding sources are as

follows.

I am the primary author of all of the chapters. For Chapter 3, I collaborated with

Dr. Tolga Soyata and Prof. Engin Ipek. Tolga Soyata provided circuit simulation

results for the STT-MRAM based lookup table. The work described in Chapter 3

was published in the proceedings of the 37th International Symposium on Computer

Architecture, and was supported by a National Science Foundation CAREER award.

For Chapter 4, I collaborated with Mahdi Nazm Borjnordi, Qing Guo, and Prof.

Engin Ipek. Mahdi Nazm Borjnordi performed the design space exploration of the

ECC logic design. Qing Guo provided power calculations for the system using McPAT.

The work described in Chapter 4 was supported by an IBM Ph.D. Fellowship.

viii

Table of Contents

List of Tables xi

List of Figures xiii

1 Introduction 1

2 Background and Motivation 4

2.1 Technology Scaling Challenges . . . . . . . . . . . . . . . . . . . . . . 4

2.1.1 Constant Voltage Scaling . . . . . . . . . . . . . . . . . . . . . 5

2.1.2 Constant Electrical Field Scaling . . . . . . . . . . . . . . . . 8

2.1.3 Multicore Scaling . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Resistive Memory Technologies . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Spin-Torque Transfer Magnetoresistive RAM

(STT-MRAM) . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 Phase Change Memory (PCM) . . . . . . . . . . . . . . . . . 14

2.2.3 Resistive RAM (RRAM) . . . . . . . . . . . . . . . . . . . . . 16

3 STT-MRAM based Microprocessors 18

3.1 Background for Resistive Computation . . . . . . . . . . . . . . . . . 19

3.1.1 1T-1MTJ STT-MRAM Cell . . . . . . . . . . . . . . . . . . . 19

3.1.2 Lookup-Table Based Computing . . . . . . . . . . . . . . . . . 23

3.2 Fundamental Building Blocks . . . . . . . . . . . . . . . . . . . . . . 24

3.2.1 RAM Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.2 Lookup Tables . . . . . . . . . . . . . . . . . . . . . . . . . . 33

ix

3.3 Structure and Operation of An STT-MRAM based CMT Pipeline . . 44

3.3.1 Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . 46

3.3.2 Predecode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3.3.3 Thread Select . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

3.3.4 Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

3.3.5 Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.3.6 Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

3.3.7 Write Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.4 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

3.4.2 Circuit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.4.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

3.5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

3.5.2 Power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

4 STT-MRAM based Main Memories 74

4.1 Background for Sanitizer . . . . . . . . . . . . . . . . . . . . . . . . . 77

4.1.1 DRAM Error Protection . . . . . . . . . . . . . . . . . . . . . 77

4.1.2 STT-MRAM Reliability . . . . . . . . . . . . . . . . . . . . . 80

4.1.3 Reliability Target . . . . . . . . . . . . . . . . . . . . . . . . . 83

4.1.4 Scrubbing Overheads . . . . . . . . . . . . . . . . . . . . . . . 84

4.2 Sanitizer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 85

4.2.1 Scheduling Scrub Operations . . . . . . . . . . . . . . . . . . . 89

4.2.2 Reducing the Read Overhead . . . . . . . . . . . . . . . . . . 92

4.2.3 Reducing the Write Overhead . . . . . . . . . . . . . . . . . . 98

4.2.4 Support for Chipkill ECC . . . . . . . . . . . . . . . . . . . . 101

4.3 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103

4.3.1 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104

4.3.2 Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

x

4.3.3 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

4.4.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

4.4.2 Energy and Power . . . . . . . . . . . . . . . . . . . . . . . . 110

4.4.3 Area . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.4.4 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4.5 Comparison to Hierarchical ECC Combined with

Prefetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

4.4.6 Comparison to DRAM . . . . . . . . . . . . . . . . . . . . . . 118

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

5 Conclusions 121

Bibliography 125

xi

List of Tables

2.1 Resistive memory technology comparisons [39]. . . . . . . . . . . . . . 11

3.1 STT-MRAM parameters at 32nm based on ITRS’13 projections. . . . 23

3.2 Comparison of three-bit adder implementations using STT-MRAM

LUTs, static CMOS, and a static CMOS ROM. Area estimates do

not include wiring overhead. . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Instruction cache parameters. . . . . . . . . . . . . . . . . . . . . . . 49

3.4 Register file parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 55

3.5 FPU parameters. Area estimates do not include wiring overhead. . . 59

3.6 L1 d-cache parameters. . . . . . . . . . . . . . . . . . . . . . . . . . 61

3.7 L2 cache parameters. . . . . . . . . . . . . . . . . . . . . . . . . . . 63

3.8 Memory controller parameters. Area estimates do not include the

wiring overhead. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

3.9 Parameters of baseline. . . . . . . . . . . . . . . . . . . . . . . . . . . 66

3.10 STT-MRAM cache parameters . . . . . . . . . . . . . . . . . . . . . 66

3.11 Simulated applications and their input sizes. . . . . . . . . . . . . . . 68

4.1 Bandwidth overhead due to scrubbing. FIT/Gbit<1, ∆=34, T=45C,

raw BER=3.4×10-5/s and block size=64B. . . . . . . . . . . . . . . . 85

4.2 Required patrol scrubbing rates for combining Sanitizer with chipkill. 103

4.3 System architecture and core parameters. . . . . . . . . . . . . . . . . 105

4.4 STT-MRAM parameters at 22nm [16,39,85]. . . . . . . . . . . . . . . 106

4.5 Comparison of different ECC codeword sizes. . . . . . . . . . . . . . . 106

xii

4.6 Sanitizer-8 system energy breakdown. . . . . . . . . . . . . . . . . . . 111

4.7 Peak dynamic power and leakage of Sanitizer components (eight block

configuration). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

4.8 Area breakdown of the Sanitizer components. . . . . . . . . . . . . . 112

4.9 Raw Retention BER per second. (5% variation on ∆.) . . . . . . . . 112

xiii

List of Figures

2.1 Illustrative example of an in-plane magnetic tunnel junction (MTJ) in

(a) low-resistance parallel and (b) high-resistance anti-parallel states. 13

2.2 Illustrative example of an PCM cell. . . . . . . . . . . . . . . . . . . . 15

2.3 Illustrative example of resistance switching in a metal-oxide RRAM.

Adapted from [88]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.1 Illustrative example of a 1T-1MTJ cell. . . . . . . . . . . . . . . . . 20

3.2 1T-1MTJ cell switching time as a function of cell size based on Cadence-

Spectre circuit simulations at 32nm. . . . . . . . . . . . . . . . . . . . 22

3.3 Illustrative example of a RAM array organized into a hierarchy of banks

and subbanks [56]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.4 Illustrative example of subbank buffers. . . . . . . . . . . . . . . . . . 28

3.5 Area of different SRAM and STT-MRAM configurations. . . . . . . . 31

3.6 Leakage of different SRAM and STT-MRAM configurations. . . . . . 32

3.7 Energy of different SRAM and STT-MRAM configurations. . . . . . . 32

3.8 Latency of different SRAM and STT-MRAM configurations. . . . . . 33

3.9 Illustrative example of a three-input lookup table. . . . . . . . . . . . 34

3.10 Access energy, leakage power, read delay, and area of a single LUT

as a function of the number of LUT inputs based on Cadence-Spectre

circuit simulations at 32nm. . . . . . . . . . . . . . . . . . . . . . . . 37

3.11 Illustrative example of a resistive CMT pipeline. . . . . . . . . . . . . 45

3.12 Next PC generation using five add-one LUTS in a carry-select config-

uration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

xiv

3.13 Illustrative example of a subbanked register file. . . . . . . . . . . . . 55

3.14 Performance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.15 Total Power. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3.16 Leakage Power. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

4.1 Tradeoff between scrubbing frequency and ECC granularity under a

12.5% storage overhead. . . . . . . . . . . . . . . . . . . . . . . . . . 76

4.2 Illustrative example of Sanitizer and conventional scrubbing mechanisms. 86

4.3 An illustration of the proposed Sanitizer architecture. . . . . . . . . . 88

4.4 An illustrative example of a scrub queue entry. . . . . . . . . . . . . . 90

4.5 An illustrative example of the operations in a four-entry RST with an

expiration time of seven. . . . . . . . . . . . . . . . . . . . . . . . . . 94

4.6 An illustrative example of generating a maximum of three scrubbing

regions using a direction threshold equals to eight. . . . . . . . . . . . 96

4.7 An illustrative example of the proposed memory layout for a four-block

codeword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

4.8 Illustrative example of supporting chipkill ECC. . . . . . . . . . . . . 102

4.9 Performance improvement analysis. . . . . . . . . . . . . . . . . . . . 108

4.10 System performance comparison. . . . . . . . . . . . . . . . . . . . . 109

4.11 System energy comparison. . . . . . . . . . . . . . . . . . . . . . . . . 109

4.12 System performance with different raw BERs. . . . . . . . . . . . . . 113

4.13 Memory traffic of systems with 72GB per channel. . . . . . . . . . . . 113

4.14 Performance impact of RST size and associativity. . . . . . . . . . . . 114

4.15 Performance comparisons with different LLC size. . . . . . . . . . . . 115

4.16 Comparison to hierarchical ECC and data prefetching. . . . . . . . . 116

4.17 Performance and system energy normalized to single-channel DRAM

varying number of channels. . . . . . . . . . . . . . . . . . . . . . . . 118

1

Chapter 1

Introduction

Over the past two decades, the CMOS microprocessor design process has been

confronted by a number of seemingly insurmountable technological challenges (e.g.,

the memory wall [11] and the wire delay problem [1]). At each turn, new classes of

systems have been architected to meet these challenges, and microprocessor perfor-

mance has continued to scale with exponentially increasing transistor budgets. With

more than two billion transistors integrated on a single die [66], power dissipation

has become the current critical challenge facing modern chip design. On-chip power

dissipation now exhausts the maximum capability of conventional cooling technolo-

gies; any further increases will require expensive and challenging solutions (e.g., liquid

cooling), which would significantly increase overall system cost.

Multicore architectures emerged in the early 2000s as a means of avoiding the

2

power wall, increasing parallelism under a constant clock frequency to avoid an in-

crease in dynamic power consumption. Although multicore systems did manage to

keep power dissipation at bay for the past decade, with the impending transition to

14nm CMOS, they are starting to experience scalability problems of their own. To

maintain constant dynamic power at a given clock rate, supply and threshold voltages

must scale with feature size, but this approach induces an exponential rise in leak-

age power, which is fast approaching dynamic power in magnitude. Under this poor

scaling behavior, the number of active cores on a chip will have to grow much more

slowly than the total transistor budget allows; indeed, at 11nm, over 80% of all cores

may have to be dormant at all times to fit within the chip’s thermal envelope [43].

Simultaneously to power-related problems in CMOS, DRAM is facing severe scala-

bility problems due to precise charge placement and sensing hurdles in deep-submicron

processes. In response, the industry is turning its attention to resistive memory tech-

nologies such as phase-change memory (PCM), resistive RAM (RRAM), and spin-

torque transfer magnetoresistive RAM (STT-MRAM). Resistive memories rely on

resistance rather than charge as the information carrier, and thus hold the potential

to scale to much smaller geometries than charge memories [39]. Unlike the case of

SRAM or DRAM, resistive memories rely on non-volatile information storage in a

cell, and thus exhibit near-zero leakage in the data array.

STT-MRAM is one of the most promising resistive memory technologies to replace

3

SRAM and DRAM due to its fast read speed [98] (< 200ps in 90nm), high density

(6F 2 [16]), scalable energy characteristics [39], and high write endurance (1012). De-

spite these desirable features, STT-MRAM has two important drawbacks as compared

to SRAM: (1) the nominal switching speed is close to 6.7ns at 32nm, which can hurt

write throughput in many on-chip applications; and (2) the switching energy is over

one order of magnitude higher than it is in SRAM, which, if left unmanaged, can

largely offset the benefits of leakage resistance in small, heavily written RAM arrays.

Moreover, STT-MRAM is expected to suffer from frequent retention errors as tech-

nology scales, which will require multi-bit ECC and periodic scrubbing mechanisms in

future STT-MRAM based main memories. To take advantage of STT-MRAM in de-

signing energy-efficient, scalable microprocessors and memory systems, architectural

techniques that can circumvent these limitations need to be developed.

This thesis presents my work on STT-MRAM based microprocessor and mem-

ory architectures. Chapter 2 summarizes technology scaling challenges and provides

background on STT-MRAM fundamentals; Chapter 3 proposes a new class of energy-

efficient, scalable microprocessors based on resistive memories; Chapter 4 introduces

a novel memory system architecture to enable large-capacity, reliable STT-MRAM

based main memories; and Chapter 5 presents the conclusions.

4

Chapter 2

Background and Motivation

This thesis leverages STT-MRAM, an emerging resistive memory technology that

holds the potential to address the scaling challenges confronting conventional charge

based memories. Background material on technology scaling challenges and resistive

memory technologies is presented in this chapter.

2.1 Technology Scaling Challenges

Over the past 50 years, shrinking transistor sizes with each new generation of

CMOS technology (i.e., technology scaling, or Moore’s law [55]) has been the fun-

damental driver behind faster and cheaper processors. A given CMOS circuit, when

implemented at successive technology nodes with progressively smaller feature sizes,

exhibits the following benefits: (1) marginal costs are reduced since the area occupied

5

by the circuit is smaller, allowing more ICs to be integrated on a fixed sized wafer; and

(2) as a result of the faster switching times of the transistors and the reduced local

wire delay, the design typically runs faster. As the transistors shrink, more transistors

can be integrated on a fixed size die, which provides the opportunity to enrich the

the computational capability of a processor with greater functionality. Better perfor-

mance, lower cost, and greater computational capability are thus the driving forces

behind technology scaling. The rest of this section discusses the different methods

employed in scaling device dimensions and voltages, and the associated problems.

2.1.1 Constant Voltage Scaling

From the 1980’s to early 1990’s, the industry adopted constant voltage scaling,

which requires the supply voltage to be kept constant (which, at the time, was 5V).

The rationale was (1) to maintain pin compatibility with peripheral devices; and (2)

to allow the clock frequency to increase rapidly from one generation to the next.

Speed. The maximum frequency that can be achieved at each technology node

depends on the propagation delay of the transistors, which is inversely related to the

transistor saturation current.According to the alpha power law model [67], the drain

6

current in the saturation region is characterized by the following expression:

IDsat =1

2µεoxtox

W

L(VGS − Vth)α, (2.1)

where µ is the electron mobility, εox is the dialectric constant of the oxide, tox is the

oxide thickness, VGS is the gate to source voltage, Vth is the threshold voltage, and α is

a constant with a value between 1 and 2. Setting VGS equal to VDD, the propagation

delay becomes:

τ = RONC =VDDC

IDsat=

2VDDC

µ εoxtox

WL

(VDD − Vth)α. (2.2)

Let W , L and tox respectively represent the width, length, and oxide thickness of a

transistor at the current technology node, and let W ′, L′ and tox represent the same

parameters at the next technology node. To double the number of transistors, the

following relationships must hold:

W ′ =W

1.4, L′ =

L

1.4(2.3)

Under constant voltage scaling, the oxide thickness tox is scaled down by 1.4× as

well. As a result, the gate capacitance, which is given by C = εox×W×Ltox

, is reduced by

1.4× at the new technology node C ′ = C1.4

. Accordingly, the delay expression (i.e.,

7

time constant) for the next technology node is given by:

τ ′ =2VDD

C1.4

µ εoxtox/1.4

W/1.4L/1.4

(VDD − Vth)α=τ

2. (2.4)

Hence, the frequency, which is the inverse of the propagation delay, can be increased

by 2×.

Power. Assuming a fixed sized chip that integrates N transistors running at a

frequency f , in which a fraction of a (called the activity factor) of the devices switch

every cycle, the total dynamic power is:

Ptotal dyn =1

2aNfCV 2

DD. (2.5)

As explained earlier in this Section, the respective numbers of transistor count, clock

frequency, and gate capacitance at the new technology node are N ′ = 2×N , f ′ = 2×f ,

and C ′ = C1.4

. Hence, the total dynamic power at the next technology node is:

P ′total dyn =1

2a(2×N)(2× f)

C

1.4V 2DD = 2.8× Ptotal dyn. (2.6)

Thus, under constant voltage scaling, the dynamic power increases by 2.8× with each

new technology generation.

8

2.1.2 Constant Electrical Field Scaling

Due to the rapid growth in dynamic power, constant voltage scaling was aban-

doned in the early 1990’s. Instead, the industry adopted the constant electrical field

scaling model first introduced by Dennard [23] in 1974. The key idea of Dennard’s

scaling theory is to simultaneously reduce transistor dimensions (width, length, and

oxide thickness), the supply voltage, and the threshold voltage, all by the same scal-

ing factor. The constant electrical field refers to the electrical fields across both the

gate and channel, which are respectively equal to VDD

toxand VDD

W.

Speed. According to equation (2.2) (and assuming a scaling factor of 1.4 to double

transistor count), the time constant at the next technology node under Dennard

scaling is given by:

τ ′ =2VDD

1.4C1.4

µ εoxtox/1.4

W/1.4L/1.4

(VDD−Vth1.4

)α=

τ

1.43−α , (2.7)

Hence, under constant field scaling, the clock frequency increases by 1.4− 2×.

Power. The total dynamic power at the next technology node is found by plugging

the scaled values of the transistor count, frequency, capacitance, and supply voltage

9

into Equation (2.5):

P ′total dyn =1

2α(2×N)(1.4× f)

C

1.4(VDD1.4

)2 = Ptotal dyn. (2.8)

Hence, the dynamic power is kept constant under constant electrical field scaling. If

the die area is kept constant across successive technology nodes, the total dynamic

power calculation above also indicates that the dynamic power density is kept constant

as well.

Although constant electrical field scaling successfully kept dynamic power in check

throughout the 1990’s, leakage power grew exponentially due to the scaling of the

threshold voltage, rivaling dynamic power by the early 2000’s. Equation (2.9) shows

the exponential dependence of the subthreshold leakage power on the threshold volt-

age:

PLeakage = VDDµεoxtox

W

LV 2T e−|Vth|nVT (1− e−

1VT ). (2.9)

2.1.3 Multicore Scaling

Because of the exponential rise in leakage power (Section 2.1.2), industry aban-

doned constant field scaling in the first half of the 2000’s, and adopted multicore

architectures. The result was a paradigm shift in microprocessor design, in which

clock frequency would stop increasing, and performance improvements would come

from exploiting greater levels of thread level parallelism with increasing transistor

10

budgets. Unfortunately, without scaling down the voltage, power density continues

to increase under multicore scaling, albeit slower than it would under earlier scaling

models:

P ′total dyn =1

2α(2×N)f

C

1.4V 2DD = 1.4× Ptotal dyn. (2.10)

The end result is that future multicore processors will not be able to afford keeping

more than a small fraction of all cores active at any given moment [43]. Hence,

multicore scaling is soon expected hit a power wall [24].

2.2 Resistive Memory Technologies

Simultaneously to power-related problems in CMOS, DRAM is facing severe scala-

bility problems due to precise charge placement and sensing hurdles in deep-submicron

processes. In response, the industry has turned its attention to resistive memory tech-

nologies such as phase-change memory (PCM) [19,49,63,70], resistive RAM (RRAM)

[26,87,88,93], and spin-torque transfer magnetoresistive RAM (STT-MRAM) [34,47,

73]—memory technologies that rely on resistance (e.g., a high resistance represents

a ‘1’ and a low resistance represents a ‘0’) rather than charge as the information

carrier, and thus hold the potential to scale to much smaller geometries than charge

memories [39]. Unlike the case of SRAM or DRAM, resistive memories rely on non-

volatile, resistive information storage in a cell, and thus exhibit near-zero leakage in

11

the data array. This section provides background material on three of the leading

resistive memory technologies, which rely on different physical mechanisms to change

the resistances of the storage elements: PCM, RRAM, and STT-MRAM. Each of

these resistive memory technologies exhibits its own advantages and disadvantages as

shown in Table 2.1.

STT-MRAM PCM RRAMMulti-level cell No Yes YesEndurance 1015 Writes 109 Writes 106 − 1012 WritesCell write latency ∼4ns ∼100ns ∼5nsCell write power ∼50µW ∼300µW ∼50µW

Table 2.1: Resistive memory technology comparisons [39].

A multi-level cell stores multiple bits in a single storage element to increase the

memory capacity. The storage elements in PCM and RRAM exhibit continuous re-

sistance ranges that can be partitioned into multiple subregions to represent multiple

values and to store multiple bits. The storage element in STT-MRAM has only two

stable states. Existing multi-level STT-MRAM proposals either stack two storage

elements one on top of the other, or place two storage elements in parallel [96]. An

important advantage of STT-MRAM is the write endurance, which is the maximum

number of writes to a memory cell before it wears out. STT-MRAM, therefore, is a

more desirable technology for frequently written on-chip structures as compared to

PCM and RRAM. Two significant disadvantages of all three resistive memory tech-

nologies as compared to SRAM or DRAM are the long write latency and the high

12

write energy. This is because changing the physical states of the storage elements is

more difficult than moving the electrons around in SRAM or DRAM.

2.2.1 Spin-Torque Transfer Magnetoresistive RAM

(STT-MRAM)

STT-MRAM [34, 39, 44, 46, 47] is a second generation MRAM technology that

addresses many of the scaling problems of commercially available toggle-mode mag-

netic RAMs. Among all resistive memories, STT-MRAM is the closest to being

a CMOS-compatible1 universal memory technology as it offers read speeds as fast

as SRAM [98] (< 200ps in 90nm), density comparable to DRAM (6F 2 [16]), scal-

able energy characteristics [39], and high write endurance (1015). Functional array

prototypes [34, 44, 85], and CAM circuits [92] using STT-MRAM already have been

demonstrated. STT-MRAM has also been made DDR3 compatible in a commercial

product [25]. Although STT-MRAM suffers from relatively high write power and

write latency compared to SRAM, its near-zero leakage power dissipation, coupled

with its fast read speed and scalability makes it a promising candidate to take over

as the workhorse for on-chip storage in sub-22nm processes.

STT-MRAM relies on magnetoresistance to encode information. Figure 2.1 de-

picts the storage element of an MRAM cell, the magnetic tunnel junction (MTJ).

1STT-MRAM can be integrated with standard CMOS process through a backend process tofabricate the storage elements on metal surfaces [99].

13

MgO

Pinned Layer

RP

(a) (b)

Free Layer

MgO

Pinned Layer

RAPFree Layer

Figure 2.1: Illustrative example of an in-plane magnetic tunnel junction (MTJ) in (a)low-resistance parallel and (b) high-resistance anti-parallel states.

An MTJ consists of two ferromagnetic layers and a tunnel barrier layer, often im-

plemented using a magnetic thin-film stack comprising Co40Fe40B20 for the ferro-

magnetic layers, and MgO for the tunnel barrier. One of the ferromagnetic layers,

the pinned layer, has a fixed magnetic spin, whereas the spin of the electrons in the

free layer can be influenced by first applying a high-amplitude current pulse through

the pinned layer to polarize the current, and then passing this spin-polarized current

through the free layer. Depending on the direction of the current, the spin polarity of

the free layer can be made either parallel or anti-parallel to that of the pinned layer.

The MTJ illustrated in Figure 2.1 is an in-plane MTJ, in which the magnetization

fields are directed in the same plane as the corresponding ferromagnetic layers. A

perpendicular MTJ [47], in which the magnetization direction of the fixed and free

layers are both orthogonal to their corresponding layers, has been proposed recently

14

to reduce the amplitude of the required switching current.

Applying a small bias voltage (typically 0.1V) across the MTJ causes a tunneling

current to flow through the MgO tunnel barrier without perturbing the magnetic

polarity of the free layer. The magnitude of the tunneling current—and thus, the

resistance of the MTJ—is determined by the polarity of the two ferromagnetic lay-

ers: a lower, parallel resistance (RP in Figure 2.1-a) state is experienced when the

spin polarities agree, and a higher, antiparallel resistance state is observed when the

polarities disagree (RAP in Figure 2.1-b). When the polarities of the two layers are

aligned, electrons with polarity anti-parallel to the two layers can travel through the

MTJ easily, while electrons with the same spin as the two layers are scattered. In

contrast, when the two layers have anti-parallel polarities, electrons of either polarity

are scattered by one of the two layers, leading to much lower conductivity, and thus,

higher resistance [14]. These low and high resistances are used to represent different

logic values.

2.2.2 Phase Change Memory (PCM)

The storage element in a PCM cell consists of a chalcogenide phase-change ma-

terial such as Ge2Sb2Te5 (GST) and a resistive heating element sandwiched between

two electrodes as shown in Figure 2.2. The resistance of the chalcogenide material is

determined by the its atomic ordering: a crystalline state exhibits a low resistance and

15

BottomElectrode

Top Electrode

Crystalline Chalcogenide

AmorphousChalcogenide Resistive

HeatingElement

Figure 2.2: Illustrative example of an PCM cell.

an amorphous state exhibits a high resistance [70]. A chalcogenide storage element

typically includes a amorphous region and a crystalline region. The volumes of these

regions determine the effective resistance of a PCM cell. To change the resistance

of PCM cell, a high amplitude current pulse is applied to the chalcogenide storage

element to induce Joule heating. A slow reduction in the write current gradually cools

the chalcogenide for a long enough period of time (i.e., 100ns [39]) to allow crystalline

growth; whereas an abrupt reduction in the current causes the device to retain its

amorphous state. Reading the a PCM cell involves passing a sensing current lower

than the write current to prevent disturbance, and the resulting voltage is sensed to

infer the content stored in the cell. A PCM cell exhibits a relatively large ratio of its

highest (RHIGH) and lowest (RLOW ) resistances. A less than 10KΩ low resistant and

greater than 1MΩ can be achieved [39,70]. Therefore, a multi-level PCM is possible.

However, the absolute resistance is in the mega-ohm range, which leads to large RC

delays, and hence, slow reads. PCM suffers from finite write endurance. Because

16

of the heating and cooling of the chalcogenide material during the writes, thermal

expansion and contraction damage the contact between the top electrode and the

chalcogenide storage element. A typical PCM cell wears out after 109 writes [39].

Many architectural techniques have been proposed to address the PCM endurance

issue [6, 27,38,40,61,62,68,69,95].

2.2.3 Resistive RAM (RRAM)

An RRAM cell consists of two metal electrodes separated by a metal-oxide insu-

lator. RRAM resistance is altered by building filaments in the insulator to create con-

ductive paths. There are two types of RRAM: conductive-bridge RAM (CBRAM) [87],

and metal-oxide resistive RAM (MeOx-RRAM) [88]. A CBRAM cell relies on the dif-

fusion of Ag or Cu ions from the metal electrodes to create conductive bridges, whereas

a MeOx-RRAM cell builds conductive filaments by evacuating oxygen ions from the

insulator. Large scale prototypes have been demonstrated with both types of RRAM

(16Gb CBRAM [26] and 32Gb MeOx-RRAM [93]). As an example, Figure 2.3 shows

the resistance changing process of a metal-oxide RRAM. When a set voltage is applied

across the two electrodes (Figure 2.3(a)), the oxygen ions are moved from the lattice

toward the anode. As shown in Figure 2.3(b), the remaining oxygen vacancies form

conductive filaments, resulting in a low resistance state. Increasing the cell resistance

requires applying a reset voltage to move oxygen ions back to the insulator, thereby

17

BE

TEOxygenion

Oxygenvacancy

Oxygenatom

Vset+

-

(a) Decrease resistance.

BE

TE

(b) Low resistance state.

BE

TE

Vreset-

+

(c) Increase resistance.

BE

TE

TE: Top electrode

BE: Bottom electrode

(d) High resistance state.

Figure 2.3: Illustrative example of resistance switching in a metal-oxide RRAM.Adapted from [88].

disconnecting the conductive filament from the top electrode. The reset voltage is

applied in the opposite direction to the set voltage for a bipolar RRAM (as shown

in Figure 2.3(c)), and in the same direction for a unipolar RRAM. In Figure 2.3(d),

a cell in the high resistance state is shown, in which the oxygen vacancies do not

form a path to connect the top and the bottom electrodes. The height and width of

the conductive filaments affect the cell resistance, which enables the RRAM to have

multi-level cell capability.

18

Chapter 3

STT-MRAM based

Microprocessors

This chapter presents resistive computation, an architectural technique that aims

at developing a new class of energy-efficient, scalable microprocessors based on emerg-

ing resistive memory technologies. Power- and performance-critical hardware re-

sources such as caches, memory controllers, and floating-point units are implemented

using spin-torque transfer magnetoresistive RAM (STT-MRAM)—a CMOS-compatible,

near-zero static-power, persistent memory that has been in development since the

early 2000s [35], and has been made DDR3 compatible in a commercial product [25].

The key idea is to implement most of the on-chip storage and combinational logic

using scalable, leakage-resistant RAM arrays and lookup tables (LUTs) constructed

19

from STT-MRAM to lower leakage, thereby allowing many more active cores under

a fixed power and area budget than a pure CMOS implementation could afford.

By adopting hardware structures amenable to fast and efficient LUT-based com-

puting, and by carefully re-architecting the pipeline, an STT-MRAM based imple-

mentation of an eight-core, Sun Niagara-like processor respectively reduces leakage

and total power at 32nm by 2.1× and 1.7×, while maintaining 93% of the system

throughput of a pure CMOS implementation.

3.1 Background for Resistive Computation

This section reviews background material on STT-MRAM cell structures and

lookup-table based computing.

3.1.1 1T-1MTJ STT-MRAM Cell

The most commonly used structure for an STT-MRAM memory cell is the 1T-

1MTJ cell that comprises a single MTJ, and a single transistor that acts as an access

device (Figure 3.1). Transistors are built in CMOS, and the MTJ magnetic material is

grown over the source and drain regions of the transistors through a few (typically two

or three) additional process steps. Similarly to SRAM and DRAM, 1T-1MTJ cells can

be coupled through wordlines and bitlines to form memory arrays. Each cell is read

by driving the appropriate wordline to connect the relevant MTJ to its bitline (BL)

20

and source line (SL), applying a small bias voltage (e.g., 0.1V ) across the two, and by

sensing the current passing through the MTJ using a current sense amplifier connected

to the bitline. Read speed is determined by how fast the capacitive wordline can be

charged to turn on the access transistor, and by how fast the bitline can be raised

to the required read voltage to sample the read-out current. The write operation, on

the other hand, requires activating the access transistor, and applying a much higher

voltage (typically VDD) that can generate sufficient current to modify the spin of the

free layer.

WL

BLSL

Figure 3.1: Illustrative example of a 1T-1MTJ cell.

An MTJ can be written in a thermal activation mode through the application of

a long, low-amplitude current pulse (>10ns), under a dynamic reversal regime with

intermediate current pulses (3-10ns), or in a precessional switching regime with a

short (<3ns), high-amplitude current pulse [35]. In a 1T-1MTJ cell with a fixed-size

MTJ, a tradeoff exists between the switching time (i.e., current pulse width) and the

cell area. In the precessional mode, the required current density Jc(τ) to switch the

state of the MTJ is inversely proportional to the switching time τ

21

Jc(τ) ∝ Jc0 +C

τ

where Jc0 is a process-dependent intrinsic current density parameter, and C is a

constant that depends on the angle of the magnetization vector of the free layer [35].

Hence, operating at a faster switching time increases energy efficiency: a 2× shorter

write pulse requires a less than 2× increase in write current, and thus, lower write

energy [34, 52, 84]. Unfortunately, the highest switching speed possible with a fixed-

size MTJ is restricted by two fundamental factors: (1) the maximum current that the

cell can can support during an RAP → RP transition cannot exceed VDD

RAPsince the

cell has to deliver the necessary switching current over the MTJ in its high-resistance

state, and (2) a higher switching current requires the access transistor to be sized

larger so that it can source the required current, which increases cell area 1 and hurts

the read energy and delay due to the higher gate capacitance.

Figure 3.2 shows the 1T-1MTJ cell switching time as a function of the cell area

based on Cadence-Spectre analog circuit simulations of a single cell at the 32nm

technology node, using ITRS 2013 projections on the MTJ parameters (Table 3.1),

and the BSIM-4 predictive technology model (PTM) of an NMOS transistor [97]; the

results presented here are assumed in the rest of this chapter whenever cell sizing

needs to be optimized for write speed. As the precise value of the intrinsic current

1The MTJ is grown above the source and drain regions of the access transistor and is typicallysmaller than the transistor itself; consequently, the size of the access transistor determines cell areain current generation STT-MRAM.

22

0 1 2 3 4 5 6 7

0.0 20.0 40.0 60.0

Switching Time (ns)

Cell Size (F2)

Figure 3.2: 1T-1MTJ cell switching time as a function of cell size based on Cadence-Spectre circuit simulations at 32nm.

density Jc0 is not included in the ITRS projections, Jc0 is conservatively assumed to

be zero, which requires a 2× increase in switching current for a 2× increase in the

switching speed. If feature size is given by F , then at a switching speed of 6.7ns, a

1T-1MTJ cell occupies a 10F 2 area—a 14.6× density advantage over SRAM, which

is a 146F 2 technology [56]. As the WL

ratio of the access transistor is increased, the

current sourcing capability of the transistor improves, which reduces the switching

time to 3.1ns at a cell size of 30F 2. Increasing the size of the transistor further causes

a large voltage drop across the MTJ, which reduces the drain-source voltage of the

access transistor, pushes the device into deep triode, and ultimately limits the current

sourcing capability. As a result, the switching time reaches an asymptote at 2.6ns,

which is accomplished at a cell size of 65F 2.

23

Parameter ValueCell Size 10F 2

Switching Current 50µASwitching Time 6.7nsWrite Energy 0.3pJ/bitMTJ Resistance (RLOW/RHIGH) 2.5kΩ / 6.25kΩ

Table 3.1: STT-MRAM parameters at 32nm based on ITRS’13 projections.

3.1.2 Lookup-Table Based Computing

Field programmable gate arrays (FPGAs) adopt a versatile internal organization

that leverages SRAM to store truth tables of logic functions [91]. This not only allows

a wide variety of logic functions to be represented flexibly, but also allows FPGAs to

be re-programmed almost indefinitely, making them suitable for rapid product pro-

totyping. With technology scaling, FPGAs have gradually evolved from four-input

SRAM-based truth tables to five- and six-input tables, named lookup tables (LUT)

[20]. This evolution is due to the increasing IC integration density—when LUTs are

created with higher numbers of inputs, the area they occupy increases exponentially;

however, place-and-route becomes significantly easier due to the increased function-

ality of each LUT. The selection of LUT size is technology dependent; for example,

Xilinx Virtex-6 FPGAs use both five- and six-input LUTs, which represent the opti-

mum sizing at the 40nm technology node [91].

We propose to leverage an attractive feature of LUT-based computing other than

reconfigurability: since LUTs are constructed from memory, it is possible to im-

plement them using a leakage-resistant memory technology such as STT-MRAM to

24

reduce power. Similarly to other resistive memories, MRAM dissipates near-zero

leakage power in the data array; consequently, power density can be kept in check

by reducing the supply voltage with each new technology generation. (Typical STT-

MRAM read voltages of 0.1V are reported in the literature [34].) Due to its high

write power, the technology is best suited to implementing hardware structures that

are read-only or are seldom written. Previous work has explored the possibility of

leveraging MRAM to design L2 caches [83, 90], but this work is the first to consider

the possibility of implementing much of the combinational logic on the chip, as well as

microarchitectural structures such as register files and L1 caches, using STT-MRAM.

3.2 Fundamental Building Blocks

At a high-level, an STT-MRAM based resistive microprocessor consists of stor-

age resources such as register files, caches, and queues; functional units and other

combinational logic elements; and pipeline latches. Judicious partitioning of these

hardware structures between CMOS and STT-MRAM is critical to designing a well-

balanced system that exploits the unique area, speed, and power advantages of each

technology. Making this selection correctly requires analyzing two broad categories

of MRAM-based hardware units: those leveraging RAM arrays (queues, register files,

and caches), and those leveraging look-up tables (combinational logic and functional

units).

25

3.2.1 RAM Arrays

Large SRAM arrays are commonly organized into hierarchical structures to opti-

mize area, speed, and power tradeoffs [3]. An array comprises multiple independent

banks with separate address and data buses that can be accessed simultaneously to

improve throughput. To minimize wordline and bitline delays and to simplify decod-

ing complexity, each bank is further divided into subbanks sharing address and data

busses; unlike the case of banks, only a single subbank can be accessed at a time

(Figure 3.3). A subbank consists of multiple independent mats sharing an address

line, each of which supplies a different portion of a requested data block on every

access. Internally, each mat comprises multiple subarrays. Memory cells within each

subarray are organized as rows × columns; a decoder selects the cells connected to

the relevant wordline, whose contents are driven onto a set of bitlines to be muxed

and sensed by the column sensing circuitry. The sensed value is routed back to the

data bus of the requesting bank through a separate reply network. Different organi-

zations of a fixed-size RAM array into different numbers of banks, subbanks, mats,

and subarrays yield dramatically different area, speed, and power figures [56].

STT-MRAM and SRAM arrays share much of this high-level structure with some

important differences arising from the size of a basic cell, the loading on the bitlines

and wordlines, and the underlying sensing mechanisms. In turn, these differences

result in different leakage power, access energy, delay, and area characteristics. Since

26

Bank

DataBus

AddressBus

Subbank

Shared Data and Address Busses

Figure 3.3: Illustrative example of a RAM array organized into a hierarchy of banksand subbanks [56].

STT-MRAM has a smaller cell size than SRAM (10F 2 vs. 146F 2), the length of

the bitlines and wordlines within a subarray can be made shorter, which reduces

the bitline and wordline capacitance and resistance, and improves both delay and

energy. In addition, unlike the case of 6T-SRAM where each cell has two access

transistors, a 1T-1MTJ cell has a single access device whose size typically is smaller

than the SRAM access transistor. This reduces the amount of gate capacitance on

the wordlines, as well as the drain capacitance attached to the bitlines, which lowers

both energy and delay. The smaller cell size of STT-MRAM implies that subarrays

can be made smaller, which shortens the global H-tree interconnect that is responsible

for a large share of the overall power, area, and delay. Importantly, unlike the case of

SRAM where each cell comprises a pair of cross-coupled inverters connected to the

supply rail, STT-MRAM does not require constant connection to VDD within a cell,

which reduces the leakage power within the data array to virtually zero.

27

3.2.1.1 Handling Long-Latency Writes

Despite these advantages, STT-MRAM suffers from a relatively long write la-

tency as compared to SRAM (Section 2.2.1). Leveraging STT-MRAM in designing

frequently accessed hardware structures requires (1) ensuring that critical reads are

not delayed by long-latency writes, and (2) long write latencies do not result in re-

source conflicts that hamper pipeline throughput.

One way of accomplishing both of these goals would be to choose a heavily multi-

ported organization for frequently written hardware structures. Unfortunately, this

results in an excessive number of ports, and as area and delay grow with port count,

significantly hurts performance. For example, building an STT-MRAM based ar-

chitectural register file that would support two reads and one write per cycle with

fast, 30F 2 cells at 32nm, 4GHz would require two read ports and 13 write ports2,

which would increase total port count from 3 to 15. An alternative would be to go

to a heavily multi-banked implementation without incurring the overhead of extreme

multiporting. Regrettably, as the number of banks are increased, so does the number

of H-tree wiring resources, which quickly overrides the leakage and area benefits of

using STT-MRAM.

Instead, this chapter proposes an alternative strategy that allows high write

2A write to the 30F 2 STT-MRAM cell takes 13 cycles (3.1ns×4GHz), whereas a typical SRAMbased register file accepts one write per cycle. To achieve the same write throughout as the SRAMbased register file, an STT-MRAM based register file needs 13 write ports.

28

throughput and read-write bypassing without incurring an increase in the wiring

overhead. The key idea is to allow long-latency writes to complete locally within

each subbank without unnecessarily occupying global H-tree wiring resources. To

make this possible, each subbank is augmented with a subbank buffer—an array of

flip-flops (physically distributed across all of mats within a subbank) that latch in the

data-in and address bits from the H-tree, and continue driving the subarray data and

address wires throughout the duration of a write while bank-level wiring resources are

released (Figure 3.4). In RAM arrays with separate read and write ports, subbank

buffers drive only the write port; reads from other locations within the array can

complete unobstructed, and it becomes possible to read the value being written to

the array directly from the subbank buffer.

Subbank

Shared Data and Address Busses

Subbank Buffer

Figure 3.4: Illustrative example of subbank buffers.

Subbank buffers also make it possible to perform differential writes [49], where

only bit positions that differ from their original contents are modified on a write. For

this to work, the port attached to the subbank buffer must be designed as a read-write

port; when a write is received, the subbank buffer (physically distributed across the

29

mats) latches in the new data and initiates a read for the original contents. Once

the data arrives, the original and the new contents are bitwise XOR’ed to generate

a mask indicating those bit positions that need to be changed. This mask is sent to

all of the relevant subarrays along with the actual data, and are used to enable the

bitline drivers. In this way, it becomes possible to perform differential writes without

incurring additional latency and energy on the global H-tree wiring. Differential

writes can reduce the number of bit flips, and thus the write energy, by significant

margins, and can make the STT-MRAM based implementation of heavily written

arrays practical.

3.2.1.2 Modeling STT-MRAM Arrays

To derive the latency, power, and area figures for STT-MRAM arrays, we use a

modified version of CACTI 6.5 [56] augmented with 10F 2 and 30F 2 STT-MRAM cell

models. The modifications reflect four key differences between SRAM and STT-

MRAM: (1) STT-MRAM incurs additional switching latency and energy during

writes, (2) the 1T-1R STT-MRAM cell is smaller than an SRAM cell, (3) there is no

leakage current within an STT-MRAM cell, and (4) each STT-MRAM cell has one

access transistor whereas an SRAM cell has two. The subbank buffers are modeled

as part of the the peripheral circuitry for each subbank.

30

3.2.1.3 Deciding When to Use STT-MRAM

STT-MRAM is best suited to large RAM arrays or infrequently written hardware

structures, because (1) the potential for leakage power, area, and read energy savings,

as well as read latency reduction are higher in large arrays as compared to smaller

ones, and (2) infrequently written structures require a small number of subbanks.

Deciding whether it is beneficial to implement a memory structure in STT-MRAM

requires (1) determining the minimum number of required subbanks that satisfies

the write accesses, and (2) comparing the area, leakage, energy, and latency of the

STT-MRAM and the SRAM based implementations.

A set of STT-MRAM and SRAM arrays with different sizes are evaluated in this

section. In the accompanying figures, “Best SRAM” and “Best STT-MRAM” repre-

sent the best configurations chosen by CACTI using an objective function that assigns

equal weights to delay, dynamic power, leakage power, cycle time, and area. The con-

figurations labeled as “2 Subbank STT-MRAM”, “4 Subbank STT-MRAM”, and

“8 Subbank STT-MRAM” are STT-MRAM configurations that force the respective

number of subbanks to be two, four, and eight. All of the evaluated configurations in

this section have a single port, a single bank, and a 32-bit access granularity.

Area. SRAM cells are larger STT-MRAM cells. The area of the “Best STT-MRAM”

configurations, therefore, are smaller than the area of iso-capacity “Best SRAM”

31

configurations in Figure 3.5. As the number of subbanks increases, however, an STT-

0 10 20 30 40 50 60 70 80

2KB 8KB 32KB 128KB

Area Normalized

to 2KB

SR

AM

Best SRAM

Best STT-­‐MRAM

2 Subbank STT-­‐MRAM

4 Subbank STT-­‐MRAM

8 Subbank STT-­‐MRAM

Figure 3.5: Area of different SRAM and STT-MRAM configurations.

MRAM array occupies a larger area than its SRAM-based counterpart due to the

area overhead of the subbank buffers. Hence, implementing a small and frequently

written hardware structure in STT-MRAM does not reduce area as compared to the

best SRAM implementation.

Leakage. STT-MRAM cells consume zero leakage power. As the number of subbanks

increases, the SRAM based subbank buffers and other peripheral circuits consume

greater amounts of leakage power (Figure 3.6). A small RAM structure implemented

in STT-MRAM, however, can still achieve leakage power savings: an eight subbank

2KB STT-MRAM array consumes half of the leakage power is consumed by the best

2KB SRAM configuration.

32

0

10

20

30

40

50

60

70

2KB 8KB 32KB 128KB

Leakage Po

wer

Normalized

to 2KB

SRA

M

Best SRAM

Best STT-­‐MRAM

2 Subbank STT-­‐MRAM

4 Subbank STT-­‐MRAM

8 Subbank STT-­‐MRAM

Figure 3.6: Leakage of different SRAM and STT-MRAM configurations.

Energy. A comparison of the read energy is shown in Figure 3.7. STT-MRAM

typically consumes less read energy than SRAM because of the reduced area, and the

corresponding reduction in the energy dissipated on the (shorter) wires. Write energy

is modeled as a fixed per-bit switching energy added on top of the read energy. For

large arrays, in which STT-MRAM read energy can be as low as half of the SRAM

read energy, the total write energy can also be less than that of SRAM.

0 1 2 3 4 5 6 7 8

2KB 8KB 32KB 128KB

Read

Ene

rgy Normalized

to 2KB

SRA

M

Best SRAM

Best STT-­‐MRAM

2 Subbank STT-­‐MRAM

4 Subbank STT-­‐MRAM

8 Subbank STT-­‐MRAM

Figure 3.7: Energy of different SRAM and STT-MRAM configurations.

33

Latency. STT-MRAM read latency increases as the number of subbanks increases

(Figure 3.8). In small, heavily subbanked STT-MRAM arrays, read latency is lower

than it is under the best SRAM configuration. This is because the subbank structures

increase h-tree complexity, and increase the h-tree delay.

0 0.5 1

1.5 2

2.5 3

3.5 4

2KB 8KB 32KB 128KB

Read

Laten

cy Normalized

to 2KB

SRA

M

Best SRAM

Best STT-­‐MRAM

2 Subbank STT-­‐MRAM

4 Subbank STT-­‐MRAM

8 Subbank STT-­‐MRAM

Figure 3.8: Latency of different SRAM and STT-MRAM configurations.

3.2.2 Lookup Tables

Although large STT-MRAM subarrays dissipate near-zero leakage power, the leak-

age power of the peripheral circuitry can be significant in smaller subarrays. With

smaller arrays, there are fewer opportunities to share sense amplifiers and decod-

ing circuitry across multiple rows and columns. One option to combat this problem

would be to utilize very large arrays to implement lookup tables of logic functions;

unfortunately, both the access time and the area overhead deteriorate with larger

arrays.

34

Rather than utilizing an STT-MRAM array to implement a logic function, we

rely on a specialized STT-MRAM based lookup table employing differential current-

mode logic with dynamic power management (DyCML). Prior work in this area has

resulted in fabricated, two-input lookup tables [84] at 140nm, as well as a non-volatile

full-adder prototype [52]. Figure 3.9 depicts an example three-input LUT. The circuit

needs both complementary and pure forms of each of its inputs, and the LUT produces

complementary outputs. Consequently, when multiple LUTs are cascaded in a large

circuit, there is no need to generate extra complementary outputs.

CC

AA

BB

3x8 Tree

clkclk

clk

clk

Z SAZ

clk

clk

Vdd

A

B

C

B

C

A

B

C

B

C C C C C

A

B

C

A

B

C

DEC REFDEC REF

Figure 3.9: Illustrative example of a three-input lookup table.

This LUT circuit, an expanded version of what is proposed in [84], utilizes a

dynamic current source by charging and discharging the capacitor shown in Figure 3.9.

The capacitor is discharged during the clk phase, and sinks current through the 3×8

decode tree during the clk phase. Keeper PMOS transistors charge the two entry

nodes of the sense amplifier (SA) during the clk phase and sensing is performed

35

during the clk phase. These two entry nodes, named DEC and REF, reach different

voltage values during the sensing phase (clk) since the sink paths from DEC to the

capacitor vs. from REF to the capacitor exhibit different resistances. The reference

MTJ needs to have a resistance between the low and high resistance values. Since

ITRS projects RLO and RHIGH values of 2.5kΩ and 6.25kΩ at 32nm, 4.375kΩ is

chosen for RREF .

Although the MTJ decoding circuitry is connected to VDD at the top and dynam-

ically connected to GND at the bottom, the voltage swing on the capacitor is much

smaller than VDD, which significantly reduces the access energy. The output of this

current mode logic operation is fed into a sense amplifier, which turns the low-swing

operation into a full-swing complementary output.

In [84], it is observed that the circuit can be expanded to higher numbers of in-

puts by expanding the decode tree. However, it is important to note that expanding

the tree beyond a certain height reduces noise margins and makes the LUT circuit

vulnerable to process variations, since it becomes increasingly difficult to detect the

difference between the high and low MTJ states due to the additional resistance in-

troduced by the transistors in series. As more and more transistors are added, their

cumulative resistance can become comparable to MTJ resistance, and fluctuations

among transistor resistances caused by process variations can make sensing challeng-

ing.

36

3.2.2.1 Optimal LUT Sizing for Latency, Power, and Area

Both the power and the performance of a resistive processor depend heavily on the

LUT sizes chosen to implement combinational logic blocks. This makes it necessary to

develop a detailed model to evaluate latency, area, and power tradeoffs as a function

of STT-MRAM LUT size. Figure 3.10 depicts read energy, leakage power, read delay,

and area as a function of the number of LUT inputs. LUTs with two to six inputs

(4-64 MTJs) are studied, which represent realistic LUT sizes for real circuits. As

a comparison, only five- and six-input LUTs are utilized in modern FPGAs (e.g.,

Xilinx Virtex 6) as larger LUTs do not justify the increase in latency and area for the

marginal gain in flexibility when implementing logic functions. As each LUT stores

only one bit of output, multiple LUTs are accessed in parallel with the same inputs

to produce multi-bit results (e.g., a three-bit adder that produces a four-bit output).

Read Energy. Access energy decreases slightly as LUT sizes are increased. Although

there are more internal nodes—and thus, higher gate and drain capacitances–to charge

with each access on a larger LUT, the voltage swing on the footer capacitor is lower

due to the increased series resistance charging the capacitor. As a design choice,

it is possible to size up the transistors in the decode tree to trade off power against

latency and area. The overall access energy goes down from 2fJ to 1.7fJ as LUT size is

increased from two to six for the minimum-size transistors used in these simulations.

37

0

0.5

1

1.5

2

2.5 Re

ad Ene

rgy (fJ)

0 100 200 300 400 500 600

Leakage (pW)

0 20 40 60 80

100 120

1 2 3 4 5 6 7

Delay (p

s)

Number of LUT Inputs

0

0.2

0.4

0.6

0.8

1

1 2 3 4 5 6 7

Area (um

2 )

Number of LUT Inputs

Figure 3.10: Access energy, leakage power, read delay, and area of a single LUT as afunction of the number of LUT inputs based on Cadence-Spectre circuit simulationsat 32nm.

Leakage Power. The dominant leakage paths for the LUT circuit are: (1) from VDD

through the PMOS keeper transistors into the capacitor, (2) from VDD through the

footer charge/discharge NMOS to GND, and (3) the sense amplifier. Lower values of

leakage power are observed at higher LUT sizes due to higher resistance along leakage

paths (1) and (2), and due to the stack effect of the transistors in the 3 × 8 decode

tree. However, similarly to the case of read energy, sizing the decoder transistors

appropriately to trade-off speed against energy can change this balance. As LUT size

38

is increased from two to six inputs, leakage power reduces from 550pW to 400pW.

Latency. Due to the increased series resistance of the decoder’s pull-down network

with larger LUTs, the RC time constant associated with charging the footer capacitor

goes up, and latency increases from 80 to 100ps. However, LUT speed can be increased

by sizing the decoder transistors larger at the expense of a larger area, and a higher

load capacitance for the previous stage driving the LUT. For optimal results, the

footer capacitor must also be sized appropriately. A higher capacitance allows the

circuit to work with a lower voltage swing at the expense of increased area. Lower

capacitance values cause higher voltage swings on the capacitor, thereby slowing down

the reaction time of the sense amplifier due to the lower potential difference between

the DEC and REF nodes. A 50fF capacitor was used in these simulations.

Area. Although larger LUTs amortize the leakage power of the peripheral circuitry

better and offer more functionality without incurring a large latency penalty, the area

overhead of the lookup table increases exponentially with the number of inputs. Every

new input doubles the number of transistors in the branches; as LUT size is increased

from two to six inputs, the area of the LUT increases fivefold. Nevertheless, a single

LUT can replace approximately 12 CMOS standard cells on average when implement-

ing such complex combinational logic blocks as a floating-point unit (Section 3.3.5) or

39

the scheduling logic of a memory controller (Section 3.3.6.4); consequently, analyses

shown later in the chapter assume six-input LUTs unless otherwise stated.

3.2.2.2 Case Study: Three-bit Adder using Static CMOS, ROM, and

STT-MRAM LUT Circuits

To study the power and performance advantages of STT-MRAM LUT-based com-

puting on a realistic circuit, Table 3.2 compares access energy, leakage power, area,

and delay figures obtained on three different implementations of a three-bit adder: (1)

a conventional, static CMOS implementation, (2) a LUT-based implementation using

the STT-MRAM (DyCML) LUTs described in Section 3.2.2, and (3) a LUT-based

implementation using conventional, CMOS-based static ROMs. Minimum size tran-

sistors are used in all three cases to keep the comparisons fair. Circuit simulations are

performed using Cadence AMS (Spectre) with Verilog-based test vector generation;

we use 32nm BSIM-4 predictive technology models (PTM) [97] of NMOS and PMOS

transistors, and the MTJ parameters presented in Table 3.1 based on ITRS’13 pro-

jections. All results are obtained under identical input vectors, minimum transistor

sizing, and a 370K temperature. Although simulations were also performed at 16nm

and 22nm nodes, results showed similar tendencies to those presented here, and are

not repeated.

40

STT-MRAM Static ROM-BasedParameter LUT CMOS LUTDelay 100ps 110ps 190psAccess Energy 7.43fJ 11.1fJ 27.4fJLeakage Power 1.77nW 10.18nW 514nWArea 2.40µm2 0.43µm2 17.9µm2

Table 3.2: Comparison of three-bit adder implementations using STT-MRAM LUTs,static CMOS, and a static CMOS ROM. Area estimates do not include wiring over-head.

Static CMOS. A three-bit CMOS ripple-carry adder is built using one half-adder

(HAX1) and two full-adder (FAX1) circuits based on circuit topologies used in the

OSU standard cell library [80]. Static CMOS offers the smallest area among all three

designs considered because the layout is highly regular and only 70 transistors are

required instead of the 348 required for the STT-MRAM LUT-based design. Leakage

is 5.8× higher than MRAM since the CMOS implementation has a much higher

number of leakage paths than an STT-MRAM LUT, whose subthreshold leakage is

confined to its peripheral circuitry.

STT-MRAM LUTs. A three-input half-adder requires four STT-MRAM LUTs,

one for each output of the adder (three sum bits plus a carry-out bit). Since the least

significant bit of the sum depends only on two bits, it can be calculated using a two-

input LUT. Similarly, the second bit of the sum depends on a total of four bits, and

can be implemented using a four-input LUT. The most significant bit and the carry-

out bit each depend on six bits, and each of them requires a six-input LUT. Although

41

results presented here are based on unoptimized, minimum-size STT-MRAM LUTs,

it is possible to slow down the two- and four-input LUTs to save access energy by

sizing the transistors. The results presented here are conservative compared to this

best-case optimization scenario.

An STT-MRAM based three-bit adder has 1.5× lower access energy than its static

CMOS counterpart due to its energy-efficient, low-swing, differential current-mode

logic implementation; however, these energy savings are achieved at the expense of a

5.6× increase in area. In a three-bit adder, a six-input STT-MRAM LUT replaces

three CMOS standard cells. The area overhead can be expected to be lower when

implementing more complex logic functions that result in many minterms, which is

when LUT-based computation is most beneficial; for instance, a single six-input LUT

is expected to replace 12 CMOS standard cells on average when implementing the

FPU (Section 3.3.5) and the memory controller scheduling logic (Section 3.3.6.4).

The most notable advantage of the STT-MRAM LUT over static CMOS is the

5.8× reduction in leakage. This is due to the significantly smaller number of leak-

age paths that are possible with an STT-MRAM LUT, which exhibits subthreshold

leakage only through its peripheral circuitry. The speed of the STT-MRAM LUT is

similar to static CMOS: although CMOS uses higher-speed standard cells, an STT-

MRAM LUT calculates all four bits in parallel using independent LUTs.

42

CMOS ROM-Based LUTs. To perform a head-on comparison against a LUT-

based CMOS adder, we build a 64× 4 static ROM circuit that can read all three bits

of the sum and the carry-out bit with a single lookup. Compared to a 6T-SRAM

based, reconfigurable LUT used in an FPGA, a ROM-based, fixed-function LUT is

more energy efficient, since each table entry requires either a single transistor (in the

case of a logic 1) or no transistors at all (in the case of a logic 0), rather than the six

transistors required by an SRAM cell. A 6-to-64 decoder drives one of 64 wordlines,

which activates the transistors on cells representing a logic 1. A minimum sized PMOS

pull-up transistor and a skewed inverter are employed to sense the stored logic value.

Four parallel bitlines are used for the four outputs of the adder, amortizing dynamic

energy and leakage power of the decoder over the four output bits.

The ROM-based LUT dissipates 290× higher leakage than its STT-MRAM based

counterpart. This is due to two factors: (1) transistors in the decoder circuit of

the ROM represent a significant source of subthreshold leakage, whereas the STT-

MRAM LUT uses differential current-mode logic, which connects a number of access

devices in series with each MTJ on a decode tree without any direct connections

between the access devices and VDD, and (2) the ROM-based readout mechanism

suffers from significant leakage paths within the data array itself since all unselected

devices represent sneak paths for active leakage during each access. The access energy

of the ROM-based LUT is 3.7× higher than the STT-MRAM LUT, since (1) the

43

decoder has to be activated with every access, and (2) the bitlines are charged to VDD

and discharged to GND using full-swing voltages, whereas the differential current-

sensing mechanism of the STT-MRAM LUT operates with low-swing voltages.

The ROM-based LUT also runs 1.9× slower than its STT-MRAM based coun-

terpart due to the serialization of the decoder access and cell readout: the input

signal has to traverse through the decoder to activate one of the wordlines, which

then selects the transistors along that wordline. Two thirds of the delay is incurred

in the decoder. Overall, the ROM-based LUT delivers the worst results on all metrics

considered due to its inherently more complex and leakage-prone design.

3.2.2.3 Deciding When to Use LUTs

Consider a three-bit adder which has two three-bit inputs and four one-bit outputs.

This function can be implemented using four six-input LUTs, whereas the VLSI

implementation requires only three standard cells, resulting in a stdcellLUT

ratio of less

than one. On the other hand, an unsigned multiplier with two three-bit inputs and a

six-bit output requires six six-input LUTs or 36 standard cells, raising the same ratio

to six. As the size and complexity of a Boolean function increase, thereby requiring

more minterms after logic minimization, this ratio can be as high as 12 [13]. This is

due not only to the increased complexity of the function better utilizing the fixed size

of the LUTs, but also to the sheer size of the circuit allowing the boolean minimizer to

44

amortize complex functions over multiple LUTs. As this ratio gets higher, the power

consumption and leakage advantage of LUT based circuits improve dramatically. The

observation that LUT-based implementations work significantly better for large and

complex circuits is one of our guidelines for choosing which parts of a microprocessor

should be implemented using LUTs vs. conventional CMOS.

3.3 Structure and Operation of An STT-MRAM

based CMT Pipeline

Figure 3.11 shows how hardware resources are partitioned between CMOS and

STT-MRAM in an example CMT system with eight single-issue in-order cores, and

eight hardware thread contexts per core. Whether a resource can be effectively im-

plemented in STT-MRAM depends on both its size and on the expected number of

writes it incurs per cycle. STT-MRAM offers dramatically lower leakage and much

higher density than SRAM, but suffers from long write latency and high write en-

ergy. Large, wire-delay dominated RAM arrays—L1 and L2 caches, TLBs, memory

controller queues, and register files—are implemented in STT-MRAM to reduce leak-

age and interconnect power, and to improve interconnect delay. Instruction and

store buffers, PC registers, and pipeline latches are kept in CMOS due to their small

size and relatively high write activity. Since LUTs are never written at runtime,

45

PC

Logi

c

Thrd

Se

lM

ux

Inst

Bu

fx

8

Thrd

Se

lM

ux

Reg

File

x 8

Fron

t-End

Thrd

Sel

Logi

c

•I$

Mis

s•

I-TLB

Mis

s•

Inst

Buf

Ful

l•

Bran

ch

•D

$ M

iss

•D

-TLB

Mis

s•

Dep

ende

nce

•St

ruct

ure

Con

flict

CLK

CLK

CLK

CLKCLK

Cro

ssba

rIn

terfa

ce

STT-

MR

AM L

UTs

STT-

MR

AM A

rrays

Pure

CM

OS

Shar

edL2

$Ba

nks

x 8

Inst

ruct

ion

Fetc

hTh

read

Sele

ct

Dec

ode

Exec

ute

Writ

e Ba

ck

Func

Uni

tAL

U

FPU

Dec

ode

Logi

c

CLK

D$

D-T

LB

I$

I-TLB

CLK

StBu

fx

8

Mem

ory

MC

0 Q

ueue

MC

0 Log

ic

MC

1 Q

ueue

MC

1 Log

ic

MC

2 Q

ueue

MC

2 Log

ic

MC

3 Q

ueue

MC

3 Log

ic

CLK

Pre

Dec

ode

Back

-End

Thrd

Sel

Logi

c

Fig

ure

3.11

:Il

lust

rati

veex

ample

ofa

resi

stiv

eC

MT

pip

elin

e.

46

they are used to implement such complex combinational logic blocks as the front-end

thread selection, decode, and next-PC generation logic, the floating-point unit, and

the scheduling logic of the memory controller.

An important issue that affects both power and performance for caches, TLBs, and

register files is the size of a basic STT-MRAM cell used to implement the subarrays.

With 30F 2 cells, write latency can be reduced by 2.2× over 10F 2 cells (Section 2.2.1)

at the expense of lower density, higher read energy, and longer read latency. Lookup

tables are constructed from dense, 10F 2 cells as they are never written at runtime.

The register file and the L1 d-cache use 30F 2 cells with 3.1ns switching time as the

6.7ns switching time of a 10F 2 cell has a prohibitive impact on throughput. The L2

cache and the memory controller queues are implemented with 10F 2 cells and are

optimized for density and power rather than write speed; similarly, TLBs and the L1

i-cache are implemented using 10F 2 cells due to their relatively low miss rate, and

thus, low write probability.

3.3.1 Instruction Fetch

Each core’s front-end is quite typical, with a separate PC register and an eight-

deep instruction buffer per thread. The i-TLB, i-cache, next-PC generation logic,

and front-end thread selection logic are shared among all eight threads. The i-TLB

and the i-cache are built using STT-MRAM arrays; thread selection and next-PC

47

generation logic are implemented with STT-MRAM LUTs. Due to their small size

and high write activity, instruction buffers and PC registers are left in CMOS.

3.3.1.1 Program Counter Generation

Each thread has a dedicated, CMOS-based PC register. To compute the next

sequential PC with minimum power and area overhead, a special 6 × 7 “add one”

LUT is used rather than a general-purpose adder LUT. A 6 × 7 LUT accepts six

bits of the current PC plus a carry-in bit to calculate the corresponding six bits of

the next PC and a carry-out bit; internally, the circuit consists of two-, three-, four-,

five-, and six-input LUTs (one of each), each of which computes a different bit of the

seven bit output in parallel.

The overall next sequential PC computation unit comprises five such 6× 7 LUTs

arranged in a carry-select configuration (Figure 3.12). Carry out bits are used as the

select signals for a chain of CMOS-based multiplexers that choose either the new or

the original six bits of the PC. Hence, the delay of the PC generation logic is four

multiplexer delays, plus a single six-input LUT delay, which comfortably fits within

a 250ps clock period in circuit simulations (Section 3.5).

3.3.1.2 Front-End Thread Selection

Every cycle, the front-end selects one of the available threads to fetch in round-

robin order, which promotes fairness and facilitates a simple implementation. The

48

LUT-64

cout

LUT-64

cout

LUT-64

cout

LUT-64

cout

LUT-64

31 26 20 14 8 2PC

Next PC

6x7 LUT

6x7 LUT

6x7 LUT

6x7 LUT

6x7 LUT

31 26 20 14 8 2

cout cout cout cout

Figure 3.12: Next PC generation using five add-one LUTS in a carry-select configu-ration.

following conditions make a thread unselectable in the front-end: (1) an i-cache or

an i-TLB miss, (2) a full instruction buffer, or (3) a branch or jump instruction.

On an i-cache or an i-TLB miss, the thread is marked unselectable for fetch, and

is reset to a selectable state when the refill of the i-cache or the i-TLB is complete.

To facilitate front-end thread selection, the ID of the last selected thread is kept in

a three-bit CMOS register, and the next thread to fetch from is determined as the

next available, ublocked thread in round-robin order. The complete thread selection

mechanism thus requires an 11-to-3 LUT, which is built from 96 six-input LUTs

sharing a data bus with tri-state buffers—six bits of the input are sent to all of the

LUTs, and the remaining five bits are used to generate the enable signals for all

LUTs in parallel with the LUT access. (It is also possible to optimize for power by

serializing the decoding of the five bits with the LUT access, and by using the enable

signal to control the LUT clk input.)

49

3.3.1.3 L1 Instruction Cache and TLB

The i-cache and and the i-TLB are both implemented in STT-MRAM due to

their large size and relatively low write activity. Since writes are infrequent, these

resources are each organized into a single subbank to minimize the overhead of the

peripheral circuitry, and are built using 10F 2 cells that reduce area, read energy,

and read latency at the expense of longer writes. The i-cache is designed with a

dedicated read port and a dedicated write port to ensure that the front-end does

not come to a complete stall during refills; this ensures that threads can still fetch

from the read port in the shadow of an ongoing write. To accommodate multiple

outstanding misses from different threads, the i-cache is augmented with an eight-

entry refill queue. When a block returns from the L2 on an i-cache miss, it starts

writing to the cache immediately if the write port is available; otherwise, it is placed

in the refill queue while it waits for the write port to free up.

SRAM STT-MRAM STT-MRAMParameter (32KB) (32KB) (128KB)Read Delay 397ps 238ps 474psWrite Delay 397ps 6932ps 7036psRead Energy 35pJ 13pJ 50pJWrite Energy 35pJ 90pJ 127pJLeakage Power 75.7mW 6.6mW 41.4mWArea 0.31mm2 0.06mm2 0.26mm2

Table 3.3: Instruction cache parameters.

It is possible to leverage the 14.6× density advantage of STT-MRAM over SRAM

50

by either designing a similar-capacity L1 i-cache with shorter wire delays, lower read

energy, and lower area and leakage, or by designing a higher-capacity cache with

similar read latency and read energy under a similar area budget. Table 3.3 presents

latency, power, and area comparisons between a 32KB, SRAM-based i-cache; its

32KB, STT-MRAM counterpart; and a larger, 128KB STT-MRAM configuration

that fits under the same area budget 3. Simply migrating the 32KB i-cache from

SRAM to STT-MRAM reduces area by 5.2×, leakage by 11.5×, read energy by 2.7×,

and read delay by one cycle at 4GHz. Leveraging the density advantage to build

a larger, 128KB cache results in more modest savings in leakage (45%) due to the

higher overhead of the CMOS-based peripheral circuitry. Write energy increases by

2.6− 3.6× over CMOS with 32KB and 128KB STT-MRAM caches, respectively.

3.3.2 Predecode

After fetch, instructions go through a predecode stage where a set of predecode

bits for back-end thread selection are extracted and written into the CMOS-based

instruction buffer. Predecode bits indicate if the instruction is a member of the

following equivalence classes: (1) a load or a store, (2) a floating-point or integer

divide, (3) a floating-point add/sub, compare, multiply, or an integer multiply, (4) a

brach or a jump, or (5) any other ALU operations. Each flag is generated by inspecting

the six-bit opcode, which requires a total of five six-input LUTs. The subbank ID of

3The experimental setup is described in Section 3.4.

51

the destination register is also extracted and recorded in the instruction buffer during

the predecode stage to faciliate back-end thread selection.

3.3.3 Thread Select

Every cycle, the back-end thread selection unit issues an instruction from one

of the available, unblocked threads. The goal is to derive a correct and balanced

issue schedule that prevents out-of-order completion; avoids structural hazards and

conflicts on L1 d-cache and register file subbanks; maintains fairness; and delivers

high throughput.

3.3.3.1 Instruction Buffer

Each thread has a private, eight-deep instruction buffer organized as a FIFO

queue. Since buffers are small and are written every few cycles with up to four new

instructions, they are implemented in CMOS as opposed to STT-MRAM.

3.3.3.2 Back-End Thread Selection Logic

Every cycle, back-end thread selection logic issues the instruction at the head of

one of the instruction buffers to be decoded and executed. The following events make

a thread unschedulable: (1) an L1 d-cache or d-TLB miss, (2) a structural hazard

on a register file subbank, (3) a store buffer overflow, (4) a data dependency on an

ongoing long-latency floating-point, integer multiply, or integer divide instruction, (5)

52

a structural hazard on the (unpipelined) floating-point divider, and (6) the possibility

of out-of-order completion.

The buffer entry holding a load is not recycled at the time the load issues; instead,

the entry is retained until the load is known to hit in the L1 d-cache or in the store

buffer. In the case of a miss, the thread is marked as unschedulable; when the L1

d-cache refill process starts, the thread transitions to a schedulable state, and the

load is replayed from the instruction buffer. On a hit, the load’s instruction buffer

entry is recycled as soon as the load enters the writeback stage.

Long-latency floating-point instructions and integer multiplies from a single thread

can be scheduled back-to-back so long as there are no dependencies between them. In

the case of an out-of-order completion possibility—a floating-point divide followed by

any other instruction, or any floating-point instruction other than a divide followed

by an integer instruction—the offending thread is made unschedulable for as many

cycles as needed for the danger to disappear.

Threads can also become unschedulable due to structural hazards on the un-

pipelined floating-point divider, on register file subbank write ports, or on store

buffers. As the register file is built using 30F 2 STT-MRAM cells with 3.1ns switching

time, the register file subbank write occupancy is 13 cycles at 4GHz. Throughout the

duration of an on-going write, the subbank is unavailable for a new write (unless it

is the same register that is being overwritten), but the read ports remain available;

53

hence, register file reads are not stalled by long-latency writes. If the destination sub-

bank of an instruction conflicts with an ongoing write to the same bank, the thread

becomes unschedulable until the target subbank is available. If the head of the in-

struction buffer is a store and the store buffer of the thread is full, the thread becomes

unschedulable until there is an opening in the store buffer.

In order to avoid starvation, a least recently selected (LRS) policy is used to pick

among all schedulable threads. The LRS policy is implemented using CMOS gates.

3.3.4 Decode

In the decode stage, the six-bit opcode of the instruction is inspected to generate

internal control signals for the following stages of the pipeline, and the architectural

register file is accessed to read the input operands. Every decoded signal propagated

to the execution stage thus requires a six-input LUT. For a typical, five-stage MIPS

pipeline [42] with 16 output control signals, 16 six-input LUTs suffice to accomplish

this.

3.3.4.1 Register File

Every thread has 32 integer registers and 32 floating-point registers, for a total of

512 registers (2kB of storage) per core. To enable a high-performance, low-leakage,

STT-MRAM based register file that can deliver the necessary write throughput and

54

single-thread latency, integer and floating-point register from all threads are aggre-

gated in a subbanked STT-MRAM array as shown in Figure 3.13. The overall register

file consists of 32 subbanks of 16 registers each, sharing a common address bus and

a 64-bit data bus. The register file has two read ports and a write port, and the

write ports are augmented with subbank buffers to allow multiple writes to proceed

in parallel on different subbanks without adding too much area, leakage, or latency

overhead (Section 3.2.1). Mapping each thread’s integer and floating-point registers

to a common subbank would significantly degrade throughput when a single thread is

running in the system, or during periods where only a few threads are schedulable due

to L1 d-cache misses. To avert this problem, each thread’s registers are are striped

across consecutive subbanks to improve throughput and to minimize the chance of a

subbank write port conflict. Double-precision floating-point operations require read-

ing two consecutive floating-point registers starting with an even-numbered register,

which is accomplished by accessing two consecutive subbanks and driving the 64-bit

data bus in parallel.

Table 3.4 lists area, read energy, and leakage power advantages that are possible

by implementing the register file in STT-MRAM. The STT-MRAM implementation

reduces leakage by 2.4× and read energy by 1.4× over CMOS; however, energy for

a full 32-bit write is increased by 22.2×. Whether the end result turns out to be a

net power savings depends on how frequently the register file is updated, and on how

55

Shared Data and Address Busses

T0-R0 T0-R1 T0-R2 T0-R3

T0-R4 T0-R5 T0-R6 T0-R7

T1-R0 T1-R1 T1-R2 T1-R3

T1-R4 T1-R5 T1-R6 T1-R7

Figure 3.13: Illustrative example of a subbanked register file.

effective differential writes are on a given workload.

Parameter SRAM STT-MRAMRead Delay 137ps 122psWrite Delay 137ps 3231psRead Energy 0.45pJ 0.33pJWrite Energy 0.45pJ 10.0pJLeakage Power 3.71mW 1.53mWArea 0.038mm2 0.042mm2

Table 3.4: Register file parameters.

3.3.5 Execute

After decode, instructions are sent to functional units to complete their execu-

tion. Bitwise logical operations, integer addition and subtraction, and logical shifts

are handled by the integer ALU, whereas floating-point addition, multiplication, and

56

division are handled by the floating-point unit. Similar to the Sun’s Niagara-1 proces-

sor [48], integer multiply and divide operations are also sent to the FPU rather than

a dedicated integer multiplier to save area and leakage power. Although the integer

ALU is responsible for 5% of the baseline leakage power consumption, many of the

operations it supports (e.g., bitwise logical operations) do not have sufficient circuit

complexity (i.e., minterms) to amortize the peripheral circuitry in a LUT-based im-

plementation. Moreover, fully pipelining an STT-MRAM based integer adder (the

power- and area-limiting unit in a typical integer ALU [51]) requires the adder to be

pipelined in two stages, but the additional power overhead of the pipeline flip-flops

largely offsets the benefits of transitioning to STT-MRAM. Consequently, the integer

ALU is left in CMOS. The FPU, on the other hand, is responsible for a large fraction

of the per-core leakage power and dynamic access energy, and is thus implemented

with STT-MRAM LUTs.

Floating-Point Unit. To compare ASIC- and LUT-based implementations of the

floating-point unit, an industrial FPU design from Gaisler Research, the GRFPU [13],

is taken as a baseline. A VHDL implementation of the GRFPU synthesizes to 100,000

gates on an ASIC design flow, and runs at 250MHz at 130nm; on a Xilinx Virtex-

2 FPGA, the unit synthesizes to 8,500 LUTs, and runs at 65MHz. Floating-point

57

addition, subtraction, and multiplication are fully pipelined and execute with a three-

cycle latency; floating-point division is unpipelined and takes 16 cycles.

To estimate the required pipeline depth for an STT-MRAM LUT-based imple-

mentation of the GRFPU to operate at 4GHz at 32nm, we use published numbers

on configurable logic block (CLB) delays on a Virtex-2 FPGA [2]. A CLB has a

LUT+MUX delay of 630ps and an interconnect delay of 1 to 2ns based on its place-

ment, which corresponds to a critical path of six to ten CLB delays. For STT-

MRAM, we assume a critical path delay of eight LUTs, which represents the average

of these two extremes. Assuming a buffered six-input STT-MRAM LUT delay of

130ps and a flip-flop sequencing overhead (tsetup + tC→Q) of 50ps, and conservatively

assuming a perfectly-balanced pipeline for the baseline GRFPU, we estimate that

the STT-MRAM implementation would need to be pipelined eight times deeper than

the original to operate at 4GHz, with floating-point addition, subtraction, and mul-

tiplication latencies of 24 cycles, and an unpipelined, 64-cycle floating-point divide

latency. When calculating leakage power, area, and access energy, we account for the

overhead of the increased number of flip-flops due to this deeper pipeline (flip-flop

power, area, and speed are extracted from 32nm circuit simulations of the topology

used in the OSU standard cell library [80]). We characterize and account for the

impact of loading on an STT-MRAM LUT when driving another LUT stage or a

flip-flop via Cadence-Spectre circuit simulations.

58

To estimate the pipeline depth for the CMOS implementation of the GRFPU run-

ning at 4GHz, we first scale the baseline 250MHz frequency linearly from 130nm to

32nm, which corresponds to a frequency of 1GHz at 32nm. Thus, conservatively ig-

noring the sequencing overhead, to operate at 4GHz, the circuit needs to be pipelined

4× deeper, with 12-cycle floating-point addition, subtraction, and multiplication la-

tencies, and a 64-cycle, unpipelined floating-point division. Estimating power for

CMOS (100,000 gates) requires estimating dynamic and leakage power for an av-

erage gate in a standard-cell library. We characterize the following OSU standard

cells using circuit simulations at 32nm, and use their average to estimate power for

the CMOS-based GRFPU design: INVX2, NAND2X1, NAND3X1, BUFX2, BUFX4,

AOI22X1, MUX2X1, DFFPOSX1, and XNORX1.

Table 3.5 shows the estimated leakage, dynamic energy, and area of the GRFPU

in both pure CMOS and STT-MRAM. The CMOS implementation uses 100, 000

gates whereas the STT-MRAM implementation uses 8,500 LUTs. Although each

CMOS gate has lower dynamic energy than a six-input LUT, each LUT can replace

12 logic gates on average. This 12× reduction in unit count results in an overall

reduction of the total dynamic energy. Similarly, although each LUT has higher

leakage than a CMOS gate, the cumulative leakage of 8,500 LUTs reduces leakage

by 4× over the combined leakage of 100, 000 gates. Area, on the other hand, is

comparable due to the reduced unit count compensating for the 5× higher area of

59

each LUT and the additional buffering required to cascade the LUTs. (Note that

these area estimates do not account for wiring overheads in either the CMOS or the

STT-MRAM implementations.) In summary, the FPU is a good candidate to place

in STT-MRAM since its high circuit complexity produces logic functions with many

minterms that require many CMOS gates to implement, which is exactly when a

LUT-based implementation is advantageous.

Parameter CMOS FPU STT-MRAM FPUDynamic Energy 36pJ 26.7pJLeakage Power 259mW 61mWArea 0.22mm2 0.20mm2

Table 3.5: FPU parameters. Area estimates do not include wiring overhead.

3.3.6 Memory

In the memory stage, load and store instructions access the STT-MRAM based

L1 d-cache and d-TLB. To simplify the scheduling of stores and to minimize the

performance impact of contention on subbank write ports, each thread is allocated a

CMOS-based, eight-deep store buffer holding in-flight store instructions.

3.3.6.1 Store Buffers

One problem that comes up when scheduling stores is the possibility of a d-cache

subbank conflict at the time the store reaches the memory stage. Since stores require

address computation before their target d-cache subbank is known, thread selection

60

logic cannot determine if a store will experience a port conflict in advance. To address

this problem, the memory stage of the pipeline includes a CMOS-based, private,

eight-deep store buffer per thread. So long as a thread’s store buffer is not full,

the thread selection logic can schedule the store without knowing the destination

subbank. Stores are dispatched into and issued from store buffers in FIFO order; store

buffers also provide an associative search port to support store-to-load forwarding,

similar to the Sun Niagara-1 processor [48]. We assume relaxed consistency models

where special synchronization primitives (e.g., memory fences in weak consistency, or

acquire/release operations in release consistency) are inserted into store buffers, and

the store buffer enforces the semantics of the primitives when retiring stores and when

forwarding to loads. Since the L1 d-cache supports a single write port (but multiple

subbank buffers), only a single store can issue per cycle. Store buffers, and the L1

refill queue contend for access to this shared resource, and priority is determined

based on a round-robin policy.

3.3.6.2 L1 Data Cache and TLB

Both the L1 d-cache and the d-TLB are implemented using STT-MRAM arrays.

The d-cache is equipped with two read ports (one for snooping, and one for the

core) and a write port shared among all subbanks. At the time a load issues, the

corresponding thread is marked unschedulable and recycling of the instruction buffer

61

entry holding the load is postponed until it is ascertained that the load will not

experience a d-cache miss. Loads search the store buffer of the corresponding thread

and access the L1 d-cache in parallel, and forward from the store buffer in the case

of a hit. On a d-cache miss, the thread is marked unschedulable, and is transitioned

back to a schedulable state once the data arrives. To accommodate refills returning

from the L2, the L1 has a 16-deep, CMOS-based refill queue holding incoming data

blocks. Store buffers and the refill queue contend for access to the two subbanks

of the L1, and are given access using a round-robin policy. Since the L1 is written

frequently, it is optimized for write throughput using 30F 2 cells. The L1 subbank

buffers perform internal differential writes to reduce write energy.

SRAM STT-MRAM STT-MRAMParameter (32KB) (32KB, 30F 2) (64KB, 30F 2)Read Delay 344ps 236ps 369psWrite Delay 344ps 3331ps 3399psRead Energy 60pJ 31pJ 53pJWrite Energy 60pJ 109pJ 131pJLeakage Power 78.4mW 11.0mW 31.3mWArea 0.54mm2 0.19mm2 0.39mm2

Table 3.6: L1 d-cache parameters.

Table 3.6 compares the power, area, and latency characteristics of two different

STT-MRAM based L1 configurations to a baseline, 32KB CMOS implementation. A

capacity-equivalent, 32KB d-cache reduces the access latency from two clock cycles to

one, and cuts down the read energy by 1.9× due to the shorter interconnect lengths

possible with the density advantage of STT-MRAM. Leakage power is reduced by

62

7.1×, and area is reduced by 2.8×. An alternative, 64kB configuration requires 72%

of the area of the CMOS baseline, but increases capacity by 2×; this configuration

takes two cycles to read, and delivers a 2.5× leakage reduction over CMOS.

3.3.6.3 L2 Cache

The L2 cache is designed using 10F 2 STT-MRAM cells to optimize for density

and access energy rather than write speed. To ensure adequate throughput, the cache

is equipped with eight banks, each of which supports four subbanks, for a total of 32.

Each L2 bank has a single read/write port shared among all of the subbanks; unlike

the L1 d-cache and the register file, L2 subbanks are not equipped with differential

writing circuitry to minimize leakage due to the CMOS-based periphery.

Table 3.7 compares two different STT-MRAM L2 organizations to a baseline,

4MB CMOS L2. To optimize for leakage, the baseline CMOS L2 cache uses high-Vt

transistors in the data array, whereas the peripheral circuitry needs to be imple-

mented using low-Vt, high-performance transistors to maintain a 4GHz cycle time.

A capacity-equivalent, 4MB STT-MRAM based L2 reduces leakage by 2.0× and read

access energy by 63% compared to a CMOS baseline. Alternatively, it is possible to

increase capacity to 32MB while maintaining lower area, but the leakage overhead of

the peripheral circuitry increases with capacity, and results in twice as much leakage

as the baseline.

63

SRAM STT-MRAM STT-MRAMParameter (4MB) (4MB) (32MB)Read Delay 2364ps 1956ps 2760psWrite Delay 2364ps 7752ps 8387psRead Energy 1268pJ 798pJ 1322pJWrite Energy 1268pJ 952pJ 1477pJLeakage Power 6578mW 3343mW 12489mWArea 82.33mm2 32.00mm2 70.45mm2

Table 3.7: L2 cache parameters.

3.3.6.4 Memory Controllers

To provide adequate memory bandwidth to eight cores, the system is equipped

with four DDR2-800 memory controllers. Memory controller read and write queues

are implemented in STT-MRAM using 10F 2 cells. Since the controller needs to make

decisions every DRAM clock cycle (10 processor cycles in our baseline), the impact

of write latency on scheduling efficiency and performance is negligible.

The scheduling logic of the controller is implemented using STT-MRAM LUTs.

To estimate power, performance, and area under CMOS- and MRAM-based imple-

mentations, we use a methodology similar to that employed for the floating-point

unit. We use a DDR2-800 memory controller IP core developed by HiTech [32] as

our baseline; on an ASIC design flow, the controller synthesizes to 13, 700 gates and

runs at 400MHz; on a Xilinx Virtex-5 FPGA, the same controller synthesizes to 920

CLBs and runs at 333MHz. Replacing CLB delays with STT-MRAM LUT delays,

we find that an STT-MRAM based implementation of the controller would meet the

64

400MHz cycle time without further modifications.

Table 3.8 compares the parameters of the CMOS and STT-MRAM based imple-

mentations. Similarly to the case of the FPU, the controller logic benefits significantly

from a LUT based design. Leakage power is reduced by 7.2×, while the energy of

writing to the scheduling queue increases by 24.4×.

Parameter CMOS STT-MRAMRead Delay 185ps 154psWrite Delay 185ps 6830psRead Energy 7.1pJ 5.6pJWrite Energy 7.1pJ 173pJMC Logic Energy 30.0pJ 1.6pJLeakage Power 41.4mW 5.72mWArea 0.097mm2 0.051mm2

Table 3.8: Memory controller parameters. Area estimates do not include the wiringoverhead.

3.3.7 Write Back

In the write-back stage, an instruction writes its result back into the architectural

register file through the write port. No conflicts are possible during this stage since the

thread selection logic schedules instructions by taking register file subbank conflicts

into account. Differential writes within the register file reduce write power during

write backs.

65

3.4 Experimental Setup

This section presents the experimental methodology used for the evaluation. Architecture-

level simulations are conducted to model the behavior of the proposed system. Circuit-

level tools and simulators are used to evaluate the area, latency, and power. A set of

13 parallel benchmarks are evaluated on the proposed STT-MRAM based systems.

3.4.1 Architecture

We use a heavily modified version of the SESC simulator [65] to model a Niagara-

like in-order CMT system with eight cores, and eight hardware thread contexts per

core. Table 3.9 lists the microarchitectural configuration of the baseline cores and the

shared memory subsystem.

For STT-MRAM, we experiment with two different design points for the L1 and

L2 caches: (1) configurations with capacity equivalent to the CMOS baseline, where

STT-MRAM benefits from the lower interconnect delays (Table 3.10-Small), and (2)

configurations with a larger capacity that still fit within same area budget as the

CMOS baseline, where STT-MRAM benefits from fewer misses (Table 3.10-Large).

The STT-MRAM memory controller queue write delay is set to 27 processor cycles.

We experiment with an MRAM-based register file with 32 subbanks and a write delay

of 13 cycles each, and we also evaluate the possibility of leaving the register file in

CMOS.

66

Processor ParametersFrequency 4 GHz

Number of cores 8Number of SMT contexts 8 per core

Front-end thread select Round RobinBack-end thread select Least Recently Selected

Pipeline organization Single-issure, in-orderStore buffer entries 8 per thread

L1 CachesiL1/dL1 size 32kB/32kB

iL1/dL1 block size 32B/32BiL1/dL1 round-trip latency 2/2 cycles(uncontended)

iL1/dL1 ports 1 / 2iL1/dL1 banks 1 / 2

iL1/dL1 MSHR entries 16/16iL1/dL1 associativity direct mapped/2-way

Coherence protocol MESIConsistency model Release consistency

Shared L2 Cache and Main MemoryShared L2 cache 4MB, 64B block, 8-way

L2 MSHR entries 64L2 round-trip latency 10 cycles (uncontended)

Write buffer 64 entriesDRAM subsystem DDR2-800 SDRAM [53]

Memory controllers 4

Table 3.9: Parameters of baseline.

Small LargeiL1/dL1 size 32kB/32kB 128kB/64kBiL1/dL1 latency 1/1 cycles 2/2 cyclesL1s write occupancy 13 cycles 13 cyclesL2 size 4MB 32MBL2 latency 8 cycles 12 cyclesL2 write occupancy 24 cycles 23 cycles

Table 3.10: STT-MRAM cache parameters

67

For structures that reside in CMOS in both the baseline and the proposed archi-

tecture (e.g., pipeline latches, store buffers), McPAT [50] is used to estimate power,

area, and latency.

3.4.2 Circuit

We use BSIM-4 predictive technology models (PTM) of NMOS and PMOS tran-

sistors at 32nm, and perform circuit simulations using Cadence AMS (Spectre) mixed

signal analyses with Verilog-based input test vectors. Only high performance transis-

tors were used in all of the circuit simulations. Temperature is set to 370K in all cases,

which is a meaningful thermal design point for the proposed processor operating at

4GHz [58].

3.4.3 Applications

A set of 13 parallel benchmarks are evaluated (Table 3.11). These include three

applications from NU-MineBench [60], two from a openMP implementation of the

NAS parallel benchmarks [7], two from SPEC OMP2001 [4], and six from SPLASH-

2 [89].

3.5 Evaluation

This section presents the evaluations on performance and power.

68

Benchmark Description Problem size

Data MiningBLAST Protein matching 12.3k sequencesBSOM Self-organizing map 2,048 rec., 100 epochs

KMEANS K-means clustering 18k pts., 18 attributes

NAS OpenMPMG Multigrid Solver Class ACG Conjugate Gradient Class A

SPEC OpenMPSWIM Shallow water model MinneSpec-Large

EQUAKE Earthquake model MinneSpec-Large

Splash-2 KernelsCHOLESKY Cholesky factorization tk29.O

FFT Fast Fourier transform 1M pointsLU Dense matrix division 512× 512to16× 16

RADIX Integer radix sort 2M integers

Splash-2 ApplicationsOCEAN Ocean movements 514×514 ocean

WATER-N Water-Nsquared 512 molecules

Table 3.11: Simulated applications and their input sizes.

3.5.1 Performance

Figure 3.14 compares the performance of four different MRAM-based CMT con-

figurations to the CMOS baseline. When the register file is placed in STT-MRAM

and the L1 and L2 cache capacities are made equivalent to CMOS, performance de-

grades by 11%. Moving the register file to CMOS improves performance, at which

point the system achieves 93% of the baseline performance. Enlarging both L1 and

L2 cache capacities under the same area budget reduces miss rates but loses the la-

tency advantage of the smaller caches; this configuration outperforms CMOS by 2%

69

0

0.2

0.4

0.6

0.8 1

1.2

BLAST

BSOM

CG

CHOLESKY

EQUAKE

FFT

KMEA

NS

LU

MG

OCEAN

RADIX

SWIM

WATER

-­‐N G

EOMEA

N

Performance Normalized to CMOS

CMOS

Small L1&

L2, STT-­‐M

RAM RF

Small L1&

L2, CMOS RF

Large L1&L2, CMOS RF

Small L1, Large L2, CMOS RF

Fig

ure

3.14

:P

erfo

rman

ce.

0.0

5.0

10.0

15.0

20.0

BLAST

BSOM

CG

CHOLESKY

EQUAKE

FFT

KMEA

NS

LU

MG

OCEAN

RADIX

SWIM

WATER

-­‐N

AVER

AGE

Total Power (W)

CMOS

Small L1&

L2, STT-­‐M

RAM RF

Small L1&

L2, CMOS RF

Large L1&L2, CMOS RF

Small L1, Large L2, CMOS RF

Fig

ure

3.15

:T

otal

Pow

er.

70

on average. Optimizing the L2 for fewer misses (by increasing capacity under the

same area budget) while optimizing the L1s for fast hits (by migrating to a denser

STT-MRAM cache with same capacity) delivers similar results.

In general, performance bottlenecks are application dependent. For applications

such as CG, FFT and WATER, the MRAM-based register file represents the biggest

performance hurdle. These applications encounter a higher number of subbank con-

flicts than others, and when the register file is moved to CMOS, their performance

improves significantly. EQUAKE, KMEANS, MG, and RADIX are found sensitive

to floating-point instruction latencies as they encounter many stalls due to depen-

dents of long-latency floating-point instructions in the 24-cycle, STT-MRAM based

floating-point pipeline. CG, CHOLESKY, FFT, RADIX, and SWIM benefit most

from increasing the cache capacities under the same area budget as CMOS by lever-

aging the density advantage of STT-MRAM.

3.5.2 Power

Figure 3.15 compares total power dissipation across the five systems. STT-MRAM

configurations that maintain the same cache sizes as CMOS reduce the total power by

1.7× over CMOS. Despite their higher performance potential, configurations which

increase cache capacity under the same area budget increase power by 1.2× over

CMOS, due to the significant amount of leakage power dissipated in the CMOS-based

71

11.40

5.32 5.34

14.92 14.48

0

2

4

6

8

10

12

14

CMOS Small L1 and L2, STT-­‐MRAM

RF

Small L1 and L2, CMOS RF

Large L1 and L2, CMOS RF

Small L1, Large L2, CMOS RF

Leakage Po

wer (W

)

RF

FPU

ALU and Bypass

InstBuf and STQ

FFs and Comb Logic

L1s and TLBs

L2

MC

Figure 3.16: Leakage Power.

72

decoding and sensing circuitry in the 32MB L2 cache. Although a larger L2 can reduce

the write power by allowing for fewer L2 refills and writes to the memory controllers’

scheduling queues, the increased leakage power consumed by the peripheral circuitry

outweighs the savings on dynamic power.

Figure 3.16 shows the breakdown of leakage power among different components

for the evaluated systems. Total leakage power is reduced by 2.1× over CMOS when

the cache capacities are kept the same. Systems with a large L2 cache increase leakage

power by 1.3× due to the CMOS-based periphery. The floating-point units, which

consume 18% of the total leakage power in the CMOS baseline, benefit significantly

from an STT-MRAM based implementation. STT-MRAM based L1 caches and TLBs

together reduce leakage power by another 10%. The leakage power of the memory

controllers in STT-MRAM is negligible, whereas in CMOS it is 1.5% of the total

leakage power.

3.6 Summary

This chapter presents a new technique that reduces leakage and dynamic power in

a deep-submicron microprocessor by migrating power- and performance-critical hard-

ware resources from CMOS to STT-MRAM. We have evaluated the power and per-

formance impact of implementing on-chip caches, register files, memory controllers,

floating-point units, and various combinational logic blocks using magnetoresistive

73

circuits, and we have explored the critical issues that affect whether a RAM array or

a combinational logic block can be effectively implemented in MRAM. We have ob-

served significant gains in power-efficiency by partitioning on-chip hardware resources

among STT-MRAM and CMOS judiciously to exploit the unique power, area, and

speed benefits of each technology, and by carefully re-architecting the pipeline to

mitigate the performance impact of long write latencies and high write power.

74

Chapter 4

STT-MRAM based Main

Memories

DRAM density scaling is jeopardized by two fundamental charge retention prob-

lems in deeply scaled technology nodes: (1) the reduced storage capacitance of the

DRAM cell makes it difficult to store large amounts of charge, and (2) the stored

charge is lost faster due to increased leakage through the access transistor. Emerging

non-volatile memory (NVM) technologies aim at skirting the charge retention problem

of deeply scaled DRAM by relying on resistance—rather than electrical charge—to

represent information. However, each of the candidate NVMs comes with its own set

of shortcomings: phase change memory (PCM) and resistive random access memory

75

(RRAM) exhibit limited write endurance and high switching energy, while STT-

MRAM density lags multiple generations behind that of current generation DRAM.

One important reason for the lower density of STT-MRAM compared to DRAM

is the access transistor, which must be sufficiently large to supply the write current

required to switch the device. Note that in Chapter 3, STT-MRAM density exhibit

higher density compared to SRAM even when a larger access transistor is used to

supply a high write current. In fact, for embedded STT-MRAM discussed in Chap-

ter 3, the density of STT-MRAM is limited by the strict design rules. For stand-alone

STT-MRAM, aggressively reducing the dimensions of the storage element over suc-

cessive technology generations can reduce the required write current, removing one

of the major impediments to rapid capacity scaling1. Reducing the size, however,

inevitably results in a lower thermal stability and a higher probability of retention

errors, which necessitate a combination of multi-bit error correcting code (ECC) and

periodic scrubbing techniques [21,57].

Scrubbing operations are expensive, each of which requires (1) reading out a code-

word spanning one or more memory blocks before the number of accumulated errors

exceeds the correction capability of the underlying ECC mechanism, (2) checking and

correcting any errors, and (3) writing back the corrected data. Employing a stronger

ECC can help tolerate more errors before a scrub operation becomes mandatory,

1Other impediments include the conventional challenges of technology scaling, such as processvariability and yield.

76

0.01$

0.1$

1$

10$

0$ 2$ 4$ 6$ 8$ 10$ 12$ 14$ 16$

Scrubb

ing)Freq

uency)(Hz))

ECC)Granularity)(Number)of)64B)Blocks))

Higher ScrubbingOverhead

More ExpensiveECC Checks

Figure 4.1: Tradeoff between scrubbing frequency and ECC granularity under a 12.5%storage overhead.

thereby reducing the scrubbing frequency and the concomitant performance and en-

ergy overheads. For a given ECC storage overhead, the ECC strength can be improved

by coarsening the ECC granularity (i.e., increasing the size of a codeword) and in-

creasing the number of errors that can be corrected in each codeword [12]. Figure 4.1

shows that coarsening the ECC granularity from one to sixteen blocks while main-

taining a fixed storage overhead reduces the required scrubbing frequency by more

than 200× (the calculation of the curve in Figure 4.1 is described in Section 4.1.3).

However, large codewords increase the access energy and bandwidth usage due to

over-fetching. Specifically, when a codeword spans multiple cache blocks, (1) a read

requires fetching multiple blocks to decode the ECC, and (2) a write requires reading

the entire codeword and updating the check bits.

We introduce Sanitizer—a low-cost, energy-efficient memory system architecture

77

that protects high-capacity, STT-MRAM based main memories against retention er-

rors. To avoid fetching multiple blocks from memory and performing costly ECC

checks on every read, memory regions (contiguous, 4KB sections of the physical ad-

dress space) that will be accessed in the near future are predicted and proactively

scrubbed. The key insight is that when accessing a recently scrubbed block, it is

sufficient to perform a lightweight ECC check. By anticipating the memory regions

that will be accessed in the near future and scrubbing them in advance, Sanitizer

improves performance by 1.22× and reduces end-to-end system energy by 22% over

a baseline STT-MRAM system at 22 nm.

4.1 Background for Sanitizer

Before taking an in-depth look at Sanitizer, it is instructive to review DRAM

error protection techniques, STT-MRAM fault modeling, and known techniques for

protecting STT-MRAM against retention errors.

4.1.1 DRAM Error Protection

With technology scaling, maintaining DRAM reliability has become increasingly

challenging. To address the problem, solutions that span novel devices, circuits,

architectures, and software have been devised.

78

4.1.1.1 Error Correcting Codes

The reliability of a memory system can be improved with the help of ECC, which

adds redundant bits to a group of data bits to form a codeword. For a specified ECC

configuration, the smallest Hamming distance between any pair of valid codewords

is called the minimum distance of the ECC; any number of errors fewer than the

minimum distance changes a valid codeword into an invalid one. For example, the

single error correction double error detection (SECDED) Hamming code has a min-

imum distance of four. On a single bit error, the original data can be restored by

finding the valid codeword closest to the invalid bit pattern. Errors due to two bit

flips can be detected but not corrected by SECDED ECC, because an erroneous bit

pattern with two errors can have the same minimum Hamming distance to multiple

valid codewords.

Protection against STT-MRAM retention errors necessitates an ECC with multi-

bit error correction capability [21,57]. BCH [10,33] and Reed Solomon codes [64] are

two widely used ECC schemes for multi-bit error correction. Sanitizer builds upon

a binary BCH code because the symbol-based Reed Solomon code is optimized for

correcting bursts of errors, which are not a common retention failure pattern in STT-

MRAM [21,57]. A binary BCH code with k data bits, capable of t-bit error correction

and (t+ 1)-bit error detection, requires r redundant bits to form an n-bit codeword,

in which n = k + r and r = t×dlog2(n+ 1)e+ 1.

79

Sanitizer employs a hierarchical error protection mechanism comprising local and

global ECCs. The local ECC protects a single data block, while the global ECC

encodes data that spans multiple blocks. Prior work in using hierarchical ECC to

protect main memory aims at reducing the over-fetching cost of chipkill [22]. Yoon et

al. [94] propose a virtualized, multi-tier ECC architecture that decouples the physical

mapping of the data and its associated ECC. Udipi et al. [86] propose a hierarchical

ECC, which separates error detection from correction by storing the checksum and

parity bits in each memory chip. Unlike prior work, which activates the first and

second level ECCs in sequence, Sanitizer leverages knowledge of whether a memory

location has been scrubbed recently to determine if it is safe to rely on a fast, local

ECC check.

4.1.1.2 Refresh and Scrub Operations

A DRAM cell can retain sufficient charge for a limited amount of time (typically

64 ms) after it is written; consequently, cells must be refreshed periodically to protect

against information loss. Unlike DRAM, STT-MRAM does not have a charge leakage

problem. However, it suffers from retention errors due to thermal fluctuations that

may abruptly and randomly change the contents of the memory cells. Hence, unlike

the case of DRAM retention errors where charge is gradually removed from the cells,

80

STT-MRAM retention errors cannot be prevented using refresh. This trait neces-

sitates using error correcting codes in conjunction with scrubbing in STT-MRAM

systems [57].

A memory system protected by ECC can tolerate a fixed number of errors per

codeword. No matter how strong the underlying ECC is, however, after a sufficiently

long period of time, the number of errors that accumulate in a block can exceed

the correction capability of the ECC, thereby resulting in an uncorrectable error.

Scrubbing is a standard strategy to meet this challenge, in which a memory block is

periodically read, checked for errors, and restored to an error-free state.

4.1.2 STT-MRAM Reliability

Errors in STT-MRAM can occur during both the read and the write operations.

A read error occurs when the resistance range of the high and low states overlap

due to process variability [57, 72]. Advanced sensing schemes [18, 85] and reference

resistance tuning [85] can reduce the read errors. A write error occurs when either the

amplitude of the write current is not sufficiently high, or its duration is not sufficiently

long. Reducing the MTJ diameter and thickness can reduce the critical current

IC0 and the thermal stability factor ∆, which lowers the amplitude of the required

write current [75]. If the required write current is sufficiently low, a minimum-size

transistor can reliably and quickly switch the state of the MTJ. Repeated writes can

81

also lead to hard errors. However, the endurance of STT-MRAM is a less pressing

issue compared to other non-volatile memory technologies such as RRAM or PCM.

Nevertheless, if the endurance of STT-MRAM were to become a concern, techniques

proposed for PCM [38,62] could be adopted to alleviate the problem. Such techniques

are orthogonal to Sanitizer, and are beyond the scope of the problem to be solved in

this chapter.

Current generation STT-MRAM exhibits low density due to the large access tran-

sistor required to supply a sufficiently high switching current. Industry projections

indicate, however, that technology scaling will effectively address this problem; for

instance, a recent paper [73] from Everspin shows that the saturation current of a

minimum-sized transistor will be higher than the required switching current below

28nm. As technology scales, the MTJ size has to be shrunk as well, which inevitably

results in an increase in the retention error rate. The best known technology [28, 45]

at 22nm already exhibits a high retention error rate due to low thermal stability.

These retention errors are projected to be the dominant type of error in deeply scaled

STT-MRAM [57]. The retention error rate can be calculated using a closed form

analytical expression:

Pretention(∆, t) = 1− exp(− t

τ0

exp(−∆)), ∆ =

EbkBT

(4.1)

where t is the time elapsed since the last write, τ0 is a process-dependent constant

82

(typically 1ns), Eb is the temperature-independent activation energy, kB is the Boltz-

mann constant, and T is the absolute temperature in Kelvins [57]. As technology

scales, ∆ is predicted to decrease since IC0 must be reduced to allow reliable write

operations with lower current [21]. A perpendicular MTJ, in which the magnetiza-

tion direction of the fixed and free layers are both orthogonal to the tunneling barrier,

achieves a lower IC0 with a higher ∆ compared to a conventional in-plane MTJ [28];

however, even for a perpendicular MTJ, the ∆ at 20 nm is in the range of 29 to

34 [28,45], which is lower than the required ∆ (>60 [57]) for a 1GB memory without

ECC. Note that these are the ∆ values measured at room temperature; ∆ further

decreases at higher temperatures.

Due to process variations, ∆ is not uniform across all of the cells on a single chip.

Specifically, if ∆ follows a distribution characterized by a probability mass function

f(∆), the probability that a random cell has a retention error at time t is:

P (t) = Σ∆max∆min

Pretention(∆, t)f(∆). (4.2)

This calculation is performed when computing the raw bit error rate (BER) used in

the rest of this chapter.

Naeimi et al. [57] and Del Bel et al. [21] propose to use ECC and scrubbing to

protect STT-MRAM based caches against retention errors. They restrict the ECC

granularity to one cache line. Protecting STT-MRAM based main memory against

83

retention errors poses a greater challenge than protecting caches, because (1) it takes

longer to scrub a high capacity main memory system, and (2) scrubbing contends with

demand misses for the limited off-chip memory bandwidth. Awasthi et al. [5] propose

the light array read for PCM resistance drift detection (LARDD) technique, which

places simple ECC logic on the memory chips to detect the first sign of a PCM resis-

tance drift. This scheme would not work for STT-MRAM retention errors, because

the occurrence of one STT-MRAM retention error does not change the probability of

the next one (Equation (4.1)), whereas the observation of one PCM resistance drift

error increases the likelihood of subsequent drift errors.

4.1.3 Reliability Target

The failure in time (FIT) is a standard industrial metric to measure the reliability

of a device (e.g., a DRAM die [54]). FIT measures the number of failures in one

billion device hours. We use 1 FIT (uncorrectable errors in one billion device hours)

per Gbit as a reliability target, so that if the hard failure rate of STT-MRAM is

similar to that of DRAM (22 to 33 FIT [74, 76, 77]), the retention failures have a

minimum impact on system reliability. To achieve this 1 FIT reliability target, an

appropriate ECC code must be chosen for a desired scrubbing frequency. For a given

scrubbing frequency, the raw BER can be calculated from equations (4.1) and (4.2),

and an ECC code is chosen so that the failure probability is below 1 FIT. For a specific

84

ECC code that can correct t errors and detect t+ 1 errors, the failure probability of

a single ECC codeword is Pcodeword =(nt+1

)pt+1(1 − p)n−t−1, where n is the number

of bits in a codeword. The failures due to retention errors in each codeword and the

length of the scrubbing interval can be assumed independent from each other because

(1) all of the correctable errors are corrected after each scrubbing operation, (2) all of

the detectable but uncorrectable errors are handled by higher-level mechanisms (e.g.,

roll back in a system that supports checkpointing), and (3) the probability of having

an undetectable error typically is orders of magnitude lower as compared to that of

having a detectable error. The number of failures in one billion data bits and one

billion hours, therefore, follows a binomial distribution, and the expected number of

failures caused by retention errors is Fretention = Pcodeword × Ncodeword × Nscrub FIT,

where Ncodeword is the number of codewords that cover one billion data bits, and Nscrub

is the number of scrub operations to each memory location in one billion hours.

4.1.4 Scrubbing Overheads

The performance penalty due to scrubbing increases in proportion to the capacitybandwidth

ratio of the memory system. Using a stronger ECC mitigates the bandwidth over-

head. Table 4.1 shows the off-chip memory bandwidth consumed by scrubbing under

progressively stronger ECC configurations, normalized to the peak memory band-

width of the system. (Note that the 1- and 2-block configurations are not practical

85

System Capacity / 4-blk 8-blk 16-blkconfigurations bandwidth ECC ECC ECCEvaluated (Section 4.3) 2.16 GB/GBps 9.89% 4.41% 2.73%SPARC M5 [59] 2.50 GB/GBps 11.64% 5.23% 3.23%Xeon E7-8800 [37] 2.56 GB/GBps 11.99% 5.35% 3.31%Power S822 [36] 2.67 GB/GBps 12.70% 5.48% 3.45%

Table 4.1: Bandwidth overhead due to scrubbing. FIT/Gbit<1, ∆=34, T=45C, rawBER=3.4×10-5/s and block size=64B.

because the bandwidth overhead is greater than 50%.) The scrubbing rates of all of

the configurations in Table 4.1 are below 0.05 Hz, which is much lower than the typi-

cal DRAM refresh rates ( 164 ms

= 15.6 Hz). However, scrubbing an STT-MRAM page

is more expensive than refreshing a DRAM page because scrubbing requires reading

the data out of the memory system. A sensitivity analysis on the capacitybandwidth

ratio is

presented in Section 4.4.4.2.

4.2 Sanitizer Architecture

Sanitizer reduces the scrubbing frequency by applying BCH codes with strong er-

ror tolerance to long codewords spanning multiple cache blocks. Figure 4.2 illustrates

the operation of three different memory protection techniques: (a) frequent scrubbing

combined with a fast but weak ECC, (b) infrequent scrubbing combined with a strong

but slow ECC, and (c) Sanitizer. All three techniques perform scrubbing to remove

errors from blocks B0 and B1. Since the strong ECC can correct more errors than

the fast ECC, it allows more errors to accumulate in a codeword before scrubbing

86

B0Re

ad B

0Re

ad B

1

(a) F

requ

ent s

crub

bing

with

f

ast b

ut w

eak

ECC

(b) I

nfre

quen

t scr

ubbi

ng w

ith

str

ong

but s

low

EC

C

(c) S

aniti

zer

Tim

e

Tim

e

Tim

e

B0B1

B1B1

B0B0 B0

B1

B0B1

B0B1

B0 B0B1B1

B0B1

B0B1

Scru

b Re

adRe

ques

t Rea

dEC

C C

odew

ord

Read

Fig

ure

4.2:

Illu

stra

tive

exam

ple

ofSan

itiz

eran

dco

nve

nti

onal

scru

bbin

gm

echan

ism

s.

87

becomes mandatory. As a result, the strong ECC requires scrubbing less frequently

than the fast ECC. However, the strong ECC has to be applied to longer codewords

spanning two cache blocks to achieve the same storage overhead as the fast ECC,

which requires reading an extra cache block with every memory access. As shown

in Figure 4.2 (b), both B0 and B1 must be accessed to perform error correction on

every read. Sanitizer addresses this problem using (1) a hierarchical error protection

mechanism, in which the strong ECC is used for infrequent scrubbing, while the fast

ECC is used for most of the ordinary memory accesses; and (2) a novel prediction

mechanism for scheduling scrub operations at the granularity of 4KB memory regions

prior to ordinary accesses, reducing the error correction cost.

Sanitizer relies on the observation that a recently scrubbed memory block tends

to accumulate relatively few errors and can be protected using a simple ECC. It uses

a global ECC (GECC) for scrubbing, and a local ECC (LECC) for detecting up to

three errors per memory block. When the LECC is applied within a short period of

time after a codeword is scrubbed, it can ensure the same FIT as the GECC. If an

error is detected by the LECC, the GECC mechanism is invoked for correction.

Figure 4.3 shows the Sanitizer datapath. For every read request, the system first

checks a recently scrubbed table (RST) 1 . On an RST hit, the memory block can

be accessed via LECC decoding; on a miss, multiple cache blocks must be read to

perform GECC decoding. Prior to decoding, the requests are enqueued in a request

88

RequestQueue

ScrubQueue

Arbiter

Data Buffer

GECC Cache

CommandsTo Memory

EvictFill

FromMemory

Per Channel

RecentlyScrubbed

Table

DataTo LLC

ScrubGenerator

Global and LocalECC Logic

Shared Among All Channels

RequestsFrom LLC 1

2

3 5

4

6 7

8

9

Figure 4.3: An illustration of the proposed Sanitizer architecture.

queue 2 . A DDR3 controller services the memory requests, and after receiving the

corresponding data from memory, the controller uses either the LECC or the GECC

decoder for error correction 3 .

Every write requires updating both the LECC and the GECC bits. To reduce

the number of updates to the GECC, Sanitizer employs a GECC cache that stores a

limited number of recently updated GECC bits. On every write, the RST is searched

first 1 . A hit in the RST indicates that the write request can benefit from a fast

block access via LECC; therefore, the old data block is read from main memory via

a read request 2 . Next, the GECC cache is searched for the relevant GECC bits 4 .

89

If the GECC is found in the cache, it is overwritten with the new GECC bits 5 ;

otherwise, the old GECC bits are retrieved from main memory, updated, and placed

in the GECC cache. The GECC cache implements a write-back policy to write the

updated GECC bits to main memory 7 .

Sanitizer determines the memory locations to be scrubbed based on an epoch-

based runtime algorithm. A scrubbing epoch is a window of time whose precise du-

ration is computed as region sizechannel capacity×scrubbing frequency

, which ranges from 2 µs to 10 µs.

Sanitizer determines the minimum scrubbing rate based on the ECC strength and the

error rate. The number of RST hits and misses during the current epoch are tracked

in separate counters. At the beginning of each scrubbing epoch, a scrub generator

consults these counters to determine the new memory regions to be scrubbed 8 .

4.2.1 Scheduling Scrub Operations

Unlike DRAM refresh, a scrub operation can be decomposed into accesses to

global codewords that can be scheduled through fine-grained DDR3 commands.2 On

the one hand, this creates new opportunities for more efficient command scheduling;

on the other hand, it necessitates a complex command scheduler. Sanitizer alleviates

this complexity by decoupling scrub scheduling from DDR3 command scheduling. As

shown in Figure 4.3, every scrub operation is scheduled from a scrub queue. The

requests in the scrub queue are either issued by a scrub generator, or are evicted from

2Due to the large recovery time in DRAM, ultra fine-grained refresh is not beneficial [8].

90

the global ECC cache. Figure 4.4 depicts an entry of the scrub queue, comprising

(1) a valid bit that indicates the scrub transaction is in progress, (2) a ready flag

indicating that the next required DDR3 command can be issued to memory without

violating any DDR3 timing constraints, (3) the number of remaining reads required to

complete fetching the global codeword, (4) the number of remaining writes required

to finish updating the global codeword, (5) a current operation flag indicating the

next command to be issued to main memory, (6) an actual operation flag that shows

the access type of the original request, and (7) address bits pointing to the global

codeword in main memory.

ready # reads # writes current op actual op addressvalid

Figure 4.4: An illustrative example of a scrub queue entry.

Using these flags, every scrub operation completes the following steps before leav-

ing the scrub queue: (1) fetch all of the required data blocks from memory and place

them in a data buffer; (2) send a check request to the ECC hardware and wait until

the ECC check is complete; and (3) if the check fails, correct and update the codeword

using global ECC.

As shown in Figure 4.3, an arbiter selects the DDR3 commands from the request

and scrub queues. When scheduling the scrub operations, the arbiter implements a

scheduling policy similar to the defer-until-empty (DUE) policy [82], which was orig-

inally proposed for lowering DRAM refresh overheads. By default, memory requests

91

are prioritized over scrub operations unless the number of deferred scrub operations

exceeds a threshold. 3 When the threshold is reached, scrub operations are prioritized

over memory requests until the scrub queue is empty. Sanitizer allows data forward-

ing from a recently scrubbed block in the data buffer to read requests in the request

queue.

The key to designing an efficient, fine-grained scrub scheduler is to issue scrub

requests to the memory controller at a rate slightly above the minimum scrubbing

frequency, thereby allowing sufficient slack for the controller to schedule scrub accesses

to maximize performance. For example, decreasing the duration of the scrubbing in-

terval by 2× makes it safe to schedule an individual scrub operation at any time

within the interval. However, a highly overprovisioned scrubbing rate hurts both

performance and energy. The solution that Sanitizer adopts is to incorporate a san-

itizer scrubber on top of a patrol scrubber with a slightly increased scrubbing rate:

the patrol scrubber linearly scans the physical address space to ensure that all of

the memory locations are scrubbed before a scrubbing deadline is violated; the sani-

tizer scrubber, as a result, can freely schedule extra scrub operations to any memory

location to improve performance.

3The threshold is set to half of the queue size, and the scrub frequency is sufficiently overprovi-sioned to ensure that no timing violations can occur due to postponed scrub operations.

92

4.2.2 Reducing the Read Overhead

Reducing the read overhead requires scheduling scrub operations in a timely fash-

ion so that most of the ordinary requests hit in the RST, and hence can be handled

using the local ECC.

4.2.2.1 Local ECC

Sanitizer employs a two-level hierarchical ECC. A codeword comprising multiple

blocks is protected by a strong, BCH based global ECC. In addition, each data block

is protected by a fast, local ECC. For a specified error rate, a stronger local ECC

can prolong the expiration time of a block—the time after which memory accesses

can no longer avoid using the global ECC. The expiration time is set to minimize

the probability that the number of errors in a cache block exceeds the protection

capability of the local ECC (probability < 10−15). (When calculating the system

FIT rate, we take into account both the local and the global ECC failures.) In order

to increase the local ECC protection strength with an acceptable storage overhead,

Sanitizer leverages the SECDED code, which can be configured either to correct one

error and detect two, or to detect three errors and correct none. Sanitizer is based

on the latter configuration. The local ECC adds an extra storage overhead of 11 bits

to a 64B cache block.

93

4.2.2.2 Recently Scrubbed Table

The RST is used to record memory regions that can be checked using the lo-

cal ECC. Memory locations the recently have been scrubbed by the in-order patrol

scrubber do not need to be added to the RST. Instead, two address comparators are

sufficient to delineate the boundaries of the regions which the patrol scrubber has re-

cently visited. Memory locations that are scrubbed out-of-order need to be recorded

in the RST. To keep the hardware overhead low, each entry of the RST represents a

4KB memory region. The RST is implemented as a set-associative cache to strike a

balance between performance and energy. (A sensitivity study on the RST param-

eters is presented in Section 4.4.4.3.) As shown in Figure 4.5, every RST entry has

a region identifier (RID), a counter (Cnt) recording the number of hits to the corre-

sponding region within the current epoch, and a time stamp (Time) that records the

expiration time.4

Every RST entry has to expire after a fixed expiration time (in the range of

10 − 50ms), which is determined by the thermal stability factor, the local ECC

strength, and the reliability target. A circular counter generates a time stamp for

each new region added to the RST. For example, when region D is added (Figure 4.5),

a counter value of six is recorded as its time stamp. The counter is incremented by

one at the end of every scrubbing epoch, and is reset to zero when it reaches the

4Each entry also has a valid bit and a scrubbing direction bit, which are omitted in the figure forsimplicity.

94

Add D

(a) Add a region

A 6 0RID Cnt Time

B 3 1C 10 5

Add E, F

(b) Evict two regions and add two new ones

A 5 0RID Cnt Time

B 2 1C 4 5D 11 6

0

RID Cnt Time

E 0 7F 0 7D 0 6

Add Nothing

(c) Expire a region

A 4 0RID Cnt Time

E 9 7F 2 7D 8 6

6

A 0 0RID Cnt Time

B 0 1C 0 5D 0 6

7

A 0 0RID Cnt Time

E 0 7F 0 7D 0 6

6

CircularCounter

7

0

CircularCounter

CircularCounter

CircularCounter

CircularCounter

CircularCounter

Figure 4.5: An illustrative example of the operations in a four-entry RST with anexpiration time of seven.

expiration time. All of the entries whose time stamps match the counter are evicted,

after which any new entires are added. In Figure 4.5 (c), region A is evicted because

its time stamp matches the counter.

The recently scrubbed regions might not all fit in the RST. If a particular set of

the RST is full, the entry with the lowest hit count is evicted, which is accomplished

by comparing all of the counters in the same set using comparators organized in a

tree topology. All of the hit counters are reset to zeroes at the beginning of each

95

scrubbing epoch to adapt to application phase behavior.

4.2.2.3 Scrub Generator

At the end of each scrubbing epoch, the scrub generator decides which memory

regions to scrub next. For the patrol scrubber, the region ID is incremented by one to

generate the next region. For the sanitizer scrubber, a missed region table (MRT) is

used to record the misses in the RST. The regions to be scrubbed next are determined

by inspecting the MRT and the RST at the end of a scrubbing epoch (Figure 4.6).

The MRT estimates the regions with frequent RST misses using a sticky sam-

pling algorithm. Every entry in the MRT represents a contiguous 4KB region of the

physical address space, and comprises (1) the address of the last read or write to the

represented region, (2) a valid bit indicating that the entry is in use, (3) an access

counter recording the number of accesses to the region, (4) a sticky counter used

to avoid evicting the entry before it collects sufficient statistics, and (5) a direction

counter to predict the scrubbing direction. On an MRT access, if the accessed region

already exists in the MRT, the access counter of that region is incremented by one;

otherwise, a new entry is added with the sticky counter set to all ones. All of the

non-zero sticky counters are decremented by one every time the MRT is accessed.

When the MRT is full, the following steps are required to decide whether a new entry

can be inserted: (1) a pseudo-random number R is generated by a linear-feedback

96

A 5 fwdRID Cnt Direction

B 2 bwdC 4 fwdD 11 fwd

Scrub Generator Recently Scrubbed Table

To be scrubbed:

RID Cnt DirectionG 3 6F 7 12

MissedRegionTable

F fwdE fwd

G bwd

Figure 4.6: An illustrative example of generating a maximum of three scrubbingregions using a direction threshold equals to eight.

shift register (LFSR), and (2) R is compared to the access counter of the least fre-

quently accessed non-sticky5 entry; if R is greater than or equal to the value of the

access counter, the entry is replaced by the new one. The MRT tracks whether the

accesses to a given region are in ascending or descending order using the direction

counter, which is a saturating up/down counter. On every access to a valid MRT

entry, the previous address stored in the entry is compared to the new address. If the

new address is greater than the previous one, the direction counter is incremented;

otherwise, the counter is decremented.

At the end of every epoch, the scrub generator needs to accomplish two tasks: (1)

to determine the maximum number of regions to be scrubbed for the next epoch, and

(2) to select the memory regions to be scrubbed by inspecting the MRT and the RST.

We observed that the maximum number of regions to be scrubbed in each epoch is a

5An MRT entry becomes non-sticky when its sticky counter equals zero.

97

parameter critical to performance. If too many regions are scrubbed within a fixed-

length epoch, the scrubbing overhead becomes too high and significantly degrades

performance. However, scrubbing too few regions in an epoch results in higher RST

miss rates, and ultimately a greater number of cases where the expensive global ECC

rather than the cheaper, local ECC must be used. The scrub generator determines

the maximum number of regions to be scrubbed based on the RST miss rate during

each scrubbing epoch. Two counters in the RST track the total number accesses

and the total number of misses. At the end of each epoch, the counters are used to

compute the miss rate. The maximum number of regions to be scrubbed during the

next epoch is determined by comparing the miss rate to four predefined thresholds.

Adapting the scrubbing rate to the RST miss rate allows a high scrubbing rate at the

beginning of a burst of memory accesses, and a low scrubbing rate when most of the

memory regions recently have been scrubbed.

The scrub generator prioritizes the MRT entries over RST entries when selecting

the memory regions to be scrubbed. This is because the miss region has a higher

prediction accuracy. The following rules are followed when selecting a memory region:

(1) no duplicates are allowed in the RST, and (2) the number of newly generated

regions is not allowed to exceed an upper bound.

The scrub generator uses the region ID of the most frequently accessed MRT

entry to scrub in the next epoch. When this region is scrubbed, its region ID and a

98

direction flag (computed based on the direction counter) are recorded in the RST.6

To select a region based on the RST, the scrub generator computes a new region

ID according to the current region ID and the direction flag of the most frequently

accessed entry. If the flag indicates a forward direction, the closest ascending region is

selected; otherwise, the closest descending region is scrubbed. As shown in Figure 4.6,

region D in the RST is frequently accessed during the current epoch; therefore, the

scrub generator selects one of its neighboring regions (i.e., C or E) to be scrubbed in

the next epoch. In this example, due to the forward scrubbing flag of D, region E is

selected for scrubbing.

4.2.3 Reducing the Write Overhead

In a memory system protected by large BCH codewords, a write generates more

traffic than a read. On every write, an entire local codeword, as well as the global

ECC bits, need to be updated. These updates require generating new local and global

ECC bits for the corresponding blocks. Therefore, all of the data blocks that are part

of the same global codeword must be present at the memory controller before a write

can complete, which creates extra memory traffic and degrades the overall bandwidth

efficiency. Sanitizer significantly reduces these overheads by (1) eliminating the need

for fetching the entire global codeword by generating differential global ECCs, (2)

6The scrubbing flag is set to backward if the scrubbing direction counter is below a predefinedthreshold; otherwise, it is set to forward.

99

adopting a careful data layout that allows for parallel access to global ECC bits, and

(3) eliminating most of the read accesses by caching global ECCs at the memory

controller.

4.2.3.1 Global ECC Cache

Writes are optimized by caching the global ECC bits. Our experiments show that

92% of the writes are to previously updated global codewords. Sanitizer exploits

this phenomenon by adding a 256-entry, 16-way set associative SRAM cache to each

memory channel. Every cache entry contains a valid bit, tag bits, global ECC bits,

and flag bits for implementing the least recently used (LRU) replacement policy.

4.2.3.2 Global ECC Update

Figure 4.7 shows an example application of Sanitizer to a conventional nine-chip

DIMM.7 A global codeword comprising four data blocks A, B, C, and D is stored

in memory. A block is spread across the nine chips; it consists of a local codeword

(comprising 512 data bits and 11 local ECC bits), and a part of the global ECC bits.

Using a single block access, the memory controller can read or update an entire local

codeword; however, accessing a global codeword requires multiple reads and writes.

To update the global codeword, all of the four blocks (i.e., A, B, C, and D)

must be read from memory. Then, a new GECC is written to memory via multiple

7All of the chips are ×8 and transfer the data in bursts of eight beats.

100

...A0B0

D0

C0

Chip 0A5B5

ECCD

C5

Chip 5A6B6

D5

ECCC

Chip 6ECCA

B7

D7

C7

Chip 8A7

ECCB

D6

C6

Chip 7A4B4

D4

C4

Chip 4

Block[4] Block[5] Block[6] Block[7] Block[8]Block[0] ...

Chip 4 Chip 5 Chip 6 Chip 7 Chip 8Chip 0 ...

(a) Chip Organization

(b) Data Selection

Figure 4.7: An illustrative example of the proposed memory layout for a four-blockcodeword.

accesses. Sanitizer eliminates the block reads by performing a differential update to

global codewords. For instance, a write to block A requires the following steps: (1)

the old contents of A are retrieved from the memory by a read access, (2) a differential

global codeword is formed by computing the bitwise XOR between the old and new

contents of A, (3) a parity matrix is used to generate the differential ECC bits used

for updating the global ECC, (4) the old global ECC bits are read from the memory,

(5) the new global ECC bits are generated by XORing the differential ECC and the

old global ECC bits, (6) the new value of A and the updated local ECC are written

back to the memory in one write access, and (7) the newly generated global ECC bits

are written to the GECC cache.

101

When a global ECC is evicted from the GECC cache, Sanitizer performs a fast

update to the global ECC bits in main memory by leveraging an optimized data

layout. As shown in Figure 4.7 (a), parts of each block are shifted to ensure that the

ECC bits of a global codeword are spread across the chips. (For example, B7 is shifted

right by one chip and ECCB is stored in chip 7.) Moreover, every chip supports a

base and offset addressing mode, where the base is the block address and the offset

is either zero or the chip ID. A simplified crossbar at the memory controller ensures

the right order of bits for both the local and the global codewords (Figure 4.7 (b)).

4.2.4 Support for Chipkill ECC

The goal of chipkill-level error protection is to recover the data from a failed

chip. In addition to pin failures, chipkill can protect against a burst of errors due to

wordline, bitline, or interconnect wire failures. As explained in Section 4.1.1, multi-

bit symbol codes [15, 64] are optimized for bursty errors. For example, a commercial

chipkill ECC [86] can protect against the failure of a ×4 chip by adding four check

symbols to 32 data symbols, where each symbol consists of four bits, the block size

is 128B, and the burst length is eight. When both random and bursty errors are

prevalent, two ECCs can be concatenated: one code (e.g., BCH) protects against

random errors; the other code (e.g., a symbol code) protects against bursty errors.

An example of combining Sanitizer with a single-symbol correction double-symbol

102

detection (SSCDSD) ECC [15, 86] is shown in Figure 4.8. For each group of 128

Blockn

Block1

Data01 GECC01 Chipkill

Data02 GECC02 Chipkill

Data07 GECC07 Chipkill

Block0

4x4 bits4x32 bits

Figure 4.8: Illustrative example of supporting chipkill ECC.

data bits, a subset of the Sanitizer GECC bits (BCH ECC) are appended to the

data bits, and the SSCDSD ECC bits are computed by treating the BCH ECC bits

as data. For example, Data01 and GECC01 are together protected by four four-bit

redundant symbols against chip failures. Note that a four-check SSCDSD code with

four-bit symbols can protect codewords up to 256 symbols [15]. The local ECC of

Sanitizer can be replaced by the chipkill ECC because the correction capability of

the SSCDSD code is strictly greater than that of the SECDED code. The failure

rates due to bursty errors reported in field studies range from 22 to 33 FIT per

chip [74,76,77]. Assuming a 27.5 FIT chip failure rate, the SSCDSD code can reduce

the failure rate of a DRAM system by a factor of 1.2×107. Table 4.2 reports the

103

patrol scrubbing rates for an STT-MRAM system with both the SSCDSD code and

Sanitizer, configured to achieve the same failure rate as a DRAM system protected

only by the SSCDSD code. The configuration of 18.75% storage overhead adds three

2 Blocks 4 Blocks 8 Blocks18.75% Storage Overhead 0.095 Hz 0.048 Hz 0.027 Hz

25% Storage Overhead 0.026 Hz 0.014 Hz 0.010 Hz

Table 4.2: Required patrol scrubbing rates for combining Sanitizer with chipkill.

×4 ECC chips to every 16 data chips to hold both Sanitizer and the chipkill ECC;

in contrast, the 25% storage overhead setting adds four chips to every 16 chips. The

storage overhead for chipkill ECC is fixed at 12.5% for all of the configurations. Note

that the 1-block configurations do not need the ECC hierarchy in Sanitizer; however

their high scrubbing rates result in significant system performance overheads.

4.3 Experimental Setup

This section presents the experimental methodology used to evaluate Sanitizer.

Architecture-level simulations are conducted to model the behavior of the proposed

system. Circuit-level tools and simulators are used to evaluate the area, latency, and

power overheads of Sanitizer. We evaluate a Sanitizer-enabled system with twenty-

two applications.

104

4.3.1 Architecture

We use the SESC simulator [65] to model a 4GHz, eight-core out-of-order pro-

cessor. A 144GB main memory comprising DDR3-2133 compatible STT-MRAM

modules is evaluated. Detailed architecture-level parameters are listed in Table 4.3.

Notably, the precharge time (tRP) has a value lower than the corresponding DRAM

timing constraint since STT-MRAM does not precharge the bitlines during the precharge

operation. The write recovery time (tWR) is higher than the DRAM timing due to

the additional switching latency required by the STT-MRAM cells.

We use McPAT [50] to evaluate the area and power of individual components

of the processor. We use Cacti-3DD [17] to simulate the area, power, and access

latency of STT-MRAM based main memory and the storage structures associated

with Sanitizer, including the global ECC cache, the scrub queue, the recently scrubbed

table, and the missed region table (Section 4.2). Logic and memories are modeled

based on 22nm technology using parameters from ITRS 2013 [39]. STT-MRAM

specific parameters are listed in Table 4.4.

We consider various ECC codeword length that maintain approximately the same

ECC storage overhead (all under 12.5%). Table 4.5 shows the ECC capability and

the associated storage overheads for each coding scheme. The numbers in the top row

indicate the number of cache blocks that are considered part of a single codeword. For

instance, in base-2, two cache blocks—a total of 1024 bits—are guarded by a global

105

Processor Parameters

Technology 22nmFrequency 4.0 GHz

Number of cores 8Fetch/issue/commit width 4/4/4

Int/FP/LdSt/Br units 2/2/1/2Int/FP Multiplier 1/1

Int/FP IssueQ entries 32/32loadQ/storeQ/ROB entries 24/24/96

Int/FP registers 96/96Branch predictor Hybrid

Local/global/meta tables 2K/2K/8KBTB/RAS entries 4K/32

IL1 cache (private) 32KB, direct-mapped, 64B block1-cycle hit time

DL1 cache (private) 32KB, 4-way, LRU, 64B block2-cycle hit time

Cache coherence MESI protocolL2 cache (shared) 8MB, 8-way, LRU, 64B block

16-cycle hit time

Memory Controller Parameters

Address mapping page interleavingScheduling policy FR-FCFS

Request queue 64 entriesScrub queue 32 entries

Recently scrubbed table 8-way, 16K entriesMissed region table 64 entries

GECC cache 16-way, 256-entries

DDR3-2133 STT-MRAM memory system - 144 GB total capacity

Technology 22nmFrequency 1066 MHz

Chip capacity 16 GbNumber of chips per rank 9

Number of ranks per channel 2Number of channels 4

Row buffer size 8 KBTiming tRCD: 14, tCL: 14, tRP: 1, tRAS: 36, tRC: 37

(memory cycles) tBURST: 4, tCCD: 4, tWTR: 8, tWR: 22tRTP: 8, tRRD: 6, tFAW: 27

Table 4.3: System architecture and core parameters.

106

Area Read Switching Switching Switchingcurrent current latency energy

6 F 2 10 µA 35 µA 6.5 ns 0.18 pJ

Table 4.4: STT-MRAM parameters at 22nm [16,39,85].

ECC. Under a fixed storage budget, increasing the length of the codeword brings the

benefit of a stronger ECC capability. For example, in base-2, 11 errors out of 1024

bits can be corrected, whereas in base-4, 21 errors anywhere within a group of 2048

bits can be corrected.

Denotation base-2 base-4 base-8 sanitizer-4 sanitizer-8Data bits 1024 2048 4096 2048 4096

LECC bits (per 64B) 0 0 0 11 11GECC bits 122 253 508 205 417

LECC detectable bits 0 0 0 3 3LECC correctable bits 0 0 0 0 0GECC detectable bits 12 22 40 18 33

GECC correctable bits 11 21 39 17 32ECC total overhead 11.9% 12.4% 12.4% 12.2% 12.3%

Table 4.5: Comparison of different ECC codeword sizes.

4.3.2 Circuits

We accurately evaluate the area, power, and latency for both the global and the

local ECC logic. The total number of gates (i.e., AND, OR, XOR, and DFF) in each

encoder and decoder unit is calculated to find the critical paths. The delay and power

consumption of each gate are evaluated via SPICE simulations at 22nm [97]. The

area is estimated based on the FreePDK45 [80] standard cells, and is scaled to 22nm.

107

To meet system throughput requirements, a parallel implementation with multiple

XOR trees (similar to [41]) is used to generate the local and global ECC check bits.

The design of the local and global BCH decoders is similar to prior work [81]. The

decoding process comprises three major steps [9]: (1) syndrome generation, which

reuses the XOR-tree architecture from BCH encoding; (2) finding an error-location

polynomial, which implements an iterative algorithm proposed by Strukov [81]; and

(3) finding error-location numbers using a serial implementation that alleviates the

area and power costs.

4.3.3 Applications

We evaluate a set of 22 benchmarks comprising six parallel applications from

SPLASH-2 [89] and SPEC OMP2001 [4], as well as 16 serial applications from SPEC2006 [79].

The parallel applications are simulated to completion. To reduce the execution time

of the serial applications, we use SimPoint [31] and determine a representative 100

million instruction region from each SPEC 2006 application.

4.4 Evaluation

We first evaluate the performance, energy, and area of Sanitizer. Next, we present

sensitivity studies, compare an STT-MRAM based main memory equipped with San-

itizer to a conventional DRAM system, and evaluate how Sanitizer stacks up against

108

a baseline STT-MRAM system that combines scrubbing with hierarchical ECC and

prefetching.

4.4.1 Performance

We study the performance of three baseline configurations and three Sanitizer

systems. Figure 4.10 compares the performance of the evaluated Sanitizer systems to

the best baseline configuration (base-4 ). Due to the additional read traffic for a GECC

check on every memory access, increasing the size of the GECC codeword results

in a performance degradation for the baseline systems. This performance penalty

effectively nullifies the benefits of using longer codewords to lower the scrubbing

rate. Consequently, base-4 outperforms base-8 and base-16 (Figure 4.9). Sanitizer

mitigates the undue data traffic by using the LECC on most of the memory accesses

(85% of the time, on average). The sanitizer-4, sanitizer-8, and sanitizer-16 systems

achieve, respectively, average speedups of 1.11×, 1.22×, and 1.14× over base-4. The

corresponding scrubbing rates are 0.098 Hz, 0.043 Hz, and 0.027 Hz.

0

0.5

1

1.5

baseline read opt only

read opt & GECC$

sani8zer

Performan

ce

Normalized

to

base-­‐4

4 blocks 8 blocks 16 blocks

Figure 4.9: Performance improvement analysis.

109

0 0.5 1

1.5 2

2.5

art ch

olesk

y eq

uake

fft

ocea

n sw

im

astar

bz

ip2 gc

c go

bmk

libqu

antu

m mc

f omne

tpp

sjeng

xa

lancb

mk

dealI

I lbm

mi

lc na

md po

vray

sople

x sphin

x3 ge

omea

n

Speedup Over the Baseline (base-­‐4)

saniDzer-­‐4

saniDzer-­‐8

saniDzer-­‐16

Fig

ure

4.10

:Syst

emp

erfo

rman

ceco

mpar

ison

.

0

0.5 1

1.5

art ch

olesk

y equa

ke

fft oc

ean

swim

as

tar

bzip2

gc

c go

bmk

libqu

antu

m mc

f omne

tpp

sjeng

xa

lancb

mk

dealI

I lbm

mi

lc na

md po

vray

sople

x sp

hinx ge

omea

n

Energy Normalized to the Baseline (base-­‐4)

saniCzer-­‐4

saniCzer-­‐8

saniCzer-­‐16

Fig

ure

4.11

:Syst

emen

ergy

com

par

ison

.

110

Figure 4.9 shows a breakdown of the performance improvements. The bars labeled

as “read opt only” represent the improvements achieved after adding the RST and

the MRT to reduce the read overheads (Section 4.2.2). The bars labeled as “read

opt & GECC$” represent the results of adding the GECC cache (Section 4.2.3.1)

on top of the RST and the MRT. Implementing the layout optimizations discussed

in Section 4.2.3.2 in addition to the read optimizations and the GECC cache gives

the full benefit of Sanitizer. The four-block configuration of read opt only exhibits

a small performance loss compared to baseline because Sanitizer requires a higher

scrubbing frequency. Read opt & GECC$ achieves average write traffic reductions

between 1.13-1.88× over read opt only.

4.4.2 Energy and Power

Figure 4.11 shows the end-to-end system energy. The baseline systems suffer from

two sources of energy inefficiency: (1) frequent scrubbing operations, and (2) exces-

sive memory traffic due to over-fetching. By addressing the over-fetching problem,

Sanitizer achieves lower energy consumption as compared to the most energy-efficient

baseline (base-4 ). Sanitizer-4, sanitizer-8, and sanitizer-16 respectively reduce the

system energy down to 93%, 78%, and 88% of base-4. This energy reduction is due to

two effects: (1) Sanitizer significantly reduces the data movement on memory reads

and writes, which results in lower energy; and (2) Sanitizer accelerates the execution

111

of the applications, which results in leakage energy savings. The energy breakdown

of the sanitizer-8 system is shown in Table 4.6.

Cores and Memory Main Buses and Sanitizercaches controller memory interfaces hardware63.7% 7.4% 18.3% 7.9% 2.7%

Table 4.6: Sanitizer-8 system energy breakdown.

Table 4.7 shows the peak dynamic power and the leakage power of Sanitizer.

The Sanitizer hardware consumes a peak power of 539.7 mW. The average power of

Sanitizer represents less than 3% of the total system power. The global and local

ECC hardware together consistute the major contributor to the power consumption

of Sanitizer (2.2% of the total system power); this is because of the high-performance

design choices that were made to achieve the required throughput.

ECC Scrub RST GECC Scrub Total(mW ) Logic Generator Cache Queue

Dynamic 280.5 18.1 77.6 28.8 5.9 410.9Leakage 98.8 0.9 12.3 14.3 2.5 128.8

Table 4.7: Peak dynamic power and leakage of Sanitizer components (eight blockconfiguration).

4.4.3 Area

The total area of the Sanitizer hardware corresponds to less than 1% of the pro-

cessor die area. Table 4.8 shows a breakdown of the area occupied by various system

components.

112

ECC Scrub RST GECC Scrub Total(mm2) Logic Generator Cache QueueArea 0.41 0.002 0.12 0.12 0.004 0.66

Table 4.8: Area breakdown of the Sanitizer components.

4.4.4 Sensitivity Analysis

We study the sensitivity of Sanitizer to the raw bit error rate (BER), the memory

capacitybandwidth

ratio, and the RST parameters.

4.4.4.1 Raw BER

The raw BER has a profound effect on the required scrubbing frequency. Either a

Temperature(C) ∆=37 ∆=36 ∆=35 ∆=3445 2.5×10-6 6.0×10-6 1.4×10-5 3.4×10-5

55 6.7×10-6 1.6×10-5 3.6×10-5 8.4×10-5

65 1.7×10-5 3.8×10-5 8.7×10-5 2.0×10-4

75 4.1×10-5 9.0×10-5 2.0×10-4 4.4×10-4

85 9.3×10-5 2.0×10-4 4.4×10-4 9.5×10-4

Table 4.9: Raw Retention BER per second. (5% variation on ∆.)

low thermal stability factor (∆) or a high temperature can result in a high retention

BER and a high scrubbing overhead (Section 4.1.2). The retention BER per second

under different ∆ and temperature values are reported in Table 4.9. As shown in

Figure 4.12, Sanitizer significantly improves the performance when the raw BER per

second is between 10-5 and 2×10-4 (marked in bold in Table 4.9). If the raw BER is

less than 10-5, the scrubbing overhead of a baseline system with a single 64B block is

low, and Sanitizer does not exhibit significant potential. When the raw BER exceeds

113

0

0.5

1

1.5

0.E+00 1.E-­‐04 2.E-­‐04 Speedu

p Over the

Best Baseline with

the Co

rrespo

nding

Raw BER

Raw BER per Second

Figure 4.12: System performance with different raw BERs.

2×10-4, both the baseline and the Sanitizer systems require the ECC codeword to

span more than 16 blocks, which results in significant area and power overheads due

to the increased complexity of the ECC logic.

4.4.4.2 Sensitivity to the CapacityBandwidth

Ratio

The capacity of a memory channel determines the minimum amount of data that

must be scanned during scrubbing. Figure 4.13 shows the increase in the memory

0 0.5 1

1.5 2

base-­‐8

base-­‐16

sani0zer-­‐8

sani0zer-­‐16

Mem

ory Traffi

c Normalized

to

base-­‐8 with 36GB

Chan

nel Cap

acity

Figure 4.13: Memory traffic of systems with 72GB per channel.

traffic when increasing the memory capacity per channel from 36 GB to 72 GB.

114

Sanitizer is effective in suppressing the memory traffic and reducing the number of

blocked reads and writes, which results in average speedups of 1.40× to 1.42× over

base-8. Sanitizer outperform the baseline systems by greater margins as the capacitybandwidth

ratio increases.

4.4.4.3 RST Parameters

An ideal RST should be able to track information on every memory region until

the region expires. However, this capability would require a fully associative RST with

up to 80K entries, which would consume excessive power. Figure 4.14 compares the

0

0.5

1

1.5

4K En*res 8K Entries 16K Entries

Performan

ce

Normalized

to

Ideal R

ST

4 Ways 8 Ways 16 Ways

Figure 4.14: Performance impact of RST size and associativity.

performance of set associative RSTs to an ideal RST for sanitizer-8. The RST size has

a larger impact on the performance than the RST associativity does for the evaluated

set of benchmarks. We choose the 4-way, 16K entry RST because (1) at most four

entries can be added into the RST during every epoch; and (2) the performance of a

16K RST is close to the performance of an ideal RST, as shown in Figure 4.14.

115

4.4.4.4 LLC Size

The size of the last level cache affects the number of data requests sent to the

memory system. Figure 4.15 shows the geometric means of the performance achieved

by the baseline and Sanitizer configurations with different LLC sizes, averaged over

the 22 benchmarks. As one would expect, Sanitizer achieves a greater improvement

over the baseline when the LLC size is small, and the off-chip traffic heavy.

0

0.5

1

1.5

4MB 8MB 16MB

Performan

ce

Normalized

to

Base-­‐4 with

4MB LLC

base-­‐4 base-­‐8 base-­‐16 sani2zer-­‐4 sani2zer-­‐8 sani2zer-­‐16

Figure 4.15: Performance comparisons with different LLC size.

4.4.5 Comparison to Hierarchical ECC Combined with

Prefetching

We would like to analyze whether the performance of Sanitizer can be matched by

a straightforward combination of two existing ideas: (1) prefetching, and (2) an exten-

sion of the recently proposed non-uniform access time DRAM controller (NUAT) [71]

to STT-MRAM. Sanitizer anticipates future memory accesses and scrubs the memory

regions in advance; this is analogous to prefetching, in which future memory accesses

are predicted and the data are speculatively loaded into the last level cache. Sanitizer

116

0 0.5 1

1.5 2

2.5

art ch

olesk

y equa

ke

fft oc

ean

swim

as

tar bzip2

gc

c gobm

k lib

quan

tum mc

f omne

tpp

sjeng

xa

lancb

mk deali

i lbm

mi

lc na

md po

vray

sople

x sphin

x3

geom

ean

Performance Normalized to base-­‐4

base-­‐4 with

hierarchical ECC

base-­‐4 with

hierarchical ECC

& prefetche

r saniHzer-­‐8 with

prefetche

r

Fig

ure

4.16

:C

ompar

ison

tohie

rarc

hic

alE

CC

and

dat

apre

fetc

hin

g.

117

leverages hierarchical ECC to allow low-overhead accesses to the recently scrubbed

memory regions; NUAT [71] provides faster reads from DRAM cells that have been

refreshed recently by remembering the last time the data were refreshed.

We evaluate the performance of three systems to compare sanitizer with these

related work (Figure 4.16): (1) a base-4 system with hierarchical ECC that remembers

the recently scrubbed memory regions (similar to how NUAT remembers recently

refreshed DRAM locations) and allows low-overhead accesses to these regions; (2) a

base-4 system with hierarchical ECC and a prefetcher, which scrubs the prefetched

memory locations; and (3) a sanitizer-8 system with the same prefetcher. The first

system, which relies on a hierarchical ECC, can degrade performance. This is because

adding local ECCs under the a fixed storage budget will reduce the strength of the

global ECC, requiring more frequent scrubbing to achieve the same reliability target.

We conduct a design space exploration of stream prefetchers with different parameter

settings [78], and report the prefetcher that achieves the highest average speedup on

the evaluated benchmarks. The prefetched data also is scrubbed and recorded. Using

a prefetcher on top of hierarchical ECC does not achieve the same benefit as Sanitizer

due to two reasons: (1) the aggressiveness of a prefetcher is restricted by the last level

cache capacity, whereas the predictive scrubs issued by Sanitizer do not require any

storage in the last level cache; and (2) hierarchical ECC and prefetching reduce only

the read overhead, whereas Sanitizer applies write and data layout optimizations

118

(Section 4.2.3) to further reduce the bandwidth overhead. Adding a prefetcher on

top of the Sanitizer-8 system outperforms a base-4 system that uses hierarchal ECC

and scrubs the prefetched memory locations by 21%.

4.4.6 Comparison to DRAM

We compare an STT-MRAM based main memory with Sanitizer to a DRAM-

based system. Sanitizer closes the performance gap between STT-MRAM and DRAM

to 6% in a four-channel, two-rank-per-channel system. Figure 4.17 shows a sensitivity

study on the number of channels, in which all of the configurations have two ranks

per channel, and all of the ranks have an 18GB capacity. The performance gap be-

0 0.2 0.4 0.6 0.8 1

1.2

1 Channel

2 Channels

4 Channels

Performan

ce

0

0.5

1

1.5

1 Channel

2 Channels

4 Channels

Energy

DRAM

sani6zer-­‐4

sani6zer-­‐8

sani6zer-­‐16

Figure 4.17: Performance and system energy normalized to single-channel DRAMvarying number of channels.

tween Sanitizer and DRAM is more pronounced for the 1-channel systems than it is

for the 4-channel ones. This is because a scrub operation blocks the entire channel,

whereas a refresh operation blocks only one rank. Despite the performance penalty,

119

the 4-channel Sanitizer systems achieve systematic energy reductions as compared to

the 4-channel DRAM system. The energy efficiency is due to three effects: (1) STT-

MRAM cells do not consume leakage energy; (2) reading an STT-MRAM cell requires

less current than reading a DRAM cell, which translates into a lower activation en-

ergy; and (3) STT-MRAM has a reduced precharge energy compared to DRAM since

precharging the bitlines is not required. Sanitizer achieves greater energy reduction

over DRAM as the number of channels is increased. This is because Sanitizer can

save more leakage energy in systems with higher memory capacity.

4.5 Summary

Sanitizer is a new error protection mechanism that uses strong ECCs for an STT-

MRAM based memory system. To amortize the high storage overhead of a strong

ECC, Sanitizer applies BCH codes to codewords spanning multiple memory blocks.

The storage overhead is kept comparable to that of the commonly used SECDED

ECC. A hierarchical ECC structure and novel control mechanism allow for efficient

protection against errors. A global ECC is used to periodically scrub the memory,

while a majority of the memory accesses are satisfied by a low-overhead, local ECC.

Unlike conventional memory scrubbing mechanisms, Sanitizer employs a novel pre-

diction mechanism to remove errors from memory blocks prior to reads and writes.

120

This enables fast and low-energy accesses to clean memory locations. When com-

pared to a conventional scrubbing mechanism, the result is a 1.22× improvement in

overall system performance and a 22% reduction in system energy with a less than 1%

increase in the processor die area. As technology moves from DRAM to non-volatile

memories such as STT-MRAM, where random errors become more critical, Sanitizer

will play a key role in mitigating the impact of expensive ECC checks.

121

Chapter 5

Conclusions

This thesis shows that architecting STT-MRAM as a complement to SRAM in

a high-performance microprocessor is a promising direction for improving the en-

ergy efficiency of future systems. Significant gains in energy-efficiency have been

observed by partitioning on-chip hardware resources among STT-MRAM and CMOS

judiciously, exploiting the unique power, area, and speed benefits of each technol-

ogy, and by carefully re-architecting the pipeline. Partitioning between CMOS and

STT-MRAM should be guided by two principles: (1) large or infrequently written

structures should be implemented with STT-MRAM arrays, and (2) combinational

logic blocks with many minterms should be migrated to STT-MRAM LUTs to reduce

power. A subbank buffering technique is proposed to alleviate the long write latency

of STT-MRAM arrays. Nevertheless, for frequently written structures (i.e., register

122

files), a heavily subbanked STT-MRAM implementation can degrade performance as

compared to an SRAM based implementation.

This thesis also proposes a viable approach to replace DRAM with STT-MRAM

by tolerating retention errors in high-density STT-MRAM. A new error protection

mechanism is devised in which a multi-bit, strong ECC is used to reduce the scrubbing

frequency. The ECC storage overhead is minimized by grouping multiple cache blocks

into a single ECC codeword; the over fetching overhead is kept low by relying on

local ECC checks for recently scrubbed memory regions. Ultimately, the analysis

presented in this thesis shows that deeply-scaled, large-capacity STT-MRAM with

high retention error rates can be made energy-efficient and reliable.

In addition to introducing novel architectures exploiting emerging resistive mem-

ory technologies, I also contributed to two projects that leverage resistive memories in

building hardware accelerators for data intensive applications. Data intensive applica-

tions such as data mining, information retrieval, video processing, and image coding

demand significant computational power and generate substantial memory traffic,

which places a heavy strain on both the off-chip memory bandwidth and the overall

system power. Ternary content addressable memories (TCAMs) are an attractive

solution to curb both the power dissipation and the off-chip bandwidth demand in a

wide range of applications. When associative lookups are implemented using TCAM,

data is processed directly on the TCAM chip, which decreases the off-chip traffic and

123

lowers the bandwidth demand. Often, a TCAM-based system also improves energy

efficiency by eliminating instruction processing and data movement overheads that

are present in a purely RAM based system. Unfortunately, even an area-optimized,

CMOS-based TCAM cell is over 90× larger than a DRAM cell at the same technol-

ogy node, which limits the capacity of commercially available TCAM parts to a few

megabytes, and confines their use to niche networking applications.

We explore TCAM-DIMM [29], a new technique that aims at cost-effective, modu-

lar integration of a high-capacity TCAM system within a general-purpose computing

platform. TCAM density is improved by more than 20× over existing, CMOS-based

parts through a novel, resistive TCAM cell and array architecture. High-capacity

resistive TCAM chips are placed on a DDR3-compatible DIMM, and are accessed

through a user-level software library with zero modifications to the processor or the

motherboard. The modularity of the resulting memory system allows TCAM to be

selectively included in systems running workloads that are amenable to TCAM-based

acceleration; moreover, when executing an application or a program phase that does

not benefit from associative search capability, the TCAM-DIMM can be configured

to provide ordinary RAM functionality. By tightly integrating TCAM with conven-

tional virtual memory, and by allowing a large fraction of the physical address space

to be made content-addressable on demand, the proposed memory system improves

the average performance by 4× and the average energy consumption by 10× on a set

124

of evaluated data-intensive applications.

One limitation of the TCAM-DIMM is that its use is restricted to search intensive

applications. To address this limitation, we introduce the AC-DIMM system [30]—

an associative memory system and compute engine that can be readily included in

a DDR3 socket. Using STT-MRAM, AC-DIMM implements a two-transistor, one-

resistor (2T1R) cell, which is 4.4× denser than an SRAM based TCAM cell. AC-

DIMM enables a new associative programming model, wherein a group of integrated

microcontrollers execute user-defined kernels on search results. This flexible func-

tionality allows AC-DIMM to cater to a broad range of applications. On a set of

13 evaluated benchmarks, AC-DIMM achieves an average speedup of 4.2× and an

average energy reduction of 6.5× as compared to a conventional RAM based system.

I believe that by leveraging emerging technologies to design novel architectures,

it will be possible to create qualitatively new opportunities for system design and

optimization, pushing the boundaries of computer architecture beyond the end of

traditional CMOS scaling.

125

Bibliography

[1] V. Agarwal, M. Hrishikesh, S. Keckler, and D. Burger. Clock rate vs. IPC: End

of the road for conventional microprocessors. In International Symposium on

Computer Architecture, Vancouver, Canada, June 2000.

[2] ALTERA. Stratix vs. Virtex-2 Pro FPGA performance analysis, 2004.

[3] B. Amrutur and M. Horowitz. Speed and power scaling of SRAMs. 2000.

[4] V. Aslot and R. Eigenmann. Quantitative performance analysis of the SPEC

OMPM2001 benchmarks. Scientific Programming, 11(2):105–124, 2003.

[5] M. Awasthi, M. Shevgoor, K. Sudan, B. Rajendran, R. Balasubramonian, and

V. Srinivasan. Efficient scrub mechanisms for error-prone emerging memories.

In High Performance Computer Architecture (HPCA), 2012 IEEE 18th Interna-

tional Symposium on, pages 1–12, Feb 2012.

[6] R. Azevedo, J. D. Davis, K. Strauss, P. Gopalan, M. Manasse, and S. Yekhanin.

Zombie memory: Extending memory lifetime by reviving dead blocks. SIGARCH

Comput. Archit. News, 41(3):452–463, June 2013.

[7] D. Bailey, E. Barszcz, J. Barton, D. Browning, R. Carter, L. Dagum, R. Fatoohi,

S. Fineberg, P. Frederickson, T. Lasinski, R. Schreiber, H. Simon, V. Venkatakr-

ishnan, and S. Weeratunga. NAS parallel benchmarks. Technical report, NASA

Ames Research Center, March 1994. Tech. Rep. RNR-94-007.

[8] I. S. Bhati. Scalable and Energy Efficient DRAM Refresh Techniques. Depart-

ment of Electrical and Computer Engineering University of Maryland, College

Park, 2014.

126

[9] R. E. Blahut. Algebraic Codes for Data Transmission. Cambridge University

Press, 1 edition, Mar. 2003.

[10] R. C. Bose and D. K. Ray-Chaudhuri. On a class of error correcting binary group

codes. Information and Control, 3(1):68–79, March 1960.

[11] D. Burger, J. R. Goodman, and A. Kagi. Memory bandwidth limitations of

future microprocessors. In International Symposium on Computer Architecture,

Philedelphia, PA, May 1996.

[12] Y. Cai, G. Yalcin, O. Mutlu, E. Haratsch, A. Cristal, O. Unsal, and K. Mai.

Flash correct-and-refresh: Retention-aware error management for increased flash

memory lifetime. In Computer Design (ICCD), 2012 IEEE 30th International

Conference on, pages 94–101, Sept 2012.

[13] E. Catovic. GRFPU-high performance IEEE-754 floating-point unit. http:

//www.gaisler.com/doc/grfpu_dasia.pdf.

[14] C. Chappert, A. Fert, and F. N. V. Dau. The emergence of spin electronics in

data storage. Nature Materials, 6:813–823, November 2007.

[15] C.-L. Chen. Error-correcting codes for byte-organized memory systems. Infor-

mation Theory, IEEE Transactions on, 32(2):181–185, Mar 1986.

[16] E. Chen, D. Apalkov, Z. Diao, A. Driskill-Smith, D. Druist, D. Lottis, V. Nikitin,

X. Tang, S. Watts, S. Wang, S. Wolf, A. W. Ghosh, J. Lu, S. J. Poon, M. Stan,

W. Butler, S. Gupta, C. K. A. Mewes, T. Mewes, and P. Visscher. Advances

and future prospects of spin-transfer torque random access memory. Magnetics,

IEEE Transactions on, 46(6):1873–1878, June 2010.

[17] K. Chen, S. Li, N. Muralimanohar, J.-H. Ahn, J. Brockman, and N. Jouppi.

Cacti-3dd: Architecture-level modeling for 3d die-stacked dram main memory.

In Design, Automation Test in Europe Conference Exhibition (DATE), 2012,

pages 33–38, March 2012.

[18] Y. Chen, H. (Helen) Li, X. Wang, W. Zhu, W. Xu, and T. Zhang. A non-

destructive self-reference scheme for spin-transfer torque random access memory

127

(stt-ram). In Design, Automation Test in Europe Conference Exhibition (DATE),

2010, pages 148–153, March 2010.

[19] H. Chung, B. H. Jeong, B. Min, Y. Choi, B.-H. Cho, J. Shin, J. Kim, J. Sunwoo,

J. min Park, Q. Wang, Y.-J. Lee, S. Cha, D. Kwon, S. Kim, S. Kim, Y. Rho,

M.-H. Park, J. Kim, I. Song, S. Jun, J. Lee, K. Kim, K. won Lim, W. ryul Chung,

C. Choi, H. Cho, I. Shin, W. Jun, S. Hwang, K.-W. Song, K. Lee, S. whan Chang,

W.-Y. Cho, J.-H. Yoo, and Y.-H. Jun. A 58nm 1.8V 1Gb PRAM with 6.4MB/s

program BW. In IEEE International Solid-State Circuits Conference Digest of

Technical Papers, pages 500–502, Feb 2011.

[20] M. D. Ciletti. Advanced Digital Design with the Verilog HDL. 2004.

[21] B. Del Bel, J. Kim, C. H. Kim, and S. S. Sapatnekar. Improving stt-mram

density through multibit error correction. In Design, Automation and Test in

Europe Conference and Exhibition (DATE), 2014, pages 1–6, March 2014.

[22] T. J. Dell. The Benefits of Chipkill-Correct ECC for PC Server Main Memory–a

white paper. IBM Microelectronics Division, 1997.

[23] R. Dennard, F. Gaensslen, V. Rideout, E. Bassous, and A. LeBlanc. Design of

ion-implanted mosfet’s with very small physical dimensions. Solid-State Circuits,

IEEE Journal of, 9(5):256–268, Oct 1974.

[24] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger.

Dark silicon and the end of multicore scaling. In Proceedings of the 38th Annual

International Symposium on Computer Architecture, ISCA ’11, pages 365–376,

New York, NY, USA, 2011. ACM.

[25] Everspin Technologies. Spin-Torque MRAM Technical Brief, 2013.

[26] R. Fackenthal, M. Kitagawa, W. Otsuka, K. Prall, D. Mills, K. Tsutsui, J. Ja-

vanifard, K. Tedrow, T. Tsushima, Y. Shibahara, and G. Hush. 16Gb ReRAM

with 200MB/s write and 1GB/s read in 27nm technology. In Solid-State Circuits

Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, pages

338–339, Feb 2014.

128

[27] J. Fan, S. Jiang, J. Shu, Y. Zhang, and W. Zhen. Aegis: Partitioning data block

for efficient recovery of stuck-at-faults in phase change memory. In Proceedings

of the 46th Annual IEEE/ACM International Symposium on Microarchitecture,

MICRO-46, pages 433–444, New York, NY, USA, 2013. ACM.

[28] M. Gajek, J. J. Nowak, J. Z. Sun, P. L. Trouilloud, E. J. O’Sullivan, D. W. Abra-

ham, M. C. Gaidis, G. Hu, S. Brown, Y. Zhu, R. P. Robertazzi, W. J. Gallagher,

and D. C. Worledge. Spin torque switching of 20nm magnetic tunnel junctions

with perpendicular anisotropy. Applied Physics Letters, 100(13):132408, 2012.

[29] Q. Guo, X. Guo, Y. Bai, and E. Ipek. A resistive TCAM accelerator for data-

intensive computing. In International Symposium on Microarchitecture, Dec.

2011.

[30] Q. Guo, X. Guo, R. Patel, E. Ipek, and E. G. Friedman. AC-DIMM: associative

computing with STT-MRAM. In Proceedings of the 40th Annual International

Symposium on Computer Architecture, pages 189–200, New York, NY, USA,

2013. ACM.

[31] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.0: Faster and more

flexible program analysis. In Journal of Instruction Level Parallelism, 2005.

[32] HiTech. DDR2 memory controller IP core for FPGA and ASIC. http://www.

hitechglobal.com/IPCores/DDR2Controller.htm.

[33] A. Hocquenghem. Codes correcteurs d’erreurs. Chiffres, 2:147–158, 1959.

[34] M. Hosomi, H. Yamagishi, T. Yamamoto, K. Bessho, Y. Higo, K. Yamane, H. Ya-

mada, M. Shoji, H. Hachino, C. Fukumoto, H. Nagao, and H. Kano. A novel

nonvolatile memory with spin torque transfer magnetization switching: Spin-

RAM. In IEDM Technical Digest, pages 459–462, 2005.

[35] Y. Huai. Spin-transfer torque MRAM (STT-MRAM) challenges and prospects.

AAPPS Bulletin, 18(6):33–40, December 2008.

129

[36] IBM Corporation. IBM Power System S822: Scale-out application server for se-

cure infrastructure built on open technology. http://www-03.ibm.com/systems/

power/hardware/s822.

[37] Intel Corporation. Intel Xeon Processor E7-8800/4800/2800

Product Families Datasheet. http://www.intel.com/

content/dam/www/public/us/en/documents/datasheets/

xeon-e7-8800-4800-2800-families-vol-1-datasheet.pdf.

[38] E. Ipek, J. Condit, E. B. Nightingale, D. Burger, and T. Moscibroda. Dy-

namically replicated memory: Building reliable systems from nanoscale resistive

memories. In Proceedings of the Fifteenth Edition of ASPLOS on Architectural

Support for Programming Languages and Operating Systems, ASPLOS XV, pages

3–14, 2010.

[39] ITRS. International Technology Roadmap for Semiconductors: 2013 Edition.

http://www.itrs.net/Links/2013ITRS/Summary2013.htm.

[40] A. N. Jacobvitz, R. Calderbank, and D. J. Sorin. Coset coding to extend the

lifetime of memory. In High Performance Computer Architecture (HPCA2013),

2013 IEEE 19th International Symposium on, pages 222–233, Feb 2013.

[41] Z. Jun, W. Zhi-Gong, H. Qing-Sheng, and X. Jie. Optimized design for high-

speed parallel bch encoder. In VLSI Design and Video Technology, 2005. Pro-

ceedings of 2005 IEEE International Workshop on, pages 97–100, May 2005.

[42] G. Kane. MIPS RISC Architecture. 1988.

[43] U. R. Karpuzcu, B. Greskamp, and J. Torellas. The bubblewrap many-core:

Popping cores for sequential acceleration. In International Symposium on Mi-

croarchitecutre, 2009.

[44] T. Kawahara, R. Takemura, K. Miura, J. Hayakawa, S. Ikeda, Y. Lee, R. Sasaki,

Y. Goto, K. Ito, T. Meguro, F. Matsukura, H. Takahashi, H. Matsuoka,

and H. Ohno. 2 Mb SPRAM (spin-transfer torque RAM) with bit-by-bit bi-

directional current write and parallelizing-direction current read. IEEE Journal

of Solid-State Circuits, 43(1):109–120, January 2008.

130

[45] W. Kim, J. Jeong, Y. Kim, W. C. Lim, J.-H. Kim, J. Park, H. Shin, Y. Park,

K. Kim, S. Park, Y. Lee, K. Kim, H. Kwon, H. Park, H. S. Ahn, S. Oh, J. Lee,

S. Park, S. Choi, H.-K. Kang, and C. Chung. Extended scalability of perpendicu-

lar stt-mram towards sub-20nm mtj node. In Electron Devices Meeting (IEDM),

2011 IEEE International, pages 24.1.1–24.1.4, Dec 2011.

[46] T. Kishi, H. Yoda, T. Kai, T. Nagase, E. Kitagawa, M. Yoshikawa, K. Nishiyama,

T. Daibou, M. Nagamine, M. Amano, S. Takahashi, M. Nakayama, N. Shimo-

mura, H. Aikawa, S. Ikegawa, S. Yuasa, K. Yakushiji, H. Kubota, A. Fukushima,

M. Oogane, T. Miyazaki, and K. Ando. Lower-current and fast switching of a

perpendicular TMR for high speed and high density spin-transfer-torque MRAM.

In IEEE International Electron Devices Meeting, 2008.

[47] U. Klostermann, M. Angerbauer, U. Griming, F. Kreupl, M. Ruhrig, F. Dahmani,

M. Kund, and G. Muller. A perpendicular spin torque switching based MRAM

for the 28 nm technology node. In IEEE International Electron Devices Meeting,

2007.

[48] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded

sparc processor. IEEE Micro, 25(2):21–29, 2005.

[49] B. Lee, E. Ipek, O. Mutlu, and D. Burger. Architecting phase-change mem-

ory as a scalable dram alternative. In International Symposium on Computer

Architecture, Austin, TX, June 2009.

[50] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P.

Jouppi. McPAT: An integrated power, area, and timing modeling framework for

multicore and manycore architectures. In International Symposium on Computer

Architecture, 2009.

[51] S. Mathew, M. Anders, B. Bloechel, T. Nguyen, R. Krishnamurthy, and

S. Borkar. A 4-GHz 300-mW 64-bit integer execution ALU with dual supply

voltages in 90-nm CMOS. pages 162–519 Vol.1, Feb 2004.

131

[52] S. Matsunaga, J. Hayakawa, S. Ikeda, K. Miura, H. Hasegawa, T. Endoh,

H. Ohno, and T. Hanyu. Fabrication of a nonvolatile full adder based on logic-in-

memory architecture using magnetic tunnel junctions. Applied Physics Express,

1(9):091301, 2008.

[53] Micron. 512Mb DDR2 SDRAM Component Data Sheet: MT47H128M4B6-

25, March 2006. http://download.micron.com/pdf/datasheets/dram/ddr2/

512MbDDR2.pdf.

[54] Micron Technology. Technical note: Understanding the quality and reliability re-

quirements for bare die applications, 2001. http://www.micron.com/~/media/

Documents/Products/Technical%20Note/NAND%20Flash/tn0014.pdf.

[55] G. Moore. Cramming more components onto integrated circuits. Proceedings of

the IEEE, 86(1):82–85, Jan 1998.

[56] N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing NUCA

organizations and wiring alternatives for large caches with CACTI 6.0. In the

40th Annual IEEE/ACM International Symposium on Microarchitecture, 2007,

Chicago, IL, Dec. 2007.

[57] H. Naeimi, C. Augustine, A. Raychowdhury, S.-l. Lu, and J. Tschanz. Sttram

scaling and retention failure. Intel Technology Journal, 17(1):54–75, 2013.

[58] U. G. Nawathe, M. Hassan, K. C. Yen, A. Kumar, A. Ramachandran, and

D. Greenhill. Implementation of an 8-core, 64-thread, power-efficient sparc server

on a chip. IEEE Journal of Solid-State Circuits, 43(1):6–20, January 2008.

[59] Oracle Corporation. SPARC M5-32 Server Architecture. http://www.oracle.

com/us/products/servers-storage/servers/sparc/oracle-sparc/m5-32.

[60] J. Pisharath, Y. Liu, W. Liao, A. Choudhary, G. Memik, and J. Parhi. NU-

MineBench 2.0. Technical report, Northwestern University, August 2005. Tech.

Rep. CUCIS-2005-08-01.

[61] M. K. Qureshi. Pay-as-you-go: Low-overhead hard-error correction for phase

change memories. In Proceedings of the 44th Annual IEEE/ACM International

132

Symposium on Microarchitecture, MICRO-44, pages 318–328, New York, NY,

USA, 2011. ACM.

[62] M. K. Qureshi, J. Karidis, M. Franceschini, V. Srinivasan, L. Lastras, and

B. Abali. Enhancing lifetime and security of PCM-based main memory with

start-gap wear leveling. In Proceedings of the 42Nd Annual IEEE/ACM Inter-

national Symposium on Microarchitecture, MICRO 42, pages 14–23, 2009.

[63] M. K. Qureshi, V. Srinivasan, and J. A. Rivers. Scalable high performance main

memory system using phase-change memory technology. In Proceedings of the

36th annual international symposium on Computer architecture, ISCA ’09, pages

24–33, 2009.

[64] I. S. Reed and G. Solomon. Polynomial codes over certain finite fields. Journal

of the Society for Industrial and Applied Mathematics, 8:300–304, 1960.

[65] J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi,

P. Sack, K. Strauss, and P. Montesinos. SESC simulator, Jan. 2005.

http://sesc.sourceforge.net.

[66] S. Rusu, S. Tam, H. Muljono, J. Stinson, D. Ayers, J. Chang, R. Varada,

M. Ratta, and S. Kottapalli. A 45nm 8-Core Enterprise Xeon Processor. In

Solid-State Circuits Conference - Digest of Technical Papers, 2009. ISSCC 2009.

IEEE International, pages 56–57, Feb 2009.

[67] T. Sakurai and A. Newton. Alpha-power law mosfet model and its applications

to cmos inverter delay and other formulas. Solid-State Circuits, IEEE Journal

of, 25(2):584–594, Apr 1990.

[68] S. Schechter, G. H. Loh, K. Straus, and D. Burger. Use ecp, not ecc, for hard

failures in resistive memories. SIGARCH Comput. Archit. News, 38(3):141–152,

June 2010.

[69] N. H. Seong, D. H. Woo, V. Srinivasan, J. Rivers, and H.-H. Lee. Safer: Stuck-

at-fault error recovery for memories. In Microarchitecture (MICRO), 2010 43rd

Annual IEEE/ACM International Symposium on, pages 115–124, Dec 2010.

133

[70] G. Servalli. A 45nm generation phase change memory technology. In IEEE

International Electron Devices Meeting, 2009.

[71] W. Shin, J. Yang, J. Choi, and L.-S. Kim. Nuat: A non-uniform access time

memory controller. In High Performance Computer Architecture (HPCA), 2014

IEEE 20th International Symposium on, pages 464–475, Feb 2014.

[72] J. Slaughter. Materials for Magnetoresistive Random Access Memory. Annual

Review of Materials Research, 39(1):277–296, Aug. 2009.

[73] J. Slaughter, N. Rizzo, J. Janesky, R. Whig, F. Mancoff, D. Houssameddine,

J. Sun, S. Aggarwal, K. Nagel, S. Deshpande, S. Alam, T. Andre, and P. LoPresti.

High density st-mram technology (invited). In Electron Devices Meeting (IEDM),

2012 IEEE International, pages 29.3.1–29.3.4, Dec 2012.

[74] C. Slayman, M. Ma, and S. Lindley. Impact of error correction code and dynamic

memory reconfiguration on high-reliability/low-cost server memory. In Integrated

Reliability Workshop Final Report, 2006 IEEE International, pages 190–193, Oct

2006.

[75] C. W. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. R. Stan. Relaxing

non-volatility for fast and energy-efficient stt-ram caches. In in Proceedings of

the 17th IEEE International Symposium on High Performance Computer Archi-

tecture, pages 50–61, 2011.

[76] V. Sridharan and D. Liberty. A study of DRAM failures in the field. In Pro-

ceedings of the International Conference on High Performance Computing, Net-

working, Storage and Analysis, SC ’12, pages 76:1–76:11, 2012.

[77] V. Sridharan, J. Stearley, N. DeBardeleben, S. Blanchard, and S. Gurumurthi.

Feng shui of supercomputer memory: Positional effects in DRAM and SRAM

faults. In Proceedings of the International Conference on High Performance

Computing, Networking, Storage and Analysis, SC ’13, pages 22:1–22:11, 2013.

[78] S. Srinath, O. Mutlu, H. Kim, and Y. Patt. Feedback directed prefetching: Im-

proving the performance and bandwidth-efficiency of hardware prefetchers. In

134

High Performance Computer Architecture, 2007. HPCA 2007. IEEE 13th Inter-

national Symposium on, pages 63–74, Feb 2007.

[79] Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmark

Suite, 2006.

[80] J. E. Stine, I. Castellanos, M. Wood, J. Henson, and F. Love. Freepdk: An open-

source variation-aware design kit. In International Conference on Microelec-

tronic Systems Education, 2007. http://vcag.ecen.okstate.edu/projects/

scells/.

[81] D. Strukov. The area and latency tradeoffs of binary bit-parallel bch decoders

for prospective nanoelectronic memories. In Signals, Systems and Computers,

2006. ACSSC ’06. Fortieth Asilomar Conference on, pages 1183–1187, Oct 2006.

[82] J. Stuecheli, D. Kaseridis, H. C.Hunter, and L. K. John. Elastic refresh: Tech-

niques to mitigate refresh penalties in high density memory. In Proceedings of

the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitec-

ture, MICRO ’43, pages 375–384, Washington, DC, USA, 2010. IEEE Computer

Society.

[83] G. Sun, X. Dong, Y. Xie, J. Li, and Y. Chen. A novel 3D stacked MRAM cache

architecture for CMPs. In High-Performance Computer Architecture, 2009.

[84] D. Suzuki, M. Natsui, S. Ikeda, H. Hasegawa, K. Miura, J. Hayakawa, T. Endoh,

H. Ohno, and T. Hanyu. Fabrication of a nonvolatile lookup-table circuit chip

using magneto/semiconductor-hybrid structure for an immediate-power-up field

programmable gate array. In VLSI Circuits, 2009 Symposium on, pages 80–81,

June 2009.

[85] K. Tsuchida, T. Inaba, K. Fujita, Y. Ueda, T. Shimizu, Y. Asao, T. Kajiyama,

M. Iwayama, K. Sugiura, S. Ikegawa, T. Kishi, T. Kai, M. Amano, N. Shimo-

mura, H. Yoda, and Y. Watanabe. A 64Mb MRAM with clamped-reference

and adequate-reference schemes. In Solid-State Circuits Conference Digest of

Technical Papers (ISSCC), 2010 IEEE International, pages 258–259, Feb 2010.

135

[86] A. Udipi, N. Muralimanohar, R. Balsubramonian, A. Davis, and N. Jouppi. Lot-

ecc: Localized and tiered reliability mechanisms for commodity memory systems.

In Computer Architecture (ISCA), 2012 39th Annual International Symposium

on, pages 285–296, June 2012.

[87] I. Valov, R. Waser, J. R. Jameson, and M. N. Kozicki. Electrochemical metal-

lization memories—fundamentals, applications, prospects. 22(25):254003, 2011.

[88] H. S. P. Wong, H.-Y. Lee, S. Yu, Y. S. Chen, Y. Wu, P. S. Chen, B. Lee, F. Chen,

and M. J. Tsai. Metal-oxide RRAM. Proceedings of the IEEE, 100(6):1951–1970,

June 2012.

[89] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-

2 programs: Characterization and methodological considerations. In ISCA-22,

1995.

[90] X. Wu, J. Li, L. Zhang, E. Speight, R. Rajamony, and Y. Xie. Hybrid cache

architecture with disparate memory technologies. In International Symposium

on Computer Architecture, 2009.

[91] Xilinx. Virtex-6 FPGA Family Overview, November 2009. http://www.xilinx.

com/support/documentation/data_sheets/ds150.pdf.

[92] W. Xu, T. Zhang, and Y. Chen. Spin-transfer torque magnetoresistive content

addressable memory (CAM) cell structure design with enhanced search noise

margin. In International Symposium on Circuits and Systems, 2008.

[93] T. yi Liu, T. H. Yan, R. Scheuerlein, Y. Chen, J. Lee, G. Balakrishnan, G. Yee,

H. Zhang, A. Yap, J. Ouyang, T. Sasaki, S. Addepalli, A. Al-Shamma, C.-Y.

Chen, M. Gupta, G. Hilton, S. Joshi, A. Kathuria, V. Lai, D. Masiwal, M. Mat-

sumoto, A. Nigam, A. Pai, J. Pakhale, C. H. Siau, X. Wu, R. Yin, L. Peng, J. Y.

Kang, S. Huynh, H. Wang, N. Nagel, Y. Tanaka, M. Higashitani, T. Minvielle,

C. Gorla, T. Tsukamoto, T. Yamaguchi, M. Okajima, T. Okamura, S. Takase,

T. Hara, H. Inoue, L. Fasoli, M. Mofidi, R. Shrivastava, and K. Quader. A

130.7mm2 2-layer 32gb reram memory device in 24nm technology. In Solid-State

136

Circuits Conference Digest of Technical Papers (ISSCC), 2013 IEEE Interna-

tional, pages 210–211, Feb 2013.

[94] D. H. Yoon and M. Erez. Virtualized and flexible ecc for main memory.

SIGARCH Comput. Archit. News, 38(1):397–408, Mar. 2010.

[95] D. H. Yoon, N. Muralimanohar, J. Chang, P. Ranganathan, N. Jouppi, and

M. Erez. Free-p: Protecting non-volatile memory against both hard and soft

errors. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th

International Symposium on, pages 466–477, Feb 2011.

[96] Y. Zhang, L. Zhang, W. Wen, G. Sun, and Y. Chen. Multi-level cell stt-ram:

Is it realistic or just a dream? In Computer-Aided Design (ICCAD), 2012

IEEE/ACM International Conference on, pages 526–532, Nov 2012.

[97] W. Zhao and Y. Cao. New generation of predictive technology model for sub-

45nm design exploration. In International Symposium on Quality Electronic

Design, 2006. http://ptm.asu.edu/.

[98] W. Zhao, C. Chappert, and P. Mazoyer. Spin transfer torque (STT) MRAM-

based runtime reconfiguration FPGA circuit. In ACM Transactions on Embedded

Computing Systems, 2009.

[99] J.-G. Zhu. Magnetoresistive random access memory: The path to competitive-

ness and scalability. Proceedings of the IEEE, 96(11):1786–1798, Nov 2008.