runtime solutions to apply non-volatile memories in future … archive/doctoral... ·...
TRANSCRIPT
Runtime Solutions to Apply Non-volatile
Memories in Future Computer Systems
Hoda Aghaei Khouzani
Fall 2016
Outline
• Introduction
• Prolonging PCM Limited Lifetime
• Addressing PCM Write Latency and Energy
• Reducing DWM Access Latency
• Summary and Future Works
1
Trends Affecting Main Memory
However, DRAM technology scaling is ending
2
DRAM has been used as main memory from 1970
2003
2004
200
5
20
06
2007
2008
200
9
201
0
2011
20
12
2013
2014
2015
20
16
1
10
100
Rel
ati
ve
Ca
pa
city #Core
DRAM
Memory capacity per core expected
to drop by 30% every two years
[Mutlu, IMW’13]
[Lim, ISCA’09]
90nm0
200
400
W/c
m2
Leakage Power
Dynamic Power
[Borkar,MICRO’05]
56nm 40nm 28nm 20nm
600
800
1000
DRAM consumes up to 40% of system
energy due to the need of consistent
refresh cycles [Udipi, ISCA’10]
3
1ns
10ns
100ns
1µs
100µs
10µs
1ms
1 10 100 1000
Acc
ess
Tim
e
Cell area per bit (F2)
NOR-FlashNAND-Flash
Static Power
High
Low
Volatile
Non-Volatile
Emerging Memory Technologies
[ITRS, 2013]
SRAMDRAM
DWM
PCM FeRAM
STT-RAM
Potential Candidates for Future Main Memory
(1) Phase Change Memory (PCM)
• Advantages:
– Higher scalability than DRAM
– Near zero static power
– Similar read performance and energy to
DRAM
• Challenges:
– Limited write endurance (106~109)
– Long write latency (10x DRAM)
– High write energy (4x DRAM)
4
(2) Domain Wall Memory (DWM)
• Advantages:
– Ultra dense
– Near zero static power
– Similar read/write performance and energy
to DRAM
• Challenge:
– Sequential access structure
Outline
• Introduction
• Prolonging PCM Limited Lifetime
• Addressing PCM Write Latency and Energy
• Reducing DWM Access Latency
• Summary and Future Works
5
6
Related Works on PCM Lifetime
Write Balancing
[1] S. Cho, et al, “Flip-N-Write: A Simple Deterministic
Technique to Improve PRAM Write Performance,
Energy and Endurance,” in MICRO, 2009.
[2] P. Zhou, et al, “A Durable and Energy Efficient
Main Memory Using Phase Change Memory
Technology,” in ISCA, 2009.
[3] A. Ferreira, et al, “Increasing PCM main memory
lifetime,” in DATE, 2010.
[4] M. Qureshi, et al, “Enhancing lifetime and security of
PCM-based main memory with start-gap wear leveling,”
in MICRO, 2009.
[5] P. Zhou, et al, “A Durable and Energy Efficient Main
Memory Using Phase Change Memory Technology,” in
ISCA, 2009.
[6] N. H. Seong, et al, “SAFER: Stuck-at-fault Error
Recovery for Memories,” in MICRO, 2010.
[7] J. Fan, et al, “Aegis: partitioning data block for
efficient recovery of stuck-at-faults in phase change
memory,” in MICRO, 2013.
• At bit level
• Reduces average writes
• Lifetime limits by the maximum
• At different levels of granularity
• Balances writes
• Extra writes due to remap
• Applies when cells die
• Increases access latency
7
Goal and Challenges
Old Physical Pages Cold Virtual Pages
Young Physical Pages Hot Virtual Pages
Goal
How to identify the age of physical pages? How to identify the temperature of virtual pages?
Physical Domain Virtual DomainOperating System
How frequent should remap be?
Physical Pages Virtual Pages
Received writes Perform writes
A lot
A lotA few
A few
8
Identification Solution
How to identify the age of physical pages? How to identify the temperature of virtual pages?
Read only
Physical Domain Virtual Domain
N-bit counter per physical page
𝑁 = log2(𝑊𝑒𝑎𝑟 𝑜𝑢𝑡 𝐿𝑖𝑚𝑖𝑡)
For wear out limit 109, almost
30 bits needed per 4KB pages.
So the overhead is <0.001.
Counters again?
(1) Virtual domain is larger than physical domain.
(2) More importantly, no need.
No!!
Virtual Pages
Text
Stack
Data
Write characteristic
High spatial locality
High temporal locality
9
How frequent should remap be?In existing works
Start After Process A After Process B After Process C
My Idea: Wear leveling can be done across different processes
Start After Process A After Process B After Process C
Implementation: Mapping upon page allocation in OS
Age
Almost no extra remap => Very limited extra writes
Relies on many extra remap => result in intensive extra writes
System Overview
10
Memory
Access
HitMiss
Wear-resistant
Page Allocation
Age-aware Page
Replacement
Yes
No
Too few
Frames?
Hit
No
Yes
Remap Page
A write &
Frame get Old?
Re-define Young
vs. Mid-Age
YesNo
Re-define Old
vs. Mid-Age
Yes
NoToo few
Young
Frames?
Too many
Old Frames?
Following are added to the system:
• Three procedures
− Wear-resistant page allocation
− Age-aware page replacement
− Remap hot page
• Two thresholds
− Young-to-Midage threshold
− Midage-to-Old threshold
Wear-resistant Page Allocation
11
How to quickly find the appropriate physical page?
Age-aware free list Segment == Cold
Assign Mid-age page
Page Request
Assign Old page
Yes No
Old list empty?
Mid-age list
empty?
Yes
No
Assign Young page
No Yes
Hot is always served with young
P9P7P5P4
P8P2 P3
P1 P6Old
Mid-Age
YoungWC > Y-to-M Threshold
WriteCount > M-to-O Threshold
WriteCount > Y-to-M Threshold
System Overview
12
Memory
Access
HitMiss
Wear-resistant
Page Allocation
Age-aware Page
Replacement
Yes
No
Too few
Frames?
Hit
No
Yes
Remap Page
A write &
Frame get Old?
Re-define Young
vs. Mid-Age
YesNo
Re-define Old
vs. Mid-Age
Yes
NoToo few
Young
Frames?
Too many
Old Frames?
Following are added to the system:
• Three procedures
− Wear-resistant page allocation
− Age-aware page replacement
− Remap hot page
• Two thresholds
− Young-to-Midage threshold
− Midage-to-Old threshold
Age-aware Page Deallocation
13
In OS there is an in-use list per process
P9
P7
P5
P4
P8
P2
P3P1
P6
Process A
Process B
P11
P12
P10 Old
Mid-Age
Young
P13
Page Deallocation
How to select a page for deallocation?
• Constrained Clock Algorithm
• Constrain is the upper bound to the
write count of page
Age-aware free list
If the young list empty
upper bound = Y-to-M threshold
Else
upper bound = wear out limit
System Overview
14
Memory
Access
HitMiss
Wear-resistant
Page Allocation
Age-aware Page
Replacement
Yes
No
Too few
Frames?
Hit
No
Yes
Remap Page
A write &
Frame get Old?
Re-define Young
vs. Mid-Age
YesNo
Re-define Old
vs. Mid-Age
Yes
NoToo few
Young
Frames?
Too many
Old Frames?
Effectively balance page write
counts across processes
Remap is only needed to handle an
extremely hot virtual page
Redefining the two thresholds
across PCM lifetime
Evaluation: Methodology
15
• Benchmarks: SPEC 2000/2006. Memory traces collected with Pin.
• PCM main memory: 128MB with 106 wear out limit per cell.
• Y-to-M threshold: 104 initially.
• M-to-O threshold: 2×105 initially.
[1] M. Qureshi, et al, “Enhancing lifetime and security of PCM-based main memory with start-gap wear leveling,” in MICRO, 2009.
[2] A. Ferreira, et al, “Increasing PCM main memory lifetime,” in DATE, 2010.
[3] P. Zhou, et al, “A Durable and Energy Efficient Main Memory Using Phase Change Memory Technology,” in ISCA, 2009.
no-WL Start Gap [1] Random Swap [2] Segment Swap [3]
No wear leveling • Maintain an extra empty
line
• Every 50 writes, swap the
empty line with the next line
• Every 512 writes, swap
the last written page with
a randomly selected page
• Group pages into 1MB
segment.
• Swap the hottest and coldest
segments every 2×105 writes
Compare with
Benchmarks
Group1 leslie3d, omnetpp, vpr
Group2 gzip, gammes, hmmer
Group3 calculix, facerec, fma3d
Group4 gcc, gromacs, milc
Lifetime Improvement
16
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
Group 1 Group 2 Group 3 Group 4
no-WL Proposed StartGap RandomSwap SegmentSwap
Normalized Lifetime:
𝑇𝑜𝑡𝑎𝑙 𝐿𝐿𝐶 𝑊𝑟𝑖𝑡𝑒𝑠
𝑊𝑒𝑎𝑟 𝑜𝑢𝑡 𝑙𝑖𝑚𝑖𝑡 × 𝑇𝑜𝑡𝑎𝑙 𝑝𝑎𝑔𝑒𝑠
• No-WL only explores 1% of
potential writes.
• Proposed scheme explores more
than 98% of PCM writes.
• Other schemes can only explore up
to 89% of writes.
0.0000
0.2000
0.4000
0.6000
0.8000
1.0000
leslie3d omnetpp vpr gzip gamess hmmer calculix facerec fma3d gcc gromacs milc
Group 1 Group 2 Group 3 Group 4
no-WL Proposed StartGap RandomSwap SegmentSwap
It is mostly due to extra remaps.
Normalized lifetime-Single Process
Normalized lifetime-Multi Process
Overhead
Benchmark no-WL Proposed StartGap Random
Swap
Segment
Swap
Group1 2% 7% 145% 37% 34%
Group2 21% 10% 179% 58% 45%
Group3 2% 3 % 119% 123% 33%
Group4 2% 9% 141% 51% 34%
17
Overhead:
(𝑇𝑜𝑡𝑎𝑙 𝑃𝑎𝑔𝑒 𝐹𝑎𝑢𝑙𝑡𝑠 + 𝑇𝑜𝑡𝑎𝑙 𝑅𝑒𝑚𝑎𝑝𝑠) × 128
𝑇𝑜𝑡𝑎𝑙 𝐿𝐿𝐶 𝑊𝑟𝑖𝑡𝑒𝑠
• No-WL overhead is due to page faults.
• Proposed scheme imposes at most 10%
extra writes.
• StartGap impose the most extra writes
− High remap frequency
𝑃𝑎𝑔𝑒 𝑆𝑖𝑧𝑒
𝐿𝐿𝐶 𝐵𝑙𝑜𝑐𝑘 𝑆𝑖𝑧𝑒=
212
25
Outline
• Introduction
• Prolonging PCM Limited Lifetime
• Addressing PCM Write Latency and Energy
• Reducing DWM Access Latency
• Summary and Future Works
18
DRAM vs PCM
19
DRAM cell PCM cell
Read Latency (ns)
Read Energy (pJ/bit)
Write Latency (ns)
Write Energy (pJ/bit)
~10 ~10
~10 ~100
4.4 2.5
5.5 14(set)~20(reset)
Write Endurance 1016 106~109
Leakage Power
Scalability
High Low
Not below 20nm Predicted 9nm
What if
~10
4.4
~10
5.5
1016
Low
Predicted 9nm
One attractive solution is a DRAM-PCM hybrid architecture
DRAM-PCM Hybrid Structures
20
PCMDRAM
Ad
dress sp
ace
from
LLC
Hierarchical organization
miss
Problem! Two-step access strategy degrades memory performance
• Use a hierarchical organization (DRAM is a cache for PCM)
• DRAM is size is 3% of PCM size
• Access policy is two steps, first access DRAM, if miss, access PCM
Related work on Hierarchical Architecture
21
[1] M. Qureshi, et al, “Scalable High Performance Main Memory System Using Phase-Change Memory Technology,” in ISCA, 2009.
[2] A. Ferreira, et al, “Increasing PCM main memory lifetime,” in DATE, 2010.
[3] H. G. Lee, et al, “An Energy- and Performance-Aware DRAM Cache Architecture for Hybrid DRAM/PCM Main Memory Systems,” in ICCD, 2011.
First Hierarchical
DRAM-PCM [1]
DRAM misses limit
its performance.
Clean-First
Replacement [2]
Lower PCM writes
Higher DRAM miss rate
Miss Penalty
Reduction [3]
Lower miss latency
Higher DRAM miss rate
Criticality of DRAM Misses
22
Goal: Reduce DRAM misses, more specifically those misses that generate PCM writes
Operations upon DRAM miss:
DRAM read
PCM read
DRAM write
Possibly PCM write
System metrics:
Performance
Energy
Endurance
Distribution of Conflict Misses
• DRAM Conflict Misses:
– Hit in PCM
• Critical Conflicts:
– Generates writeback to PCM
23
0.00E+00
2.50E+03
5.00E+03Omnetpp
0.00E+00
2.50E+03
5.00E+03h264ref
0.00E+00
6.00E+04
1.20E+05
11
01
92
83
74
65
56
47
38
29
11
00
109
118
127
136
145
154
163
172
181
190
199
208
217
226
235
244
253
vpr
0.00E+00
6.00E+04
1.20E+05vortex
Sets
Nu
mb
er o
f C
riti
cal
Co
nfl
icts
High variation across DRAM sets!!
A Motivating Example
24
Consider a directly mapped DRAM with 2 sets
Mapped to set 0 Mapped to set 1
Time
Pages
ABCD
Time
Pages
ACBD
Conflicts
No conflicts
Idea: when bring a page to DRAM, map it to a less conflicting set
How the DRAM Set is Determined?
25
DRAM set of a page
Physical page number
Virtual-to-physical page mapping
Physical page number:
(2n Pages in PCM)
DRAM set number
n
1001.... 10 1..00
0
0m
1010..............10
Determined by OS,
upon a page fault
Exploiting the flexibility of virtual-to-physical page mapping
Virtual page number
How to identify a less conflicting set?
26
A counter per set is used.
2-bit saturating counters per DRAM set.
Siz
e o
f C
ou
nte
r Higher Overhead
(Power and Storage)
Lower Accuracy
For a 2MB, 4-way associative
DRAM, storage overhead is 256
bits out of 2MB (< 0.01%)
A higher priority is given to critical
conflicts
i
iHit in
PCM
Miss in
DRAM
Writeback to
PCM
Counter ++
Yes
No
No
Yes
Counter ++
Yes
No
In parallel with
data transfer,
no performance
overhead
Look for a Page More Efficiently
27
Clock_Page
R bits
DRAM sets
PCM pages mapped to the same DRAM set
Clock_Set
C bits
1 00 1 1 1 10 0 00
Proposed Algorithm:
• Group pages of the same DRAM set into a
list.
• Search only the list associated with the
selected DRAM set.
Previously: Reference bits of all in-use pages
are searched.
n = total number of in-use pages,
m = total number of DRAM sets
Complexity of conventional clock = O(n)
Complexity of our algorithm = O(m) + O(n/m)
2
0
3
1
0
2
0
28
• Benchmarks: SPEC 2000/2006. Memory traces collected with Pin.
• Last Level Cache (LLC): 1MB, 128B Block, and 4-way associative
• PCM main memory: 1 GB, page size of 4KB
• DRAM Cache: 32MB, 4-way associative
Evaluation: Methodology
Baseline [1] N-Chance [2] Reduced Miss Penalty [3]
Basic hierarchical
DRAM-PCM
Clean first replacement in
DRAM
Maintain an empty block
per set in DRAM
Compare with
[1] M. Qureshi, et al, “Scalable High Performance Main Memory System Using Phase-Change Memory Technology,” in ISCA, 2009.
[2] A. Ferreira, et al, “Increasing PCM main memory lifetime,” in DATE, 2010.
[3] H. G. Lee, et al, “An Energy- and Performance-Aware DRAM Cache Architecture for Hybrid DRAM/PCM Main Memory Systems,” in ICCD, 2011.
Groups of Benchmarks
29
Benchmarks
Group1 gromacs, h264ref, hmmer, leslie3d, omnetpp, sjeng, tonto, zeusmp
Group2 ammp, gromacs, leslie3d, mgrid, milc, sjeng, tonto, wupwise
Group3 bwaves, gzip, leslie3d, milc, sjeng, vortex, wupwise, zeusmp
Group4 facerec, gzip, h264ref, mgrid, omnetpp, tonto, vpr, wupwise
Group5 facerec, gamess, gcc, gzip, h264ref, vortex, vpr, zeusmp
Benchmarks are grouped to simulate multi-process environment
Balanced Critical Conflicts
Reduced standard deviation
by 64.78%
30
1.00E+04
3.50E+04
6.00E+04
1.00E+04
5.50E+04
1.00E+05
9.00E+03
1.10E+04
1.30E+04
1.40E+04
1.90E+04
2.40E+04
1.00E+04
3.50E+04
6.00E+04
133
65
97
12
916
119
322
525
728
932
135
338
541
744
948
151
354
557
760
964
167
370
573
776
980
183
386
589
792
996
199
310
25
10
57
10
89
11
21
11
53
11
85
12
17
12
49
12
81
13
13
13
45
13
77
14
09
14
41
14
73
15
05
15
37
15
69
16
01
16
33
16
65
16
97
17
29
17
61
17
93
18
25
18
57
18
89
19
21
19
53
19
85
20
17
Baseline Proposed
Sets
Nu
mb
er o
f C
riti
cal
Co
nfl
icts
Group1
Group2
Group3
Group4
Group5
Results- DRAM misses and PCM writes
• Proposed scheme reduces both DRAM misses and PCM writes
– 7% reduction in DRAM misses
– 6% reduction in PCM writes
• N-Chance only reduces PCM writes but induces more misses
• Reduced miss penalty increases both.31
Number of DRAM Misses Number of PCM Writes
0.80
1.00
1.20
1.40
1.60
Group1 Group2 Group3 Group4 Group5
N-Chance ReducedMissPenalty Proposed
0.80
1.00
1.20
1.40
1.60
Group1 Group2 Group3 Group4 Group5
N-Chance ReducedMissPenalty Proposed
All values are normalized to the baseline
Baseline is basic hierarchical DRAM-PCM
Outline
• Introduction
• Prolonging PCM Limited Lifetime
• Addressing PCM Write Latency and Energy
• Reducing DWM Access Latency
• Summary and Future Works
32
Domain Wall Memory (DWM)
33
Access latency and energy is affected by shift operation.
• Made of hundreds of millions of ferromagnetic nanowires.
• Each nanowire stores many bits or domains.
• To read/write data several access ports are placed at fixed position.
• Shift operation required to access each bit.
Access PortAccess Port
Shift CurrentShift Current
Related Work on DWM
34
Storage
Main Memory
Caches
Registers
Memory Hierarchy
[1] M. Mao, et al, “Exploration of GPGPU
Register File Architecture Using Domain-wall-
shift-write based Racetrack Memory,” in DAC,
2014.
[2] R. Venkatesan, et al, “TapeCache: A High
Density, Energy Efficient Cache Based on
Domain Wall Memory,” in ISLPED, 2012.
[3] H. Xu, et al, “Multilane Racetrack Caches:
Improving Efficiency through Compression and
Independent Shifting,” in ASP-DAC, 2015.
[4] Q. Hu, et al, “Exploring Main Memory
Design Based on Racetrack Memory
Technology,” in GLVLSI, 2016.
Reduced shift by register remapping
and instruction scheduling [1]
Reduced shift through compression
techniques [3] and adding read-only ports [2]
Reduced shift through
storing data vertically[4]
My work differs from them by
Considering metadata accesses, specifically page table.
(1) Reducing shifts
(2) Leveraging access port position for metadata interpretation
Page Table• Contains the virtual-to-physical page address mapping.
• Modern systems adopt a hierarchical page table.
35
CR3
PML4 Idx Directory Ptr Idx Directory Idx Table Idx Page Offset
0111220212930383947
Level 4
(PML4)
Level 3
(Directory Pointer)
Level 2
(Directory)
Level 1
(Table)
Physical
Page
Number
Virtual Page Address
Page Table
(in Main Memory)
The latency of this 4-step process is
50% of total execution time!!! [1]
[1] V. Karakostas, et al, “Redundant Memory Mappings for Fast Access to Large Memories,” in ISCA, 2015.
Page Table• Contains the virtual-to-physical page address mapping.
• Modern systems adopt a hierarchical page table.
36
The latency of this 4-step process is
50% of total execution time!!! [1]
CR3
PML4 Idx Directory Ptr Idx Directory Idx Table Idx Page Offset
0111220212930383947
Level 4
(PML4)
Level 3
(Directory Pointer)
Level 2
(Directory)
Level 1
(Table)
Physical
Page
Number
Virtual Page Address
Page Table
(in Main Memory)
Control Bits Address Bits
V
R
/
W
R
U
/
S
P
W
T
P
C
D
D
P
A
T
G Avail Address Page Table Entry (PTE)
Most frequently accessed control bits
• Translation Lookaside Buffer (TLB) store recently accessed PTEs
– More TLB misses in today’s system[1].
[1] V. Karakostas, et al, “Redundant Memory Mappings for Fast Access to Large Memories,” in ISCA, 2015.
Intuitive Read/Write a PTE
37
V RD Address
Currently upon access to page table
#𝑜𝑓 𝑆ℎ𝑖𝑓𝑡𝑠 = 2 ×𝑀𝑎𝑥𝑆ℎ𝑖𝑓𝑡
𝑀𝑎𝑥𝑆ℎ𝑖𝑓𝑡 = distance between two neighbouring port
Aligning the first bit to the access port.
Read/write the whole entry.
My observation: There is no need to read the entire PTE
Observation on PTE States
38
Not in
MemoryBrought to
memory
Evicted from TLB
Evicted from
Memory
Missed in TLB
In system with TLB, PTE have three states
In TLBNot in
TLB
Transition Action Accessed Field in PTE
Not in Memory → In TLB Write The entire entry
In TLB → Not in TLB UpdateReferenced (R) bit
and/or Dirty (D) bit
Not in TLB → In TLB Read The entire entry
Not in TLB → Not in Memory Update Validity (V) bit
TLB Main memory
×
×Not in Memory
In TLB
Not in TLB ×
39
An Example
V RD Address
Alignment Shift:
Extra Shift:
Necessary Shift:
Align first bit with access port.
Shifting through the bits that are not listed in table
Shifting through the bits that are listed in table.
Transition Action Accessed Field in PTE
In TLB → Not in TLB UpdateReferenced (R) bit
and/or Dirty (D) bit
Pre-Alignment TechniquePre-align based on the state:
• In TLB: Pre-align to R bit
• Not in TLB: Pre-align to V bit
• Not in Memory: Pre-align to
the bit adjacent to the V bit
40
Two advantages:
(1) Remove the alignment shifts from
the PTE access path
(2) The state of PTE is known prior to
reading PTE
Transition Action Accessed Field in PTE
Not in Memory → In TLB Write The entire entry
In TLB → Not in TLB UpdateReferenced (R) bit
and/or Dirty (D) bit
Not in TLB → In TLB Read The entire entry
Not in TLB → Not in Memory Update Validity (V) bit
V RD Address
V RD Address
V RD Address
Multiple PTEs can share a track.
41
Placement – Motivating example
But it can create Conflict of interest!!!
PTE0 → Not In Memory
V RD Address V RD Address
PTE0 PTE1
At the beginning
Idea: Place PTEs with minimum conflict on the same track
PTE1 → Not In Memory
PTE0 → In TLB
Horizontal Placement
42
Placement
PTE 0 PTE 1
PTE 2 PTE 3
PTE n-2 PTE n-1
Track0
Track1
Track(n/2)-1
PTPage0
PTE 0 PTE 1
PTE 2 PTE 3
PTE n-2 PTE n-1
Trackn/2
Track(n/2)+1
Trackn-1
PTPage1
PTE 0 PTE 0
PTE 1 PTE 1
PTE (n/2)-1 PTE (n/2)-1
PTPage0
PTE n/2 PTE n/2
PTE (n/2)+1 PTE (n/2)+1
PTE n-1 PTE n-1
PTPage1
Track0
Track1
Track(n/2)-1
Trackn/2
Track(n/2)+1
Trackn-1
Vertical Placement
PTEs are stored in granularity of pages. (PTpage)
PTEs in the same page have high
spatial locality ⇒ highly conflicting
Less spatial locality between PTEs of
different pages ⇒ less conflicting
43
• Benchmarks: SPEC 2000/2006. TLB traces collected with Pin.
• Instruction and Data TLB: 4-way associative 128-entry
• PCM main memory: 4 GB, page size of 4KB
• Page Table Reserved Size: 2MB
Evaluation: Methodology
DWM-based baseline Traditional DRAM-based baseline
• DWM main memory.
• No pre-alignment.
• Horizontal placement.
• DRAM main memory.
Compare with
Groups of Benchmarks
44
Benchmarks
Group1 gemsFDTD, gobmk, gromacs, h264ref
Group2 astar, bwaves, bzip, cactus
Group3 calculix, deal, gamess, gcc
Group4 hmmer, zeusmp, leslie3d, libquantum
Group5 mcf, milc, namd, omnetpp
Group6 perlbench, sjeng, soplex, tonto
Benchmarks are grouped to simulate multi-process environment
Comparison with DWM-based Baseline
45
0.0E+00
6.0E+10
1.2E+11
Group1 Group2 Group3 Group4 Group5 Group6
BaseLine Proposed Lower bound
0.0E+00
3.0E+09
6.0E+09
Group1 Group2 Group3 Group4 Group5 Group6BaseLine Proposed Lower bound
Latency of Address Translation.
Latency and Energy of Context Switching.
6.0E+10
9.0E+10
1.2E+11
Group1 Group2 Group3 Group4 Group5 Group6BaseLine Proposed
Energy of Address Translation.
69% reduction 15% reduction
71% reduction
Lower bound =
necessary shifts only
Outline
• Introduction
• Prolonging PCM Limited Lifetime
• Addressing PCM Write Latency and Energy
• Reducing DWM Access Latency
• Summary and Future Works
46
Summary
47
• Showed the necessity to change main memory technology.
• Introduced two candidates:
− Phase Change Memory (PCM)
− Domain Wall Memory (DWM)
• Prolonged PCM main memory lifetime
− Proposed a segment-aware page allocation.
• Overcame write limitation of PCM
− Proposed a conflict-aware page allocation for PCM-DRAM hybrid main memory.
• Improved the performance and energy page table accesses
− Proposed a pre-alignment technique as well as a new placement in DWM main memory.
Future Work
48
Storage
Main Memory
Caches
Registers
Memory Hierarchy
PCM-based Main memory
DWM-based Main memory
DWM-based Stack Architecture
STT-RAM-based Cache
Future Work – DWM-based Stack Architecture
49
Energy harvesting devices become popular
Goal: Improving the reliability, energy, and execution
time of energy harvesting devices
Non-volatile memory register files [1]
Alternative architecture such as Stack architecture
(1) Intermittent power supply
(2) Tend to be smaller everyday
DWM is a good choice for stack.
Solutions:Concerns:
[1] K. Ma, et al, “Nonvolatile Processor Architecture Exploration for Energy-Harvesting Applications,” in MICRO, 2015.
My proposal
Future Work – STT-RAM-based Cache
50
• Over 50% of chip area devoted to cache.
• STT-RAM is more scalable than SRAM.
• STT-RAM has long write latency and high write energy,
− however they have a trade of with non-volatility property
Retention Time 10 years 1 sec 10 ms
Write Latency @2GHz 22 cycles 12 cycles 6 cycles
[Jog, DAC’12]
Goal: Explore the flexibility of cache replacement algorithm to reduce
retention time and improve access latency and energy.
Thank You!
Q&A
51