csc 2224: parallel computer architecture and programming
TRANSCRIPT
![Page 1: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/1.jpg)
CSC 2224: Parallel Computer Architecture and ProgrammingMain Memory Fundamentals
Prof. Gennady Pekhimenko
University of Toronto
Fall 2020
The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU
![Page 2: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/2.jpg)
Review #4
• RowClone: Fast and Energy-Efficient
In-DRAM Bulk Data Copy and Initialization
Vivek Seshadri et al., MICRO 2013
2
![Page 3: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/3.jpg)
Why Is Memory So Important?
(Especially Today)
![Page 4: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/4.jpg)
The Performance Perspective• “It’s the Memory, Stupid!” (Richard Sites, MPR, 1996)
Mutlu+, “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors,” HPCA 2003.
![Page 5: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/5.jpg)
The Energy Perspective
5
Dally, HiPEAC 2015
![Page 6: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/6.jpg)
The Energy Perspective
6
Dally, HiPEAC 2015
A memory access consumes ~1000X the energy of a complex addition
![Page 7: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/7.jpg)
The Reliability Perspective• Data from all of Facebook’s servers worldwide• Meza+, “Revisiting Memory Errors in Large-Scale Production Data Centers,” DSN’15.
7
![Page 8: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/8.jpg)
The Security Perspective
8
![Page 9: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/9.jpg)
Why Is DRAM So Slow?
![Page 10: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/10.jpg)
Motivation (by Vivek Seshadri)
Conversation with a friend from Stanford
Why is DRAM so slow?!
Really?
50 nanoseconds to serve one request? Is that a fundamental limit to DRAM’s access latency?
Him Vivek
10
![Page 11: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/11.jpg)
Understanding DRAM
What is in here?
11
DRAM
Problems related to today’s DRAM design
Solutions proposed by our research
![Page 12: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/12.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013,
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)
12
![Page 13: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/13.jpg)
What is DRAM?
DRAM – Dynamic Random Access Memory
3455434
Array of Values
43543
98734
0
847
42
873909
1729
WRITE address, value
READ address
Accessing any location takes the same amount of time
Data needs to be constantly refreshed
13
![Page 14: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/14.jpg)
DRAM in Today’s Systems
14
Why DRAM? Why not some other memory?
![Page 15: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/15.jpg)
Von Neumann Model
Processor Memory
Program Data
T1 = Read Data[1]
T2 = Read Data[2]
T3 = T1 + T2
Write Data[3], T3
3
4
8
2
T1 = Read Data[1]
44 instruction accesses +3 data accesses
Memory performance is important15
![Page 16: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/16.jpg)
Factors that Affect Choice of Memory
1. Speed
– Should be reasonably fast compared to processor
2. Capacity
– Should be large enough to fit programs and data
3. Cost
– Should be cheap
16
![Page 17: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/17.jpg)
Why DRAM?
Access Latency
Co
stFlip-flops
SRAM
DRAM
FlashDisk
Higher Cost
Higher access latency
Favorable point in the trade-off spectrum
17
![Page 18: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/18.jpg)
Is DRAM Fast Enough?
Processor Commodity DRAM
Request
3 GHz, 2 Instructions / cycle
300 Instructions
300 Instructions
300 Instructions
300 Instructions
Independent programs
Request
Request
Request
Served in parallel?
Core Core
Core Core
18
Latency
Parallelism
50ns
![Page 19: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/19.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013,
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)
19
![Page 20: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/20.jpg)
processor
main memory
peripheral logic
bank
mat
high latency
DRAM Organization
![Page 21: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/21.jpg)
DRAM Cell Array: Mat
peripheral logic
cell
matmat
sense amplifierw
ord
line
driver
cell
wordline
bitline
![Page 22: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/22.jpg)
Memory Element (Cell)
0 1
Component that can be in at least two states
Can be electrical, magnetic, mechanical, etc.
DRAM → Capacitor22
![Page 23: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/23.jpg)
Capacitor – Bucket of Electric Charge
Charge
“Charged” “Discharged”
0 0
Vmax
0Voltage level indicator. Not part of the actual circuit
23
![Page 24: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/24.jpg)
DRAM Chip
Contains billions of capacitors (also called cells)
0
?
24
DRAM Chip
![Page 25: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/25.jpg)
Divide and Conquer
DRAM Bank
?
25
1. Weak
2. Reading destroys state
Challenges
![Page 26: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/26.jpg)
Sense-Amplifier
Vmax
0
Vx
Vy
Vx > Vy
Vx
Vy
Vx+δ
Vx+δ
Vy-δ
Amplify the difference Stable state
26
![Page 27: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/27.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization
– DRAM Cell
– DRAM Array
– DRAM Bank
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013)
– Parallelism (Subarray-level Parallelism, ISCA 2012)27
![Page 28: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/28.jpg)
DRAM Cell Read Operation
Vmax/2
Vmax/2
0
Vmax/2 + δ
0
VmaxVmaxVmax/2 + δ
DRAMCell
Sense-Amplifier
Cell loses charge
Amplify the differenceRestore
Cell Data
28
Access data
12
![Page 29: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/29.jpg)
DRAM Cell Read Operation
0
DRAMCell
Sense-Amplifier
Control Signal
DRAMCell
Address
29
![Page 30: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/30.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization– DRAM Cell
– DRAM Array
– DRAM Bank
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)30
![Page 31: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/31.jpg)
Problem
Sense-Amplifier
Control Signal
DRAMCell
Address
Much larger than a cell
31
![Page 32: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/32.jpg)
Cost AmortizationBitline
Wordline
32
![Page 33: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/33.jpg)
DRAM Array Operation
1 0 0 1 1 0 1 0 0 1 0 1
½ ½ ½ ½ ½ ½ ½ ½ ½ ½ ½ ½
Ro
w D
eco
der
1
Ro
w A
dd
ress
? ? ? ? ? ? ? ? ? ? ??
↑
½↑
½↑
½↑
½↑
½↑
½½↓
½↓
½↓
½↓
½↓
½↓
m
1 0 0 1 1 0 1 0 0 1 0 1
Sensing & Amplification
Charge RestorationWordlineDisableCharge SharingWordline Enable
Access Data (column n)
Restore Sense-Amplifier
33
m
n
Sense-Amplifiers
![Page 34: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/34.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization– DRAM Cell
– DRAM Array
– DRAM Bank
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)34
![Page 35: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/35.jpg)
DRAM Bank
35
How to build a DRAM bank from a DRAM array?
![Page 36: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/36.jpg)
DRAM Bank: Single DRAM Array?
36
10
00
0+
row
s
Ro
w D
eco
der
1 0 0 1 1 0 0 1
Long bitlineDifficult to sense data
in far away cells
Ro
w A
dd
ress
![Page 37: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/37.jpg)
DRAM Bank: Collection of Arrays
37
Ro
w D
eco
der
1 0 0 1 1 0 0 1
Ro
w D
eco
der
Row Address Column Read/Write
Data
Subarray
![Page 38: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/38.jpg)
DRAM Operation: Summary
38
Ro
w D
eco
der
1 0 0 1 1 0 0 1
Ro
w D
eco
der
Row Address Column Read/Write
Row m, Col n
m
1. Enable row m
2. Access col n
3. Close row
n
m
![Page 39: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/39.jpg)
DRAM Chip Hierarchy
39
Ro
w D
eco
der
1 0 0 1 1 0 0 1
Ro
w D
eco
der
Row Address Column Read/Write
Collection of Banks
Collection of Subarrays
![Page 40: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/40.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)
40
![Page 41: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/41.jpg)
Factors That Affect Performance
1. Latency
– How fast can DRAM serve a request?
2. Parallelism
– How many requests can DRAM serve in parallel?
41
![Page 42: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/42.jpg)
DRAM Chip Hierarchy
42
Ro
w D
eco
der
1 0 0 1 1 0 0 1
Ro
w D
eco
der
Row Address Column Read/Write
Collection of Banks
Collection of Subarrays
Latency
Parallelism
![Page 43: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/43.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)
43
![Page 44: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/44.jpg)
Subarray Size: Rows/Subarray
44
Number of rows in subarray
Latency Chip Area
![Page 45: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/45.jpg)
Subarray Size vs. Access Latency
45
Smaller subarrays => lower access latency
Shorter Bitlines => Faster access
![Page 46: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/46.jpg)
Subarray Size vs. Chip Area
Large Subarray Smaller Subarrays
46
AreaCost
Smaller subarrays => larger chip area
![Page 47: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/47.jpg)
Chip Area vs. Access Latency
0
1
2
3
4
0 10 20 30 40 50 60
No
rmal
ize
d C
hip
Are
a
Access Latency (ns)
Commodity DRAM(512 rows/subarray)
32 rows/subarray
47
Why is DRAM so slow?
![Page 48: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/48.jpg)
Chip Area vs. Access Latency
0
1
2
3
4
0 10 20 30 40 50 60
No
rmal
ize
d C
hip
Are
a
Access Latency (ns)
Commodity DRAM(512 rows/subarray)
32 rows/subarray
?
How to enable low latency without high area overhead?
48
![Page 49: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/49.jpg)
New Proposal
Large Subarray Small SubarrayOur Proposal49
Low area cost
![Page 50: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/50.jpg)
Tiered-Latency DRAM
Near Segment
Far Segment
+ Lower access latency+ Lower energy/access
- Higher access latency- Higher energy/access
Map frequently accessed data to near segment
50
![Page 51: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/51.jpg)
Results Summary
51
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Performance
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Energy Consumption
12%23%
Commodity DRAM Tiered-Latency DRAM
![Page 52: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/52.jpg)
Tiered-Latency DRAM:A Low Latency and Low Cost DRAM
Architecture
Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu
Published in the proceedings of 19th IEEE International Symposium on
High Performance Computer Architecture 2013
52
![Page 53: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/53.jpg)
DRAM Stores Data as Charge
1. Sensing2. Restore3. Precharge
DRAM cell
Sense amplifier
Three steps of charge movement
![Page 54: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/54.jpg)
Sensing RestoreTiming Parameters
Data 0
Data 1
cell
time
char
ge
Sense amplifier
DRAM Charge over Time
Why does DRAM need the extra timing margin?
In theory margin
cell
Sense amplifier
In practice
![Page 55: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/55.jpg)
1. Process Variation – DRAM cells are not equal
– Leads to extra timing margin for cell that can store small amount of charge
2. Temperature Dependence – DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at low temperature
Two Reasons for Timing Margin
1. Process Variation – DRAM cells are not equal
– Leads to extra timing margin for cell that can store small amount of charge;
1. Process Variation – DRAM cells are not equal
– Leads to extra timing margin for cells that can store large amount of charge
![Page 56: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/56.jpg)
Same size →Same charge →
Different size →Different charge →
Same latency Different latency
Large variation in cell size →Large variation in charge →Large variation in access latency
DRAM Cells are Not EqualRealIdeal
Largest cell
Smallest cell
![Page 57: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/57.jpg)
1. Process Variation – DRAM cells are not equal
– Leads to extra timing margin for cells that can store large amount of charge
2. Temperature Dependence – DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at low temperature
Two Reasons for Timing Margin
2. Temperature Dependence – DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at low temperature
2. Temperature Dependence – DRAM leaks more charge at higher temperature
– Leads to extra timing margin when operating at low temperature
![Page 58: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/58.jpg)
Cells store small charge at high temperature and large charge at low temperature → Large variation in access latency
Charge Leakage ∝ Temperature
Room Temp. Hot Temp. (85°C)
Small leakage Large leakage
![Page 59: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/59.jpg)
DRAM Timing Parameters
• DRAM timing parameters are dictated by the worst case
– The smallest cell with the smallest charge in all DRAM products
– Operating at the highest temperature
• Large timing margin for the common case
→ Can lower latency for the common case
![Page 60: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/60.jpg)
TemperatureController
PC
HeaterFPGAs FPGAs
DRAM Testing Infrastructure
![Page 61: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/61.jpg)
Typical DIMM at Low Temperature
Obs 1. Faster Sensing
More charge
Strong chargeflow
Faster sensing
Typical DIMM at Low Temperature→More charge → Faster sensing
Timing(tRCD)
17% ↓No Errors
115 DIMM characterization
![Page 62: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/62.jpg)
Obs 2. Reducing Restore Time
Larger cell & Less leakage ➔Extra charge
No need to fullyrestore charge
Typical DIMM at lower temperature→More charge → Restore time reduction
Read (tRAS)
37% ↓Write (tWR)
54% ↓No Errors
115 DIMM characterization
Typical DIMM at Low Temperature
![Page 63: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/63.jpg)
Empty (0V)
Full (Vdd)
Half
Obs 3. Reducing Precharge Time
Bit
line
Sense amplifier
Sensing Precharge
Precharge ? – Setting bitline to half-full charge
Typical DIMM at Low Temperature
![Page 64: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/64.jpg)
Empty (0V) Full (Vdd)
Half
bitline
Not fully precharged
More charge→ strong sensing
Access empty cell Access full cell
Timing(tRP)
35% ↓No Errors
115 DIMM characterization
Typical DIMM at Lower Temperature→More charge → Precharge time reduction
Obs 3. Reducing Precharge Time
![Page 65: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/65.jpg)
Adaptive-Latency DRAM
• Key idea– Optimize DRAM timing parameters online
• Two components– DRAM manufacturer profiles multiple sets of reliable DRAM timing parameters at different temperatures for each DIMM– System monitors DRAM temperature & uses appropriate
DRAM timing parameters
reliable DRAM timing parameters
DRAM temperature
![Page 66: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/66.jpg)
0%5%
10%15%20%25%
sop
lex
mcf
milc
libq
lbm
gem
s
cop
y
s.cl
ust
er
gup
s
no
n-i
nte
nsi
ve
inte
nsi
ve
all-
wo
rklo
ads
Single Core Multi Core
0%5%
10%15%20%25%
sop
lex
mcf
milc
libq
lbm
gem
s
cop
y
s.cl
ust
er
gup
s
no
n-i
nte
nsi
ve
inte
nsi
ve
all-
wo
rklo
ads
Single Core Multi Core
0%5%
10%15%20%25%
sop
lex
mcf
milc
libq
lbm
gem
s
cop
y
s.cl
ust
er
gup
s
no
n-i
nte
nsi
ve
inte
nsi
ve
all-
wo
rklo
ads
Single Core Multi Core
14.0%
2.9%0%5%
10%15%20%25%
sop
lex
mcf
milc
libq
lbm
gem
s
cop
y
s.cl
ust
er
gup
s
no
n-i
nte
nsi
ve
inte
nsi
ve
all-
wo
rklo
ads
Single Core Multi-Core
10.4%
Real System Evaluation
AL-DRAM provides high performance improvement, greater for multi-core workloads
Perf
orm
ance
Imp
rove
me
nt Average
Improvement
all-
35
-wo
rklo
ad
![Page 67: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/67.jpg)
Summary: AL-DRAM
• Observation– DRAM timing parameters are dictated by the worst-case cell
(smallest cell at highest temperature)
• Our Approach: Adaptive-Latency DRAM (AL-DRAM) – Optimizes DRAM timing parameters for the common case
(typical DIMM operating at low temperatures)
• Analysis: Characterization of 115 DIMMs– Great potential to lower DRAM timing parameters (17 –
54%) without any errors
• Real System Performance Evaluation – Significant performance improvement (14% for memory-
intensive workloads) without errors (33 days)
![Page 68: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/68.jpg)
Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case
Donghyuk Lee, Yoongu Kim,
Gennady Pekhimenko, Samira Khan, VivekSeshadri, Kevin Chang, and Onur Mutlu
Published in the proceedings of 21st
International Symposium on High Performance Computer Architecture 2015
68
![Page 69: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/69.jpg)
Outline
1. What is DRAM?
2. DRAM Internal Organization
3. Problems and Solutions– Latency (Tiered-Latency DRAM, HPCA 2013;
Adaptive-Latency DRAM, HPCA 2015)
– Parallelism (Subarray-level Parallelism, ISCA 2012)
69
![Page 70: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/70.jpg)
Parallelism: Demand vs. Supply
Demand Supply
Out-of-order Execution
Multi-cores
Prefetchers
70
MultipleBanks
![Page 71: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/71.jpg)
Increasing Number of Banks?
How to improve available parallelism within DRAM?
Adding more banks → Replication of shared structures
Replication → Cost
71
![Page 72: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/72.jpg)
Our Observation
72
Time
1. Wordline enable
2. Charge sharing
3. Sense amplify
4. Charge restoration
5. Wordline disable
6. Restore sense-amps
Data Access
Local to a subarray
![Page 73: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/73.jpg)
Subarray-Level Parallelism
73
Ro
w D
eco
der
1 0 0 1 1 0 0 1
Ro
w D
eco
der
Row Address Column Read/Write
Ro
w A
dd
ress
Ro
w A
dd
ress
Replicate
Time share
![Page 74: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/74.jpg)
Subarray-Level Parallelism: Benefits
74
Time
Data Access
Data Access
Data Access
Data Access Saved Time
Commodity DRAM
Subarray-Level Parallelism
Two requests to different subarraysin same bank
![Page 75: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/75.jpg)
Results Summary
75
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Performance
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Energy Consumption
Commodity DRAM Subarray-Level Parallelism
17%
19%
![Page 76: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/76.jpg)
A Case for Exploiting Subarray-Level Parallelism (SALP) in DRAM
Yoongu Kim, Vivek Seshadri, Donghyuk Lee,Jamie Liu, Onur Mutlu
Published in the proceedings of 39th
International Symposium on Computer Architecture 2012
76
![Page 77: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/77.jpg)
Review #4
• RowClone: Fast and Energy-Efficient
In-DRAM Bulk Data Copy and Initialization
Vivek Seshadri et al., MICRO 2013
77
![Page 78: CSC 2224: Parallel Computer Architecture and Programming](https://reader033.vdocuments.site/reader033/viewer/2022042219/625a1896f857b55ce2275df8/html5/thumbnails/78.jpg)
CSC 2224: Parallel Computer Architecture and ProgrammingMain Memory Fundamentals
Prof. Gennady Pekhimenko
University of Toronto
Fall 2020
The content of this lecture is adapted from the slides of Vivek Seshadri, Donghyuk Lee, Yoongu Kim, and lectures of Onur Mutlu @ ETH and CMU