gather-scatter dram - eth z€¦ · gather-scatter dram in-dram address translation to improve the...
TRANSCRIPT
![Page 1: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/1.jpg)
Gather-Scatter DRAMIn-DRAM Address Translation to Improve the
Spatial Locality of Non-unit Strided Accesses
Vivek SeshadriThomas Mullins, Amirali Boroumand, Onur Mutlu,
Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
![Page 2: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/2.jpg)
Executive summary
• Problem: Non-unit strided accesses– Present in many applications
– In-efficient in cache-line-optimized memory systems
• Our Proposal: Gather-Scatter DRAM– Gather/scatter values of strided access from multiple chips
– Ideal memory bandwidth/cache utilization for power-of-2 strides
– Requires very few changes to the DRAM module
• Results– In-memory databases: the best of both row store and column store
– Matrix multiplication: Eliminates software gather for SIMD optimizations
2
![Page 3: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/3.jpg)
Strided access pattern
3
Physical layout of the data structure (row store)
Record 1
Record 2
Record n
In-Memory
Database
Table
Field 1 Field 3
![Page 4: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/4.jpg)
Shortcomings of existing systems
4
Cache Line
Data unnecessarilytransferred on the
memory channel
and stored in on-
chip cache
High latency
Wasted bandwidth
Wasted cache space
High energy
![Page 5: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/5.jpg)
Prior approaches
Improving efficiency of fine-grained memory accesses
• Impulse Memory Controller (HPCA 1999)
• Adaptive/Dynamic Granularity Memory System (ISCA 2011/12)
Costly in a commodity system
• Modules that support fine-grained memory accesses
– E.g., mini-rank, threaded-memory module
• Sectored caches
5
![Page 6: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/6.jpg)
Goal: Eliminate inefficiency
6
Cache Line
Can we retrieve a only useful data?
Gather-Scatter DRAM
(Power-of-2 strides)
![Page 7: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/7.jpg)
DRAM modules have multiple chips
7
All chips within a “rank” operate in unison!
READ addr
Cache Line
?Two Challenges!
Data
Cmd/Addr
![Page 8: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/8.jpg)
Challenge 1: Chip conflicts
8
Data of each cache line is spread across all the chips!
Cache line 0
Cache line 1
Useful data mapped to only two chips!
![Page 9: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/9.jpg)
Challenge 2: Shared address bus
9
All chips share the same address bus!
No flexibility for the memory controller to
read different addresses from each chip!
One address bus for each chip is costly!
![Page 10: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/10.jpg)
Gather-Scatter DRAM
10
Column-ID-based data shuffling
(shuffle data of each cache line differently)
Pattern ID – In-DRAM address translation
(locally compute column address at each chip)
Challenge 1: Minimizing chip conflicts
Challenge 2: Shared address bus
![Page 11: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/11.jpg)
Column-ID-based data shuffling
11
Cache Line
Stage 1
Stage 2
Stage 3
Stage “n” enabled only if
nth LSB of column ID is set
DRAM Column Address
1 0 1
Ch
ip 0
Ch
ip 1
Ch
ip 2
Ch
ip 3
Ch
ip 4
Ch
ip 5
Ch
ip 6
Ch
ip 7
(implemented in the memory controller)
![Page 12: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/12.jpg)
Effect of data shuffling
12
Chip conflicts Minimal chip conflicts!
Col 0Col 1Col 2Col 3
Ch
ip 0
Ch
ip 1
Ch
ip 2
Ch
ip 3
Ch
ip 4
Ch
ip 5
Ch
ip 6
Ch
ip 7
Ch
ip 0
Ch
ip 1
Ch
ip 2
Ch
ip 3
Ch
ip 4
Ch
ip 5
Ch
ip 6
Ch
ip 7
Before shuffling After shuffling
Can be retrieved in a single command
![Page 13: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/13.jpg)
Gather-Scatter DRAM
13
Column-ID-based data shuffling
(shuffle data of each cache line differently)
Pattern ID – In-DRAM address translation
(locally compute the column address at each chip)
Challenge 1: Minimizing chip conflicts
Challenge 2: Shared address bus
![Page 14: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/14.jpg)
Per-chip column translation logic
14
READ addr, pattern
cmdaddr
pattern
CTL
AND
pattern
chip ID
addrcmd =
READ/WRITE
output address
XOR
![Page 15: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/15.jpg)
Gather-Scatter DRAM (GS-DRAM)
15
32 values contiguously stored in DRAM (at the start of a DRAM row)
read addr 0, pattern 0 (stride = 1, default operation)
read addr 0, pattern 1 (stride = 2)
read addr 0, pattern 3 (stride = 4)
read addr 0, pattern 7 (stride = 8)
![Page 16: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/16.jpg)
End-to-end system support for GS-DRAM
16
Memory
controller
Cache
Data
Store
Tag
Store
Pa
tte
rn I
D
CPU
New instructions:
pattload/pattstore
GS-DRAM
misscacheline(addr), patt
DRAM column(addr), patt
pattload reg, addr, patt
Support for coherence of
overlapping cache lines
![Page 17: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/17.jpg)
Methodology
• Simulator
– Gem5 x86 simulator
– Use “prefetch” instruction to implement pattern load
– Cache hierarchy
• 32KB L1 D/I cache, 2MB shared L2 cache
– Main Memory: DDR3-1600, 1 channel, 1 rank, 8 banks
• Energy evaluations
– McPAT + DRAMPower
• Workloads
– In-memory databases
– Matrix multiplication
17
![Page 18: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/18.jpg)
In-memory databases
18
Layouts Workloads
Row Store
Column Store
GS-DRAM
Transactions
Analytics
Hybrid
![Page 19: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/19.jpg)
Workload
• Database– 1 table with million records
– Each record = 1 cache line
• Transactions– Operate on a random record
– Varying number of read-only/write-only/read-write fields
• Analytics– Sum of one/two columns
• Hybrid– Transactions thread: random records with 1 read-only, 1
write-only
– Analytics thread: sum of one column
19
![Page 20: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/20.jpg)
Transaction throughput and energy
200
5
10
15
20
25
30
0
10
20
30
40
50
60
Th
rou
gh
pu
t(m
illi
on
s/se
con
d)
En
erg
y(m
J fo
r 1
00
00
tra
ns.
)
Row Store GS-DRAMColumn Store
3X
![Page 21: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/21.jpg)
Analytics performance and energy
21
Row Store GS-DRAMColumn Store
0.0
0.5
1.0
1.5
2.0
2.5
0
20
40
60
80
100
120
Exe
cuti
on
Tim
e(m
Se
c)
En
erg
y (
mJ)
2X
![Page 22: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/22.jpg)
Hybrid Transactions/Analytical Processing
22
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Analytics
0
5
10
15
20
25
30
Transactions
Exe
cuti
on
Tim
e(m
Se
c)
Th
rou
gh
pu
t(m
illi
on
s/se
con
d)
Row Store GS-DRAMColumn Store
![Page 23: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/23.jpg)
Conclusion
• Problem: Non-unit strided accesses
– Present in many applications
– In-efficient in cache-line-optimized memory systems
• Our Proposal: Gather-Scatter DRAM
– Gather/scatter values of strided access from multiple chips
– Ideal memory bandwidth/cache utilization for power-of-2 strides
– Low DRAM Cost: Logic to perform two bitwise operations per chip
• Results
– In-memory databases: the best of both row store and column store
– Many more applications: scientific computation, key-value stores
23
![Page 24: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/24.jpg)
Gather-Scatter DRAMIn-DRAM Address Translation to Improve the
Spatial Locality of Non-unit Strided Accesses
Vivek SeshadriThomas Mullins, Amirali Boroumand, Onur Mutlu,
Phillip B. Gibbons, Michael A. Kozuch, Todd C. Mowry
![Page 25: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/25.jpg)
Backup
25
![Page 26: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/26.jpg)
Maintaining Cache Coherence
• Restrict each data structure to only two patterns
– Default pattern
– One additional strided pattern
• Additional invalidations on read-exclusive
requests
– Cache controller generates list of cache lines
overlapping with modified cache line
– Invalidates all overlapping cache lines
26
![Page 27: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/27.jpg)
Hybrid Transactions/Analytical Processing
27
0
5
10
15
20
25
30
w/o Pref. Pref.
0
2
4
6
8
10
w/o Pref. Pref.
Exe
cuti
on
Tim
e(m
Se
c)
Th
rou
gh
pu
t(m
illi
on
s/se
con
d)
Row Store GS-DRAMColumn Store
21
Transactions Analytics
![Page 28: Gather-Scatter DRAM - ETH Z€¦ · Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, AmiraliBoroumand,](https://reader036.vdocuments.site/reader036/viewer/2022062507/5fcbeb6e68ffc358f0118f8c/html5/thumbnails/28.jpg)
Transactions Results
28
0
2
4
6
8
10
1-0-1 2-1-2 0-2-2 2-4-2 5-0-1 2-0-4 6-1-2 4-2-2
Exe
cuti
on
tim
e f
or
10
00
0
tra
ns.