prof. uri weiser,technion
TRANSCRIPT
![Page 1: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/1.jpg)
Professor Uri WeiserTechnionHaifa, Israel
Handling Memory Accesses in Big Data Environment
Chipex 2016
1The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser
![Page 2: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/2.jpg)
2
A New Architecture Avenues in Big Data Environment
The Era of Heterogeneous HW/SW fits application Dynamic tuning Accelerators performance, energy efficiency
Big Data = big In general non repeated access to all the
“Big Data” What are the implications?
![Page 3: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/3.jpg)
Heterogeneous computing :Application Specific Accelerators
Performance/power
Apps range
Continue performance trend by tuned architecture to bypass current technological hurdles
Perf
orm
ance
/pow
erAccelerators
3
Tuned architectures
Apps behavior
![Page 4: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/4.jpg)
4
A New Architecture Avenues in Big Data Environment
Heterogeneous computing – ”tuning” HW to respond to specific needs
example: Big Data memory access pattern Potential savings
Reduction of Data Movements and bypass DRAM
Bandwidth issue Potential solution
![Page 5: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/5.jpg)
Input: Unstructured data
Big Data usage of DATA
5
Read OnceNon-Temporal
Memory Access
Funnel
beta=BWout
BWin
![Page 6: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/6.jpg)
Structuring
Input: Unstructured data
Structured data (aggregation)
A
ML Model creation
Data structuring = ETL
C
B
C Model usage @ client
6
Machine Learning
![Page 7: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/7.jpg)
7
Does Big Data exhibit special memory access pattern?
It probably should since Revisiting ALL Big Data items will cause huge/slow
data transfers from Data sources There are 2 access modes of memory operations:
Temporal Memory Access Non-Temporal Memory access
Many Big Data computations exhibit a Non-Temporal Memory-Accesses and/or Funnel operation
![Page 8: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/8.jpg)
Non-Temporal Memory access Initial analysis: Hadoop-grep Single Memory Access Pattern
~50% of Hadoop-grep unique memory references are single access
8
![Page 9: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/9.jpg)
Non-Temporal Memory AccessesPreliminary Results
WordCount:Access to Storage:Non-temporal locality
Sort: Access to Storage:NO Non-temporal locality
0 5 10 15 20 25 30 35 40 45 500
1000020000300004000050000600007000080000
WordCount I/O Utilization
Time [s]
0 200 400 600 800 1000 12000
20000400006000080000
100000120000
SORT I/O
Time [s]
Access rate[KB/s]
Time
Time
9
Access rate[KB/s]
![Page 10: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/10.jpg)
10
Where energy is wasted?
• DRAM
• Limited BW
![Page 11: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/11.jpg)
From: Mark Horowitz, Stanford “Computing’s Energy Problems”
From: Bill Dally (nVidia and Stanford), Efficiency and Parallelism, the challenges of future computing11
![Page 12: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/12.jpg)
Energy:
DRAM
12
![Page 13: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/13.jpg)
Memory Subsystem - copies
L1$
L2$
LL Cache
DRAM
NV Storage
RegistersKBs
10’s KBs
MBs
TBs
GBs
10’s MBs
3GB/sec
25GB/sec
500GB/sec
TB/sec
Size
Core
BW- Source
Copy 1 (main memory)
Copy 2 (LL Cache)
Copy 3 (L2 Cache)
Copy 4 (L1 Cache)
Copy 5 (Registers) - Destination
13
![Page 14: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/14.jpg)
Memory Subsystem – DRAM bypass == DDIO
L1$
L2$
LL Cache
DRAM
NV Storage
Registers
3-20GB/sec
25GB/sec
500GB/sec
TB/sec
Core
BW- Source
Copy 1 (main memory)
Copy 2 (LL Cache)
Copy 3 (L2 Cache)
Copy 4 (L1 Cache)
Copy 5 (Registers) - Destination
Potential savings:
@ 0.5n J/B (DRAM)10 – 20 GB/s NV BW
5W – 10W
Reference: “Optimizing Read-Once Data Flow in Big-Data Applications”Morad, Ghomron, Erez, Weiser, Kolodny, in Computer Architecture Letters Journal 2016 14
![Page 15: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/15.jpg)
BandwidthWhen should we use Funnel at the Data source
15
![Page 16: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/16.jpg)
Memory Hierarchy is Optimized for A: Bandwidth issue System are built for Temporal Locality
16Highest Bandwidth
L1$
L2$
LLC Cache
DRAM
NV Storage
RegistersKBs
10’s KBs
MBs
TBs
GBs
10’s MBs
3-20GB/sec
25GB/sec
500GB/sec
TB/sec
Size
Core
BW Existing BW
NTMA Desired BW
![Page 17: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/17.jpg)
# of cores
Bandwidth[MB/s] INITIAL RESULT
S# of cores
CPUutilization
[%]
Bandwidth[MB/s]Read Once – Non-Temporal Memory Accesses
CPU Utilizations
SSD=CPU Bandwidth
# of cores
Bandwidth[MB/s]
CPUutilization
[%]Temporal Memory Accesses
# of cores
Bandwidth[MB/s]
Hint: Memory access per operation
B: Memory access per operation impact BW
CPU Utilizations
CPU Bandwidth
SSD Bandwidth
17
![Page 18: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/18.jpg)
Solution:
Flow of “Non-Temporal Data Accesses”
Core
L1$
L2$
LLC Cache
DRAM
NV Storage
Registers
Non-Temporal
Memory accesses
NTMA
Temporal locality
accesses
The Funnel
18
Use Funnel when Bandwidth bottleneck occurs- “high” memory accesses per Instruction- Limited BW- Non temporal locality memory access
*private communication with: Moinuddin Qureshi
![Page 19: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/19.jpg)
“Funnel”ing “Read-Once” data in storage
*Kang, Yangwook, Yang-suk Kee, Ethan L. Miller, and Chanik Park. "Enabling cost-effective data processing with smart ssd." In Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12. IEEE, 2013.**K. Eshghi and R. Micheloni. “SSD Architecture and PCI Express Interface”
Typical SDD architecture*
19
![Page 20: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/20.jpg)
Analytical model of the Funnel
Post process
Bandwidth (BW) IN
Bandwidth BW OUT
Funnel
B
B
= BWOUT/BWIN
20
![Page 21: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/21.jpg)
Purposed Architecture
21
PCIeTL
B
CPU performs NTMA and TMA work
SSD Storage
B
Funnel
B=Bandwidth
Baseline Configuration
PCIe
TLB
2,LcE
CPU performs TMA workSSD performs NTMA work
B
Funnel
Funnel Configurations
B
B B
21
![Page 22: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/22.jpg)
Funnel Performance
Perfo
rman
ce
impr
ovem
ent
CPU becomes bottleneck
drops as CPU becomes bottleneck
𝟏𝐏𝐂𝐈𝐞𝐁𝐖
𝟏𝐒𝐒𝐃𝐁𝐖
PCIeTL
B
CPU performs NTMA and TMA work
SSD Storage
B
Funnel
B=Bandwidth
PCIeTL
B
2,LcE
CPU performs: TMA work
SSD performs NTMA work
B
Funnel
beta beta
Pe
rfor
man
ce
22
![Page 23: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/23.jpg)
Funnel energy
Funnel improvement
SSD BW is the
bottleneck
CPU becomes the bottleneck
PCIe is the bottleneck
Energy grows as
performance
advantage shrinks
Funnel processor overhead
PCIeTL
B
CPU performs NTMA and TMA work
SSD Storage
B
Funnel
B=Bandwidth
PCIeTL
B
2,LcE
CPU performs TMA work
SSD performs NTMA work
B
Funnel
beta
En
ergy
CPU becomes the bottleneck
Funnel configuration
baseline configuration
23
![Page 24: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/24.jpg)
Solution: ?
Non-Temporal Memory Accesses should be processed as close as possible to the data sourceData that exhibit Temporal Locality should use current Memory HierarchyUse Machine Learning (context aware*) to distinguish between the two phasesOpen questions:
SW modelShared DataHW implementationComputational requirement at the “Funnel”
*Reference: “Semantic locality and Context based prefetching” Peled, Mannor, Weiser, Etsion in ISCA 2015 24
![Page 25: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/25.jpg)
Summary
Memory access is a critical path in computingFunnel should be used for:
Resolve BW systems’ bottleneck for specific applicationsSolve the System’s BW issues for “Read Once” cases
Reduction of Data movementFree up system’s memory resources (re-Spark)Simple-energy-efficient engines at the front endIssues
…
25
![Page 26: Prof. Uri Weiser,Technion](https://reader035.vdocuments.site/reader035/viewer/2022062412/58730c351a28ab99088b6e43/html5/thumbnails/26.jpg)
26