![Page 1: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/1.jpg)
INS
TIT
UTE O
F C
OM
PU
TIN
G
TEC
HN
OLO
GY
DMA Cache Architecturally Separate I/O Data from
CPU Data for Improving I/O Performance
Dang Tang, Yungang Bao,
Weiwu Hu, Mingyu Chen
2010.1
Institute of Computing Technology (ICT)
Chinese Academy of Sciences (CAS)
![Page 2: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/2.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
The role of I/O
I/O is ubiquitous Load binary files : Disk Memory Brower web, media stream : NetworkMemory…
I/O is significant Many commercial applications are I/O intensive:
Database etc.
![Page 3: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/3.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
State-of-the-Art I/O Technologies I/O Bus: 20GB/s
PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect
I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)
![Page 4: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/4.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Direct Memory Access (DMA)
DMA is used for I/O operations in all modern computers
DMA allows I/O subsystems to access system memory independently of CPU.
Many I/O devices have DMA engines Including disk drive controllers, graphics
cards, network cards, sound cards and GPUs
![Page 5: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/5.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
![Page 6: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/6.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Engine
CPU
Memory
Driver Buffer
Descriptor①
②③
Kernel Buffer
④
An Example of Disk Read:DMA Receiving Operation
• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles
![Page 7: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/7.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Engine
CPU
Memory
Driver Buffer
Descriptor①
②③
Kernel Buffer
④
Direct Cache Access [Ram-ISCA05]
• This is a typical Shared-Cache Scheme
Prefetch-Hint Approach [Kumar-Micro07]
![Page 8: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/8.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing
Not suitable for other I/O Degrade performance
when DMA requests are large (>100KB) for “Oracle + TPC-H” application
To address this problem deeply, we need to investigate the I/O data characteristics.
![Page 9: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/9.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
I/O Data V.S. CPU Data
MemCtrlI/O Data
CPU Data
HMTT
I/O Data + CPU Data
![Page 10: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/10.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
A short AD of HMTT [Bao-Sigmetrics08]
A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g.,
virtual address Process id I/O operation
Can collect the trace of commercial applications, e.g., Oracle Web server
The HMTT System
![Page 11: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/11.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(1)
% of Memory References to I/O data
% of References of various I/O types
![Page 12: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/12.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(2) I/O request size distribution?
![Page 13: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/13.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(3) Sequential access in I/O data
Compared with CPU data, I/O data is very regular
![Page 14: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/14.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(4) Reuse Distance (RD)
LRU Stack Distance 1
3
2
4
1
2
2
3
3
4
4
3
1
1
2
1
2
4
3
1
2
3
4
1
2
3
1
2
1
2
3
1
1
2
4
RD
CDF
x%
<=n
![Page 15: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/15.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Characteristics of I/O Data(5)
DMA-W CPU-R
CPU-RW CPU-RW
CPU-W DMA-R
![Page 16: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/16.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Rethink I/O & DMA Operation
20~40% of memory references are for I/O data in I/O-intensive applications.
Characteristics of I/O data are different from CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential
Separating I/O data and CPU data
![Page 17: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/17.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Separating I/O data and CPU data
Before Separating
After Separating
![Page 18: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/18.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
![Page 19: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/19.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
Dedicated DMA Cache (DDC)
![Page 20: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/20.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through
policies are available Write Policy Cache Coherence Replacement Policy Prefetching
![Page 21: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/21.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
IO-E
SI P
roto
col
for W
T p
olicy
IO-M
OE
SI P
roto
col
for W
B P
olicy
The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions
![Page 22: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/22.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
A Big Issue
How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?
![Page 23: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/23.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98]
DMA $ CPU $ CPU $
……O S IM I S
OS+I+√ MS+I+ X
EI+
R|E
MI+W|*
S+I+R|I
![Page 24: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/24.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Global State Cache Coherence Theorem
Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine.
S+I+
EI+
I+
MI+
OS+I+
R|*
W|*
W|* R|I
R|M W|*
R|*
R|*
W|*
W|*
R|E
R|I
![Page 25: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/25.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
MOESI + ESI
S+I+
ECI+
I+
MCI+
EDI+
OCS+I+
R*|*
RC|E R*|I
WC|* WD|*
RC|I RD |I
WD|I
RD|* WD|*
RC|I
WC|*
Wc|I
WD|I
WC|I
WD|SI R*|I
WC|*
RC|* RD|SI
WD|* RD|E RC|M
WC|*
6 Global States:
S+I+
ECI*
I*
MCI*
EDI*
OCS*I*
√√√√√√
![Page 26: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/26.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
An LRU-like Replace Policy
1. Invalid
2. Shared
3. Owned
4. Exlusive
5. Modified
![Page 27: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/27.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
DMA Cache Design Issues
Write Policy Cache Coherence Replacement Policy Prefetching
Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time
![Page 28: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/28.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Design Complexity vs.Design Cost Dedicated DMA Cache (DDC)
Partition-Based DMA Cache
(PBDC)
![Page 29: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/29.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
![Page 30: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/30.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Speedup of Dedicated DMA Cache
![Page 31: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/31.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
% of Valid Prefetched Blocks
DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.
![Page 32: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/32.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Performance Comparisons
Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.
![Page 33: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/33.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Outline
Revisiting I/O
DMA Cache Design
Evaluations
Conclusions
![Page 34: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/34.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Conclusions
![Page 35: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/35.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGYThanks !
&Question?
![Page 36: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/36.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
Design Complexity of PBDC
![Page 37: DMA Cache Architecturally Separate I/O Data from CPU Data for Improving I/O Performance](https://reader036.vdocuments.site/reader036/viewer/2022062423/56814568550346895db23a0d/html5/thumbnails/37.jpg)
INSTITUTE OF COMPUTING
TECHNOLOGY
More References on Cache Coherence Protocol Verification
Fong Pong , Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998
Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997