institute of computing technology dma cache architecturally separate i/o data from cpu data for...

Post on 26-Dec-2015

218 Views

Category:

Documents

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

INS

TIT

UTE O

F C

OM

PU

TIN

G

TEC

HN

OLO

GY

DMA Cache Architecturally Separate I/O Data from

CPU Data for Improving I/O Performance

Dang Tang, Yungang Bao,

Weiwu Hu, Mingyu Chen

2010.1

Institute of Computing Technology (ICT)

Chinese Academy of Sciences (CAS)

INSTITUTE OF COMPUTING

TECHNOLOGY

The role of I/O

I/O is ubiquitous Load binary files: Disk Memory Brower web, media stream: NetworkMemory…

I/O is significant Many commercial applications are I/O intensive:

Database etc.

INSTITUTE OF COMPUTING

TECHNOLOGY

State-of-the-Art I/O Technologies I/O Bus: 20GB/s

PCI-Express 2.0 HyperTransport 3.0 QuickPath Interconnect

I/O Devices SSD RAID: 1.2GB/s 10GE: 1.25GB/s Fusion-io: 8GB/s, 1M IOPS (2KB random 70/30 read/write mix)

INSTITUTE OF COMPUTING

TECHNOLOGY

Direct Memory Access (DMA)

DMA is used for I/O operations in all modern computers

DMA allows I/O subsystems to access system memory independently of CPU.

 Many I/O devices have DMA engines Including disk drive controllers, graphics

cards, network cards, sound cards and GPUs

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver Buffer

Descriptor①

②③

Kernel Buffer

An Example of Disk Read:DMA Receiving Operation

• Cache Access Latency : ~20 Cycles• Memory Access Latency : ~200 Cycles

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Engine

CPU

Memory

Driver Buffer

Descriptor①

②③

Kernel Buffer

Direct Cache Access [Ram-ISCA05]

• This is a typical Shared-Cache Scheme

Prefetch-Hint Approach [Kumar-Micro07]

INSTITUTE OF COMPUTING

TECHNOLOGY

Problems of Shared-Cache Scheme Cache Pollution Cache Thrashing

Not suitable for other I/O Degrade performance

when DMA requests are large (>100KB) for “Oracle + TPC-H” application

To address this problem deeply, we need to investigate the I/O data characteristics.

INSTITUTE OF COMPUTING

TECHNOLOGY

I/O Data V.S. CPU Data

MemCtrlI/O Data

CPU Data

HMTT

I/O Data + CPU Data

INSTITUTE OF COMPUTING

TECHNOLOGY

A short AD of HMTT [Bao-Sigmetrics08]

A Hardware/Software Hybrid Memory Trace Tool Can support DDR2 DIMM interface on multiple platforms Can collect full system off-chip memory traces Can provide trace with semantic information, e.g.,

virtual address Process id I/O operation

Can collect the trace of commercial applications, e.g., Oracle Web server

The HMTT System

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(1)

% of Memory References to I/O data

% of References of various I/O types

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(2) I/O request size distribution?

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(3) Sequential access in I/O data

Compared with CPU data, I/O data is very regular

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(4) Reuse Distance (RD)

LRU Stack Distance 1

3

2

4

1

2

2

3

3

4

4

3

1

1

2

1

2

4

3

1

2

3

4

1

2

3

1

2

1

2

3

1

1

2

4

RD

CDF

x%

<=n

INSTITUTE OF COMPUTING

TECHNOLOGY

Characteristics of I/O Data(5)

DMA-W CPU-R

CPU-RW CPU-RW

CPU-W DMA-R

INSTITUTE OF COMPUTING

TECHNOLOGY

Rethink I/O & DMA Operation

20~40% of memory references are for I/O data in I/O-intensive applications.

Characteristics of I/O data are different from CPU data An explicit produce-consume relationship for I/O data Reuse distance of I/O data is smaller than CPU data References to I/O data are primarily sequential

Separating I/O data and CPU data

INSTITUTE OF COMPUTING

TECHNOLOGY

Separating I/O data and CPU data

Before Separating

After Separating

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Dedicated DMA Cache (DDC)

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues Adopt Write-Allocate Policy Both Write-Back or Write Through

policies are available Write Policy Cache Coherence Replacement Policy Prefetching

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

IO-E

SI P

roto

col

for W

T p

olicy

IO-M

OE

SI P

roto

col

for W

B P

olicy

The only difference between IO-MOESE/IO-ESI and the original protocols is exchanging the local source and the probe source of state transitions

INSTITUTE OF COMPUTING

TECHNOLOGY

A Big Issue

How to prove the correctness of integrating the heterogeneous cache coherency protocols in a system?

INSTITUTE OF COMPUTING

TECHNOLOGY

A Global State Method for Heterogeneous Cache Coherence Protocol [Pong-SPAA93, Pong-JACM98]

DMA $ CPU $ CPU $

……O S IM I S

OS+I+ √ MS+I+ X

EI+

R|E

MI+W|*

S+I+R|I

INSTITUTE OF COMPUTING

TECHNOLOGY

Global State Cache Coherence Theorem  

Given N (N>1) well-defined cache protocols, they are not conflict if and only if there does not exist any Conflict Global States in the global state transition machine.

S+I+

EI+

I+

MI+

OS+I+

R|*

W|*

W|* R|I

R|M W|*

R|*

R|*

W|*

W|*

R|E

R|I

5 Global States:

S+I+

EI*

I*

MI*

OS*I*

√√√√√

INSTITUTE OF COMPUTING

TECHNOLOGY

MOESI + ESI

S+I+

ECI+

I+

MCI+

EDI+

OCS+I+

R*|*

RC|E R*|I

WC|* WD|*

RC|I RD |I

WD|I

RD|* WD|*

RC|I

WC|*

Wc|I

WD|I

WC|I

WD|SI R*|I

WC|*

RC|* RD|SI

WD|* RD|E RC|M

WC|*

6 Global States:

S+I+

ECI*

I*

MCI*

EDI*

OCS*I*

√√√√√√

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

An LRU-like Replace Policy

1. Invalid

2. Shared

3. Owned

4. Exlusive

5. Modified

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Cache Design Issues

Write Policy Cache Coherence Replacement Policy Prefetching

Adopt straightforward sequential prefetching Prefetching trigged by cache miss Fetch 4 blocks one time

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity vs.Design Cost Dedicated DMA Cache (DDC)

Partition-Based DMA Cache

(PBDC)

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

Speedup of Dedicated DMA Cache

INSTITUTE OF COMPUTING

TECHNOLOGY

% of Valid Prefetched Blocks

DMA caches can exhibit an impressive high prefetching accuracy This is because I/O data has very regular access pattern.

INSTITUTE OF COMPUTING

TECHNOLOGY

Performance Comparisons

Although PBDC does not additional on-chip storage, it can achieve about 80% of DDC’s performance improvements.

INSTITUTE OF COMPUTING

TECHNOLOGY

Outline

Revisiting I/O

DMA Cache Design

Evaluations

Conclusions

INSTITUTE OF COMPUTING

TECHNOLOGY

Conclusions We have proposed a DMA cache technique to separate

I/O data and CPU We adopt a Global State Method for Integrating

Heterogeneous Cache Protocols Experimental results show that DMA Cache schemes are

better than the existing approaches that use unified, shared caches for I/O data and CPU data

Still Open Problems, e.g., Can I/O data goes direct to L1 cache? How to design heterogeneous caches for different

types of data? How to optimize MC with awareness of IO

INSTITUTE OF COMPUTING

TECHNOLOGYThanks!&

Question?

INSTITUTE OF COMPUTING

TECHNOLOGY

RTL Emulation Platform LLC and DMA cache Model from Loongson-2F DDR2 Memory Controller from Loongson-2F DDR2 DIMM model from Micron Technology

LL Cache

MemCtrl

DDR2 DIMM

DMA Cache

Memory trace

INSTITUTE OF COMPUTING

TECHNOLOGY

Parameters

DDR2-666

INSTITUTE OF COMPUTING

TECHNOLOGY

Normalized Speedup for WB

Baseline is snoop cache scheme DMA cache schemes exhibits better performance than others

INSTITUTE OF COMPUTING

TECHNOLOGY

DMA Write & CPU Read Hit Rate

Both shared cache and DMA cache exhibit high hit rates Then, where do cycle go for shared cache scheme?

INSTITUTE OF COMPUTING

TECHNOLOGY

Breakdown of Normalized Total Cycles

INSTITUTE OF COMPUTING

TECHNOLOGY

Design Complexity of PBDC

INSTITUTE OF COMPUTING

TECHNOLOGY

More References on Cache Coherence Protocol Verification

Fong Pong , Michel Dubois, Formal verification of complex coherence protocols using symbolic state models, Journal of the ACM (JACM), v.45 n.4, p.557-587, July 1998

Fong Pong , Michel Dubois, Verification techniques for cache coherence protocols, ACM Computing Surveys (CSUR), v.29 n.1, p.82-126, March 1997

top related