machine structures lecture 25 – caches iii

56
L25 Caches III (1) Machine Structures Lecture 25 – Caches III gigaom.com/2006/11/14/100gbe QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. 100 Gig Eth?! Yesterday Infinera demoed a system that took a 100 GigE signal, broke it up into ten 10GigE signals and sent it over existing optical networks. Fast!!!

Upload: benjamin-guerra

Post on 03-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Machine Structures Lecture 25 – Caches III. 100 Gig Eth?!  Yesterday Infinera demoed a system that took a 100 GigE signal, broke it up into ten 10GigE signals and sent it over existing optical networks. Fast!!!. gigaom.com/2006/11/14/100gbe. What to do on a write hit?. Write-through - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine Structures Lecture 25 –  Caches III

L25 Caches III (1)

Machine Structures

Lecture 25 – Caches III

gigaom.com/2006/11/14/100gbe

QuickTime™ and aTIFF (Uncompressed) decompressor

are needed to see this picture.

100 Gig Eth?! Yesterday Infinera

demoed a system that took a 100 GigE signal, broke it up into ten 10GigE signals and sent it over

existing optical networks. Fast!!!

Page 2: Machine Structures Lecture 25 –  Caches III

L25 Caches III (2)

What to do on a write hit?•Write-through

• 同时更新 cache 块和对应的内存字•Write-back

• 更新 cache 块中的字• 允许内存中的字为“ stale”

add ‘dirty’ bit to each block indicating that memory needs to be updated when block is replaced

OS flushes cache before I/O…

•Performance trade-offs?

Page 3: Machine Structures Lecture 25 –  Caches III

L25 Caches III (3)

Block Size Tradeoff (1/3)•Larger Block Size 优点

• 空间局部性 : 如果访问给定字 , 可能很快会访问附近的其他字

• 适用于存储程序概念 : 如果执行一条指令 , 极有可能会执行接下来的几条指令

• 对于线性数组的访问也是合适的

Page 4: Machine Structures Lecture 25 –  Caches III

L25 Caches III (4)

Block Size Tradeoff (2/3)•Larger Block Size 缺点

• 大块意味着更大的 miss penalty 在 miss 时 , 需要花更长的时间从内存中装入新

块• 如果块太大,块的数量就少

结果 : miss 率上升

•总的来说,应最小化平均访问时间 Average Memory Access Time (AMAT)

= Hit Time + Miss Penalty x Miss Rate

Page 5: Machine Structures Lecture 25 –  Caches III

L25 Caches III (5)

Block Size Tradeoff (3/3)•Hit Time = 从 cache 中找到并提取数据

所花的时间•Miss Penalty = 在当前层 miss 时,取数

据所花的平均时间 ( 包括在下一层也出现misses 的可能性 )

•Hit Rate = % of requests that are found in current level cache

•Miss Rate = 1 - Hit Rate

Page 6: Machine Structures Lecture 25 –  Caches III

L25 Caches III (6)

Extreme Example: One Big Block

•Cache Size = 4 bytes Block Size = 4 bytes• Only ONE entry (row) in the cache!

• If item accessed, likely accessed again soon• But unlikely will be accessed again immediately!

•The next access will likely to be a miss again• Continually loading data into the cache butdiscard data (force out) before use it again

• Nightmare for cache designer: Ping Pong Effect

Cache DataValid BitB 0B 1B 3

TagB 2

Page 7: Machine Structures Lecture 25 –  Caches III

L25 Caches III (7)

Block Size Tradeoff ConclusionsMissPenalty

Block Size

Increased Miss Penalty& Miss Rate

AverageAccess

Time

Block Size

Exploits Spatial Locality

Fewer blocks: compromisestemporal locality

MissRate

Block Size

Page 8: Machine Structures Lecture 25 –  Caches III

L25 Caches III (8)

Types of Cache Misses (1/2)

•“ Three Cs” Model of Misses

•1st C: Compulsory Misses• occur when a program is first started

• cache does not contain any of that program’s data yet, so misses are bound to occur

• can’t be avoided easily, so won’t focus on these in this course

Page 9: Machine Structures Lecture 25 –  Caches III

L25 Caches III (9)

Types of Cache Misses (2/2)•2nd C: Conflict Misses

• 因为两个不同的内存地址映射到相同的cache 地址而发生的 miss

• 两个块可能不断互相覆盖 ( 不幸地映射到同一位置 )

• big problem in direct-mapped caches• 如何减小这个效果 ?

•处理 Conflict Misses• Solution 1: 增加 cache

Fails at some point

• Solution 2: 多个不同的块映射到同样的cache 索引 ?

Page 10: Machine Structures Lecture 25 –  Caches III

L25 Caches III (10)

完全关联( Fully Associative ) Cache (1/3)•内存地址字段 :

• Tag: same as before

• Offset: same as before

• Index: non-existant

•What does this mean?• 没有 “ rows” 索引 : 每个块都可以映射到cache 的任一行

• 必须要比较整个 cache 的所有行来发现数据是否在 cache 中

Page 11: Machine Structures Lecture 25 –  Caches III

L25 Caches III (11)

Fully Associative Cache (2/3)•Fully Associative Cache (e.g., 32 B block)

• compare tags in parallel

Byte Offset

:

Cache DataB 0

0431

:

Cache Tag (27 bits long)

Valid

:

B 1B 31 :

Cache Tag=

==

=

=:

Page 12: Machine Structures Lecture 25 –  Caches III

L25 Caches III (12)

Fully Associative Cache (3/3)•Fully Assoc Cache 优点

• No Conflict Misses ( 因数据可在任何地方 )

•Fully Assoc Cache 缺点• 需要硬件比较来比较所有入口 : 如果在cache 中有 64KB 数据 , 4B 偏移 , 需要16K 比较器 : infeasible

Page 13: Machine Structures Lecture 25 –  Caches III

L25 Caches III (13)

Accessing data in a Fully assoc cache•6 Addresses:•0x00000014, 0x0000001C, 0x00000034, 0x00008014, 0x00000030, 0x0000001c

•6 地址分解为 ( 为方便起见 ) Tag, 字节偏移域0000000000000000000000000001 01000000000000000000000000000001 11000000000000000000000000000011 01000000000000000000100000000001 01000000000000000000000000000011 00000000000000000000000000000001 1100 Tag Offset

Page 14: Machine Structures Lecture 25 –  Caches III

L25 Caches III (14)

16 KB Fully Asso. Cache, 16B blocks• Valid bit: 确定该行是否存有数据 ( 当计算

机初始化时 , 所有行都为 invalid)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

Index00000000

00

Page 15: Machine Structures Lecture 25 –  Caches III

L25 Caches III (15)

1. Read 0x00000014

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

• 0000000000000000000000000001 0100

Index

Tag field Offset

00000000

00

Page 16: Machine Structures Lecture 25 –  Caches III

L25 Caches III (16)

Find the block to match tag (00…001)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

• 0000000000000000000000000001 0100Tag field Offset

00000000

00

Page 17: Machine Structures Lecture 25 –  Caches III

L25 Caches III (17)

No. So we find block to fill 0 (0000000000)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

• 0000000000000000000000000001 0100Tag field Offset

00000000

00

Page 18: Machine Structures Lecture 25 –  Caches III

L25 Caches III (18)

So load that data into cache, setting tag, valid

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 1 a b c d

Tag field Offset

00

00000

00

• 0000000000000000000000000001 0100

Page 19: Machine Structures Lecture 25 –  Caches III

L25 Caches III (19)

get that data in the offset (0100)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 1 a b c d

Tag field Offset

00

00000

00

• 0000000000000000000000000001 0100

Page 20: Machine Structures Lecture 25 –  Caches III

L25 Caches III (20)

2. Read 0x0000001C = 0…00 0..001 1100

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 1 a b c d

• 0000000000000000000000000001 1100Tag field Offset

0000000

00

Page 21: Machine Structures Lecture 25 –  Caches III

L25 Caches III (21)

Find the tag in cache

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

01 a b c d

• 0000000000000000000000000001 1100

Index

Tag field Offset

1

000000

00

Page 22: Machine Structures Lecture 25 –  Caches III

L25 Caches III (22)

The data have found in Cache

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

01 a b c d

• 0000000000000000000000000001 1100

Index

Tag field Offset

1

000000

00

Page 23: Machine Structures Lecture 25 –  Caches III

L25 Caches III (23)

Get the data from Offset (1100)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

01 a b c d

• 0000000000000000000000000001 1100

Index

Tag field Offset

1

000000

00

Page 24: Machine Structures Lecture 25 –  Caches III

L25 Caches III (24)

3. Read 0x00000034 = 0…000..011 0100

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 1 a b c d

• 0000000000000000000000000011 0100

Index

Tag field Offset

0000000

00

Page 25: Machine Structures Lecture 25 –  Caches III

L25 Caches III (25)

Find the tag 3 (0011)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

01 a b c d

• 0000000000000000000000000011 0100

Index

Tag field Offset

1

000000

00

Page 26: Machine Structures Lecture 25 –  Caches III

L25 Caches III (26)

Not Find (0011)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

01 a b c d

• 0000000000000000000000000011 0100

Index

Tag field Offset

1

000000

00

Page 27: Machine Structures Lecture 25 –  Caches III

L25 Caches III (27)

So load that data into cache, setting tag, valid

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 a b c d

• 0000000000000000000000000011 0100

Index

Tag field Offset

1

000000

00

3 e f g h1

Page 28: Machine Structures Lecture 25 –  Caches III

L25 Caches III (28)

Get the data from Offset (0100)

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 a b c d

• 0000000000000000000000000011 0100

Index

Tag field Offset

1

000000

00

3 e f g h1

Page 29: Machine Structures Lecture 25 –  Caches III

L25 Caches III (29)

4. Read 0x00008014 = 0…10 0..001 0100

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

• 0000000000000000100000000001 0100

Index

Tag field Offset

1 a b c d

Offset

1

000000

00

3 e f g h1

Page 30: Machine Structures Lecture 25 –  Caches III

L25 Caches III (30)

Find Tag(0x801), Not Found

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

• 0000000000000000100000000001 0100

Index

Tag field Offset

1 a b c d

Offset

1

000000

00

3 e f g h1

Page 31: Machine Structures Lecture 25 –  Caches III

L25 Caches III (31)

So load that data into cache, setting tag, valid

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 a b c d

• 0000000000000000100000000001 0100

Index

Tag field Offset

1

100000

00

3 e f g h1i j k l801

Page 32: Machine Structures Lecture 25 –  Caches III

L25 Caches III (32)

Get data from cache

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f

01234567

10221023

...

1 a b c d

• 0000000000000000100000000001 0100

Index

Tag field Offset

1

100000

00

3 e f g h1i j k l801

Page 33: Machine Structures Lecture 25 –  Caches III

L25 Caches III (33)

Do an example yourself. What happens?• Chose from: Cache: Hit, Miss, Miss w. replace

Values returned: a ,b, c, d, e, ..., k, l• Read address 0x00000030 ? 000000000000000000 0000000011 0000• Read address 0x0000001c ? 000000000000000000 0000000001 1100

...

ValidTag 0x0-3 0x4-7 0x8-b 0xc-f01234567...

11

i j k l0

3 e f g h

Index1

1

0000

Cache

801

a b c d

Page 34: Machine Structures Lecture 25 –  Caches III

L25 Caches III (34)

Answers•0x00000030 a hit

tag = 3, Tag found, Offset = 0, value = e

•0x0000001c a hittag = 1, Tag found, Offset = 0xc, value = d

•Since reads, values must = memory values whether or not cached:•0x00000030 = e•0x0000001c = d

Address Value of WordMemory

0000001000000014000000180000001c

abcd

... ...

0000003000000034000000380000003c

efgh

0000801000008014000080180000801c

ijkl

... ...

... ...

... ...

Page 35: Machine Structures Lecture 25 –  Caches III

L25 Caches III (35)

Third Type of Cache Miss•Capacity Misses

• miss that occurs because the cache has a limited size

• miss that would not occur if we increase the size of the cache

• sketchy definition, so just get the general idea

•This is the primary type of miss for Fully Associative caches.

Page 36: Machine Structures Lecture 25 –  Caches III

L25 Caches III (36)

N-Way Set Associative Cache (1/3)

•Memory address fields:• Tag: same as before

• Offset: same as before

• Index: 指向正确的行“ row” ( 此处称为集合 )

•有什么区别 ?• 每个 set 包含多个块• 当找到正确的块后 , 需要比较该 set 中的所有 tag 以查找数据

Page 37: Machine Structures Lecture 25 –  Caches III

L25 Caches III (37)

Associative Cache Example

• Here’s a simple 2 way set associative cache.

MemoryMemory Address

0123456789ABCDEF

Cache Index

0011

Page 38: Machine Structures Lecture 25 –  Caches III

L25 Caches III (38)

N-Way Set Associative Cache (2/3)

• Basic Idea• cache is direct-mapped w/respect to sets

• each set is fully associative

• basically N direct-mapped caches working in parallel: each has its own valid bit and data

•给定内存地址 :• 使用 Index 找到正确的 set.

• 在所找到的 set 中比较所有的 Tag.

• 如果找到 hit!, 否则 miss.

• 最后 , 和前面一样使用偏移字段在块中找到所需的数据 .

Page 39: Machine Structures Lecture 25 –  Caches III

L25 Caches III (39)

N-Way Set Associative Cache (3/3)

•What’s so great about this?• even a 2-way set assoc cache avoids a lot of conflict misses

• hardware cost isn’t that bad: only need N comparators

• In fact, for a cache with M blocks,• it’s Direct-Mapped if it’s 1-way set assoc

• it’s Fully Assoc if it’s M-way set assoc

• so these two are just special cases of the more general set associative design

Page 40: Machine Structures Lecture 25 –  Caches III

L25 Caches III (40)

4-Way Set Associative Cache Circuit

tagindex

Page 41: Machine Structures Lecture 25 –  Caches III

L25 Caches III (41)

Block Replacement Policy• Direct-Mapped Cache: index completely

specifies position which position a block can go in on a miss

• N-Way Set Assoc: index specifies a set, but block can occupy any position within the set on a miss

• Fully Associative: block can be written into any position

•Question: 如果有多个选择 , 我们该写到何处 ?• 如果有空位 , 通常总是写到第一个空位 .

• If all possible locations already have a valid block, we must pick a replacement policy: rule by which we determine which block gets “cached out” on a miss.

Page 42: Machine Structures Lecture 25 –  Caches III

L25 Caches III (42)

Block Replacement Policy: LRU•LRU (Least Recently Used)

• Idea: cache out block which has been accessed (read or write) least recently

• Pro: temporal locality recent past use implies likely future use: in fact, this is a very effective policy

• Con: with 2-way set assoc, easy to keep track (one LRU bit); with 4-way or greater, requires complicated hardware and much time to keep track of this

Page 43: Machine Structures Lecture 25 –  Caches III

L25 Caches III (43)

Block Replacement Example•We have a 2-way set associative cache

with a four word total capacity and one word blocks. We perform the following word accesses (ignore bytes for this problem):

0, 2, 0, 1, 4, 0, 2, 3, 5, 4

How many hits and how many misses will there be for the LRU block replacement policy?

Page 44: Machine Structures Lecture 25 –  Caches III

L25 Caches III (44)

Block Replacement Example: LRU•Addresses 0, 2, 0, 1, 4, 0, ... 0 lru

2

1 lru

loc 0 loc 1

set 0

set 1

0 2lruset 0

set 1

0: miss, bring into set 0 (loc 0)

2: miss, bring into set 0 (loc 1)

0: hit

1: miss, bring into set 1 (loc 0)

4: miss, bring into set 0 (loc 1, replace 2)

0: hit

0set 0

set 1

lrulru

0 2set 0

set 1

lru lru

set 0

set 1

01

lru

lru24lru

set 0

set 1

0 41

lru

lru lru

Page 45: Machine Structures Lecture 25 –  Caches III

L25 Caches III (45)

Big Idea

•How to choose between associativity, block size, replacement & write policy?

•Design against a performance model• Minimize: Average Memory Access Time

= Hit Time + Miss Penalty x Miss Rate

• influenced by technology & program behavior

•Create the illusion of a memory that is large, cheap, and fast - on average

•How can we improve miss penalty?

Page 46: Machine Structures Lecture 25 –  Caches III

L25 Caches III (46)

Improving Miss Penalty

•When caches first became popular, Miss Penalty ~ 10 processor clock cycles

•Today 2400 MHz Processor (0.4 ns per clock cycle) and 80 ns to go to DRAM 200 processor clock cycles!

Proc $2

DR

AM

$

MEM

Solution: another cache between memory and the processor cache: Second Level (L2) Cache

Page 47: Machine Structures Lecture 25 –  Caches III

L25 Caches III (47)

Analyzing Multi-level cache hierarchy

Proc $2

DR

AM

$

L1 hit time

L1 Miss RateL1 Miss Penalty

Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * L1 Miss Penalty

L1 Miss Penalty = L2 Hit Time + L2 Miss Rate * L2 Miss Penalty

Avg Mem Access Time = L1 Hit Time + L1 Miss Rate * (L2 Hit Time + L2 Miss Rate * L2 Miss Penalty)

L2 hit time L2 Miss Rate

L2 Miss Penalty

Page 48: Machine Structures Lecture 25 –  Caches III

L25 Caches III (48)

And in Conclusion…• We’ve discussed memory caching in detail.

Caching in general shows up over and over in computer systems

• Filesystem cache• Web page cache• Game databases / tablebases• Software memoization• Others?

• Big idea: if something is expensive but we want to do it repeatedly, do it once and cache the result.

• Cache design choices:• Write through v. write back• size of cache: speed v. capacity• direct-mapped v. associative• for N-way set assoc: choice of N• block replacement policy• 2nd level cache?• 3rd level cache?

• Use performance model to pick between choices, depending on programs, technology, budget, ...

Page 49: Machine Structures Lecture 25 –  Caches III

L25 Caches III (49)

BONUS SLIDES

Page 50: Machine Structures Lecture 25 –  Caches III

L25 Caches III (50)

Example

•Assume • Hit Time = 1 cycle

• Miss rate = 5%

• Miss penalty = 20 cycles

• Calculate AMAT…

•Avg mem access time = 1 + 0.05 x 20

= 1 + 1 cycles

= 2 cycles

Page 51: Machine Structures Lecture 25 –  Caches III

L25 Caches III (51)

Ways to reduce miss rate

•Larger cache• limited by cost and technology

• hit time of first level cache < cycle time (bigger caches are slower)

•More places in the cache to put each block of memory – associativity

• fully-associative any block any line

• N-way set associated N places for each block direct map: N=1

Page 52: Machine Structures Lecture 25 –  Caches III

L25 Caches III (52)

Typical Scale

•L1 • size: tens of KB• hit time: complete in one clock cycle

• miss rates: 1-5%

•L2:• size: hundreds of KB• hit time: few clock cycles

• miss rates: 10-20%

•L2 miss rate is fraction of L1 misses that also miss in L2

• why so high?

Page 53: Machine Structures Lecture 25 –  Caches III

L25 Caches III (53)

Example: with L2 cache

•Assume • L1 Hit Time = 1 cycle

• L1 Miss rate = 5%

• L2 Hit Time = 5 cycles

• L2 Miss rate = 15% (% L1 misses that miss)

• L2 Miss Penalty = 200 cycles

•L1 miss penalty = 5 + 0.15 * 200 = 35

•Avg mem access time = 1 + 0.05 x 35= 2.75 cycles

Page 54: Machine Structures Lecture 25 –  Caches III

L25 Caches III (54)

Example: without L2 cache

•Assume • L1 Hit Time = 1 cycle

• L1 Miss rate = 5%

• L1 Miss Penalty = 200 cycles

•Avg mem access time = 1 + 0.05 x 200= 11 cycles

•4x faster with L2 cache! (2.75 vs. 11)

Page 55: Machine Structures Lecture 25 –  Caches III

L25 Caches III (55)

An actual CPU – Early PowerPC• Cache

• 32 KByte Instructions and 32 KByte Data L1 caches

• External L2 Cache interface with integrated controller and cache tags, supports up to 1 MByte external L2 cache

• Dual Memory Management Units (MMU) with Translation Lookaside Buffers (TLB)

• Pipelining• Superscalar (3

inst/cycle)• 6 execution units (2

integer and 1 double precision IEEE floating point)

Page 56: Machine Structures Lecture 25 –  Caches III

L25 Caches III (56)

An Actual CPU – Pentium M

32KB I$

32KB D$