cpu caches

45
CPU Caches Jamie Allen Director of Consul3ng @jamie_allen h9p://github.com/jamieallen

Upload: shinolajla

Post on 15-Jan-2015

446 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Cpu Caches

CPU13 CachesJamie13 Allen

Director13 of13 Consul3ng

jamie_allenh9pgithubcomjamie-shy‐allen

Agenda

bull Goalbull Defini3onsbull Architecturesbull Development13 Tipsbull The13 Future

Goal

Provide13 you13 with13 the13 informa3on13 you13 need13 about13 CPU13 caches13 so13 that13 you13 can13 improve13 the13

performance13 of13 your13 applica3ons

Why

bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)

bull Disruptor13 2011

Defini7ons

SMP

bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 2: Cpu Caches

Agenda

bull Goalbull Defini3onsbull Architecturesbull Development13 Tipsbull The13 Future

Goal

Provide13 you13 with13 the13 informa3on13 you13 need13 about13 CPU13 caches13 so13 that13 you13 can13 improve13 the13

performance13 of13 your13 applica3ons

Why

bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)

bull Disruptor13 2011

Defini7ons

SMP

bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 3: Cpu Caches

Goal

Provide13 you13 with13 the13 informa3on13 you13 need13 about13 CPU13 caches13 so13 that13 you13 can13 improve13 the13

performance13 of13 your13 applica3ons

Why

bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)

bull Disruptor13 2011

Defini7ons

SMP

bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 4: Cpu Caches

Why

bull Increased13 virtualiza3on13 ndash Run3me13 (JVM13 RVM)ndash PlaRormsEnvironments13 (cloud)

bull Disruptor13 2011

Defini7ons

SMP

bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 5: Cpu Caches

Defini7ons

SMP

bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 6: Cpu Caches

SMP

bull Symmetric13 Mul3processor13 (SMP)13 Architecturebull Shared13 main13 memory13 controlled13 by13 single13 OSbull No13 more13 Northbridge

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 7: Cpu Caches

NUMA

bull Non-shy‐Uniform13 Memory13 Accessbull The13 organiza3on13 of13 processors13 reflect13 the13 3me13 to13 access13 data13 in13 RAM13 called13 the13 ldquoNUMA13 factorrdquo

bull Shared13 memory13 space13 (as13 opposed13 to13 mul3ple13 commodity13 machines)

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 8: Cpu Caches

Data13 Locality

bull The13 most13 cri3cal13 factor13 in13 performance13 13 Google13 argues13 otherwise

bull Not13 guaranteed13 by13 a13 JVMbull Spa7al13 -shy‐13 reused13 over13 and13 over13 in13 a13 loop13 data13 accessed13 in13 small13 regions

bull Temporal13 -shy‐13 high13 probability13 it13 will13 be13 reused13 before13 long

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 9: Cpu Caches

Memory13 Controller

bull Manages13 communica3on13 of13 readswrites13 between13 the13 CPU13 and13 RAM

bull Integrated13 Memory13 Controller13 on13 die

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 10: Cpu Caches

Cache13 Lines

bull 32-shy‐25613 con3guous13 bytes13 most13 commonly13 64bull Beware13 ldquofalse13 sharingrdquobull Use13 padding13 to13 ensure13 unshared13 linesbull Transferred13 in13 64-shy‐bit13 blocks13 (8x13 for13 6413 byte13 lines)13 arriving13 every13 ~413 cycles

bull Posi3on13 in13 the13 line13 of13 the13 ldquocri3cal13 wordrdquo13 ma9ers13 but13 not13 if13 pre-shy‐fetched

bull Contended13 annota3on13 coming13 in13 Java13 8

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 11: Cpu Caches

Cache13 Associa7vity

bull Fully13 Associa7ve13 Put13 it13 anywherebull Somewhere13 in13 the13 middle13 n-shy‐way13 set-shy‐associa3ve13 2-shy‐way13 skewed-shy‐associa3ve

bull Direct13 Mapped13 Each13 entry13 can13 only13 go13 in13 one13 specific13 place13

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 12: Cpu Caches

Cache13 Evic7on13 Strategies

bull Least13 Recently13 Used13 (LRU)bull Pseudo-shy‐LRU13 (PLRU)13 for13 large13 associa3vity13 caches

bull 2-shy‐Way13 Set13 Associa7vebull Direct13 Mappedbull Others

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 13: Cpu Caches

Cache13 Write13 Strategies

bull Write13 through13 changed13 cache13 line13 immediately13 goes13 back13 to13 main13 memory

bull Write13 back13 cache13 line13 is13 marked13 when13 dirty13 evic3on13 sends13 back13 to13 main13 memory

bull Write13 combining13 grouped13 writes13 of13 cache13 lines13 back13 to13 main13 memory

bull Uncacheable13 dynamic13 values13 that13 can13 change13 without13 warning

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 14: Cpu Caches

Exclusive13 versus13 Inclusive

bull Only13 relevant13 below13 L3bull AMD13 is13 exclusivendash Progressively13 more13 costly13 due13 to13 evic3onndash Can13 hold13 more13 datandash Bulldozer13 uses13 write13 through13 from13 L1d13 back13 to13 L2

bull Intel13 is13 inclusivendash Can13 be13 be9er13 for13 inter-shy‐processor13 memory13 sharingndash More13 expensive13 as13 lines13 in13 L113 are13 also13 in13 L213 amp13 L3ndash If13 evicted13 in13 a13 higher13 level13 cache13 must13 be13 evicted13 below13 as13 well

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 15: Cpu Caches

Inter-shy‐Socket13 Communica7on

bull GTs13 ndash13 gigatransfers13 per13 secondbull Quick13 Path13 Interconnect13 (QPI13 Intel)13 ndash13 8GTsbull HyperTransport13 (HTX13 AMD)13 ndash13 64GTs13 ()bull Both13 transfer13 1613 bits13 per13 transmission13 in13 prac3ce13 but13 Sandy13 Bridge13 is13 really13 32

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 16: Cpu Caches

MESI+F13 Cache13 Coherency13 Protocol

bull Specific13 to13 data13 cache13 linesbull Request13 for13 Ownership13 (RFO)13 when13 a13 processor13 tries13 to13 write13 to13

a13 cache13 linebull Modified13 the13 local13 processor13 has13 changed13 the13 cache13 line13 implies13

only13 one13 who13 has13 itbull Exclusive13 one13 processor13 is13 using13 the13 cache13 line13 not13 modifiedbull Shared13 mul3ple13 processors13 are13 using13 the13 cache13 line13 not13

modifiedbull Invalid13 the13 cache13 line13 is13 invalid13 must13 be13 re-shy‐fetchedbull Forward13 designate13 to13 respond13 to13 requests13 for13 a13 cache13 linebull All13 processors13 MUST13 acknowledge13 a13 message13 for13 it13 to13 be13 valid

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 17: Cpu Caches

Sta7c13 RAM13 (SRAM)

bull Requires13 6-shy‐813 pieces13 of13 circuitry13 per13 datumbull Cycle13 rate13 access13 not13 quite13 measurable13 in13 3mebull Uses13 a13 rela3vely13 large13 amount13 of13 power13 for13 what13 it13 does

bull Data13 does13 not13 fade13 or13 leak13 does13 not13 need13 to13 be13 refreshedrecharged

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 18: Cpu Caches

Dynamic13 RAM13 (DRAM)

bull Requires13 213 pieces13 of13 circuitry13 per13 datumbull ldquoLeaksrdquo13 charge13 but13 not13 sooner13 than13 64msbull Reads13 deplete13 the13 charge13 requiring13 subsequent13 recharge

bull Takes13 24013 cycles13 (~100ns)13 to13 accessbull Intels13 Nehalem13 architecture13 -shy‐13 each13 CPU13 socket13 controls13 a13 por3on13 of13 RAM13 no13 other13 socket13 has13 direct13 access13 to13 it

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 19: Cpu Caches

Architectures

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 20: Cpu Caches

Current13 Processors

bull Intelndash Nehalem13 (Tock)13 Westmere13 (Tick13 32nm)ndash Sandy13 Bridge13 (Tock)ndash Ivy13 Bridge13 (Tick13 22nm)ndash Haswell13 (Tock)

bull AMDndash Bulldozer

bull Oraclendash UltraSPARC13 isnt13 dead

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 21: Cpu Caches

ldquoLatency13 Numbers13 Everyone13 Should13 KnowrdquoL1 cache reference 05 ns

Branch mispredict 5 ns

L2 cache reference 7 ns

Mutex lockunlock 25 ns

Main memory reference 100 ns

Compress 1K bytes with Zippy 3000 ns = 3 micros

Send 2K bytes over 1 Gbps network 20000 ns = 20 micros

SSD random read 150000 ns = 150 micros

Read 1 MB sequentially from memory 250000 ns = 250 micros

Round trip within same datacenter 500000 ns = 05 ms

Read 1 MB sequentially from SSD 1000000 ns = 1 ms

Disk seek 10000000 ns = 10 ms

Read 1 MB sequentially from disk 20000000 ns = 20 ms

Send packet CA-gtNetherlands-gtCA 150000000 ns = 150 ms

bull Shamelessly13 cribbed13 from13 this13 gist13 h9psgistgithubcom284337513 originally13 by13 Peter13 Norvig13 and13 amended13 by13 Jeff13 Dean

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 22: Cpu Caches

Measured13 Cache13 Latencies

Sandy Bridge-E L1d L2 L3 Main=======================================================================Sequential Access 3 clk 11 clk 14 clk 6nsFull Random Access 3 clk 11 clk 38 clk 658ns

SI13 Sotwares13 benchmarks13 h9pwwwsisotwarenetd=qaampf=ben_mem_latency

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 23: Cpu Caches

Registers

bull On-shy‐core13 for13 instruc3ons13 being13 executed13 and13 their13 operands

bull Can13 be13 accessed13 in13 a13 single13 cyclebull There13 are13 many13 different13 typesbull A13 64-shy‐bit13 Intel13 Nehalem13 CPU13 had13 12813 Integer13 amp13 12813 floa3ng13 point13 registers

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 24: Cpu Caches

Store13 Buffers

bull Hold13 data13 for13 Out13 of13 Order13 (OoO)13 execu3onbull Fully13 associa3vebull Prevent13 ldquostallsrdquo13 in13 execu3on13 on13 a13 thread13 when13 the13 cache13 line13 is13 not13 local13 to13 a13 core13 on13 a13 write

bull ~113 cycle

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 25: Cpu Caches

Level13 Zero13 (L0)

bull Added13 in13 Sandy13 Bridgebull A13 cache13 of13 the13 last13 153613 uops13 decodedbull Well-shy‐suited13 for13 hot13 loopsbull Not13 the13 same13 as13 the13 older13 trace13 cache

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 26: Cpu Caches

Level13 One13 (L1)

bull Divided13 into13 data13 and13 instruc3onsbull 32K13 data13 (L1d)13 32K13 instruc3ons13 (L1i)13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 and13 Haswell

bull Sandy13 Bridge13 loads13 data13 at13 25613 bits13 per13 cycle13 double13 that13 of13 Nehalem

bull 3-shy‐413 cycles13 to13 access13 L1d

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 27: Cpu Caches

Level13 Two13 (L2)

bull 256K13 per13 core13 on13 a13 Sandy13 Bridge13 Ivy13 Bridge13 amp13 Haswell

bull 2MB13 per13 ldquomodulerdquo13 on13 AMDs13 Bulldozer13 architecture

bull ~1113 cycles13 to13 accessbull Unified13 data13 and13 instruc3on13 caches13 from13 here13 up

bull If13 the13 working13 set13 size13 is13 larger13 than13 L213 misses13 grow

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 28: Cpu Caches

Level13 Three13 (L3)

bull Was13 a13 ldquounifiedrdquo13 cache13 up13 un3l13 Sandy13 Bridge13 shared13 between13 cores

bull Varies13 in13 size13 with13 different13 processors13 and13 versions13 of13 an13 architecture13 13 Laptops13 might13 have13 6-shy‐8MB13 but13 server-shy‐class13 might13 have13 30MB

bull 14-shy‐3813 cycles13 to13 access

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 29: Cpu Caches

Level13 Four13 (L4)

bull Some13 versions13 of13 Haswell13 will13 have13 a13 12813 MB13 L413 cache

bull No13 latency13 benchmarks13 for13 this13 yet

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 30: Cpu Caches

Programming13 Tips

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 31: Cpu Caches

Striding13 amp13 Pre-shy‐fetching

bull Predictable13 memory13 access13 is13 really13 importantbull Hardware13 pre-shy‐fetcher13 on13 the13 core13 looks13 for13 pa9erns13 of13 memory13 access

bull Can13 be13 counter-shy‐produc3ve13 if13 the13 access13 pa9ern13 is13 not13 predictable

bull Mar3n13 Thompson13 blog13 post13 ldquoMemory13 Access13 Pa9erns13 are13 Importantrdquo

bull Shows13 the13 importance13 of13 locality13 and13 striding

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 32: Cpu Caches

Cache13 Misses

bull Cost13 hundreds13 of13 cyclesbull Keep13 your13 code13 simplebull Instruc3on13 read13 misses13 are13 most13 expensivebull Data13 read13 miss13 are13 less13 so13 but13 s3ll13 hurt13 performancebull Write13 misses13 are13 okay13 unless13 using13 Write13 Throughbull Miss13 types

ndash Compulsoryndash Capacityndash Conflict

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 33: Cpu Caches

Programming13 Op7miza7ons

bull Stack13 allocated13 data13 is13 cheapbull Pointer13 interac3on13 -shy‐13 you13 have13 to13 retrieve13 data13 being13 pointed13 to13 even13 in13 registers

bull Avoid13 locking13 and13 resultant13 kernel13 arbitra3onbull CAS13 is13 be9er13 and13 occurs13 on-shy‐thread13 but13 algorithms13 become13 more13 complex

bull Match13 workload13 to13 the13 size13 of13 the13 last13 level13 cache13 (LLC13 L3L4)

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 34: Cpu Caches

What13 about13 Func7onal13 Programming

bull Have13 to13 allocate13 more13 and13 more13 space13 for13 your13 data13 structures13 leads13 to13 evic3on

bull When13 you13 cycle13 back13 around13 you13 get13 cache13 misses

bull Choose13 immutability13 by13 default13 profile13 to13 find13 poor13 performance

bull Use13 mutable13 data13 in13 targeted13 loca3ons

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 35: Cpu Caches

Hyperthreading

bull Great13 for13 IO-shy‐bound13 applica3onsbull If13 you13 have13 lots13 of13 cache13 missesbull Doesnt13 do13 much13 for13 CPU-shy‐bound13 applica3onsbull You13 have13 half13 of13 the13 cache13 resources13 per13 corebull NOTE13 -shy‐13 Haswell13 only13 has13 Hyperthreading13 on13 i7

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 36: Cpu Caches

Data13 Structures

bull BAD13 Linked13 list13 structures13 and13 tree13 structuresbull BAD13 Javas13 HashMap13 uses13 chained13 bucketsbull BAD13 Standard13 Java13 collec3ons13 generate13 lots13 of13 garbage

bull GOOD13 Array-shy‐based13 and13 con3guous13 in13 memory13 is13 much13 faster

bull GOOD13 Write13 your13 own13 that13 are13 lock-shy‐free13 and13 con3guous

bull GOOD13 Fastu3l13 library13 but13 note13 that13 its13 addi3ve

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 37: Cpu Caches

Applica7on13 Memory13 Wall13 amp13 GC

bull Tremendous13 amounts13 of13 RAM13 at13 low13 costbull GC13 will13 kill13 you13 with13 compac3onbull Use13 pauseless13 GCndash IBMs13 Metronome13 very13 predictablendash Azuls13 C413 very13 performant

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 38: Cpu Caches

Using13 GPUs

bull Remember13 locality13 ma9ersbull Need13 to13 be13 able13 to13 export13 a13 task13 with13 data13 that13 does13 not13 need13 to13 update

bull AMD13 has13 the13 new13 HSA13 plaRorm13 which13 communicates13 between13 GPUs13 and13 CPUs13 via13 shared13 L3

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 39: Cpu Caches

The13 Future

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 40: Cpu Caches

ManyCore

bull David13 Ungar13 says13 gt13 2413 cores13 generally13 many13 10s13 of13 cores

bull Really13 gets13 interes3ng13 above13 100013 coresbull Cache13 coherency13 wont13 be13 possiblebull Non-shy‐determinis3c

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 41: Cpu Caches

Memristor

bull Non-shy‐vola3le13 sta3c13 RAM13 same13 write13 endurance13 as13 Flash

bull 200-shy‐30013 MB13 on13 chipbull Sub-shy‐nanosecond13 writesbull Able13 to13 perform13 processing13 13 (Probably13 not)bull Mul3state13 not13 binary

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 42: Cpu Caches

Phase13 Change13 Memory13 (PRAM)

bull Higher13 performance13 than13 todays13 DRAMbull Intel13 seems13 more13 fascinated13 by13 this13 released13 its13 neuromorphic13 chip13 design13 last13 Fall

bull Not13 able13 to13 perform13 processingbull Write13 degrada3on13 is13 supposedly13 much13 slowerbull Was13 considered13 suscep3ble13 to13 uninten3onal13 change13 maybe13 fixed

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 43: Cpu Caches

Thanks

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content

Page 44: Cpu Caches

Credits

bull What13 Every13 Programmer13 Should13 Know13 About13 Memory13 Ulrich13 Drepper13 of13 RedHat13 2007

bull Java13 Performance13 Charlie13 Huntbull WikipediaWikimedia13 Commonsbull AnandTechbull The13 Microarchitecture13 of13 AMD13 Intel13 and13 VIA13 CPUsbull Everything13 You13 Need13 to13 Know13 about13 the13 Quick13 Path13 Interconnect13 Gabriel13

TorresHardware13 Secretsbull Inside13 the13 Sandy13 Bridge13 Architecture13 Gabriel13 TorresHardware13 Secretsbull Mar3n13 Thompsons13 Mechanical13 Sympathy13 blog13 and13 Disruptor13 presenta3onsbull The13 Applica3on13 Memory13 Wall13 Gil13 Tene13 CTO13 of13 Azul13 Systemsbull AMD13 BulldozerIntel13 Sandy13 Bridge13 Comparison13 Gionatan13 Dan3bull SI13 Sotwares13 Memory13 Latency13 Benchmarksbull Mar3n13 Thompson13 and13 Cliff13 Click13 provided13 feedback13 ampaddi3onal13 content