lecture 27 computer architecture adapted from: supercomputing in plain english part vii: multicore...

Lecture 27 Computer Architecture Architecture Adapted From: Adapted From: Supercomputing in Plain Supercomputing in Plain English English Part VII: Multicore Part VII: Multicore Madness Madness Henry Neeman, Director OU Supercomputing Center for Education & Research University of Oklahoma Wednesday October 17 2007

Upload: job-morris

Post on 12-Jan-2016




0 download


Page 1: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Lecture 27 Computer Lecture 27 Computer Architecture Architecture

Adapted From:Adapted From:Supercomputing in Plain Supercomputing in Plain


Part VII: Multicore MadnessPart VII: Multicore MadnessHenry Neeman, Director

OU Supercomputing Center for Education & ResearchUniversity of Oklahoma

Wednesday October 17 2007

Page 2: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 2


The March of Progress Multicore/Many-core Basics Software Strategies for Multicore/Many-core A Concrete Example: Weather Forecasting

Page 3: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

The March of Progress

Page 4: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 4

10 racks @ 1000 lbs per rack

270 Pentium4 Xeon CPUs, 2.0 GHz, 512 KB L2 cache

270 GB RAM, 400 MHz FSB8 TB diskMyrinet2000 Interconnect100 Mbps Ethernet InterconnectOS: Red Hat LinuxPeak speed: 1.08 TFLOP/s (1.08 trillion calculations per second)One of the first Pentium4 clusters!

OU’s TeraFLOP Cluster, 2002


Page 5: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 5

TeraFLOP, Prototype 2006, Sale 2011


9 years from room to chip!

Page 6: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 6

Moore’s Law

In 1965, Gordon Moore was an engineer at Fairchild Semiconductor.

He noticed that the number of transistors that could be squeezed onto a chip was doubling about every 18 months.

It turns out that computer speed is roughly proportional to the number of transistors per unit area.

Moore wrote a paper about this concept, which became known as “Moore’s Law.”

Page 7: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 7

Moore’s Law in Practice






Page 8: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 8

Moore’s Law in Practice







k Ban



Page 9: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 9

Moore’s Law in Practice







k Ban




Page 10: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 10

Moore’s Law in Practice







k Ban




1/Network Latency

Page 11: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 11

Moore’s Law in Practice







k Ban




1/Network Latency


Page 12: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 12

Fastest Supercomputer vs. MooreFastest Supercomputer in the World








1992 1994 1996 1998 2000 2002 2004 2006 2008



d in







Page 13: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

The Tyranny ofthe Storage Hierarchy

Page 14: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 14

The Storage Hierarchy

Registers Cache memory Main memory (RAM) Hard disk Removable media (e.g., DVD) Internet

Fast, expensive, few

Slow, cheap, a lot



Page 15: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 15

RAM is SlowCPU 351 GB/sec[7]

10.66 GB/sec[9] (3%)


The speed of data transferbetween Main Memory and theCPU is much slower than thespeed of calculating, so the CPUspends most of its time waitingfor data to come in or go out.

Page 16: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 16

Why Have Cache?CPUCache is nearly the same speed

as the CPU, so the CPU doesn’thave to wait nearly as long forstuff that’s already in cache:it can do moreoperations per second!

351 GB/sec[7]

10.66 GB/sec[9] (3%)

253 GB/sec[8] (72%)

Page 17: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 17

Storage Use Strategies Register reuse: Do a lot of work on the same data before

working on new data. Cache reuse: The program is much more efficient if all

of the data and instructions fit in cache; if not, try to use what’s in cache a lot before using anything that isn’t in cache.

Data locality: Try to access data that are near each other in memory before data that are far.

I/O efficiency: Do a bunch of I/O all at once rather than a little bit at a time; don’t mix calculations and I/O.

Page 18: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 18

A Concrete Example OSCER’s big cluster, topdawg, has Irwindale CPUs: single

core, 3.2 GHz, 800 MHz Front Side Bus. The theoretical peak CPU speed is 6.4 GFLOPs (double

precision) per CPU, and in practice we’ve gotten as high as 94% of that.

So, in theory each CPU could consume 143 GB/sec. The theoretical peak RAM bandwidth is 6.4 GB/sec, but in

practice we get about half that. So, any code that does less than 45 calculations per byte

transferred between RAM and cache has speed limited by RAM bandwidth.

Page 19: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Good Cache Reuse Example

Page 20: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 20

A Sample Application

Matrix-Matrix MultiplyLet A, B and C be matrices of sizesnr nc, nr nk and nk nc, respectively:









































kcnknkrcrcrcrckkrcr cbcbcbcbcba


The definition of A = B • C is

for r {1, nr}, c {1, nc}.

Page 21: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 21

Matrix Multiply: Naïve VersionSUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2

INTEGER :: r, c, q

DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_naive

Page 22: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 22

Performance of Matrix MultiplyMatrix-Matrix Multiply










0 10000000 20000000 30000000 40000000 50000000 60000000

Total Problem Size in bytes (nr*nc+nr*nq+nq*nc)


U s




Page 23: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 23


Page 24: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 24

Tiling Tile: A small rectangular subdomain of a problem domain.

Sometimes called a block or a chunk. Tiling: Breaking the domain into tiles. Tiling strategy: Operate on each tile to completion, then

move to the next tile. Tile size can be set at runtime, according to what’s best for

the machine that you’re running on.

Page 25: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 25

Tiling CodeSUBROUTINE matrix_matrix_mult_by_tiling (dst, src1, src2, nr, nc, nq, & & rtilesize, ctilesize, qtilesize) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rtilesize, ctilesize, qtilesize

INTEGER :: rstart, rend, cstart, cend, qstart, qend

DO cstart = 1, nc, ctilesize cend = cstart + ctilesize - 1 IF (cend > nc) cend = nc DO rstart = 1, nr, rtilesize rend = rstart + rtilesize - 1 IF (rend > nr) rend = nr DO qstart = 1, nq, qtilesize qend = qstart + qtilesize - 1 IF (qend > nq) qend = nq CALL matrix_matrix_mult_tile(dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_by_tiling

Page 26: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 26

Multiplying Within a TileSUBROUTINE matrix_matrix_mult_tile (dst, src1, src2, nr, nc, nq, & & rstart, rend, cstart, cend, qstart, qend) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2 INTEGER,INTENT(IN) :: rstart, rend, cstart, cend, qstart, qend

INTEGER :: r, c, q

DO c = cstart, cend DO r = rstart, rend IF (qstart == 1) dst(r,c) = 0.0 DO q = qstart, qend dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_tile

Page 27: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 27

Reminder: Naïve Version, AgainSUBROUTINE matrix_matrix_mult_naive (dst, src1, src2, & & nr, nc, nq) IMPLICIT NONE INTEGER,INTENT(IN) :: nr, nc, nq REAL,DIMENSION(nr,nc),INTENT(OUT) :: dst REAL,DIMENSION(nr,nq),INTENT(IN) :: src1 REAL,DIMENSION(nq,nc),INTENT(IN) :: src2

INTEGER :: r, c, q

DO c = 1, nc DO r = 1, nr dst(r,c) = 0.0 DO q = 1, nq dst(r,c) = dst(r,c) + src1(r,q) * src2(q,c) END DO END DO END DOEND SUBROUTINE matrix_matrix_mult_naive

Page 28: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 28

Performance with Tiling

Matrix-Matrix Mutiply Via Tiling (log-log)







Tile Size (bytes)


U s







Matrix-Matrix Mutiply Via Tiling








Tile Size (bytes)


Page 29: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 29

The Advantages of Tiling It allows your code to exploit data locality better, to get

much more cache reuse: your code runs faster! It’s a relatively modest amount of extra coding (typically a

few wrapper functions and some changes to loop bounds). If you don’t need tiling – because of the hardware, the

compiler or the problem size – then you can turn it off by simply setting the tile size equal to the problem size.

Page 30: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 30

Why Does Tiling Work Here?

Cache optimization works best when the number of calculations per byte is large.

For example, with matrix-matrix multiply on an n × n matrix, there are O(n3) calculations (on the order of n3), but only O(n2) bytes of data.

So, for large n, there are a huge number of calculations per byte transferred between RAM and cache.

Page 31: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Multicore/Many-core Basics

Page 32: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 32

What is Multicore?

In the olden days (i.e., the first half of 2005), each CPU chip had one “brain” in it.

More recently, each CPU chip has 2 cores (brains), and, starting in late 2006, 4 cores.

Jargon: Each CPU chip plugs into a socket, so these days, to avoid confusion, people refer to sockets and cores, rather than CPUs or processors.

Each core is just like a full blown CPU, except that it shares its socket with one or more other cores – and therefore shares its bandwidth to RAM.

Page 33: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 33

Dual CoreCore Core

Page 34: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 34

Quad CoreCore CoreCore Core

Page 35: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 35

Oct CoreCore Core Core CoreCore Core Core Core

Page 36: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 36

The Challenge of Multicore: RAM Each socket has access to a certain amount of RAM, at a

fixed RAM bandwidth per SOCKET. As the number of cores per socket increases, the contention

for RAM bandwidth increases too. At 2 cores in a socket, this problem isn’t too bad. But at 16

or 32 or 80 cores, it’s a huge problem. So, applications that are cache optimized will get big

speedups. But, applications whose performance is limited by RAM

bandwidth are going to speed up only as fast as RAM bandwidth speeds up.

RAM bandwidth speeds up much slower than CPU speeds up.

Page 37: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 37

The Challenge of Multicore: Network Each node has access to a certain number of network ports,

at a fixed number of network ports per NODE. As the number of cores per node increases, the contention

for network ports increases too. At 2 cores in a socket, this problem isn’t too bad. But at 16

or 32 or 80 cores, it’s a huge problem. So, applications that do minimal communication will get

big speedups. But, applications whose performance is limited by the

number of MPI messages are going to speed up very very little – and may even crash the node.

Page 38: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 38

Multicore/Many-core Problem

Most multicore chip families have relatively small cache per core (e.g., 2 MB) – and this problem seems likely to remain.

Small TLBs make the problem worse: 512 KB per core rather than 2 MB.

So, to get good cache reuse, you need to partition algorithm so subproblem needs no more than 512 KB.

Page 39: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 39

The T.L.B. on a Current Chip

On Intel Core Duo (“Yonah”): Cache size is 2 MB per core. Page size is 4 KB. A core’s data TLB size is 128 page table entries. Therefore, D-TLB only covers 512 KB of cache.

Page 40: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 40

The T.L.B. on a Current Chip

On Intel Core Duo (“Yonah”): Cache size is 2 MB per core. Page size is 4 KB. A core’s data TLB size is 128 page table entries. Therefore, D-TLB only covers 512 KB of cache. The cost of a TLB miss is 49 cycles, equivalent to as many

as 196 calculations! (4 FLOPs per cycle)http://www.digit-life.com/articles2/cpu/rmma-via-c7.html

Page 41: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 41

What Do We Need?

We need much bigger caches! TLB must be big enough to cover the entire cache. It’d be nice to have RAM speed increase as fast as core

counts increase, but let’s not kid ourselves.

Page 42: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Supercomputing in Plain English: Multicore MadnessWednesday October 17 2007 42

To Learn More Supercomputing


Page 43: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Computer ArchitectureSpring 2012Lecture 27. CMPs & SMTs

Adapted from Mary Jane Irwin

( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Page 44: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Multithreading on A Chip Find a way to “hide” true data dependency stalls, cache

miss stalls, and branch stalls by finding instructions (from other process threads) that are independent of those stalling instructions

Multithreading – increase the utilization of resources on a chip by allowing multiple processes (threads) to share the functional units of a single processor

Processor must duplicate the state hardware for each thread – a separate register file, PC, instruction buffer, and store buffer for each thread

The caches, TLBs, BHT, BTB can be shared (although the miss rates may increase if they are not sized accordingly)

The memory can be shared through virtual memory mechanisms Hardware must support efficient thread context switching

Page 45: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Types of Multithreading on a Chip Fine-grain – switch threads on every instruction issue

Round-robin thread interleaving (skipping stalled threads) Processor must be able to switch threads on every clock cycle Advantage – can hide throughput losses that come from both

short and long stalls Disadvantage – slows down the execution of an individual

thread since a thread that is ready to execute without stalls is delayed by instructions from other threads

Coarse-grain – switches threads only on costly stalls (e.g., L2 cache misses)

Advantages – thread switching doesn’t have to be essentially free and much less likely to slow down the execution of an individual thread

Disadvantage – limited, due to pipeline start-up costs, in its ability to overcome throughput loss

- Pipeline must be flushed and refilled on thread switches

Page 46: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Simultaneous Multithreading (SMT) A variation on multithreading that uses the resources of a

multiple-issue, dynamically scheduled processor (superscalar) to exploit both program ILP and thread-level parallelism (TLP)

Most SS processors have more machine level parallelism than most programs can effectively use (i.e., than have ILP)

With register renaming and dynamic scheduling, multiple instructions from independent threads can be issued without regard to dependencies among them

- Need separate rename tables (ROBs) for each thread

- Need the capability to commit from multiple threads (i.e., from multiple ROBs) in one cycle

Intel’s Pentium 4 SMT called hyperthreading Supports just two threads (doubles the architecture state) Typically, each core of newer multi-core process is


Page 47: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Threading on a 4-way SS Processor Example

Thread A Thread B

Thread C Thread D


e →

Issue slots →SMTFine MTCoarse MT

Page 48: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

William Stallings Computer Organization and Architecture8th Edition

Chapter 18

Multicore Computers

Page 49: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Hardware Performance Issues

Microprocessors have seen an exponential increase in performance

Improved organization Increased clock frequency

Increase in Parallelism Pipelining Superscalar Simultaneous multithreading (SMT)

Diminishing returns More complexity requires more logic Increasing chip area for coordinating and signal transfer logic

- Harder to design, make and debug

Page 50: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Alternative Chip Organizations

Page 51: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Intel Hardware Trends

Page 52: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Increased Complexity Power requirements grow exponentially with chip density

and clock frequency Can use more chip area for cache

- Smaller- Order of magnitude lower power requirements

By 2015 100 billion transistors on 300mm2 die

- Cache of 100MB- 1 billion transistors for logic

Pollack’s rule: Performance is roughly proportional to square root of increase in

complexity- Double complexity gives 40% more performance

Multicore has potential for near-linear improvement Unlikely that one core can use all cache effectively

Page 53: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Power and Memory Considerations

Page 54: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Chip Utilization of Transistors

Page 55: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Software Performance Issues

Performance benefits dependent on effective exploitation of parallel resources

Even small amounts of serial code impact performance 10% inherently serial on 8 processor system gives only 4.7

times performance

Communication, distribution of work and cache coherence overheads

Some applications effectively exploit multicore processors

Page 56: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Effective Applications for Multicore Processors

Database Servers handling independent transactions Multi-threaded native applications

Lotus Domino, Siebel CRM

Multi-process applications Oracle, SAP, PeopleSoft

Java applications Java VM is multi-thread with scheduling and memory

management Sun’s Java Application Server, BEA’s Weblogic, IBM

Websphere, Tomcat

Multi-instance applications One application running multiple times

E.g. Value Game Software

Page 57: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Multicore Organization

Number of core processors on chip Number of levels of cache on chip Amount of shared cache Next slide examples of each organization: (a) ARM11 MPCore (deleted) (b) AMD Opteron (deleted) (c) Intel Core Duo (d) Intel Core i7

Page 58: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Multicore Organization Alternatives

Page 59: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Advantages of shared L2 Cache

Constructive interference reduces overall miss rate Data shared by multiple cores not replicated at cache

level With proper frame replacement algorithms mean amount

of shared cache dedicated to each core is dynamic Threads with less locality can have more cache

Easy inter-process communication through shared memory

Cache coherency confined to L1 Dedicated L2 cache gives each core more rapid access

Good for threads with strong locality

Shared L3 cache may also improve performance

Page 60: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Individual Core Architecture

Intel Core Duo uses superscalar cores Intel Core i7 uses simultaneous multi-threading (SMT)

Scales up number of threads supported- 4 SMT cores, each supporting 4 threads appears as 16 core

Page 61: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Intel x86 Multicore Organization - Core Duo (1)

2006 Two x86 superscalar, shared L2 cache Dedicated L1 cache per core

32KB instruction and 32KB data

Thermal control unit per core Manages chip heat dissipation Maximize performance within constraints Improved ergonomics

Advanced Programmable Interrupt Controlled (APIC) Inter-process interrupts between cores Routes interrupts to appropriate core Includes timer so OS can interrupt core

Page 62: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Intel x86 Multicore Organization - Core Duo (2)

Power Management Logic Monitors thermal conditions and CPU activity Adjusts voltage and power consumption Can switch individual logic subsystems

2MB shared L2 cache Dynamic allocation MESI support for L1 caches Extended to support multiple Core Duo in SMP

- L2 data shared between local cores or external

Bus interface

Page 63: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Intel Core Duo Block Diagram

Page 64: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Intel x86 Multicore Organization - Core i7

November 2008 Four x86 SMT processors Dedicated L2, shared L3 cache Speculative pre-fetch for caches On chip DDR3 memory controller

Three 8 byte channels (192 bits) giving 32GB/s No front side bus

QuickPath Interconnection Cache coherent point-to-point link High speed communications between processor chips 6.4G transfers per second, 16 bits per transfer Dedicated bi-directional pairs Total bandwidth 25.6GB/s

Page 65: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Intel Core i7 Block Diagram

Page 66: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Performance Effect of Multiple Cores

Page 67: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Computer Architecture

Adapted from Mary Jane Irwin

( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Page 68: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Multicore Xbox360 – “Xenon” processor

To provide game developers with a balanced and powerful platform

Three SMT processors, 32KB L1 D$ & I$, 1MB UL2 cache 165M transistors total 3.2 Ghz Near-POWER ISA 2-issue, 21 stage pipeline, with 128 128-bit registers Weak branch prediction – supported by software hinting In order instructions Narrow cores – 2 INT units, 2 128-bit VMX units, 1 of anything


An ATI-designed 500MZ GPU w/ 512MB of DDR3DRAM 337M transistors, 10MB framebuffer 48 pixel shader cores, each with 4 ALUs

Page 69: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Xenon Diagram

Core 0


Core 1


Core 2





3D Core











DVDHDD PortFront USBs (2)WirelessMU ports (2 USBs)Rear USB (1)EthernetIRAudio OutFlashSystems Control

Video Out

Page 70: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

The PS3 “Cell” Processor Architecture

Composed of a Non-SMP Architecture 234M transistors @ 4Ghz 1 Power Processing Element, 8 “Synergistic” (SIMD) PE’s 512KB L2 $ - Massively high bandwidth (200GB/s) bus connects

it to everything else The PPE is strangely similar to one of the Xenon cores

- Almost identical, really. Slight ISA differences, and fine-grained MT instead of real SMT

The real differences lie in the SPEs (21M transistors each)- An attempt to ‘fix’ the memory latency problem by giving each

processor complete control over it’s own 256KB “scratchpad” – 14M transistors

– Direct mapped for low latency

- 4 vector units per SPE, 1 of everything else – 7M trans.

Page 71: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

The PS3 “Cell” Processor Architecture

Page 72: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

How to make use of the SPEs

Page 73: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

What about the Software?

Makes use of special IBM “Hypervisor” Like an OS for OS’s Runs both a real time OS (for sound) and non-real time (for

things like AI)

Software must be specially coded to run well The single PPE will be quickly bogged down Must make use of SPEs wherever possible This isn’t easy, by any standard

What about Microsoft? Development suite identifies which 6 threads you’re expected to

run Four of them are DirectX based, and handled by the OS Only need to write two threads, functionally


Page 74: Lecture 27 Computer Architecture Adapted From: Supercomputing in Plain English Part VII: Multicore Madness Henry Neeman, Director OU Supercomputing Center

Next Lecture and Reminders

Reminders Final is Wednesday, May 2 from 1-2:50 PM in ITT 328