computer architecture - crash course · computerarchitecture crash course frédérichaziza...

51
Computer Architecture Crash course Frédéric Haziza <[email protected]> Department of Computer Systems Uppsala University Summer 2009

Upload: others

Post on 24-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Computer ArchitectureCrash course

Frédéric Haziza <[email protected]>

Department of Computer Systems

Uppsala University

Summer 2009

Page 2: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Conclusions

The multicore era is already herecost of parallelism is dropping dramatically• Cache/Thread is dropping• Memory bandwidth/thread is dropping

The introduction of multicore processors will have profoundimpact on computational software

“For high-performance computing (HPC) applications,multicore processors introduces an additional layer ofcomplexity, which will force users to go through a phasechange in how they address parallelism or risk being leftbehind.”

[IDC presentation on HPC at International Supercomputing Conference, ICS07, Dresden, June 26-29 2007]

5 OS2’09 | Computer Architecture (Crash course)

Page 3: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Why multiple cores/threads?(even in your laptops)

Instruction level parallelism is running outWire delay is hurtingPower dissipation is hurtingThe memory latency/bandwidth bottleneck is becoming evenworse

7 OS2’09 | Computer Architecture (Crash course)

Page 4: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Focus: Performance

Create and explore:

1 Parallelism• Instruction level parallelism (ILP)• In some cases: Parallelism at the program/algorithm level

(“Parallel computing”)

2 Locality of data• Caches• Temporal locality• Cache blocks/lines: Spatial locality

8 OS2’09 | Computer Architecture (Crash course)

Page 5: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Old Trend 1: Deeper pipelines(exploring ILP)

9 OS2’09 | Computer Architecture (Crash course)

Page 6: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Old Trend 2: Wider pipelines(exploring more ILP)

10 OS2’09 | Computer Architecture (Crash course)

Page 7: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Old Trend 3: Deeper memory hierarchy(exploring locality of data)

11 OS2’09 | Computer Architecture (Crash course)

Page 8: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Are we hitting the wall now?

YearNow

Performance [log]

1

10

100

1000

“Moor

e’slaw

”Business as usual...

Humm, transistors canbe made smaller andfaster, but there areother problems

Possible path, but re-quires a paradigm shift

12 OS2’09 | Computer Architecture (Crash course)

Page 9: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Classical microprocessors: Whatever it takes torun one program fast

Exploring ILP (instruction-level parallelism)

Faster clocks → Deep pipelinesSuperscalar PipelinesBranch PredictionOut-of-Order ExecutionTrace CacheSpeculationPredicate ExecutionAdvanced Load Address TableReturn Address Stack...

BadNe

ws#1:Alreadyexplored

mostILP

13 OS2’09 | Computer Architecture (Crash course)

Page 10: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

#2: Future technology

BadNe

ws#2:Looongwiredelay→slowCPUs

14 OS2’09 | Computer Architecture (Crash course)

Page 11: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

#2: How much of the chip area can you reach inone cycle?

BadNe

ws#2:Looongwiredelay→slowCPUs

15 OS2’09 | Computer Architecture (Crash course)

Page 12: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #3: Mem latency/bandwidth

B

C

B+C A

16 OS2’09 | Computer Architecture (Crash course)

Page 13: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #4: Power is the limit

Power consumption is the bottleneck• Cooling servers is hard• Battery lifetime for mobile computers• Energy is money

Dissipated effect is proportional to• ˜ Frequency• ˜ Voltage2

17 OS2’09 | Computer Architecture (Crash course)

Page 14: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

1 Why mutliple threads/cores?Old TrendsBad newsSolutions?

2 Introducing concurrencyScenarioDefinitionsAmdhal’s law

3 Can we trust the hardware?

18 OS2’09 | Computer Architecture (Crash course)

Page 15: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #1: Not enough ILP→ feed one CPU with instr. from many threads

19 OS2’09 | Computer Architecture (Crash course)

Page 16: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Simltaneous multithreading

20 OS2’09 | Computer Architecture (Crash course)

Page 17: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #2: Wire delay→ Multiple small CPUs with private L1$

21 OS2’09 | Computer Architecture (Crash course)

Page 18: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #2: Wire delay→ Multiple small CPUs with private L1$

Institutionen för informationsteknologi | www.it.uu.se

Multicore processor

Mem

I R

Regs

B MMW

I R B MMW

I R B MMW

I R B MMW

I

I

I

IIssue

logic

$

!

SEK

Thread 1

Thread N

PC

PC Regs

SEK

Issue

logic

22 OS2’09 | Computer Architecture (Crash course)

Page 19: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #3: Memory latency/bandwidth→ memory accesses from many threads

B

C

B+C A

B

C

B+C A

23 OS2’09 | Computer Architecture (Crash course)

Page 20: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Bad News #4: Power consumption→ Lower the frequency → lower voltage

Pdyn = C ∗ f ∗ V 2 ≈ area ∗ freq ∗ voltage2

CPUfreq=f

VS.CPU

freq=f/2CPU

freq=f/2

Pdyn(C , f ,V ) < CfV 2 Pdyn(2C , f /2, < V ) < CfV 2

CPUfreq=f

VS.CPU CPU

CPU CPU

freq=f/2

Pdyn(C , f ,V ) < CfV 2 Pdyn(C , f /2, < V ) < 12CfV

2

24 OS2’09 | Computer Architecture (Crash course)

Page 21: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Solving all the problems (?):Exploring thread parallelism

#1 Running out of ILP

→ feed one CPU with instr. from many threads

#2 Wire delay is starting to hurt

→ Multiple small CPUs with private caches

#3 Memory is the bottleneck

→ memory accesses from many threads

#4 Power is the limit

→ Lower the frequency → lower voltage

25 OS2’09 | Computer Architecture (Crash course)

Page 22: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

In all computers very soon...

Multicore processors, probably alsowith simultaneous multithreading

26 OS2’09 | Computer Architecture (Crash course)

Page 23: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Scenario

Several cars want to drive from point A to point B.

They can compete for space on the same roadand end up either:

following each otheror competing for positions (and having accidents!).

Or they could drive in parallel lanes,thus arriving at about the same time without getting in eachother’s way.

Or they could travel different routes, using separate roads.

28 OS2’09 | Computer Architecture (Crash course)

Page 24: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Scenario

Several cars want to drive from point A to point B.

Sequential

ProgrammingThey can compete for space on the same roadand end up either:

following each other

or competing for positions (and having accidents!).

Parallel

ProgrammingOr they could drive in parallel lanes,

thus arriving at about the same time without getting in each other’s way.

Distributed

ProgrammingOr they could travel different routes, using separate roads.

29 OS2’09 | Computer Architecture (Crash course)

Page 25: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Definitions

Concurrent Program2+ processes working together to perform a task.Each process is a sequential program(= sequence of statements executed one after another)

Single thread of control vs multiple thread of control

Communication

• Shared Variables• Message Passing

Synchronization

• Mutual Exclusion• Condition Synchronization

30 OS2’09 | Computer Architecture (Crash course)

Page 26: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Correctness

Wanna write a concurrent program?What kinds of processes?

How many processes?

How should they interact?

CorrectnessEnsure that processes interaction is properly synchronized

Mutual Exclusion

Ensuring the critical sections of statements do not execute atthe same time

Condition Synchronization

Delaying a process until a given condition is true

Our focus: imperative programs and asynchronous execution31 OS2’09 | Computer Architecture (Crash course)

Page 27: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Amdhal’s law

P is the fraction of a calculation that can be parallelized(1− P) is the fraction that is sequential(i.e. cannot benefit from parallelization)

N processors

speedup(1−P)+P

(1−P)+P/N

maximum speedup

11−P

ExampleIf P = 90% ⇒ max speedup of 10no matter how large the value of N used (ie N →∞)

32 OS2’09 | Computer Architecture (Crash course)

Page 28: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Single-Processor machine

CPU

1ns200ns

3ns

sram

CacheLevel 1Level 2

sram

10ns200ns

Mem

dram

150ns200ns

Storage

5 000 000ns10 000 000ns

2000:1982:

34 OS2’09 | Computer Architecture (Crash course)

Page 29: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Memory Hierarchy

Main Memory

Level 2 cache

Level 1 cache

CPU

35 OS2’09 | Computer Architecture (Crash course)

Page 30: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Why do we miss in the cache?

Compulsory

missTouching the data for the first time

Capacity

missThe cache is too small

Conflict

missNon-ideal cache implementation (data hash to the same cache

line)

Main Memory

Miss

Cache

Hit

CPU

36 OS2’09 | Computer Architecture (Crash course)

Page 31: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Locality

Temporal

locality

Spatial

locality

Inner loop stepping through an array

A, B, C, A+1, B, C, A+2, B, C,

spatial temporal

37 OS2’09 | Computer Architecture (Crash course)

Page 32: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

MultiProcessor world - Taxonomy

SIMD MIMD

Message Passing

Fine-grained Coarse-grained

Shared Memory

UMA NUMA COMA

38 OS2’09 | Computer Architecture (Crash course)

Page 33: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Shared-Memory Multiprocessors

Memory Memory...

Interconnection network / Bus

Cache Cache...

CPU CPU

39 OS2’09 | Computer Architecture (Crash course)

Page 34: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Programming Model

Thread

$

Thread

$

Thread

$

Thread

$

Thread

$

Thread

$

Thread

$

Thread

$

Thread

$

Shared Memory

40 OS2’09 | Computer Architecture (Crash course)

Page 35: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Cache coherency

Shared Memory

A: B:

$

Thread

$

Thread

$

ThreadRead A

Read A

...

...

Read A

...

Read A

...

Write A

Read B

...

Read A

41 OS2’09 | Computer Architecture (Crash course)

Page 36: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Summing up Coherence

There can be many copies of adatum, but only one value

Too str

ong!!

!

There is a single global order of valuechanges to each datum

42 OS2’09 | Computer Architecture (Crash course)

Page 37: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Memory Ordering

The

coherence

defines a per-datum order of value changes.The

memory model

defines the order of value changes for allthe data.

What ordering does the memory system guarantees?“Contract” between the HW and the SW developersWithout it, we can’t say much about the result of a parallelexecution

43 OS2’09 | Computer Architecture (Crash course)

Page 38: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

What order for these threads?

A’ denotes a modified value to the datum at address A

Thread 1

LD AST B’LD CST D’LD E......

Thread 2

ST A’LD B’ST C’LD DST E’......

LD A happens before ST A’

44 OS2’09 | Computer Architecture (Crash course)

Page 39: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Other possible orders?

Thread 1

LD AST B’LD C

ST D’LD E......

Thread 2

ST A’LD B’ST C’LD D

ST E’......

Thread 1

LD AST B’LD C

ST D’LD E......

Thread 2

ST A’LD B’ST C’LD D

ST E’......

45 OS2’09 | Computer Architecture (Crash course)

Page 40: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Memory model flavors

Sequentially Consistent

: Programmer’s intuition

Total Store Order

: Almost Programmer’s intuition

Weak/Release Consistency

: No guaranty

46 OS2’09 | Computer Architecture (Crash course)

Page 41: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Dekker’s algorithm

Initially A = 0,B = 0“fork”

A := 1if(B==0)print(“A wins”);

B := 1if(A==0)print(“B wins”);

Can both A and B win?

Does the writebecome globallyvisible before theread is performed?

Left: The read (ie, test if B==0) can bypass the store (A := 1)Right: The read (ie, test if A==0) can bypass the store (B := 1)⇒ Both loads can be performed before any of the stores⇒ Yes, it is possible that both win!

47 OS2’09 | Computer Architecture (Crash course)

Page 42: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Dekker’s algorithm for Total Store Order

Initially A=0,B=0“fork”

A := 1Membar #StoreLoad;if(B==0)print(“A wins”);

B := 1Membar #StoreLoad;if(A==0)print(“B wins”);

Can both A and B win?

Does the writebecome globallyvisible before theread is performed?

Membar: the read is started after all previous stores have been “globallyordered”⇒ Behaves like a sequentially consistent machine⇒ No, they won’t both win. Good job Mister Programmer!

48 OS2’09 | Computer Architecture (Crash course)

Page 43: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Dekker’s algorithm, in general

Initially A = 0,B = 0“fork”

A := 1if(B==0)print(“A wins”);

B := 1if(A==0)print(“B wins”);

Can both A and B win?

The answer depends on the memory model

Remember? ...Contract between the HW and SW developers.

49 OS2’09 | Computer Architecture (Crash course)

Page 44: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

So....

Memory Modelis a tricky issue

50 OS2’09 | Computer Architecture (Crash course)

Page 45: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

New issues

Compulsory missCapacity missConflict miss

Memory Memory...

Interconnection network / Bus

Cache Cache

...

CPU CPU

Communication miss

Cache-to-cache transfer

False-sharing

Side-effect from large cache lines

What about the compiler?Code reordering? volatile keyword in C ...

51 OS2’09 | Computer Architecture (Crash course)

Page 46: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Good to know

Performance ⇒ Use of CacheMemory hierarchy ⇒ Consistency problems

To get maximal performance on a given machine,the programmer has to know about the characteristics of thememory system and has to write programs to account them

52 OS2’09 | Computer Architecture (Crash course)

Page 47: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Distributed Memory Architecture

Interconnection network

Memory Memory

...Cache Cache

CPU CPU

Communication through Message PassingOwn cache, but memory not shared⇒ No coherency problems

53 OS2’09 | Computer Architecture (Crash course)

Page 48: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Isn’t a CMP just a SMP on a chip?

54 OS2’09 | Computer Architecture (Crash course)

Page 49: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Cost of communication?

55 OS2’09 | Computer Architecture (Crash course)

Page 50: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Impact on Algorithms

For performance, we need to understand the interaction betweenalgorithms and architecture.

The ruleshave changed

We need to question old algorithms and results!

56 OS2’09 | Computer Architecture (Crash course)

Page 51: Computer Architecture - Crash course · ComputerArchitecture Crash course FrédéricHaziza Department of Computer Systems Uppsala University Summer2009

Why? Concurrency Hardware?

Criteria for algorithm design

Pre-CMP:• Communication is expensive: Minimize communication• Data locality is important• Maximize scalability for large-scale applications

Within a CMP chip today:• (On-chip) communication is almost to free• Data locality is even more important• (SMT may help by hiding some poor locality)• Scalability to 2-32 threads

In a multi-CMP system tomorrow:• Communication is sometimes almost free (on-chip), sometimes

(very) expensive (between chips)• Data locality (minimizing of-chip references) is a key to

efficiency• “Hierarchical scalability”

57 OS2’09 | Computer Architecture (Crash course)