the shift to multicore architectures · the shift to multicore architectures 3 computer science:...

The Shift to Multicore Architectures

Fall 2011

Parallel Programming Practice

Bernd Burgstaller

Yonsei University

Bernhard Scholz

The University of Sydney


2

Why is it important?

“Now we're into the explicit parallelism multiprocessor era, and

this will dominate for the foreseeable future. I don't see any

technology or architectural innovation on the horizon that

might be competitive with this approach.”

-- John Hennessy, Dec. 2006 on ACM Queue

Future computing platforms will be massively parallel many-core

architectures.

We need to be able to program them.


3

Computer Science: Crisis by Crisis

“To put it quite bluntly: as long as there were no machines,

programming was no problem at all;

When we had a few weak computers, programming became a mild

problem, and now we have gigantic computers, programming has

become an equally gigantic problem.”

-- Edsger Dijkstra, 1972 Turing Award Lecture

2002 update: “...now we have gigantic, parallel, computers...”

(parallel ~ more complex than sequential!)


4

The First Software Crisis

• Period: 1960s and 1970s

• Problem: people were still programming in assembly language.

• Example MIPS assembly program to

compute GCD

• Example MIPS R4000 machine code of

the assembly program

addiu sp,sp,-32

sw ra,20(sp)

jal getint

nop

jal getint

sw v0,28(sp)

lw a0,28(sp)

move v1,v0

beq a0,v0,D

slt at,v1,a0

A: beq at,zero,B

nop

b C

subu a0,a0,v1

B: subu v1,v1,a0

C: bne a0,v1,A

slt at,v1,a0

D: jal putint

nop

lw ra,20(sp)

addiu sp,sp,32

jr ra

move v0,zero

27bdffd0 afbf0014 0c1002a8 00000000

0c1002a8 afa2001c 8fa4001c 00401825

10820008 0064082a 10200003 00000000

10000002 00832023 00641823 1483fffa

0064082a 0c1002b2 00000000 8fbf0014

27bd0020 03e00008 00001025


5

The First Software Crisis (cont.)

Disadvantages of assembly languages (and machine code):

• not portable

every hardware architecture provides its own instruction set

moving software to a different architecture means re-coding everything

• Very low abstraction level

bit-level, register-level

very hard to write (esp. for large programs)

even harder to maintain (esp. for large programs)

• Programmers were unable to produce larger and more complex

programs with assembly language.

• It needed higher abstraction and portability without loosing

performance.


6

Solution to the First Software Crisis

High-level languages

Fortran and C

Programmer programs in high-level language

Compiler translates high-level language to assembly code.

Assembler files Source Files compile assemble Machine code

#include<stdio.h>

int gcd(int a, int b) {

while (b != 0) {

if (a > b) {

a = a – b;

} else {

b = b – a;

}

}

printf(“The gcd is %d\n”, a);

return a;

}

addiu sp,sp,-32

sw ra,20(sp)

jal getint

nop

jal getint

sw v0,28(sp)

lw a0,28(sp)

move v1,v0

beq a0,v0,D

slt at,v1,a0

...

27bdffd0

afbf0014

0c1002a8

00000000

0c1002a8

afa2001c

8fa4001c

00401825

...

compile assemble

higher abstraction

portable

good performance

(with optimizing compilers)


7

Solution to the First Software Crisis (cont.)

A high-level language provides a

unified view for uni-processors:

• a single flow of control

• a single memory image

It hides properties of the processor:

• the processor registers

• the instruction set of the processor

• the functional units of the processor

#include<stdio.h>

int gcd(int a, int b) {

while (b != 0) {

if (a > b) {

a = a – b;

} else {

b = b – a;

}

}

printf(“The gcd is %d\n”, a);

return a;

}


8

Today:

Programmers are agnostic about processors

• Solid boundary between hardware and software.

Called hardware-software interface.

• No necessity for programmers to know about the processor

High-level languages abstract away processors

Java bytecode is an executable, machine-independent program

representation

• Programmers like the freedom provided by this abstraction.


9

• Period: 2005 to 20??

• Problem: no more performance gains for sequential programs.

(see next slides).

• We need continuous and reasonable performance

improvements

to handle increasing complexity of software

to process larger data-sets

We need to keep portability, malleability and maintainability.

We do not want to increase complexity on programmer’s side.

Current Software Crisis: The Parallel

Programming Gap

parallel, but...


10

Moore’s Law

• Gordon Earle Moore, co-founder of Intel Cooperation, stated in

an article published in Electronics Magazine in 1965, that

“the number of transistors that can be placed on an integrated

circuit is doubling approximately every two years”.


11

The Future

(Itanium 2) Historically: use

transistors to boost

performance of single

instruction streams

(faster CPUs, ILP,

caches).

Now: deliver more

cores per chip

(multicores, GPUs).

“Every year we get

faster more

processors.”


12

Bottleneck: Power density

8008

8086

386

Pentium® processors

Hot Plate

Nuclear Reactor

Rocket Nozzle

Sun’s Surface

Source: Patrick Gelsinger, Intel Developer Forum, Spring 2004.

‘70

486 286

8085

8080

10,000

Po

wer

Den

sit

y (

W/c

m2)

1,000

100

10

1

‘80 ‘90 ‘00 ‘10

4004


13

The Future

• The free performance lunch is over for sequential applications.

• Transistors on a chip double every 18 months (Moore’s Law),

however:

Power consumption proportional to clock-frequency^2

Wire delays

Diminishing returns from instruction-level parallelism (ILP)

DRAM access latency

No substantial performance improvement of uniprocessors in sight.

No more speed-ups for sequential applications (see next slides).

• Hardware solution:

increase the number of cores per processor

new parallel computer architectures • multicore CPUs

• GPGPUs

• Cell architecture (heterogeneous multicore)

• Intel Single Chip Cloud Computer (SCC)


14

Uni-Processor (i.e., one core) Performance

1

10

100

1000

10000

1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006

Perf

orm

ance (

vs. V

AX

-11/7

80)

25%/year

52%/year

20%/year

From Hennessy and Patterson, Computer Architecture: A

Quantitative Approach, 4th edition, 2006.


15

“Free Lunch” for Sequential Programs (1986-2002)

Until 2002, performance of

uni-processors would

increase 50% per year.

A program that took 100

seconds to execute on a uni-

processor from the year 2000

would take only 66 seconds

on next year’s uni-processor.

The annual performance

increase of uni-processors

dramatically slowed down

around 2002.

The “free lunch” (annual

performance increase for

sequential programs) is

over.

UniProc

se

qu

en

tia

l

pro

gra

m

UniPoc

se

qu

en

tia

l

pro

gra

m

UniProc

se

qu

en

tia

l

pro

gra

m

2002 2001 2000

100 sec

66 sec

44 sec


16

The Fate of Sequential Programs on Multicores

• A sequential program is restricted to a single core.

It cannot take advantage of other cores that are available.

The execution time does not improve if the sequential program is executed on a multicore. This is a problem, because the performance of uni-processors/single cores will not improve much in the future.

• To run faster, the sequential program must be parallelized, i.e., parts of it must execute on different cores in parallel.

UniProc

se

qu

en

tia

l

pro

gra

m

Core1 Core2

se

qu

en

tia

l

pro

gra

m

Core3 Core4

Core2 Core1

se

qu

en

tia

l

pro

gra

m

100 sec 100 sec 100 sec


17

Outlook

• Until 2002, performance of uni-processors would increase 50% per year.

So did the performance of sequential programs on uni-processors.

• A sequential program is restricted to a single core.

Performance might even decrease on future multi-core architectures because of lower

Perf/Clock ratio.

No more performance gains in foreseeable future for sequential programs on multicore

architectures.

To run faster, programs must utilize several cores at once (parallelization).

2,048 cores


18

Parallelizing Sequential Programs

Decompose

into tasks

• Identify parts (tasks) of the problem that can execute in parallel.

Called task-decomposition

More parallel tasks higher speedup possible.

Each task should consist of a non-negligible amount of computation.

• Why?

• Map tasks onto parallel execution units (CPU cores, GPGPU

stream processors (SPs), cluster nodes, ...)

• Implement...


19

Parallelizing Sequential Programs (cont.)

• Embarrassingly parallel:

Decompose “naturally” into many independent tasks

May result in many more tasks (1000s!) than available cores.

Example: Mandelbrot sets

• Embarrassingly sequential

do not decompose at all.

• Real-world programming problems usually in between:

Dwarfs: Classification of SW regarding computation and data movement

See “The Landscape of Parallel Computing Research: A View From Berkeley”

13 dwarfs identified in http://view.eecs.berkeley.edu/wiki/Dwarf_Mine


20


• Tasks are usually not independent of

each other:

Temporal order: some tasks can only

execute after another task has completed.

• Example: The first task (not shown) in our

example might be a setup-task that pre-loads

array m with data from a file. Tasks 1 and 2 can

only execute after the setup-task has

completed.

Communication: tasks might need to

exchange information while executing.

• Example: the partial sum in variable y needs to

be communicated to Task 1 to compute the

overall sum.

Coordination: execution of tasks might need

to be coordinated to guarantee a correct

result.

• Example: it is not allowed that 2 tasks use a

printer at the same time (i.e., in parallel). Why?

/* sequential computation to add

up integers in an array: */

int m[1024];

sum = 0;

for(i=0; i < 1024; i++) {

sum = sum + m[i];

}

2 11 4 .....

Task1: Task2:

y communicated to Task1

Parallel sum:


21


• The hardware might consist of different kinds of cores.

Such a processor is called a heterogenous multiprocessor.

• Processors with one kind of core are called homogenous multiprocessors.

• Example: PC with IA64 multicore and GPGPU cores

four IA64 general-purpose cores

• for conventional control computations

several stream processors (SPs)

• Each SP provides several “mini-cores” plus local memory

• for data-intensive processing, allows efficient floating-point computations.

• Needs a task-decomposition that fits the underlying hardware!

Heterogeneous

Multicore Computer

IA64 GPGPU

SP SP

SP SP

SP SP


22

Who will do the actual parallelization ?

• The compiler?

Would be nice. Programmers could continue writing high-level language programs.

The compiler would find a task-decomposition for a given multicore processor.

Unfortunately this approach does not work (yet).

• Esp. heterogeneous multiprocessors are difficult to program

• The speed-up gained from automatic parallelization is limited.

Parallelism from automatic parallelization is called implicit parallelism.

• The programmer?

Yes! (contents of this course)

• Knows most about program to find a ‘winning’ task-decomposition.

• Needs to understand the hardware to achieve a task-decomposition that fits the underlying hardware.

• Needs to take care of communication & coordination among tasks.

Parallelism done by the programmer (her/him)self is called explicit parallelism.

The research community is working on programming languages and tools that ease this task.


23

As already mentioned...

“Now we're into the explicit parallelism multiprocessor era, and

this will dominate for the foreseeable future. I don't see any

technology or architectural innovation on the horizon that

might be competitive with this approach.”

John Hennessy

Future computing platforms will be massively parallel,

heterogeneous many-core architectures.

We need to be able to program them.


24

Multicores are here to stay…

#core

s /

chip

1975 1980 1985 1990 1995 2000 2005 2010

1

2

4

8

16

32

64

128

256

512

4004

8008

8080

8086 286 386 486 Pentium P2 P3 P4

Athlon

Itanium

Itanium 2

Power4 PA 8800

Opte- ron

Core Core2

CoreDuo Core2Duo

Power6

Xbox360 Xeon

Opteron BCM1480

Opteron

Core2Quad

i7 Gulftown

NVIDIA G80

NVIDIA Fermi (GTX 580) PicoChip102

Raw Cell

Intel

Tflops

Intel

SCC

Power7

Cisco CRS-1

Cisco CRS-3

Sparc T3

Sparc T2 Sparc T1

ARM

A9

ARM

A11

Xeon

24

Courtesy: Kudlur and Mahlke'08

the shift to multicore architectures · the shift to multicore architectures 3 computer science:...

Documents