the shift to multicore architectures · the shift to multicore architectures 3 computer science:...
TRANSCRIPT
The Shift to Multicore Architectures
Fall 2011
Parallel Programming Practice
Bernd Burgstaller
Yonsei University
Bernhard Scholz
The University of Sydney
The Shift to Multicore Architectures
2
Why is it important?
“Now we're into the explicit parallelism multiprocessor era, and
this will dominate for the foreseeable future. I don't see any
technology or architectural innovation on the horizon that
might be competitive with this approach.”
-- John Hennessy, Dec. 2006 on ACM Queue
Future computing platforms will be massively parallel many-core
architectures.
We need to be able to program them.
The Shift to Multicore Architectures
3
Computer Science: Crisis by Crisis
“To put it quite bluntly: as long as there were no machines,
programming was no problem at all;
When we had a few weak computers, programming became a mild
problem, and now we have gigantic computers, programming has
become an equally gigantic problem.”
-- Edsger Dijkstra, 1972 Turing Award Lecture
2002 update: “...now we have gigantic, parallel, computers...”
(parallel ~ more complex than sequential!)
The Shift to Multicore Architectures
4
The First Software Crisis
• Period: 1960s and 1970s
• Problem: people were still programming in assembly language.
• Example MIPS assembly program to
compute GCD
• Example MIPS R4000 machine code of
the assembly program
addiu sp,sp,-32
sw ra,20(sp)
jal getint
nop
jal getint
sw v0,28(sp)
lw a0,28(sp)
move v1,v0
beq a0,v0,D
slt at,v1,a0
A: beq at,zero,B
nop
b C
subu a0,a0,v1
B: subu v1,v1,a0
C: bne a0,v1,A
slt at,v1,a0
D: jal putint
nop
lw ra,20(sp)
addiu sp,sp,32
jr ra
move v0,zero
27bdffd0 afbf0014 0c1002a8 00000000
0c1002a8 afa2001c 8fa4001c 00401825
10820008 0064082a 10200003 00000000
10000002 00832023 00641823 1483fffa
0064082a 0c1002b2 00000000 8fbf0014
27bd0020 03e00008 00001025
The Shift to Multicore Architectures
5
The First Software Crisis (cont.)
Disadvantages of assembly languages (and machine code):
• not portable
every hardware architecture provides its own instruction set
moving software to a different architecture means re-coding everything
• Very low abstraction level
bit-level, register-level
very hard to write (esp. for large programs)
even harder to maintain (esp. for large programs)
• Programmers were unable to produce larger and more complex
programs with assembly language.
• It needed higher abstraction and portability without loosing
performance.
The Shift to Multicore Architectures
6
Solution to the First Software Crisis
High-level languages
Fortran and C
Programmer programs in high-level language
Compiler translates high-level language to assembly code.
Assembler files Source Files compile assemble Machine code
#include<stdio.h>
int gcd(int a, int b) {
while (b != 0) {
if (a > b) {
a = a – b;
} else {
b = b – a;
}
}
printf(“The gcd is %d\n”, a);
return a;
}
addiu sp,sp,-32
sw ra,20(sp)
jal getint
nop
jal getint
sw v0,28(sp)
lw a0,28(sp)
move v1,v0
beq a0,v0,D
slt at,v1,a0
...
27bdffd0
afbf0014
0c1002a8
00000000
0c1002a8
afa2001c
8fa4001c
00401825
...
compile assemble
higher abstraction
portable
good performance
(with optimizing compilers)
The Shift to Multicore Architectures
7
Solution to the First Software Crisis (cont.)
A high-level language provides a
unified view for uni-processors:
• a single flow of control
• a single memory image
It hides properties of the processor:
• the processor registers
• the instruction set of the processor
• the functional units of the processor
#include<stdio.h>
int gcd(int a, int b) {
while (b != 0) {
if (a > b) {
a = a – b;
} else {
b = b – a;
}
}
printf(“The gcd is %d\n”, a);
return a;
}
The Shift to Multicore Architectures
8
Today:
Programmers are agnostic about processors
• Solid boundary between hardware and software.
Called hardware-software interface.
• No necessity for programmers to know about the processor
High-level languages abstract away processors
Java bytecode is an executable, machine-independent program
representation
• Programmers like the freedom provided by this abstraction.
The Shift to Multicore Architectures
9
• Period: 2005 to 20??
• Problem: no more performance gains for sequential programs.
(see next slides).
• We need continuous and reasonable performance
improvements
to handle increasing complexity of software
to process larger data-sets
We need to keep portability, malleability and maintainability.
We do not want to increase complexity on programmer’s side.
Current Software Crisis: The Parallel
Programming Gap
parallel, but...
The Shift to Multicore Architectures
10
Moore’s Law
• Gordon Earle Moore, co-founder of Intel Cooperation, stated in
an article published in Electronics Magazine in 1965, that
“the number of transistors that can be placed on an integrated
circuit is doubling approximately every two years”.
The Shift to Multicore Architectures
11
The Future
(Itanium 2) Historically: use
transistors to boost
performance of single
instruction streams
(faster CPUs, ILP,
caches).
Now: deliver more
cores per chip
(multicores, GPUs).
“Every year we get
faster more
processors.”
The Shift to Multicore Architectures
12
Bottleneck: Power density
8008
8086
386
Pentium® processors
Hot Plate
Nuclear Reactor
Rocket Nozzle
Sun’s Surface
Source: Patrick Gelsinger, Intel Developer Forum, Spring 2004.
‘70
486 286
8085
8080
10,000
Po
wer
Den
sit
y (
W/c
m2)
1,000
100
10
1
‘80 ‘90 ‘00 ‘10
4004
The Shift to Multicore Architectures
13
The Future
• The free performance lunch is over for sequential applications.
• Transistors on a chip double every 18 months (Moore’s Law),
however:
Power consumption proportional to clock-frequency^2
Wire delays
Diminishing returns from instruction-level parallelism (ILP)
DRAM access latency
No substantial performance improvement of uniprocessors in sight.
No more speed-ups for sequential applications (see next slides).
• Hardware solution:
increase the number of cores per processor
new parallel computer architectures • multicore CPUs
• GPGPUs
• Cell architecture (heterogeneous multicore)
• Intel Single Chip Cloud Computer (SCC)
The Shift to Multicore Architectures
14
Uni-Processor (i.e., one core) Performance
1
10
100
1000
10000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006
Perf
orm
ance (
vs. V
AX
-11/7
80)
25%/year
52%/year
20%/year
From Hennessy and Patterson, Computer Architecture: A
Quantitative Approach, 4th edition, 2006.
The Shift to Multicore Architectures
15
“Free Lunch” for Sequential Programs (1986-2002)
Until 2002, performance of
uni-processors would
increase 50% per year.
A program that took 100
seconds to execute on a uni-
processor from the year 2000
would take only 66 seconds
on next year’s uni-processor.
The annual performance
increase of uni-processors
dramatically slowed down
around 2002.
The “free lunch” (annual
performance increase for
sequential programs) is
over.
UniProc
se
qu
en
tia
l
pro
gra
m
UniPoc
se
qu
en
tia
l
pro
gra
m
UniProc
se
qu
en
tia
l
pro
gra
m
2002 2001 2000
100 sec
66 sec
44 sec
The Shift to Multicore Architectures
16
The Fate of Sequential Programs on Multicores
• A sequential program is restricted to a single core.
It cannot take advantage of other cores that are available.
The execution time does not improve if the sequential program is executed on a multicore. This is a problem, because the performance of uni-processors/single cores will not improve much in the future.
• To run faster, the sequential program must be parallelized, i.e., parts of it must execute on different cores in parallel.
UniProc
se
qu
en
tia
l
pro
gra
m
Core1 Core2
se
qu
en
tia
l
pro
gra
m
Core3 Core4
Core2 Core1
se
qu
en
tia
l
pro
gra
m
100 sec 100 sec 100 sec
The Shift to Multicore Architectures
17
Outlook
• Until 2002, performance of uni-processors would increase 50% per year.
So did the performance of sequential programs on uni-processors.
• A sequential program is restricted to a single core.
Performance might even decrease on future multi-core architectures because of lower
Perf/Clock ratio.
No more performance gains in foreseeable future for sequential programs on multicore
architectures.
To run faster, programs must utilize several cores at once (parallelization).
2,048 cores
The Shift to Multicore Architectures
18
Parallelizing Sequential Programs
Decompose
into tasks
• Identify parts (tasks) of the problem that can execute in parallel.
Called task-decomposition
More parallel tasks higher speedup possible.
Each task should consist of a non-negligible amount of computation.
• Why?
• Map tasks onto parallel execution units (CPU cores, GPGPU
stream processors (SPs), cluster nodes, ...)
• Implement...
The Shift to Multicore Architectures
19
Parallelizing Sequential Programs (cont.)
• Embarrassingly parallel:
Decompose “naturally” into many independent tasks
May result in many more tasks (1000s!) than available cores.
Example: Mandelbrot sets
• Embarrassingly sequential
do not decompose at all.
• Real-world programming problems usually in between:
Dwarfs: Classification of SW regarding computation and data movement
See “The Landscape of Parallel Computing Research: A View From Berkeley”
13 dwarfs identified in http://view.eecs.berkeley.edu/wiki/Dwarf_Mine
The Shift to Multicore Architectures
20
Parallelizing Sequential Programs (cont.)
• Tasks are usually not independent of
each other:
Temporal order: some tasks can only
execute after another task has completed.
• Example: The first task (not shown) in our
example might be a setup-task that pre-loads
array m with data from a file. Tasks 1 and 2 can
only execute after the setup-task has
completed.
Communication: tasks might need to
exchange information while executing.
• Example: the partial sum in variable y needs to
be communicated to Task 1 to compute the
overall sum.
Coordination: execution of tasks might need
to be coordinated to guarantee a correct
result.
• Example: it is not allowed that 2 tasks use a
printer at the same time (i.e., in parallel). Why?
/* sequential computation to add
up integers in an array: */
int m[1024];
sum = 0;
for(i=0; i < 1024; i++) {
sum = sum + m[i];
}
2 11 4 .....
Task1: Task2:
y communicated to Task1
Parallel sum:
The Shift to Multicore Architectures
21
Parallelizing Sequential Programs (cont.)
• The hardware might consist of different kinds of cores.
Such a processor is called a heterogenous multiprocessor.
• Processors with one kind of core are called homogenous multiprocessors.
• Example: PC with IA64 multicore and GPGPU cores
four IA64 general-purpose cores
• for conventional control computations
several stream processors (SPs)
• Each SP provides several “mini-cores” plus local memory
• for data-intensive processing, allows efficient floating-point computations.
• Needs a task-decomposition that fits the underlying hardware!
Heterogeneous
Multicore Computer
IA64 GPGPU
SP SP
SP SP
SP SP
The Shift to Multicore Architectures
22
Who will do the actual parallelization ?
• The compiler?
Would be nice. Programmers could continue writing high-level language programs.
The compiler would find a task-decomposition for a given multicore processor.
Unfortunately this approach does not work (yet).
• Esp. heterogeneous multiprocessors are difficult to program
• The speed-up gained from automatic parallelization is limited.
Parallelism from automatic parallelization is called implicit parallelism.
• The programmer?
Yes! (contents of this course)
• Knows most about program to find a ‘winning’ task-decomposition.
• Needs to understand the hardware to achieve a task-decomposition that fits the underlying hardware.
• Needs to take care of communication & coordination among tasks.
Parallelism done by the programmer (her/him)self is called explicit parallelism.
The research community is working on programming languages and tools that ease this task.
The Shift to Multicore Architectures
23
As already mentioned...
“Now we're into the explicit parallelism multiprocessor era, and
this will dominate for the foreseeable future. I don't see any
technology or architectural innovation on the horizon that
might be competitive with this approach.”
John Hennessy
Future computing platforms will be massively parallel,
heterogeneous many-core architectures.
We need to be able to program them.
The Shift to Multicore Architectures
24
Multicores are here to stay…
#core
s /
chip
1975 1980 1985 1990 1995 2000 2005 2010
1
2
4
8
16
32
64
128
256
512
4004
8008
8080
8086 286 386 486 Pentium P2 P3 P4
Athlon
Itanium
Itanium 2
Power4 PA 8800
Opte- ron
Core Core2
CoreDuo Core2Duo
Power6
Xbox360 Xeon
Opteron BCM1480
Opteron
Core2Quad
i7 Gulftown
NVIDIA G80
NVIDIA Fermi (GTX 580) PicoChip102
Raw Cell
Intel
Tflops
Intel
SCC
Power7
Cisco CRS-1
Cisco CRS-3
Sparc T3
Sparc T2 Sparc T1
ARM
A9
ARM
A11
Xeon
24
Courtesy: Kudlur and Mahlke'08