another rebirth of mathematical software for linear...
TRANSCRIPT
Another Rebirth of Mathematical Another Rebirth of Mathematical
Software for Linear Algebra Software for Linear Algebra
Problems, A Personal PerspectiveProblems, A Personal Perspective
Symposium:
The birth of numerical analysis
at K.U. Leuven, Belgium on
October 29-30, 2007
Jack Dongarra,
University of Tennessee,Oak Ridge National Laboratory,
andUniversity of Manchester
A GrowthA Growth--Factor of a Trillion in Factor of a Trillion in
Performance in the Past 60 YearsPerformance in the Past 60 Years
1 103 106 109 1012 10151949
Edsac
1976
Cray 11996
T3E
1991
Intel Delta
2003
Cray X1
1959
IBM 7094
2005
IBM
BG/L
1948
Manchr
Baby1 103 106 109 1012 1015
KiloOPS MegaOPS GigaOPS TeraOPS PetaOPSOne OPS
1951
Pilot Ace
Edsac
1982
Cray XMP
1988
Cray YMP
1964
CDC 6600
1997
ASCI Red
2001
Earth
Simulator
1943
Harvard
Mark 1
Baby
Scalar to super scalar to vector to SMP to DMP to massively parallel to multi-core designs
AEC urges manufacturers to look at
“radical new” machine structuresThis leads to CDC Star-100, TI ASC, and Iliac IV
1961 1962 1963 1964 1965 1966 1967 1968 1969
Virtual memory from
U Manchester,
T. Kilburn
DEC ships first PDP-8
and IBM ships 360
First CS PhD U of Penn.
Richard Wexelblat
DARPA contract
with U of I to build
the Iliac IV
DARPANET work begins4 computers connected (UC-
SB,UCLA,SRI, and U of Utah
Unix developed
Thompson and
RitchieGordon Moore,
“# of gates per
chip 2x / 18
months”Wilkinson’s
Rounding Errors
1960’s1960’sSerial Computers;Serial Computers;
Algorithms and the Algorithms and the
Beginning of Math SoftwareBeginning of Math Software
1961 1962 1963 1964 1965 1966 1967 1968 1969
IBM Stretch
delivered to LANL
CDC 6600; S. Cray’s DesignFunctional Parallelism, leading to RISC
(3x faster than IBM Stretch)
Fortran 66
Forsythe & Moler
publishedFortran, Algol, and PLI
Wilkinson’s The
Algebraic
Eigenvalue
Problem published
CDC 7600Pipelined architecture; 3.1 Mflop/s
Gatlinburg I
Strassen’s Algorithm
For matrix multiply
Golub & Kahan
SVD paper
BCSLIB
Harwell Subroutine Library
1970’s1970’sMath SW Ideas FormingMath SW Ideas Forming
Microprocessors &Microprocessors &
Parallel BeginningParallel Beginning
1970 1971 1972 1973
NATS project
conceivedConcept of certified MS
and process involved
with production
Purdue Math
Software
Symposium IMSL founded
Intel 400460 Kop/s
Cray Research founded
Intel 8008
1/4 size Iliac IV
installed NASA Ames15 Mflop/s achieved; 64 procs.
BLAS Report in SIGNUMLawson, Hanson, & Krogh
Alan George on
Nested Dissection
Wilkinson ‘s
Turing Award
1970 1971 1972 1973
NAG project
begins
Handbook for Automatic
Computation Volume IILandmark in the development of numerical
algorithms and software
Basis for a number of software projects
EISPACK, and number of linear algebra
routines in IMSL and the F chapters of NAG
IBM 370/195Pipelined architecture;
Out of order execution;
2.5 Mflop/s
Paper by S. Reddaway on
massive bit-level parallelism
EISPACK available150 installations; EISPACK Users’ Guide
5 versions, IBM, CDC, Univac, Honeywell and PDP,
distributed free via Argonne Code Center
Mike Flynn’s paper on
architectural taxonomy
ARPANet37 computers connected
Metcalf’s
Ethernet
NATS Project (1970)NATS Project (1970)
♦ National Activity for Testing Software(NSF, Argonne, Texas and Stanford)
♦ Project to explore the problems of testing, certifying, disseminating and maintaining quality math software.maintaining quality math software.�First EISPACK, later FUNPACK
�Influenced other “PACK”s� ELLPACK, FISHPACK, ITPACK, MINPACK, ODEPACK, QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK, TESTPACK, LINPACK, LAPACK, ScaLAPACK . . .
♦ Key attributes of math software�Reliability, Robustness, Structure, Usability, and Validity
EISPACK Under DevelopmentEISPACK Under Development
♦ Algol versions of the algorithms translated into Fortran� Restructured to avoid underflow
� Format programs in a unified fashion
� Field test sites
� At ANL B. Smith, W. Cowell, J. Boyle, � At ANL B. Smith, W. Cowell, J. Boyle, B. Garbow, V. Klema, JD + Moler and Ikebe
♦ 1971 U of Michigan Summer Conference� Jim Wilkinson’s algorithms & Cleve Moler’s software
♦ 1972 Software released via Argonne Code Center� 5 versions, IBM, CDC, Univac, PDP, Honeywell
♦ Software certified in the sense that reports of poor or incorrect performance “would gain the immediate attention from the developers”
1970’s1970’scontinuedcontinuedVector ProcessingVector Processingin usein use
1974 1975 1976 1977 1978 1979
Level 1 BLAS
activity started
by community;
Purdue 5/74
Intel 8080
Cray 1 - model for
vector computing160 Mflop/s Peak
1st issue of
Trans on Math
Software
DEC VAX 11/780;
super mini.14 Mflop/s;
4.3 GB virtual memory
Fortran 77
John Cocke
designs 801
First RISC proc
ICL DAP delivered
to QMC, London
Vaxination of
Groups & Depts
1974 1975 1976 1977 1978 1979
PORT Lib
Bell Labs
2nd LINPACK
meeting ANLLay the groundwork and
hammer out what was and
was not to be included in
the package. Proposal
submitted to NSF.
EISPACK
2nd edition of
User’s Guide
LINPACK
test software
developed
& sent
EISPACK 2nd
release
IEEE Arithmetic standard
meetingsPaper by Palmer on INTEL std for fl pt
LINPACK
software
releasedSent to NESC and
IMSL for distribution
Level 1 BLAS
published/released
LINPACK User’s GuideAppendix: 17 machines PDP-10 to Cray-1
LINPACK
meeting in
summer ANL
SLATEC
DOE Labs
Microsoft founded
Basic Linear Algebra SubprogramsBasic Linear Algebra Subprograms
♦ BLAS 1973-1977♦ Lawson, Hanson, Kincaid Krogh
♦ Consensus on:�Names�Calling sequences�
♦ A design tool software for numerical linear algebra
♦ Improve readability and aid documentation
♦ Aid modularity and maintenance, and improve �Calling sequences
�Functional Descriptions�Low level linear algebra operations
♦ Success results from �Extensive Public Involvement
�Careful consideration of implications
maintenance, and improve robustness of software calling the BLAS
♦ Improve portability, without sacrificing efficiency, through standardization
♦ Vector operations and Vector Computers
♦ LINPACK in the works
Argonne National Lab, Summer 1976
009
Argonne National Lab, Summer 1976
The Accidental BenchmarkerThe Accidental Benchmarker♦ Appendix B of the Linpack Users’ Guide
� Designed to help users extrapolate execution time for Linpack software package
♦ First benchmark report from 1977; � Cray 1 to DEC PDP-10
Dense matrices
Linear systems
Least squares problems
Singular values
1980’s1980’s
1980 1981 1982 1983
Total
computers in
use in the US
exceeds 1M
IBM introduces the PCIntel 8088/DOS
BBN Butterfly delivered
Iliac IV decommissioned
Steve Chen’s group at
Cray produces X-MP
First Denelcor HEP
installed (.21 Mflop/s)
Vector Computing (Parallel Processing Beginnings)Total computers in use in
the US exceeds 10M
DARPA starts Strategic
Computing InitiativeHelps fund Thinking Machines, BBN,
WARP
Cosmic Cube hypercube
running at CaltechJohn Palmer, after seeing Caltech
machine, leaves Intel to found Ncube
1980 1981 1982 1983
FPS delivers FPS-164Start of mini-supercomputer market
CDC
introduces
Cyber 205
SGI founded by Jim
Clark and others
Loop unrolling at outer
level for data locality and
parallelismAmounts to matrix-vector operations
Cuppen’s method for
Symmetric Eigenvalue D&C
published
Sun Microsys, Convex,
and Alliant founded
Encore, Sequent, TMC,
SCS, Myrias founded
Cray 2 introduced
NEC SX-1 and SX-2,
Fujitsu ships VP-200
ETA System spun off from CDC
Golub & Van Loan published
SPARSPAK
Sequent
0012
1981 Oxford Householder VIII Meeting
1980’s1980’scontinuedcontinued
1984 1985
NSFNET; 5000 computers;
56 kb/s lines
Level 2 BLAS activity startedPurdue, SIAM
MathWorks founded
EISPACK third release
Netlib origins 1/3/84
IEEE Standard 754
for floating point IBM delivers 3090 vector; 16
Mflop/s Linpack, 138 Peak
TMC demos CM1 to DARPA
Intel produces first iPSC/1
Hypercube80286 connected via Ethernet controllers
IBM begins RP2 project
1984 1985
Intel Scientific Computers
started by J. RattnerProduce commercial hypercube
Cray X-MP 1 processor, 21
Mflop/s Linpack
Multiflow founded
by J. Fisher; VLIW
architecture
Apple introduces Mac &
IBM introduces PC AT
IJK paper in
SIAM Review
Fujitsu VP-400; NEC SX-
2; Cray 2; Convex C1
Ncube/10; .1 Mflop/s Linpack 1
processorFPS-264; 5.9 Mflop/s
Linpack 38 Peak
Stellar, Podsuka & Ardent,
Michels Supertek Computer
founded
Denelcor closes doors
P P P P P P
BUS
Memory
Multiflow
NetlibNetlib Mathematical Mathematical Software Software & More& More
Over 400M Over 400M SeveredSevered♦ Began in 1984
�JD and Eric Grosse�Motivated by need for cost-effective, timely distribution of high-quality mathematical software to the community.
♦ One of the first open source software collections.Designed to send, by return electronic mail, requested
0014
♦ Designed to send, by return electronic mail, requested items.
♦ Automatic mechanism for electronic dissemination of freely available software.�Still in use and growing�Mirrored at 9 sites around the world
♦ Moderated collection /Distributed maintenance♦ NA-DIGEST and NA-Net
�Gene Golub, Mark Kent and Cleve Moler
1980’s1980’scontinuedcontinued
1986 1987 1988 1989
(Lost Decade for Parallel Software)
ETA Systems family of supercomputers
Sun Microsystems
introduces its first RISC WS
IBM invests in
Steve Chen’s SSI
Cray Y-MP
First NA-DIGEST
Level 3 BLAS work begun
LAPACK: Prospectus
Development of a
LA Library for HPC
Stellar and Ardent merge,
forming Stardent
Ncube 2nd generation machine
# of computers in US
exceeds 50M
Blocked partitioned
algorithms pursued
Kahan‘s
Turing
Award
1986 1987 1988 1989
IBM and MIPS release
first RISC WS
# of computers in US
exceeds 30M
TMC ships CM-1; 64K
1 bit processors
AMT delivers first re-
engineered DAP
Intel produces
iPSC/2
S. Cray leaves Cray
Research to form Cray
Computer
Stellar and Ardent begin
delivering single user
graphics workstations
Level 2 BLAS paper
published
ETA out of business
Intel 80486 and i860;
1M transistorsI860 RISC & 64 bit floating point
ITPACK
Computers with Lots of Memory HierarchyComputers with Lots of Memory Hierarchy
♦ Bandwidth and Latency are the critical issues, not FLOPS
“Got Bandwidth”?
1000000 Processor-Memory
Performance Gap:
0016
µProc
60%/yr.
(2X/1.5yr)
1
100
10000
1980
1982
1984
1986
1988
1990
1992
1994
1996
1998
2000
2002
2004
Year
Perform
ance
DRAM
9%/yr.
(2X/10 yrs)
Performance Gap:
(grows 50% / year)
1990’s1990’s
1990 1991 1992
Internet
World Wide Web
Motorola introduces 68040
Stardent to sell business and close
Cray C-90
Kendall Square Research delivers 32
processor KSR-1
TMC produces CM-2000 and
announces CM-5 MIMD computer
DEC announces the Alpha
TMC produces the first CM-5
Torvalds’ Linux
Newsgroup
posting
Kendall Square
Research
1990 1991 1992
Workshop to consider Message
Passing Standard, beginnings of
MPI Initial MPI draft by: Dongarra, Hempel, Hey, and WalkerCommunity effort
NEC ships SX-3; First Japanese
parallel vector supercomputer
Alliant delivers
FX/2800 based
on i860
IBM announces RS/6000 familyHas FMA instruction
Intel hypercube based on 860 chip128 processors
Fujitsu VP-2600
PVM project started
Level 3 BLAS published
Fortan 90
LAPACK software
released & Users’
Guide published
PBLAS for DM
LAPACKLAPACK
♦ Linear Algebra library in Fortran 77� Solution of systems of equations
� Solution of eigenvalue problems
♦ Combine algorithms from LINPACK and EISPACK into a single package
Blocked partitioned algorithms♦ Blocked partitioned algorithms� Efficient on a wide range of computers
�RISC, Vector, SMPs
♦ User interface similar to LINPACK� Single, Double, Complex, Double Complex
♦ Built on the Level 1, 2, and 3 BLAS♦ SW used in HP-48G up to CRAY T-90
1990’s 1990’s For HPC Everything is Parallel (The MPI Rut)For HPC Everything is Parallel (The MPI Rut)
1993 1994 1995 1996 1997 1998 1999
ScaLAPACK Prototype software
releasedFirst portable library for distributed memory
machines
Intel, TMC and workstations using PVM
Intel Pentium system start to ship Internet;
34M users
Nintendo 64More computing power than
a Cray 1 and much much
better graphics
MPI-2 Finished
Fortan 95
Issues of parallel and
numerical stability
“New” AlgorithmsChaotic iteration
Sparse LU w/o pivoting
Pipeline HQR
Graph partitioning
Algorithmic bombardment
NCSA Mosaic v1 released
1993 1994 1995 1996 1997 1998 1999
PVM 3.0
available
MPI-1 Finished
Templates
project
DSM architectures
Beowolf Clustering
comes about
ARPACK
Super LU
PhiPAC
MUMPS
Octave
available
Recursive blocked algorithms
DOE ASCI
Program started
ScaLAPACKScaLAPACK
♦ Library of software dealing with dense & banded routines
♦Distributed Memory - Message Passing�First library for this type of �First library for this type of architecture
♦MIMD Computers and Networks of Workstations
♦Clusters of SMPs
From 100’s of Processors to 100,000’s of ProcessorsFrom 100’s of Processors to 100,000’s of Processors
(Parallel Processing on Steroids) (Parallel Processing on Steroids)
2000 2001 2002 2003 2004 2005 2006
ATLAS
Self
tuning
software
PETSc
“Holy Grail”
64 bit architectures
Japanese ES 5 X
faster than US fastest
K Goto’s BLAS
Sony, Toshiba, IBM Cell
>200 Gflop/s (16 Gflop/s DP)
Fl pt for the sake of gamesRecursive
Blocked
Algorithms;
Elmroth et al
2000 2001 2002 2003 2004 2005 2006
LAPACK95
Trilinos
Multi-core chips
Tightly-Coupled Heterogeneous
System 2009
IBM BG/L
131,072 processorsEigen
templates
Hypre
BeBOp
BLAST Standard
DARPA HPCS
Program begins
Performance Development Top500 DataPerformance Development Top500 Data
4.92 PF/s
281 TF/s
4.0 TF/s
NEC Earth Simulator
IBM ASCI White
N=1
SUM
Tflop/s
100 Tflop/s
10 Tflop/s
1 Pflop/s
IBM BlueGene/L
22
1.17 TF/s
59.7 GF/s
0.4 GF/s
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
Fujitsu 'NWT'
Intel ASCI Red
N=500
1 Gflop/s
1 Tflop/s
100 Mflop/s
100 Gflop/s
10 Gflop/s
My Laptop
6-8 years
Solve Ax = b (dense) on your system and measure the performance
Processors per System Processors per System -- June 2007June 2007
150
200
250
Number of Systems
Top500
Number of
cores
23
0
50
100
64k-
128k
32k-64k16k-32k8k-16k4k-8k2049-
4096
1025-
2048
513-
1024
257-512129-25665-12833-64
Number of Systems
Number of
cores
Usage Based on Top500Usage Based on Top500
54%
24
25%
18%
Time to Rethink Software AgainTime to Rethink Software Again
♦Must rethink the design of our software�Another disruptive technology
�Similar to what happened with cluster computing and message passing
�Rethink and rewrite the applications, �Rethink and rewrite the applications, algorithms, and software
♦Numerical libraries for example will change�For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this
Lower Lower VoltageVoltage
Increase Increase Clock RateClock Rate& Transistor & Transistor DensityDensity
We have seen increasing number of gates on a
Increasing the number of gates into a tight knot and decreasing the cycle time of the processor
26
We have seen increasing number of gates on a
chip and increasing clock speed.
Heat becoming an unmanageable problem,
Intel Processors > 100 Watts
We will not see the dramatic
increases in clock speeds in the
future.
However, the number of
gates on a chip will
continue to increase.
Core
Cache
Core
Cache
Core
C1 C2
C3 C4
Cache
C1 C2
C3 C4
Cache
C1 C2
C3 C4
C1 C2
C3 C4
C1 C2
C3 C4
C1 C2
C3 C4
Power Cost of FrequencyPower Cost of Frequency
• Power Power Power Power ∝ ∝ ∝ ∝ VoltageVoltageVoltageVoltage2222 x Frequencyx Frequencyx Frequencyx Frequency (V(V(V(V2222F)F)F)F)
• Frequency ∝∝∝∝ VoltageVoltageVoltageVoltage
• Power Power Power Power ∝∝∝∝FrequencyFrequencyFrequencyFrequency3333
27
Power Cost of FrequencyPower Cost of Frequency
• Power Power Power Power ∝ ∝ ∝ ∝ VoltageVoltageVoltageVoltage2222 x Frequencyx Frequencyx Frequencyx Frequency (V(V(V(V2222F)F)F)F)
• Frequency ∝∝∝∝ VoltageVoltageVoltageVoltage
• Power Power Power Power ∝∝∝∝FrequencyFrequencyFrequencyFrequency3333
28
What’s Next?What’s Next?
All Large CoreAll Large Core
Mixed LargeMixed LargeandandSmall CoreSmall Core
All Small CoreAll Small Core
Many Small CoresMany Small Cores
SRAMSRAM
+ 3D Stacked Memory
Many Floating-Point Cores
Different Classes of Processor Chips
HomeGames / GraphicsBusiness Scientific
80 Core80 Core
• Intel’s 80 Core chip�1 Tflop/s
�62 Watts
�1.2 TB/s
30
�1.2 TB/s internal BW
Major Changes to SoftwareMajor Changes to Software
• Must rethink the design of our software�Another disruptive technology
�Similar to what happened with cluster computing and message passing
�Rethink and rewrite the applications,
31
�Rethink and rewrite the applications, algorithms, and software
• Numerical libraries for example will change�For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this
A New Generation of Software:A New Generation of Software:Parallel Linear Algebra Software for Parallel Linear Algebra Software for MulticoreMulticore Architectures (PLASMA)Architectures (PLASMA)
Algorithms follow hardware evolution in time
LINPACK (70’s)(Vector operations)
Rely on - Level-1 BLAS
operations
LAPACK (80’s)(Blocking, cache friendly)
Rely on - Level-3 BLAS
operationsfriendly) operations
ScaLAPACK (90’s)(Distributed Memory)
Rely on - PBLAS Mess Passing
PLASMA (00’s)New Algorithms (many-core friendly)
Rely on - a DAG/scheduler- block data layout- some extra kernels
Those new algorithms
- have a very low granularity, they scale very well (multicore, petascale
computing, J )
- removes a lots of dependencies among the tasks, (multicore, distributed
computing)
A New Generation of Software:A New Generation of Software:Parallel Linear Algebra Software for Parallel Linear Algebra Software for MulticoreMulticore Architectures (PLASMA)Architectures (PLASMA)
Algorithms follow hardware evolution in time
LINPACK (70’s)(Vector operations)
Rely on - Level-1 BLAS
operations
LAPACK (80’s)(Blocking, cache friendly)
Rely on - Level-3 BLAS
operationsfriendly) operations
ScaLAPACK (90’s)(Distributed Memory)
Rely on - PBLAS Mess Passing
PLASMA (00’s)New Algorithms (many-core friendly)
Rely on - a DAG/scheduler- block data layout- some extra kernels
Those new algorithms
- have a very low granularity, they scale very well (multicore, petascale
computing, J )
- removes a lots of dependencies among the tasks, (multicore, distributed
computing)
DGETF2
DLSWP
LAPACK
LAPACK
Steps in the LAPACK LUSteps in the LAPACK LU
(Factor a panel)
(Backward swap)
34
DLSWP
DTRSM
DGEMM
LAPACK
BLAS
BLAS
(Forward swap)
(Triangular solve)
(Matrix multiply)
DGETF2
LU Timing Profile (4 processor system)LU Timing Profile (4 processor system)
1D decomposition and SGI OriginTime for each component
DGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
Threads – no lookahead
DLSWP
DLSWP
DTRSM
DGEMMBulk Sync PhasesBulk Sync Phases
Adaptive Lookahead Adaptive Lookahead -- DynamicDynamic
36Event Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven Multithreading
Reorganizing algorithms to use this approach
A
C
A
B C
T TT
ForkFork--Join vs. Dynamic ExecutionJoin vs. Dynamic Execution
Fork-Join – parallel BLAS
Time
37
Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads
A
C
A
B C
T TT
ForkFork--Join vs. Dynamic ExecutionJoin vs. Dynamic Execution
Fork-Join – parallel BLAS
Time
38
DAG-based – dynamic scheduling
Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads
Time
saved
With With the the Hype on Hype on Cell & PS3Cell & PS3
We Became Interested We Became Interested ♦ The PlayStation 3's CPU based on a "Cell“ processor
♦ Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit, SPE: SPU + DMA engine)� An SPE is a self contained vector processor which acts independently from
the others. � 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ
� 204.8 Gflop/s peak!
� The catch is that this is for 32 bit floating point; (Single Precision SP)
� And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!
39
� And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! � Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues
SPE ~ 25 Gflop/s peak
Performance of Single Precision on Performance of Single Precision on
Conventional ProcessorsConventional Processors• Realized have the similar situation on our commodity processors.• That is, SP is 2X as fast as DP on many systems
• The Intel Pentium
SizeSizeSGEMM/SGEMM/DGEMMDGEMM
SizeSizeSGEMV/SGEMV/DGEMVDGEMV
AMD Opteron 246 30003000 2.002.00 50005000 1.701.70UltraSparc-
IIe 30003000 1.641.64 50005000 1.661.66Intel PIII Coppermine 30003000 2.032.03 50005000 2.092.09
PowerPC 970 30003000 2.042.04 50005000 1.441.44
Single precision is faster because:• Higher parallelism in SSE/vector units• Reduced data motion • Higher locality in cache
• The Intel Pentium and AMD Opteronhave SSE2• 2 flops/cycle DP• 4 flops/cycle SP
• IBM PowerPC has AltiVec• 8 flops/cycle SP• 4 flops/cycle DP
• No DP on AltiVec
PowerPC 970 30003000 2.042.04 50005000 1.441.44Intel
Woodcrest 30003000 1.811.81 50005000 2.182.18
Intel XEON 30003000 2.042.04 50005000 1.821.82Intel
Centrino Duo 30003000 2.712.71 50005000 2.212.21
32 or 64 bit Floating Point Precision?32 or 64 bit Floating Point Precision?
• A long time ago 32 bit floating point was used�Still used in scientific apps but limited
• Most apps use 64 bit floating point�Accumulation of round off error
�A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (1018) ops.
41
> 1 Exaflop (1018) ops. �Ill conditioned problems�IEEE SP exponent bits too few (8 bits, 10±38)�Critical sections need higher precision
�Sometimes need extended precision (128 bit fl pt)�However some can get by with 32 bit fl pt in some parts
• Mixed precision a possibility�Approximate in lower precision and then refine or improve solution to high precision.
Idea Goes Something Like This…Idea Goes Something Like This…
• Exploit 32 bit floating point as much as possible.�Especially for the bulk of the computation
• Correct or update the solution with selective use of 64 bit floating point to provide a refined results
42
provide a refined results
• Intuitively: �Compute a 32 bit result,
�Calculate a correction to 32 bit result using selected higher precision and,
�Perform the update of the 32 bit results with the correction using high precision.
L U = lu(A) SINGLE O(n3)
x = L\(U\b) SINGLE O(n2)
r = b – Ax DOUBLE O(n2)
WHILE || r || not small enough
z = L\(U\r) SINGLE O(n2)
x = x + z DOUBLE O(n1)
MixedMixed--Precision Iterative RefinementPrecision Iterative Refinement♦ Iterative refinement for dense systems, Ax = b, can work this way.
43
r = b – Ax DOUBLE O(n2)
END
� Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
L U = lu(A) SINGLE O(n3)
x = L\(U\b) SINGLE O(n2)
r = b – Ax DOUBLE O(n2)
WHILE || r || not small enough
z = L\(U\r) SINGLE O(n2)
x = x + z DOUBLE O(n1)
MixedMixed--Precision Iterative RefinementPrecision Iterative Refinement♦ Iterative refinement for dense systems, Ax = b, can work this way.
44
r = b – Ax DOUBLE O(n2)
END
� Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.
� It can be shown that using this approach we can compute the solution to 64-bit floating point precision.
� Requires extra storage, total is 1.5 times normal;� O(n3) work is done in lower precision� O(n2) work is done in high precision
� Problems if the matrix is ill-conditioned in sp; O(108)
Results for Mixed Precision Iterative
Refinement for Dense Ax = bArchitecture (BLAS)
1 Intel Pentium III Coppermine (Goto)
2 Intel Pentium III Katmai (Goto)
3 Sun UltraSPARC IIe (Sunperf)
4 Intel Pentium IV Prescott (Goto)
5 Intel Pentium IV-M Northwood (Goto)
6 AMD Opteron (Goto)
7 Cray X1 (libsci)
8 IBM Power PC G5 (2.7 GHz) (VecLib)
9 Compaq Alpha EV6 (CXML)
10 IBM SP Power3 (ESSL)10 IBM SP Power3 (ESSL)
11 SGI Octane (ATLAS)
• Single precision is faster than DP because:
� Higher parallelism within vector units
� 4 ops/cycle (usually) instead of 2 ops/cycle
� Reduced data motion
� 32 bit data instead of 64 bit data
� Higher locality in cache
� More data items in cache
Results for Mixed Precision Iterative
Refinement for Dense Ax = bArchitecture (BLAS)
1 Intel Pentium III Coppermine (Goto)
2 Intel Pentium III Katmai (Goto)
3 Sun UltraSPARC IIe (Sunperf)
4 Intel Pentium IV Prescott (Goto)
5 Intel Pentium IV-M Northwood (Goto)
6 AMD Opteron (Goto)
7 Cray X1 (libsci)
8 IBM Power PC G5 (2.7 GHz) (VecLib)
9 Compaq Alpha EV6 (CXML)
10 IBM SP Power3 (ESSL)10 IBM SP Power3 (ESSL)
11 SGI Octane (ATLAS)
• Single precision is faster than DP because:
� Higher parallelism within vector units
� 4 ops/cycle (usually) instead of 2 ops/cycle
� Reduced data motion
� 32 bit data instead of 64 bit data
� Higher locality in cache
� More data items in cache
Architecture (BLAS-MPI) # procs n DP Solve
/SP Solve
DP Solve
/Iter Ref
#
iter
AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85 1.79 6
AMD Opteron (Goto – OpenMPI MX) 64 32000 1.90 1.83 6
Quadruple PrecisionQuadruple Precisionn Quad Precision
Ax = bIter. Refine.DP to QP
time (s) time (s) Speedup
100 0.29 0.03 9.5
200 2.27 0.10 20.9
300 7.61 0.24 30.5
400 17.8 0.44 40.4
Intel Xeon 3.2 GHz
Reference
implementation of
the
quad precision
BLAS
Accuracy: 10-32
47♦ Variable precision factorization (with say < 32 bit precision) plus 64 bit refinement produces 64 bit accuracy
400 17.8 0.44 40.4
500 34.7 0.69 49.7
600 60.1 1.01 59.0
700 94.9 1.38 68.7
800 141. 1.83 77.3
900 201. 2.33 86.3
1000 276. 2.92 94.8
No more than 3
steps of iterative
refinement are
needed.
MUMPS and Iterative RefinementMUMPS and Iterative Refinement
1.2
1.4
1.6
1.8
2
Speedup Over DP
AMD Opteron Processor w/Intel compilerIterative Refinement
Single Precision
48
I t er at ive Ref inement
0
0.2
0.4
0.6
0.8
1
Tim Davis's Collection, n=100K - 3M
What about the Cell?What about the Cell?
• Power PC at 3.2 GHz�DGEMM at 5 Gflop/s�Altivec peak at 25.6 Gflop/s
�Achieved 10 Gflop/s SGEMM
• 8 SPUs
49
• 8 SPUs�204.8 Gflop/s peak!�The catch is that this is for 32 bit floating point; (Single Precision SP)
�And 64 bit floating point runs at 14.6 Gflop/stotal for all 8 SPEs!! �Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues
IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b
150
200
250
GFlop/s
SP Peak (204 Gflop/s)
SP Ax=b IBM
DP Peak (15 Gflop/s)
DP Ax=b IBM
.30 secs
8 SGEMM (Embarrassingly Parallel)
50
0
50
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Size
GFlop/s
DP Ax=b IBM
3.9 secs
IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b
150
200
250
GFlop/s
SP Peak (204 Gflop/s)
SP Ax=b IBM
DSGESV
DP Peak (15 Gflop/s)
.30 secs
8 SGEMM (Embarrassingly Parallel)
51
0
50
100
0 500 1000 1500 2000 2500 3000 3500 4000 4500
Matrix Size
GFlop/s
DP Ax=b IBM
.47 secs
3.9 secs
8.3X
Intriguing PotentialIntriguing Potential♦ Exploit lower precision as much as possible
� Payoff in performance� Faster floating point � Less data to move
♦ Automatically switch between SP and DP to match the desired accuracy� Compute solution in SP and then a correction to the solution in DP
52
solution in DP
♦ Potential for GPU, FPGA, special purpose processors� What about 16 bit floating point?
� Use as little you can get away with and improve the accuracy
♦ Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used.
Correction = - A\(b –
Conclusions Conclusions
♦ For the last few decades or more, the research investment strategy has been overwhelmingly biased in favor of hardware.
♦ This strategy needs to be rebalanced -barriers to progress are increasingly on the software side. software side.
♦ Moreover, the return on investment is more favorable to software.�Hardware has a half-life measured in years, while software has a half-life measured in decades.
• High Performance Ecosystem out of balance� Hardware, OS, Compilers, Software, Algorithms, Applications
� No Moore’s Law for software, algorithms and applications
Collaborators / SupportCollaborators / Support
Alfredo Buttari, UTKJulien Langou, UColoradoUColorado
Julie Langou, UTKPiotr Luszczek, MathWorks
Jakub Kurzak, UTKStan Tomov, UTK