another rebirth of mathematical software for linear...

Another Rebirth of Mathematical Another Rebirth of Mathematical

Software for Linear Algebra Software for Linear Algebra

Problems, A Personal PerspectiveProblems, A Personal Perspective

Symposium:

The birth of numerical analysis

at K.U. Leuven, Belgium on

October 29-30, 2007

Jack Dongarra,

University of Tennessee,Oak Ridge National Laboratory,

andUniversity of Manchester

A GrowthA Growth--Factor of a Trillion in Factor of a Trillion in

Performance in the Past 60 YearsPerformance in the Past 60 Years

1 103 106 109 1012 10151949

Edsac

1976

Cray 11996

T3E

1991

Intel Delta

2003

Cray X1

1959

IBM 7094

2005

IBM

BG/L

1948

Manchr

Baby1 103 106 109 1012 1015

KiloOPS MegaOPS GigaOPS TeraOPS PetaOPSOne OPS

1951

Pilot Ace

Edsac

1982

Cray XMP

1988

Cray YMP

1964

CDC 6600

1997

ASCI Red

2001

Earth

Simulator

1943

Harvard

Mark 1

Baby

Scalar to super scalar to vector to SMP to DMP to massively parallel to multi-core designs

AEC urges manufacturers to look at

“radical new” machine structuresThis leads to CDC Star-100, TI ASC, and Iliac IV

1961 1962 1963 1964 1965 1966 1967 1968 1969

Virtual memory from

U Manchester,

T. Kilburn

DEC ships first PDP-8

and IBM ships 360

First CS PhD U of Penn.

Richard Wexelblat

DARPA contract

with U of I to build

the Iliac IV

DARPANET work begins4 computers connected (UC-

SB,UCLA,SRI, and U of Utah

Unix developed

Thompson and

RitchieGordon Moore,

“# of gates per

chip 2x / 18

months”Wilkinson’s

Rounding Errors

1960’s1960’sSerial Computers;Serial Computers;

Algorithms and the Algorithms and the

Beginning of Math SoftwareBeginning of Math Software

1961 1962 1963 1964 1965 1966 1967 1968 1969

IBM Stretch

delivered to LANL

CDC 6600; S. Cray’s DesignFunctional Parallelism, leading to RISC

(3x faster than IBM Stretch)

Fortran 66

Forsythe & Moler

publishedFortran, Algol, and PLI

Wilkinson’s The

Algebraic

Eigenvalue

Problem published

CDC 7600Pipelined architecture; 3.1 Mflop/s

Gatlinburg I

Strassen’s Algorithm

For matrix multiply

Golub & Kahan

SVD paper

BCSLIB

Harwell Subroutine Library

1970’s1970’sMath SW Ideas FormingMath SW Ideas Forming

Microprocessors &Microprocessors &

Parallel BeginningParallel Beginning

1970 1971 1972 1973

NATS project

conceivedConcept of certified MS

and process involved

with production

Purdue Math

Software

Symposium IMSL founded

Intel 400460 Kop/s

Cray Research founded

Intel 8008

1/4 size Iliac IV

installed NASA Ames15 Mflop/s achieved; 64 procs.

BLAS Report in SIGNUMLawson, Hanson, & Krogh

Alan George on

Nested Dissection

Wilkinson ‘s

Turing Award

1970 1971 1972 1973

NAG project

begins

Handbook for Automatic

Computation Volume IILandmark in the development of numerical

algorithms and software

Basis for a number of software projects

EISPACK, and number of linear algebra

routines in IMSL and the F chapters of NAG

IBM 370/195Pipelined architecture;

Out of order execution;

2.5 Mflop/s

Paper by S. Reddaway on

massive bit-level parallelism

EISPACK available150 installations; EISPACK Users’ Guide

5 versions, IBM, CDC, Univac, Honeywell and PDP,

distributed free via Argonne Code Center

Mike Flynn’s paper on

architectural taxonomy

ARPANet37 computers connected

Metcalf’s

Ethernet

NATS Project (1970)NATS Project (1970)

♦ National Activity for Testing Software(NSF, Argonne, Texas and Stanford)

♦ Project to explore the problems of testing, certifying, disseminating and maintaining quality math software.maintaining quality math software.�First EISPACK, later FUNPACK

�Influenced other “PACK”s� ELLPACK, FISHPACK, ITPACK, MINPACK, ODEPACK, QUADPACK, SPARSPAK, ROSEPACK, TOOLPACK, TESTPACK, LINPACK, LAPACK, ScaLAPACK . . .

♦ Key attributes of math software�Reliability, Robustness, Structure, Usability, and Validity

EISPACK Under DevelopmentEISPACK Under Development

♦ Algol versions of the algorithms translated into Fortran� Restructured to avoid underflow

� Format programs in a unified fashion

� Field test sites

� At ANL B. Smith, W. Cowell, J. Boyle, � At ANL B. Smith, W. Cowell, J. Boyle, B. Garbow, V. Klema, JD + Moler and Ikebe

♦ 1971 U of Michigan Summer Conference� Jim Wilkinson’s algorithms & Cleve Moler’s software

♦ 1972 Software released via Argonne Code Center� 5 versions, IBM, CDC, Univac, PDP, Honeywell

♦ Software certified in the sense that reports of poor or incorrect performance “would gain the immediate attention from the developers”

1970’s1970’scontinuedcontinuedVector ProcessingVector Processingin usein use

1974 1975 1976 1977 1978 1979

Level 1 BLAS

activity started

by community;

Purdue 5/74

Intel 8080

Cray 1 - model for

vector computing160 Mflop/s Peak

1st issue of

Trans on Math

Software

DEC VAX 11/780;

super mini.14 Mflop/s;

4.3 GB virtual memory

Fortran 77

John Cocke

designs 801

First RISC proc

ICL DAP delivered

to QMC, London

Vaxination of

Groups & Depts

1974 1975 1976 1977 1978 1979

PORT Lib

Bell Labs

2nd LINPACK

meeting ANLLay the groundwork and

hammer out what was and

was not to be included in

the package. Proposal

submitted to NSF.

EISPACK

2nd edition of

User’s Guide

LINPACK

test software

developed

& sent

EISPACK 2nd

release

IEEE Arithmetic standard

meetingsPaper by Palmer on INTEL std for fl pt

LINPACK

software

releasedSent to NESC and

IMSL for distribution

Level 1 BLAS

published/released

LINPACK User’s GuideAppendix: 17 machines PDP-10 to Cray-1

LINPACK

meeting in

summer ANL

SLATEC

DOE Labs

Microsoft founded

Basic Linear Algebra SubprogramsBasic Linear Algebra Subprograms

♦ BLAS 1973-1977♦ Lawson, Hanson, Kincaid Krogh

♦ Consensus on:�Names�Calling sequences�

♦ A design tool software for numerical linear algebra

♦ Improve readability and aid documentation

♦ Aid modularity and maintenance, and improve �Calling sequences

�Functional Descriptions�Low level linear algebra operations

♦ Success results from �Extensive Public Involvement

�Careful consideration of implications

maintenance, and improve robustness of software calling the BLAS

♦ Improve portability, without sacrificing efficiency, through standardization

♦ Vector operations and Vector Computers

♦ LINPACK in the works

Argonne National Lab, Summer 1976

009

Argonne National Lab, Summer 1976

The Accidental BenchmarkerThe Accidental Benchmarker♦ Appendix B of the Linpack Users’ Guide

� Designed to help users extrapolate execution time for Linpack software package

♦ First benchmark report from 1977; � Cray 1 to DEC PDP-10

Dense matrices

Linear systems

Least squares problems

Singular values

1980’s1980’s

1980 1981 1982 1983

Total

computers in

use in the US

exceeds 1M

IBM introduces the PCIntel 8088/DOS

BBN Butterfly delivered

Iliac IV decommissioned

Steve Chen’s group at

Cray produces X-MP

First Denelcor HEP

installed (.21 Mflop/s)

Vector Computing (Parallel Processing Beginnings)Total computers in use in

the US exceeds 10M

DARPA starts Strategic

Computing InitiativeHelps fund Thinking Machines, BBN,

WARP

Cosmic Cube hypercube

running at CaltechJohn Palmer, after seeing Caltech

machine, leaves Intel to found Ncube

1980 1981 1982 1983

FPS delivers FPS-164Start of mini-supercomputer market

CDC

introduces

Cyber 205

SGI founded by Jim

Clark and others

Loop unrolling at outer

level for data locality and

parallelismAmounts to matrix-vector operations

Cuppen’s method for

Symmetric Eigenvalue D&C

published

Sun Microsys, Convex,

and Alliant founded

Encore, Sequent, TMC,

SCS, Myrias founded

Cray 2 introduced

NEC SX-1 and SX-2,

Fujitsu ships VP-200

ETA System spun off from CDC

Golub & Van Loan published

SPARSPAK

Sequent

0012

1981 Oxford Householder VIII Meeting

1980’s1980’scontinuedcontinued

1984 1985

NSFNET; 5000 computers;

56 kb/s lines

Level 2 BLAS activity startedPurdue, SIAM

MathWorks founded

EISPACK third release

Netlib origins 1/3/84

IEEE Standard 754

for floating point IBM delivers 3090 vector; 16

Mflop/s Linpack, 138 Peak

TMC demos CM1 to DARPA

Intel produces first iPSC/1

Hypercube80286 connected via Ethernet controllers

IBM begins RP2 project

1984 1985

Intel Scientific Computers

started by J. RattnerProduce commercial hypercube

Cray X-MP 1 processor, 21

Mflop/s Linpack

Multiflow founded

by J. Fisher; VLIW

architecture

Apple introduces Mac &

IBM introduces PC AT

IJK paper in

SIAM Review

Fujitsu VP-400; NEC SX-

2; Cray 2; Convex C1

Ncube/10; .1 Mflop/s Linpack 1

processorFPS-264; 5.9 Mflop/s

Linpack 38 Peak

Stellar, Podsuka & Ardent,

Michels Supertek Computer

founded

Denelcor closes doors

P P P P P P

BUS

Memory

Multiflow

NetlibNetlib Mathematical Mathematical Software Software & More& More

Over 400M Over 400M SeveredSevered♦ Began in 1984

�JD and Eric Grosse�Motivated by need for cost-effective, timely distribution of high-quality mathematical software to the community.

♦ One of the first open source software collections.Designed to send, by return electronic mail, requested

0014

♦ Designed to send, by return electronic mail, requested items.

♦ Automatic mechanism for electronic dissemination of freely available software.�Still in use and growing�Mirrored at 9 sites around the world

♦ Moderated collection /Distributed maintenance♦ NA-DIGEST and NA-Net

�Gene Golub, Mark Kent and Cleve Moler

1980’s1980’scontinuedcontinued

1986 1987 1988 1989

(Lost Decade for Parallel Software)

ETA Systems family of supercomputers

Sun Microsystems

introduces its first RISC WS

IBM invests in

Steve Chen’s SSI

Cray Y-MP

First NA-DIGEST

Level 3 BLAS work begun

LAPACK: Prospectus

Development of a

LA Library for HPC

Stellar and Ardent merge,

forming Stardent

Ncube 2nd generation machine

# of computers in US

exceeds 50M

Blocked partitioned

algorithms pursued

Kahan‘s

Turing

Award

1986 1987 1988 1989

IBM and MIPS release

first RISC WS

# of computers in US

exceeds 30M

TMC ships CM-1; 64K

1 bit processors

AMT delivers first re-

engineered DAP

Intel produces

iPSC/2

S. Cray leaves Cray

Research to form Cray

Computer

Stellar and Ardent begin

delivering single user

graphics workstations

Level 2 BLAS paper

published

ETA out of business

Intel 80486 and i860;

1M transistorsI860 RISC & 64 bit floating point

ITPACK

Computers with Lots of Memory HierarchyComputers with Lots of Memory Hierarchy

♦ Bandwidth and Latency are the critical issues, not FLOPS

“Got Bandwidth”?

1000000 Processor-Memory

Performance Gap:

0016

µProc

60%/yr.

(2X/1.5yr)

1

100

10000

1980

1982

1984

1986

1988

1990

1992

1994

1996

1998

2000

2002

2004

Year

Perform

ance

DRAM

9%/yr.

(2X/10 yrs)

Performance Gap:

(grows 50% / year)

1990’s1990’s

1990 1991 1992

Internet

World Wide Web

Motorola introduces 68040

Stardent to sell business and close

Cray C-90

Kendall Square Research delivers 32

processor KSR-1

TMC produces CM-2000 and

announces CM-5 MIMD computer

DEC announces the Alpha

TMC produces the first CM-5

Torvalds’ Linux

Newsgroup

posting

Kendall Square

Research

1990 1991 1992

Workshop to consider Message

Passing Standard, beginnings of

MPI Initial MPI draft by: Dongarra, Hempel, Hey, and WalkerCommunity effort

NEC ships SX-3; First Japanese

parallel vector supercomputer

Alliant delivers

FX/2800 based

on i860

IBM announces RS/6000 familyHas FMA instruction

Intel hypercube based on 860 chip128 processors

Fujitsu VP-2600

PVM project started

Level 3 BLAS published

Fortan 90

LAPACK software

released & Users’

Guide published

PBLAS for DM

LAPACKLAPACK

♦ Linear Algebra library in Fortran 77� Solution of systems of equations

� Solution of eigenvalue problems

♦ Combine algorithms from LINPACK and EISPACK into a single package

Blocked partitioned algorithms♦ Blocked partitioned algorithms� Efficient on a wide range of computers

�RISC, Vector, SMPs

♦ User interface similar to LINPACK� Single, Double, Complex, Double Complex

♦ Built on the Level 1, 2, and 3 BLAS♦ SW used in HP-48G up to CRAY T-90

1990’s 1990’s For HPC Everything is Parallel (The MPI Rut)For HPC Everything is Parallel (The MPI Rut)

1993 1994 1995 1996 1997 1998 1999

ScaLAPACK Prototype software

releasedFirst portable library for distributed memory

machines

Intel, TMC and workstations using PVM

Intel Pentium system start to ship Internet;

34M users

Nintendo 64More computing power than

a Cray 1 and much much

better graphics

MPI-2 Finished

Fortan 95

Issues of parallel and

numerical stability

“New” AlgorithmsChaotic iteration

Sparse LU w/o pivoting

Pipeline HQR

Graph partitioning

Algorithmic bombardment

NCSA Mosaic v1 released

1993 1994 1995 1996 1997 1998 1999

PVM 3.0

available

MPI-1 Finished

Templates

project

DSM architectures

Beowolf Clustering

comes about

ARPACK

Super LU

PhiPAC

MUMPS

Octave

available

Recursive blocked algorithms

DOE ASCI

Program started

ScaLAPACKScaLAPACK

♦ Library of software dealing with dense & banded routines

♦Distributed Memory - Message Passing�First library for this type of �First library for this type of architecture

♦MIMD Computers and Networks of Workstations

♦Clusters of SMPs

From 100’s of Processors to 100,000’s of ProcessorsFrom 100’s of Processors to 100,000’s of Processors

(Parallel Processing on Steroids) (Parallel Processing on Steroids)

2000 2001 2002 2003 2004 2005 2006

ATLAS

Self

tuning

software

PETSc

“Holy Grail”

64 bit architectures

Japanese ES 5 X

faster than US fastest

K Goto’s BLAS

Sony, Toshiba, IBM Cell

>200 Gflop/s (16 Gflop/s DP)

Fl pt for the sake of gamesRecursive

Blocked

Algorithms;

Elmroth et al

2000 2001 2002 2003 2004 2005 2006

LAPACK95

Trilinos

Multi-core chips

Tightly-Coupled Heterogeneous

System 2009

IBM BG/L

131,072 processorsEigen

templates

Hypre

BeBOp

BLAST Standard

DARPA HPCS

Program begins

Performance Development Top500 DataPerformance Development Top500 Data

4.92 PF/s

281 TF/s

4.0 TF/s

NEC Earth Simulator

IBM ASCI White

N=1

SUM

Tflop/s

100 Tflop/s

10 Tflop/s

1 Pflop/s

IBM BlueGene/L

22

1.17 TF/s

59.7 GF/s

0.4 GF/s

1993

1994

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

Fujitsu 'NWT'

Intel ASCI Red

N=500

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

10 Gflop/s

My Laptop

6-8 years

Solve Ax = b (dense) on your system and measure the performance

Processors per System Processors per System -- June 2007June 2007

150

200

250

Number of Systems

Top500

Number of

cores

23

0

50

100

64k-

128k

32k-64k16k-32k8k-16k4k-8k2049-

4096

1025-

2048

513-

1024

257-512129-25665-12833-64

Number of Systems

Number of

cores

Usage Based on Top500Usage Based on Top500

54%

24

25%

18%

Time to Rethink Software AgainTime to Rethink Software Again

♦Must rethink the design of our software�Another disruptive technology

�Similar to what happened with cluster computing and message passing

�Rethink and rewrite the applications, �Rethink and rewrite the applications, algorithms, and software

♦Numerical libraries for example will change�For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

Lower Lower VoltageVoltage

Increase Increase Clock RateClock Rate& Transistor & Transistor DensityDensity

We have seen increasing number of gates on a

Increasing the number of gates into a tight knot and decreasing the cycle time of the processor

26

We have seen increasing number of gates on a

chip and increasing clock speed.

Heat becoming an unmanageable problem,

Intel Processors > 100 Watts

We will not see the dramatic

increases in clock speeds in the

future.

However, the number of

gates on a chip will

continue to increase.

Core

Cache

Core

Cache

Core

C1 C2

C3 C4

Cache

C1 C2

C3 C4

Cache

C1 C2

C3 C4

C1 C2

C3 C4

C1 C2

C3 C4

C1 C2

C3 C4

Power Cost of FrequencyPower Cost of Frequency

• Power Power Power Power ∝ ∝ ∝ ∝ VoltageVoltageVoltageVoltage2222 x Frequencyx Frequencyx Frequencyx Frequency (V(V(V(V2222F)F)F)F)

• Frequency ∝∝∝∝ VoltageVoltageVoltageVoltage

• Power Power Power Power ∝∝∝∝FrequencyFrequencyFrequencyFrequency3333

27

Power Cost of FrequencyPower Cost of Frequency

• Power Power Power Power ∝ ∝ ∝ ∝ VoltageVoltageVoltageVoltage2222 x Frequencyx Frequencyx Frequencyx Frequency (V(V(V(V2222F)F)F)F)

• Frequency ∝∝∝∝ VoltageVoltageVoltageVoltage

• Power Power Power Power ∝∝∝∝FrequencyFrequencyFrequencyFrequency3333

28

What’s Next?What’s Next?

All Large CoreAll Large Core

Mixed LargeMixed LargeandandSmall CoreSmall Core

All Small CoreAll Small Core

Many Small CoresMany Small Cores

SRAMSRAM

+ 3D Stacked Memory

Many Floating-Point Cores

Different Classes of Processor Chips

HomeGames / GraphicsBusiness Scientific

80 Core80 Core

• Intel’s 80 Core chip�1 Tflop/s

�62 Watts

�1.2 TB/s

30

�1.2 TB/s internal BW

Major Changes to SoftwareMajor Changes to Software

• Must rethink the design of our software�Another disruptive technology

�Similar to what happened with cluster computing and message passing

�Rethink and rewrite the applications,

31

�Rethink and rewrite the applications, algorithms, and software

• Numerical libraries for example will change�For example, both LAPACK and ScaLAPACK will undergo major changes to accommodate this

A New Generation of Software:A New Generation of Software:Parallel Linear Algebra Software for Parallel Linear Algebra Software for MulticoreMulticore Architectures (PLASMA)Architectures (PLASMA)

Algorithms follow hardware evolution in time

LINPACK (70’s)(Vector operations)

Rely on - Level-1 BLAS

operations

LAPACK (80’s)(Blocking, cache friendly)

Rely on - Level-3 BLAS

operationsfriendly) operations

ScaLAPACK (90’s)(Distributed Memory)

Rely on - PBLAS Mess Passing

PLASMA (00’s)New Algorithms (many-core friendly)

Rely on - a DAG/scheduler- block data layout- some extra kernels

Those new algorithms

- have a very low granularity, they scale very well (multicore, petascale

computing, J )

- removes a lots of dependencies among the tasks, (multicore, distributed

computing)

DGETF2

DLSWP

LAPACK

LAPACK

Steps in the LAPACK LUSteps in the LAPACK LU

(Factor a panel)

(Backward swap)

34

DLSWP

DTRSM

DGEMM

LAPACK

BLAS

BLAS

(Forward swap)

(Triangular solve)

(Matrix multiply)

DGETF2

LU Timing Profile (4 processor system)LU Timing Profile (4 processor system)

1D decomposition and SGI OriginTime for each component

DGETF2

DLASWP(L)

DLASWP(R)

DTRSM

DGEMM

Threads – no lookahead

DLSWP

DLSWP

DTRSM

DGEMMBulk Sync PhasesBulk Sync Phases

Adaptive Lookahead Adaptive Lookahead -- DynamicDynamic

36Event Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven MultithreadingEvent Driven Multithreading

Reorganizing algorithms to use this approach

A

C

A

B C

T TT

ForkFork--Join vs. Dynamic ExecutionJoin vs. Dynamic Execution

Fork-Join – parallel BLAS

Time

37

Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads

A

C

A

B C

T TT

ForkFork--Join vs. Dynamic ExecutionJoin vs. Dynamic Execution

Fork-Join – parallel BLAS

Time

38

DAG-based – dynamic scheduling

Experiments on Experiments on Intel’s Quad Core Clovertown Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treadswith 2 Sockets w/ 8 Treads

Time

saved

With With the the Hype on Hype on Cell & PS3Cell & PS3

We Became Interested We Became Interested ♦ The PlayStation 3's CPU based on a "Cell“ processor

♦ Each Cell contains a Power PC processor and 8 SPEs. (SPE is processing unit, SPE: SPU + DMA engine)� An SPE is a self contained vector processor which acts independently from

the others. � 4 way SIMD floating point units capable of a total of 25.6 Gflop/s @ 3.2 GHZ

� 204.8 Gflop/s peak!

� The catch is that this is for 32 bit floating point; (Single Precision SP)

� And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!!

39

� And 64 bit floating point runs at 14.6 Gflop/s total for all 8 SPEs!! � Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

SPE ~ 25 Gflop/s peak

Performance of Single Precision on Performance of Single Precision on

Conventional ProcessorsConventional Processors• Realized have the similar situation on our commodity processors.• That is, SP is 2X as fast as DP on many systems

• The Intel Pentium

SizeSizeSGEMM/SGEMM/DGEMMDGEMM

SizeSizeSGEMV/SGEMV/DGEMVDGEMV

AMD Opteron 246 30003000 2.002.00 50005000 1.701.70UltraSparc-

IIe 30003000 1.641.64 50005000 1.661.66Intel PIII Coppermine 30003000 2.032.03 50005000 2.092.09

PowerPC 970 30003000 2.042.04 50005000 1.441.44

Single precision is faster because:• Higher parallelism in SSE/vector units• Reduced data motion • Higher locality in cache

• The Intel Pentium and AMD Opteronhave SSE2• 2 flops/cycle DP• 4 flops/cycle SP

• IBM PowerPC has AltiVec• 8 flops/cycle SP• 4 flops/cycle DP

• No DP on AltiVec

PowerPC 970 30003000 2.042.04 50005000 1.441.44Intel

Woodcrest 30003000 1.811.81 50005000 2.182.18

Intel XEON 30003000 2.042.04 50005000 1.821.82Intel

Centrino Duo 30003000 2.712.71 50005000 2.212.21

32 or 64 bit Floating Point Precision?32 or 64 bit Floating Point Precision?

• A long time ago 32 bit floating point was used�Still used in scientific apps but limited

• Most apps use 64 bit floating point�Accumulation of round off error

�A 10 TFlop/s computer running for 4 hours performs > 1 Exaflop (1018) ops.

41

> 1 Exaflop (1018) ops. �Ill conditioned problems�IEEE SP exponent bits too few (8 bits, 10±38)�Critical sections need higher precision

�Sometimes need extended precision (128 bit fl pt)�However some can get by with 32 bit fl pt in some parts

• Mixed precision a possibility�Approximate in lower precision and then refine or improve solution to high precision.

Idea Goes Something Like This…Idea Goes Something Like This…

• Exploit 32 bit floating point as much as possible.�Especially for the bulk of the computation

• Correct or update the solution with selective use of 64 bit floating point to provide a refined results

42

provide a refined results

• Intuitively: �Compute a 32 bit result,

�Calculate a correction to 32 bit result using selected higher precision and,

�Perform the update of the 32 bit results with the correction using high precision.

L U = lu(A) SINGLE O(n3)

x = L\(U\b) SINGLE O(n2)

r = b – Ax DOUBLE O(n2)

WHILE || r || not small enough

z = L\(U\r) SINGLE O(n2)

x = x + z DOUBLE O(n1)

MixedMixed--Precision Iterative RefinementPrecision Iterative Refinement♦ Iterative refinement for dense systems, Ax = b, can work this way.

43


END

� Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.

L U = lu(A) SINGLE O(n3)

x = L\(U\b) SINGLE O(n2)


WHILE || r || not small enough

z = L\(U\r) SINGLE O(n2)

x = x + z DOUBLE O(n1)

MixedMixed--Precision Iterative RefinementPrecision Iterative Refinement♦ Iterative refinement for dense systems, Ax = b, can work this way.

44


END

� Wilkinson, Moler, Stewart, & Higham provide error bound for SP fl pt results when using DP fl pt.

� It can be shown that using this approach we can compute the solution to 64-bit floating point precision.

� Requires extra storage, total is 1.5 times normal;� O(n3) work is done in lower precision� O(n2) work is done in high precision

� Problems if the matrix is ill-conditioned in sp; O(108)

Results for Mixed Precision Iterative

Refinement for Dense Ax = bArchitecture (BLAS)

1 Intel Pentium III Coppermine (Goto)

2 Intel Pentium III Katmai (Goto)

3 Sun UltraSPARC IIe (Sunperf)

4 Intel Pentium IV Prescott (Goto)

5 Intel Pentium IV-M Northwood (Goto)

6 AMD Opteron (Goto)

7 Cray X1 (libsci)

8 IBM Power PC G5 (2.7 GHz) (VecLib)

9 Compaq Alpha EV6 (CXML)

10 IBM SP Power3 (ESSL)10 IBM SP Power3 (ESSL)

11 SGI Octane (ATLAS)

• Single precision is faster than DP because:

� Higher parallelism within vector units

� 4 ops/cycle (usually) instead of 2 ops/cycle

� Reduced data motion

� 32 bit data instead of 64 bit data

� Higher locality in cache

� More data items in cache

Results for Mixed Precision Iterative

Refinement for Dense Ax = bArchitecture (BLAS)

1 Intel Pentium III Coppermine (Goto)

2 Intel Pentium III Katmai (Goto)

3 Sun UltraSPARC IIe (Sunperf)

4 Intel Pentium IV Prescott (Goto)

5 Intel Pentium IV-M Northwood (Goto)

6 AMD Opteron (Goto)

7 Cray X1 (libsci)

8 IBM Power PC G5 (2.7 GHz) (VecLib)

9 Compaq Alpha EV6 (CXML)

10 IBM SP Power3 (ESSL)10 IBM SP Power3 (ESSL)

11 SGI Octane (ATLAS)

• Single precision is faster than DP because:

� Higher parallelism within vector units

� 4 ops/cycle (usually) instead of 2 ops/cycle

� Reduced data motion

� 32 bit data instead of 64 bit data

� Higher locality in cache

� More data items in cache

Architecture (BLAS-MPI) # procs n DP Solve

/SP Solve

DP Solve

/Iter Ref

#

iter

AMD Opteron (Goto – OpenMPI MX) 32 22627 1.85 1.79 6

AMD Opteron (Goto – OpenMPI MX) 64 32000 1.90 1.83 6

Quadruple PrecisionQuadruple Precisionn Quad Precision

Ax = bIter. Refine.DP to QP

time (s) time (s) Speedup

100 0.29 0.03 9.5

200 2.27 0.10 20.9

300 7.61 0.24 30.5

400 17.8 0.44 40.4

Intel Xeon 3.2 GHz

Reference

implementation of

the

quad precision

BLAS

Accuracy: 10-32

47♦ Variable precision factorization (with say < 32 bit precision) plus 64 bit refinement produces 64 bit accuracy

400 17.8 0.44 40.4

500 34.7 0.69 49.7

600 60.1 1.01 59.0

700 94.9 1.38 68.7

800 141. 1.83 77.3

900 201. 2.33 86.3

1000 276. 2.92 94.8

No more than 3

steps of iterative

refinement are

needed.

MUMPS and Iterative RefinementMUMPS and Iterative Refinement

1.2

1.4

1.6

1.8

2

Speedup Over DP

AMD Opteron Processor w/Intel compilerIterative Refinement

Single Precision

48

I t er at ive Ref inement

0

0.2

0.4

0.6

0.8

1

Tim Davis's Collection, n=100K - 3M

What about the Cell?What about the Cell?

• Power PC at 3.2 GHz�DGEMM at 5 Gflop/s�Altivec peak at 25.6 Gflop/s

�Achieved 10 Gflop/s SGEMM

• 8 SPUs

49

• 8 SPUs�204.8 Gflop/s peak!�The catch is that this is for 32 bit floating point; (Single Precision SP)

�And 64 bit floating point runs at 14.6 Gflop/stotal for all 8 SPEs!! �Divide SP peak by 14; factor of 2 because of DP and 7 because of latency issues

IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b

150

200

250

GFlop/s

SP Peak (204 Gflop/s)

SP Ax=b IBM

DP Peak (15 Gflop/s)

DP Ax=b IBM

.30 secs

8 SGEMM (Embarrassingly Parallel)

50

0

50

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix Size

GFlop/s

DP Ax=b IBM

3.9 secs

IBM Cell 3.2 GHz, Ax = bIBM Cell 3.2 GHz, Ax = b

150

200

250

GFlop/s

SP Peak (204 Gflop/s)

SP Ax=b IBM

DSGESV

DP Peak (15 Gflop/s)

.30 secs

8 SGEMM (Embarrassingly Parallel)

51

0

50

100

0 500 1000 1500 2000 2500 3000 3500 4000 4500

Matrix Size

GFlop/s

DP Ax=b IBM

.47 secs

3.9 secs

8.3X

Intriguing PotentialIntriguing Potential♦ Exploit lower precision as much as possible

� Payoff in performance� Faster floating point � Less data to move

♦ Automatically switch between SP and DP to match the desired accuracy� Compute solution in SP and then a correction to the solution in DP

52

solution in DP

♦ Potential for GPU, FPGA, special purpose processors� What about 16 bit floating point?

� Use as little you can get away with and improve the accuracy

♦ Applies to sparse direct and iterative linear systems and Eigenvalue, optimization problems, where Newton’s method is used.

Correction = - A\(b –

Conclusions Conclusions

♦ For the last few decades or more, the research investment strategy has been overwhelmingly biased in favor of hardware.

♦ This strategy needs to be rebalanced -barriers to progress are increasingly on the software side. software side.

♦ Moreover, the return on investment is more favorable to software.�Hardware has a half-life measured in years, while software has a half-life measured in decades.

• High Performance Ecosystem out of balance� Hardware, OS, Compilers, Software, Algorithms, Applications

� No Moore’s Law for software, algorithms and applications

Collaborators / SupportCollaborators / Support

Alfredo Buttari, UTKJulien Langou, UColoradoUColorado

Julie Langou, UTKPiotr Luszczek, MathWorks

Jakub Kurzak, UTKStan Tomov, UTK

another rebirth of mathematical software for linear...

Documents