cs267 – lecture 14 automatic performance tuning and sparse-matrix-vector-multiplication (spmv)

CS267 Lecture 14

Automatic Performance TuningandSparse-Matrix-Vector-Multiplication (SpMV)James Demmel

www.cs.berkeley.edu/~demmel/cs267_Spr13

OutlineMotivation for Automatic Performance TuningResults for sparse matrix kernelsOSKI = Optimized Sparse Kernel InterfacepOSKI for multicoreTuning Higher Level AlgorithmsFuture Work, Class Projects

BeBOP: Berkeley Benchmarking and Optimization GroupMany results shown from current and former membersMeet weekly M 1-2, in 380 Soda

Motivation for Automatic Performance TuningWriting high performance software is hardMake programming easier while getting high speedIdeal: program in your favorite high level language (Matlab, Python, PETSc) and get a high fraction of peak performanceReality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,Best choice can depend on knowing a lot of applied mathematics and computer scienceHow much of this can we teach?How much of this can we automate?

Examples of Automatic Performance Tuning (1)Dense BLASSequentialPHiPAC (UCB), then ATLAS (UTK) (used in Matlab)math-atlas.sourceforge.net/Internal vendor toolsFast Fourier Transform (FFT) & variationsSequential and ParallelFFTW (MIT)www.fftw.orgDigital Signal ProcessingSPIRAL: www.spiral.net (CMU)Communication Collectives (UCB, UTK)Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), More projects, conferences, government reports,

Examples of Automatic Performance Tuning (2)What do dense BLAS, FFTs, signal processing, MPI reductions have in common?Can do the tuning off-line: once per architecture, algorithmCan take as much time as necessary (hours, a week)At run-time, algorithm choice may depend only on few parametersMatrix dimension, size of FFT, etc.

Tuning Register Tile Sizes (Dense Matrix Multiply)333 MHz Sun Ultra 2i

2-D slice of 3-D space; implementations color-coded by performance in Mflop/s

16 registers, but 2-by-3 tile size fastestNeedle in a haystack

Example: Select a Matmul Implementation

Example: Support Vector Classification

Machine Learning in Automatic Performance TuningReferencesStatistical Models for Empirical Search-Based Performance Tuning (International Journal of High Performance Computing Applications, 18 (1), pp. 65-94, February 2004) Richard Vuduc, J. Demmel, and Jeff A. Bilmes.Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning (Computer Science PhD Thesis, University of California, Berkeley. UCB//EECS-2009-181 ) Archana Ganapathi

Examples of Automatic Performance Tuning (3)What do dense BLAS, FFTs, signal processing, MPI reductions have in common?Can do the tuning off-line: once per architecture, algorithmCan take as much time as necessary (hours, a week)At run-time, algorithm choice may depend only on few parametersMatrix dimension, size of FFT, etc.Cant always do off-line tuningAlgorithm and implementation may strongly depend on data only known at run-timeEx: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) Part of search for best algorithm just be done (very quickly!) at run-time

Source: Accelerator Cavity Design Problem (Ko via Husbands)

Linear Programming Matrix

A Sparse Matrix You Encounter Every Day

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row ifor k=ptr[i] to ptr[i+1]-1 doy[i] = y[i] + val[k]*x[ind[k]]SpMV with Compressed Sparse Row (CSR) StorageMatrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row ifor k=ptr[i] to ptr[i+1]-1 doy[i] = y[i] + val[k]*x[ind[k]]

Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

Source: NASA structural analysis problem


Source: NASA structural analysis problem8x8 dense substructure

Taking advantage of block structure in SpMVBottleneck is time to get matrix from memoryOnly 2 flops for each nonzero in matrixDont store each nonzero with index, instead store each nonzero r-by-c block with indexStorage drops by up to 2x, if rc >> 1, all 32-bit quantitiesTime to fetch matrix from memory decreasesChange both data structure and algorithmNeed to pick r and cNeed to change algorithm accordinglyIn example, is r=c=8 best choice?Minimizes storage, so looks like a good idea

Speedups on Itanium 2: The Need for SearchMflop/sMflop/s

Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

SpMV Performance (Matrix #2): Generation 1Power3 - 13%Power4 - 14%Itanium 2 - 31%Itanium 1 - 7%195 Mflop/s100 Mflop/s703 Mflop/s469 Mflop/s225 Mflop/s103 Mflop/s1.1 Gflop/s276 Mflop/s

Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 19%63 Mflop/s35 Mflop/s109 Mflop/s53 Mflop/s96 Mflop/s42 Mflop/s120 Mflop/s58 Mflop/s

Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

Another example of tuning challengesMore complicated non-zero structure in general

N = 16614NNZ = 1.1M

Zoom in to top cornerMore complicated non-zero structure in general

N = 16614NNZ = 1.1M

3x3 blocks look natural, butMore complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cells

But would lead to lots of fill-in

Extra Work Can Improve Efficiency!More complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cellsFill-in explicit zerosUnroll 3x3 block multipliesFill ratio = 1.5

On Pentium III: 1.5x speedup!Actual mflop rate 1.52 = 2.25 higher

Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmarkPrecompute Mflops(r,c) using dense A for each r x cOnce per machine/architectureRun-time search Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)See p. 375 of Vuducs thesis for matrices

Accuracy of the Tuning Heuristics (2/4)

Accuracy of the Tuning Heuristics (2/4)DGEMV

Upper Bounds on Performance for blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge minimum access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

Example: Bounds on Itanium 2

Summary of Other Sequential Performance OptimizationsOptimizations for SpMVRegister blocking (RB): up to 4x over CSRVariable block splitting: 2.1x over CSR, 1.8x over RBDiagonals: 2x over CSRReordering to create dense structure + splitting: 2x over CSRSymmetry: 2.8x over CSR, 2.6x over RBCache blocking: 2.8x over CSRMultiple vectors (SpMM): 7x over CSRAnd combinationsSparse triangular solveHybrid sparse/dense data structure: 1.8x over CSRHigher-level kernelsAATx, ATAx: 4x over CSR, 1.8x over RBA2x: 2x over CSR, 1.5x over RB[Ax, A2x, A3x, .. , Akx]

Example: Sparse Triangular FactorRaefsky4 (structural problem) + SuperLU + colmmdN=19779, nnz=12.6 M

Cache Optimizations for AAT*xCache-level: Interleave multiplication by A, ATOnly fetch A from memory onceRegister-level: aiT to be rc block row, or diag row

Example: Combining Optimizations (1/2)Register blocking, symmetry, multiple (k) vectorsThree low-level tuning parameters: r, c, vvkXYAcr+=*

Example: Combining Optimizations (2/2)Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]Symmetric, blocked, 1 vectorUp to 2.6x over nonsymmetric, blocked, 1 vectorSymmetric, blocked, k vectorsUp to 2.1x over nonsymmetric, blocked, k vectorsUp to 7.3x over nonsymmetric, nonblocked, 1 vectorSymmetric Storage: up to 64.7% savings

Motif/Dwarf: Common Computational Methods(Red Hot Blue Cool)03/08/2012CS267 Lecture 15

CS267 Lecture 15

Sheet1

EmbedSPECDBGamesMLHPCHealthImageSpeechMusicBrowser

1Finite State Mach.

2Combinational

3Graph Traversal

4Structured Grid

5Dense Matrix

6Sparse Matrix

7Spectral (FFT)

8Dynamic Prog

9N-Body

10MapReduce

11Backtrack/ B&B

12Graphical Models

13Unstructured Grid

Sheet2

Sheet3

Why so much about SpMV?Contents of the Sparse MotifWhat is sparse linear algebra?Direct solvers for Ax=b, least squaresSparse Gaussian elimination, QR for least squaresHow to choose: crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdfIterative solvers for Ax=b, least squares, Ax=x, SVDUsed when SpMV only affordable operation on A Krylov Subspace MethodsHow to choose For Ax=b: www.netlib.org/templates/Templates.htmlFor Ax=x: www.cs.ucdavis.edu/~bai/ET/contents.htmlWhat about Multigrid?In overlap of sparse and (un)structured grid motifs details later

How to choose an iterative solver - exampleAll methods (GMRES, CGS,CG) depend on SpMV (or variations)See www.netlib.org/templates/Templates.html for details

Potential Impact on Applications: Omega3PApplication: accelerator cavity design [Ko]Relevant optimization techniquesSymmetric storageRegister blockingReordering, to create more dense blocksReverse Cuthill-McKee ordering to reduce bandwidthDo Breadth-First-Search, number nodes in reverse order visitedTraveling Salesman Problem-based ordering to create blocksNodes = columns of AWeights(u, v) = no. of nonzeros u, v have in commonTour = ordering of columnsChoose maximum weight tourSee [Pinar & Heath 97]2.1x speedup on Power 4

Source: Accelerator Cavity Design Problem (Ko via Husbands)

Post-RCM Reordering

100x100 Submatrix Along Diagonal

Before: Green + RedAfter: Green + BlueMicroscopic Effect of RCM Reordering

Microscopic Effect of Combined RCM+TSP ReorderingBefore: Green + RedAfter: Green + Blue

(Omega3P)

How do permutations affect algorithms?A = original matrix, AP = A with permuted rows, columnsNave approach: permute x, multiply y=APx, permute yFaster way to solve Ax=bWrite AP = PTAP where P is a permutation matrixSolve APxP = PTb for xP, using SpMV with AP, then let x = PxPOnly need to permute vectors twice, not twice per iterationFaster way to solve Ax=xA and AP have same eigenvalues, no vectors to permute!APxP =xP implies Ax = x where x = PxPWhere else do optimizations change higher level algorithms? More later

*Tuning SpMV on Multicore

*Multicore SMPs UsedAMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)Source: Sam Williams

*Multicore SMPs Used(Conventional cache-based memory hierarchy)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)Sun T2+ T5140 (Victoria Falls)IBM QS20 Cell BladeSource: Sam Williams

*Multicore SMPs Used(Local store-based memory hierarchy)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)Source: Sam Williams

*Multicore SMPs Used(CMT = Chip-MultiThreading)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)Source: Sam Williams

*Multicore SMPs Used(threads)*SPEs onlySource: Sam Williams

*Multicore SMPs Used(Non-Uniform Memory Access - NUMA)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)*SPEs onlySource: Sam Williams

*Multicore SMPs Used(peak double precision flops)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)75 GFlop/s74 Gflop/s29* GFlop/s19 GFlop/s*SPEs onlySource: Sam Williams

*Multicore SMPs Used(Total DRAM bandwidth)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)21 GB/s (read)10 GB/s (write)21 GB/s51 GB/s42 GB/s (read)21 GB/s (write)*SPEs onlySource: Sam Williams

*Results fromAuto-tuning Sparse Matrix-Vector Multiplication (SpMV)Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

*Test matricesSuite of 14 matricesAll bigger than the caches of our SMPsWell also include a median performance numberDenseProteinFEM /SpheresFEM /CantileverWindTunnelFEM /HarborQCDFEM /ShipEconomicsEpidemiologyFEM /AcceleratorCircuitwebbaseLP2K x 2K Dense matrixstored in sparse formatWell Structured(sorted by nonzeros/row)Poorly StructuredhodgepodgeExtreme Aspect Ratio(linear programming)Source: Sam Williams

SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?*Source: Sam Williams

SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?We could parallelize by columns (sparse matrix time dense sub vector) and in the worst case simplify the random access challenge but: each thread would need to store a temporary partial sumand we would need to perform a reduction (inter-thread data dependency) *thread 0thread 1thread 2thread 3Source: Sam Williams

SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?By rows blocksNo inter-thread data dependencies, but random access to x*thread 0thread 1thread 2thread 3Source: Sam Williams

*SpMV Performance(simple parallelization)Out-of-the box SpMV performance on a suite of 14 matricesSimplest solution = parallelization by rows

Scalability isnt greatCan we do better?

Nave PthreadsNaveSource: Sam Williams

Summary of Multicore OptimizationsNUMA - Non-Uniform Memory Access pin submatrices to memories close to cores assigned to themPrefetch values, indices, and/or vectorsuse exhaustive search on prefetch distance Matrix Compression not just register blocking (BCSR)32 or 16-bit indices, Block Coordinate format for submatricesCache-blocking2D partition of matrix, so needed parts of x,y fit in cache

*

NUMA(Data Locality for Matrices)On NUMA architectures, all large arrays should be partitioned eitherexplicitly (multiple malloc()s + affinity) implicitly (parallelize initialization and rely on first touch)You cannot partition on granularities less than the page size512 elements on x862M elements on Niagara

For SpMV, partition the matrix andperform multiple malloc()sPin submatrices so they areco-located with the cores taskedto process them*Source: Sam Williams

NUMA(Data Locality for Matrices)*Source: Sam Williams

Prefetch for SpMVSW prefetch injects more MLP into the memory subsystem.Supplement HW prefetchersCan try to prefetch thevaluesindicessource vectoror any combination thereofIn general, should only insert one prefetch per cache line (works best on unrolled code) *for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+i+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}

Source: Sam Williams

SpMV Performance(NUMA and Software Prefetching)*NUMA-aware allocation is essential on memory-bound NUMA SMPs.Explicit software prefetching can boost bandwidth and change cache replacement policiesCell PPEs are likely latency-limited.

used exhaustive search for best prefetch distanceSource: Sam Williams

Matrix Compression Goal: minimize memory trafficRegister blockingChoose block size to minimize memory trafficOnly power-of-2 block sizesSimplifies search, achieves most of the possible speedupShorter indices 32-bit, or 16-bit if possibleDifferent sparse matrix formatsBCSR Block compressed sparse rowLike CSR but with register blocksBCOO Block coordinateStores row and column index of each register blockBetter on very sparse sub-blocks (see cache blocking later)

ILP/DLP vs BandwidthIn the multicore era, which is the bigger issue?a lack of ILP/DLP (a major advantage of BCSR)insufficient memory bandwidth per core

There are many architectures that when running low arithmetic intensity kernels, there is so little available memory bandwidth per core that you wont notice a complete lack of ILP

Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP

Rather than benchmarking every combination, just Select the register blocking that minimizes the matrix foot print.

*Source: Sam Williams

Matrix Compression StrategiesRegister blocking creates small dense tilesbetter ILP/DLPreduced overhead per nonzero

Let each thread select a unique register blockingIn this work, we only considered power-of-two register blocksselect the register blocking that minimizes memory traffic*Source: Sam Williams

Matrix Compression StrategiesWhere possible we may encode indices with less than 32 bitsWe may also select different matrix formats

In this work, we considered 16-bit and 32-bit indices (relative to threads start)we explored BCSR/BCOO (GCSR in book chapter)*Source: Sam Williams

SpMV Performance(Matrix Compression)*After maximizing memory bandwidth, the only hope is to minimize memory traffic.Compression: exploitregister blockingother formatssmaller indicesUse a traffic minimization heuristic rather than searchBenefit is clearlymatrix-dependent.Register blocking enables efficient software prefetching (one per cache line)Source: Sam Williams

Cache blocking for SpMV(Data Locality for Vectors)Store entire submatrices contiguously The columns spanned by each cacheblock are selected to use same space in cache, i.e. access same number of x(i)TLB blocking is a similar concept butinstead of on 8 byte granularities, it uses 4KB granularities

*thread 0thread 1thread 2thread 3Source: Sam Williams

Cache blocking for SpMV(Data Locality for Vectors)Cache-blocking sparse matrices is very different than cache-blocking dense matrices.Rather than changing loop bounds, store entire submatrices contiguously. The columns spanned by each cacheblock are selected so that all submatricesplace the same pressure on the cachei.e. touch the same number of uniquesource vector cache lines

TLB blocking is a similar concept butinstead of on 64 byte granularities, it uses 4KB granularities

*Source: Sam WilliamsStore entire submatrices contiguously The columns spanned by each cacheblock are selected to use same space in cache, i.e. access same number of x(i)TLB blocking is a similar concept butinstead of on 8 byte granularities, it uses 4KB granularities

*Auto-tuned SpMV Performance(cache and TLB blocking)Fully auto-tuned SpMV performance across the suite of matricesWhy do some optimizations work better on some architectures?

matrices with naturally small working sets

architectures with giant caches+Cache/LS/TLB Blocking+Matrix Compression+SW Prefetching+NUMA/AffinityNave PthreadsNaveSource: Sam Williams

*Auto-tuned SpMV Performance(architecture specific optimizations)Fully auto-tuned SpMV performance across the suite of matricesIncluded SPE/local store optimized versionWhy do some optimizations work better on some architectures?+Cache/LS/TLB Blocking+Matrix Compression+SW Prefetching+NUMA/AffinityNave PthreadsNaveSource: Sam Williams

*Auto-tuned SpMV Performance(max speedup)Fully auto-tuned SpMV performance across the suite of matricesIncluded SPE/local store optimized versionWhy do some optimizations work better on some architectures?+Cache/LS/TLB Blocking+Matrix Compression+SW Prefetching+NUMA/AffinityNave PthreadsNave2.7x4.0x2.9x35xSource: Sam Williams

Optimized Sparse Kernel Interface - pOSKIProvides sparse kernels automatically tuned for users matrix & machineBLAS-style functionality: SpMV, Ax & ATyHides complexity of run-time tuning

Based on OSKI bebop.cs.berkeley.edu/oskiAutotuner for sequential sparse matrix operations:SpMV (Ax and ATx), ATAx, solve sparse triangular systems, So far pOSKI only does multicore optimizations of SpMVClass projects!Up to 4.5x faster SpMV (Ax) on Intel Sandy Bridge E

On-going work by the Berkeley Benchmarking and Optimization (BeBop) group

Optimizations in pOSKI, so farFully automatic heuristics forSparse matrix-vector multiply (Ax, ATx)Register-level blocking, Thread-level blockingSIMD, software prefetching, software pipelining, loop unrollingNUMA-aware allocations

Plug-in extensibilityVery advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system

Other optimizations under developmentCache-level blocking, Reordering (RCM, TSP), variable block structure, index compressing, Symmetric storage, etc.

How the pOSKI Tunes (Overview)1. Build forTarget Arch.2. BenchmarkGeneratedCodeVariantsLibrary Install-Time (offline)Application Run-TimeSample Dense Matrix(r,c)(r,c) = Register Block size(d) = prefetching distance(imp) = SIMD implementation(r,c,d,imp,)BenchmarkData&Selected Code Variants....2. EvaluateModels3. SelectData Struct.& CodeUsers Matrix1. PartitionWorkloadfrom programmonitoringEmpirical & Heuristic SearchHistoryUsers hintsSubmatrix threadSubmatrix.To user: Matrix handle for kernel calls

How the pOSKI Tunes (Overview)At library build/install-timeGenerate code variantsCode generator (Python) generates code variants for various implementations Collect benchmark dataMeasures and records speed of possible sparse data structure and code variants on target architectureSelect best code variants & benchmark dataprefetching distance, SIMD implementationInstallation process uses standard, portable GNU AutoToolsAt run-timeLibrary tunes using heuristic modelsModels analyze users matrix & benchmark data to choose optimized data structure and codeUser may re-collect benchmark data with users sparse matrix (under development) Non-trivial tuning cost: up to ~40 mat-vecsLibrary limits the time it spends tuning based on estimated workloadprovided by user or inferred by libraryUser may reduce cost by saving tuning results for application on future runs with same or similar matrix (under development)

How to Call pOSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors */

/* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )my_matmult( ptr, ind, val, , x, b, y );

How to Call pOSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create a default pOSKI thread object */poski_threadarg_t *poski_thread = poski_InitThread();/* Step 2: Create pOSKI wrappers around this data */poski_mat_t A_tunable = poski_CreateMatCSR(ptr, ind, val, nrows, ncols, nnz, SHARE_INPUTMAT, poski_thread, NULL, );poski_vec_t x_view = poski_CreateVecView(x, ncols, UNIT_STRIDE, NULL);poski_vec_t y_view = poski_CreateVecView(y, nrows, UNIT_STRIDE, NULL);


How to Call pOSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create a default pOSKI thread object */poski_threadarg_t *poski_thread = poski_InitThread();/* Step 2: Create pOSKI wrappers around this data */poski_mat_t A_tunable = poski_CreateMatCSR(ptr, ind, val, nrows, ncols, nnz, SHARE_INPUTMAT, poski_thread, NULL, );poski_vec_t x_view = poski_CreateVecView(x, ncols, UNIT_STRIDE, NULL);poski_vec_t y_view = poski_CreateVecView(y, nrows, UNIT_STRIDE, NULL);

/* Step 3: Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )poski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

How to Call pOSKI: Tune with Explicit HintsUser calls tune routine (optional)May provide explicit tuning hintsposki_mat_t A_tunable = poski_CreateMatCSR( );/* *//* Tell pOSKI we will call SpMV 500 times (workload hint) */poski_TuneHint_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view,500);/* Tell pOSKI we think the matrix has 8x8 blocks (structural hint) */poski_TuneHint_Structure(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

/* Ask pOSKI to tune */poski_TuneMat(A_tunable);

for( i = 0; i < 500; i++ )poski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

How to Call pOSKI: Implicit TuningAsk library to infer workload (optional)Library profiles all kernel callsMay periodically re-tuneposki_mat_t A_tunable = poski_CreateMatCSR( );/* */

for( i = 0; i < 500; i++ ) {poski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);poski_TuneMat(A_tunable); /* Ask pOSKI to tune */}

How to Call pOSKI: Modify a thread objectAsk library to infer thread hints (optional)Number of threadsThreading model (PthreadPool, Pthread, OpenMP)Default: PthreadPool, #threads=#available cores on systemposki_threadarg_t *poski_thread = poski_InitThread(); /* Ask pOSKI to use 8 threads with OpenMP */ poski_ThreadHints(poski_thread, NULL, OPENMP, 8); poski_mat_t A_tunable = poski_CreateMatCSR( , poski_thread, );poski_MatMult( );

How to Call pOSKI: Modify a partition objectAsk library to infer partition hints (optional)Number of partitions#partition = k#threadsPartitioning model (OneD, SemiOneD, TwoD)Default: OneD, #partitions = #threadsMatrix: /* Ask pOSKI to partition 16 sub-matrices using SemiOneD */ poski_partitionarg_t *pmat poski_PartitionMatHints(SemiOneD, 16); poski_mat_t A_tunable = poski_CreateMatCSR( , pmat, );Vector:/* Ask pOSKI to partition a vector for SpMV input vector based on A_tunable */ poski_partitionVec_t *pvec = poski_PartitionVecHints(A_tunable, KERNEL_MatMult, OP_NORMAL, INPUTVEC); poski_vec_t x_view = poski_CreateVec( , pvec);

Performance on Intel Sandy Bridge EJaketown: i7-3960X @ 3.3 GHz#Cores: 6 (2 threads per core), L3:15MBpOSKI SpMV (Ax) with double precision float-pointMKL Sparse BLAS Level 2: mkl_dcsrmv()

Is tuning SpMV all we can do?Iterative methods all depend on itBut speedups are limitedJust 2 flops per nonzeroCommunication costs dominateCan we beat this bottleneck?Need to look at next level in stack:What do algorithms that use SpMV do?Can we reorganize them to avoid communication?Only way significant speedups will be possible

Tuning Higher Level Algorithms than SpMVWe almost always do many SpMVs, not just oneKrylov Subspace Methods (KSMs) for Ax=b, Ax = xConjugate Gradients, GMRES, Lanczos, Do a sequence of k SpMVs to get vectors [x1 , , xk]Find best solution x as linear combination of [x1 , , xk]Main cost is k SpMVsSince communication usually dominates, can we do better?Goal: make communication cost independent of kParallel case: O(log P) messages, not O(k log P) - optimalsame bandwidth as beforeSequential case: O(1) messages and bandwidth, not O(k) - optimalAchievable when matrix partitionable with low surface-to-volume ratio

1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

Example: A tridiagonal, n=32, k=3Works for any well-partitioned A

1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

Example: A tridiagonal, n=32, k=3

1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Sequential Algorithm

Example: A tridiagonal, n=32, k=3Step 1


Example: A tridiagonal, n=32, k=3

Step 1Step 2


Example: A tridiagonal, n=32, k=3Step 1Step 2Step 3

1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx] Sequential Algorithm

Example: A tridiagonal, n=32, k=3Step 1Step 2Step 3Step 4

1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

Example: A tridiagonal, n=32, k=3Proc 1Proc 2Proc 3Proc 4


Example: A tridiagonal, n=32, k=3Each processor communicates once with neighbors Proc 1Proc 2Proc 3Proc 4


Example: A tridiagonal, n=32, k=3Proc 1


Example: A tridiagonal, n=32, k=3Each processor works on (overlapping) trapezoidProc 1Proc 2Proc 3Proc 4

Same idea works for general sparse matricesCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Simple block-row partitioning (hyper)graph partitioning

Top-to-bottom processing Traveling Salesman Problem

1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Proc 1Proc 2Proc 3Proc 4

Locally Dependent Entries for [x,Ax,,A8x], A tridiagonal2 processorsCan be computed without communicationk=8 fold reuse of AProc 1 Proc 2

One message to get data needed to compute remotely dependent entries, not k=8

Minimizes number of messages = latency costPrice: redundant work surface/volume ratioProc 1 Proc 2Remotely Dependent Entries for [x,Ax,,A8x], A tridiagonal2 processors

Fewer Remotely Dependent Entries for [x,Ax,,A8x], A tridiagonal2 processorsReduce redundant work by halfProc 1 Proc 2

Remotely Dependent Entries for [x,Ax, A2x,A3x], 2D Laplacian

Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors

Speedups on Intel Clovertown (8 core)

Performance ResultsMeasured Multicore (Clovertown) speedups up to 6.4xMeasured/Modeled sequential OOC speedup up to 3xModeled parallel Petascale speedup up to 6.9xModeled parallel Grid speedup up to 22x

Sequential speedup due to bandwidth, works for many problem sizesParallel speedup due to latency, works for smaller problems on many processorsMulticore results used both techniques

Avoiding Communication in Iterative Linear Algebrak-steps of typical iterative solver for sparse Ax=b or Ax=xDoes k SpMVs with starting vectorFinds best solution among all linear combinations of these k+1 vectorsMany such Krylov Subspace MethodsConjugate Gradients, GMRES, Lanczos, Arnoldi, Goal: minimize communication in Krylov Subspace MethodsAssume matrix well-partitioned, with modest surface-to-volume ratioParallel implementationConventional: O(k log p) messages, because k calls to SpMVNew: O(log p) messages - optimalSerial implementationConventional: O(k) moves of data from slow to fast memoryNew: O(1) moves of data optimalLots of speed up possible (modeled and measured)Price: some redundant computationMuch prior workSee Mark Hoemmens PhD thesis, other papers at bebop.cs.berkeley.edu

Minimizing Communication of GMRES to solve Ax=bGMRES: find x in span{b,Ab,,Akb} minimizing || Ax-b ||2Cost of k steps of standard GMRES vs new GMRESStandard GMRES for i=1 to k w = A v(i-1) MGS(w, v(0),,v(i-1)) update v(i), H endfor solve LSQ problem with H

Sequential: #words_moved = O(knnz) from SpMV + O(k2n) from MGSParallel: #messages = O(k) from SpMV + O(k2 log p) from MGSCommunication-avoiding GMRES W = [ v, Av, A2v, , Akv ] [Q,R] = TSQR(W) Tall Skinny QR Build H from R, solve LSQ problem

Sequential: #words_moved = O(nnz) from SpMV + O(kn) from TSQRParallel: #messages = O(1) from computing W + O(log p) from TSQROops W from power method, precision lost!

Monomial basis [Ax,,Akx] fails to converge A different polynomial basis does converge

Speed ups of GMRES on 8-core Intel ClovertownRequires co-tuning kernels [MHDY09]

New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.

FY 2010 Congressional Budget, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65-67.President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress:CA-GMRES (Hoemmen, Mohiyuddin, Yelick, JD)Tall-Skinny QR (Grigori, Hoemmen, Langou, JD)

Tuning space for Krylov MethodsNonzero entriesIndices Classifications of sparse operators for avoiding communication Explicit indices or nonzero entries cause most communication, along with vectors Ex: With stencils (all implicit) all communication for vectors Operations [x, Ax, A2x,, Akx ] or [x, p1(A)x, p2(A)x, , pk(A)x ] Number of columns in x [x, Ax, A2x,, Akx ] and [y, ATy, (AT)2y,, (AT)ky ], or [y, ATAy, (ATA)2y,, (ATA)ky ], return all vectors or just last one Cotuning and/or interleaving W = [x, Ax, A2x,, Akx ] and {TSQR(W) or WTW or } Ditto, but throw away W Preconditioned versions

Explicit (O(nnz))Implicit (o(nnz))Explicit (O(nnz))CSR and variationsVision, climate, AMR,Implicit (o(nnz))Graph LaplacianStencils

Possible Class ProjectsCome to BEBOP meetings (M 1 2, 380 Soda)Experiment with SpMV on different architecturesWhich optimizations are most effective?Try to speed up particular matrices of interestData mining, bottom solver from AMRExplore tuning space of [x,Ax,,Akx] kernelDifferent matrix representations (last slide)New Krylov subspace methods, preconditioningExperiment with new frameworks (SPF, Halide) More details available

Extra Slides

ExtensionsOther Krylov methodsArnoldi, CG, Lanczos, PreconditioningSolve MAx=Mb where preconditioning matrix M chosen to make system easier M approximates A-1 somehow, but cheaply, to accelerate convergenceCheap as long as contributions from distant parts of the system can be compressedSparsityLow rankNo implementations yet (class projects!)

Design Space for [x,Ax,,Akx] (1/3)Mathematical OperationHow many vectors to keepAll: Krylov Subspace Methods Keep last vector Akx only (Jacobi, Gauss Seidel)Improving conditioning of basisW = [x, p1(A)x, p2(A)x,,pk(A)x] pi(A) = degree i polynomial chosen to reduce cond(W)Preconditioning (Ay=b MAy=Mb)[x,Ax,MAx,AMAx,MAMAx,,(MA)kx]

Design Space for [x,Ax,,Akx] (2/3)Representation of sparse AZero pattern may be explicit or implicitNonzero entries may be explicit or implicitImplicit save memory, communication

Representation of dense preconditioners MLow rank off-diagonal blocks (semiseparable)

Explicit patternImplicit patternExplicit nonzerosGeneral sparse matrixImage segmentationImplicit nonzerosLaplacian(graph)Multigrid (AMR)Stencil matrixEx: tridiag(-1,2,-1)

Design Space for [x,Ax,,Akx] (3/3)Parallel implementationFrom simple indexing, with redundant flops surface/volume ratioTo complicated indexing, with fewer redundant flopsSequential implementationDepends on whether vectors fit in fast memoryReordering rows, columns of AImportant in parallel and sequential casesCan be reduced to pair of Traveling Salesmen ProblemsPlus all the optimizations for one SpMV!

Summary Communication-Avoiding Linear Algebra (CALA)Lots of related workSome going back to 1960sReports discuss this comprehensively, not hereOur contributionsSeveral new algorithms, improvements on old onesPreconditioningUnifying parallel and sequential approaches to avoiding communicationTime for these algorithms has come, because of growing communication costsWhy avoid communication just for linear algebra motifs?

Optimized Sparse Kernel Interface - OSKIProvides sparse kernels automatically tuned for users matrix & machineBLAS-style functionality: SpMV, Ax & ATy, TrSVHides complexity of run-time tuningIncludes new, faster locality-aware kernels: ATAx, AkxFaster than standard implementationsUp to 4x faster matvec, 1.8x trisolve, 4x ATA*xFor advanced users & solver library writersAvailable as stand-alone library (OSKI 1.0.1h, 6/07)Available as PETSc extension (OSKI-PETSc .1d, 3/06)Bebop.cs.berkeley.edu/oskiUnder development: pOSKI for multicore

How the OSKI Tunes (Overview)Benchmarkdata1. Build forTargetArch.2. BenchmarkHeuristicmodels1. EvaluateModelsGeneratedcodevariants2. SelectData Struct.& CodeLibrary Install-Time (offline)Application Run-TimeTo user:Matrix handlefor kernelcallsWorkloadfrom programmonitoringExtensibility: Advanced users may write & dynamically add Code variants and Heuristic models to system.HistoryMatrix

How the OSKI Tunes (Overview)At library build/install-timePre-generate and compile code variants into dynamic librariesCollect benchmark dataMeasures and records speed of possible sparse data structure and code variants on target architectureInstallation process uses standard, portable GNU AutoToolsAt run-timeLibrary tunes using heuristic modelsModels analyze users matrix & benchmark data to choose optimized data structure and codeNon-trivial tuning cost: up to ~40 mat-vecsLibrary limits the time it spends tuning based on estimated workloadprovided by user or inferred by libraryUser may reduce cost by saving tuning results for application on future runs with same or similar matrix

Optimizations in OSKI, so farFully automatic heuristics forSparse matrix-vector multiplyRegister-level blockingRegister-level blocking + symmetry + multiple vectorsCache-level blockingSparse triangular solve with register-level blocking and switch-to-dense optimizationSparse ATA*x with register-level blockingUser may select other optimizations manuallyDiagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x)All available in dynamic librariesAccessible via high-level embedded script languagePlug-in extensibilityVery advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the systempOSKI under constructionOptimizations for multicore more later

How to Call OSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors */

/* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, b, y );

How to Call OSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, );oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);


How to Call OSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, );oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

/* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */

How to Call OSKI: Tune with Explicit HintsUser calls tune routineMay provide explicit tuning hints (OPTIONAL)oski_matrix_t A_tunable = oski_CreateMatCSR( );/* *//* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

oski_TuneMat(A_tunable); /* Ask OSKI to tune */

for( i = 0; i < 500; i++ )oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

How the User Calls OSKI: Implicit TuningAsk library to infer workloadLibrary profiles all kernel callsMay periodically re-tuneoski_matrix_t A_tunable = oski_CreateMatCSR( );/* */

for( i = 0; i < 500; i++ ) {oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */}

Quick-and-dirty Parallelism: OSKI-PETScExtend PETScs distributed memory SpMV (MATMPIAIJ)p0p1p2p3PETScEach process stores diag (all-local) and off-diag submatrices

OSKI-PETSc:Add OSKI wrappersEach submatrix tuned independently

OSKI-PETSc Proof-of-Concept ResultsMatrix 1: Accelerator cavity design (R. Lee @ SLAC)N ~ 1 M, ~40 M non-zeros2x2 dense block substructureSymmetricMatrix 2: Linear programming (Italian Railways)Short-and-fat: 4k x 1M, ~11M non-zerosHighly unstructuredBig speedup from cache-blocking: no native PETSc formatEvaluation machine: Xeon clusterPeak: 4.8 Gflop/s per node

Accelerator Cavity Matrix

OSKI-PETSc Performance: Accel. Cavity

Linear Programming Matrix

OSKI-PETSc Performance: LP Matrix

Tuning SpMV for Multicore: Architectural Review

Cost of memory access in Multicore

When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessUCA & UMA architecture:


Cost of Memory Access in MulticoreWhen physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessNUCA & UMA architecture:


Cost of Memory Access in MulticoreWhen physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessNUCA & NUMA architecture:


Cost of Memory Access in MulticoreProper cache locality to maximize performance


Cost of Memory Access in MulticoreProper DRAM locality to maximize performance


Optimizing Communication Complexity of Sparse SolversNeed to modify high level algorithms to use new kernelExample: GMRES for Ax=b where A= 2D Laplacianx lives on n-by-n meshPartitioned on p -by- p processor gridA has 5 point stencil (Laplacian)(Ax)(i,j) = linear_combination(x(i,j), x(i,j1), x(i1,j))Ex: 18-by-18 mesh on 3-by-3 processor grid

Minimizing CommunicationWhat is the cost = (#flops, #words, #mess) of s steps of standard GMRES?GMRES, ver.1: for i=1 to s w = A * v(i-1) MGS(w, v(0),,v(i-1)) update v(i), H endfor solve LSQ problem with Hn/p n/p Cost(A * v) = s * (9n2 /p, 4n / p , 4 ) Cost(MGS = Modified Gram-Schmidt) = s2/2 * ( 4n2 /p , log p , log p ) Total cost ~ Cost( A * v ) + Cost (MGS) Can we reduce the latency?

Minimizing CommunicationCost(GMRES, ver.1) = Cost(A*v) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 4s ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from A*v can you avoid? Almost all

Minimizing CommunicationCost(GMRES, ver. 2) = Cost(W) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from MGS can you avoid? Almost all

Minimizing CommunicationCost(GMRES, ver. 2) = Cost(W) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from MGS can you avoid? Almost allGMRES, ver. 3: W = [ v, Av, A2v, , Asv ] [Q,R] = TSQR(W) Tall Skinny QR (See Lecture 11) Build H from R, solve LSQ problem

Minimizing CommunicationCost(GMRES, ver. 3) = Cost(W) + Cost(TSQR)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , log p ) Latency cost independent of s, just log p optimal Oops W from power method, so precision lost What to do? Use a different polynomial basis Not Monomial basis W = [v, Av, A2v, ], instead Newton Basis WN = [v, (A 1 I)v , (A 2 I)(A 1 I)v, ] or Chebyshev Basis WC = [v, T1(v), T2(v), ]

Tuning Higher Level Algorithms So far we have tuned a single sparse matrix kernel y = AT*A*x motivated by higher level algorithm (SVD) What can we do by extending tuning to a higher level? Consider Krylov subspace methods for Ax=b, Ax = lx Conjugate Gradients (CG), GMRES, Lanczos, Inner loop does y=A*x, dot products, saxpys, scalar ops Inner loop costs at least O(1) messages k iterations cost at least O(k) messages Our goal: show how to do k iterations with O(1) messages Possible payoff make Krylov subspace methods much faster on machines with slow networks Memory bandwidth improvements too (not discussed) Obstacles: numerical stability, preconditioning,

Krylov Subspace Methods for Solving Ax=bCompute a basis for a subspace V by doing y = A*x k timesFind best solution in that Krylov subspace VGiven starting vector x1, V spanned by x2 = A*x1, x3 = A*x2 , , xk = A*xk-1GMRES finds an orthogonal basis of V by Gram-Schmidt, so it actually does a different set of SpMVs than in last bullet

Example: Standard GMRESr = b - A*x1, b = length(r), v1 = r / b length(r) = sqrt(S ri2 )for m= 1 to k do w = A*vm at least O(1) messages for i = 1 to m do Gram-Schmidt him = dotproduct(vi , w ) at least O(1) messages, or log(p) w = w h im * vi end for hm+1,m = length(w) at least O(1) messages, or log(p) vm+1 = w / hm+1,m end forfind y minimizing length( Hk * y be1 ) small, local problemnew x = x1 + Vk * y Vk = [v1 , v2 , , vk ]O(k2), or O(k2 log p), messages altogether

Example: Computing [Ax,A2x,A3x,,Akx] for A tridiagonal Different basis for same Krylov subspaceWhat can Proc 1 compute without communication?x(1:30):(Ax)(1:30):(A2x)(1:30):(A8x)(1:30):Proc 1Proc 2. . .

Example: Computing [Ax,A2x,A3x,,Akx] for A tridiagonal x(1:30):(Ax)(1:30):(A2x)(1:30):. . .(A8x)(1:30):Proc 1Proc 2Computing missing entries with 1 communication, redundant work

x(1:30):(Ax)(1:30):(A2x)(1:30):. . .(A8x)(1:30):Proc 1Proc 2Saving half the redundant workExample: Computing [Ax,A2x,A3x,,Akx] for A tridiagonal

Example: Computing [Ax,A2x,A3x,,Akx] for LaplacianA = 5pt Laplacian in 2D,Communicated point for k=3 shown

Latency-Avoiding GMRES (1)r = b - A*x1, b = length(r), v1 = r / b O(log p) messagesWk+1 = [v1 , A * v1 , A2 * v1 , , Ak * v1 ] O(1) messages[Q, R] = qr(Wk+1) QR decomposition, O(log p) messagesHk = R(:, 2:k+1) * (R(1:k,1:k))-1 small, local problemfind y minimizing length( Hk * y be1 ) small, local problemnew x = x1 + Qk * y local problemO(log p) messages altogetherIndependent of k

Latency-Avoiding GMRES (2)[Q, R] = qr(Wk+1) QR decomposition, O(log p) messages

Easy, but not so stable way to do it: X(myproc) = Wk+1T(myproc) * Wk+1 (myproc) local computation Y = sum_reduction(X(myproc)) O(log p) messages Y = Wk+1T* Wk+1 R = (cholesky(Y))T small, local computation Q(myproc) = Wk+1 (myproc) * R-1 local computation

Numerical example (1)Diagonal matrix with n=1000, Aii from 1 down to 10-5Instability as k grows, after many iterations

Numerical Example (2)Partial remedy: restarting periodically (every 120 iterations)Other remedies: high precision, different basis than [v , A * v , , Ak * v ]

Operation Counts for [Ax,A2x,A3x,,Akx] on p procs

ProblemPer-proc costStandardOptimized

1D mesh#messages2k2(tridiagonal)#words sent2k2k#flops5kn/p5kn/p + 5k2memory(k+4)n/p(k+4)n/p + 8k

3D mesh#messages26k2627 pt stencil#words sent6kn2p-2/3 + 12knp-1/3 + O(k)6kn2p-2/3 + 12k2np-1/3 + O(k3)#flops53kn3/p53kn3/p + O(k2n2p-2/3)memory(k+28)n3/p+ 6n2p-2/3 + O(np-1/3)(k+28)n3/p + 168kn2p-2/3 + O(k2np-1/3)

Summary and Future WorkDenseLAPACKScaLAPACKCommunication primitivesSparseKernels, StencilsHigher level algorithmsAll of the above on new architecturesVector, SMPs, multicore, Cell, High level support for tuningSpecification languageIntegration into compilers

Extra Slides

A Sparse Matrix You Encounter Every DayWho am I?I am aBig RepositoryOf usefulAnd uselessFacts alike.

Who am I?

(Hint: Not your e-mail inbox.)

What about the Google Matrix?Google approachApprox. once a month: rank all pages using connectivity structureFind dominant eigenvector of a matrixAt query-time: return list of pages ordered by rankMatrix: A = aG + (1-a)(1/n)uuT Markov model: Surfer follows link with probability a, jumps to a random page with probability 1-aG is n x n connectivity matrix [n billions]gij is non-zero if page i links to page jNormalized so each column sums to 1Very sparse: about 78 non-zeros per row (power law dist.)u is a vector of all 1 valuesSteady-state probability xi of landing on page i is solution to x = AxApproximate x by power method: x = Akx0In practice, k 25

Current WorkPublic software releaseImpact on library designs: Sparse BLAS, Trilinos, PETSc, Integration in large-scale applicationsDOE: Accelerator design; plasma physicsGeophysical simulation based on Block Lanczos (ATA*X; LBL)Systematic heuristics for data structure selection?Evaluation of emerging architecturesRevisiting vector microsOther sparse kernelsMatrix triple products, Ak*xParallelismSparse benchmarks (with UTK) [Gahvari & Hoemmen]Automatic tuning of MPI collective ops [Nishtala, et al.]

Summary of High-Level ThemesKernel-centric optimizationVs. basic block, trace, path optimization, for instanceAggressive use of domain-specific knowledgePerformance bounds modelingEvaluating software qualityArchitectural characterizations and consequencesEmpirical searchHybrid on-line/run-time modelsStatistical performance modelsExploit information from sampling, measuring

Related WorkMy bibliography: 337 entries so farSample area 1: Code generationGenerative & generic programmingSparse compilersDomain-specific generatorsSample area 2: Empirical search-based tuningKernel-centriclinear algebra, signal processing, sorting, MPI, Compiler-centricprofiling + FDO, iterative compilation, superoptimizers, self-tuning compilers, continuous program optimization

Future Directions (A Bag of Flaky Ideas)Composable code generators and search spacesNew application domainsPageRank: multilevel block algorithms for topic-sensitive search?New kernels: cryptokernelsrich mathematical structure germane to performance; lots of hardwareNew tuning environmentsParallel, Grid, whole systemsStatistical models of application performanceStatistical learning of concise parametric models from traces for architectural evaluationCompiler/automatic derivation of parametric models

Possible Future WorkDifferent ArchitecturesNew FP instruction sets (SSE2)SMP / multicore platformsVector architecturesDifferent KernelsHigher Level AlgorithmsParallel versions of kenels, with optimized communicationBlock algorithms (eg Lanczos)XBLAS (extra precision)Produce BenchmarksAugment HPCC Benchmark Make it possible to combine optimizations easily for any kernelRelated tuning activities (LAPACK & ScaLAPACK)

Review of Tuning by Illustration(Extra Slides)

Splitting for Variable Blocks and DiagonalsDecompose A = A1 + A2 + AtDetect canonical structures (sampling)SplitTune each AiImprove performance and save storageNew data structuresUnaligned block CSRRelax alignment in rows & columnsRow-segmented diagonals

Example: Variable Block Row (Matrix #12)2.1x over CSR1.8x over RB

Example: Row-Segmented Diagonals2x over CSR

Mixed Diagonal and Block Structure

SummaryAutomated block size selectionEmpirical modeling and searchRegister blocking for SpMV, triangular solve, ATA*x

Not fully automatedGiven a matrix, select splittings and transformationsLots of combinatorial problemsTSP reordering to create dense blocks (Pinar 97; Moon, et al. 04)

Extra Slides

A Sparse Matrix You Encounter Every DayWho am I?I am aBig RepositoryOf usefulAnd uselessFacts alike.

Who am I?

(Hint: Not your e-mail inbox.)

Problem ContextSparse kernels aboundModels of buildings, cars, bridges, economies, Google PageRank algorithmHistorical trendsSparse matrix-vector multiply (SpMV): 10% of peak2x faster with hand-tuningTuning becoming more difficult over timePromise of automatic tuning: PHiPAC/ATLAS, FFTW, Challenges to high-performanceNot dense linear algebra!Complex data structures: indirect, irregular memory accessPerformance depends strongly on run-time inputs

Key Questions, Ideas, ConclusionsHow to tune basic sparse kernels automatically?Empirical modeling and searchUp to 4x speedups for SpMV1.8x for triangular solve4x for ATA*x; 2x for A2*x7x for multiple vectorsWhat are the fundamental limits on performance?Kernel-, machine-, and matrix-specific upper boundsAchieve 75% or more for SpMV, limiting low-level tuningConsequences for architecture?General techniques for empirical search-based tuning?Statistical models of performance

Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceStatistical models of performance

Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]Compressed Sparse Row (CSR) StorageMatrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]

Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceStatistical models of performance

Historical Trends in SpMV PerformanceThe DataUniprocessor SpMV performance since 1987Untuned and Tuned implementationsCache-based superscalar micros; some vectorsLINPACK

SpMV Historical Trends: Mflop/s


Source: NASA structural analysis problem

Still More SurprisesMore complicated non-zero structure in general

Still More SurprisesMore complicated non-zero structure in general

Example: 3x3 blockingLogical grid of 3x3 cells

Historical Trends: Mixed NewsObservationsGood news: Moores law like behaviorBad news: Untuned is 10% peak or less, worseningGood news: Tuned roughly 2x better today, and improvingBad news: Tuning is complex

(Not really news: SpMV is not LINPACK)

QuestionsApplication: Automatic tuning?Architect: What machines are good for SpMV?

Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesSpMV [SC02; IJHPCA 04b]Sparse triangular solve (SpTS) [ICS/POHLL 02]ATA*x [ICCS/WoPLA 03]Upper bounds on performanceStatistical models of performance

SPARSITY: Framework for Tuning SpMVSPARSITY: Automatic tuning for SpMV [Im & Yelick 99]General approachIdentify and generate implementation spaceSearch space using empirical models & experimentsPrototype library and heuristic for choosing register block sizeAlso: cache-level blocking, multiple vectorsWhats new?New block size selection heuristicWithin 10% of optimal replaces previous versionExpanded implementation spaceVariable block splitting, diagonals, combinationsNew kernels: sparse triangular solve, ATA*x, Ar*x

Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmark: characterize the machinePrecompute Mflops(r,c) using dense matrix for each r x cOnce per machine/architectureRun-time search: characterize the matrix Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to maximize Mflops(r,c) / Fill(r,c)Run-time costsUp to ~40 SpMVs (empirical worst case)

Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)DGEMV

Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceSC02Statistical models of performance

Motivation for Upper Bounds ModelQuestionsSpeedups are good, but what is the speed limit?Independent of instruction scheduling, selectionWhat machines are good for SpMV?

Upper Bounds on Performance: Blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge min access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

Example: Bounds on Itanium 2

Fraction of Upper Bound Across Platforms

Achieved Performance and Machine BalanceMachine balance [Callahan 88; McCalpin 95]Balance = Peak Flop Rate / Bandwidth (flops / double)Ideal balance for mat-vec: 2 flops / doubleFor SpMV, even less

SpMV ~ streaming1 / (avg load time to stream 1 array) ~ (bandwidth)Sustained balance = peak flops / model bandwidth

Where Does the Time Go?Most time assigned to memoryCaches disappear when line sizes are equalStrictly increasing line sizes

Execution Time Breakdown: Matrix 40

Speedups with Increasing Line Size

Summary: Performance Upper BoundsWhat is the best we can do for SpMV?Limits to low-level tuning of blocked implementationsRefinements?What machines are good for SpMV?Partial answer: balance characterizationArchitectural consequences?Example: Strictly increasing line sizes

Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceTuning other sparse kernelsStatistical models of performanceFDO 00; IJHPCA 04a

Statistical Models for Automatic TuningIdea 1: Statistical criterion for stopping a searchA general search modelGenerate implementationMeasure performanceRepeatStop when probability of being within e of optimal falls below thresholdCan estimate distribution on-lineIdea 2: Statistical performance modelsProblem: Choose 1 among m implementations at run-timeSample performance off-line, build statistical model

Example: Select a Matmul Implementation

Example: Support Vector Classification

Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceTuning other sparse kernelsStatistical models of performanceSummary and Future Work

Summary of High-Level ThemesKernel-centric optimizationVs. basic block, trace, path optimization, for instanceAggressive use of domain-specific knowledgePerformance bounds modelingEvaluating software qualityArchitectural characterizations and consequencesEmpirical searchHybrid on-line/run-time modelsStatistical performance modelsExploit information from sampling, measuring

Related WorkMy bibliography: 337 entries so farSample area 1: Code generationGenerative & generic programmingSparse compilersDomain-specific generatorsSample area 2: Empirical search-based tuningKernel-centriclinear algebra, signal processing, sorting, MPI, Compiler-centricprofiling + FDO, iterative compilation, superoptimizers, self-tuning compilers, continuous program optimization

Future Directions (A Bag of Flaky Ideas)Composable code generators and search spacesNew application domainsPageRank: multilevel block algorithms for topic-sensitive search?New kernels: cryptokernelsrich mathematical structure germane to performance; lots of hardwareNew tuning environmentsParallel, Grid, whole systemsStatistical models of application performanceStatistical learning of concise parametric models from traces for architectural evaluationCompiler/automatic derivation of parametric models

AcknowledgementsSuper-advisors: Jim and KathyUndergraduate R.A.s: Attila, Ben, Jen, Jin, Michael, Rajesh, Shoaib, Sriram, Tuyet-LinhSee pages xvixvii of dissertation.

TSP-based Reordering: Before(Pinar 97;Moon, et al 04)

TSP-based Reordering: After(Pinar 97;Moon, et al 04)

Up to 2xspeedupsover CSR

Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

Example: Distribution of Blocked Non-Zeros

Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

Sparse/Dense Partitioning for SpTSPartition L into sparse (L1,L2) and dense LD:Perform SpTS in three steps:Sparsity optimizations for (1)(2); DTRSV for (3)Tuning parameters: block size, size of dense triangle

SpTS Performance: Power3

Summary of SpTS and AAT*x ResultsSpTS Similar to SpMV1.8x speedups; limited benefit from low-level tuningAATx, ATAxCache interleaving only: up to 1.6x speedupsReg + cache: up to 4x speedups1.8x speedup over register onlySimilar heuristic; same accuracy (~ 10% optimal)Further from upper bounds: 6080%Opportunity for better low-level tuning a la PHiPAC/ATLASMatrix triple products? Ak*x?Preliminary work

Register Blocking: Speedup

Register Blocking: Performance

Register Blocking: Fraction of Peak

Example: Confidence Interval Estimation

Costs of Tuning

Splitting + UBCSR: Pentium III

Splitting + UBCSR: Power4

Splitting+UBCSR Storage: Power4

Example: Variable Block Row (Matrix #13)

Dense Tuning is Hard, TooEven dense matrix multiply can be notoriously difficult to tune

Dense matrix multiply: surprising performance as register tile size varies.

Preliminary Results (Matrix Set 2): Itanium 2DenseFEMFEM (var)BioLPEconStat

Multiple Vector Performance

What about the Google Matrix?Google approachApprox. once a month: rank all pages using connectivity structureFind dominant eigenvector of a matrixAt query-time: return list of pages ordered by rankMatrix: A = aG + (1-a)(1/n)uuT Markov model: Surfer follows link with probability a, jumps to a random page with probability 1-aG is n x n connectivity matrix [n 3 billion]gij is non-zero if page i links to page jNormalized so each column sums to 1Very sparse: about 78 non-zeros per row (power law dist.)u is a vector of all 1 valuesSteady-state probability xi of landing on page i is solution to x = AxApproximate x by power method: x = Akx0In practice, k 25

MAPS Benchmark Example: Power4

MAPS Benchmark Example: Itanium 2

Saavedra-Barrera Example: Ultra 2i

Summary of Results: Pentium III

Summary of Results: Pentium III (3/3)

Execution Time Breakdown (PAPI): Matrix 40

Preliminary Results (Matrix Set 1): Itanium 2LPFEMFEM (var)AssortedDense

Tuning Sparse Triangular Solve (SpTS)Compute x=L-1*b where L sparse lower triangular, x & b denseL from sparse LU has rich dense substructureDense trailing triangle can account for 2090% of matrix non-zerosSpTS optimizationsSplit into sparse trapezoid and dense trailing triangleUse tuned dense BLAS (DTRSV) on dense triangleUse Sparsity register blocking on sparse partTuning parametersSize of dense trailing triangleRegister block size

Sparse Kernels and OptimizationsKernelsSparse matrix-vector multiply (SpMV): y=A*xSparse triangular solve (SpTS): x=T-1*by=AAT*x, y=ATA*xPowers (y=Ak*x), sparse triple-product (R*A*RT), Optimization techniques (implementation space)Register blockingCache blockingMultiple dense vectors (x)A has special structure (e.g., symmetric, banded, )Hybrid data structures (e.g., splitting, switch-to-dense, )Matrix reorderingHow and when do we search?Off-line: Benchmark implementationsRun-time: Estimate matrix properties, evaluate performance models based on benchmark data

Cache Blocked SpMV on LSI Matrix: Ultra 2iA10k x 255k3.7M non-zeros

Baseline:16 Mflop/s

Best block size& performance:16k x 64k28 Mflop/s

Cache Blocking on LSI Matrix: Pentium 4A10k x 255k3.7M non-zeros

Baseline:44 Mflop/s


Cache Blocked SpMV on LSI Matrix: ItaniumA10k x 255k3.7M non-zeros

Baseline:25 Mflop/s


Cache Blocked SpMV on LSI Matrix: Itanium 2A10k x 255k3.7M non-zeros

Baseline:170 Mflop/s


Inter-Iteration Sparse Tiling (1/3)[Strout, et al., 01]Let A be 6x6 tridiagonalConsider y=A2xt=Ax, y=AtNodes: vector elementsEdges: matrix elements aij


Orange = everything needed to compute y1Reuse a11, a12


Orange = everything needed to compute y1Reuse a11, a12Grey = y2, y3Reuse a23, a33, a43

Inter-Iteration Sparse Tiling: IssuesTile sizes (colored regions) grow with no. of iterations and increasing out-degreeG likely to have a few nodes with high out-degree (e.g., Yahoo)Mathematical tricks to limit tile size?Judicious dropping of edges [Ng01]

Summary and QuestionsNeed to understand matrix structure and machineBeBOP: suite of techniques to deal with different sparse structures and architecturesGoogle matrix problemEstablished techniques within an iterationIdeas for inter-iteration optimizationsMathematical structure of problem may helpQuestionsStructure of G?What are the computational bottlenecks?Enabling future computations?E.g., topic-sensitive PageRank multiple vector version [Haveliwala 02]See www.cs.berkeley.edu/~richie/bebop/intel/google for more info, including more complete Itanium 2 results.

Exploiting Matrix StructureSymmetry (numerical or structural)Reuse matrix entriesCan combine with register blocking, multiple vectors, Matrix splittingSplit the matrix, e.g., into r x c and 1 x 1No fill overheadLarge matrices with random structureE.g., Latent Semantic Indexing (LSI) matricesTechnique: cache blockingStore matrix as 2i x 2j sparse submatricesEffective when x vector is largeCurrently, search to find fastest size

Symmetric SpMV Performance: Pentium 4

SpMV with Split Matrices: Ultra 2i

Cache Blocking on Random Matrices: ItaniumSpeedup on four bandedrandom matrices.

Register Blocked SpMV: Pentium III

Register Blocked SpMV: Ultra 2i

Register Blocked SpMV: Power3

Register Blocked SpMV: Itanium

Possible Optimization TechniquesWithin an iteration, i.e., computing (G+uuT)*x onceCache block G*xOn linear programming matrices and matrices with random structure (e.g., LSI), 1.54x speedupsBest block size is matrix and machine dependentReordering and/or splitting of G to separate dense structure (rows, columns, blocks)Between iterations, e.g., (G+uuT)2x(G+uuT)2x = G2x + (Gu)uTx + u(uTG)x + u(uTu)uTxCompute Gu, uTG, uTu once for all iterationsG2x: Inter-iteration tiling to read G only once

Multiple Vector Performance: Itanium

SpTS Performance: Itanium(See POHLL 02 workshop paper, at ICS 02.)

Optimizing AAT*xKernel: y=AAT*x, where A is sparse, x & y denseArises in linear programming, computation of SVDConventional implementation: compute z=AT*x, y=A*zElements of A can be reused:When ak represent blocks of columns, can apply register blocking.

Optimized AAT*x Performance: Pentium III

Current DirectionsApplying new optimizationsOther split data structures (variable block, diagonal, )Matrix reordering to create block structureStructural symmetryNew kernels (triple product RART, powers Ak, )Tuning parameter selectionBuilding an automatically tuned sparse matrix libraryExtending the Sparse BLASLeverage existing sparse compilers as code generation infrastructureMore thoughts on this topic tomorrow

Related WorkAutomatic performance tuning systemsPHiPAC [Bilmes, et al., 97], ATLAS [Whaley & Dongarra 98]FFTW [Frigo & Johnson 98], SPIRAL [Pueschel, et al., 00], UHFFT [Mirkovic and Johnsson 00]MPI collective operations [Vadhiyar & Dongarra 01]Code generationFLAME [Gunnels & van de Geijn, 01]Sparse compilers: [Bik 99], Bernoulli [Pingali, et al., 97]Generic programming: Blitz++ [Veldhuizen 98], MTL [Siek & Lumsdaine 98], GMCL [Czarnecki, et al. 98], Sparse performance modeling[Temam & Jalby 92], [White & Saddayappan 97], [Navarro, et al., 96], [Heras, et al., 99], [Fraguela, et al., 99],

More Related WorkCompiler analysis, modelsCROPS [Carter, Ferrante, et al.]; Serial sparse tiling [Strout 01]TUNE [Chatterjee, et al.]Iterative compilation [OBoyle, et al., 98]Broadway compiler [Guyer & Lin, 99][Brewer 95], ADAPT [Voss 00]Sparse BLAS interfacesBLAST Forum (Chapter 3)NIST Sparse BLAS [Remington & Pozo 94]; SparseLib++SPARSKIT [Saad 94]Parallel Sparse BLAS [Fillipone, et al. 96]

Context: Creating High-Performance LibrariesApplication performance dominated by a few computational kernelsToday: Kernels hand-tuned by vendor or userPerformance tuning challengesPerformance is a complicated function of kernel, architecture, compiler, and workloadTedious and time-consumingSuccessful automated approachesDense linear algebra: ATLAS/PHiPACSignal processing: FFTW/SPIRAL/UHFFT

Cache Blocked SpMV on LSI Matrix: Itanium

Sustainable Memory Bandwidth

Multiple Vector Performance: Pentium 4

Multiple Vector Performance: Itanium

Multiple Vector Performance: Pentium 4

Optimized AAT*x Performance: Ultra 2i

Optimized AAT*x Performance: Pentium 4

Tuning Pays OffPHiPAC

Tuning pays off ATLASExtends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

Register Tile Sizes (Dense Matrix Multiply)333 MHz Sun Ultra 2i

2-D slice of 3-D space; implementations color-coded by performance in Mflop/s

16 registers, but 2-by-3 tile size fastest

High Precision GEMV (XBLAS)

High Precision Algorithms (XBLAS)Double-double (High precision word represented as pair of doubles)Many variations on these algorithms; we currently use BaileysExploiting Extra-wide RegistersSuppose s(1) , , s(n) have f-bit fractions, SUM has F>f bit fractionConsider following algorithm for S = Si=1,n s(i)Sort so that |s(1)| |s(2)| |s(n)|SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUMTheorem (D., Hida) Suppose F

cs267 – lecture 14 automatic performance tuning and sparse-matrix-vector-multiplication (spmv)

Documents

sparse matrix kernelsoski

implementation of sparse

best algorithm

best choice

algorithm choice

sparse matrix nonzero

high speedideal

computer architecture