cs267 – lecture 14 automatic performance tuning and sparse-matrix-vector-multiplication (spmv)

Download CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV)

If you can't read please download the document

Upload: derek-burris

Post on 31-Dec-2015

46 views

Category:

Documents


2 download

DESCRIPTION

CS267 – Lecture 14 Automatic Performance Tuning and Sparse-Matrix-Vector-Multiplication (SpMV). James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr13. Outline. Motivation for Automatic Performance Tuning Results for sparse matrix kernels OSKI = Optimized Sparse Kernel Interface - PowerPoint PPT Presentation

TRANSCRIPT

  • CS267 Lecture 14

    Automatic Performance TuningandSparse-Matrix-Vector-Multiplication (SpMV)James Demmel

    www.cs.berkeley.edu/~demmel/cs267_Spr13

  • OutlineMotivation for Automatic Performance TuningResults for sparse matrix kernelsOSKI = Optimized Sparse Kernel InterfacepOSKI for multicoreTuning Higher Level AlgorithmsFuture Work, Class Projects

    BeBOP: Berkeley Benchmarking and Optimization GroupMany results shown from current and former membersMeet weekly M 1-2, in 380 Soda

  • Motivation for Automatic Performance TuningWriting high performance software is hardMake programming easier while getting high speedIdeal: program in your favorite high level language (Matlab, Python, PETSc) and get a high fraction of peak performanceReality: Best algorithm (and its implementation) can depend strongly on the problem, computer architecture, compiler,Best choice can depend on knowing a lot of applied mathematics and computer scienceHow much of this can we teach?How much of this can we automate?

  • Examples of Automatic Performance Tuning (1)Dense BLASSequentialPHiPAC (UCB), then ATLAS (UTK) (used in Matlab)math-atlas.sourceforge.net/Internal vendor toolsFast Fourier Transform (FFT) & variationsSequential and ParallelFFTW (MIT)www.fftw.orgDigital Signal ProcessingSPIRAL: www.spiral.net (CMU)Communication Collectives (UCB, UTK)Rose (LLNL), Bernoulli (Cornell), Telescoping Languages (Rice), More projects, conferences, government reports,

  • Examples of Automatic Performance Tuning (2)What do dense BLAS, FFTs, signal processing, MPI reductions have in common?Can do the tuning off-line: once per architecture, algorithmCan take as much time as necessary (hours, a week)At run-time, algorithm choice may depend only on few parametersMatrix dimension, size of FFT, etc.

  • Tuning Register Tile Sizes (Dense Matrix Multiply)333 MHz Sun Ultra 2i

    2-D slice of 3-D space; implementations color-coded by performance in Mflop/s

    16 registers, but 2-by-3 tile size fastestNeedle in a haystack

  • Example: Select a Matmul Implementation

  • Example: Support Vector Classification

  • Machine Learning in Automatic Performance TuningReferencesStatistical Models for Empirical Search-Based Performance Tuning (International Journal of High Performance Computing Applications, 18 (1), pp. 65-94, February 2004) Richard Vuduc, J. Demmel, and Jeff A. Bilmes.Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning (Computer Science PhD Thesis, University of California, Berkeley. UCB//EECS-2009-181 ) Archana Ganapathi

  • Examples of Automatic Performance Tuning (3)What do dense BLAS, FFTs, signal processing, MPI reductions have in common?Can do the tuning off-line: once per architecture, algorithmCan take as much time as necessary (hours, a week)At run-time, algorithm choice may depend only on few parametersMatrix dimension, size of FFT, etc.Cant always do off-line tuningAlgorithm and implementation may strongly depend on data only known at run-timeEx: Sparse matrix nonzero pattern determines both best data structure and implementation of Sparse-matrix-vector-multiplication (SpMV) Part of search for best algorithm just be done (very quickly!) at run-time

  • Source: Accelerator Cavity Design Problem (Ko via Husbands)

  • Linear Programming Matrix

  • A Sparse Matrix You Encounter Every Day

  • Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

    for each row ifor k=ptr[i] to ptr[i+1]-1 doy[i] = y[i] + val[k]*x[ind[k]]SpMV with Compressed Sparse Row (CSR) StorageMatrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

    for each row ifor k=ptr[i] to ptr[i+1]-1 doy[i] = y[i] + val[k]*x[ind[k]]

  • Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

    Source: NASA structural analysis problem

  • Example: The Difficulty of Tuningn = 21200nnz = 1.5 Mkernel: SpMV

    Source: NASA structural analysis problem8x8 dense substructure

  • Taking advantage of block structure in SpMVBottleneck is time to get matrix from memoryOnly 2 flops for each nonzero in matrixDont store each nonzero with index, instead store each nonzero r-by-c block with indexStorage drops by up to 2x, if rc >> 1, all 32-bit quantitiesTime to fetch matrix from memory decreasesChange both data structure and algorithmNeed to pick r and cNeed to change algorithm accordinglyIn example, is r=c=8 best choice?Minimizes storage, so looks like a good idea

  • Speedups on Itanium 2: The Need for SearchMflop/sMflop/s

  • Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

  • SpMV Performance (Matrix #2): Generation 1Power3 - 13%Power4 - 14%Itanium 2 - 31%Itanium 1 - 7%195 Mflop/s100 Mflop/s703 Mflop/s469 Mflop/s225 Mflop/s103 Mflop/s1.1 Gflop/s276 Mflop/s

  • Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

  • SpMV Performance (Matrix #2): Generation 2Ultra 2i - 9%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 19%63 Mflop/s35 Mflop/s109 Mflop/s53 Mflop/s96 Mflop/s42 Mflop/s120 Mflop/s58 Mflop/s

  • Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

  • Another example of tuning challengesMore complicated non-zero structure in general

    N = 16614NNZ = 1.1M

  • Zoom in to top cornerMore complicated non-zero structure in general

    N = 16614NNZ = 1.1M

  • 3x3 blocks look natural, butMore complicated non-zero structure in general

    Example: 3x3 blockingLogical grid of 3x3 cells

    But would lead to lots of fill-in

  • Extra Work Can Improve Efficiency!More complicated non-zero structure in general

    Example: 3x3 blockingLogical grid of 3x3 cellsFill-in explicit zerosUnroll 3x3 block multipliesFill ratio = 1.5

    On Pentium III: 1.5x speedup!Actual mflop rate 1.52 = 2.25 higher

  • Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmarkPrecompute Mflops(r,c) using dense A for each r x cOnce per machine/architectureRun-time search Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to minimize time ~ Fill(r,c) / Mflops(r,c)

  • Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

  • Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)See p. 375 of Vuducs thesis for matrices

  • Accuracy of the Tuning Heuristics (2/4)

  • Accuracy of the Tuning Heuristics (2/4)DGEMV

  • Upper Bounds on Performance for blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge minimum access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

  • Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

  • Example: Bounds on Itanium 2

  • Example: Bounds on Itanium 2

  • Example: Bounds on Itanium 2

  • Summary of Other Sequential Performance OptimizationsOptimizations for SpMVRegister blocking (RB): up to 4x over CSRVariable block splitting: 2.1x over CSR, 1.8x over RBDiagonals: 2x over CSRReordering to create dense structure + splitting: 2x over CSRSymmetry: 2.8x over CSR, 2.6x over RBCache blocking: 2.8x over CSRMultiple vectors (SpMM): 7x over CSRAnd combinationsSparse triangular solveHybrid sparse/dense data structure: 1.8x over CSRHigher-level kernelsAATx, ATAx: 4x over CSR, 1.8x over RBA2x: 2x over CSR, 1.5x over RB[Ax, A2x, A3x, .. , Akx]

  • Example: Sparse Triangular FactorRaefsky4 (structural problem) + SuperLU + colmmdN=19779, nnz=12.6 M

  • Cache Optimizations for AAT*xCache-level: Interleave multiplication by A, ATOnly fetch A from memory onceRegister-level: aiT to be rc block row, or diag row

  • Example: Combining Optimizations (1/2)Register blocking, symmetry, multiple (k) vectorsThree low-level tuning parameters: r, c, vvkXYAcr+=*

  • Example: Combining Optimizations (2/2)Register blocking, symmetry, and multiple vectors [Ben Lee @ UCB]Symmetric, blocked, 1 vectorUp to 2.6x over nonsymmetric, blocked, 1 vectorSymmetric, blocked, k vectorsUp to 2.1x over nonsymmetric, blocked, k vectorsUp to 7.3x over nonsymmetric, nonblocked, 1 vectorSymmetric Storage: up to 64.7% savings

  • Motif/Dwarf: Common Computational Methods(Red Hot Blue Cool)03/08/2012CS267 Lecture 15

    CS267 Lecture 15

    Sheet1

    EmbedSPECDBGamesMLHPCHealthImageSpeechMusicBrowser

    1Finite State Mach.

    2Combinational

    3Graph Traversal

    4Structured Grid

    5Dense Matrix

    6Sparse Matrix

    7Spectral (FFT)

    8Dynamic Prog

    9N-Body

    10MapReduce

    11Backtrack/ B&B

    12Graphical Models

    13Unstructured Grid

    Sheet2

    Sheet3

  • Why so much about SpMV?Contents of the Sparse MotifWhat is sparse linear algebra?Direct solvers for Ax=b, least squaresSparse Gaussian elimination, QR for least squaresHow to choose: crd.lbl.gov/~xiaoye/SuperLU/SparseDirectSurvey.pdfIterative solvers for Ax=b, least squares, Ax=x, SVDUsed when SpMV only affordable operation on A Krylov Subspace MethodsHow to choose For Ax=b: www.netlib.org/templates/Templates.htmlFor Ax=x: www.cs.ucdavis.edu/~bai/ET/contents.htmlWhat about Multigrid?In overlap of sparse and (un)structured grid motifs details later

  • How to choose an iterative solver - exampleAll methods (GMRES, CGS,CG) depend on SpMV (or variations)See www.netlib.org/templates/Templates.html for details

  • Potential Impact on Applications: Omega3PApplication: accelerator cavity design [Ko]Relevant optimization techniquesSymmetric storageRegister blockingReordering, to create more dense blocksReverse Cuthill-McKee ordering to reduce bandwidthDo Breadth-First-Search, number nodes in reverse order visitedTraveling Salesman Problem-based ordering to create blocksNodes = columns of AWeights(u, v) = no. of nonzeros u, v have in commonTour = ordering of columnsChoose maximum weight tourSee [Pinar & Heath 97]2.1x speedup on Power 4

  • Source: Accelerator Cavity Design Problem (Ko via Husbands)

  • Post-RCM Reordering

  • 100x100 Submatrix Along Diagonal

  • Before: Green + RedAfter: Green + BlueMicroscopic Effect of RCM Reordering

  • Microscopic Effect of Combined RCM+TSP ReorderingBefore: Green + RedAfter: Green + Blue

  • (Omega3P)

  • How do permutations affect algorithms?A = original matrix, AP = A with permuted rows, columnsNave approach: permute x, multiply y=APx, permute yFaster way to solve Ax=bWrite AP = PTAP where P is a permutation matrixSolve APxP = PTb for xP, using SpMV with AP, then let x = PxPOnly need to permute vectors twice, not twice per iterationFaster way to solve Ax=xA and AP have same eigenvalues, no vectors to permute!APxP =xP implies Ax = x where x = PxPWhere else do optimizations change higher level algorithms? More later

  • *Tuning SpMV on Multicore

  • *Multicore SMPs UsedAMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)Source: Sam Williams

  • *Multicore SMPs Used(Conventional cache-based memory hierarchy)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)Sun T2+ T5140 (Victoria Falls)IBM QS20 Cell BladeSource: Sam Williams

  • *Multicore SMPs Used(Local store-based memory hierarchy)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)Source: Sam Williams

  • *Multicore SMPs Used(CMT = Chip-MultiThreading)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)Source: Sam Williams

  • *Multicore SMPs Used(threads)*SPEs onlySource: Sam Williams

  • *Multicore SMPs Used(Non-Uniform Memory Access - NUMA)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)*SPEs onlySource: Sam Williams

  • *Multicore SMPs Used(peak double precision flops)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)75 GFlop/s74 Gflop/s29* GFlop/s19 GFlop/s*SPEs onlySource: Sam Williams

  • *Multicore SMPs Used(Total DRAM bandwidth)AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown)IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls)21 GB/s (read)10 GB/s (write)21 GB/s51 GB/s42 GB/s (read)21 GB/s (write)*SPEs onlySource: Sam Williams

  • *Results fromAuto-tuning Sparse Matrix-Vector Multiplication (SpMV)Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

  • *Test matricesSuite of 14 matricesAll bigger than the caches of our SMPsWell also include a median performance numberDenseProteinFEM /SpheresFEM /CantileverWindTunnelFEM /HarborQCDFEM /ShipEconomicsEpidemiologyFEM /AcceleratorCircuitwebbaseLP2K x 2K Dense matrixstored in sparse formatWell Structured(sorted by nonzeros/row)Poorly StructuredhodgepodgeExtreme Aspect Ratio(linear programming)Source: Sam Williams

  • SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?*Source: Sam Williams

  • SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?We could parallelize by columns (sparse matrix time dense sub vector) and in the worst case simplify the random access challenge but: each thread would need to store a temporary partial sumand we would need to perform a reduction (inter-thread data dependency) *thread 0thread 1thread 2thread 3Source: Sam Williams

  • SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?We could parallelize by columns (sparse matrix time dense sub vector) and in the worst case simplify the random access challenge but: each thread would need to store a temporary partial sumand we would need to perform a reduction (inter-thread data dependency) *thread 0thread 1thread 2thread 3Source: Sam Williams

  • SpMV ParallelizationHow do we parallelize a matrix-vector multiplication ?By rows blocksNo inter-thread data dependencies, but random access to x*thread 0thread 1thread 2thread 3Source: Sam Williams

  • *SpMV Performance(simple parallelization)Out-of-the box SpMV performance on a suite of 14 matricesSimplest solution = parallelization by rows

    Scalability isnt greatCan we do better?

    Nave PthreadsNaveSource: Sam Williams

  • Summary of Multicore OptimizationsNUMA - Non-Uniform Memory Access pin submatrices to memories close to cores assigned to themPrefetch values, indices, and/or vectorsuse exhaustive search on prefetch distance Matrix Compression not just register blocking (BCSR)32 or 16-bit indices, Block Coordinate format for submatricesCache-blocking2D partition of matrix, so needed parts of x,y fit in cache

    *

  • NUMA(Data Locality for Matrices)On NUMA architectures, all large arrays should be partitioned eitherexplicitly (multiple malloc()s + affinity) implicitly (parallelize initialization and rely on first touch)You cannot partition on granularities less than the page size512 elements on x862M elements on Niagara

    For SpMV, partition the matrix andperform multiple malloc()sPin submatrices so they areco-located with the cores taskedto process them*Source: Sam Williams

  • NUMA(Data Locality for Matrices)*Source: Sam Williams

  • Prefetch for SpMVSW prefetch injects more MLP into the memory subsystem.Supplement HW prefetchersCan try to prefetch thevaluesindicessource vectoror any combination thereofIn general, should only insert one prefetch per cache line (works best on unrolled code) *for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+i+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;}

    Source: Sam Williams

  • SpMV Performance(NUMA and Software Prefetching)*NUMA-aware allocation is essential on memory-bound NUMA SMPs.Explicit software prefetching can boost bandwidth and change cache replacement policiesCell PPEs are likely latency-limited.

    used exhaustive search for best prefetch distanceSource: Sam Williams

  • Matrix Compression Goal: minimize memory trafficRegister blockingChoose block size to minimize memory trafficOnly power-of-2 block sizesSimplifies search, achieves most of the possible speedupShorter indices 32-bit, or 16-bit if possibleDifferent sparse matrix formatsBCSR Block compressed sparse rowLike CSR but with register blocksBCOO Block coordinateStores row and column index of each register blockBetter on very sparse sub-blocks (see cache blocking later)

  • ILP/DLP vs BandwidthIn the multicore era, which is the bigger issue?a lack of ILP/DLP (a major advantage of BCSR)insufficient memory bandwidth per core

    There are many architectures that when running low arithmetic intensity kernels, there is so little available memory bandwidth per core that you wont notice a complete lack of ILP

    Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP

    Rather than benchmarking every combination, just Select the register blocking that minimizes the matrix foot print.

    *Source: Sam Williams

  • Matrix Compression StrategiesRegister blocking creates small dense tilesbetter ILP/DLPreduced overhead per nonzero

    Let each thread select a unique register blockingIn this work, we only considered power-of-two register blocksselect the register blocking that minimizes memory traffic*Source: Sam Williams

  • Matrix Compression StrategiesWhere possible we may encode indices with less than 32 bitsWe may also select different matrix formats

    In this work, we considered 16-bit and 32-bit indices (relative to threads start)we explored BCSR/BCOO (GCSR in book chapter)*Source: Sam Williams

  • SpMV Performance(Matrix Compression)*After maximizing memory bandwidth, the only hope is to minimize memory traffic.Compression: exploitregister blockingother formatssmaller indicesUse a traffic minimization heuristic rather than searchBenefit is clearlymatrix-dependent.Register blocking enables efficient software prefetching (one per cache line)Source: Sam Williams

  • Cache blocking for SpMV(Data Locality for Vectors)Store entire submatrices contiguously The columns spanned by each cacheblock are selected to use same space in cache, i.e. access same number of x(i)TLB blocking is a similar concept butinstead of on 8 byte granularities, it uses 4KB granularities

    *thread 0thread 1thread 2thread 3Source: Sam Williams

  • Cache blocking for SpMV(Data Locality for Vectors)Cache-blocking sparse matrices is very different than cache-blocking dense matrices.Rather than changing loop bounds, store entire submatrices contiguously. The columns spanned by each cacheblock are selected so that all submatricesplace the same pressure on the cachei.e. touch the same number of uniquesource vector cache lines

    TLB blocking is a similar concept butinstead of on 64 byte granularities, it uses 4KB granularities

    *Source: Sam WilliamsStore entire submatrices contiguously The columns spanned by each cacheblock are selected to use same space in cache, i.e. access same number of x(i)TLB blocking is a similar concept butinstead of on 8 byte granularities, it uses 4KB granularities

  • *Auto-tuned SpMV Performance(cache and TLB blocking)Fully auto-tuned SpMV performance across the suite of matricesWhy do some optimizations work better on some architectures?

    matrices with naturally small working sets

    architectures with giant caches+Cache/LS/TLB Blocking+Matrix Compression+SW Prefetching+NUMA/AffinityNave PthreadsNaveSource: Sam Williams

  • *Auto-tuned SpMV Performance(architecture specific optimizations)Fully auto-tuned SpMV performance across the suite of matricesIncluded SPE/local store optimized versionWhy do some optimizations work better on some architectures?+Cache/LS/TLB Blocking+Matrix Compression+SW Prefetching+NUMA/AffinityNave PthreadsNaveSource: Sam Williams

  • *Auto-tuned SpMV Performance(max speedup)Fully auto-tuned SpMV performance across the suite of matricesIncluded SPE/local store optimized versionWhy do some optimizations work better on some architectures?+Cache/LS/TLB Blocking+Matrix Compression+SW Prefetching+NUMA/AffinityNave PthreadsNave2.7x4.0x2.9x35xSource: Sam Williams

  • Optimized Sparse Kernel Interface - pOSKIProvides sparse kernels automatically tuned for users matrix & machineBLAS-style functionality: SpMV, Ax & ATyHides complexity of run-time tuning

    Based on OSKI bebop.cs.berkeley.edu/oskiAutotuner for sequential sparse matrix operations:SpMV (Ax and ATx), ATAx, solve sparse triangular systems, So far pOSKI only does multicore optimizations of SpMVClass projects!Up to 4.5x faster SpMV (Ax) on Intel Sandy Bridge E

    On-going work by the Berkeley Benchmarking and Optimization (BeBop) group

  • Optimizations in pOSKI, so farFully automatic heuristics forSparse matrix-vector multiply (Ax, ATx)Register-level blocking, Thread-level blockingSIMD, software prefetching, software pipelining, loop unrollingNUMA-aware allocations

    Plug-in extensibilityVery advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the system

    Other optimizations under developmentCache-level blocking, Reordering (RCM, TSP), variable block structure, index compressing, Symmetric storage, etc.

  • How the pOSKI Tunes (Overview)1. Build forTarget Arch.2. BenchmarkGeneratedCodeVariantsLibrary Install-Time (offline)Application Run-TimeSample Dense Matrix(r,c)(r,c) = Register Block size(d) = prefetching distance(imp) = SIMD implementation(r,c,d,imp,)BenchmarkData&Selected Code Variants....2. EvaluateModels3. SelectData Struct.& CodeUsers Matrix1. PartitionWorkloadfrom programmonitoringEmpirical & Heuristic SearchHistoryUsers hintsSubmatrix threadSubmatrix.To user: Matrix handle for kernel calls

  • How the pOSKI Tunes (Overview)At library build/install-timeGenerate code variantsCode generator (Python) generates code variants for various implementations Collect benchmark dataMeasures and records speed of possible sparse data structure and code variants on target architectureSelect best code variants & benchmark dataprefetching distance, SIMD implementationInstallation process uses standard, portable GNU AutoToolsAt run-timeLibrary tunes using heuristic modelsModels analyze users matrix & benchmark data to choose optimized data structure and codeUser may re-collect benchmark data with users sparse matrix (under development) Non-trivial tuning cost: up to ~40 mat-vecsLibrary limits the time it spends tuning based on estimated workloadprovided by user or inferred by libraryUser may reduce cost by saving tuning results for application on future runs with same or similar matrix (under development)

  • How to Call pOSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors */

    /* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )my_matmult( ptr, ind, val, , x, b, y );

  • How to Call pOSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create a default pOSKI thread object */poski_threadarg_t *poski_thread = poski_InitThread();/* Step 2: Create pOSKI wrappers around this data */poski_mat_t A_tunable = poski_CreateMatCSR(ptr, ind, val, nrows, ncols, nnz, SHARE_INPUTMAT, poski_thread, NULL, );poski_vec_t x_view = poski_CreateVecView(x, ncols, UNIT_STRIDE, NULL);poski_vec_t y_view = poski_CreateVecView(y, nrows, UNIT_STRIDE, NULL);

    /* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )my_matmult( ptr, ind, val, , x, b, y );

  • How to Call pOSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create a default pOSKI thread object */poski_threadarg_t *poski_thread = poski_InitThread();/* Step 2: Create pOSKI wrappers around this data */poski_mat_t A_tunable = poski_CreateMatCSR(ptr, ind, val, nrows, ncols, nnz, SHARE_INPUTMAT, poski_thread, NULL, );poski_vec_t x_view = poski_CreateVecView(x, ncols, UNIT_STRIDE, NULL);poski_vec_t y_view = poski_CreateVecView(y, nrows, UNIT_STRIDE, NULL);

    /* Step 3: Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )poski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

  • How to Call pOSKI: Tune with Explicit HintsUser calls tune routine (optional)May provide explicit tuning hintsposki_mat_t A_tunable = poski_CreateMatCSR( );/* *//* Tell pOSKI we will call SpMV 500 times (workload hint) */poski_TuneHint_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view,500);/* Tell pOSKI we think the matrix has 8x8 blocks (structural hint) */poski_TuneHint_Structure(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

    /* Ask pOSKI to tune */poski_TuneMat(A_tunable);

    for( i = 0; i < 500; i++ )poski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

  • How to Call pOSKI: Implicit TuningAsk library to infer workload (optional)Library profiles all kernel callsMay periodically re-tuneposki_mat_t A_tunable = poski_CreateMatCSR( );/* */

    for( i = 0; i < 500; i++ ) {poski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);poski_TuneMat(A_tunable); /* Ask pOSKI to tune */}

  • How to Call pOSKI: Modify a thread objectAsk library to infer thread hints (optional)Number of threadsThreading model (PthreadPool, Pthread, OpenMP)Default: PthreadPool, #threads=#available cores on systemposki_threadarg_t *poski_thread = poski_InitThread(); /* Ask pOSKI to use 8 threads with OpenMP */ poski_ThreadHints(poski_thread, NULL, OPENMP, 8); poski_mat_t A_tunable = poski_CreateMatCSR( , poski_thread, );poski_MatMult( );

  • How to Call pOSKI: Modify a partition objectAsk library to infer partition hints (optional)Number of partitions#partition = k#threadsPartitioning model (OneD, SemiOneD, TwoD)Default: OneD, #partitions = #threadsMatrix: /* Ask pOSKI to partition 16 sub-matrices using SemiOneD */ poski_partitionarg_t *pmat poski_PartitionMatHints(SemiOneD, 16); poski_mat_t A_tunable = poski_CreateMatCSR( , pmat, );Vector:/* Ask pOSKI to partition a vector for SpMV input vector based on A_tunable */ poski_partitionVec_t *pvec = poski_PartitionVecHints(A_tunable, KERNEL_MatMult, OP_NORMAL, INPUTVEC); poski_vec_t x_view = poski_CreateVec( , pvec);

  • Performance on Intel Sandy Bridge EJaketown: i7-3960X @ 3.3 GHz#Cores: 6 (2 threads per core), L3:15MBpOSKI SpMV (Ax) with double precision float-pointMKL Sparse BLAS Level 2: mkl_dcsrmv()

  • Is tuning SpMV all we can do?Iterative methods all depend on itBut speedups are limitedJust 2 flops per nonzeroCommunication costs dominateCan we beat this bottleneck?Need to look at next level in stack:What do algorithms that use SpMV do?Can we reorganize them to avoid communication?Only way significant speedups will be possible

  • Tuning Higher Level Algorithms than SpMVWe almost always do many SpMVs, not just oneKrylov Subspace Methods (KSMs) for Ax=b, Ax = xConjugate Gradients, GMRES, Lanczos, Do a sequence of k SpMVs to get vectors [x1 , , xk]Find best solution x as linear combination of [x1 , , xk]Main cost is k SpMVsSince communication usually dominates, can we do better?Goal: make communication cost independent of kParallel case: O(log P) messages, not O(k log P) - optimalsame bandwidth as beforeSequential case: O(1) messages and bandwidth, not O(k) - optimalAchievable when matrix partitionable with low surface-to-volume ratio

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

    Example: A tridiagonal, n=32, k=3Works for any well-partitioned A

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

    Example: A tridiagonal, n=32, k=3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

    Example: A tridiagonal, n=32, k=3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

    Example: A tridiagonal, n=32, k=3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

    Example: A tridiagonal, n=32, k=3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]

    Example: A tridiagonal, n=32, k=3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Sequential Algorithm

    Example: A tridiagonal, n=32, k=3Step 1

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Sequential Algorithm

    Example: A tridiagonal, n=32, k=3

    Step 1Step 2

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Sequential Algorithm

    Example: A tridiagonal, n=32, k=3Step 1Step 2Step 3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx] Sequential Algorithm

    Example: A tridiagonal, n=32, k=3Step 1Step 2Step 3Step 4

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Proc 1Proc 2Proc 3Proc 4

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Each processor communicates once with neighbors Proc 1Proc 2Proc 3Proc 4

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Proc 1

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Proc 2

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Proc 3

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Proc 4

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Replace k iterations of y = Ax with [Ax, A2x, , Akx]Parallel Algorithm

    Example: A tridiagonal, n=32, k=3Each processor works on (overlapping) trapezoidProc 1Proc 2Proc 3Proc 4

  • Same idea works for general sparse matricesCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Simple block-row partitioning (hyper)graph partitioning

    Top-to-bottom processing Traveling Salesman Problem

  • 1 2 3 4 32xAxA2xA3xCommunication Avoiding Kernels: The Matrix Powers Kernel : [Ax, A2x, , Akx] Proc 1Proc 2Proc 3Proc 4

  • Locally Dependent Entries for [x,Ax,,A8x], A tridiagonal2 processorsCan be computed without communicationk=8 fold reuse of AProc 1 Proc 2

  • One message to get data needed to compute remotely dependent entries, not k=8

    Minimizes number of messages = latency costPrice: redundant work surface/volume ratioProc 1 Proc 2Remotely Dependent Entries for [x,Ax,,A8x], A tridiagonal2 processors

  • Fewer Remotely Dependent Entries for [x,Ax,,A8x], A tridiagonal2 processorsReduce redundant work by halfProc 1 Proc 2

  • Remotely Dependent Entries for [x,Ax, A2x,A3x], 2D Laplacian

  • Remotely Dependent Entries for [x,Ax,A2x,A3x], A irregular, multiple processors

  • Speedups on Intel Clovertown (8 core)

  • Performance ResultsMeasured Multicore (Clovertown) speedups up to 6.4xMeasured/Modeled sequential OOC speedup up to 3xModeled parallel Petascale speedup up to 6.9xModeled parallel Grid speedup up to 22x

    Sequential speedup due to bandwidth, works for many problem sizesParallel speedup due to latency, works for smaller problems on many processorsMulticore results used both techniques

  • Avoiding Communication in Iterative Linear Algebrak-steps of typical iterative solver for sparse Ax=b or Ax=xDoes k SpMVs with starting vectorFinds best solution among all linear combinations of these k+1 vectorsMany such Krylov Subspace MethodsConjugate Gradients, GMRES, Lanczos, Arnoldi, Goal: minimize communication in Krylov Subspace MethodsAssume matrix well-partitioned, with modest surface-to-volume ratioParallel implementationConventional: O(k log p) messages, because k calls to SpMVNew: O(log p) messages - optimalSerial implementationConventional: O(k) moves of data from slow to fast memoryNew: O(1) moves of data optimalLots of speed up possible (modeled and measured)Price: some redundant computationMuch prior workSee Mark Hoemmens PhD thesis, other papers at bebop.cs.berkeley.edu

  • Minimizing Communication of GMRES to solve Ax=bGMRES: find x in span{b,Ab,,Akb} minimizing || Ax-b ||2Cost of k steps of standard GMRES vs new GMRESStandard GMRES for i=1 to k w = A v(i-1) MGS(w, v(0),,v(i-1)) update v(i), H endfor solve LSQ problem with H

    Sequential: #words_moved = O(knnz) from SpMV + O(k2n) from MGSParallel: #messages = O(k) from SpMV + O(k2 log p) from MGSCommunication-avoiding GMRES W = [ v, Av, A2v, , Akv ] [Q,R] = TSQR(W) Tall Skinny QR Build H from R, solve LSQ problem

    Sequential: #words_moved = O(nnz) from SpMV + O(kn) from TSQRParallel: #messages = O(1) from computing W + O(log p) from TSQROops W from power method, precision lost!

  • Monomial basis [Ax,,Akx] fails to converge A different polynomial basis does converge

  • Speed ups of GMRES on 8-core Intel ClovertownRequires co-tuning kernels [MHDY09]

  • New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.

    FY 2010 Congressional Budget, Volume 4, FY2010 Accomplishments, Advanced Scientific Computing Research (ASCR), pages 65-67.President Obama cites Communication-Avoiding Algorithms in the FY 2012 Department of Energy Budget Request to Congress:CA-GMRES (Hoemmen, Mohiyuddin, Yelick, JD)Tall-Skinny QR (Grigori, Hoemmen, Langou, JD)

  • Tuning space for Krylov MethodsNonzero entriesIndices Classifications of sparse operators for avoiding communication Explicit indices or nonzero entries cause most communication, along with vectors Ex: With stencils (all implicit) all communication for vectors Operations [x, Ax, A2x,, Akx ] or [x, p1(A)x, p2(A)x, , pk(A)x ] Number of columns in x [x, Ax, A2x,, Akx ] and [y, ATy, (AT)2y,, (AT)ky ], or [y, ATAy, (ATA)2y,, (ATA)ky ], return all vectors or just last one Cotuning and/or interleaving W = [x, Ax, A2x,, Akx ] and {TSQR(W) or WTW or } Ditto, but throw away W Preconditioned versions

    Explicit (O(nnz))Implicit (o(nnz))Explicit (O(nnz))CSR and variationsVision, climate, AMR,Implicit (o(nnz))Graph LaplacianStencils

  • Possible Class ProjectsCome to BEBOP meetings (M 1 2, 380 Soda)Experiment with SpMV on different architecturesWhich optimizations are most effective?Try to speed up particular matrices of interestData mining, bottom solver from AMRExplore tuning space of [x,Ax,,Akx] kernelDifferent matrix representations (last slide)New Krylov subspace methods, preconditioningExperiment with new frameworks (SPF, Halide) More details available

  • Extra Slides

  • ExtensionsOther Krylov methodsArnoldi, CG, Lanczos, PreconditioningSolve MAx=Mb where preconditioning matrix M chosen to make system easier M approximates A-1 somehow, but cheaply, to accelerate convergenceCheap as long as contributions from distant parts of the system can be compressedSparsityLow rankNo implementations yet (class projects!)

  • Design Space for [x,Ax,,Akx] (1/3)Mathematical OperationHow many vectors to keepAll: Krylov Subspace Methods Keep last vector Akx only (Jacobi, Gauss Seidel)Improving conditioning of basisW = [x, p1(A)x, p2(A)x,,pk(A)x] pi(A) = degree i polynomial chosen to reduce cond(W)Preconditioning (Ay=b MAy=Mb)[x,Ax,MAx,AMAx,MAMAx,,(MA)kx]

  • Design Space for [x,Ax,,Akx] (2/3)Representation of sparse AZero pattern may be explicit or implicitNonzero entries may be explicit or implicitImplicit save memory, communication

    Representation of dense preconditioners MLow rank off-diagonal blocks (semiseparable)

    Explicit patternImplicit patternExplicit nonzerosGeneral sparse matrixImage segmentationImplicit nonzerosLaplacian(graph)Multigrid (AMR)Stencil matrixEx: tridiag(-1,2,-1)

  • Design Space for [x,Ax,,Akx] (3/3)Parallel implementationFrom simple indexing, with redundant flops surface/volume ratioTo complicated indexing, with fewer redundant flopsSequential implementationDepends on whether vectors fit in fast memoryReordering rows, columns of AImportant in parallel and sequential casesCan be reduced to pair of Traveling Salesmen ProblemsPlus all the optimizations for one SpMV!

  • Summary Communication-Avoiding Linear Algebra (CALA)Lots of related workSome going back to 1960sReports discuss this comprehensively, not hereOur contributionsSeveral new algorithms, improvements on old onesPreconditioningUnifying parallel and sequential approaches to avoiding communicationTime for these algorithms has come, because of growing communication costsWhy avoid communication just for linear algebra motifs?

  • Optimized Sparse Kernel Interface - OSKIProvides sparse kernels automatically tuned for users matrix & machineBLAS-style functionality: SpMV, Ax & ATy, TrSVHides complexity of run-time tuningIncludes new, faster locality-aware kernels: ATAx, AkxFaster than standard implementationsUp to 4x faster matvec, 1.8x trisolve, 4x ATA*xFor advanced users & solver library writersAvailable as stand-alone library (OSKI 1.0.1h, 6/07)Available as PETSc extension (OSKI-PETSc .1d, 3/06)Bebop.cs.berkeley.edu/oskiUnder development: pOSKI for multicore

  • How the OSKI Tunes (Overview)Benchmarkdata1. Build forTargetArch.2. BenchmarkHeuristicmodels1. EvaluateModelsGeneratedcodevariants2. SelectData Struct.& CodeLibrary Install-Time (offline)Application Run-TimeTo user:Matrix handlefor kernelcallsWorkloadfrom programmonitoringExtensibility: Advanced users may write & dynamically add Code variants and Heuristic models to system.HistoryMatrix

  • How the OSKI Tunes (Overview)At library build/install-timePre-generate and compile code variants into dynamic librariesCollect benchmark dataMeasures and records speed of possible sparse data structure and code variants on target architectureInstallation process uses standard, portable GNU AutoToolsAt run-timeLibrary tunes using heuristic modelsModels analyze users matrix & benchmark data to choose optimized data structure and codeNon-trivial tuning cost: up to ~40 mat-vecsLibrary limits the time it spends tuning based on estimated workloadprovided by user or inferred by libraryUser may reduce cost by saving tuning results for application on future runs with same or similar matrix

  • Optimizations in OSKI, so farFully automatic heuristics forSparse matrix-vector multiplyRegister-level blockingRegister-level blocking + symmetry + multiple vectorsCache-level blockingSparse triangular solve with register-level blocking and switch-to-dense optimizationSparse ATA*x with register-level blockingUser may select other optimizations manuallyDiagonal storage optimizations, reordering, splitting; tiled matrix powers kernel (Ak*x)All available in dynamic librariesAccessible via high-level embedded script languagePlug-in extensibilityVery advanced users may write their own heuristics, create new data structures/code variants and dynamically add them to the systempOSKI under constructionOptimizations for multicore more later

  • How to Call OSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors */

    /* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ ) my_matmult( ptr, ind, val, , x, b, y );

  • How to Call OSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, );oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

    /* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )my_matmult( ptr, ind, val, , x, b, y );

  • How to Call OSKI: Basic UsageMay gradually migrate existing appsStep 1: Wrap existing data structuresStep 2: Make BLAS-like kernel callsint* ptr = , *ind = ; double* val = ; /* Matrix, in CSR format */double* x = , *y = ; /* Let x and y be two dense vectors *//* Step 1: Create OSKI wrappers around this data */oski_matrix_t A_tunable = oski_CreateMatCSR(ptr, ind, val, num_rows, num_cols, SHARE_INPUTMAT, );oski_vecview_t x_view = oski_CreateVecView(x, num_cols, UNIT_STRIDE);oski_vecview_t y_view = oski_CreateVecView(y, num_rows, UNIT_STRIDE);

    /* Compute y = y + Ax, 500 times */for( i = 0; i < 500; i++ )oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);/* Step 2 */

  • How to Call OSKI: Tune with Explicit HintsUser calls tune routineMay provide explicit tuning hints (OPTIONAL)oski_matrix_t A_tunable = oski_CreateMatCSR( );/* *//* Tell OSKI we will call SpMV 500 times (workload hint) */oski_SetHintMatMult(A_tunable, OP_NORMAL, , x_view, , y_view, 500);/* Tell OSKI we think the matrix has 8x8 blocks (structural hint) */oski_SetHint(A_tunable, HINT_SINGLE_BLOCKSIZE, 8, 8);

    oski_TuneMat(A_tunable); /* Ask OSKI to tune */

    for( i = 0; i < 500; i++ )oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);

  • How the User Calls OSKI: Implicit TuningAsk library to infer workloadLibrary profiles all kernel callsMay periodically re-tuneoski_matrix_t A_tunable = oski_CreateMatCSR( );/* */

    for( i = 0; i < 500; i++ ) {oski_MatMult(A_tunable, OP_NORMAL, , x_view, , y_view);oski_TuneMat(A_tunable); /* Ask OSKI to tune */}

  • Quick-and-dirty Parallelism: OSKI-PETScExtend PETScs distributed memory SpMV (MATMPIAIJ)p0p1p2p3PETScEach process stores diag (all-local) and off-diag submatrices

    OSKI-PETSc:Add OSKI wrappersEach submatrix tuned independently

  • OSKI-PETSc Proof-of-Concept ResultsMatrix 1: Accelerator cavity design (R. Lee @ SLAC)N ~ 1 M, ~40 M non-zeros2x2 dense block substructureSymmetricMatrix 2: Linear programming (Italian Railways)Short-and-fat: 4k x 1M, ~11M non-zerosHighly unstructuredBig speedup from cache-blocking: no native PETSc formatEvaluation machine: Xeon clusterPeak: 4.8 Gflop/s per node

  • Accelerator Cavity Matrix

  • OSKI-PETSc Performance: Accel. Cavity

  • Linear Programming Matrix

  • OSKI-PETSc Performance: LP Matrix

  • Tuning SpMV for Multicore: Architectural Review

  • Cost of memory access in Multicore

    When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessUCA & UMA architecture:

    *Source: Sam Williams

  • Cost of memory access in Multicore

    When physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessUCA & UMA architecture:

    *Source: Sam Williams

  • Cost of Memory Access in MulticoreWhen physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessNUCA & UMA architecture:

    *Source: Sam Williams

  • Cost of Memory Access in MulticoreWhen physically partitioned, cache or memory access is non uniform (latency and bandwidth to memory/cache addresses varies)NUMA = Non-Uniform Memory AccessNUCA = Non-Uniform Cache AccessNUCA & NUMA architecture:

    *Source: Sam Williams

  • Cost of Memory Access in MulticoreProper cache locality to maximize performance

    *Source: Sam Williams

  • Cost of Memory Access in MulticoreProper DRAM locality to maximize performance

    *Source: Sam Williams

  • Optimizing Communication Complexity of Sparse SolversNeed to modify high level algorithms to use new kernelExample: GMRES for Ax=b where A= 2D Laplacianx lives on n-by-n meshPartitioned on p -by- p processor gridA has 5 point stencil (Laplacian)(Ax)(i,j) = linear_combination(x(i,j), x(i,j1), x(i1,j))Ex: 18-by-18 mesh on 3-by-3 processor grid

  • Minimizing CommunicationWhat is the cost = (#flops, #words, #mess) of s steps of standard GMRES?GMRES, ver.1: for i=1 to s w = A * v(i-1) MGS(w, v(0),,v(i-1)) update v(i), H endfor solve LSQ problem with Hn/p n/p Cost(A * v) = s * (9n2 /p, 4n / p , 4 ) Cost(MGS = Modified Gram-Schmidt) = s2/2 * ( 4n2 /p , log p , log p ) Total cost ~ Cost( A * v ) + Cost (MGS) Can we reduce the latency?

  • Minimizing CommunicationCost(GMRES, ver.1) = Cost(A*v) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 4s ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from A*v can you avoid? Almost all

  • Minimizing CommunicationCost(GMRES, ver. 2) = Cost(W) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from MGS can you avoid? Almost all

  • Minimizing CommunicationCost(GMRES, ver. 2) = Cost(W) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from MGS can you avoid? Almost allGMRES, ver. 3: W = [ v, Av, A2v, , Asv ] [Q,R] = TSQR(W) Tall Skinny QR (See Lecture 11) Build H from R, solve LSQ problem

  • Minimizing CommunicationCost(GMRES, ver. 2) = Cost(W) + Cost(MGS)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , s2 log p / 2 ) How much latency cost from MGS can you avoid? Almost allGMRES, ver. 3: W = [ v, Av, A2v, , Asv ] [Q,R] = TSQR(W) Tall Skinny QR (See Lecture 11) Build H from R, solve LSQ problem

  • Minimizing CommunicationCost(GMRES, ver. 3) = Cost(W) + Cost(TSQR)= ( 9sn2 /p, 4sn / p , 8 ) + ( 2s2n2 /p , s2 log p / 2 , log p ) Latency cost independent of s, just log p optimal Oops W from power method, so precision lost What to do? Use a different polynomial basis Not Monomial basis W = [v, Av, A2v, ], instead Newton Basis WN = [v, (A 1 I)v , (A 2 I)(A 1 I)v, ] or Chebyshev Basis WC = [v, T1(v), T2(v), ]

  • Tuning Higher Level Algorithms So far we have tuned a single sparse matrix kernel y = AT*A*x motivated by higher level algorithm (SVD) What can we do by extending tuning to a higher level? Consider Krylov subspace methods for Ax=b, Ax = lx Conjugate Gradients (CG), GMRES, Lanczos, Inner loop does y=A*x, dot products, saxpys, scalar ops Inner loop costs at least O(1) messages k iterations cost at least O(k) messages Our goal: show how to do k iterations with O(1) messages Possible payoff make Krylov subspace methods much faster on machines with slow networks Memory bandwidth improvements too (not discussed) Obstacles: numerical stability, preconditioning,

  • Krylov Subspace Methods for Solving Ax=bCompute a basis for a subspace V by doing y = A*x k timesFind best solution in that Krylov subspace VGiven starting vector x1, V spanned by x2 = A*x1, x3 = A*x2 , , xk = A*xk-1GMRES finds an orthogonal basis of V by Gram-Schmidt, so it actually does a different set of SpMVs than in last bullet

  • Example: Standard GMRESr = b - A*x1, b = length(r), v1 = r / b length(r) = sqrt(S ri2 )for m= 1 to k do w = A*vm at least O(1) messages for i = 1 to m do Gram-Schmidt him = dotproduct(vi , w ) at least O(1) messages, or log(p) w = w h im * vi end for hm+1,m = length(w) at least O(1) messages, or log(p) vm+1 = w / hm+1,m end forfind y minimizing length( Hk * y be1 ) small, local problemnew x = x1 + Vk * y Vk = [v1 , v2 , , vk ]O(k2), or O(k2 log p), messages altogether

  • Example: Computing [Ax,A2x,A3x,,Akx] for A tridiagonal Different basis for same Krylov subspaceWhat can Proc 1 compute without communication?x(1:30):(Ax)(1:30):(A2x)(1:30):(A8x)(1:30):Proc 1Proc 2. . .

  • Example: Computing [Ax,A2x,A3x,,Akx] for A tridiagonal x(1:30):(Ax)(1:30):(A2x)(1:30):. . .(A8x)(1:30):Proc 1Proc 2Computing missing entries with 1 communication, redundant work

  • x(1:30):(Ax)(1:30):(A2x)(1:30):. . .(A8x)(1:30):Proc 1Proc 2Saving half the redundant workExample: Computing [Ax,A2x,A3x,,Akx] for A tridiagonal

  • Example: Computing [Ax,A2x,A3x,,Akx] for LaplacianA = 5pt Laplacian in 2D,Communicated point for k=3 shown

  • Latency-Avoiding GMRES (1)r = b - A*x1, b = length(r), v1 = r / b O(log p) messagesWk+1 = [v1 , A * v1 , A2 * v1 , , Ak * v1 ] O(1) messages[Q, R] = qr(Wk+1) QR decomposition, O(log p) messagesHk = R(:, 2:k+1) * (R(1:k,1:k))-1 small, local problemfind y minimizing length( Hk * y be1 ) small, local problemnew x = x1 + Qk * y local problemO(log p) messages altogetherIndependent of k

  • Latency-Avoiding GMRES (2)[Q, R] = qr(Wk+1) QR decomposition, O(log p) messages

    Easy, but not so stable way to do it: X(myproc) = Wk+1T(myproc) * Wk+1 (myproc) local computation Y = sum_reduction(X(myproc)) O(log p) messages Y = Wk+1T* Wk+1 R = (cholesky(Y))T small, local computation Q(myproc) = Wk+1 (myproc) * R-1 local computation

  • Numerical example (1)Diagonal matrix with n=1000, Aii from 1 down to 10-5Instability as k grows, after many iterations

  • Numerical Example (2)Partial remedy: restarting periodically (every 120 iterations)Other remedies: high precision, different basis than [v , A * v , , Ak * v ]

  • Operation Counts for [Ax,A2x,A3x,,Akx] on p procs

    ProblemPer-proc costStandardOptimized

    1D mesh#messages2k2(tridiagonal)#words sent2k2k#flops5kn/p5kn/p + 5k2memory(k+4)n/p(k+4)n/p + 8k

    3D mesh#messages26k2627 pt stencil#words sent6kn2p-2/3 + 12knp-1/3 + O(k)6kn2p-2/3 + 12k2np-1/3 + O(k3)#flops53kn3/p53kn3/p + O(k2n2p-2/3)memory(k+28)n3/p+ 6n2p-2/3 + O(np-1/3)(k+28)n3/p + 168kn2p-2/3 + O(k2np-1/3)

  • Summary and Future WorkDenseLAPACKScaLAPACKCommunication primitivesSparseKernels, StencilsHigher level algorithmsAll of the above on new architecturesVector, SMPs, multicore, Cell, High level support for tuningSpecification languageIntegration into compilers

  • Extra Slides

  • A Sparse Matrix You Encounter Every DayWho am I?I am aBig RepositoryOf usefulAnd uselessFacts alike.

    Who am I?

    (Hint: Not your e-mail inbox.)

  • What about the Google Matrix?Google approachApprox. once a month: rank all pages using connectivity structureFind dominant eigenvector of a matrixAt query-time: return list of pages ordered by rankMatrix: A = aG + (1-a)(1/n)uuT Markov model: Surfer follows link with probability a, jumps to a random page with probability 1-aG is n x n connectivity matrix [n billions]gij is non-zero if page i links to page jNormalized so each column sums to 1Very sparse: about 78 non-zeros per row (power law dist.)u is a vector of all 1 valuesSteady-state probability xi of landing on page i is solution to x = AxApproximate x by power method: x = Akx0In practice, k 25

  • Current WorkPublic software releaseImpact on library designs: Sparse BLAS, Trilinos, PETSc, Integration in large-scale applicationsDOE: Accelerator design; plasma physicsGeophysical simulation based on Block Lanczos (ATA*X; LBL)Systematic heuristics for data structure selection?Evaluation of emerging architecturesRevisiting vector microsOther sparse kernelsMatrix triple products, Ak*xParallelismSparse benchmarks (with UTK) [Gahvari & Hoemmen]Automatic tuning of MPI collective ops [Nishtala, et al.]

  • Summary of High-Level ThemesKernel-centric optimizationVs. basic block, trace, path optimization, for instanceAggressive use of domain-specific knowledgePerformance bounds modelingEvaluating software qualityArchitectural characterizations and consequencesEmpirical searchHybrid on-line/run-time modelsStatistical performance modelsExploit information from sampling, measuring

  • Related WorkMy bibliography: 337 entries so farSample area 1: Code generationGenerative & generic programmingSparse compilersDomain-specific generatorsSample area 2: Empirical search-based tuningKernel-centriclinear algebra, signal processing, sorting, MPI, Compiler-centricprofiling + FDO, iterative compilation, superoptimizers, self-tuning compilers, continuous program optimization

  • Future Directions (A Bag of Flaky Ideas)Composable code generators and search spacesNew application domainsPageRank: multilevel block algorithms for topic-sensitive search?New kernels: cryptokernelsrich mathematical structure germane to performance; lots of hardwareNew tuning environmentsParallel, Grid, whole systemsStatistical models of application performanceStatistical learning of concise parametric models from traces for architectural evaluationCompiler/automatic derivation of parametric models

  • Possible Future WorkDifferent ArchitecturesNew FP instruction sets (SSE2)SMP / multicore platformsVector architecturesDifferent KernelsHigher Level AlgorithmsParallel versions of kenels, with optimized communicationBlock algorithms (eg Lanczos)XBLAS (extra precision)Produce BenchmarksAugment HPCC Benchmark Make it possible to combine optimizations easily for any kernelRelated tuning activities (LAPACK & ScaLAPACK)

  • Review of Tuning by Illustration(Extra Slides)

  • Splitting for Variable Blocks and DiagonalsDecompose A = A1 + A2 + AtDetect canonical structures (sampling)SplitTune each AiImprove performance and save storageNew data structuresUnaligned block CSRRelax alignment in rows & columnsRow-segmented diagonals

  • Example: Variable Block Row (Matrix #12)2.1x over CSR1.8x over RB

  • Example: Row-Segmented Diagonals2x over CSR

  • Mixed Diagonal and Block Structure

  • SummaryAutomated block size selectionEmpirical modeling and searchRegister blocking for SpMV, triangular solve, ATA*x

    Not fully automatedGiven a matrix, select splittings and transformationsLots of combinatorial problemsTSP reordering to create dense blocks (Pinar 97; Moon, et al. 04)

  • Extra Slides

  • A Sparse Matrix You Encounter Every DayWho am I?I am aBig RepositoryOf usefulAnd uselessFacts alike.

    Who am I?

    (Hint: Not your e-mail inbox.)

  • Problem ContextSparse kernels aboundModels of buildings, cars, bridges, economies, Google PageRank algorithmHistorical trendsSparse matrix-vector multiply (SpMV): 10% of peak2x faster with hand-tuningTuning becoming more difficult over timePromise of automatic tuning: PHiPAC/ATLAS, FFTW, Challenges to high-performanceNot dense linear algebra!Complex data structures: indirect, irregular memory accessPerformance depends strongly on run-time inputs

  • Key Questions, Ideas, ConclusionsHow to tune basic sparse kernels automatically?Empirical modeling and searchUp to 4x speedups for SpMV1.8x for triangular solve4x for ATA*x; 2x for A2*x7x for multiple vectorsWhat are the fundamental limits on performance?Kernel-, machine-, and matrix-specific upper boundsAchieve 75% or more for SpMV, limiting low-level tuningConsequences for architecture?General techniques for empirical search-based tuning?Statistical models of performance

  • Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceStatistical models of performance

  • Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)Matrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

    for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]Compressed Sparse Row (CSR) StorageMatrix-vector multiply kernel: y(i) y(i) + A(i,j)*x(j)

    for each row ifor k=ptr[i] to ptr[i+1] doy[i] = y[i] + val[k]*x[ind[k]]

  • Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceStatistical models of performance

  • Historical Trends in SpMV PerformanceThe DataUniprocessor SpMV performance since 1987Untuned and Tuned implementationsCache-based superscalar micros; some vectorsLINPACK

  • SpMV Historical Trends: Mflop/s

  • Example: The Difficulty of Tuningn = 21216nnz = 1.5 Mkernel: SpMV

    Source: NASA structural analysis problem

  • Still More SurprisesMore complicated non-zero structure in general

  • Still More SurprisesMore complicated non-zero structure in general

    Example: 3x3 blockingLogical grid of 3x3 cells

  • Historical Trends: Mixed NewsObservationsGood news: Moores law like behaviorBad news: Untuned is 10% peak or less, worseningGood news: Tuned roughly 2x better today, and improvingBad news: Tuning is complex

    (Not really news: SpMV is not LINPACK)

    QuestionsApplication: Automatic tuning?Architect: What machines are good for SpMV?

  • Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesSpMV [SC02; IJHPCA 04b]Sparse triangular solve (SpTS) [ICS/POHLL 02]ATA*x [ICCS/WoPLA 03]Upper bounds on performanceStatistical models of performance

  • SPARSITY: Framework for Tuning SpMVSPARSITY: Automatic tuning for SpMV [Im & Yelick 99]General approachIdentify and generate implementation spaceSearch space using empirical models & experimentsPrototype library and heuristic for choosing register block sizeAlso: cache-level blocking, multiple vectorsWhats new?New block size selection heuristicWithin 10% of optimal replaces previous versionExpanded implementation spaceVariable block splitting, diagonals, combinationsNew kernels: sparse triangular solve, ATA*x, Ar*x

  • Automatic Register Block Size SelectionSelecting the r x c block sizeOff-line benchmark: characterize the machinePrecompute Mflops(r,c) using dense matrix for each r x cOnce per machine/architectureRun-time search: characterize the matrix Sample A to estimate Fill(r,c) for each r x cRun-time heuristic modelChoose r, c to maximize Mflops(r,c) / Fill(r,c)Run-time costsUp to ~40 SpMVs (empirical worst case)

  • Accuracy of the Tuning Heuristics (1/4)NOTE: Fair flops used (ops on explicit zeros not counted as work)DGEMV

  • Accuracy of the Tuning Heuristics (2/4)DGEMV

  • Accuracy of the Tuning Heuristics (3/4)DGEMV

  • Accuracy of the Tuning Heuristics (4/4)DGEMV

  • Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceSC02Statistical models of performance

  • Motivation for Upper Bounds ModelQuestionsSpeedups are good, but what is the speed limit?Independent of instruction scheduling, selectionWhat machines are good for SpMV?

  • Upper Bounds on Performance: Blocked SpMVP = (flops) / (time)Flops = 2 * nnz(A)Lower bound on time: Two main assumptions1. Count memory ops only (streaming)2. Count only compulsory, capacity misses: ignore conflictsAccount for line sizesAccount for matrix size and nnzCharge min access latency ai at Li cache & ameme.g., Saavedra-Barrera and PMaC MAPS benchmarks

  • Example: Bounds on Itanium 2

  • Example: Bounds on Itanium 2

  • Example: Bounds on Itanium 2

  • Fraction of Upper Bound Across Platforms

  • Achieved Performance and Machine BalanceMachine balance [Callahan 88; McCalpin 95]Balance = Peak Flop Rate / Bandwidth (flops / double)Ideal balance for mat-vec: 2 flops / doubleFor SpMV, even less

    SpMV ~ streaming1 / (avg load time to stream 1 array) ~ (bandwidth)Sustained balance = peak flops / model bandwidth

  • Where Does the Time Go?Most time assigned to memoryCaches disappear when line sizes are equalStrictly increasing line sizes

  • Execution Time Breakdown: Matrix 40

  • Speedups with Increasing Line Size

  • Summary: Performance Upper BoundsWhat is the best we can do for SpMV?Limits to low-level tuning of blocked implementationsRefinements?What machines are good for SpMV?Partial answer: balance characterizationArchitectural consequences?Example: Strictly increasing line sizes

  • Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceTuning other sparse kernelsStatistical models of performanceFDO 00; IJHPCA 04a

  • Statistical Models for Automatic TuningIdea 1: Statistical criterion for stopping a searchA general search modelGenerate implementationMeasure performanceRepeatStop when probability of being within e of optimal falls below thresholdCan estimate distribution on-lineIdea 2: Statistical performance modelsProblem: Choose 1 among m implementations at run-timeSample performance off-line, build statistical model

  • Example: Select a Matmul Implementation

  • Example: Support Vector Classification

  • Road MapSparse matrix-vector multiply (SpMV) in a nutshellHistorical trends and the need for searchAutomatic tuning techniquesUpper bounds on performanceTuning other sparse kernelsStatistical models of performanceSummary and Future Work

  • Summary of High-Level ThemesKernel-centric optimizationVs. basic block, trace, path optimization, for instanceAggressive use of domain-specific knowledgePerformance bounds modelingEvaluating software qualityArchitectural characterizations and consequencesEmpirical searchHybrid on-line/run-time modelsStatistical performance modelsExploit information from sampling, measuring

  • Related WorkMy bibliography: 337 entries so farSample area 1: Code generationGenerative & generic programmingSparse compilersDomain-specific generatorsSample area 2: Empirical search-based tuningKernel-centriclinear algebra, signal processing, sorting, MPI, Compiler-centricprofiling + FDO, iterative compilation, superoptimizers, self-tuning compilers, continuous program optimization

  • Future Directions (A Bag of Flaky Ideas)Composable code generators and search spacesNew application domainsPageRank: multilevel block algorithms for topic-sensitive search?New kernels: cryptokernelsrich mathematical structure germane to performance; lots of hardwareNew tuning environmentsParallel, Grid, whole systemsStatistical models of application performanceStatistical learning of concise parametric models from traces for architectural evaluationCompiler/automatic derivation of parametric models

  • AcknowledgementsSuper-advisors: Jim and KathyUndergraduate R.A.s: Attila, Ben, Jen, Jin, Michael, Rajesh, Shoaib, Sriram, Tuyet-LinhSee pages xvixvii of dissertation.

  • TSP-based Reordering: Before(Pinar 97;Moon, et al 04)

  • TSP-based Reordering: After(Pinar 97;Moon, et al 04)

    Up to 2xspeedupsover CSR

  • Example: L2 Misses on Itanium 2Misses measured using PAPI [Browne 00]

  • Example: Distribution of Blocked Non-Zeros

  • Register Profile: Itanium 2190 Mflop/s1190 Mflop/s

  • Register Profiles: Sun and Intel x86Ultra 2i - 11%Ultra 3 - 5%Pentium III-M - 15%Pentium III - 21%72 Mflop/s35 Mflop/s90 Mflop/s50 Mflop/s108 Mflop/s42 Mflop/s122 Mflop/s58 Mflop/s

  • Register Profiles: IBM and Intel IA-64Power3 - 17%Power4 - 16%Itanium 2 - 33%Itanium 1 - 8%252 Mflop/s122 Mflop/s820 Mflop/s459 Mflop/s247 Mflop/s107 Mflop/s1.2 Gflop/s190 Mflop/s

  • Accurate and Efficient Adaptive Fill EstimationIdea: Sample matrixFraction of matrix to sample: s [0,1]Cost ~ O(s * nnz)Control cost by controlling sSearch at run-time: the constant matters!Control s automatically by computing statistical confidence intervalsIdea: Monitor varianceCost of tuningLower bound: convert matrix in 5 to 40 unblocked SpMVsHeuristic: 1 to 11 SpMVs

  • Sparse/Dense Partitioning for SpTSPartition L into sparse (L1,L2) and dense LD:Perform SpTS in three steps:Sparsity optimizations for (1)(2); DTRSV for (3)Tuning parameters: block size, size of dense triangle

  • SpTS Performance: Power3

  • Summary of SpTS and AAT*x ResultsSpTS Similar to SpMV1.8x speedups; limited benefit from low-level tuningAATx, ATAxCache interleaving only: up to 1.6x speedupsReg + cache: up to 4x speedups1.8x speedup over register onlySimilar heuristic; same accuracy (~ 10% optimal)Further from upper bounds: 6080%Opportunity for better low-level tuning a la PHiPAC/ATLASMatrix triple products? Ak*x?Preliminary work

  • Register Blocking: Speedup

  • Register Blocking: Performance

  • Register Blocking: Fraction of Peak

  • Example: Confidence Interval Estimation

  • Costs of Tuning

  • Splitting + UBCSR: Pentium III

  • Splitting + UBCSR: Power4

  • Splitting+UBCSR Storage: Power4

  • Example: Variable Block Row (Matrix #13)

  • Dense Tuning is Hard, TooEven dense matrix multiply can be notoriously difficult to tune

  • Dense matrix multiply: surprising performance as register tile size varies.

  • Preliminary Results (Matrix Set 2): Itanium 2DenseFEMFEM (var)BioLPEconStat

  • Multiple Vector Performance

  • What about the Google Matrix?Google approachApprox. once a month: rank all pages using connectivity structureFind dominant eigenvector of a matrixAt query-time: return list of pages ordered by rankMatrix: A = aG + (1-a)(1/n)uuT Markov model: Surfer follows link with probability a, jumps to a random page with probability 1-aG is n x n connectivity matrix [n 3 billion]gij is non-zero if page i links to page jNormalized so each column sums to 1Very sparse: about 78 non-zeros per row (power law dist.)u is a vector of all 1 valuesSteady-state probability xi of landing on page i is solution to x = AxApproximate x by power method: x = Akx0In practice, k 25

  • MAPS Benchmark Example: Power4

  • MAPS Benchmark Example: Itanium 2

  • Saavedra-Barrera Example: Ultra 2i

  • Summary of Results: Pentium III

  • Summary of Results: Pentium III (3/3)

  • Execution Time Breakdown (PAPI): Matrix 40

  • Preliminary Results (Matrix Set 1): Itanium 2LPFEMFEM (var)AssortedDense

  • Tuning Sparse Triangular Solve (SpTS)Compute x=L-1*b where L sparse lower triangular, x & b denseL from sparse LU has rich dense substructureDense trailing triangle can account for 2090% of matrix non-zerosSpTS optimizationsSplit into sparse trapezoid and dense trailing triangleUse tuned dense BLAS (DTRSV) on dense triangleUse Sparsity register blocking on sparse partTuning parametersSize of dense trailing triangleRegister block size

  • Sparse Kernels and OptimizationsKernelsSparse matrix-vector multiply (SpMV): y=A*xSparse triangular solve (SpTS): x=T-1*by=AAT*x, y=ATA*xPowers (y=Ak*x), sparse triple-product (R*A*RT), Optimization techniques (implementation space)Register blockingCache blockingMultiple dense vectors (x)A has special structure (e.g., symmetric, banded, )Hybrid data structures (e.g., splitting, switch-to-dense, )Matrix reorderingHow and when do we search?Off-line: Benchmark implementationsRun-time: Estimate matrix properties, evaluate performance models based on benchmark data

  • Cache Blocked SpMV on LSI Matrix: Ultra 2iA10k x 255k3.7M non-zeros

    Baseline:16 Mflop/s

    Best block size& performance:16k x 64k28 Mflop/s

  • Cache Blocking on LSI Matrix: Pentium 4A10k x 255k3.7M non-zeros

    Baseline:44 Mflop/s

    Best block size& performance:16k x 16k210 Mflop/s

  • Cache Blocked SpMV on LSI Matrix: ItaniumA10k x 255k3.7M non-zeros

    Baseline:25 Mflop/s

    Best block size& performance:16k x 32k72 Mflop/s

  • Cache Blocked SpMV on LSI Matrix: Itanium 2A10k x 255k3.7M non-zeros

    Baseline:170 Mflop/s

    Best block size& performance:16k x 65k275 Mflop/s

  • Inter-Iteration Sparse Tiling (1/3)[Strout, et al., 01]Let A be 6x6 tridiagonalConsider y=A2xt=Ax, y=AtNodes: vector elementsEdges: matrix elements aij

  • Inter-Iteration Sparse Tiling (2/3)[Strout, et al., 01]Let A be 6x6 tridiagonalConsider y=A2xt=Ax, y=AtNodes: vector elementsEdges: matrix elements aij

    Orange = everything needed to compute y1Reuse a11, a12

  • Inter-Iteration Sparse Tiling (3/3)[Strout, et al., 01]Let A be 6x6 tridiagonalConsider y=A2xt=Ax, y=AtNodes: vector elementsEdges: matrix elements aij

    Orange = everything needed to compute y1Reuse a11, a12Grey = y2, y3Reuse a23, a33, a43

  • Inter-Iteration Sparse Tiling: IssuesTile sizes (colored regions) grow with no. of iterations and increasing out-degreeG likely to have a few nodes with high out-degree (e.g., Yahoo)Mathematical tricks to limit tile size?Judicious dropping of edges [Ng01]

  • Summary and QuestionsNeed to understand matrix structure and machineBeBOP: suite of techniques to deal with different sparse structures and architecturesGoogle matrix problemEstablished techniques within an iterationIdeas for inter-iteration optimizationsMathematical structure of problem may helpQuestionsStructure of G?What are the computational bottlenecks?Enabling future computations?E.g., topic-sensitive PageRank multiple vector version [Haveliwala 02]See www.cs.berkeley.edu/~richie/bebop/intel/google for more info, including more complete Itanium 2 results.

  • Exploiting Matrix StructureSymmetry (numerical or structural)Reuse matrix entriesCan combine with register blocking, multiple vectors, Matrix splittingSplit the matrix, e.g., into r x c and 1 x 1No fill overheadLarge matrices with random structureE.g., Latent Semantic Indexing (LSI) matricesTechnique: cache blockingStore matrix as 2i x 2j sparse submatricesEffective when x vector is largeCurrently, search to find fastest size

  • Symmetric SpMV Performance: Pentium 4

  • SpMV with Split Matrices: Ultra 2i

  • Cache Blocking on Random Matrices: ItaniumSpeedup on four bandedrandom matrices.

  • Sparse Kernels and OptimizationsKernelsSparse matrix-vector multiply (SpMV): y=A*xSparse triangular solve (SpTS): x=T-1*by=AAT*x, y=ATA*xPowers (y=Ak*x), sparse triple-product (R*A*RT), Optimization techniques (implementation space)Register blockingCache blockingMultiple dense vectors (x)A has special structure (e.g., symmetric, banded, )Hybrid data structures (e.g., splitting, switch-to-dense, )Matrix reorderingHow and when do we search?Off-line: Benchmark implementationsRun-time: Estimate matrix properties, evaluate performance models based on benchmark data

  • Register Blocked SpMV: Pentium III

  • Register Blocked SpMV: Ultra 2i

  • Register Blocked SpMV: Power3

  • Register Blocked SpMV: Itanium

  • Possible Optimization TechniquesWithin an iteration, i.e., computing (G+uuT)*x onceCache block G*xOn linear programming matrices and matrices with random structure (e.g., LSI), 1.54x speedupsBest block size is matrix and machine dependentReordering and/or splitting of G to separate dense structure (rows, columns, blocks)Between iterations, e.g., (G+uuT)2x(G+uuT)2x = G2x + (Gu)uTx + u(uTG)x + u(uTu)uTxCompute Gu, uTG, uTu once for all iterationsG2x: Inter-iteration tiling to read G only once

  • Multiple Vector Performance: Itanium

  • Sparse Kernels and OptimizationsKernelsSparse matrix-vector multiply (SpMV): y=A*xSparse triangular solve (SpTS): x=T-1*by=AAT*x, y=ATA*xPowers (y=Ak*x), sparse triple-product (R*A*RT), Optimization techniques (implementation space)Register blockingCache blockingMultiple dense vectors (x)A has special structure (e.g., symmetric, banded, )Hybrid data structures (e.g., splitting, switch-to-dense, )Matrix reorderingHow and when do we search?Off-line: Benchmark implementationsRun-time: Estimate matrix properties, evaluate performance models based on benchmark data

  • SpTS Performance: Itanium(See POHLL 02 workshop paper, at ICS 02.)

  • Sparse Kernels and OptimizationsKernelsSparse matrix-vector multiply (SpMV): y=A*xSparse triangular solve (SpTS): x=T-1*by=AAT*x, y=ATA*xPowers (y=Ak*x), sparse triple-product (R*A*RT), Optimization techniques (implementation space)Register blockingCache blockingMultiple dense vectors (x)A has special structure (e.g., symmetric, banded, )Hybrid data structures (e.g., splitting, switch-to-dense, )Matrix reorderingHow and when do we search?Off-line: Benchmark implementationsRun-time: Estimate matrix properties, evaluate performance models based on benchmark data

  • Optimizing AAT*xKernel: y=AAT*x, where A is sparse, x & y denseArises in linear programming, computation of SVDConventional implementation: compute z=AT*x, y=A*zElements of A can be reused:When ak represent blocks of columns, can apply register blocking.

  • Optimized AAT*x Performance: Pentium III

  • Current DirectionsApplying new optimizationsOther split data structures (variable block, diagonal, )Matrix reordering to create block structureStructural symmetryNew kernels (triple product RART, powers Ak, )Tuning parameter selectionBuilding an automatically tuned sparse matrix libraryExtending the Sparse BLASLeverage existing sparse compilers as code generation infrastructureMore thoughts on this topic tomorrow

  • Related WorkAutomatic performance tuning systemsPHiPAC [Bilmes, et al., 97], ATLAS [Whaley & Dongarra 98]FFTW [Frigo & Johnson 98], SPIRAL [Pueschel, et al., 00], UHFFT [Mirkovic and Johnsson 00]MPI collective operations [Vadhiyar & Dongarra 01]Code generationFLAME [Gunnels & van de Geijn, 01]Sparse compilers: [Bik 99], Bernoulli [Pingali, et al., 97]Generic programming: Blitz++ [Veldhuizen 98], MTL [Siek & Lumsdaine 98], GMCL [Czarnecki, et al. 98], Sparse performance modeling[Temam & Jalby 92], [White & Saddayappan 97], [Navarro, et al., 96], [Heras, et al., 99], [Fraguela, et al., 99],

  • More Related WorkCompiler analysis, modelsCROPS [Carter, Ferrante, et al.]; Serial sparse tiling [Strout 01]TUNE [Chatterjee, et al.]Iterative compilation [OBoyle, et al., 98]Broadway compiler [Guyer & Lin, 99][Brewer 95], ADAPT [Voss 00]Sparse BLAS interfacesBLAST Forum (Chapter 3)NIST Sparse BLAS [Remington & Pozo 94]; SparseLib++SPARSKIT [Saad 94]Parallel Sparse BLAS [Fillipone, et al. 96]

  • Context: Creating High-Performance LibrariesApplication performance dominated by a few computational kernelsToday: Kernels hand-tuned by vendor or userPerformance tuning challengesPerformance is a complicated function of kernel, architecture, compiler, and workloadTedious and time-consumingSuccessful automated approachesDense linear algebra: ATLAS/PHiPACSignal processing: FFTW/SPIRAL/UHFFT

  • Cache Blocked SpMV on LSI Matrix: Itanium

  • Sustainable Memory Bandwidth

  • Multiple Vector Performance: Pentium 4

  • Multiple Vector Performance: Itanium

  • Multiple Vector Performance: Pentium 4

  • Optimized AAT*x Performance: Ultra 2i

  • Optimized AAT*x Performance: Pentium 4

  • Tuning Pays OffPHiPAC

  • Tuning pays off ATLASExtends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

  • Register Tile Sizes (Dense Matrix Multiply)333 MHz Sun Ultra 2i

    2-D slice of 3-D space; implementations color-coded by performance in Mflop/s

    16 registers, but 2-by-3 tile size fastest

  • High Precision GEMV (XBLAS)

  • High Precision Algorithms (XBLAS)Double-double (High precision word represented as pair of doubles)Many variations on these algorithms; we currently use BaileysExploiting Extra-wide RegistersSuppose s(1) , , s(n) have f-bit fractions, SUM has F>f bit fractionConsider following algorithm for S = Si=1,n s(i)Sort so that |s(1)| |s(2)| |s(n)|SUM = 0, for i = 1 to n SUM = SUM + s(i), end for, sum = SUMTheorem (D., Hida) Suppose F
  • More Extra Slides

  • Opteron (1.4GHz, 2.8GFlop peak)Beat ATLAS DGEMVs 365 MflopsNonsymmetric peak = 504 MFlopsSymmetric peak = 612 MFlops

  • AwardsBest Paper, Intern. Conf. Parallel Processing, 2004Performance models for evaluation and automatic performance tuning of symmetric sparse matrix-vector multiplyBest Student Paper, Intern. Conf. Supercomputing, Workshop on Performance Optimization via High-Level Languages and Libraries, 2003Best Student Presentation too, to Richard VuducAutomatic performance tuning and analysis of sparse triangular solveFinalist, Best Student Paper, Supercomputing 2002To Richard VuducPerformance Optimization and Bounds for Sparse Matrix-vector Multiply

    Best Presentation Prize, MICRO-33: 3rd ACM Workshop on Feedback-Directed Dynamic Optimization, 2000To Richard VuducStatistical Modeling of Feedback Data in an Automatic Tuning System

  • Accuracy of the Tuning Heuristics (4/4)

  • Can Match DGEMV PerformanceDGEMV

  • Fraction of Upper Bound Across Platforms

  • Achieved Performance and Machine BalanceMachine balance [Callahan 88; McCalpin 95]Balance = Peak Flop Rate / Bandwidth (flops / double)Lower is better (I.e. can hope to get higher fraction of peak flop rate)Ideal balance for mat-vec: 2 flops / doubleFor SpMV, even less

  • Where Does the Time Go?Most time assigned to memoryCaches disappear when line sizes are equalStrictly increasing line sizes

  • Execution Time Breakdown: Matrix 40

  • Execution Time Breakdown (PAPI): Matrix 40

  • Speedups with Increasing Line Size

  • Summary: Performance Upper BoundsWhat is the best we can do for SpMV?Limits to low-level tuning of blocked implementationsRefinements?What machines are good for SpMV?Partial answer: balance characterizationArchitectural consequences?Help to have strictly increasing line sizes

  • Evaluating algorithms and machines for SpMV

    MetricsSpeedupsMflop/s (fair flops)Fraction of peakQuestionsSpeedups are good, but what is the best?Independent of instruction scheduling, selectionCan SpMV be further improved or not?What machines are good for SpMV?

  • Tuning Dense BLAS PHiPAC

  • Tuning Dense BLAS ATLASExtends applicability of PHIPAC; Incorporated in Matlab (with rest of LAPACK)

  • Statistical Models for Automatic TuningIdea 1: Statistical criterion for stopping a searchA general search modelGenerate implementationMeasure performanceRepeatStop when probability of being within e of optimal falls below thresholdCan estimate distribution on-lineIdea 2: Statistical performance modelsProblem: Choose 1 among m implementations at run-timeSample performance off-line, build statistical model

  • SpMV Historical Trends: Fraction of Peak

  • Motivation for Automatic Performance Tuning of SpMVHistorical trendsSparse matrix-vector multiply (SpMV): 10% of peak or lessPerformance depends on machine, kernel, matrixMatrix known at run-timeBest data structure + implementation can be surprisingOur approach: empirical performance modeling and algorithm search

  • Accuracy of the Tuning Heuristics (3/4)

  • Accuracy of the Tuning Heuristics (3/4)DGEMV

  • Sparse CALA Summary Optimizing one SpMV too limitingTake k steps of Krylov subspace methodGMRES, CG, Lanczos, ArnoldiAssume matrix well-partitioned, with modest surface-to-volume ratioParallel implementationConventional: O(k log p) messagesCALA: O(log p) messages - optimalSerial implementationConventional: O(k) moves of data from slow to fast memoryCALA: O(1) moves of data optimalCan incorporate some preconditionersNeed to be able to compress interactions between distant i, jHierarchical, semiseparable matrices Lots of speed up possible (measured and modeled)Price: some redundant computation

  • Potential Impact on Applications: T3PApplication: accelerator design [Ko] 80% of time spent in SpMVRelevant optimization techniquesSymmetric storageRegister blockingOn Single Processor Itanium 21.68x speedup 532 Mflops, or 15% of 3.6 GFlop peak4.4x speedup with multiple (8) vectors1380 Mflops, or 38% of peak

  • Locally Dependent Entries for [x,Ax], A tridiagonal2 processorsCan be computed without communicationProc 1 Proc 2

  • Locally Dependent Entries for [x,Ax,A2x], A tridiagonal2 processorsCan be computed without communicationProc 1 Proc 2

  • Locally Dependent Entries for [x,Ax,,A3x], A tridiagonal2 processorsCan be computed without communicationProc 1 Proc 2

  • Locally Dependent Entries for [x,Ax,,A4x], A tridiagonal2 processorsCan be computed without communicationProc 1 Proc 2

  • Sequential [x,Ax,,A4x], with memory hierarchyOne read of matrix from slow memory, not k=4Minimizes words moved = bandwidth costNo redundant work

    Start-up based on SPIRAL?**A 2-D slice of the 3-D core matmul space, with all other generator options fixed. This shows the performance (Mflop/s) of various core matmul implementations on a small in-L2 cache workload. The best is the dark red square at (m0=2,k0=1,n0=8), which achieved 620 Mflop/s. This experiment was performed on the Sun Ultra-10 workstation (333 MHz Ultra II processor, 2 MB L2 cache) using the Sun cc compiler v5.0 with the flags -dalign -xtarget=native -xO5 -xarch=v8plusa.

    The space is discrete and highly irregular. (This is not even the worst example!)Paper: Statistical Models for Empirical Search-Based Performance Tuning (International Journal of High Performance Computing Applications, 18 (1), pp. 65-94, February 2004) Richard Vuduc, James W. Demmel, and Jeff A. Bilmes.Archana Ganapathi, Computer Science PhD Thesis, University of California, Berkeley. UCB//EECS-2009-181Predicting and Optimizing System Utilization and Performance via Statistical Machine Learning **First 20,000 columns of 4k x 1M matrix, rail4284*You probably use a sparse matrix everyday, albeit indirectly. Here, I show a million x million submatrix of the web connectivity graph, constructed from an archive at the Stanford WebBase.*A one-slide crash course on a basic sparse matrix data structure (compressed sparse row, or CSR, format) and a basic kernel: matrix-vector multiply.In CSR: store each non-zero value, but also need extra index for each non-zero less than dense, but higher rate of storage per non-zero.Sparse-matrix-vector multiply (SpMV): no re-use of A, just of vectors x and y, taken to be dense.SpMV with CSR storage: indirect/irregular accesses to x!

    So what youd like to do is: Unroll the k loop but you need to know how many non-zeros per row Hoist y[i] not a problem, absent aliasing Eliminate ind[i] its an extra load, but you need to know the non-zero pattern Reuse elements of x but you need to know the non-zero pattern*FEM discretization from a structural analysis problem.Source: Matrix raefsky, provided by NASA via the University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/Simon/raefsky3.html*discretization from a structural analysis problem.Source: Matrix raefsky, provided by NASA via the University of Florida Sparse Matrix Collectionhttp://www.cise.ufl.edu/research/sparse/matrices/Simon/raefsky3.html

    *[NOTE: This slide has some animation in it.]Experiment: Try all block sizes that divide 8x816 implementations in all.Platform: 900 MHz Itanium-2, 3.6 Gflop/s peak speed. Intel v7.0 compiler.Good speedups (4x) but at an unexpected block size (4x2).Figure taken from Im, Yelick, Vuduc, IJHPCA paper, to appear.*An enlarged submatrix for ex11, from an FEM fluid flow application.Original matrix: n = 16614, nnz = 1.1MSo