high performance matrix computations/calcul matriciel haute

High Performance Matrix Computations/CalculMatriciel Haute Performance

J.-Y. L’Excellent (INRIA/LIP-ENS Lyon)[email protected]

en collab. avec P. Amestoy, M. Dayde, L. Giraud (ENSEEIHT-IRIT)

2007-2008

1/ 627

Outline

IntroductionIntroduction aux calculateurs haute-performanceEvolutions architecturalesProgrammationCalcul Matriciel et calcul haute performanceGrid computing - Internet ComputingConclusion

2/ 627


3/ 627

I Interets du Calcul Haute-PerformanceI Applications temps-critiqueI Cas de calcul plus grosI Diminution du temps de reponseI Minimisation des couts de calcul

I Difficultes

I Acces aux donnees : hierarchie memoire complexe→ Exploiter la localite des references aux donnees

I Identification et gestion du parallelisme dans une application→ Approche algorithmique

4/ 627

Systemes paralleles : enfin l’age adulte !

I Les machines les plus puissantes sont a haut degre deparallelisme

I Le rapport prix / performance est attractif

I Plus que quelques constructeurs dans la course

I Systemes plus stables

I Logiciels applicatifs et librairies disponibles

I Exploitation industrielle et commerciale : plus uniquementlaboratoires de recherche

I Mais : travail algorithmique important etvalidation/maintenance difficile.

Nouvelles evolutions:I 1 core per chip → multi-core chips

I supercomputing → metacomputing (“grid computing”)

5/ 627

Classes de calculateursI Serveurs de calcul :

I Utilisables sur une large gamme d’applicationsI Multiprogrammation et temps partageI Stations de travail, serveurs departementaux, centre de calcul

I Calculateurs plus specifiques :I Efficaces sur une classe plus limitee de problemes (haut degre

de parallelisme)I A cause de leur architecture ou de limitations du logicielI Par exemple architectures massivement paralleles (MPP,

clusters de PC,.....)I Gains importants possibles avec rapport cout-performance

interessantI Calculateurs specialises :

I Resolution d’un probleme (image processing, crash test, . . . )I Hardware et logiciels concus pour cette application-cibleI Gains tres importants possibles avec un rapport

cout-performance tres interessantI Par exemple, la machine MDGRAPE-3 (dynamique

moleculaire) installee au Japon atteint 1 PFlop/s !

6/ 627

Besoins dans le domaine du calcul scientifique

Science traditionnelle

1. Construire une theorie,

2. Effectuer des experiences ou construire un systeme.

I trop difficile (ex: souffleries de grandes tailles)

I trop cher (fabriquer un avion juste pour quelques experimentations)

I trop lent (attente de l’evolution du climat / de l’univers)

I trop dangereux (armes, medicaments, experimentations sur leclimat)

Calcul scientifique

I simuler le comportement de systemes complexes grace a lasimulation numerique.

I lois physiques + algorithmes numeriques + calculateurs hauteperformance

7/ 627

Exemples dans le domaine du calcul scientifique

I Contraintes de duree: prevision du climat

8/ 627

Quelques exemples dans le domaine du calculscientifique

I Cost constraints: wind tunnels, crash simulation, . . .

9/ 627

Scale Constraints

I large scale: climate modelling, pollution, astrophysics

I tiny scale: combustion, quantum chemistry

10/ 627

Pourquoi des traitements paralleles ?

I Besoins de calcul non satisfaits dans beaucoup de disciplines(pour resoudre des problemes significatifs)

I Performance uniprocesseur proche des limites physiques

Temps de cycle 0.5 nanoseconde↔ 4 GFlop/s (avec 2 operations flottantes / cycle)

I Calculateur 20 TFlop/s ⇒ 5000 processeurs→calculateurs massivement paralleles

I Pas parce que c’est le plus simple mais parce que c’estnecessaire

I Objectif actuel (2010):

supercalculateur a 3 PFlop/s,500 TBytes de memoire ?

11/ 627

Quelques unites pour le calcul haute performance

Vitesse

1 MFlop/s 1 Megaflop/s 106 operations / seconde1 GFlop/s 1 Gigaflop/s 109 operations / seconde1 TFlop/s 1 Teraflop/s 1012 operations / seconde1 PFlop/s 1 Petaflop/s 1015 operations / seconde

Memoire

1 kB / 1 ko 1 kilobyte 103 octets1 MB / 1 Mo 1 Megabyte 106 octets1 GB / 1 Go 1 Gigabyte 109 octets1 TB / 1 To 1 Terabyte 1012 octets1 PB / 1 Po 1 Petabyte 1015 octets

12/ 627

Mesures de performance

I Nombre d’operations flottantes par seconde (pas MIPS)I Performance crete :

I Ce qui figure sur la publicite des constructeursI Suppose que toutes les unites de traitement sont activesI On est sur de ne pas aller plus vite :

Performance crete = #unites fonctionnellesclock (sec.)

I Performance reelle :I Habituellement tres inferieure a la precedente

Malheureusement

13/ 627

Rapport (Performance reelle / performance de crete) souvent bas !!Soit P un programme :

1. Processeur sequentiel:I 1 unite scalaire (1 GFlop/s)I Temps d’execution de P : 100 s

2. Machine parallele a 100 processeurs:I Chaque processor: 1 GFlop/sI Performance crete: 100 GFlop/s

3. Si P : code sequentiel (10%) + code parallelise (90%)I Temps d’execution de P : 0.9 + 10 = 10.9 sI Performance reelle : 9.2 GFlop/s

4. Performance reellePerformance de crete = 0.1

14/ 627

Loi d’Amdahl

I fs fraction d’une application qui ne peut pas etre parallelisee

fp = 1− fs fraction du code parallelise

N: nombre de processeurs

I Loi d’Amdahl:

tN ≥ (fpN + fs)t1 ≥ fst1

Speed-up: S = t1tN≤ 1

fs+fpN

≤ 1fs

Sequential Parallel

t3 t2 t1t∞= fst1

15/ 627

Calculateur procs LINPACK LINPACK Perf.n = 100 n = 1000 crete

Intel WoodCrest (1 core, 3GHz) 1 3018 6542 12000HP ProLiant (1 core, 3.8GHz) 1 1852 4851 7400HP ProLiant (1 core, 3.8GHz) 2 8197 14800IBM eServer(1.9GHz, Power5) 1 1776 5872 7600IBM eServer(1.9GHz, Power5) 8 34570 60800Fujitsu Intel Xeon (3.2GHz) 1 1679 3148 12800Fujitsu Intel Xeon (3.2GHz) 2 5151 6400SGI Altix (1.5GHz Itanium2) 1 1659 5400 6000NEC SX-8 (2 GHz) 1 2177 14960 16000Cray T932 32 1129 (1 proc.) 29360 57600Hitachi S-3800/480 4 408 (1 proc.) 20640 32000

Table: Performance (MFlop/s) sur la resolution d’un systeme d’equationslineaires (d’apres LINPACK Benchmark Dongarra [07])

16/ 627

1980 1988 1991 1993 1995 and beyond

2D airfoil10MB

100MB

Oil Reservoir

48-Hour Weather Modelling

1 GB

1 TB

10GB

100 GB

72-Hour

VehiculeSignature

Weather

3D Plasma

ModellingChemical Dynamics

Pharmaceutical Design

100 MFlops 1 GFlops 10 GFlops 100 GFlops 1 TFlops

StructuralBiology

Global ChangeHuman GenomeFkuid TurbulenceVehical DynamicsOcean CirculationViscous Fluid DynamicsSuperconductor ModellingStructural BiologyQuantum ChromodynamicsVision

Figure: Grand challenge problems (1995).

17/ 627

Machine Probleme de Probleme depetite taille grande taille

PFlop/s computer - 36 secondesTFlop/s computer 2 secondes 10 heuresCM2 64K 30 minutes 1 anCRAY-YMP-8 4 heures 10 ansALLIANT FX/80 5 jours 250 ansSUN 4/60 1 mois 1500 ansVAX 11/780 9 mois 14,000 ansIBM AT 9 ans 170,000 ansAPPLE MAC 23 ans 450,000 ans

Table: Vitesse de certains calculateurs sur un probleme Grand Challengeen 1995 (d’apres J.J. Dongarra)

Depuis, les problemes “Grand Challenge” ont grossi !

18/ 627


19/ 627

Evolutions architecturales: historique

Pour 1,000 $ : calculateur personnel plus performant, avecplus de memoire et plus de disque qu’un calculateur desannees 70 avec 1,000,000 $

technologie et conception !

I Durant les 25 premieres annees de l’informatique progres :technologie et architecture

I Depuis les annees 70 :I conception basee sur les circuits integresI performance : +25-30% par an pour les “mainframes” et minis

qui dominaient l’industrie

I Depuis la fin des annees 70 : emergence du microprocesseurI meilleure exploitation des avancees dans l’integration que pour

les mainframes et les minis (integration moindre)I progression et avantage de cout (production de masse) : de

plus en plus de machines sont basees sur les microprocesseursI possibilite de pourcentage d’amelioration plus rapide = 35%

par an

20/ 627

Evolutions architecturales: historique

I Deux changements sur le marche facilitent l’introduction denouvelles architectures :

1. utilisation decroissante de l’assembleur (compatibilite binairemoins importante)

2. systemes d’exploitation standards, independants desarchitectures (e.g. UNIX)

⇒ developpement d’un nouvel ensemble d’architectures :RISC a partir de 85

I performance : + 50% par an !!!I Consequences :

I plus de puissance :I Performance d’un PC > CRAY C90 (95)I Prix tres inferieur

I Domination des microprocesseursI PC, stations de travailI Minis remplaces par des serveurs a base de microprocesseursI Mainframes remplaces par des multiprocesseurs a faible

nombre de processeurs RISC (SMP)I Supercalculateurs a base de processeurs RISC (essentiellement

MPP)20/ 627

Moore’s law

I Gordon Moore (co-fondateur d’Intel) a predit en 1965 que ladensite en transitors des circuits integres doublerait tous les24 mois.

I A aussi servi de but a atteindre pour les fabriquants.I A ete deforme:

I 24 → 18 moisI nombre de transistors → performance

21/ 627

Comment accroıtre la vitesse de calcul ?

I Accelerer la frequence avec des technologies plus rapides

On atteint les limites:I Conception des pucesI Consommation electrique et chaleur dissipeeI Refroidissement ⇒ probleme d’espace

I On peut encore miniaturiser, mais:I pas indefinimentI resistance des conducteurs (R = ρ×l

s ) augmente et ..I la resistance est responsable de la dissipation d’energie (effet

Joule).I effets de capacites difficiles a maıtriser

Remarque: 1 nanoseconde = temps pour qu’un signalparcourt 30 cm de cable

I Temps de cycle 1 nanosecond ↔ 2 GFlop/s (avec 2operations flottantes par cycle)

22/ 627

Seule solution: le parallelisme

I parallelisme: execution simultanee de plusieurs instructions al’interieur d’un programme

I A l’interieur d’un processeur :I micro-instructionsI traitement pipelineI recouvrement d’instructions executees par des unites distinctes

→ transparent pour le programmeur(gere par le compilateur ou durant l’execution)

I Entre des processeurs ou cœurs distincts:I suites d’instructions differentes executees

→ synchronisations implicites (compilateur, parallelisationautomatique) ou explicites (utilisateur)

23/ 627

Unites centrales haute-performance

Concept cle: Traitement pipeline :

I L’execution d’une operation (arithmetique) est decomposee enplusieurs sous-operations

I Chaque sous-operation est executee par une unitefonctionnelle dediee = etage (travail a la chaine)

I Exemple pour une operations diadique (a← b × c) :

T1. Separer mantisse et exposantT2. Multiplier mantissesT3. Additionner les exposantsT4. Normaliser le resultatT5. Ajouter signe au resultat

24/ 627

Exemple pour des operations diadiques (suite)

I Supposition: l’operation a← b × c s’effectue en 5 traitementselementaires T1,T2,. . . ,T5 d’un cycle chacun. Quel est lenombre de cycles processeur pour la boucle suivante ?

Pour i = 1 a NA(i) = B(i) * C(i)

Fin Pour

I Traitement non pipeline: N * 5 cyclesI Traitement pipeline (a la chaine): N + 5 cycles

I 1er cycle: T1(1)I 2eme cycle: T1(2), T2(1)I 3eme cycle: T1(3), T2(2), T3(1)I . . .I keme cycle: T1(k), T2(k-1), T3(k-2), T4(k-3), T5(k-4)I . . .

25/ 627

Impact de l’approche CRAY

L’approche CRAY (annees 80) a eu un grand impact sur laconception des supercalculateurs :

I horloge la plus rapide possible

I unite vectorielle pipelinee sophistiquee

I registres vectoriels

I memoire tres haute performance

I multiprocesseurs a memoire partageeI processeurs vectoriels

I exploitent la regularite des traitements sur les elements d’unvecteur

I traitement pipelineI couramment utilises sur les supercalculateursI vectorisation par le compilateur

26/ 627

Processeurs RISC

I Processeurs RISC : introduits sur le marche vers 1990“the attack of the killer micros”

I pipeline sur les operations scalairesI performance proche de celle des processeurs vectoriels a

frequence egaleI plus efficaces sur des problemes scalaires

I CISC (Complex Instruction Set Computer)I Efficacite par un meilleur encodage des instructions

I RISC (Reduced Instruction Set Computer)I Concept etudie fin des annees 70I Decroıtre le nombre de cycles par instruction a 1

Jeu d’instructions simple↓

Hardware simplifie↓

Temps de cycle plus faible

27/ 627

I Idees maıtresses dans la conception des RISC :I Instructions decodees en 1 cycleI Uniquement l’essentiel realise au niveau du hardwareI Interface load/store avec la memoireI Utilise intensivement le principe du pipeline pour obtenir un

resultat par cycle meme pour les operations complexesI Hierarchie memoire haute-performanceI Format d’instructions simpleI RISC super scalaires ou superpipelines: plusieurs unites

fonctionnelles

28/ 627

Calculateur procs LINPACK LINPACK Performancen = 100 n = 1000 crete

Intel WoodCrest (1 core, 3GHz) 1 3018 6542 12000HP ProLiant (1 core, 3.8GHz) 1 1852 4851 7400IBM eServer(1.9GHz, Power5) 1 1776 5872 7600SGI Altix (1.6GHz Itanium2) 1 1765 5943 6400AMD Opteron (2.19GHz) 1 1253 3145 4284Fujitsu Intel Xeon (3.2GHz) 1 1679 3148 12800AMD Athlon (1GHz) 1 832 1705 3060Compaq ES45 (1GHz) 1 824 1542 2000

Performance actuelle d’un processeur vectorielNEC SX-8 (2 GHz) 1 2177 14960 16000NEC SX-8 (2 GHz) 8 75140 128000

Table: Performance des processseurs RISC (LINPACK BenchmarkDongarra [07])

29/ 627

Architectures multi-cœurs

ConstatsI La quantite de composants / puce va continuer a augmenter

I La frequence ne peut plus augmenter beaucoup(chaleur/refroidissement)

I Il est difficile de trouver suffisamment de parallelisme dans leflot d’instructions d’un processus

Multi-cœursI plusieurs cœurs a l’interieur d’un meme processeur

I vus comme plusieurs processeurs logiques par l’utilisateur

I Mais: multi-threading necessaire au niveau de l’application

30/ 627

Processeur Cell

I La PS3 est basee sur un processeurCell (Sony,Toshiba,IBM)

I 1 Cell= un Power PC + 8 SPE(Synergetic Process. Elem.)

I 1 SPE = processeur vectoriel SIMD+ DMA = 25.6 GFlop/s

I 204 GFlop/s de performance creteen arithmetique 32 bits

(14.6 GFlop/s en 64 bits)

I D’ou regain d’interet pour le calcul en 32 bitsI Melange d’arithmetiques simple et double precision (voir [?])I Typiquement: 32-bit pour le gros des calculs, 64 bits pour

ameliorer la precisionI Pas seulement sur processeur Cell

Example of mixed-precision arithmetic

I Solve Ax = b, A sparse with the sparse direct solver MUMPSI Compare single precision + iterative refinement to double

precision run (Number of steps of iterative refinementsindicated on Figure).

Speed-up obtained wrt double precision(Results from A. Buttari et.al., 2007)

32/ 627

Annee Calculateur MFlop/s1955-65 CDC 6600 1-101965-75 CDC 7600 10 - 100

IBM 370/195ILLIAC IV

1975-85 CRAY-1, XMP, CRAY 2 100 - 1000CDC CYBER 205FUJITSU VP400

NEC SX-21985-1995 CRAY-YMP, C90 1000 - 100,000

ETA-10NEC SX-3

FUJITSU VP26001995-2005 CRAY T3E 1.2 TFlop/s

INTEL 1.8 TFlop/sIBM SP 16 TFlop/s

HP 20 TFlop/sNEC 40 TFlop/s

IBM Blue Gene 180 TFlop/s2008 - Roadrunner 1 PFlop/s

Table: Evolutions des performances par decennie

Problemes

I On est souvent (en pratique) a 10% de la performance creteI Processeurs plus rapides → acces aux donnees plus rapide :

I organisation memoire,I communication inter-processeurs

I Hardware plus complexe : pipe, technologie, reseau, . . .

I Logiciel plus complexe : compilateur, systeme d’exploitation,langages de programmation, gestion du parallelisme,. . . applications

Il devient plus difficile de programmer efficacement

34/ 627

Vitesse memoire vs vitesse processeur

I Performance processeur: + 60% par an

I Memoire DRAM: + 9% par an

→ Ratio performance processeurtemps acces memoire augmente d’environ 50% par an !!

35/ 627

Problemes de debit memoire

I L’acces aux donnees est un probleme crucial dans lescalculateurs modernes

I Accroıssement de la vitesse de calcul sans accroıtre le debitmemoire → goulet d’etranglement

MFlop/s plus faciles que MB/s pour debit memoire

I

Temps de cyle processeurs → 2 GHz (.5 ns)Temps de cycle memoire → ≈ 20 ns SRAM

≈ 50 ns DRAM

36/ 627

Comment obtenir de hauts debits memoire ?

I Plusieurs chemins d’acces entre memoire et processeursI CRAY XMP et YMP :

I 2 vector load + 1 vector store + 1 I/OI utilises pour acceder des vecteurs distincts

I NEC SX :I chemins d’acces multiples peuvent etre aussi utilises pour

charger un vecteur

I (ameliore le debit, mais pas la latence !)

I Plusieurs modules memoire accedes simultanement(entrelacage)

I Acces memoire pipelines

I Memoire organisee hierarchiquementI La facon d’acceder aux donnees peut affecter la performance:

I Minimiser les defauts de cacheI Minimiser la pagination memoireI Localite: ameliorer le rapport references a des memoires

locales/ references a des memoires a distance

37/ 627

Cache level #2

Cache level #1 1−2 / 8 − 66

6−15 / 30 − 200

Main memory 10 − 100

Remote memory 500 − 5000

Registers < 1

256 KB − 16 MB

1 − 128 KB

Average access time (# cycles) hit/missSize

Disks 700,000 / 6,000,000

1 − 10 GB

Figure: Exemple de hierarchie memoire.

38/ 627

Conception memoire pour nombre important deprocesseurs ?

Comment 100 processeurs peuvent-ils avoir acces a des donneesrangees dans une memoire partagee (technologie, interconnexion,prix ?)→ Solution a cout raisonnable : memoire physiquement distribuee(chaque processeur a sa propre memoire locale)

I 2 solutions :I memoires locales globalement adressables : Calulateurs a

memoire partagee virtuelleI transferts explicites des donnees entre processeurs avec

echanges de messagesI Scalibite impose :

I augmentation lineaire debit memoire / vitesse du processeurI augmentation du debit des communications / nombre de

processeurs

I Rapport cout/performance → memoire distribuee et bonrapport cout/performance sur les processeurs

39/ 627

Architecture des multiprocesseurs

Nombre eleve de processeurs → memoire physiquement distribuee

Organisation Organisation physiquelogique Partagee (32 procs max) DistribueePartagee multiprocesseurs espace d’adressage global

a memoire partagee (hard/soft) au dessus de messagesmemoire partagee virtuelle

Distribuee emulation de messages echange de messages(buffers)

Table: Organisation des processeurs

Remarque: standards de programmation

Organisation logique partagee: threads, OpenMPOrganisation logique distribuee: PVM, MPI, sockets

40/ 627

P1 P2 P3 ......................... PnP4

Shared Memory

Interconnection Network

Figure: Exemple d’architecture a memoire partagee.

41/ 627

P1 P2 P3 ......................... PnP4

Interconnection Network

LM LM LM LM LM.........................

Figure: Exemple d’architecture a memoire distribuee.

42/ 627

Remarques

Memoire physiquement partagee

I Temps d’acces uniforme a toute la memoire

Memoire physiquement distribuee

I Temps d’acces depend de la localisation de la donnee

Memoire logiquement partagee

I Espace d’adressage unique

I Communications implicites entre les processeurs via lamemoire partagee

Memoire logiquement distribuee

I Plusieurs espaces d’adressage prives

I Communications explicites (messages)

43/ 627

Terminologie

Architecture SMP (Symmetric Multi Processor)

I Memoire partagee (physiquement et logiquement)

I Temps d’acces identique a la memoire

I Similaire du point de vue applicatif aux architecturesmulti-cœurs (1 cœur = 1 processeur logique)

I Mais communications bcp plus rapides dans les multi-cœurs(latence < 3ns, bande passantee > 20 GB/s) que dans lesSMP (latence ≈ 60ns, bande passantee ≈ 2 GB/s)

Architecture NUMA (Non Uniform Memory Access)

I Memoire physiquement distribuee et logiquement partagee

I Plus facile d’augmenter le nombre de procs qu’en SMP

I Temps d’acces depend de la localisation de la donnee

I Acces locaux plus rapides qu’acces distants

I hardware permet la coherence des caches (ccNUMA)44/ 627

Exemples

I Memoire physiquement et logiquement partagee (SMP):

la plupart des supercalculateurs a faible nombre deprocesseurs: stations de travail multi-processeurs (SUN:jusqu’a 64 processeurs, . . . ), NEC, SGI Power Challenge, . . .

I Memoire physiquement et logiquement distribuee:

grappes de PC monoprocesseurs, IBM SP2, T3D, T3E, . . .

I Memoire physiquement distribuee et logiquement partagee(NUMA):

BBN, KSR, SGI Origin, SGI Altix, . . .

45/ 627

Clusters de multi-processeurs

I Plusieurs niveaux de memoire et de reseaux d’interconnexion→ temps d’acces non uniforme

I Memoire commune partagee par un faible nombre deprocesseurs (noeud SMP)

I Eventuellement des outils de programmation distincts(transferts de message entre les clusters, . . . )

I Exemples:I grappes de bi- ou quadri-processeurs,I IBM SP (CINES, IDRIS): plusieurs nœuds de 4 a 32 Power4+

. . .

46/ 627

Memory Memory

Network

NetworkNetwork

LM LM LM LM

Proc Proc ProcProc

SMP node

Figure: Exemple d’architecture “clusterisee”.

47/ 627

Reseaux de CalculateursI Evolution du calcul centralise vers un calcul distribue sur des

reseaux de calculateursI Puissance croissante des stations de travailI Interessant du point de vue coutI Processeurs identiques sur stations de travail et MPP

I Calcul parallele et calcul distribue peuvent converger :I modele de programmationI environnement logiciel : PVM, MPI, . . .

I Performance effective peut varier enormement sur uneapplication

I Heterogene / homogeneI Plutot oriente vers un parallelisme gros grain (taches

independentes, . . . )I Performance tres dependante des communications (debit et

latence)I Charge du reseau et des calculateurs variable pdt l’execution

Equilibrage des traitements ?

48/ 627

network #1

computer #2

computer #1

cluster

network #2

multiprocessor

Figure: Exemple de reseau de calculateurs.

49/ 627

Multiprocesseurs vs reseaux de machines

I Systemes distribues (reseaux de machines) : communicationsrelativement lentes et systemes independants

I Systemes paralleles (architectures multiprocesseur) :communications plus rapides (reseau d’interconnexion plusrapide) et systemes plus homogenes

Il y a convergence entre ces deux classes d’architectures et lafrontiere est floue :

I clusters et clusters de clusters

I des systemes d’exploitation repartis (ex: MACH et CHORUSOS) savent gerer les deux

I versions de UNIX multiprocesseur

I souvent memes environnements de developpement

50/ 627


51/ 627

Classification de Flynn

I S.I.S.D. : Single Instruction Single Data streamI architecture monoprocesseurI calculateur von Neumann conventionnelI exemples : SUN, PC

I S.I.M.D. : Single Instruction Multiple Data streamI processeurs executent de facon synchrone la meme instruction

sur des donnees differentes (e.g. elements d’un vecteur, d’unematrice, d’une image)

I une unite de controle diffuse les instructionsI processeurs identiquesI Exemples : CM-2, MasPar,. . .I plus recemment: chacun des 8 SPE du processeur CELL se

comporte comme un systeme SIMD

52/ 627

I M.I.S.D. : n’existe pasI M.I.M.D. : Multiple Instructions Multiple Data stream

I processeurs executent de facon asynchrone des instructionsdifferentes sur des donnees differentes

I processeurs eventuellement heterogenesI chaque processeur a sa propre unite de controleI exemples : ALLIANT, CONVEX, CRAYs, IBM SP, clusters

BEOWULF, serveurs multi-processeurs, reseaux de stations detravail, . . .

53/ 627

Modes de programmation SIMD et MIMD

I Avantages du SIMD :I Facilite de programmation et de debogageI Processeurs synchronises → couts de synchronisation

minimauxI Une seule copie du programmeI Decodage des instructions simple

I Avantages du MIMD :I Plus flexible, beaucoup plus generalI Exemples:

I memoire partagee: OpenMP, threads POSIXI memoire distribuee: PVM, MPI (depuis C/C++/Fortran)

54/ 627


55/ 627

Calcul Matriciel et calcul haute performance

I Demarche generale pour le calcul scientifique:

1. Probleme de simulation (probleme continu)2. Application de lois phyisques (Equations aux derivees

partielles)3. Discretisation, mise en equations en dimension finie4. Resolution de systemes lineaires (Ax = b)5. (Etude des resultats, remise en cause eventuelle du modele ou

de la methode)

I Resolution de systemes lineaires=noyau algorithmiquefondamental. Parametres a prendre en compte:

I Proprietes du systeme (symetrie, defini positif,conditionnement, sur-determine, . . . )

I Structure: dense ou creux,I Taille: plusieurs millions d’equations ?

56/ 627

Equations aux derivees partielles

I Modelisation d’un phenomene physique

I Equaltions differentielles impliquant:I forcesI momentsI temperaturesI vitessesI energiesI temps

I Solutions analytiques rarement disponibles

57/ 627

Exemples d’equations aux derivees partielles

I Trouver le potentiel electrique pour une distribution de chargedonnee:∇2ϕ = f ⇔ ∆ϕ = f , or∂2

∂x2ϕ(x , y , z) + ∂2

∂y2ϕ(x , y , z) + ∂2

∂z2ϕ(x , y , z) = f (x , y , z)

I Equation de la chaleur (ou equation de Fourier):∂2u

∂x2+∂2u

∂y 2+∂2u

∂z2=

1

α

∂u

∂tavec

I u = u(x , y , z , t): temperature,I α: diffusivite thermique du milieu.

I Equations de propagation d’ondes, equation de Schrodinger,Navier-Stokes,. . .

58/ 627

Discretisation (etape qui suit la modelisationphysique)

Travail du numericien:

I Realisation d’un maillage

I Choix des methodes de resolution et etude de leurcomportement

I Etude de la perte d’information due au passage a la dimensionfinie

Principales techniques de discretisation

I Differences finies

I Elements finis

I Volumes finis

59/ 627

Discretization with finite differences (1D)

I Basic approximation (ok if h is small enough):(du

dx

)(x) ≈ u(x + h)− u(x − h)

h

I Results from Taylor’s formula

u(x + h) = u(x) + hdu

dx+

h2

2

d2u

dx2+

h3

6

d3u

dx3+ O(h4)

I Replacing h by −h:

u(x − h) = u(x)− hdu

dx+

h2

2

d2u

dx2− h3

6

d3u

dx3+ O(h4)

I Thus:

d2u

dx2=

u(x + h)− 2u(x) + u(x − h)

h2+ O(h2)

60/ 627

Discretization with finite differences (1D)

d2u

dx2=

u(x + h)− 2u(x) + u(x − h)

h2+ O(h2)

3-point stencil for the centered difference approximation tothe second order derivative:

−21 1

61/ 627

Finite Differences for the Laplacian Operator (2D)

Assuming same mesh size h in x and y directions,

∆u(x) ≈ u(x − h, y)− 2u(x , y) + u(x + h), y

h2+

u(x , y − h)− 2u(x , y) + u(x , y + h)

h2

∆u(x) ≈ 1

h2(u(x−h, y)+u(x +h, y)+u(x , y−h)+u(x , y +h)−4u(x , y))

1 1

1

1

−4 −4

1 1

11

5-point stencils for the centered difference approximation tothe Laplacian operator (left) standard (right) skewed

62/ 627

27-point stencil usedfor 3D geophysicalapplications (collabo-ration S.Operto andJ.Virieux, Geoazur).

1D example

I Consider the problem

−u′′(x) = f (x) for x ∈ (0, 1)u(0) = u(1) = 0

I xi = i × h, i = 0, . . . , n + 1, f (xi ) = fi , u(xi ) = ui

h = 1/(n + 1)

I Centered difference approximation:

−ui−1 + 2ui − ui+1 = h2fi (u0 = un+1 = 0),

I We obtain a linear system Au = f or (for n = 6):

1

h2

2 −1 0 0 0 0−1 2 1 0 0 0

0 −1 2 1 0 00 0 −1 2 1 00 0 0 −1 2 10 0 0 0 −1 2

u1

u2

u3

u4

u5

u6

=

f1

f2

f3

f4

f5

f6

64/ 627

Slightly more complicated (2D)

Consider an elliptic PDE:

−∂(a(x , y)∂u

∂x )

∂x−∂(b(x , y)∂u

∂y )

∂y+ c(x , y)× u = g(x , y) sur Ω

u(x , y) = 0 sur ∂Ω

0 ≤ x , y ≤ 1

a(x , y) > 0

b(x , y) > 0

c(x , y) ≥ 0

65/ 627

I Case of a regular 2D mesh:

0 1

1

2 3 41

5

discretization step: h = 1n+1 , n = 4

I 5-point finite difference scheme:

∂(a(x , y)∂u∂x )ij

∂x=

ai+ 12,j(ui+1,j − ui ,j)

h2−

ai− 12,j(ui ,j − ui−1,j)

h2+O(h2)

I Similarly:

∂(b(x , y)∂u∂y )ij

∂y=

bi ,j+ 12(ui ,j+1 − ui ,j)

h2−

bi ,j− 12(ui ,j − ui ,j−1)

h2+O(h2)

I ai+ 12,j , bi+ 1

2,j , cij , . . . known.

I With the ordering of unknows of the example, we obtain alinear system of the form:

Ax = b,

I where

x1 ↔ u1,1 = u( 1n+1 ,

1n+1 )

x2 ↔ u2,1 = u( 2n+1 ,

1n+1 )

x3 ↔ u3,1

x4 ↔ u4,1

x5 ↔ u1,2, . . .

I and A is n2 by n2, b is of size n2, with the following structure:

67/ 627

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16|x x x | 1 |g11||x x x x | 2 |g21|| x x x x | 3 |g31|| x x 0 x | 4 |g41||x 0 x x x | 5 |g12|| x x x x x | 6 |g22|| x x x x x | 7 |g32|

A=| x x x 0 x | 8 b=|g42|| x 0 x x x | 9 |g13|| x x x x x |10 |g23|| x x x x x |11 |g33|| x x x 0 x |12 |g43|| x 0 x x |13 |g14|| x x x x |14 |g24|| x x x x |15 |g34|| x x x |16 |g44|

68/ 627

Solution of the linear system

Often the most costly part in a numerical simulation code

I Direct methods:I L U factorization followed by triangular substitutionsI parallelism depends highly on the structure of the matrix

I Iterative methods:I usually rely on sparse matrix-vector products (can be done in

parallel)I algebraic preconditioner useful

→Need for high-performance linear algebra kernels

69/ 627

Evolution of a complex phenomenon

I Examples:I climate modeling, evolution of radioactive waste, . . .

I heat equation:

∆u(x , y , z , t) = ∂u(x,y ,z,t)

∂tu(x , y , z , t0) = u0(x , y , z)

I Discretization in both space and time (1D case):I Explicit approaches:

un+1j −un

j

tn+1−tn=

unj+1−2un

j +unj−1

h2 .I Implicit approaches:

un+1j −un

j

tn+1−tn=

un+1j+1−2un+1

j +un+1j−1

h2 .

I Implicit approaches are preferred (more stable, larger timesteppossible) but are more numerically intensive: a sparse linearsystem must be solved at each iteration.

Need for high performance linear algebra kernels

70/ 627

Discretization with Finite elements

I Consider a partial differential equation of the form (PoissonEquation):

∆u = ∂2u∂x2 + ∂2u

∂y2 = f

u = 0 on ∂Ω

I we can show (using Green’s formula) that the previousproblem is equivalent to:

a(u, v) = −∫

Ωf v dx dy ∀v such that v = 0 on ∂Ω

where a(u, v) =∫

Ω

(∂u∂x

∂v∂x + ∂u

∂y∂v∂y

)dxdy

71/ 627

Finite element scheme: 1D Poisson Equation

I ∆u = ∂2u∂x2 = f , u = 0 on ∂Ω

I Equivalent to

a(u, v) = g(v) for all v (v|∂Ω = 0)

where a(u, v) =∫

Ω∂u∂x

∂v∂x and g(v) = −

∫Ω f (x)v(x)dx

(1D: similar to integration by parts)

I Idea: we search u of the form =∑

k αkΦk(x)

(Φk)k=1,n basis of functions such that Φk is linear on all Ei ,

and Φk(xi ) = δik = 1 if k = i , 0 otherwise.

Ω

Φk Φk+1Φk−1

xkEk Ek+1

72/ 627

Finite Element Scheme: 1D Poisson Equation

Ω

Φk Φk+1Φk−1

xkEk Ek+1

I We rewrite a(u, v) = g(v) for all Φk :a(u,Φk) = g(Φk) for all k ⇔

∑i αia(Φi ,Φk) = g(Φk)

a(Φi ,Φk) =∫

Ω∂Φi∂x

∂Φk∂x = 0 when |i − k| ≥ 2

I kth equation associated with Φk

αk−1a(Φk−1,Φk) + αka(Φk ,Φk) + xk+1a(Φk+1,Φk) = g(Φk)

I a(Φk−1,Φk) =∫Ek

∂Φk−1

∂x∂Φk∂x

a(Φk+1,Φk) =∫Ek+1

∂Φk+1

∂x∂Φk∂x

a(Φk ,Φk) =∫Ek

∂Φk∂x

∂Φk∂x +

∫Ek

∂Φk∂x

∂Φk∂x

73/ 627

Finite Element Scheme: 1D Poisson Equation

From the point of view of Ek , we have a 2x2 contribution matrix:( ∫Ek

∂Φk−1

∂x∂Φk−1

∂x

∫Ek

∂Φk−1

∂x∂Φk∂x∫

Ek

∂Φk−1

∂x∂Φk∂x

∫Ek

∂Φk∂x

∂Φk∂x

)=

(IEk

(Φk−1,Φk−1) IEk(Φk−1,Φk)

IEk(Φk ,Φk−1) IEk

(Φk ,Φk)

)

210 3 4 Ω

E3

Φ1 Φ2 Φ3

E4E1 E2 IE1 (Φ1,Φ1) + IE2 (Φ1,Φ1) IE2 (Φ1,Φ2)IE2 (Φ2,Φ1) IE2 (Φ2,Φ2) + IE3 (Φ2,Φ2) IE3 (Φ2,Φ3)

IE3 (Φ2,Φ3) IE3 (Φ3,Φ3) + IE4 (Φ3,Φ3)

×

α1

α2

α3

=

g(φ1)g(φ2)g(φ3)

74/ 627

Finite Element Scheme in Higher Dimension

I Can be used for higher dimensions

I Mesh can be irregular

I Φi can be a higher degree polynomial

I Matrix pattern depends on mesh connectivity/ordering

75/ 627

Finite Element Scheme in Higher Dimension

I Set of elements (tetrahedras, triangles) to assemble:

j

T

i

k C (T ) =

aTi ,i aT

i ,j aTi ,k

aTj ,i aT

j ,j aTj ,k

aTk,i aT

k,j aTk,k

Needs for the parallel case

I Assemble the sparse matrix A =∑

i C (Ti ): graph coloringalgorithms

I Parallelization domain by domain: graph partitioning

I Solution of Ax = b: high performance matrix computationkernels

75/ 627

Other example: linear least squares

I mathematical model + approximate measures ⇒ estimateparameters of the model

I m ”experiments” + n parameters xi :min‖Ax − b‖ avec:

I A ∈ Rm×n,m ≥ n: data matrixI b ∈ Rm: vector of observationsI x ∈ Rn: parameters of the model

I Solving the problem:I Factorisation sous la forme A = QR, avec Q orthogonale, R

triangulaireI ‖Ax−b‖ = ‖QT Ax−QT b‖ = ‖QT QRx−QT b‖ = ‖Rx−QT b‖

I Problems can be large (meteorological data, . . . ), sparse ornot

→ Again, we need high performance algorithms

76/ 627

Software aspects, parallelization of industrialsimulation codes

I Distinction betweenI Porting codes and optimizing them on SMP machines

I Local changes in the codeI No major change in the global resolution methodI Possible substitution of computational kernels

I Development of a parallel code for distributed memorymachines → different algorithms needed

I Development of optimized parallel libraries (ex: solvers forlinear systems) where portability and efficiency are essential

I How to take the characteristics of a parallel machine intoaccount ?

I Some of the most efficient sequential algorithms cannot beparallelized

I Some algorithms suboptimal in sequential are very good inparallel

I Major problem: How to reuse existing codes ?

77/ 627


78/ 627

Grid computing - Internet Computing

Internet peut servir de support a l’execution d’applicationsreparties en plus de sa fonction d’acces a l’information.

InteretI Interface familiereI Disponibilite d’outils de base :

I Espace universel de designation (URL)I Protocole de transfert de l’information (HTTP)I Gestion d’information sous format standard (HTML-XML)

I Web = systeme d’exploitation primitif pour applicationsreparties ?

79/ 627

Grid computing - Internet Computing

Internet peut servir de support a l’execution d’applicationsreparties en plus de sa fonction d’acces a l’information.

ProblemesI Ou et comment sont executes les programmes ?

I Sur le site serveur → scripts CGI, servlets, . . .I Sur le site client → scripts dans extension du navigateur

(plugin) ou applets, . . .

I Comment assurer la securite ?I Probleme majeur pas completement resolu

I Protection des sitesI Encryptage de l’informationI Restrictions sur les conditions d’execution

I TracabiliteI Mais finalement qui beneficie du resultat de l’execution du

service ?

79/ 627

Grid Computing

I Rendre accessibles de facon transparente des ressources sur leNet : capacites de traitement, logiciels d’expertise, bases dedonnees, . . .

3 types de grilles: partage d’information, stockage, calculI Problemes :

I Localiser et renvoyer les solutions - ou logiciels - sous formedirectement exploitable et usuelle a l’utilisateur

I Exemples : NetSolve, DIET, Globus, NEOS, Ninf, Legion, . . .I Mecanismes mis en œuvre :

I Sockets, RPC, client-serveur, HTTP, Corba, scripts CGI, Java,. . .

I Appels a partir de codes C ou FortranI Eventuellement interfaces plus interactives : consoles Java,

pages HTML, . . .

I Initiatives americaines, europeennes : EuroGrid (CERN + ESA+ . . . ) , GRID 5000, . . .

80/ 627

Grilles de calcul : tentative de classification (T.Priol, INRIA)

I Multiplicite de termes : P2P Computing, Metacomputing,Virtual Supercomputing, Desktop Grid, Pervasive Computing,Utility Computing, Mobile Computing, Internet Computing,PC Grid Computing, On Demand Computing, . . .

I Virtual Supercomputing : grilles de supercalculateurs ;

I Desktop Grid, Internet Computing : grille composee d’un tresgrand nombre de PC (10,000 - 1,000,000);

I Metacomputing: association de serveurs applicatifs;

I P2P Computing : infrastructure de calcul Pair-a-Pair:chaque entite peut etre alternativement client ou serveur.

81/ 627

Vision de la “grille aux USA”.

82/ 627

Peer-to-Peer : SETI@home

I 500,000 PCs a la recherche d’intelligence extra-terrestre

I Analyse du signal

I Pair recupere un jeu de donnees depuis le radio-telescoped’Arecibo

I Pair analyse les donnees (300 kB, 3TFlops, 10 hours) quandils sont inactifs

I Les resultats sont transmis a l’equipe SETI

I 35 TFlop/s en moyenne

I Source d’inspiration pour de nombreuses entreprises

83/ 627

Peer-to-Peer : SETI@home

Total Last 24 HoursUsers 5436301 0 new users

Results received 2005637370 780175Total CPU time 2378563.061 years 539.796 years

Flops 7.406171e+21 3.042682e+18

I 500,000 PCs a la recherche d’intelligence extra-terrestre

I Analyse du signal

I Pair recupere un jeu de donnees depuis le radio-telescoped’Arecibo

I Pair analyse les donnees (300 kB, 3TFlops, 10 hours) quandils sont inactifs

I Les resultats sont transmis a l’equipe SETI

I 35 TFlop/s en moyenne

I Source d’inspiration pour de nombreuses entreprises

83/ 627

Google (d’apres J. Dongarra)

I 2600 requetes par seconde (200× 106 par jour)

I 100 pays

I 8× 109 documents indexes

I 450,000 systemes Linux dans plusieurs centres de donnees

I Consommation electrique 20 MW (2 millions de $ par mois)

I Ordre d’apparence des pages ⇔ valeurs propres d’une matricede probabilite de transition (1 entre page i et j signifiel’existence d’un lien de i vers j)

84/ 627

RPC et Grid Computing : Grid RPC (F. Desprez,INRIA)

I Idee simple:I Construire le modele de programmation RPC sur la grilleI utiliser les ressources (donnees+services) disponibles sur le

reseauI Parallelisme mixte : guide par les donnees au niveau du serveur

et par les taches entre les serveurs.

I Fonctionnalites requises:

1. Equilibrage de charge (localisation services, evaluation deperformance, sequencement)

2. IDL (Interface Definition Language)3. Mecanismes pour gerer la persistence et la dupplication des

donnees.4. Securite, Tolerance aux pannes, Interoperabilite entre

intergiciels (gridRPC)

85/ 627

RPC et Grid Computing : Grid RPC (suite)

I Exemples:I Netsolve (Univ. Tennessee) (le plus ancien, base sur des

sockets)I DIET: Equipe Graal (LIP)

Outil recent largement utiliseTravaux importants sur l’ordonnancement des taches, ledeploiement, la gestion des donnees.

86/ 627


87/ 627

Evolutions du Calcul Haute-Performance

I Memoire virtuellement partagee :I clustersI Hierarchie memoire plus etendue

I Clusters de machinesI Souvent a base de PCs (Pentium ou Dec Alpha, NT ou

LINUX)

I Programmation parallele (memoire partagee, transfert demessage, data parallele) :

I Efforts de definition de standards : Open MP et threadsPOSIX, MPI, HPF, . . .

I MPPs et clustersI representent l’avenir pour le calcul haute-performanceI rapport communications

puissance de calcul souvent faible par rapport aux

multiprocesseurs a memoire partageeI integration dans l’ensemble des moyens de calcul d’une

entreprise de plus en plus courante

88/ 627

Environnements de programmation

I On n’evitera pas le calcul paralleleI Logiciels ont toujours un temps de retard / aux architectures

I Systeme d’exploitationI Parallelisation automatiqueI Logiciels applicatifs et librairies scientifiques

I Pour des architectures massivement paralleles :I Standard de programmation : MPI ou MPI + threads

(POSIX/OpenMP)I Langages: le plus souvent C ou FortranI Besoins d’outils de developement (debogueurs, compilateurs,

analyseurs de performance, librairies, . . . )I Developpements/maintenance difficiles et difficultes

d’utilisation des outils de mise au point.

89/ 627

Resolution de problemes issus du calcul scientifique

I Calcul parallele necessaire pour resoudre des problemes detailles ”raisonnables”

I Calculs impliquant des matrices souvent les plus critiques enmemoire/en temps

I Besoins: methodes numeriques paralleles, algebre lineairedense et creuse, algorithmes de traitement de graphes

I Les algorithmes doivent s’adapter:I aux architectures parallelesI aux modeles de programmation

I portabilite et efficacite ?la meilleure facon d’obtenir un programme parallele est deconcevoir un algorithme parallele !!!!

90/ 627

HPC Spectrum (d’apres J.Dongarra)

Peer

to p

eer

(SET

I@ho

me)

Grid

−bas

ed co

mpu

ting

Net

wor

k of

ws

Beow

ulf c

luste

rCl

uste

rs w

/

Para

llel d

ist m

emTF

lop

mac

hine

s

spec

ial i

nter

conn

ect

Distributed Systems

- Gather (unused) resources- Steal cycles- System software managesresources- 10% - 20% overhead is OK- Resources drive applications- Completion time not critical- Time-shared- Heterogeneous

Massively // Systems

- Bounded set of resources- Apps grow to consume all cycles- Application manages resources- 5% overhead is maximum- Apps drive purchase of equipment- Real-time constraints- Space-shared- Homogeneous 91/ 627

Outline

Calculateurs haute-performance: concepts generauxIntroductionOrganisation des processeursOrganisation memoireOrganisation interne et performance des processeurs vectorielsOrganisation des processeurs RISCReutilisation des donnees (dans les registres)Memoire cacheReutilisation des donnees (dans les caches)Memoire virtuelleReutilisation des donnees (en memoire)Interconnexion des processeursLes supercalculateurs du top 500 en Juin 2007Conclusion

92/ 627


93/ 627

Introduction

I Conception d’un supercalculateurI Determiner quelles caracteristiques sont importantes (domaine

d’application)I Maximum de performance en respectant les contraintes de

cout (achat, maintenance,consommation)I Conception d’un processeur :

I Jeu d’instructionsI Organisation fonctionnelle et logiqueI Implantation (integration, alimentation, . . . )

I Exemples de contraintes fonctionnelles vs domained’application

I Machine generaliste : performance equilibree sur un largeensemble de traitements

I Calcul scientifique : arithmetique flottante performanteI Gestion : base de donnees, transactionnel, . . .

94/ 627

I Utilisation des architecturesI Besoins toujours croissants en volume memoire :

x 1.5 – 2 par an pour un code moyen (soit 1 bit d’adresse tousles 1-2 ans)

I 25 dernieres annees remplacement assembleur par langages dehaut niveau → compilateurs / optimisation de code

I Evolution technologiqueI Circuits integres (CPU) : densite + 50 % par anI Semi-conducteurs DRAM (memoire) :

I Densite + 60 % par anI Temps de cycle : -30 % tous les 10 ansI Taille : multipliee par 4 tous les 3 ans

95/ 627

CPU performance

I CPUtime = #ProgramCyclesClockRate

I #ProgramCycles =#ProgramInstructions × avg .#cyclesperinstruction

I Thus performance (CPUtime) depends on three factors :

1. clock cycle time2. #cycles per instruction3. number of instructions

But those factors are inter-dependent:

I ClockRate depends on hardware technology and processororganization

I #cyclesperinstruction depends on organization and instructionset architecture

I #instructions depends on instruction set and compiler

96/ 627


97/ 627

Pipeline

I Pipeline = principe du travail a la chaıneI un traitement est decoupe en un certain nombre de

sous-traitements realises par des unites differentes (etages dupipeline)

I les etages fonctionnent simultanement sur des operandesdifferents (elements de vecteurs par exemple)

I apres amorcage du pipeline, on obtient un resultat par tempsde cyle de base

I Processeur RISC :I Pipeline sur des operations scalaires independantes :

a = b + cd = e + f

I Code executable plus complexe sur RISC :

do i = 1, na(i) = b(i) + c(i)

enddo

98/ 627

I Code correspondant :

i = 1boucle : load b(i) dans registre #1

load c(i) dans registre #2registre #3 = registre #1 + registre #2store registre #3 dans a(i)i = i + 1 et test fin de boucle

I Exploitation du pipeline → deroulage de boucle

do i = 1, n, 4a(i ) = b(i ) + c(i )a(i+1) = b(i+1) + c(i+1)a(i+2) = b(i+2) + c(i+2)a(i+3) = b(i+3) + c(i+3)

enddo

99/ 627

I Sur processeur vectoriel :

do i = 1, na(i) = b(i) + c(i)

enddo

load vector b dans registre #1load vector c dans registre #2register #3 = register #1 + register #2store registre #3 dans vecteur a

Stripmining : si n > nb (taille registres vectoriels)

do i = 1, n, nbib = min( nb, n-i+1 )do ii = i, i + ib - 1

a(ii) = b(ii) + c(ii)enddo

enddo

100/ 627

Problemes dans la conception des pipelines

I Beaucoup d’etages:I cout d’amorcage plus eleveI performances plus sensibles a la capacite de nourrir le pipelineI permet de reduire le temps de cycle

I

I Moins d’etagesI sous-instructions plus complexesI plus difficile de decroıtre le temps de cycle

101/ 627

Problemes des dependences de donnees

I Exemple :

do i = 2, na(i) = a(i-1) + 1

enddo

a(i) initialises a 1.

I Execution scalaire :

Etape 1 : a(2) = a(1) + 1 = 1 + 1 = 2

Etape 2 : a(3) = a(2) + 1 = 2 + 1 = 3

Etape 3 : a(4) = a(3) + 1 = 3 + 1 = 4.....

102/ 627

I Execution vectorielle : pipeline a p etages → p elements dansle pipeline

Etages du pipe-------------------------------------------

Temps 1 2 3 ... p sortie-------------------------------------------------------t0 a(1)t0 + dt a(2) a(1)t0 + 2dt a(3) a(2) a(1)....t0 + pdt a(p+1) a(p) ... a(2) a(1)-------------------------------------------------------

D’ou :

a(2) = a(1) + 1 = 1 + 1 = 2a(3) = a(2) + 1 = 1 + 1 = 2...

car on utilise la valeur initiale de a(2).

Resultat execution vectorielle 6= execution scalaire

103/ 627

Overlapping (recouvrement)

I Utiliser des unites fonctionnelles en parallele sur desoperations independantes. Exemple:

do i = 1, nA(i) = B(i) * C(i)D(i) = E(i) + F(i)

enddo

A

DE

F

B

C

Pipelined multiplier

Pipelined adder

Timeoverlapping = maxStartupmul ,Startupadd + dt+ n×dt

Timeno overlap. = Startupmul +n×dt+Startupadd +n×dtI Avantages: parallelisme entre les unites fonctionnelles

independantes et plus de flops par cycle

104/ 627

Chaining (chaınage)

I La sortie d’une unite fonctionnelle est dirigee directement versl’entree d’une autre unite fonctionnelle

I Exemple :

do i = 1, nA(i) = ( B(i) * C(i) ) + D(i)

enddo

D

A

Pipelined multiplier Pipelined adderB

C

Timechaining = Startupmul + Startupadd + n × dtTimenochaining = Startupmul + n×dt+Startupadd + n×dt

I Avantages : plus de flops par cyle, exploitation de la localitedes donnees, economie de stockage intermediaire

105/ 627


106/ 627

Locality of references

Programs tend to reuse data and instructions recently used

I Often program spends 90% of its time in only 10% of code.

I Also applies - not as strongly - to data accesses :

I temporal locality : recently accessed items are likely to beaccessed in the future

I spatial locality : items whose addresses are near one anothertend to be referenced close together in time.

107/ 627

Concept of memory hierarchy - 1

In hardware : smaller is faster

Example :

I On a high-performance computer using same technology(pipelining, overlapping, . . . ) for memory:

I signal propagation is a major cause of delay thus largermemories → more signal delay and more levels to decodeaddresses.

I smaller memories are faster because designer can use morepower per memory cell.

108/ 627

Concept of memory hierarchy - 2

Make use of principle of locality of references

I Data most recently used - or nearby data - are very likely tobe accessed again in the future

I Try to have recently accessed data in the fastest memory

I Because smaller is faster → use smaller memories to holdmost recently used items close to CPU and successively largermemories farther away from CPU

→ Memory hierarchy

109/ 627

Typical memory hierarchy

access bandwidthLevel Size time MB/s technology manag.Registers ≤ 1KB 2-5 ns 400-32,000 (BI)CMOS compilerCache ≤ 4MB 3-10 ns 800-5,000 CMOS SRAM hardwareMain memory ≤ 4GB 80-400 ns 400-2,000 CMOS DRAM OSDisk ≥ 1GB 5 ×106 ns 4-32 magnetic disk OS/user

110/ 627

Speed of computation relies on high bandwithHigh memory bandwidthSpeed of computation

Adder

MemoryX <--- Y + Z

Flow requirementData

Intruction

NI LI NO Cycle time

Required(in nsec)

Bandwidth

Digital α 21064

1 CPU CRAY-C90

Intel i860 XP 2 1/2 3/2 20 0.275GW/sec

2 1/2 1

4

5

4.2

0.6GW/sec

2.8GW/sec

1 CPU NEC SX3/14 16 2.9 16GW/sec

Y Z

X

Example:

cycle time

(NI*LI+3*NO) (in Words/sec)Bandwidth required:

Intruction

NI = Nb. Instructions/cycle

1 Word = 8 Bytes

LI = Nb. Words/Instruction

NO = Nb. Operations/cycle

111/ 627

Memory interleaving

Banks

1

2

3

4

5

6

7

8

Banks

1

2

3

4

5

6

7

8

a(1), a(9), ..., a(249)

a(2), a(10), ..., a(250)

a(3),a(11), ..., a(251)

a(4),...

a(5), ...

a(6), ...a(7), ..., a(255)

a(8), a(16), ..., a(256)

Two basic ways of distributing the addresses

Memory size 210

=1024 Words divided into 8 banks

a(1), a(2), ..., a(128)

a(129), ..., a(256)

Low order interleaving

Real a(256)

"well adapted to pipelining memory access"

Memory Interleaving

"The memory is subdivided into several independent memory modules (banks)"

Example:

High order interleaving

112/ 627

Effect of bank cycle time

1

3

2

Bank

4

1

3

2

Bank

4

... = a(i,j)

EnddoEnddo

Do i=1,4

Real a(4,2)

Do j=1,2

... = a(i,j)Enddo

Real a(4,2)

Do i=1,4Do j=1,2

Enddo

cannot be referenced again

Time interval during which the bank

Example

a(1,1) a(1,2)

1 CP

Low order interleaved memory, 4 banks, bank cycle time 3CP.

% column access %row access

Bank cycle time:

10 Clock Period 18 Clock Period

Bank Conflict: Consecutive accesses to the same bank in less than bank cycle time.

Stride: Memory address interval between successive elements

time

a(3,1)

a(4,1) a(4,2)

a(2,2)a(2,1)

a(1,2)a(1,1)

a(3,2)

a(4,1) a(4,2)

a(2,1) a(2,2)

a(3,1) a(3,2)

113/ 627


114/ 627

Organisation interne et performance des processeursvectoriels (d’apres J. Dongarra)

I Soit l’operation vectorielle triadique :

do i = 1, ny(i) = alpha * ( x(i) + y(i) )

enddoI On a 6 operations :

1. Load vecteur x2. Load vecteur y3. Addition x + y4. Multiplication alpha × ( x + y )5. Store dans vecteur y

115/ 627

I Organisations de processeur considerees :

1. Sequentielle2. Arithmetique chaınee3. Load memoire et arithmetique chaınees4. Load memoire, arithmetique et store memoire chaınes5. Recouvrement des loads memoire et operations chaınees

I Notations :

a : startup pour load memoireb : startup pour additionc : startup pour multiplicationd : startup pour store memoire

116/ 627

Sequential Machine Organization

a

a

b

c

d

memory path busy

load x

load y

add.

mult.

store

Chained Arithmetic

a load x

a load y

b add.

c mult.

d store

memory path busy

117/ 627

a

a

memory path busy

load x

load y

a load x

a load y

memory path busy

Chained Load and Arithmetic

b add.

mult.c

d store

Chained Load, Arithmetic and Store

add. b

c mult.

d store

118/ 627

a load x

Overlapped Load with Chained Operations

a load y

b add.

c mult.

stored

memory path 2 busy

memory path 3 busy

memory path 1 busy

119/ 627


120/ 627

Organisation des processeurs RISC

The execution pipeline

Instruction

Decode

Instruction

FetchExecution

Memory access

and branch

completion

(write results

in register file)

Write back

Example (DLX processor, Hennessy and Patterson, 96 [?])

I Pipeline increases the instruction throughputI Pipeline hazards: prevents the next instruction from executing

I Structural hazards: arising from hardware resource conflictsI Data hazards: due to dependencies between instructionsI Control hazards: branches for example

121/ 627

Instruction Level Parallelism (ILP)

I Pipelining: overlap execution of independent operations →Instruction Level Parallelism

I Techniques for increasing amount of parallelism amonginstructions:

I reduce the impact of data and control hazardsI increase the ability of processor to exploit parallelismI compiler techniques to increase ILP

I Main techniquesI loop unrollingI basic and dynamic pipeline schedulingI dynamic branch predictionI Issuing multiple instructions per cycleI compiler dependence analysisI software pipeliningI trace scheduling / speculationI . . .

122/ 627

Instruction Level Parallelism (ILP)

I Simple and common way to increase amount of parallelism isto exploit parallelism among iterations of a loop : Loop LevelParallelism

I Several techniques :I Unrolling a loop statically by compiler or dynamically by the

hardwareI Use of vector instructions

123/ 627

ILP: Dynamic scheduling

I Hardware rearranges the instruction execution to reduce thestalls.

I Advantage: handle cases where dependences are unknown atcompile time and simplifies the compiler

I But: significant increase in hardware complexity

I Idea: execute instructions as soon as their data are availableOut-of-order execution

I Handling exceptions becomes tricky

124/ 627


I Scoreboarding: technique allowing instruction out-of-orderexecution when resources are sufficient and when no datadependences

I full responsability for instruction issue and execution

I goal : try to maintain an execution rate of one instruction /clock by executing instructions as early as possible

I requires multiple instructions to be in the EX stagesimultaneously → multiple functional units and/or pipelinedunits

I Scoreboard table record/update data dependences + status offunctional units

I Limits:I amount of parallelism available between instructionsI number of scoreboard entries: set of instructions examined

(window)I number and type of functional units

125/ 627


I Other approach : Tomasulo’s approach (register renaming)

I Suppose compiler has issued:

F10 <- F2 x F2F2 <- F0 + F6

I Rename F2 to F8 in the second instruction (assuming F8 isnot used)

F10 <- F2 x F2F8 <- F0 + F6

I Can be used in conjunction with scoreboarding

126/ 627

ILP : Multiple issue

I CPI cannot be less than one except if more than oneinstruction issued each cycle → multiple-issue processors(CPI: average nb of cycles per instruction)

I Two types :I superscalar processorsI VLIW processors (Very Long Instruction Word)

I Superscalar processors issue varying number of instructionsper cycle either statically scheduled by compiler ordynamically (e.g. using scoreboarding). Typically 1 - 8instructions per cycle with some constraints.

I VLIW issue a fixed number of instructions formatted either asone large instruction or as a fixed instruction packet :inherently statically scheduled by compiler

127/ 627

Impact of ILP : example

This example is from J.L. Hennessy and D.A. Patterson (1996) [?].

I Original Fortran code

do i = 1000, 1x(i) = x(i) + temp

enddo

I Pseudo-assembler code

R1 <- address(x(1000))load temp -> F2

Loop : load x(i) -> F0F4 = F0 + F2store F4 -> x(i)R1 = R1 - #8 % decrement pointerBNEZ R1, Loop % branch until end of loop

128/ 627

I Architecture

IF ID MEM WB

Integer Unit1 stage

FP add

FP mult

Dividenot pipelined

4 stages

4 stages

Example of pipelined processor (DLX processor, Hennessy andPatterson, 96 [?])

129/ 627

I Latency: # cycles between instruction that produces resultand instruction that uses result

I Initiation interval : # cycles between issuing 2 instructions ofsame type

I Latency = 0 means results can be used next cycle

Functional unit Latency Initiation intervalInteger ALU 0 1Loads 1 1FP add 3 1FP mult 3 1FP divide 24 24

Characteristics of the processor

Inst. producing result Inst. using result LatencyFP op FP op 3FP op store double 2Load double FP op 1Load double store double 0

Latency between instructions

Latency FP op to store double : forwarding hardware passes result from

ALU directly to memory input. 130/ 627

I Straightforward code

#cycleLoop : load x(i) -> F0 1 load lat. = 1

stall 2F4 = F0 + F2 3stall 4 FP op -> store = 2stall 5store F4 -> x(i) 6R1 = R1 - #8 7BNEZ R1, Loop 8stall 9 delayed branch 1

I 9 cycles per iteration

I Cost of calculation 9,000 cycles

I Peak performance : 1 flop/cycle

I Effective performance : 19 of peak

131/ 627

I With a better scheduling

#cycleLoop : load x(i) -> F0 1 load lat. = 1

stall 2F4 = F0 + F2 3R1 = R1 - #8 4 Try keep int. unit busyBNEZ R1, Loop 5store F4 -> x(i) 6 Hide delayed branching

by store

I 6 cycles per iteration

I Cost of calculation 6,000 cycles

I Effective performance : 16 of peak

132/ 627

I Using loop unrolling (depth = 4)

do i = 1000, 1, -4x(i ) = x(i ) + tempx(i-1) = x(i-1) + tempx(i-2) = x(i-2) + tempx(i-3) = x(i-3) + temp

enddo

133/ 627

I Pseudo-assembler code (loop unrolling, depth=4):#cycle

Loop : load x(i) -> F0 1 1 stallF4 = F0 + F2 3 2 stallsstore F4 -> x(i) 6load x(i-1) -> F6 7 1 stallF8 = F6 + F2 9 2 stallsstore F8 -> x(i-1) 12load x(i-2) -> F10 13 1 stallF12= F10+ F2 15 2 stallsstore F12-> x(i-2) 18load x(i-3) -> F14 19 1 stallF16= F14+ F2 21 2 stallsstore F16-> x(i-3) 24R1 = R1 - #32 25BNEZ R1, Loop 26stall 27

I 27 cycles per iterationI Cost of calculation 1000

4 × 27 = 6750 cyclesI Effective performance : 1000

6750 = 15% of peak

134/ 627

I Using loop unrolling (depth = 4) and scheduling

#cycleLoop : load x(i) -> F0 1

load x(i-1) -> F6 2load x(i-2) -> F10 3load x(i-3) -> F14 4F4 = F0 + F2 5F8 = F6 + F2 6F12= F10+ F2 7F16= F14+ F2 8store F4 -> x(i) 9store F8 -> x(i-1) 10store F12-> x(i-2) 11R1 = R1 - #32 12BNEZ R1, Loop 13store F16-> x(i-3) 14



3500 = 29% of peak135/ 627

Survol des processeurs RISC

I Processeur RISC pipeline: exemple pipeline d’execution a 4etages

decode execute write result

stage #1 stage #2

fetch

stage #3 stage #4

137/ 627


I Processeur RISC superscalaire :I plusieurs pipelinesI plusieurs instructions chargees + decodees + executees

simultanement

decode execute write result

stage #1 stage #2

fetch

stage #3 stage #4

write result

write result

write result

execute

execute

execute

decode

decode

decode

fetch

fetch

fetch

pipeline #1

pipeline #3

pipeline #4

pipeline #2

I souvent operation entiere / operations flottante / loadmemoire

I probleme : dependancesI largement utilises : DEC, HP, IBM, Intel, SGI, Sparc, . . .

137/ 627


I Processeur RISC superpipeline :I plus d’une instruction initialisee par temps de cycleI pipeline plus rapide que l’horloge systemeI exemple : sur MIPS R4000 horloge interne du pipeline est 2

(ou 4) fois plus rapide que horloge systeme externe

I Superscalaire + superpipeline : non existant

137/ 627

Exemples de processeurs RISC

Exec. pipe D-cache I-cache inst./ PeakProcessor MHz #stages (KB) (KB) cycle Perf.

DEC 21064 200 7/9 8 8 2 200DEC 21164 437 7/9 8 + 96 + 4 MB 8 4 874

HP PA 7200 120 - 2 MB ext. 2 240HP PA 8000 180 - 2 MB ext. 4 720

IBM Power 66 6 32-64 8-32 4 132IBM Power2 71.5 6 128-256 8-32 6 286

MIPS R8000 75 5/7 16+ 4 MB ext. 16 4 300MIPS R10000 195 - 32 + 4 MB ext. 32 4 390MIPS R12000 300 - 32 + 8 MB ext. 32 4 600

Pentium Pro 200 - 512 ext. 200

UltraSPARC I 167 - 16 + 512 KB ext. 16 2 334UltraSPARC II 200 - 1 MB ext. 16 2 400

138/ 627


139/ 627

Reutilisation des donnees (dans les registres)

I Ameliorer l’acces aux donnees et exploiter la localite spatialeet temporelle des references memoire

I Deroulage de boucles : reduit le nombre d’acces memoire enutilisant le plus de registres possible

I Utiliser des scalaires temporaires

I Distribution de boucles : si nombre de donnees reutilisables >nombre de registres : substituer plusieurs boucles a une seule

140/ 627

Deroulage de boucle

Objectif : reduire nombre d’acces memoire et ameliorer pipelineoperations flottantes.

I Produit matrice-vecteur : y ← y + At × x

do ...do ...

y(i) = y(i) + x(j)*A(j,i)enddo

enddo

I 2 variantes :I AXPY :

do j = 1, Ndo i = 1, N

...I DOT

do i = 1, Ndo j = 1, N

...141/ 627

DOT variant

Processeurs RISC mieux adaptes a DOT que AXPY

do i = 1, Ntemp = 0.do j = 1, N

temp = temp + x(j)*A(j,i)enddoy(i) = y(i) + temp

enddo

Stride = 1 dans boucle la plus interne

load A(j,i)load x(j)perform x(j)*A(j,i) + temp

Ratio Flops/references memoire = 22 = 1

142/ 627

Reutilisation de x(j) : deroulage a une profondeur 2

* Cleanup odd iterationi = MOD(N,2)if ( i >= 1 ) then

do j = 1, Ny(i) = y(i) + x(j)*A(j,i)

enddoend if

* Main loopimin = i + 1do i = imin, N, 2

temp1 = 0.temp2 = 0.do j = 1, N

temp1 = temp1 + A( j,i-1) * x(j)temp2 = temp2 + A( j,i ) * x(j)

enddoy(i-1) = y(i-1) + temp1y(i ) = y(i ) + temp2

enddo

143/ 627

load A(j,i-1)load x(j)perform A(j, i-1 ) * x(j) + temp1load A(j,i)perform A(j,i ) * x(j) + temp2

I Ratio Flops/references memoire = 43

I Deroulage a une profondeur de 4 : 85

I Deroulage a une profondeur k : 2kk+1

144/ 627

Rolled

Unrolled 2

Unrolled 4

Unrolled 8

0 200 400 600 800 1000 12005

10

15

20

25

30

35

40

45

Size

MF

lops

Performance of y = At x on HP 715/64

Figure: Effect of loop unrolling on HP 715/64

145/ 627

Rolled

Unrolled 2

Unrolled 4

Unrolled 8

0 200 400 600 800 1000 12005

10

15

20

25

30

35

40

45

50

Size

MF

lops

Performance of y = At x on CRAY T3D

Figure: Effect of loop unrolling on CRAY T3D

146/ 627

AXPY variant

Habituellement preferee sur processeurs vectoriels

do j = 1, Ndo i = 1, N

y(i) = y(i) + x(j)*A(j,i)enddo

enddo

Stride > 1 dans la boucle la plus interne

load A(j,i)load y(i)perform x(j)*A(j,i) + y(i)store result in y(i)

Ratio Flops/references memoire = 23

147/ 627

Reutilisation de y(i) : deroulage a profondeur 2

* Cleanup odd iterationj = MOD(N,2)if ( j .GE. 1 ) then

do i = 1, Ny(i) = y(i) + x(j)*A(j,i)

enddoend if

* Main loopjmin = j + 1do j = jmin, N, 2

do i = 1, Ny(i) = y(i)+A(j-1,i)*x(j-1)+A(j,i)*x(j)

enddoenddo

load y(i)load A(j-1,i)perform A(j-1,i ) * x(j-1) + y(i)load A(j,i)perform A(j,i) * x(j) + y(i)store result in y(i)

I Ratio Flops/references memoire = 1

I Deroulage a profondeur 4 → Ratio = 43

I Deroulage a profondeur p → Ratio = 2p2+p

149/ 627


150/ 627

Organisation d’une memoire cache

I CacheI Buffer rapide entre les registres et la memoire principaleI Divise en lignes de cache

I Ligne de cacheI Unite de transfert entre cache et memoire principale

I Defaut de cacheI Reference a une donnee non presente dans le cacheI Strategie de choix d’une ligne a remplacer (LRU parmi les

eligibles)I Une ligne de cache contenant la donnee est chargee de la

memoire principale dans le cache

I Probleme de la coherence de cache sur les multiprocesseurs amemoire partagee

I Rangement des donnees dans les cachesI correspondance memoire ↔ emplacements dans le cache

151/ 627

I Strategies les plus courantes :I “direct mapping”I “fully associative”I “set associative”

I Conception des caches :I L octets par ligne de cacheI K lignes par ensemble (K est le degre d’associativite)I N ensembles

I Correspondance simple entre l’adresse en memoire et unensemble :

I N = 1 : cache “fully associative”I K = 1 : cache “direct mapped”

152/ 627

I “Direct mapping”I Chaque bloc en memoire ↔ un placement unique dans le cacheI Recherche de donnees dans cache peu couteuse (mais

remplacement couteux)I Probleme de contention entre les blocs

line

cache

main memory

I “Fully associative”I Pas de correspondance a prioriI Recherche de donnees dans cache couteuse

153/ 627

I “Set associative”I Cache divise en plusieurs ensemblesI Chaque bloc en memoire peut etre dans l’une des lignes de

l’ensembleI “4-way set associative” : 4 lignes par ensemble

line

main memory

line 1line 2line 3

cache set #k

line 4

154/ 627

Gestion des caches

I Cout d’un defaut de cache : entre 2 et 50 C (temps de cycle)I “Copyback”

I Pas de m-a-j lorsqu’une ligne de cache est modifiee, exceptelors d’un cache flush ou d’un defaut de cache

Memoire pas toujours a jour.Pas de probleme de coherence si les processeurs modifient des

lignes de cache independantes

I “Writethrough”I Donnee ecrite en memoire chaque fois qu’elle est modifiee

Donnees toujours a jour.Pas de probleme de coherence si les processeurs modifient des

donnees independantes

155/ 627

Cache coherency problem

cache cache

Y

Processor # 2Processor # 1

X

cache line

I Cache coherency mechanisms to:I avoid processors accessing old copies of data (copyback and

writethrough)I update memory by forcing copybackI invalidate old cache lines

I Example of mechanism (snooping):I assume writethrough policyI Each processor observes the memory accesses from othersI If a write operation occurs that corresponds to a local

cacheline, invalidate local cacheline

156/ 627

Cache coherency problem

cache cache

Y

Processor # 2Processor # 1

X

cache line

I Cache coherency mechanisms to:I avoid processors accessing old copies of data (copyback and

writethrough)I update memory by forcing copybackI invalidate old cache lines

I Example of mechanism (snooping):I assume writethrough policyI Each processor observes the memory accesses from othersI If a write operation occurs that corresponds to a local

cacheline, invalidate local cacheline156/ 627

Processor Line size Level Size Organization miss Access /cycle

DEC 21164 32 B 1 8 KB Direct-mapped 2 C 22∗ 96 KB 3-way ass. ≥ 8 C 23∗ 1-64 MB Direct-mapped ≥ 12 C 2

IBM Power2 128 B / 1 128 KB / 4-way-ass. 8 C 2256 B 256 KB

MIPS R8000 16 B 1 16 KB Direct-mapped 7 C 22∗ 4-16 MB 4-way-ass. 50 C 2

Cache configurations on some computers.∗ : data + instruction cache

I Current trends:I Large caches of several MBytesI Several levels of cache

157/ 627


158/ 627

Reutilisation des donnees (dans les caches)

Example

I cache 10 times faster than memory, hits 90% of the time.I What is the gain from using the cache ?

I Cost cache miss: tmiss

I Cost cache hit: thit = 0.1× tmiss

I Average cost:

90%(0.1× tmiss) + 10%× tmiss

I gain = tmiss×100%90%×(0.1×tmiss )+10%×tmiss

= 1(0.9×0.1)+0.1 = 1

0.19 = 5.3

(similar to Amdahl’s law)

159/ 627


Example

I cache 10 times faster than memory, hits 90% of the time.I What is the gain from using the cache ?

I Cost cache miss: tmiss

I Cost cache hit: thit = 0.1× tmiss

I Average cost: 90%(0.1× tmiss) + 10%× tmiss

I gain = tmiss×100%90%×(0.1×tmiss )+10%×tmiss

= 1(0.9×0.1)+0.1 = 1

0.19 = 5.3

(similar to Amdahl’s law)

159/ 627


Il est critique d’utiliser au maximum les donnees dans le cache ↔ameliorer le % de succes de cache

I Exemple : effet du % de defauts de cache sur un code donne

I Pmax performance lorsque toutes les donnees tiennent dans lecache (hit ratio = 100%). Tmin temps correspondant.

I Lecture de donnee dans le cache par une instruction etexecution : thit = 1 cycle

I Temps d’acces a une donnee lors d’un defaut de cache : tmiss

= 10 ou 20 cycles (execution instruction tmiss + thit)

I Ttotal = %hits.thit + %misses × (tmiss + thit)

I Topt = 100%× thit

I Perf =Topt

Ttotal

160/ 627

Tmiss %hits Tps hits Tps misses Ttotal Perf.

100% 1.00 0.00 1.00 100%

10 99% 0.99 0.11 1.10 91%20 99% 0.99 0.22 1.21 83%

10 95% 0.95 0.55 1.50 66%20 95% 0.95 1.10 2.05 49%

Table: Effet des defauts de cache sur la performance d’un code (exprimesen pourcentages vs pas de defaut de cache).

161/ 627

Efficient cache utilization: ExerciseReuse as much as possible data held in cache ↔ Improve cache hitratio

I Cache : single block of CS (cache size) wordsI When cache is full: LRU line returned to memoryI Copy-back: memory updated only when a modified block

removed from cacheI For simplicity, we assume cache line size L=1

Example from D. Gannon and F. Bodin :

do i=1,ndo j=1,n

a(j) = a(j) + b(i)enddo

enddo

1. Compute the cache hit ratio (assume n much larger than CS).

2. Propose a modification to improve the cache hit ratio.

162/ 627

I Total number of memory references = 3× n2 i.e. n2 loads fora, n2 stores for a, and n2 loads for b (assuming the compiler isstupid).

I Total number of flops = n2

I Cache empty at beginning of calculations.I Inner loop:

do j=1,na(j) = a(j) + b(i)

enddo

Each iteration reads a(j) and b(i), and writes a(j)For i=1 → access to a(1:n)For i=2 → access to a(1:n)As n >> CS, a(j) no longer in cache when accessed again,therefore:

I each read of a(j) → 1 missI each write of a(j) → 1 hitI each read of b(i) → 1 hit (except the first one)

I Hit ratio = # of hitsMem.Refs = 2

3 = 66%

163/ 627

blocked version

The inner loop is blocked into blocks of size nb < CS so that nbelements of a can be kept in cache and entirely updated withb(1:n).

do j=1,n,nbjb = min(nb,n-j+1) ! nb may not divide ndo i=1,n

do jj=j,j+jb-1a(jj) = a(jj) + b(i)

enddoenddo

enddo

164/ 627

To clarify we load the cache explicitely; it is managed as a 1Darray : CA(0:nb)

do j=1,n,nbjb = min(nb,n-j+1)CA(1:jb) = a(j:j+jb-1)do i=1,n

CA(0) = b(i)do jj=j,j+jb-1

CA(jj-j+1) = CA(jj-j+1) + CA(0)enddo

enddoa(j:j+jb-1) = CA(1:jb)

enddo

Each load into cache is a miss, each store to cache is a hit.

165/ 627

I Total memory references = 3n2

I Total misses:I load a = n

nb × nbI load b = n

nb × n

I Total = n + n2

nb

I Total hits = 3n2 − n − n2

nb = (3− 1nb )× n2 − n

Hit ratio = hitsMem.Refs ≈ 1− 1

3nb ≈ 100%if nb is large enough.

166/ 627


167/ 627

Memoire virtuelle

I Memoire reelle : code et donnees doivent etre loges enmemoire centrale (CRAY)

I Memoire virtuelle : mecanisme de pagination entre lamemoire et les disques

Une pagination memoire excessive peut avoir desconsequences dramatiques sur la performance !!!!

I TLB :I Translation Lookaside Buffer : correspondance entre l’adresse

virtuelle et l’adresse reelle d’une page en memoireI TLB sur IBM Power4/5: 1024 entreesI Defaut de TLB : 36 C environ

I AIX offre la possibilite d’augmenter la taille des pages (jusqu’a16 MB) pour limiter les defauts de TLB.

168/ 627


169/ 627

Exercice sur la reutilisation des donnees (enmemoire)

(inspire de (Dongarra, Duff, Sorensen, van der Vorst [?]))C ← C + A× BA, B, C : matrices n × n, n = 20000, stockees par colonnes

I Calculateur vectoriel (Performance de crete 50 GFlop/s)

I Memoire virtuelle (remplacement page : LRU)

I 1 page memoire = 2Mmots = 100 colonnes de A, B, ou C(1 mot = 8 bytes)

I 1 defaut de page ≈ 10−4 secondes

I Stockage de A, B, et C :3× 400Mmots = 3× 3.2 GB = 9.6 GB

I capacite memoire : 128 pages soit:128× 2Mmots = 256Mmots = 2GB → A, B, C ne peuventetre stockees totalement

170/ 627

Variante (1) : ijk

do i = 1, ndo j = 1, n

do k = 1, nCij <- Cij + Aik * Bkj

enddoenddo

enddo

1. Quel est le nombre de defauts de pages et le temps de calculde cette variante (ijk) ?

2. Quel est le nombre de defauts de pages et le temps de calculde la variante (jki) ?

3. Quel est le nombre de defauts de pages et le temps de calculde la variante (jki) avec blocage sur j et k par blocs de taille 4pages memoire ?

171/ 627

Variante (1) : ijk

do i = 1, ndo j = 1, n

do k = 1, nCij <- Cij + Aik * Bkj

enddoenddo

enddo

Si acces en sequence aux colonnes d’une matrice, 1 defaut de pagetoutes les 100 colonnes.Acces a une ligne de A → n

100 = 200 defauts de page.D’ou 200× 200002 = 8× 1010 defauts de page.8× 1010 defauts de page× 10−4sec . = 8 Msec ≈ 128 jours decalcul

172/ 627

Variante (2) : jki

do j = 1, ndo k = 1, n

do i = 1, nCij <- Cij + Aik * Bkj

enddoenddo

enddo

Pour chaque j :

I toutes colonnes de A accedees : n*200 defauts de page

I acces aux colonnes de B et C : 200 defauts de page

I total ≈ 4× 106 defauts de page

Temps d’execution ≈ 4× 106 × 10−4 sec = 400 sec

173/ 627

Variante (3) : jki bloqueLes matrices sont partitionees en blocs de colonnes tq bloc-colonne(nb = 400 colonnes) = 4 pages memoire.

Reutilisation maximale des sous-matrices en memoire.

* Organisation des calculs sur des sous-matricesdo j = 1, n, nb

jb = min(n-j+1,nb)do k = 1, n, nb sectioning loops

kb = min(n-k+1,nb)* Multiplication sur les sous-matrices* C1:n,j:j+jb-1 <- C1:n,j:j+jb-1* + A1:n,k:k+kb-1 * Bk:k+kb-1,j:j+jb-1

do jj = j, j+jb-1do kk = k, k+kb-1

do i = 1, nCijj <- Cijj + Aikk * Bkkjj

enddo enddo enddoenddo

enddo

Defauts de page :

I nb = 400 colonnes (4 pages memoire)

I acces a B et C, defauts de page lors de la boucle en j: 200defauts de page

I n/nb acces (boucle en j) a A par blocs de colonnes, pourchaque indice k : 200, soit n/nb × 200 au total.

I Total ≈ ( nnb + 2)× 200 defauts de page

I nb = 400 donc nnb = 50

I et donc ≈ 104 defauts de page

I Temps de chargement memoire = 1 sec

Attention : le temps de calcul n’est plus negligeable !!Temps = 2× n3/vitesse ≈ 320 secondesIdees identiques au blocage pour cacheBlocage : tres efficace pour exploiter au mieux une hierarchiememoire (cache, memoire virtuelle, . . . )

175/ 627


176/ 627

Interconnexion des processeurs

I Reseaux constitues d’un certain nombre de boıtes deconnexion et de liens

I Commutation de circuits : chemin cree physiquement pourtoute la duree d’un transfert (ideal pour un gros transfert)

I Commutation de paquets : des paquets formes de donnees +controle trouvent eux-meme leur chemin

I Commutation integree : autorise les deux commutationsprecedentes

I Deux familles de reseaux distincts par leur conception et leurusage :

I Reseaux mono-etageI Reseaux multi-etages

177/ 627

Reseau Crossbar

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

o

0

1

2

3

1 2 3o

Toute entree peut etre connectee a toute sortie sans blocage.Theoriquement, le plus rapide des reseaux mais concevableseulement pour un faible nombre d’Entrees/Sortie.Utilise sur calculateurs a memoire partagee : Alliant, Cray, Convex,. . .

178/ 627

Reseaux multi-etages

Constitues de plus d’un etage de boitiers de connexion. Systemede communication permettant le plus grand nombre possible depermutations entre un nombre fixe d’entrees et de sorties.A chaque entree (ou sortie) est associee une unite fonctionnelle.Nombre d’entrees = nombre de sorties = 2p.

0 1 2 3 4 5 6

0 1 2 3 4 5 6 7

7

Figure: Exemple de reseau multi-etage avec p=3.

Reseaux birectionnels ou doublement du reseau.179/ 627

Boıte de connexion elementaireElement de base dans la construction d’un reseau : connexionentre deux entrees et deux sorties

I Boıte a deux fonctions (B2F) permettant les connexionsdirecte et croisee controlee par un bit

I Boıte a quatre fonctions (B4F) permettant les connexionsdirecte, croisee,a distribution basse et haute controlee pardeux bits.

180/ 627

I Topologie : mode d’assemblage des boıtes de connexion pourformer un reseau de N = 2p entrees / N sorties. La plupartdes reseaux sont composes de p etages de N

2 boıtes.

I Exemple : Reseau OmegaTopologie basee sur le “Perfect Shuffle”, permutation sur desvecteurs de 2p elements.

0 1 2 4 5 6

3 4 6

3 7

75210Le reseau Omega reproduit a chaque etage un “PerfectShuffle”. Autorise la distribution d’une entree sur toutes lessorties (“broadcast”).

181/ 627

0

1

2

4

0

3

5

6

1

2

3

4

5

6

77

0

1

2

3

4

5

6

7

0

1

2

4

5

6

7

3

A

B

C

D

E

F

G

H

I

J

K

L

Reseau Omega 8× 8.

182/ 627

I Autre topologie possible (reseau Butterfly, BBN, Meiko CS2)

0

1

2

4

6

3

5

7

6

4

3

0

1

2

5

7

A

B

C

D

E

F

G

H

I

J

K

L

183/ 627

Reseaux mono-etage

I Realisent un nombre fini de permutations entre les entrees etles sorties, chacune de ces permutations faisant l’objet d’uneconnexion physique (en general canal bidirectionnel).Generalement statique.

Proc 1 Proc 2 Proc 3

Proc 4 Proc 5 Proc 6

I Tres utilise dans les architectures a memoire locale

184/ 627

I Exemples :I Bus partage

Proc#0

Proc#n

Cache LocalMemory

MemoryLocalCache

MainMemory

BUS

Largement utilise sur SMP (SGI, SUN, DEC, . . . )185/ 627

I Anneau

Proc 1 Proc 2 Proc nProc 0

I GrilleProc Proc Proc Proc

ProcProcProc

Proc Proc Proc Proc

ProcProcProcProc

Proc

Utilise sur Intel DELTA et PARAGON, . . .186/ 627

I Shuffle Exchange : Perfect Shuffle avec en plus Proc # iconnecte a Proc # (i+1)

1 2 3 4 5 6 70

I N-cube ou hypercube : Proc #i connecte au Proc # j si i et jdifferent d’un seul bit.

0 1 2 3 4 5 6 7

I Grand classique utilise sur hypercubes Intel (iPSC/1, iPSC/2,iPSC/860), machines NCUBE, CM2, . . .

187/ 627

Figure: 4-Cube in space.

188/ 627

Topologies usuelles pour les architectures distribuees

I Notations :I # procs = N = 2p

I diametre = d (chemin critique entre 2 procs)I # liens = w

I Anneau : d = N2 ,w = N

I Grille 2D : d = 2× (N12 − 1),w = 2× N

12 × (N

12 − 1)

I Tore 2D (grille avec rebouclage sur les bords) :

d = N12 ,w = 2× N

Proc Proc Proc Proc

ProcProcProc

Proc Proc Proc Proc

ProcProcProcProc

Proc

I Hypercube ou p-Cube : d = p,w = N×p2

189/ 627

Remarques

I Tendance actuelle:I Reseaux hierarchiques/multi-etagesI Beaucoup de redondances (bande passante, connections

simultanees)

I Consequence sur les calculateurs haute performance:I Peu de difference de cout selon sources/destinationsI La conception des algorithmes paralleles ne prend plus en

compte la topologie des reseaux (anneaux, . . . )

190/ 627


191/ 627

Statistiques Top 500 (voir www.top500.org)

I Liste des 500 machines les plus puissantes au monde

I Mesure: GFlops/s pour pour la resolution deAx = b, A matrice dense.

I Mises a jour 2 fois par an (Juin/ISC, Novembre/SC).I Sur les 10 dernieres annees la performance a augmente plus

vite que la loi de Moore:I 1997:

I #1 = 1.1 TFlop/sI #500 = 7.7 GFlop/s

I 2007:I #1 = 280 TFlop/sI #500 = 4 TFlop/s

I 2008: RoadrunnerI #1 = 1 PFlop/s (1026 TFlop/s)I #500 = 4 TFlop/s

192/ 627

Quelques remarques generales (Juin 2007)

I Architectures IBM Blue Gene dominent dans le top 10.

I NEC ”Earth simulator supercomputer” (36 Tflop/s, 5120processeurs vectoriels) est aujourd’hui numero 20. Est resteen tete de Juin 2002 a Juin 2004.

I Il faut 56 Tflop/s pour entrer dans le Top 10(contre 15 TFlop/s en juin 2005)

I Somme accumulee: 4.95 PFlop/s(contre 1.69 PFlop/s en juin 2005)

I Le 500 ieme (4 Tflop/s) aurait ete 216 eme il y a 6 mois.

193/ 627

Remarques generales (Juin 2007 suite)

I Domaine d’activiteI Recherche 25%, Accademie 18%, Industrie 53%I Par contre 100% du TOP10 pour recherche et accademie.I France (10/500) dont 8 pour l’industrie.

I ProcesseursI 289 systemes bases sur de l’Intel (dont 31→ 205 sur le Xeon

Woodcrest, bi-cœur)I 107 sur des AMD (dont 90 : bi-cœurs Opteron)I 85 sur de l’IBM Power 3, 4 ou 5I 10 sur des HP PA-RISCI 4 sur des NEC (vectoriels)I 3 sur des SparcI 2 sur des CRAY (vectoriels)I 6/500 (18/500 en 2005) bases sur des processeurs vectoriels.

I Architecture107 MPP (Cray SX1, IBM SP, NEC SX, SGI ALTIX, HitatchiSR) pour 393 Clusters

194/ 627

Analyse des sites - Definitions

I Rang: Position dans le top 500.

I Rpeak: Performance crete de la machine en nombred’operations flottantes par secondes.

I Rmax: Performance maximum obtenue sur le test LINPACK.

I Nmax: Taille du probleme ayant servi a obtenir Rmax.I Power: Watts consommes (voir aussi www.green500.org)

I Plus/moins performant du top 500: 480 Mflops/Watt et 4Mflops/Watt

I Juin 2008, #1 Top500: 437 Mflops/Watt est 3ieme augreen500 (#2 : 205 Mflops/Watt)

I Gain de 131 Mflops/watt par rapport a Novembre 2007(utilisation du processeur Cell, voir Section 2 Introduction)

I Gain de 0.4Mflops (entre Juin 2007 et 2008) seulement sur lebas du classement

195/ 627

Top 10 mondial (Juin 2007)

Rang-Configuration Implantation #proc. Rmax Rpeak YearTFlops TFlops

1-IBM eServer BlueGene DOE/NNSA/LLNL 131072 280 367 20052-Cray XT4/XT3 1 Oak Ridge Nationl Lab 23016 101 119 20063-Cray RedStorm 2 NNSA/Sandia Lab 26544 101 127 20064-IBM eServer BlueGene IBM TJWatson Res. Ctr. 40960 91 114 20055-IBM eServer BlueGene New York Ctr. in CS 36864 82 103 20076-IBM eServer pSeries 3 DOE/NNSA/LLNL 12208 75 92 20067-IBM eServer Blue Gene Nonotechnology 4 32768 73 91 20078-DELL PowerEdge 5 Nat.Ctr. Supercomp. Appl. 10240 62 94 20079-IBM cluster 6 Barcelona Supercomp. Ctr. 10240 62 94 200610-SGI Altix4700-1.6GHz Leibniz Rechenzentrum 9728 56 62 200712-Tera-10 Novascale 7 CEA 9968 52 63 2006

1Opteron 2.6Hz dual core

2Opteron 2.4Hz dual core

3p5 1.9GHz

4Rensselaer Polytech. Inst. (nanotech.)

52.33GHz-Infinib.

6PPC-2.3GHz-Myri.

7Ita2-1.6GHz-Quadrics

196/ 627

Top 7 mondial (Juin 2005)

Rang-Configuration Implantation #proc. Rmax Rpeak Nmax

TFlops TFlops 103

1-IBM eServer BlueGene Solution DOE/NNSA/LLNL 65536 136 183 12782-IBM eServer BlueGene Solution IBM TJWatson Res. Ctr. 40960 91 114 9833-SGI Altix 1.5GHz NASA/Ames Res.Ctr./NAS 10160 51 60 12904-NEC Earth simulator Earth Simul. Center. 5120 36 41 10755-IBM cluster, PPC-2.2GHz-Myri. Barcelona Supercomp. Ctr. 4800 27 42 9776-IBM eServer BlueGene Solution ASTRON/Univ. Groningen 12288 27 34 5167-NOW Itanium2-1.4GHz-Quadrix Los Alamos Nat. Lab. 8192 19 22 975

Stockage du probleme de taille 106 = 8 Terabytes

197/ 627

Constructeur Nombre Pourcent.∑

Rmax∑

Rpeak∑

Procs(TFlop/s) (TFlop/s)

IBM 192 38.4 2060 3121 679128HP 201 40.2 1193 1860 227028Dell 22 4.4 427 616 67264Cray Inc. 11 2.2 359 438 81070SGI 19 3.8 281 317 48464NEC 4 0.8 53 59 5952Self-made 5 1.0 48 79 10056Sun 7 1.4 43 59 5952Fujitsu 4 0.8 25 47 7488All 500 100 4946 7183 1221114

Statistiques constructeurs Top 500, nombre de systemes installes.

Analyse des sites francais – Juin 2007Rang-Configuration Implantation #proc. Rmax Rpeak

GFlops GFlops

12-NovaScale 51608 CEA 9968 52840 6379822-NovaScale 30459 CEA 6144 35130 3932138-IBM Blue Gene L EDF R&D 8192 18665 22937110-HP Cluster 10 HP 1024 8751 12288238-HP Cluster 11 Industrie alim. 668 5210 8016329-HP Cluster 12 IT Service Prov. 640 4992 7680349-IBM BladeCenter 13 Finance 2000 4925 8800394-IBM Cluster 14 PSA Peugeot 1184 4673 6157458-459 IBM eServer 15 Total SA 1024 4307 7782480-HP Cluster Xeon 16 Industrie alim. 688 4173 6420489-490 NEC SX8R (2.2 Ghz) Meteo-France 128 4058 405

8Ita2,1.6GHz, Infiniband

9Ita2,1.6GHz, Quadrics

10Xeon-3GHz, Infiniband

11Xeon-3GHz, GigEthernet

12Xeon-3GHz, GigEthernet

13Opteron-2.2 GHz

14Opteron-2.6 GHz, Infiniband

15pSeries 1.9GHz Myrinet

162.33GHz, GigEthernet

199/ 627

Analyse des sites francais – Juin 2005

Rang-Configuration Implantation #proc. Rmax Rpeak Nmax

GFlops GFlops 103

77-HP AlphaServer SC45, 1GHz CEA 2560 3980 5120 360238-HP Cluster P4 Xeon-2.4GHz Finance 512 1831 3276251-IBM Cluster Xeon2.4GHz-Gig-E Total 1024 1755 4915 335257-HP Cluster P4 Xeon-2.4GHz Caylon 530 1739 3392258-HP Cluster P4 Xeon-2.4GHz Caylon 530 1739 3392266-IBM Cluster Xeon2.4GHz-Gig-E Soc.Gen. 968 1685 4646281-IBM eServer (1.7GHz Power4+) CNRS-IDRIS 384 1630 2611359-SGI Altix 1.5GHz CEG Gramat

(armement)

256 1409 1536

384-HP Superdome 875MHz FranceTelec.

704 1330 2464

445-HP Cluster Xeon 3.2 GHz Soc.Gen. 320 1228 2048

200/ 627

Repartition geographique

Afrique: 1 Oceanie : 5

Amerique: 295 Europe: 127Bresil 2 Allemagne 24Canada 10 France 13Mexique 2 Italie 5USA 281 RU 42

Espagne 6Russie 5

Asie : 72Chine 13India 8Japon 23S. Arabia 2

201/ 627

Analyse des plates-formes a usage academique

Amerique: 44 Europe: 33Canada 4 Allemagne 6Etats-Unis 39 Belgique 1Mexique 1 Espagne 3Oceanie : 2 Finlande 2

Australie 1 France 0Nouvelle Zelande 1 Italie 1Asie : 11 Norvege 1Japon 8 Pays-Bas 2Chine 1 Royaume Uni 7Taiwan 1 Russie 4Coree du Sud 1 Suede 4Turquie 1 Suisse 1

202/ 627

Type de processeurs

203/ 627

Evolution de la performance

204/ 627

Exemples d’architecture de supercalculateurs

I Machines de type scalaireI MPP IBM SP (NERSC-LBNL, IDRIS (France))I CRAY XT3/4 (Oak Ridge National Lab)I Cluster DELL (NCSA)I Non-Uniform Memory Access (NUMA) computer SGI Altix

(Nasa Ames)I IBM Blue Gene

I Machines de type vectorielI NEC (Earth Simulator Center, Japon)I CRAY X1 (Oak Ridge Nat. Lab.)

I Machine a base de processeur CellI Roadrunner (Los Alamos National Lab (LANL))

205/ 627

MPP IBM SP NERSC-LBNL

Réseau

P1 P16 P1 P16

Noeud 1 Noeud 416

12Gbytes 12Gbytes

Remarque: Machine pécédente (en 2000)

Cray T3E (696 procs à 900MFlops et 256Mbytes)

416 Noeuds de 16 processeurs

375MHz processeur (1.5Gflops)Mémoire: 4.9 Terabytes

6656 processeurs (Rpeak=9.9Teraflops)

Supercalculateur du Lawrence Berkeley National Lab. (installe en2001)

206/ 627

MPP IBM SP CNRS-IDRIS

Réseau

P1 P32

1.3GHz processeur (5.2Gflops)Mémoire: 1.5 Terabytes

384 processeurs (Rpeak=2.61Tflops)

128Gbytes 128Gbytes

P1 P32

Noeud 12Noeud 1

12 Noeuds de 32 processeurs

+ X noeuds de 4 procs

Supercalculateur de l’IDRIS (installe en 2004)

207/ 627

Cluster DELL ”Abe” (NCSA, Illinois)

I Performance: Rpeak=94 TFlop/s peak, Rmax=62.7 TFlop/sI Architecture (9600 cores):

I 1200 nœuds (bi-Xeon) a 2.33 GHzI Chaque Xeon : 4 cœursI 4 flops/cycle/cœur (9.33 GFlop/s)I Memoire: 90 TB (1 GB par cœur)I Infiniband → applicationsI GigEthernet → systeme+monitoringI IO: 170 TB at 7.5 GB/s

208/ 627

Non Uniform Memory Access Computer SGI Altix

4.1Tbytes de memoire globalement adressable

Remarque: NUMA et latence

Réseau

P1 P2P1 P2Noeud 1 Noeud 2

C−Brick 128

Noeud (145nsec); C−Brick (290ns); Entre C−Bricks(+ 150 à 400ns);

P1 P2

C−Brick 1

P1 P2Noeud 1 Noeud 2

16Gb 16Gb 16Gb 16Gb

128 C−Bricks de 2 Noeuds de 2 procs

1.5GHz Itanium 2 (6Gflops/proc)Mémoire: 4.1 Terabytes

512 processeurs (Rpeak=3.1Teraflops)

Supercalculateur SGI Altix (installe a NASA-Ames en 2004) 2007:#10=Altix, 63 TFlop/s, 9728 cœurs, 39 TB, Allemagne

209/ 627

NEC Earth Simulator Center (characteristiques)

I 640 NEC/SX6 nodes

I 5120 CPU (8 GFlops) −− > 40 TFlops

I 2 $ Billions, 7 MWatts.

210/ 627

NEC Earth Simulator Center (architecture)

unit

cacheRegisters

Scalar

unit

cacheRegisters

Scalar

Arithm. Proc 1 Arith. Proc. 8

UnitUnit

Noeud 640

unit

cacheRegisters

Scalar

unit

cacheRegisters

Scalar

Arithm. Proc 1 Arith. Proc. 8

UnitUnit

Noeud 1

Réseau (Crossbar complet)

640 Noeuds (8 Arith. Proc.) −> 40Tflops

(Rpeak −−> 16 flops // par AP)

Vector Vector Vector Vector

Mémoire partagée (16Gbytes) Mémoire partagée (16Gbytes)

Mémoire totale 10TBytes

Vector unit (500MHz): 8 ens. de pipes (8*2*.5= 8Glops)

Supercalculateur NEC (installe a Tockyo en 2002)

211/ 627

Cray X1 d’Oak Ridge National Lab.

I Performance: 6.4 Tflop/s, 2Terabytes, Rmax(5.9 TFlop/s)I Architecture 504 Multi Stream processeurs (MSP):

I 126 NoeudsI Chaque Noeud a 4 MSP et 16Gbytes de memoire “flat”.I Chaque MSP a 4 Single Stream Processors (SSP)I Chaque SSP a une unite vectorielle et une unite superscalaire,

total 3.2Gflops.

212/ 627

Cray X1 node

213/ 627

Blue Gene L (65536 dual-procs, 360 TFlops peak)

I Systeme d’exploitationminimal (non threade)

I Consommation limitee:I 32 TB mais seulement

512 MB de memoire parnoeud !

I un noeud = 2 PowerPC a700 MHz (2x2.8 GFlop/s)

I 2.8 GFlop/s ou 5.6GFlop/s crete par noeud

I Plusieurs reseaux rapidesavec redondances

214/ 627

Blue gene: efficace aussi en Mflops/watt

215/ 627

Clusters a base de processeurs Cell

I rack QS20 = 2 processeurs Cell (512 MB / processeur)

I racks connectes entre eux par switchs GigEthernet

I Chaque Cell=205 GFlop/s (32 bits)

I Installation au CINES (Montpellier):I 2 racks IBM QS20I performance crete: 820 GFlop/sI memoire: seulement 2 GB !

I reste tres experimental et difficile a programmer

216/ 627

Pour entrer dans l’ere du Petacale : Roadrunner

I Los Alamos National Lab et IBM

I 18 clusters de 170 noeuds de calcul

I Par noeud : 2 dual-core AMDOpteron et 4 IBM PowerXCell 8iproc(Machine complete : 12240PowerCell)

I Performance IBM PowerXCell 8i : 110 Glops (64 bits flottant)

I 122400 cores et 98 Terabytes

I Rmax=1026 Teraflops; Rpeak 1376 Teraflops; 2.3 MWatts

217/ 627

Roadrunner (suite)

I Difference Cell BroadBand Engine(CBE) et IBM PowerXCell 8i

I Amelioration significative de laperformance des calculs 64bits(100Gflops/15Gflops)

I Memoire plus rapide

I Programmation du RoadrunnerI 3 compilateurs : Opteron,

PowerPC et Cell SPE jeud’instructions

I Gestion explicite des donnees etprogrammes entre Opteron,PowerPC et Cell.

218/ 627

Machines auxquelles on a acces depuis le LIP

I Calculateurs des centres nationaux (pas dans le top 500)I IDRIS: 1024 processeurs Power4 IBM, 3 noeuds NEC SX8I CINES: 9 noeuds de 32 Power4 IBM, SGI Origin 3800 (768

processeurs), . . .

I Calculateurs regionaux/locaux:I icluster2 a Grenoble: 100 bi-processeurs itanium (en cours de

renouvellement)I clusters de la federation lyonnaise de calcul haute performanceI Grid 5000 (node in Lyon: 127 bi-processeurs opteron, 1

core/proc)

219/ 627

Programmes nationaux d’equipement

USA: Advanced Simulation and Computing Program (formerlyAccelerated Strategic Initiative)

I http://www.nnsa.doe.gov/ascI Debut du projet : 1995 DOE (Dept. Of Energy)I Objectifs : 1 PetaFlop/s

France: le projet Grid 5000(en plus des centres de calcul CNRS: IDRIS et CINES)

I http://www.grid5000.orgI Debut du projet : 2004 (Ministere de la Recherche)I Objectifs : reseau de 5000 machines sur 8 sites repartis

(Bordeaux, Grenoble, Lille, Lyon, Nice, Rennes, Toulouse)

220/ 627

Previsions

I BlueGeneL et ses successeurs: ≈ 3 PFlop/s en 2010

I Projet japonnais (10 Pflops en 2011).

I Juin 2008: Architectures a base de noeuds hybrides incluantdes processeurs vectoriels/Cell

221/ 627


222/ 627

Conclusion

I Performance :I Horloge rapideI Parallelisme interne au processeur

I Traitement pipelineI Recouvrement, chaınage des unites fonctionnelles

I Parallelisme entre processeurs

I Mais :I Acces aux donnees :

I Organisation memoireI Communications entre processeurs

I Complexite du hardwareI Techniques de compilation : pipeline / vectorisation /

parallelisation

Comment exploiter efficacement l’architecture ?

223/ 627

Ecriture de code efficace (I) : MFLOPS ou MIPS ?

I MFLOPS: floating point operations /sec.Ne depend pas du calculateur

I MIPS: instructions de bas-niveauDepend du calculateur

I Watt: code efficace sur des machines a faible consommationen Watt par proc. (Exemple des proc. Cell).

I Precision des calculs: travail partiel en precision numeriqueaffaiblie (plus efficace).

224/ 627

Ecriture de code efficace (II)

I Facteurs architecturaux influencant la performance :I debit et latence memoireI couts des communications et de synchronisationI temps d’amorcage des unites vectoriellesI besoins en entrees/sorties

I Facteurs dependant de l’application :I parallelisme (depend des algorithmes retenus)

I regularite des traitementsI equilibrage des traitementsI volume de communications (localite)I granularite - scalabilite

I Localite des donnees (spatiale et temporelle)encore plus critique sur les architectures Cell et GPU(Graphical Proc Unit)

225/ 627

Notion de calcul potentiellement efficace

I Proposition: Soient x et y des vecteurs et A,B,C desmatrices d’ordre n; le noyau de calcul (1) x = x + αy estpotentiellement moins efficace que le noyau (2) y = A× x + yqui est potentiellement moins efficace que le noyau (3)C = C + A× B

I Exercice : justifier la proposition precedente.

226/ 627

I La mesure du rapport entre le nombre d’operations flottanteset de references memoire pour chacun des noyaux de calculexplique le potentiel.

I x = x + αyI 3n references memoireI 2n operations flottantesI rapport Flops/Ref = 2/3

I y = A× x + yI n2 references memoireI 2n2 operations flottantesI rapport Flops/Ref = 2

I C = C + A× BI 4n2 references memoireI 2n3 operations flottantesI rapport Flops/Ref = n/2

I Typiquement Vitesse (3) = 5 × vitesse(2) et vitesse(2) = 3 ×vitesse(1) . . . si on utilise des bibliotheques optimisees !

227/ 627

Limites de l’optimisation de code et de lavectorisation/parallelisation automatiques

C ← α× A× B + βC (DGEMM du BLAS)

DO 40 j = 1, N................DO 30 l = 1, K

IF ( B( l, j ) .NE. ZERO ) THENTEMP = ALPHA * B( l, j )DO 20 i = 1, M

C( i, j ) = C( i, j ) + TEMP * A( i, l )20 CONTINUE

END IF30 CONTINUE40 CONTINUE

Plupart des compilateurs : parallelisent la boucle d’indice j etoptimisent / vectorisent la boucle d’indice i

228/ 627

Table: Performance de versions differentes de GEMM sur processeursRISC avec des matrices 128 × 128.

Calculateur standard optimise perf. de crete

DEC 3000/300 AXP 23.1 48.4 150.0HP 715/64 16.9 38.4 128.0IBM RS6000/750 25.2 96.1 125.0

Pentium 4 113 975 3600

I Plupart des optimisations realisees par les compilateurs sur laboucle interne

I En theorie tres bon potentiel grace au rapport entreoperations flottantes et references memoire : ( 4n2 referencesmemoire, 2n3 operations flottantes)

i.e. n2 mais les compilateurs ne savent pas l’exploiter !!

229/ 627

I Optimisation de code :I Ameliorer l’acces aux donnees et exploiter la localite spatiale et

temporelle des references memoireI Deroulage de boucles : reduit le nombre d’acces memoire en

ameliorant la reutilisation des registres, permet aussi unemeilleure exploitation du parallelisme interne aux processeurs

I Blocage pour une utilisation efficace du cache : ameliore lalocalite spatiale et temporelle

I Copie des donnees dans des tableaux de travail pour forcer lalocalite et eviter des ”strides” critiques (pas toujours possiblescar parfois trop couteux)

I ”prefetch” des donneesI Utilisation de l’assembleur (cas desespere !!)I Utilisation de bibliotheques optimisees (cas ideal !!)

230/ 627

Utilisation d’une bibliotheque optimiseeI Noyaux de calcul matriceXmatrice optimises existent :

I ATLAS - Automatic Tuned Linear Algebra Software.http://netlib.enseeiht.fr/atlas/

I Goto from Univ. Texas at Austinhttp://www.cs.utexas.edu/users/flame/goto/

Figure: Comparaison de la performance de noyaux de calcul en algebrelineaire (BLAS) (J. Dongarra)

Outline

Notions et techniques generales pour l’algebre lineaireIntroductionGaussian EliminationLU FactorizationVector and Matrix normsErreur, sensibilite, conditionnementFactorisation LU avec pivotageSystemes bandeMatrices symetriquesFactorisation de CholeskyFactorisation QRGram-Schmidt ProcessProblemes aux moindres carresProblemes aux valeurs propresDecomposition en valeurs singulieres (SVD)

232/ 627


233/ 627

Linear algebra

I Linear algebra: branch of mathematics that deals withsolutions of systems of linear equations and the relatedgeometric notions of vector spaces and linear transformations.

I “linear” comes from the fact that equation

ax + by = c

defines a line (in two-dimensional geometry).I similar to the form of a system of linear equations:

ai1x1 + ai2x2 + . . .+ ainxn = bi , i = 1, . . . ,m

I Linear transformation from a vector space V to W :T (v1 + v2) = T (v1) + T (v2)

T (αv1) = αT (v2)I Linear transformations (rotations, projections, . . . ) are often

represented by matrices. A =

0 1−2 2

1 0

, v =

[xy

], then

T : v −→ Av is a linear transformation from IR2 to IR3, defined by

T (x , y) = (y ,−2x + 2y , x).234/ 627

Use of Linear algebra

Continuous problem → Discretization → Mathematicalrepresentation involving vectors and matrices

This leads to problems involving vectors and matrices, inparticular:

I systems of linear equations (sparse, dense, symmetric,unsymmetric, well conditionned, . . . )

I least-square problems

I eigenvalue problems

235/ 627

I Resolution de Ax = bI A generale carree : factorisation LU avec pivotageI A symetrique definie positive : factorisations de Cholesky LLt

ou LDLt

I A symetrique indefinie : factorisation LDLt

I A rectangulaire m × n avec m ≥ n : factorisation QR

I Problemes aux moindres carres minx ||Ax − b||2I Si rang(A) maximal : factorisation de Cholesky ou QRI Sinon QR avec pivotage sur les colonnes ou Singular Value

Decomposition (SVD)

I Problemes aux valeurs propres Ax = λxI Exemple: determiner les frequences de resonnance d’un pont /

d’un avionI Techniques a base de transformations orthogonales:

decomposition de Schur, Hessenberg, reduction a une matricetri-diagonale

I Problemes generalises :I Ax = λBx et AtAx = µ2B tBx : Schur et SVD generalisee

I Implantation efficace critique

236/ 627


237/ 627

System of linear equations ?

Example:

2 x1 - 1 x2 3 x3 = 13-4x1 + 6 x2 + 5 x3 = -286 x1 + 13 x2 + 16 x3 = 37

can be written under the form:

Ax = b,

with A =

2 −1 3−4 6 5

6 13 16

, x =

x1

x2

x3

, and b =

13−28

37

238/ 627

Gaussian EliminationExample:

2x1 − x2 + 3x3 = 13 (1)

−4x1 + 6x2 + 5x3 = −28 (2)

6x1 + 13x2 + 16x3 = 37 (3)

With 2 * (1) + (2) → (2) and -3*(1) + (3) → (3) we obtain:

2x1 − x2 + 3x3 = 13 (4)

0x1 + 4x2 + x3 = −2 (5)

0x1 + 16x2 + 7x3 = −2 (6)

Thus x1 is eliminated. With -4*(5) + (6) → (6):

2x1 − x2 + 3x3 = 13

0x1 + 4x2 + x3 = −2

0x1 + 0x2 + 3x3 = 6

The linear system is then solved by backward (x3 → x2 → x1)substitution: x3 = 6

3 = 2, x2 = 14 (−2− x3) = −1, and finally

x1 = 12 (13− 3x3 + x2) = 3

239/ 627


240/ 627

LU Factorization

I Find L unit lower triangular and U upper triangular such that:A = L× U

A =

2 −1 3−4 6 −5

6 13 16

=

1 0 0−2 1 0

3 4 1

× 2 −1 3

0 4 10 0 3

I Procedure to solve Ax = b

I A = LUI Solve Ly = b (descente / forward elimination)I Solve Ux = y (remontee / backward substitution)

Ax = (LU)x = L(Ux) = Ly = b

241/ 627

From Gaussian Elimination to LU FactorizationA = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13

a21 a22 a23

a31 a32 a33

1A 0@ x1

x2

x3

1A =

0@ b1

b2

b3

1A 2← 2− 1× a21/a11

3← 3− 1× a31/a11

A(2)x = b(2)0@ a11 a12 a13

0 a(2)22 a

(2)23

0 a(2)32 a

(2)33

1A 0@ x1

x2

x3

1A =

0@ b1

b(2)2

b(2)3

1A b(2)2 = b2 − a21b1/a11 . . .

a(2)32 = a32 − a31a12/a11 . . .

Finally 3← 3− 2× a32/a22 gives A(3)x = b(3)0@ a11 a12 a13

0 a(2)22 a

(2)23

0 0 a(3)33

1A 0@ x1

x2

x3

1A =

0@ b1

b(2)2

b(3)3

1Aa

(3)33 = a

(2)33 − a

(2)32 a

(2)23 /a

(2)22 . . .

Typical Gaussian elimination at step k :

a(k+1)ij = a

(k)ij −

a(k)ik

a(k)kk

a(k)kj , for i > k

(and a(k+1)ij = a

(k)ij for i ≤ k)

242/ 627

From Gaussian Elimination to LU factorization

8><>: a(k+1)ij = a

(k)ij −

a(k)ik

a(k)kk

a(k)kj , for i > k

a(k+1)ij = a

(k)ij , for i ≤ k

I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) (and b(k+1) = L(k)b(k)), with

Lk =

0BBBBBBB@

1.

.1

−lk+1,k .. .−ln,k 1

1CCCCCCCAand lik =

a(k)ik

a(k)kk

.

I After n − 1 steps, A(n) = U = L(n−1) . . .L(1)A givesA = LU , with L = [L(1)]−1 . . . [L(n−1)]−1 =0BBBB@

1l21 1

.

.

.. . .

ln1 1

1CCCCA . . .

0BBBB@1

. . .

1ln,n−1 1

1CCCCA =

0BBB@1 0

..

.li,j 1

1CCCA ,

LU Factorization Algorithm

I Overwrite matrix A: we store a(k)ij , k = 2, . . . , n in A(i,j)

I In the end, A = A(n) = U

do k=1, n-1L(k,k) = 1do i=k+1, n

L(i,k) = A(i,k)/A(k,k)do j=k, n (better than: do j=1,n)

A(i,j) = A(i,j) - L(i,k) * A(k,j)end do

enddoenddoL(n,n)=1

I Matrix A at each step:

0

0

0

0

0

0 0

0

0

000

0244/ 627

I Avoid building the zeros under the diagonalI Before

L(n,n)=1do k=1, n-1

L(k,k) = 1do i=k+1, n

L(i,k) = A(i,k)/A(k,k)do j=k, n

A(i,j) = A(i,j) - L(i,k) * A(k,j)

I After

L(n,n)=1do k=1, n-1

L(k,k) = 1do i=k+1, n

L(i,k) = A(i,k)/A(k,k)do j=k+1, n

A(i,j) = A(i,j) - L(i,k) * A(k,j)

245/ 627

I Use lower triangle of array A to store L(i,k) multipliers

I Before:

L(n,n)=1do k=1, n-1

L(k,k) = 1do i=k+1, n

L(i,k) = A(i,k)/A(k,k)do j=k+1, n

A(i,j) = A(i,j) - L(i,k) * A(k,j)

I After (diagonal 1 of L is not stored):

do k=1, n-1do i=k+1, n

A(i,k) = A(i,k)/A(k,k)do j=k+1, n

A(i,j) = A(i,j) - A(i,k) * A(k,j)

246/ 627

I More compact array syntax (Matlab, scilab, Fortran 90):

do k=1, n-1A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)

- A(k+1:n,k) * A(k,k+1:n)end do

I corresponds to a rank-1 update:

A(k,k) A(k,j)

k

k A(k,k+1:n)

A(k+1:n,k)

A(i,k) A(i,j)i

j

Computed elements of U

L multipliers

247/ 627

What we have computed

I we have stored the L and U factors in A:

I A(i,j), i > j corresponds to lijI A(i,j), i ≤ j corresponds to uij

I and we had lii = 1, i = 1, n

I Finally,

U

L

A = L + U − I

248/ 627

Nombre d’operations flottantes (flops)

I Dans la descente Ly = b calcul de la k-eme inconnue

yk = bk −k−1∑j=1

Lkjyj

Soit (k-1) multiplications et (k-1) additions, k de 1 a n-1

Donc n2 − n flops au total

I Idem pour la remontee Ux = yI Nombre de flops dans la factorisation de Gauss:

I n − k divisionsI (n − k)2 multiplications, (n − k)2 additionsI k = 1, 2, ..., n − 1I total: ≈ 2×n3

3

249/ 627

Exercise

do k=1, n-1A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)

- A(k+1:n,k) * A(k,k+1:n)end do

Compute the LU factorization of A =

2 −1 3−4 6 −5

6 13 16

Answer: A =

1 0 0−2 1 0

3 4 1

× 2 −1 3

0 4 10 0 3

250/ 627

Remark

I Assume that a decomposition A = LU exists withI L=(lij)i,j=1...n lower triangular with unit diagonalI U=(uij)i,j=1...n upper triangular

I Computing the LU product, we have:aij =

∑i−1k=1 likukj + uij for i ≤ j

aij =∑j−1

k=1 likukj + lijujj for i > j

I Renaming i → K in the 1st equation and j → K in the 2nd,

IK

uKj = aKj −

∑K−1k=1 lKkukj for j ∈ K ; ...; N

liK = 1uKK

(aiK −∑K−1

k=1 likukK ) for i ∈ K + 1; ...; N

I Explicit computation of uKj and liK for K = 1 to n

I Finally, same computations are performed but in a differentorder (called left-looking)

251/ 627


252/ 627

Vector Norms

Definition

A vector norm is a function f : IRn −→ IRn such that

f (x) ≥ 0 x ∈ IRn, f (x) = 0⇔ x = 0

f (x + y) ≤ f (x) + f (y) x , y ∈ IRn

f (αx) = |α|f (x) α ∈ IR, x ∈ IRn

p-norm: ‖x‖p = (|x1|p + |x2|p + . . .+ |xn|p)1p

Most important p-norms are 1, 2, and ∞ norm:

‖x‖1 = |x1|+ |x2|+ . . .+ |xn|‖x‖2 = (|x1|2 + |x2|2 + . . .+ |xn|2)

12 = (xT x)

12

‖x‖∞ = max1≤i≤n

|xi |

253/ 627

Vector Norms – Some properties

I Cauchy-Schwarz inequality: |xT y | ≤ ‖x‖2‖y‖2

(Proof based on 0 ≤ ‖x − λy‖2 with λ = xT y‖y‖2 )

I All norms on IRn are equivalent:∀‖.‖α and ‖.‖β, ∃c1, c2 s.t. c1‖x‖α ≤ ‖x‖β ≤ c2‖x‖α

I In particular:

‖x‖2 ≤ ‖x‖1 ≤√

n‖x‖2

‖x‖∞ ≤ ‖x‖2 ≤√

n‖x‖∞‖x‖∞ ≤ ‖x‖1 ≤ n‖x‖∞

254/ 627

Matrix Norms

I As for vector norms,f (A) ≥ 0 A ∈ IRm×n, f (A) = 0⇔ A = 0f (A + B) ≤ f (A) + f (B) A,B ∈ IRm×n

f (αA) = |α|f (A) α ∈ IR,A ∈ IRm×n

I Most matrix norms satisfyI ‖AB‖ ≤ ‖A‖ × ‖B‖

I Norms induced by p norms on vectors:

‖A‖p = maxx 6=0

‖Ax‖p‖x‖p

= max‖x‖p=1

‖Ax‖p‖A‖1 = max

1≤j≤n

∑mi=1 |aij |

‖A‖∞ = max1≤i≤m

∑nj=1 |aij |

‖A‖p ≥ ρ(A) = max1≤i≤n

|λi |

I Frobenius norm:‖A‖F =

√∑mi=1

∑nj=1 |aij |2 =

∑i σ

2i = trace(AT A)

255/ 627


256/ 627

I Considerons le systeme lineaire:[.780 .563.913 .659

]× [x ] =

[.217.254

]I Supposons que l’on obtient les resultats suivants par 2

methodes differentes :

x1 =

[0.314−0.87

]et x2 =

[0.999−1.00

]I Quelle solution est la meilleure ?

I Residus :

b − Ax1 =

[.0000001

0

]et b − Ax2 =

[.001343.001572

]I x1 est la meilleure solution car possede le plus petit residu

I Solution exacte :

x∗ =

[1−1

]I En realite x2 est plus precis.

Notion de bonne solution : ambigu257/ 627

Sensibilite des problemes

I Soit A : [.780 .563.913 .659

]matrice presque singuliere

I Soit A′ : [.780 .563001095.913 .659

]matrice singuliere

→ une perturbation des donnees en O(10−6) rend le problemeinsoluble

I Autre probleme si A proche de matrice singuliere : petitchangement sur A et/ou b → perturbations importantes surla solution

Cela n’est pas lie a l’algorithme de resolution utilise

258/ 627

Representation des reels en machine

I Reels codes en machine avec nombre fini de chiffres

I Representation normalisee d’un reel flottant normalise:

x = (−1)sm × 2e

I Plupart des calculateurs base = 2 (norme IEEE), mais aussi 8(octal) ou 16 (IBM), 10 (calculettes)

I macheps : precision machine i.e. plus petit reel positif tel que1 + macheps > 1

I Norme IEEE definit:I format des nombresI modes d’arrondis possiblesI traitement des exceptions (overflow, division par zero, . . . )I procedures de conversion (en decimal, . . . )I l’arithmetique

259/ 627

I Simple precision IEEE :

31 | 30 23 | 22 0_________________________________________s | exposant | mantisse

Exposant code sur 8 bits, mantisse 23 bits plus 1 implicite.I Double precision IEEE :

63 | 62 52 | 51 0________________________________________s | exposant | mantisse

Exposant sur 11 bits, mantisse 52 bits plus 1 impliciteI Simple precision :

I macheps ≈ 1.2× 10−7

I xmin ≈ 1.2× 10−38

I xmax ≈ 3.4× 1038

I Double precision :I macheps ≈ 2.2× 10−16

I xmin ≈ 2.2× 10−308

I xmax ≈ 1.8× 10308

260/ 627

Nombres speciaux

I ±∞ : signe, mantisse=0, exposant max

I NaN : signe, mantisse 6= 0, exposant max

I ±0 : signe, mantisse = 0, exposant min

I Nombres denormalises: signe, mantisse 6= 0, exposant min

Remarques

I 0/0,√−1→ NaN

I 1/(−0)→ −∞I NaN op x → NaN

I Exceptions: overflows, underflows, divide by zero, Invalid(NaN)

I Possibilite d’arret avec un message d’erreur ou bien poursuitedes calculs

261/ 627

Analyse d’erreur en arithmetique flottante

I Avec la norme IEEE (modele pour le calcul a precision finie):fl(x op y) = (x op y)(1 + ε) avec |ε| ≤ u

I fl(x): x represente en arithmetique flottanteI op = +, −, ×, /I u = macheps: precision machine

I Exemple:

fl(x1 + x2 + x3) = fl((x1 + x2) + x3)

= ((x1 + x2)(1 + ε1) + x3) (1 + ε2)

= x1(1 + ε1)(1 + ε2) + x2(1 + ε1)(1 + ε2) + x3(1 + ε3)

= x1(1 + e1) + x2(1 + e2) + x3(1 + e3)

avec chaque |ei | < 2 macheps.I Somme exacte de valeurs modifiees xi (1 + ei ), avec |ei | < 2uI Analyse d’erreur inverse: un algorithme est dit backward

stable s’il donne la solution exacte pour des donneeslegerement modifiees (ici xi (1 + ei )). 262/ 627

Analyse d’erreur inverse

I solution approchee = solution exacte d’un probleme modifieI quelle taille d’erreur sur les donnees peut expliquer l’erreur sur

la solution ?I solution approchee OK si solution exacte d’un probleme avec

des donnees proches

erreurdirecte

erreurinverse

G

F

F

x

x’

y

y’ = F(x’)

Conditionnement

I Pb bien conditionne: ‖x − x ′‖ petit ⇒ ‖f (x)− f (x ′)‖ petit

I Sinon: probleme sensitif ou mal conditionne

I Sensibilite ou conditionnement: changement relatif solution /changement relatif donnees

= | f (x ′)−f (x)f (x) |/| (x

′−x)x |

263/ 627

Erreur sur la resolution de Ax = b

I Representation de A (et b) en machine inexacte : resolutiond’un probleme perturbe

(A + E )x = b + f

avec E = (eij), |eij | ≤ u × |aij | et |fi | ≤ u × |bi |.x : meilleure solution accessible

I A quel point x est proche de x ?

I Si un algorithme calcule xalg et ‖x − xalg‖/‖x‖ est grand,deux raisons possibles:

I le probleme mathematique est tres sensible aux perturbations(et alors, ‖x − x‖ pourra etre grand aussi)

I l’algorithme se comporte mal en precision finie

I L’analyse des erreurs inverses permet de discriminer ces deuxcas (Wilkinson, 1963[?])

264/ 627

Notion de conditionnement d’un systeme lineaire

AF7−→ x t.q. Ax = b

A + ∆AF7−→ x + ∆x t.q. (A + ∆A)(x + ∆x) = b

Alors‖∆x‖‖x‖

≤ K (A)‖∆A‖‖A‖

avec K (A) = ‖A‖‖A−1‖.I K (A) est le conditionnement de l’application F .

I Si ‖∆A‖ ≈ macheps‖A‖ (precision machine) alors erreurrelative ≈ K (A)×macheps

(A singuliere : κ(A) = +∞)

265/ 627

Backward error of an algorithm

I Let x be the computed solution. We have ([?]):

err = min ε > 0 such that ‖∆A‖ ≤ ε‖A‖, ‖∆b‖ ≤ ε‖b‖,(A + ∆A)x = b + ∆b

=‖Ax − b‖

‖A‖‖x‖+ ‖b‖.

I Proof:I

(A + ∆A)x = b + ∆b

⇒ b − Ax = ∆b −∆Ax

⇒ ‖b − Ax‖ ≤ ‖∆A‖‖x‖+ ‖∆b‖⇒ ‖r‖ ≤ ε(‖A‖‖x‖+ ‖b‖)

⇒ ‖r‖‖A‖‖x‖+ ‖b‖

≤ minε = err

266/ 627




=‖Ax − b‖

‖A‖‖x‖+ ‖b‖.

I Proof:

I Bound is attained for ∆Amin =‖A‖

‖x‖(‖A‖‖x‖+ ‖b‖)r xT and

∆bmin =‖b‖

‖A‖‖x‖+ ‖b‖r .

We have ∆Aminx −∆bmin = r with

‖∆Amin‖ =‖A‖‖r‖

‖A‖‖x‖+ ‖b‖and ‖∆bmin‖ =

‖b‖‖r‖‖A‖‖x‖+ ‖b‖

.

266/ 627




=‖Ax − b‖

‖A‖‖x‖+ ‖b‖.

I Proof:

I Furthermore, it can be shown thatRelative forward error ≤ Condition Number × Backward Error

266/ 627

Ce qu’il faut retenir

I Conditionnement (cas general):

κ(A, b) = ‖A−1‖(‖A‖+‖b‖‖x‖

)

mesure la sensibilite du probleme mathematique

I Erreur inverse d’un algorithme: ‖Ax−b‖‖A‖‖x‖+‖b‖ .

→mesure la fiabilite de l’algorithme

→a comparer a la precision machine ou a l’incertitude sur lesdonnees

I Prediction de l’erreur:Erreur directe ≤ conditionnement × erreur inverse

267/ 627


268/ 627

Soit A =

[ε 11 1

]=

[1 01ε 1

]×[ε 10 1− 1

ε

]κ2(A) = 1 + O(ε). Si on resoud :[

ε 11 1

] [x1

x2

]=

[1 + ε

2

]Solution exacte x∗ = (1, 1).

269/ 627

En faisant varier ε on a :

ε ‖x∗−x‖‖x∗‖

10−3 6× 10−6

10−6 2× 10−11

10−9 9× 10−8

10−12 9× 10−5

10−15 7× 10−2

Table: Precision relative de la solution en fonction de ε.

I Donc meme si A bien conditionnee : elimination de Gaussintroduit des erreurs

I Explication : le pivot ε est trop petit

270/ 627

I Solution : echanger les lignes 1 et 2 de A[1 1ε 1

] [x1

x2

]=

[2

1 + ε

]→ precision parfaite !

I Pivotage partiel : pivot choisi a chaque etape = plus grandelement de la colonne

I Avec pivotage partiel :1. PA = LU ou P matrice de permutation2. Ly = Pb3. Ux = y

I LU avec pivotage: backward stable

‖Ax − b‖‖A‖ × ‖x‖ ≈ u(1)

‖x − x∗‖‖x∗‖ ≈ u × κ(A) (2)

1. la LU donne de faibles residus independamment duconditionnement de A

2. la precision depend du conditionnementsi u ≈ 10−q et κ∞(A) ≈ 10p alors x a approximativement(q − p) chiffres corrects 271/ 627

Factorisation LU avec pivotage

do k = 1 a n-1find l such that

|A(l,k)| = max |A(j,k)|, j = k a n if |A(l,k)| = 0exit. // A is (almost) singular

endifif k != l, swap rows k and l in A (and in b)A(k+1:n,k) = A(k+1:n,k) / A(k,k)A(k+1:n,k+1:n) = A(k+1:n,k+1:n)

- A(k+1:n,k)*A(k,k+1:n)end do

272/ 627


273/ 627

Systemes bande

| x x 0 0 0 || x x x 0 0 |

A = | 0 x x x 0 | largeur de bande = 3| 0 0 x x x | A tridiagonale| 0 0 0 x x |

Exploitation de la structure bande lors de la factorisation : L et Ubidiagonales

| x 0 0 0 0 | | x x 0 0 0 || x x 0 0 0 | | 0 x x 0 0 |

L = | 0 x x 0 0 | U = | 0 0 x x 0 || 0 0 x x 0 | | 0 0 0 x x || 0 0 0 x x | | 0 0 0 0 x |

→ on peut donc reduire le nombre d’operations

274/ 627

Systemes bande

I KL: nombre de sous-diagonales de A

I KU: nombre de sur-diagonales de A

I KL+KU+1: largeur de bande

Question: Si p = KL = KU (largeur totale, 2p+1), quel est lenombre d’operations de l’algo de factorisation LU (sans pivotage)?

Reponse:(n − p)× (p divisions + p2 multiplications + p2 additions ) +23 (p − 1)3)

≈ 2np2 flops (quand n >> p), au lieu de 2n3

3 .

Pivotage partiel ⇒ la largeur de bande augmente !!

I echange des lignes k et i, A(i,k)=max(A(j,k), j > k)

I KL’ = KL

I KU’ = KL + KU

275/ 627

Systemes bande

I KL: nombre de sous-diagonales de A

I KU: nombre de sur-diagonales de A

I KL+KU+1: largeur de bande

Question: Si p = KL = KU (largeur totale, 2p+1), quel est lenombre d’operations de l’algo de factorisation LU (sans pivotage)?Reponse:(n − p)× (p divisions + p2 multiplications + p2 additions ) +23 (p − 1)3)

≈ 2np2 flops (quand n >> p), au lieu de 2n3

3 .

Pivotage partiel ⇒ la largeur de bande augmente !!

I echange des lignes k et i, A(i,k)=max(A(j,k), j > k)

I KL’ = KL

I KU’ = KL + KU

275/ 627


276/ 627

Matrices symetriques

I A symetrique : on ne stocke que la triangulaire inferieure ousuperieure de A

I A = LU At = A↔ LU = UtLt Donc(U)(Lt)−1 = L−1Ut = D diagonale et U = DLt , soitA = L(DLt) = LDLt

I Exemple :

| 4 -8 -4| | 1 0 0 | | 1 0 0 | | 1 -2 -1 ||-8 18 14| = | -2 1 0 | * | 0 2 0 | * | 0 1 3 ||-4 14 25| | -1 3 1 | | 0 0 3 | | 0 0 1 |

I Resolution :1. A = LDLt

2. Ly = b3. Dz = y4. Ux = z

I LDLt : n3

3 flops (au lieu de 2n3

3 pour la LU)

277/ 627

Matrices symetriques et pivotage

I pas de stabilite numerique sur A a priori → pivotageI maintien de la symetrie → pivotage diagonal, mais insuffisantI approches possibles: Aasen, Bunch & Kaufman, . . .I En general on cherche: PAPt = LDLt ou P matrice de

permutation L : triangulaire inferieureD : somme de matrices diagonales 1× 1 et 2× 2

| 1 0 0 0 | | x 0 0 0 | | 1 0 0 0 |t| x 1 0 0 | | 0 x x 0 | | x 1 0 0 |

PAPt= | x 0 1 0 | * | 0 x x 0 | * | x 0 1 0 || x x x 1 | | 0 0 0 x | | x x x 1 |

L D Lt

I Examples of 2x2 pivots:

| 0 1 | | eps1 1 || 1 0 | | 1 eps2 |

I Determination du pivot complexe: 2 colonnes a chaque etape

278/ 627

I Let PAPt =

[E C t

C B

]. If E is a 2x2 pivot, form E−1 to get:

PAPt =

[I 0CE−1 I

] [E 00 B − CE−1C t

] [I E−1C t

0 I

]I Possible pivot selection algorithm (Bunch-Parlett):

µ1 = maxi |aii |; µ2 = maxij |aij |if µ1 ≥ αµ2 (for a given α > 0)

Choose largest 1x1 diag. pivot. Permute s.t. |e11| = µ1

else

Choose 2x2 pivot s.t. |e21| = µ2

I Choice of α to minimize growth factor, ie, the magnitude ofthe entries in B − CE−1C t , with E 1x1 or 2x2

I 1x1 pivot (µ1 ≥ αµ2), C has 1 column,|B − C 1

µ1C t |ij ≤ maxij |Bij |+ maxij(|cicj |/µ1) ≤ µ2 + µ2

2/µ1 =

µ2(1 + µ2/µ1) ≤ µ2(1 + 1/α)I 2x2 pivot, one can shot that bound is 3−α

1−αµ2

I Choose α s.t 3−α1−α = (1− 1

α )2 (2 pivots) gives. α = 1+√

178 .

I Unfortunately, previous algorithm requires between n2

12 and n2

6comparisons, and is too costly.

279/ 627

I More efficient variants exist, also with a good backward errorI Example: Bunch-Kaufman algorithm (1977)

Determination of first pivot:α← (1 +

√17)/8 ≈ 0.64

r ← index of largest element colmax = |ar1| below the diagonalif |a11| ≥ α× colmax

1x1 pivot a11 is okelse

rowmax = |arp| =largest element in row rif rowmax× |a11| ≥ α× colmax2

1x1 pivot a11 is okelseif |arr | ≥ α× rowmax

1x1 pivot arr is ok, permuteelse

2x2 pivot

[a11 ar1

ar1 arr

]is chosen

interchange rows r and 2endif

endif

280/ 627


281/ 627

Factorisation de Cholesky

I A definie positive si x tAx > 0 ∀x 6= 0I A symetrique definie positive → factorisation de Cholesky

A = LLt avec L triangulaire inferieureI Par identification :[

A11 A12

A21 A22

]=

[L11 0L21 L22

]×[

L11 L21

0 L22

]I De la :

A11 = L211 → L11 = (A11)

12 (7)

A21 = L21 × L11 → L21 =A21

L11(8)

A22 = L221 + L2

22 → L22 = (A22 − L221)

12 (9)

. . . (10)

I Pas de pivotage, Cholesky est backward stableI Factorisation : ≈ n3

3 flops

282/ 627

Algorithme de factorisation de type Cholesky

do k=1, nA(k,k)=sqrt(A(k,k))A(k+1:n,k) = A(k+1:n,k)/A(k,k)do j=k+1, n

A(j:n,j) = A(j:n,j) - A(j:n,k) A(j,k)end do

end do

I Schema similaire a la LU, mais on ne met a jour que letriangle inferieur

I LU factorization:

A(k+1:n,k+1:n) = A(k+1:n,k+1:n) / A(k,k)- A(k+1:n,k) * A(k,k+1:n,k)

283/ 627


284/ 627

Factorisation QR

I Definition d’un ensemble de vecteurs orthonormesx1, . . . , xk

I x ti xj = 0 ∀i 6= j

I x ti xi = 1

I Matrice orthogonale Q : les vecteurs colonnes de Q sontorthonormes, QQt = I , Q−1 = Qt

I Factorisation QR:

R

= QA

I Q orthogonale

I R triangulaire superieure

285/ 627

Exemple

1 −82 −12 14

=

13 −2

3 −23

23 −1

323

23

23 −1

3

3 60 150 0

= Q × R

286/ 627

Factorisation QR

I Factorisation QR obtenue en general par applicationssuccessives de transformations orthogonales sur les donnees :

Q = Q1 . . .Qn

ou Qi matrices orthogonales simples telles que QtA = R

I Transformations utilisees :I Reflexions de HouseholderI Rotations de GivensI Procede de Gram-Schmidt (auquel cas Q est de taille m× n et

R est de taille n × n)

287/ 627

Reflexions de HouseholderH = I − 2v .v t ou v vecteur de IRn tq ‖v‖2 = 1H orthogonale symetrique.Permet en particulier d’annuler tous les elements d’un vecteur saufune composante.

I Exemple :

x =

2−1

2

u = x +

‖x‖2

00

=

5−1

2

et v =u

‖u‖2

Alors :

H = I − 2v × v t =1

15×

−10 5 −105 14 2

−10 2 11

Donc :

H × x =

−300

288/ 627

Reflexions de Householder

xu

Hx

Vect u

289/ 627

Reflexions de HouseholderVecteur de Householder : u = x + ou - ‖x‖2e1 puis u = v/‖v‖2

Permettent d’obtenir des matrices de la forme:

A =

a11 a12 a13

0 a22 a23

0 a32 a33

0 a42 a43

0 a52 a53

Soit H telle que :

H ×

a22

a32

a42

a52

=

a′22

000

Si H ′ =

[1 00 H

]Alors H ′ × A =

a11 a12 a13

0 a′22 a′23

0 0 a′33

0 0 a′43

0 0 a′53

290/ 627

I Triangularisation d’une matrice 4× 3 : Q = H1 × H2 × H3

| x x x | | x x x | | x x x || x x x | H1 | 0 x x | H2 | 0 x x || x x x | -> | 0 x x | -> | 0 0 x || x x x | | 0 x x | | 0 0 x |

| x x x |H3 | 0 x x |-> | 0 0 x | = R

| 0 0 0 |

I QR backward stable, avec une erreur inverse meilleure que LU

I Nombree d’operations ≈ 43 n3

291/ 627

Rotations de Givens

I Rotation 2× 2 :

G (θ) =

[c s−s c

]orthogonale

avec c = cos(θ) et s = sin(θ).

I Utilisation : x = x1, x2

c =x1

(x21 + x2

2 )12

et s =−x2

(x21 + x2

2 )12

y = (y1, y2) = G tx alors y2 = 0

I Permet d’annuler certains elements d’une matrice

292/ 627

Rotations de Givens

I Exemple : factorisation QR de

A =

r11 r12 r13

0 r22 r23

0 0 r33

v1 v2 v3

I Determiner (c , s) tels que :[

c s−s c

]t

×[

r11

v1

]=

[r ′11

0

]

293/ 627

Rotations de Givens

I Rotation dans le plan (1,4) :

G (1, 4) =

c 0 0 s0 1 0 00 0 1 0−s 0 0 c

G (1, 4)t × A =

r ′11 r ′12 r ′13

0 r22 r23

0 0 r33

0 v ′2 v ′3

I Rotations successives pour annuler les autres elements : 4n3

3flops pour triangulariser la matrice.

294/ 627


295/ 627

Gram-Schmidt Process

I Hypothesis: a basis of a subspace is available

I Goal: Build an orthonormal basis of that subspace

I Very useful in iterative methods, where:I each iterate is searched for in a subspace of increasing

dimensionI one needs to maintain a basis of good quality

296/ 627


Consider two linearly independent vectors x1 and x2

I q1 =x1

‖x1‖2has norm 1.

I x2 − (x2, q1)q1 is orthogonal to q1:

(x2 − (x2, q1)q1, q1) = x t2q1 − (x t

2q1)qt1q1 = 0

I q2 =x2 − (x2, q1)q1

‖x2 − (x2, q1)q1‖2has norm 1

297/ 627


1. Compute r11 = ‖x1‖2, if r11 = 0 stop

2. q1 =x1

r113. For j = 2, . . . , r Do

(q1 . . . , qj−1 form an orthogonal basis)4. rij ← x t

j qi , for i = 1, 2, . . . , j − 1

5. q ← xj −j−1∑i=1

rijqi

6. rjj = ‖q‖2, if rjj = 0 stop

7. qj ←q

rjj8. EndDo

298/ 627

Remarks1. Compute r11 = ‖x1‖2, if r11 = 0 stop

2. q1 =x1

r113. For j = 2, . . . , r Do

(q1 . . . , qj−1 form an orthogonal basis)

4. rij ← xtj qi , for i = 1, 2, . . . , j − 1

5. q ← xj −j−1Xi=1

rijqi

6. rjj = ‖q‖2, if rjj = 0 stop

7. qj ←q

rjj8. EndDo

I From steps 5-7, it is clear that xj =∑j

i=1 rijqi

I We note X = [x1, x2, . . . , xr ] and Q = [q1, q2, . . . , qr ]

I Let R be the r -by-r upper triangular matrix whose nonzerosare the ones defined by the algorithm.

I Then the above relation can be written as

X = QR,

where Q is n-by-r and R is r -by-r .

299/ 627

Example: x1 = (1, 2, 2)T , x2 = (−8,−1, 14)T

I r11 = ‖x1‖2 = 3,

q1 =x1

r11=

1

3

24 122

35 , r12 = xT2 q1 =

18

3= 6, and q = x2 − r12q1

q =

24 −8−114

35−6× 1

3

24 122

35 et r22 = ‖q‖ = 15, q2 =q

‖q‖ =1

3×

24 −2−1

2

35I Ce qui correspond a la factorisation :24 1 −8

2 −12 14

35 =

24 13− 2

323− 1

323

23

35× » 3 60 15

–=

24 13− 2

3− 2

323− 1

323

23

23− 1

3

3524 3 60 150 0

35Factorisation QR ou Q orthogonale et R triangulaire superieure

300/ 627


301/ 627

Problemes aux moindres carresSoit A : m × n, b ∈ IRn,m ≥ n (et le plus souvent m >> n)

I Probleme : trouver x tel que Ax = b

I Systeme sur-determine : existence de solution pas garantie.Donc on cherche la meilleure solution au sens d’une norme:

minx‖Ax − b‖2

SpanA

rb

Ax

I Principales approches:Equations normales ou factorisation QR

302/ 627

Equations normales

minx‖Ax − b‖2 ↔ min

x‖Ax − b‖2

2

‖Ax − b‖22 = (Ax − b)t(Ax − b) = x tAtAx − 2x tAtb + btb

I Derivee nulle par rapport a x : 2AtAx − 2Atb = 0⇒ systeme de taille (n x n)

AtAx = Atb

I AtA symetrique semi-definie positive, definie positive si A estde rang maximal (rang(A)=n)

I resolution: avec Cholesky AtA = LDLt

probleme : κ(AtA) = κ(A)2

pas backward stable

I (AtA)−1At : pseudo-inverse de A

303/ 627

Resolution par factorisation QR

Si Q est une matrice orthogonale :

‖Ax − b‖ = ‖Qt(Ax − b)‖ = ‖(QtA)x − (Qtb)‖

I A : m × n,Q : m ×m tel que A = QR

QtA = R =

[R1

0

]nm − n

R est triangulaire superieure. En posant :

Qtb =

[cd

]nm − n

I on a donc :

‖Ax − b‖22 = ‖QtAx − Qtb‖2

2 = ‖R1x − c‖22 + ‖d‖2

2

I si rang(A) = rang(R1)=n alors la solution est donnee parR1x = c

I nombre de flops ≈ (2n2 ×m)

304/ 627


305/ 627

Problemes aux valeurs propres

I Resolution de Ax = λx ou λ valeurs propres et x vecteurspropres

I Polynome caracteristique : p(λ) = det(A− λI ) (revient achercher λ tel que A− λI singuliere)

I Soit T non singuliere et Ax = λx

(T−1AT )(T−1x) = λ(T−1x)

A et (T−1AT ) sont des matrices dites similaires, elles ontmemes valeurs propres.T : transformation de similarite

306/ 627

Problemes aux valeurs propres

On prend T = Q, orthogonale

I A← QtAQ est tres interessant

I backward stable avec des transformations de Householder ouGivens

I QtAQ similaire a (A + E ) avec ‖E‖ ≈ u × ‖A‖I On cherche donc a determiner Q tel que valeurs propres de

QtAQ evidentes

307/ 627

Exemple

A matrice 2× 2 de valeurs propres reellesOn peut toujours trouver (c , s) tel que :[

c s−s c

]t

×[

a11 a12

a21 a22

]×[

c s−s c

]=

[λ1 t

0 λ2

]= S

λ1 et λ2 sont les valeurs propres de A : decomposition de Schur

I Si y est vecteur propre de S alors x = Qy est vecteur proprede A

I Sensibilite d’une valeur propre aux perturbations fonction del’independance de son vecteur propre par rapport aux vecteurspropres des autres valeurs propres

308/ 627

Valeurs propres: methodes iteratives

Methode de la puissance

vn+1 = Avn/||Avn||

avec v0 pris au hasard

I converge vers v tel que Av = λ1v (|λ1| > |λ2| ≥ ... ≥ |λn|)I Preuve:

- si v0 =∑αixi avec (xi ): base de vecteurs propre alors

- Akv = Ak(∑

αixi ) =n∑

i=1

αiλki xi =

α1λk1(x1 +

n∑i=2

αj

α1(λi

λ1)kxi )

- avec ( λiλ1

)k → 0

309/ 627

Valeurs propres: methodes iteratives

Shift-and-invert

(A− µI )vk+1 = vk

I methode de la puissance appliquee a (A− µI )−1

I permet d’obtenir la valeur propre la plus proche de µ

I factorisation (par exemple, LU) de (A− µI )

I a chaque iteration: Ly = vk , puis Uvk+1 = y

Des ameliorations existent pour accelerer la convergence (Lanczos,. . . ).

309/ 627


310/ 627

Decomposition en valeurs singulieres (SVD)

I A ∈ IRm×n, alors il existe U et V matrices orthogonales tellesque :

A = UΣV t

decomposition en valeurs singulieres.

I Remarque:

AtA = V Σ2V t et AAt = UΣUt

I U ∈ IRm×m formee des m vecteurs propres orthonormesassocies aux m plus grandes valeurs propres de AAt .

I V ∈ IRn×n formee des n vecteurs propres orthonormes associesaux valeurs propres de AtA

I Σ matrice diagonale constituee des valeurs singulieres de Aqui sont les racines carrees des valeurs propres de AtA (tqσ1 ≥ σ2 ≥ . . . ≥ σn).

311/ 627

I Si A est de rang r < n, alors sr+1 = sr+2 = . . . = sn = 0.I Tres utile dans certaines applications lorsque rang(A) pas

maximalI moindres carres,I valeurs propresI determination precise du rang d’une matrice

312/ 627

Outline

Efficient dense linear algebra librariesUse of scientific librariesLevel 1 BLAS and LINPACKBLASLU FactorizationLAPACKLinear algebra for distributed memory architecturesBLACS (Basic Linear Algebra Communication Subprograms)PBLAS : parallel BLAS for distributed memory machinesScaLAPACKRecursive algorithms

313/ 627


314/ 627

Use of scientific libraries

(a) Robustness

(b) Efficiency

(c) Portability

(d) Usable on a wide range of applications

(a)+(b)+(c) should be true for all scientific software

I Robustness:I Reliability of the computations (backward stable algorithms)I In particular if input is far from an underflow/overflow

threshold, the code should not produce underflow/overflow.

315/ 627

I Efficiency:I Good performanceI No performance degradation for large-scale problemsI Time for execution should not vary too much for problems of

identical size

I Portability :I Code should be written in a standard languageI Source code can be compiled on an arbitrary machine with an

arbitrary compilerexecution should be correct (robustness) and efficient

I Wide range of applications:I Can be used on several problems/data structures (example:

matrices in BLAS library can be dense, symmetric, packed,band)

316/ 627

Use of scientific libraries and parallelism

Two main models for parallelism:I shared address space (example: multi-processor or multi-core

workstation):I all processors have access to the same logical memoryI works like POSIX threads, system maps threads to different

cores/processorsI parallelism can be transparent to the user of the libraryI standards: POSIX thread, OpenMP

I distributed memory model (example: cluster)I each processor has its own memoryI each processor has a network interfaceI communication and synchronization require message passingI standards: PVM, MPI

317/ 627

Example: dot product on 2 processors – sharedmemory

(dot = 0 initially)thread 1 on proc 1

loc s1 = 0do i = 1, n/2loc s1 = loc s1 + x(i) *

y(i)enddodot = dot + loc s1

thread 2 on proc 2

loc s2 = 0do i = n/2+1, nloc s2 = loc s2 + x(i) *

y(i)enddodot = dot + loc s2

Result could be wrong

I problem: dot = dot + loc s is not atomic

I possible solution: mutual exclusion with locks (criticalsections)

318/ 627

Example: dot product on 2 processors – sharedmemory

(dot = 0 initially)thread 1 on proc 1

loc s1 = 0do i = 1, n/2loc s1 = loc s1 + x(i) *

y(i)enddolockdot = dot + loc s1unlock

thread 2 on proc 2

loc s2 = 0do i = n/2+1, nloc s2 = loc s2 + x(i) *

y(i)enddolockdot = dot + loc s2unlock

Result could be wrong

I problem: dot = dot + loc s is not atomic

I possible solution: mutual exclusion with locks (criticalsections)

318/ 627

Dot product on 2 processors – Message Passing

Suppose that initially:

I p1 owns x(1:n/2) and y(1:n/2)

I p2 owns x(n/2+1:n) and y(n/2+1:n)Processor 1:

s loc = dot seq ( x(1:n/2),y(1:n/2) )

send s loc to P2receive s remote from P2s=s loc + s remote

Processor 2:s loc = dot seq (x(n/2+1:n),

y(n/2+1:n))send s loc to P1receive s remote from P1s=s loc + s remote

Correctness depends on send/receive protocols

I asynchronous: ok

I rendezvous protocol: deadlock

319/ 627

Calling parallel libraries

shared memory parallelism

I Parallelism (threads) can be created inside the callI threadsI OpenMP standard

I Can be transparent for the user

distributed memory parallelism

I Each processor executes a program

I Each processor calls the library function (SPMD)I Data distribution must be specified in the API

I data replicated on the processors, (large mem usage)I data only on one (master) processor initially, (bottleneck on

master)I chunks of data on each processor.

There exist other parallel programming models (SIMD or dataparallel, BSP, mixed shared-distributed programming,. . . )

320/ 627

Parallelism and portable libraries

I Historically: each parallel machine was unique, along with itsprogramming model and programming language

I For each new type of machine, start development again

I Now distinguish between programming model from theunderlying machine, so we can write portably correct code

I shared memory: OpenMP directives above threads (loopparallelism, . . . )

I distributed memory: MPI most portable

321/ 627


322/ 627

Level 1 BLAS and LINPACK

I First effort to define a standard forI basic vector operations used in linear algebra (BLAS, later

called BLAS 1)I a portable package to solve systems of linear equations

(LINPACK)

I LINPACK/BLAS1 standard was defined in 1979

I Goals/motivations:

I ease the concetpion of numerical codesI better readabilityI efficiency: optimized versions or assemblerI robustness, reliability and portability improved

(standardization)

I Level 1 BLAS used in LINPACK

323/ 627

Linpack performance (MFlops)

Computer Peak perf Effective perf Efficiency

ALLIANT FX/2800 (14 proc) 560 31 0.06CONVEX C-210 50 17 0.34CONVEX C-3810 (1 proc) 106 37 0.35CONVEX C-240 (4 proc) 126 27 0.21CRAY-XMP-1 235 70 0.28CRAY-XMP-4 (4 proc) 940 178 0.22CRAY-2 (4 proc) 1951 129 0.066CRAY-YMP-1 333 161 0.48CRAY-YMP-8 (8 proc) 2664 275 0.10CRAY C-90 (1 proc) 1000 326 0.33FUJITSU VP 2600/10 5000 249 0.05HITACHI S-820/80 3000 107 0.036IBM RS/6000-530 50 13 0.26IBM RS/6000-550 83 27 0.34NEC SX-2 1300 43 0.033NEC SX-3 5500 314 0.06

324/ 627

Why performance is small ?

I Memory contentionIn LINPACK, the main kernel used is:SAXPY : y ← y + α× xSAXPY : 2 loads, multiplication, addition, et storeRatio flops/memory ref = 2/3Does not allow for an efficient use of memory hierarchy (dataare not reused)

I ObjectiveIncrease flops/memory ref ratio

I How ?Re-use several times data that are in scalar/vector registers, inlow-level cache

→ definition of higher level BLAS (matrix operations)

325/ 627


326/ 627

BLAS library

BLAS : Basic Linear Algebra Subprograms3 levels:

I BLAS1 : vector-vector operations - complexity O(n)

I BLAS2 : matrix-vector operations - complexity O(n2)

I BLAS3 : matrix-matrix operation - complexity O(n3)

typical memoryoperation ] flop access ratio

BLAS11979 y = αx + y 2n 3n + 1 2

3

BLAS21988 y = αAx + βy 2n2 n2 + 3n 2

BLAS31990 C = αAB + βC 2n3 4n2 n

2

327/ 627

BLAS performance

328/ 627

BLAS BenefitsThe BLAS offer several benefits

1. Robustness:low level details (treatment of exception like overflow arehandled by the library).

2. Portability/Efficiency:thanks to the standardization of the API. Machine dependentoptimization are left to the vendors/system administrator.Nowadays, available on all scientific computers.

3. Readability:modular description of the mathematical algortithms(Matlab-like).

The subroutines are available for the four standard arithmetics

1. single real: prefix S,

2. double real: prefix D,

3. single complex: prefix C,

4. double complex: prefix Z.

BLAS 1: quick overview

scal x = αx axpy y = αx + yswap y ↔ x copy y = xdot dot = xT y nrm2 nrm2 = ‖x‖2

min, max search generating and applying plane rotationscall DAXPY(N, ALPHA, X, INCX, Y, INCY)

330/ 627

BLAS 2: quick overview

α, β are scalar, x , y are vectors, A is a general matrix, T is atriangular matrix and H an Hermitian matrix.

I Matrix-vector product

y = αAx + βy y = αAT x + βy y = αAHx + βy

x = Tx x = T T x x = T Hx

I Rank-one and rank-two update

A = αxyT + A A = αxy t + αyx t + A

H = αxxH + H H = αxyH + αyxH + H

I Solution of triangular systems

x = T−1x x = T−T x x = T−Hx

call DGEMV(TRANS, M, N, ALPHA, A, LDA, X, INCX, BETA, Y, INCY)

BLAS 2: naming scheme

1. first character: data type (S, D, C, Z)

2. characters 2 and 3: matrix type

I GE : general matrix.I GB : general band matrix.I HE : Hermitian matrix.I SY : symmetric matrix.I SP : symmetric matrix in ”packed” format.I HP : Hermitian matrix in ”packed” format.I HB : Hermitian band matrix.I SB : symmetric band matrix.I TR : triangular matrix.I TP : triangular matrix in ”packed” format.I TB : triangular band matrix.

3. characters 4 and 5: operation type

I MV : matrix-vector product y = αAx + y .I R : rank-one update A = A + αxyT .I R2 : rank-two update A = A + αxyT + αyxT .I SV : triangular system solution x = T−1x .

BLAS 3: quick overviewA,B,C are general matrices and T is a triangular matrix.

I Matrix-matrix product

C = αAB + βC C = αAT B + βC

C = αABT + βC C = αAT BT + βC

I Rank-k and 2k updates of a symmetric matrix

C = αAAT + βC C = αAT A + βC

C = αAT B + αABT + βC C = αABT + αBAT + βC

I Multiply a matrix by a triangular matrix

B = αTB B = αT T B

B = αBT B = αBT T

I Solving triangular systems with multiple right-hand sides

B = αT−1B B = αT−T B

B = αBT−1 B = αBT−T

call DGEMM(TRANSA, TRANSB, M, N, K, ALPHA, A, LDA, B, LDB, BETA, C, LDC)

BLAS 3: naming scheme

1. first character: data type (S, D, C, Z)

2. characters 2 and 3 : matrix type

I GE : general matrix.I HE : Hermitian matrix.I SY : symmetric matrix.I TR : triangular matrix.

3. characters 4 and 5 : operation type

I MM : matrix-matrix product C = αAB + βC .I RK : rank-k update of a symmetric or Hermitian matrix

C = αAAT + βC .I R2K : rank-2k update of symmetric or Hermitian matrix

C = αABT + αBAT + βCI SM : solution of a triangular system with multiple right-hand

sides B = T−1B.

Performance of the BLAS

I Today’s processors can achieve high performance, but thisrequires extensive machine-specific hand tuning.

I Routines have a large design space with many parameters:Blocking sizes, loop nesting permutations, loop unrollingdepths, ...

I Complicated interactions with the increasingly sophisticatedmicro-architectures of modern microprocessors.

I Need for quick/dynamic deployment of optimized routines.I ATLAS - Automatically Tuned Linear Algebra Software.I PhiPac from Berkeley.

I More recent approach:I Flame/Goto BLAS from Univ. Texas at AustinI Main idea: minimize TLB misses

335/ 627

Optimized Blas

Peak performance of the Power 4 : 3.5 GFlops

Optimized Blas

Peak performance of the Itanium 2 : 3.7 GFlops

Atlas performance

J. Dongarra figures

Conclusion on BLAS

I standard ⇒ used to design portable and efficient codes

I optimized BLAS libraries available

Never code vector/matrix operations yourself, always rely onoptimized kernels

I parallel BLAS kernels (exploit parallelism inside BLASroutines)

I Pros: portability + parallelism is hidden to the userI Cons: not always the most efficient way to parallelize an

applicationI frequent on multicores/multiprocessors with shared memoryI rare on multiprocessors with virtual shared memoryI distributed memory: PBLAS and message passing

339/ 627

Parallel BLAS performance

I Exists since a long time

Computer Prec 1 proc. # procs1 2 4 8 16 24

BBN TC2000 32 bits 7.8 6.6 13.4 26.2 52.1 98.8 124.464 bits 2.7 2.5 4.9 9.7 19.2 37.2 47.0

KSR1 64 bits 27.5 25.4 42.9 81.9 165.4 305.4 418.3

Table: Performance in MFlops of GEMM using square matrices oforder 512 on BBN TC2000 and KSR1.

I Several BLAS libraries (ATLAS, GOTO Blas, vendors’ BLAS)provide threaded parallelism that can be efficiently exploitedon SMP or multi-core architectures

340/ 627


341/ 627

Factorisation LU

I Solution of Ax = bI factorization PA = LU with P: permutation matrixI forward-backward substitution:

I Ly = PbI Ux = y

I Gaussian elimination:

do .........do ......do ......

a(i,j) = a(i,j) - a(i,k) * a(k,j) / a(k,k)end do

end doend do

Order of nested loops → 6 alternatives, 3 column-oriented

342/ 627

Column-oriented variants

I KJI- SAXPY (right-looking)

do k=1, n-1do j=k+1, ndo i=k+1, n


end doend do

Note: divisions of the columns by the pivot are omitted.

343/ 627


I JKI- GAXPY (left-looking)

do j=2, ndo k=1, j-1do i=k+1, n


end doend do


344/ 627


I JIK- SDOT

do j=2, ndo i=2, ndo k=1, min(i,j)-1


end doend do


345/ 627

LU factorization: Crout variantAt each step K, build Kth row and Kth column.

do K=2, n! build row K (=i)i=Kdo k=1,K-1do j=K, n


end do

! build column K (=j)j=Kdo k=1, K-1do i=K+1, n


end doend do

346/ 627

Blocked algorithms

(Key to get high performance)

A11 A12 A13

A21 A22 A23

A31 A32 A33

=

L11

L21 L22

L31 L32 L33

U11 U12 U13

U22 U23

U33

.

That is equivalent (equating terms with A):

A11 = L11U11, A12 = L11U12, A13 = L11U13,A21 = L21U11, A22 = L21U12 + L22U22, A23 = L21U13 + L22U23,A31 = L31U11, A32 = L31U12 + L32U22, A33 = L31U13 + L32U23 + L33U33.

347/ 627

Various variants


Postponing some updates and changing the order in which they arecomputed, leads to different variants.

Left looking Right looking Crout (i,j,k variant)

348/ 627

Left looking LU


Step 1:

L11, U11

L21

L31

= LU

A11

A21

A31

A12 A13

A22 A23

A32 A33

Left looking LU


Step 2:

L11 U12 = L−111 A12 A13

L21

L31

[L22, U22

L32

]= LU

((A22

A32

)−(

L21

L31

)U12

)A23

A33

Left looking LU


Step 3:

L11

L21 L22

(U13

U23

)=

(L11

L21 L22

)−1(A13

A23

)L31 L32 [L33U33] = LU

(A33 −

(L31 L32

)( U13

U23

))

Right-looking LU


Step 1:

0B@ [L11,U11] = LU (A11) (U12 U13) = L−111

`A12 A13

´„L21

L31

«=

„A21

A31

«U−1

11

A

(1)22 A

(1)23

A(1)32 A

(1)33

!=

„A22 A23

A32 A33

«−„

L21

L31

«`U12 U13

´1CA

Right-looking LU

Step 2:

(

[L22,U22] = LU(

A(1)22

)U23 = L−1

22 A(1)23

L32 = A132U−1

22 A(2)33 = A

(1)22 − L32U23

)

Step 3:

[L33,U33] = LU(

A(2)33

)

351/ 627

Crout LU


Step 1:

[L11,U11] = LU (A11)(

U12 U13

)= L−1

11

(A12 A13

)(L21

L31

)=

(A21

A31

)U−1

11

A22 A23

A32 A33

Crout LU

Step 2:

( (A

(1/2)22 A

(1/2)23

)= (A22A23)− L21 (U12U13)

A(1/2)32 = A32 − L31U12 A33

)

( [L22, U22

L32

]= LU

(A

(1/2)22

A(1/2)32

)U23 = L−1

22 A(1/2)23

A33

)

353/ 627

Crout LU

Step 3:

(A

(1/2)33 = A33 − (L31L32)

(U13

U23

))[L33U33] = LU(A

(1/2)33 )

354/ 627

Performance of blocked algorithms

n 100 500 1000 1500

F77 loops 0.0240 2.87 30.19 105.82BLAS 1 0.0057 0.40 11.97 44.81BLAS 3 0.0021 0.18 1.42 4.68

Elapsed time (seconds) of Cholesky factorization on SGI O2K

n3 growth for the Fortran loops (Cholesky complexity n3

3 flops).

355/ 627

Use of BLAS 3 in LU factorization

All these block algorithms can be expressed using BLAS 3 kernelsExample: LU right-looking (KJI-SAXPY):

Bk Ck

UkLk At each step:

1. Unblocked (pivoting) factorizationof Bk

2. Compute row block Uk : TRSM

3. Update submatrix Ck : GEMM

I All variants → same number of flops

I Different memory access

I Efficiency depends on relative BLAS 3 efficiency

356/ 627

BLAS 3 operations (n=500, nb=64) for LU variants

Variant Routine % Operations % Time Avg. MFlops

Left-Looking DGEMM 49 32 438DTRSM 41 45 268unblock LU 10 20 146

Right-Looking DGEMM 82 56 414DTRSM 8 23 105unblock LU 10 19 151

Crout DGEMM 82 57 438DTRSM 8 24 105unblock LU 10 16 189

357/ 627

Other block algorithms

Most of the linear algebra algorithms can be recasted in blockvariants:

I Linear systemsSymmetric positive definite (LLT ), symmetric indefinite(LDLT ).

I Eigensolvers

I Linear least-squaresQR decomposition based on Householder transformations.

358/ 627

Example of performance with parallel BLAS

Speed-up on 8 processors vs. 1 processor with 1000x1000 matrices,CRAY YMP.

I Factorisation LU : 6.1

I Factorisation Cholesky : 6.2

I Factorisation LDLT : 5.3

I Factorisation QR : 6.8

359/ 627


360/ 627

LAPACK: Linear Algebra PACKageScientific library developed in Fortran 77 intensively using BLAS 3routines (600 000 lines of code).

I Supersedes LINPACK (Ax = b) and EISPACK (Ax = λx)

I Scope:I linear equations,I linear least-squares,I standard eigenvalue and singular value problems,I generalized eigenvalue problems.

I Components:I driver routines:

solve the complete problem (e.g. solve a linear sytem);I expert routines:

similar to driver but provides the users with more numericalinformation (e.g. estimate condition number of a matrix, onlycompute a subset of eigenpairs, etc .)

I computational routines:that perform a distinct computational task (e.g. LU or QRfactorization).

361/ 627

The LAPACK library

I Good numerical robustness (rely on ”clean” IEEE arithmetic)

I First public release: 1991. Available on netlib

I Latest release: 3.1.1, February 2007

I Main credits: Cray research, Univ. Kentucky, Univ. ofTennessee, Courant Institute, NAG Ltd, Rice Univ., ArgonneNat. Lab., Oak Ridge Nat. Lab.

I Parallel implementation on shared memory/multicoresinherited from parallel BLAS (efficiency limited on largenumbers of processors/cores)

I Evolution for distributed memory multiprocessors:ScaLAPACK

362/ 627

The LAPACK library

I Good numerical robustness (rely on ”clean” IEEE arithmetic)

I First public release: 1991. Available on netlib

I Latest release: 3.1.1, February 2007

I Main credits: Cray research, Univ. Kentucky, Univ. ofTennessee, Courant Institute, NAG Ltd, Rice Univ., ArgonneNat. Lab., Oak Ridge Nat. Lab.

I Parallel implementation on shared memory/multicoresinherited from parallel BLAS (efficiency limited on largenumbers of processors)

I Evolution for distributed memory multiprocessors:ScaLAPACK

362/ 627

Case Study: Cholesky Factorization on multicores

(slides from Alfredo Buttari, Jack Dongarra, Jakub Kurzak andJulien Langou)

363/ 627

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization

DPOTF2: BLAS-2non-blocked factorization of the panel

DTRSM: BLAS-3updates by applying the transformation computed in DPOTF2

DGEMM (DSYRK): BLAS-3updates trailing submatrix

U=LT


BLAS2 operations cannot be efficiently parallelized because they are bandwidth bound.

• strict synchronizations• poor parallelism• poor scalability


The execution flow if filled with stalls due to synchronizations and sequential operations.

Time


do DPOTF2 on

for all do DTRSM on end

for all do DGEMM on end

end

Tiling operations:


Cholesky can be represented as a Directed Acyclic Graph (DAG) where nodes are subtasks and edges are dependencies among them.

As long as dependencies are not violated, tasks can be scheduled in any order.

3:3 4:3

3:2 4:2

2:2

2:2 3:2 4:2

2:1 3:1 4:1

1:1

4:2 4:3

1:1

2:1 2:2

3:1

4:1

3:33:2

5:1 5:2 5:3 5:4 5:5

4:4

Time

Parallelism in LAPACK: Cholesky factorizationParallelism in LAPACK: Cholesky factorization higher flexibility some degree of adaptativity no idle time better scalability

Cost:

1 /3n3

n3

2n3

Parallelism in LAPACK: block data layoutParallelism in LAPACK: block data layout

Column-Major Block data layout

Column-Major


Block data layout

64 128 2560

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Blocking Speedup

DGEMM

DTRSM

block size

speedup

The use of block data layout storage can significantly improve performance


Cholesky: performance Cholesky: performance

0 2000 4000 6000 8000 100000

5

10

15

20

25

30

35

40

45

50

55

Cholesky -- Dual Clovertown

async. 2d blockingLAPACK + Th. BLAS

problem size

Gflop

/s

Cholesky: performance Cholesky: performance

0 2500 5000 7500 10000 12500 150002.5

5

7.5

10

12.5

15

17.5

20

22.5

25

27.5

30

32.5

35

Cholesky -- 8-way Dual Opteron

async. 2d blockingLAPACK + Th. BLAS

problem size

Gflop/s


375/ 627

Linear algebra for distributed memory architectures

I Difficulties:I Distribute data on the processorsI Define enough parallel tasks but not too manyI Explicit message passing between processors

I Example: LU factorization

Suppose thatI Each processor holds part of the matrixI Each processor performs the update operations on its partI Which data distribution should be used ?

376/ 627

Data distribution for dense matrices

41 2 3 4 1 2 3

1D block-cyclic

3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2 3

0 1

2

2D block-cyclic

Main reasons for 2D block-cyclic:

1. good load balance, minimize communication,

2. use of BLAS 3 on each processor.

Need to communicate blocks of matrices between processors: BLACS

377/ 627


378/ 627

BLACS (Basic Linear Algebra CommunicationSubprograms) (User’s Guide, J. J. Dongarra, R. C.Whaley)

I Set of communication routines to imlpement linear algebraalgorithms on distributed memory architectures

I Portable

I available on top of MPI or PVM, on CMMD (ThinkingMachine), MPL (IBM SPx), NX (Intel), . . .

I SPMD modelI Main concept: communication based on 2D arrays:

I MxN rectangular matricesI trapezoidal matrices

379/ 627

UPLO

’U’

’L’

M <= N M > N

n

m

n−m+1

m−n+1

n

m

n−m+1

m

n

m

n

m−n+1

380/ 627

I Processes organized in a 2D grid: P × Q such thatP × Q = N number of processes

I Processes identified by their row/column indices

I Example: 8 processes in a 2x4 grid

0 1 2 3

0

1

0 1 2

4 5 6

3

7

I Building a grid:I BLACS GRIDINITI BLACS GRIDMAP

I Termination:I grid: BLACS GRIDEXITI blacs: BLACS EXIT

381/ 627

BLACS: communication routines

I send/receive

I broadcastI Naming:

I character 1: type of data (S,D,C,Z,I)I characters 2+3 : data structure

I GE : general matrixI TR : trapeze matrix (upper, lower, unit or not)

I characters 5 and 6: functionI SD : sendI RV : receiveI BS : broadcastI BR : broadcast (receiver side)

382/ 627

Examples

I send trapezoidal matrix:TRSD2D(ICONTXT, M, N, A, LDA, RDEST, CDEST)

I receive general (rectangular) matrix:GERV2D(ICONTXT, M, N, A, LDA, LDA, RSRC, CSRC)

I broadcast general matrix:GEBS2D(ICONTXT,SCOPE, TOP,M, N, A, LDA)

I broadcast reception:GEBR2D(ICONTXT, SCOPE, TOP, M, N, LDA, RSRC,

CSRC)

383/ 627

ParametersI ICONTXT: context (identifies the grid of processes)

I SCOPE: one process, complete row, complete column or allprocesses

I TOP: network topology emulated

I M : nombre de lignes de A

I N : nombre de colonnes de A

I A : matrice a envoyer A( LDA, * )

I RSRC: row of sender/receiver

I CSRC: column of sender/receiver

Global operators

I GAMX : maximum

I GAMN : minimum

I GSUM : summation

384/ 627


385/ 627

PBLAS : parallel BLAS (LAPACK Working note100, Choi, Dongarra, Ostrouchov, Petitet, Walker,and Whaley)

I Parallel operations based on BLAS 1 + 2 + 3, on top ofsequential BLAS (similar interface)

I Based on BLACSComputations: BLAS / communications: BLACS

I 2D cyclic data distribution → good scalability/load balance

I Used to develop (part of) ScaLAPACK

I PBLAS : subset of BLAS (no operations on band or packedmatrices), some extra operations (matrix transpose)

386/ 627

I level 1 PBLAS:

x ↔ y x tyx ← αx ‖x‖2

y ← x ‖re(x)‖1 + ‖im(x)‖1

y ← αx + y

I largest element of a vector

I level 2 PBLAS:I matrix-vector multiplicationI rank 1 updatesI multiplication by a triangular matrixI triangular system solving

387/ 627

I Level 3 PBLAS:I matrix-matrix multiplicationI rank k and rank 2k updatesI multiplication of a matrix by triangular matrixI solution of triangular systems

I matrix transpose (C → βC + αAt)

388/ 627

Storage, intialization of distributed matrices

Let A : M × N with M = N = 5 partitioned in 2x2 blocksa11 a12 a13 a14 a15a21 a22 a23 a24 a25a31 a32 a33 a34 a35a41 a42 a43 a44 a45a51 a52 a53 a54 a55

389/ 627

Storage, intialization of distributed matrices

On a 2 by 2 grid of processors:0 | 1 | 0

------------------------------------------

a11 a12 | a13 a14 | a15

0 a21 a22 | a23 a24 | a25

------------------------------------------

a31 a32 | a33 a34 | a35

1 a41 a42 | a43 a44 | a45

------------------------------------------

0 a51 a52 | a53 a54 | a55

I Proc 0,0 :a11, a21, a51, a12, a22, a52, a15, a25, a55 (3x3)

I Proc 0,1 :a13, a23, a53, a14, a24, a54 (3x2)

I . . .

I Redistribution routines exist

389/ 627

Array descriptorInteger array of length 9

DESC() Name Scope Definition1 DTYPE A Global Descriptor type DTYPE A=1 for dense

matrices.2 CTXT A Global BLACS context indicating the BLACS

process grid over which the global matrixis distributed.

3 M A Global Number of rows in the global array A.4 N A Global Number of columns in the global array A.5 MB A Global Blocking factor used to distribute the

rows of the array.6 NB A Global Blocking factor used to distribute the

columns of the array.7 RSRC A Global Process row over which the first row of

the array A is distributed.8 CSRC A Global Process row over which the first column

of the array A is distributed.9 LLD A Local Leading dimension of the local array.

Example of PBLAS call

I BLAS callCALL DGEMM (TRANSA, TRANSB, M, N, K, ALPHA,

$ A(IA,JA), LDA, B(IB,JB), LDB,

$ BETA,C(IC,JC), LDC)

I PBLAS callCALL PDGEMM (TRANSA, TRANSB, M, N, K, ALPHA,

$ A, IA, JA, DESCA, B, IB, JB, DESCB,

$ BETA, C, IC, JC, DESC)

391/ 627


393/ 627

ScaLAPACK Software Hierarchy

Goal: reuse most of the existing dense linear algebra software.

Global

Message passing library(MPI,PVM, ...)BLAS

ScaLAPACK

PBLAS

BLACSLAPACK

Local

394/ 627

ScaLAPACK : Right-looking LU Factorization

Conversion of LAPACK codes

I Sequential LU:

1. Factor a column block (I AMAX, SWAP, GER)2. Pivot on the rest of the matrix ( SWAP)3. Update the submatrix ( TRSM suivi de GEMM)

I Parallel Implementation with PBLAS :

1. Factor a column block (P AMAX, P SWAP, P GER)2. Pivot on the rest of the matrix (P SWAP)3. Update the submatrix (P TRSM suivi de P GEMM)

395/ 627

398/ 627

ScaLapack: out-of-core algorithms

I Based on left-looking variants of LU, QR and CholeskyI Same as cache, but much higher latency + smaller bandwidthI QR easier than LU (no pivoting, more flops)

399/ 627

Performance models for ScaLAPACK

I Communication: volume= Cv N2, no of msgs = CmN/NB

I flops: Cf

T (N,P) =Cf N3

Ptf +

Cv N2

√P

tv +CmN

NBtm

(tf : avg. time for a flop, tm: latency, t−1v : bandwidth)

400/ 627


401/ 627

Recursive algorithms

Example 1: LU factorization

1. Split matrix A into two rectangles m x n/2If only 1 column divide column by pivot and return

2. Apply LU algorithm to the left part:A11 = LU with updated A21

3. Apply transformations to the right part(triangular solve A12 = L−1A12 and matrix multiplicationA22 = A22 − A21A12)

4. Apply LU algorithm to the right (square) part

→ Matrices with n/2, n/4, n/8 . . . columns

Example 2: Matrix-matrix multiplication: AB =

„A11 A12

A21 A22

«„B11 B12

B21 B22

«=

„A11B11 + A12B21 A21B11 + A22B21

A11B12 + A12B22 A21B12 + A22B22

«with recursive blocking for each matrix-matrix multiplication.

402/ 627

Recursive algorithms

I automatic adaptation to cache size (at all levels)

I easier to tune and often very efficient compared to classicalapproaches

I can exploit recursive data layoutsI ’Z’ or ’U’ storage for unsymmetric matricesI recursive packed storage possible for symmetric matrices

(Cholesky factorization, . . . )

403/ 627

Conclusion

I Standards have been defined for most useful linear algebraoperations

I One should not try to write his/her own routineI Efficient sequential and parallel implementations available:

I Shared memory architecturesI Distributed memory architectures

I But . . . a higher degree of parallelism is now needed(multicores, thousands of processors)

Software always one step behind architectures

404/ 627

Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far

LAPACK

ThreadedBLAS

PThreads OpenMP

ScaLAPACK

PBLAS

BLACS+ MPI

Shared Memory Distributed Memory

parallelism

Parallelism in Linear Algebra software so farParallelism in Linear Algebra software so far

LAPACK

ThreadedBLAS

PThreads OpenMP

ScaLAPACK

PBLAS

BLACS+ MPI

Shared Memory Distributed Memoryparallelism

Outline

Introduction to Sparse Matrix ComputationsMotivation and main issuesSparse matricesGaussian elimination

407/ 627

A selection of references

I Books

I Duff, Erisman and Reid, Direct methods for Sparse Matrices,Clarenton Press, Oxford 1986.

I Dongarra, Duff, Sorensen and H. A. van der Vorst, SolvingLinear Systems on Vector and Shared Memory Computers,SIAM, 1991.

I George, Liu, and Ng, Computer Solution of Sparse PositiveDefinite Systems, book to appear

I Articles

I Gilbert and Liu, Elimination structures for unsymmetric sparseLU factors, SIMAX, 1993.

I Liu, The role of elimination trees in sparse factorization,SIMAX, 1990.

I Heath and E. Ng and B. W. Peyton, Parallel Algorithms forSparse Linear Systems, SIAM review 1991.

408/ 627


409/ 627

Motivations

I solution of linear systems of equations → key algorithmickernel

Continuous problem↓

Discretization↓

Solution of a linear system Ax = b

I Main parameters:I Numerical properties of the linear system (symmetry, pos.

definite, conditioning, . . . )I Size and structure:

I Large (> 100000× 100000 ?), square/rectangularI Dense or sparse (structured / unstructured)I Target computer (sequential/parallel)

→ Algorithmic choices are critical

410/ 627

Motivations for designing efficient algorithms

I Time-critical applications

I Solve larger problems

I Decrease elapsed time (parallelism ?)

I Minimize cost of computations (time, memory)

411/ 627

Difficulties

I Access to data :I Computer : complex memory hierarchy (registers, multilevel

cache, main memory (shared or distributed), disk)I Sparse matrix : large irregular dynamic data structures.

→ Exploit the locality of references to data on the computer(design algorithms providing such locality)

I Efficiency (time and memory)I Number of operations and memory depend very much on the

algorithm used and on the numerical and structural propertiesof the problem.

I The algorithm depends on the target computer (vector, scalar,shared, distributed, clusters of Symmetric Multi-Processors(SMP), GRID).

→ Algorithmic choices are critical

412/ 627


413/ 627

Sparse matrices

Example:

3 x1 + 2 x2 = 52 x2 - 5 x3 = 1

2 x1 + 3 x3 = 0

can be represented as

Ax = b,

where A =

3 2 00 2 −52 0 3

, x =

x1

x2

x3

, and b =

510

Sparse matrix: only nonzeros are stored.

414/ 627

Sparse matrix ?

Original matrix

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

Matrix dwt 592.rua (N=592, NZ=5104);Structural analysis of a submarine

415/ 627

Factorization process

Solution of Ax = b

I A is unsymmetric :I A is factorized as: A = LU, where

L is a lower triangular matrix, andU is an upper triangular matrix.

I Forward-backward substitution: Ly = b then Ux = y

I A is symmetric:I A = LDLT or LLT

I A is rectangular m × n with m ≥ n and minx ‖Ax− b‖2 :I A = QR where Q is orthogonal (Q−1 = QT and R is

triangular).I Solve: y = QTb then Rx = y

416/ 627

Difficulties

I Only non-zero values are stored

I Factors L and U have far more nonzeros than A

I Data structures are complex

I Computations are only a small portion of the code (the rest isdata manipulation)

I Memory size is a limiting factor→ out-of-core solvers

417/ 627

Key numbers:

1- Average size : 100 MB matrix;Factors = 2 GB; Flops = 10 Gflops ;

2- A bit more “challenging” : Lab. Geosiences Azur, Valbonne

I Complex matrix arising in 2D 16× 106 , 150× 106 nonzerosI Storage : 5 GB (12 GB with the factors ?)I Flops : tens of TeraFlops

3- Typical performance (MUMPS):I PC LINUX (P4, 2GHz) : 1.0 GFlops/sI Cray T3E (512 procs) : Speed-up ≈ 170, Perf. 71 GFlops/s

418/ 627

Typical test problems:

BMW car body,227,362 unknowns,5,757,996 nonzeros,MSC.Software

Size of factors: 51.1 million entriesNumber of operations: 44.9 ×109

419/ 627

Typical test problems:

BMW crankshaft,148,770 unknowns,5,396,386 nonzeros,MSC.Software

Size of factors: 97.2 million entriesNumber of operations: 127.9 ×109

420/ 627

Sources of parallelism

Several levels of parallelism can be exploited:

I At problem level: problem can de decomposed intosub-problems (e.g. domain decomposition)

I At matrix level arising from its sparse structure

I At submatrix level within dense linear algebra computations(parallel BLAS, . . . )

421/ 627

Data structure for sparse matrices

I Storage scheme depends on the pattern of the matrix and onthe type of access required

I band or variable-band matricesI “block bordered” or block tridiagonal matricesI general matrixI row, column or diagonal access

422/ 627

Data formats for a general sparse matrix A

What needs to be represented

I Assembled matrices: MxN matrix A with NNZ nonzeros.

I Elemental matrices (unassembled): MxN matrix A with NELTelements.

I Arithmetic: Real (4 or 8 bytes) or complex (8 or 16 bytes)

I Symmetric (or Hermitian)→ store only part of the data.

I Distributed format ?

I Duplicate entries and/or out-of-range values ?

423/ 627

Classical Data Formats for Assembled MatricesI Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

I Coordinate formatIRN [1 : NNZ ] = 1 3 2 2 3JCN [1 : NNZ ] = 1 1 2 3 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33

I Compressed Sparse Column (CSC) formatIRN [1 : NNZ ] = 1 3 2 2 3VAL [1 : NNZ ] = a11 a31 a22 a23 a33

COLPTR [1 : N + 1] = 1 3 4 6column J is stored in IRN/A locations COLPTR(J)...COLPTR(J+1)-1

I Compressed Sparse Row (CSR) format:Similar to CSC, but row by row

I Diagonal format:NDIAG = 3IDIAG = −2 0 1

VAL =

na na a31

a11 a22 a33

na a23 na

(na: not accessed)

424/ 627

Classical Data Formats for Assembled Matrices

I Example of a 3x3 matrix with NNZ=5 nonzeros

a31

a23a22

a11

a33

1 2 3

1

2

3

I Diagonal format:NDIAG = 3IDIAG = −2 0 1

VAL =

na na a31

a11 a22 a33

na a23 na

(na: not accessed)

424/ 627

Sparse Matrix-vector products

Assume we want to comute Y ← AX .Various algorithms for matrix-vector product depending on sparsematrix format:

I Coordinate format:Y ( 1 :N) = 0DO i =1,NNZ

Y( IRN ( i ) ) = Y( IRN ( i ) ) + VAL( i ) ∗ X(JCN( i ) )ENDDO

I CSC format:

Y ( 1 :N) = 0DO J=1,N

DO I=COLPTR( J ) ,COLPTR( J+1)−1Y( IRN ( I ) ) = Y( IRN ( I ) ) + VAL( I )∗X( J )

ENDDOENDDO

I Diagonal format:

Y ( 1 :N) = 0DO K=1,NDIAG

DO I= max(1,1− IDIAG (K) ) , min (N, N−IDIAG (K) )Y( I ) = Y( I ) + VAL( I , K)∗X( I+IDIAG (K) )

END DOEND DO

425/ 627





I CSC format:Y ( 1 :N) = 0DO J=1,N


ENDDOENDDO

I Diagonal format:

Y ( 1 :N) = 0DO K=1,NDIAG


END DOEND DO

425/ 627





I CSC format:Y ( 1 :N) = 0DO J=1,N


ENDDOENDDO

I Diagonal format:Y ( 1 :N) = 0DO K=1,NDIAG


END DOEND DO

425/ 627

Example of elemental matrix format

A1 =123

−1 2 32 1 11 1 1

, A2 =345

2 −1 31 2 −13 2 1

I N=5 NELT=2 NVAR=6 A =

∑NELTi=1 Ai

IELTPTR [1:NELT+1] = 1 4 7ELTVAR [1:NVAR] = 1 2 3 3 4 5ELTVAL [1:NVAL] = -1 2 1 2 1 1 3 1 1 2 1 3 -1 2 2 3 -1 1

I Remarks:

I NVAR = ELTPTR(NELT+1)-1I NVAL =

∑S2

i (unsym) ou∑

Si (Si + 1)/2 (sym), avecSi = ELTPTR(i + 1)− ELTPTR(i)

I storage of elements in ELTVAL: by columns

426/ 627

File storage: Rutherford-Boeing

I Standard ASCII format for filesI Header + Data (CSC format). key xyz:

I x=[rcp] (real, complex, pattern)I y=[suhzr] (sym., uns., herm., skew sym., rectang.)I z=[ae] (assembled, elemental)I ex: M T1.RSA, SHIP003.RSE

I Supplementary files: right-hand-sides, solution,permutations. . .

I Canonical format introduced to guarantee a uniquerepresentation (order of entries in each column, no duplicates).

427/ 627

File storage: Rutherford-Boeing

DNV-Ex 1 : Tubular joint-1999-01-17 M_T1

1733710 9758 492558 1231394 0

rsa 97578 97578 4925574 0

(10I8) (10I8) (3e26.16)

1 49 96 142 187 231 274 346 417 487

556 624 691 763 834 904 973 1041 1108 1180

1251 1321 1390 1458 1525 1573 1620 1666 1711 1755

1798 1870 1941 2011 2080 2148 2215 2287 2358 2428

2497 2565 2632 2704 2775 2845 2914 2982 3049 3115

...

1 2 3 4 5 6 7 8 9 10

11 12 49 50 51 52 53 54 55 56

57 58 59 60 67 68 69 70 71 72

223 224 225 226 227 228 229 230 231 232

233 234 433 434 435 436 437 438 2 3

4 5 6 7 8 9 10 11 12 49

50 51 52 53 54 55 56 57 58 59

...

-0.2624989288237320E+10 0.6622960540857440E+09 0.2362753266740760E+11

0.3372081648690030E+08 -0.4851430162799610E+08 0.1573652896140010E+08

0.1704332388419270E+10 -0.7300763190874110E+09 -0.7113520995891850E+10

0.1813048723097540E+08 0.2955124446119170E+07 -0.2606931100955540E+07

0.1606040913919180E+07 -0.2377860366909130E+08 -0.1105180386670390E+09

0.1610636280324100E+08 0.4230082475435230E+07 -0.1951280618776270E+07

0.4498200951891750E+08 0.2066239484615530E+09 0.3792237438608430E+08

0.9819999042370710E+08 0.3881169368090200E+08 -0.4624480572242580E+08

428/ 627


429/ 627

Gaussian elimination

A = A(1), b = b(1), A(1)x = b(1):0@ a11 a12 a13a21 a22 a23a31 a32 a33

1A 0@ x1x2x3

1A =

0@ b1b2b3

1A 2← 2− 1× a21/a113← 3− 1× a31/a11

A(2)x = b(2)0B@ a11 a12 a13

0 a(2)22 a

(2)23

0 a(2)32 a

(2)33

1CA0@ x1

x2x3

1A =

0B@ b1

b(2)2

b(2)3

1CA b(2)2 = b2 − a21b1/a11 . . .

a(2)32 = a32 − a31a12/a11 . . .

Finally A(3)x = b(3)0B@ a11 a12 a13

0 a(2)22 a

(2)23

0 0 a(3)33

1CA0@ x1

x2x3

1A =

0B@ b1

b(2)2

b(3)3

1CAa

(3)(33)

= a(2)(33)− a

(2)32 a

(2)23 /a

(2)22 . . .

Typical Gaussian elimination step k : a(k+1)ij = a

(k)ij −

a(k)ik a

(k)kj

a(k)kk

430/ 627

Relation with A = LU factorization

I One step of Gaussian elimination can be written:A(k+1) = L(k)A(k) , with

Lk =

0BBBBBBB@

1.

.1

−lk+1,k .. .−ln,k 1

1CCCCCCCAand lik =

a(k)ik

a(k)kk

.

I Then, A(n) = U = L(n−1) . . .L(1)A, which gives A = LU ,

with L = [L(1)]−1 . . . [L(n−1)]−1 =

0BBB@1 0

..

.li,j 1

1CCCA ,

I In dense codes, entries of L and U overwrite entries of A.

I Furthermore, if A is symmetric, A = LDLT with dkk = a(k)kk :

A = LU = At = U tLt implies (U)(Lt)−1 = L−1U t = D diagonal

and U = DLt , thus A = L(DLt) = LDLt

Gaussian elimination and sparsity

Step k of LU factorization (akk pivot):

I For i > k compute lik = aik/akk (= a′ik),

I For i > k, j > k

a′ij = aij −aik × akj

akk

ora′ij = aij − lik × akj

I If aik 6= 0 et akj 6= 0 then a′ij 6= 0

I If aij was zero → its non-zero value must be stored k j

k

i

x

x

x

x

k j

k

i

x

x

x

0

fill-in

432/ 627

I Idem for Cholesky :

I For i > k compute lik = aik/√

akk (= a′ik),

I For i > k, j > k, j ≤ i (lower triang.)

a′ij = aij −aik × ajk√

akk

ora′ij = aij − lik × ajk

433/ 627

Example

I Original matrix

x x x x xx xx xx xx x

I Matrix is full after the first step of elimination

I After reordering the matrix (1st row and column ↔ last rowand column)

434/ 627

x xx x

x xx x

x x x x x

I No fill-inI Ordering the variables has a strong impact on

I the fill-inI the number of operations

435/ 627

Table: Benefits of Sparsity on matrix of order 2021 × 2021 with 7353nonzeros. (Dongarra etal 91) .

Procedure Total storage Flops Time (sec.)on CRAY J90

Full Syst. 4084 Kwords 5503 ×106 34.5Sparse Syst. 71 Kwords 1073×106 3.4Sparse Syst. and reordering 14 Kwords 42×103 0.9

436/ 627

Efficient implementation of sparse solvers

I Indirect addressing is often used in sparse calculations: e.g.sparse SAXPY

do i = 1, mA( ind(i) ) = A( ind(i) ) + alpha * w( i )

enddoI Even if manufacturers provide hardware for improving indirect

addressingI It penalizes the performance

I Switching to dense calculations as soon as the matrix is notsparse enough

437/ 627

Effect of switch to dense calculations

Matrix from 5-point discretization of the Laplacian on a 50× 50grid (Dongarra etal 91)

Density for Order of Millions Timeswitch to full code full matrix of flops (sec.)

No switch 0 7 21.81.00 74 7 21.40.80 190 8 15.00.60 235 11 12.50.40 305 21 9.00.20 422 50 5.50.10 531 100 3.70.005 1420 1908 6.1

Sparse structure should be exploited if density < 10%.

438/ 627

Outline

Ordering sparse matricesObjectives/OutlineFill-reducing orderings

439/ 627


440/ 627

Ordering sparse matrices: objectives/outline

I Reduce fill-in and number of operations during factorization(local and global heuristics):

I Increase parallelism (wide tree)I Decrease memory usage (deep tree)I Equivalent orderings :

(Traverse tree to minimize working memory)

I Reorder unsymmetric matrices to special forms:I block upper triangular matrix:I with (large) non-zero entries on the diagonal (maximum

transversal).

I Combining approaches

441/ 627


442/ 627

Fill-reducing orderings

Three main classes of methods for minimizing fill-in duringfactorization

I Global approach: The matrix is permuted into a matrix with agiven pattern

I Fill-in is restricted to occur within that structureI Cuthill-McKee (block tridiagonal matrix)I Nested dissections (“block bordered” matrix).

443/ 627

Fill-reducing orderings

I Local heuristics: At each step of the factorization, selection ofthe pivot that is likely to minimize fill-in.

I Method is characterized by the way pivots are selected.I Markowitz criterion (for a general matrix).I Minimum degree (for symmetric matrices).

I Hybrid approaches: Once the matrix is permuted in order toobtain a block structure, local heuristics are used within theblocks.

443/ 627

Cuthill-McKee and Reverse Cuthill-McKee

Consider the matrix:

A =

x x x xx x

x x xx x x xx x x

x x

The corresponding graph is

5 3

4 6

12

444/ 627

Cuthill-McKee algorithm

I Goal: reduce the profile/bandwidth of the matrix

(the fill is restricted to the band structure)

I Level sets (such as Breadth First Search) are built from thevertex of minimum degree (priority to the vertex of smallestnumber)We get: S1 = 2,S2 = 1, S3 = 4, 5,S4 = 3, 6 andthus the ordering 2, 1, 4, 5, 3, 6.

The reordered matrix is:

A =

26666664x xx x x x

x x x xx x x

x x xx x

37777775

445/ 627

Reverse Cuthill-McKee

I The ordering is the reverse of that obtained usingCuthill-McKee i.e. on the example 6, 3, 5, 4, 1, 2

I The reordered matrix is:

A =

26666664x x

x x xx x x

x x x xx x x x

x x

37777775I More efficient than Cuthill-McKee at reducing the envelop of

the matrix.

446/ 627

Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua

Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202

Original matrix Factorized matrix

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 58202

447/ 627

Illustration: Reverse Cuthill-McKee on matrixdwt 592.rua

NZ(LU factors)=16924

Permuted matrix Factorized permuted matrix(RCM)

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 16924

447/ 627

Nested Dissection

Recursive approach based on graph partitioning.

Graph partitioning Permuted matrix

(1)

(5)

(4)

(2)

S1

S2

S3

S1

12

34

S2

S3

448/ 627

Local heuristics to reduce fill-in during factorization

Let G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric such that Metric(vi ) < Metric(vj) implies vi is a betterthan vj

Generic algorithmLoop until all nodes are selected

Step1: select current node p (so called pivot) withminimum metric value,

Step2: update elimination graph,Step3: update Metric(vj) for all non-selected nodes vj .

Step3 should only be applied to nodes for which the Metric valuemight have changed.

449/ 627

Reordering unsymmetric matrices: Markowitzcriterion

I At step k of Gaussian elimination:

Ak

L

U

I rki = number of non-zeros in row i of Ak

I ckj = number of non-zeros in column j of Ak

I Candidate pivot aij must be large enough and should minimize(rk

i − 1)× (ckj − 1) ∀i , j ≥ k

I Minimum degree : Markowitz criterion for symmetricdiagonally dominant matrices

450/ 627

Minimum degree algorithm

I Step 1:Select the vertex that possesses the smallest number ofneighbors in G 0.

23

45

67

89

10

1

14

3

5

6

8

9

10

2

7

(a) Sparse symmetric matrix(b) Elimination graph

The node/variable selected is 1 of degree 2.

451/ 627

I Notation for the elimination graph

I Let G k = (V k ,E k) be the graph built at step k.I G k describes the structure of Ak after eliminating k pivots.I G k is non-oriented (Ak is symmetric)I Fill-in in Ak ≡ adding edges in the graph.

452/ 627

Illustration

Step 1: elimination of pivot 1

12

34

56

78

910

4

3

5

6

8

9

10

2

7

1

23

45

67

89

10

1

7

4

3

5

6

8

9

10

2

1

(a) Elimination graph (b) Factors and active submatrix

Initial nonzeros Fill−inNonzeros in factors

Minimum degree algorithm applied to the graph:

I Step k : Select the node with the smallest number ofneighbors

I G k is built from G k−1 by suppressing the pivot and addingedges corresponding to fill-in.

454/ 627

Illustration (cont’d)

Graphs G1,G2,G3 and corresponding reduced matrices.

e

7

4

3

5

6

8

9

10

2

12

34

56

78

910

12

34

56

78

910

4

3

5

6

8

9

10

7

23

45

67

89

10

1

4

5

6

8

9

10

7

(a) Elimination graphs

(b) Factors and active submatrices

Original nonzero Fill−in

Original nonzero modified Nonzeros in factors

455/ 627

Minimum Degree does not always minimize fill-in !!!

12

34

67

89

5

4

3

1

2

6

7

9

8

5

4

3

1

2

6

7

9

8

Consider the following matrix

Remark: Using initial ordering

No fill−in

Corresponding elimination graph

Step 1 of Minimum Degree:

Select pivot 5 (minimum degree = 2)

Updated graph

Add (4,6) i.e. fill−in

456/ 627

Efficient implementation of Minimum degreeReduce time complexity

1. Accelerate selection of pivots and update of the graph:

I 1.1 Supervariables (or indistinguishable nodes): if severalvariables have the same adjacency structure in G k , they canbe eliminated simultaneously.

I 1.2 Two non-adjacent nodes of same degree can be eliminatedsimultaneously (multiple eliminations).

I 1.3 Degree update of neighbours of the pivot can be effectedin an approximate way (Approximate Minimum Degree).

457/ 627

Reduce memory complexity

2. Decrease size of working spaceUsing the elimination graph, working space is of orderO(#nonzeros in factors).

I Fill-in: Let pivot be the pivot at step k

If i ∈ AdjGk−1 (pivot) then AdjGk−1 (pivot) ⊂ AdjGk (i)

Structure of pivot column included in filled structure of column i .

I We can then use an implicit representation of fill-in bydefining the notion of element (variable already eliminated)and quotient graph. A variable of the quotient graph isadjacent to variables and elements.

I One can show that ∀k ∈ [1 . . .N] , the size of the quotientgraph is O(G 0)

458/ 627

Influence on the structure of factors

Harwell-Boeing matrix: dwt 592.rua, structural computing on asubmarine. NZ(LU factors)=58202

0 100 200 300 400 500

0

100

200

300

400

500

nz = 5104

459/ 627

Structure of factors after permutationMinimum Degree MMD (1.1+1.2+2)

0 100 200 300 400 500

0

100

200

300

400

500

nz = 151100 100 200 300 400 500

0

100

200

300

400

500

nz = 14838

Detection of supervariables allows to build more regularly structured

factors (easier factorization).

460/ 627

Comparison of 3 implementations of Minimum Degree

I Let V0 be the initial algorithm (based on the eliminationgraph)

I MMD the version including 1.1/ + 1.2/ + 2/ (MultipleMinimum Degree, Liu 1985, 1989), used in MATLAB

I AMD the version including 1.1/ + 1.3/ + 2/ (ApproximateMinimum Degree, Amestoy, Davis, Duff 1995).

461/ 627

Execution times (secs) on a SUN Sparc 10

Matrix Order Nonzeros Minimum DegreeV0 MMD AMD

dwt 2680 2680 13853 35 0.2 0.2Min. memory size 250KB 110KB 110KB

Wang4 26068 75552 - 11 5

Orani678 2529 85426 - 125 5

I Fill-in is similar

I Memory space for MMD and AMD : ≈ 2× NZ integers

I V0 was not able to perform reordering for the 2 last matrices(lack of memory after 2 hours of computations)

462/ 627

Mininimum fill-in heuristics

Recalling the generic algorithmLet G (A) be the graph associated to a matrix A that we want toorder using local heuristics.Let Metric be such that Metric(vi ) < Metric(vj) ≡ vi is a betterthan vj

Generic algorithmLoop until all nodes are selected

Step1: Select current node p (so called pivot) withminimum metric value,

Step2: update elimination (or quotient) graph,Step3: update Metric(vj) for all non-selected nodes vj .

Step3 should only be applied to nodes for which the Metric valuemight have changed.

463/ 627

Minimum fill based algorithm

I Metric(vi ) is the amount of fill-in that vi would introduce if itwere selected as a pivot.

I Illustration: r has a degree d = 4 and a fill-in metric ofd × (d − 1)/2 = 6 whereas s has degree d = 5 but a fill-inmetric of d × (d − 1)/2− 9 = 1.

r

s

i2

i1

i5

i3 i4j1

j2j3 j4

464/ 627

Minimum fill-in properties

I The situation typically occurs when i1, i2, i3 and i2, i3, i4, i5were adjacent to two already selected nodes (here e2 and e1)

e1 and e2 are previously selected nodes

rs

i1i2

i3i4

i5j1

j2j3

j4

r

s

i2

i1

i5

i3 i4j1

j2j3 j4

e2e1

I The elimination of a node vk affects the degree of nodesadjacent to vk . The fill-in metric of Adj(Adj(vk)) is alsoaffected.

I Illustration: selecting r affects the fill-in metric of i1 (becauseof fill edge (j3, j4)).

How to compute the fill-in metrics

Computing the exact minimum fill-in metric is too costly

I Only nodes adjacent to current pivot are updated.

I Only approximated metrics (using clique structures) arecomputed

I Let dk be the degree of node k ; dk × (dk − 1)/2 is an upperbound of the fill (s → ds = 5 → ds × (ds − 1)/2 = 10).

I Several possibilities:

1. Deduce the clique area of the ”last” selected pivot adjacent tok (s → clique of e2).

2. Deduce the largest clique area of all adjacent selected pivots(s → clique of e1)

3. If for dk we use instead AMD then cliques of all adjacentselected pivots can be deduced.

466/ 627

Outline

Factorization of sparse matricesIntroductionElimination tree and Multifrontal approachImpact of fill reduction algorithm on the shape of the treePostorderings and memory usageEquivalent orderings and elimination treesComparison between 3 approaches for LU factorizationTask mapping and schedulingDistributed memory approachesSome parallel solversCase study: comparison of MUMPS and SuperLUConcluding remarks

467/ 627

Factorization of sparse matrices

Outline

1. Introduction

2. Elimination tree and multifrontal method

3. Comparison between multifrontal,frontal and generalapproaches for LU factorization

4. Task mapping and scheduling

5. Distributed memory approaches: fan-in, fan-out, multifrontal

6. Some parallel solvers; case study on MUMPS and SuperLU.

7. Concluding remarks

468/ 627


469/ 627

Recalling the Gaussian elimination

Step k of LU factorization (akk pivot):

I For i > k compute lik = aik/akk (= a′ik),

I For i > k, j > k such that aik and akj are nonzeros

a′ij = aij −aik × akj

akk

I If aik 6= 0 et akj 6= 0 then a′ij 6= 0

I If aij was zero → its non-zero value must be stored

I Orderings (minimum degree, Cuthill-McKee, ND) limit fill-in,the number of operations and modify the tasks graph

470/ 627

Three-phase scheme to solve Ax = b

1. Analysis stepI Preprocessing of A (symmetric/unsymmetric orderings,

scalings)I Build the dependency graph (elimination tree, eDAG . . . )

2. Factorization (A = LU, LDLT, LLT, QR)

Numerical pivoting

3. Solution based on factored matricesI triangular solves: Ly = b, then Ux = yI improvement of solution (iterative refinement), error analysis

471/ 627

Control of numerical stability: numerical pivoting

I In dense linear algebra partial pivoting commonly used (ateach step the largest entry in the column is selected).

I In sparse linear algebra, flexibility to preserve sparsity isoffered :

I Partial threshold pivoting : Eligible pivots are not too smallwith respect to the maximum in the column.

Set of eligible pivots = r | |a(k)rk | ≥ u ×maxi |a(k)

ik |, where0 < u ≤ 1.

I Then among eligible pivots select one preserving bettersparsity.

I u is called the threshold parameter (u = 1 → partial pivoting).I It restricts the maximum possible growth of: aij = aij − aik×akj

akkI u ≈ 0.1 is often chosen in practice.

I Symmetric indefinite case: requires 2 by 2 pivots, e.g.„

0 11 0

«

Threshold pivoting and numerical accuracy

Table: Effect of variation in threshold parameter u on a 541× 541 matrixwith 4285 nonzeros (Dongarra etal 91) .

u Nonzeros in LU factors Error

1.0 16767 3 ×10−9

0.25 14249 6 ×10−10

0.1 13660 4 ×10−9

0.01 15045 1 ×10−5

10−4 16198 1 ×102

10−10 16553 3 ×1023

473/ 627

Iterative refinement for linear systems

Suppose that a solver has computed A = LU (or LDLT or LLT,and a solution x to Ax = b.

1. Compute r = b− Ax.

2. Solve LU δx = r.

3. Update x = x + δx.

4. Repeat if necessary/useful.

474/ 627


475/ 627

Elimination tree and Multifrontal approach

We recall that:

I The elimination tree expresses dependencies between thevarious steps of the factorization.

I It also exhibits parallelism arising from the sparse structure ofthe matrix.

Building the elimination tree

I Permute matrix (to reduce fill-in) PAPT.

I Build filled matrix AF = L + LT where PAPT = LLT

I Transitive reduction of associated filled graph

→ Each column corresponds to a node of the graph. Each nodek of the tree corresponds to the factorization of a frontalmatrix whose row structure is that of column k of AF .

476/ 627

Illustration of multifrontal factorization

We assume pivots are chosen down the diagonal in order.

_

_

_

3, 4

4 4

1, 3, 4_ 2, 3, 4

3

1 2

F

F

Elimination graph

Filled matrix

Treatment at each node:

I Assembly of the frontal matrix using the contributions fromthe sons.

I Gaussian elimination on the frontal matrix

477/ 627

I Elimination of variable 1 (a11 pivot)I Assembly of the frontal matrix

1 3 4

1 x x x

3 x

4 x

I Contributions : aij =−(ai1×a1j )

a11i > 1, j > 1 on a33, a44, a34 and

a43 :

a(1)33 = − (a31 × a13)

a11a

(1)34 = − (a31 × a14)

a11

a(1)43 = − (a41 × a13)

a11a

(1)44 = − (a41 × a14)

a11

Terms − ai1×a1j

a11of the contribution matrix are stored for later updates.

478/ 627

I Elimination of variable 2 (a22 pivot)I Assembly of frontal matrix: update of elements of pivot row

and column using contributions from previous updates (nonehere)

2 3 4

2 x x x

3 x

4 x

I Contributions on a33, a34, a43, and a44.

a(2)33 = − (a32 × a23)

a22

a(2)34 = − (a32 × a24)

a22

a(2)43 = − (a42 × a23)

a22

a(2)44 = − (a42 × a24)

a22

479/ 627

I Elimination of variable 3.I Assembly of frontal matrix

Update using the previous contributions:

a′33 = a33 + a(1)33 + a

(2)33

a′34 = a34 + a(1)34 + a

(2)34 (a34 = 0)

a′43 = a43 + a(1)43 + a

(2)43 , (a43 = 0)

a′44 = a(1)44 + a

(2)44

stored as a so called contribution matrix.

480/ 627

Note that a44 is partially summed since contributions aretransfered only between son and father.

3 4

3 x x

4 x

I Contribution on a44 : a(3)44 = a′44 −

(a′43×a′34)a′33

I Elimination of variable 4I Frontal involves only a44 : a44 = a44 + a

(3)44

481/ 627

The multifrontal method (Duff, Reid’83)

3

5

4

2

1

1 2 3 4 5

3

5

4

2

1

1 2 3 4 5

A= L+U−I=

Fill−in

00

0

0

0

0 0 0

0

0

00

0 0

0 0

0

0

0 0

0

0

Memory is divided into two parts (thatcan overlap in time):

I the factors

I the active memory

FactorsStack of

contributionblocks

Activefrontalmatrix

Active Memory

3

2

4

5

1

1

5

4 2

3

3

4

4

5

5

Factors

Contribution block

Elimination treerepresents tasks

dependencies

482/ 627

Multifrontal method

From children to parent

I ASSEMBLY:Gather/Scatter operations(indirect addressing)

I ELIMINATION: Densepartial Gaussianelimination, Level 3 BLAS(TRSM, GEMM)

I CONTRIBUTION to parent

483/ 627

Supernodal methods

Definition

A supernode (or supervariable) is a set of contiguous columns inthe factors L that share essentially the same sparsity structure.

I All algorithms (ordering, symbolic factor., factor., solve)generalize to blocked versions.

I Use of efficient matrix-matrix kernels (improve cache usage).

I Same concept as supervariables for elimination tree/minimumdegree ordering.

I Supernodes and pivoting: pivoting inside a supernode doesnot increase fill-in.

484/ 627

Amalgamation

I GoalI Exploit a more regular structure in the original matrixI Decrease the amount of indirect addressingI Increase the size of frontal matrices

I How?I Relax the number of nonzeros of the matrixI Amalgamation of nodes of the elimination tree

485/ 627

I Consequences?I Increase in the total amount of flopsI But decrease of indirect addressingI And increase in performance

I Remark : If i, i1, i2 . . . if is a son of node j, j1, j2 . . . jp and ifj , j1, j2 . . . jp ⊂ i1, i2 . . . if then the amalgamation of i andj is without fill-in

Amalgamation of supernodes (same lower diagonal structure)is without fill-in

486/ 627

Illustration of amalgamation

Original Matrix Elimination tree

1

2

3

4

5

6

1 2 3 4 5 6

F3

F3

F4

F4

6

4, 5, 6

5, 6

3, 5, 62, 4, 51, 4

(Leaves)

(Root)

Structure of node i = frontal matrix noted i, i1, i2 . . . if

487/ 627

Illustration of amalgamation

Amalgamation

(WITHOUT fill-in) (WITH fill-in )

3, 5, 6

4, 5, 6

2, 4, 51, 4 2, 4, 51, 4

3, 4, 5, 6

fill-in : (3,4) and (4,3)

488/ 627

Amalgamation and Supervariables

Amalgamation of supervariables does not cause fill-inInitial Graph:

1

2

4

3

5

6

7

8

9 10

11

12

13

Reordering: 1, 3, 4, 2, 6, 8, 10, 11, 5, 7, 9, 12, 13Supervariables: 1, 3, 4 ; 2, 6, 8 ; 10, 11 ; 5, 7, 9, 12, 13

489/ 627

Supervariables and multifrontal method

AT EACH NODE

F22 ← F22 − F T12F−1

11 F12

Pivot can ONLY be chosen from F11 block since F22 is NOT fullysummed

490/ 627

Parallelization: two levels of parallelism

I Arising from sparsity : between nodes of the elimination treefirst level of parallelism

I Within each node: parallel dense LU factorization (BLAS)second level of parallelism

Incr

easi

ng n

ode

para

llelis

mD

ecre

asin

g tr

ee p

aral

lelis

m

LU

LL

U U

491/ 627

Exploiting the second level of parallelism is crucial

Multifrontal factorization(1) (2)

Computer #procs MFlops (speed-up) MFlops (speed-up)

Alliant FX/80 8 15 (1.9) 34 (4.3)IBM 3090J/6VF 6 126 (2.1) 227 (3.8)CRAY-2 4 316 (1.8) 404 (2.3)CRAY Y-MP 6 529 (2.3) 1119 (4.8)

Performance summary of the multifrontal factorization on matrix BCSSTK15.In column (1), we exploit only parallelism from the tree. In column (2), wecombine the two levels of parallelism.

492/ 627

Other features

I Dynamic management of parallelism:I Pool of tasks for exploiting the two levels of parallelism

I Assembly operations also parallel (but indirect addressing)

L

U

I Dynamic management of dataI Storage of LU factors, frontal and contribution matricesI Amount of memory available may conflict with exploiting

maximum parallelism

493/ 627


494/ 627

Impact of fill reduction on the shape of the tree

Reorderingtechnique

Shape of the tree observations

AMD

I Deep well-balanced

I Large frontal matriceson top

AMFI Very deep unbalanced

I Small frontal matrices

495/ 627

Reorderingtechnique

Shape of the tree observations

PORDI deep unbalanced

I Small frontal matrices

SCOTCH

I Very widewell-balanced

I Large frontal matrices

METIS

I Wide well-balanced

I Smaller frontalmatrices (thanSCOTCH)

Importance of the shape of the tree

Suppose that each node in the tree corresponds to a task that:- consumes temporary data from the children,- produces temporary data, that is passed to the parent node.

I Wide treeI Good parallelismI Many temporary blocks to storeI Large memory usage

I Deep treeI Less parallelismI Smaller memory usage

497/ 627

Impact of fill-reducing heuristics

Size of factors (millions of entries)

METIS SCOTCH PORD AMF AMD

gupta2 8.55 12.97 9.77 7.96 8.08ship 003 73.34 79.80 73.57 68.52 91.42twotone 25.04 25.64 28.38 22.65 22.12wang3 7.65 9.74 7.99 8.90 11.48xenon2 94.93 100.87 107.20 144.32 159.74

Peak of active memory (millions of entries)

METIS SCOTCH PORD AMF AMD

gupta2 58.33 289.67 78.13 33.61 52.09ship 003 25.09 23.06 20.86 20.77 32.02twotone 13.24 13.54 11.80 11.63 17.59wang3 3.28 3.84 2.75 3.62 6.14xenon2 14.89 15.21 13.14 23.82 37.82

498/ 627

Impact of fill-reducing heuristics

Number of operations (millions)

METIS SCOTCH PORD AMF AMDgupta2 2757.8 4510.7 4993.3 2790.3 2663.9ship 003 83828.2 92614.0 112519.6 96445.2 155725.5twotone 29120.3 27764.7 37167.4 29847.5 29552.9wang3 4313.1 5801.7 5009.9 6318.0 10492.2xenon2 99273.1 112213.4 126349.7 237451.3 298363.5

Matrix coneshl (SAMTECH, 1 million equations)

factor Total memory Floating-pointMatrix order entries required operations

coneshl METIS 687 ×106 8.9 GBytes 1.6×1012

PORD 746 ×106 8.4 GBytes 2.2×1012

499/ 627

Impact of fill-reducing heuristics/MUMPS

Time for factorization (seconds)

1p 16p 32p 64p 128p

coneshl METIS 970 60 41 27 14PORD 1264 104 67 41 26

audi METIS 2640 198 108 70 42PORD 1599 186 146 83 54

Matrices with quasi dense rows:Impact on the ordering time (seconds) of gupta2 matrix

AMD METIS QAMD

Analysis 361 52 23Total 379 76 59

500/ 627


501/ 627

Trees, topological orderings and postorderings

I A rooted tree is a tree for which one node has been selectedto be the root.

I A topological ordering of a rooted tree is an ordering thatnumbers children nodes before their parent.

I Postorderings are topological orderings which number nodes inany subtree consecutively.

502/ 627

Trees, topological orderings and postorderings

u w

x

y

z

v

with topological ordering

1 3

2

4

5

6

Rooted spanning tree

w

yz

xu

v

Connected graph G

u w

x

y

z

v

1

6

54

3

2

Rooted spanning treewith postordering

502/ 627

Postorderings and memory usage

I Assumptions:I Tree processed from the leaves to the rootI Parents processed as soon as all children have completed

(postorder of the tree)I Each node produces and sends temporary data consumed by

its father.I Exercise: In which sense is a postordering-based tree traversal

more interesting than a random topological ordering ?

I Furthermore, memory usage also depends on the postorderingchosen:

a b ab

c

d f

e

g

h

c

d

e

f

g

h

ii

Best (abcdefghi) Worst (hfdbacegi)

Leaves

Root

503/ 627

Example 1: Processing a wide tree

1 2 3 6

4 5

7

1 2 3 6

4 5

7

Memory

..

.

..

.

unused memory space stack memory spacefactor memory space non-free memory space

1

1

65

5

5

5

5

5

11

1 1

1

1

1

1

1

1

2

2

2

2

2

2

2

23

3

3

3

3

3

3

3

4

4

4

4

4

6

6

6

6

6

6

4

7

7

7

Active memory

Example 2: Processing a deep tree

1 2

3

4

Memory

unused memory space stack memory spacefactor memory space non-free memory space

1

11

1122

1 23

1 23

1 2 33

1 2 33 4

1 23 4

..

.

12

..

.

Allocation of 3

Assembly step for 3

Factorization step for 3 +

Stack step for 3

Modelization of the problem

I Mi : memory peak for complete subtree rooted at i ,

I tempi : temporary memory produced by node i ,

I mparent : memory for storing the parent.

M2 M3

M(parent)

M1

temp3temp2

temp1

Mparent = max( maxnbchildrenj=1 (Mj +

∑j−1k=1 tempk),

mparent +∑nbchildren

j=1 tempj)(11)

Objective: order the children to minimize Mparent

507/ 627

Memory-minimizing schedules

Theorem.

[Liu,86] The minimum of maxj(xj +∑j−1

i=1 yi ) is obtained whenthe sequence (xi , yi ) is sorted in decreasing order of xi − yi .

Corollary

An optimal child sequence is obtained by rearranging the childrennodes in decreasing order of Mi − tempi .

Interpretation: At each level of the tree, child with relatively largepeak of memory in its subtree (Mi large with respect to tempi )should be processed first.

⇒ Apply on complete tree starting from the leaves(or from the root with a recursive approach)

Optimal tree reordering

Objective: Minimize peak of stack memory

Tree Reorder (T ):Begin

for all i in the set of root nodes doProcess Node(i);

end forEnd

Process Node(i):if i is a leaf then

Mi=mi

elsefor j = 1 to nbchildren do

Process Node(j th child);end forReorder the children of i in decreasing order of (Mj − tempj);Compute Mparent at node i using Formula (11);

end if


510/ 627

Equivalent orderings of symmetric matrices

Let F be the filled matrix of a symmetric matrix A (that is,F = L + Lt , where A = LLT)G +(A) = G (F) is the associated filled graph.

Definition

[Equivalent orderings] P and Q are said to be equivalent orderingsiff G +(PAPT) = G +(QAQT)By extension, a permutation P is said to be an equivalent orderingof a matrix A iff G +(PAPT) = G +(A)

It can be shown that an equivalent reordering also preserves theamount of arithmetic operations for sparse Cholesky factorization.

511/ 627

Relation with elimination trees

I Let A be a reordered matrix, and G +(A) be its filled graphI In the elimination tree, any tree traversal (that processes

children before parents) corresponds to an equivalent orderingP of A and the elimination tree of PAPT is identical to thatof A.

u w

x

y

z

v

w

yz

xu

v

uv

wx

yz

F

F

A =

1 3

2

4

5

6

with topological orderingElimination Tree of A

341

562

Graph G (A)+

512/ 627

Tree rotations

Definition

An ordering that does not introduce any fill is referred to as aperfect ordering

Natural ordering is a perfect ordering of the filled matrix F.

Theorem.

For any node x of G +(A) = G (F), there exists a perfect orderingon G (F) such that x is numbered last.

I Essence of tree rotations :I Nodes in the clique of x in F are numbered lastI Relative ordering of other nodes is preserved.

513/ 627

Example of equivalent orderings

On the right-hand side tree rotation applied on w :(clique of w is w , x and for other nodes relative ordering w.r.t.tree on the left is preserved).

u

vz

wx

yF

F

u w

x

y

z

v

1

6

2

54

3

with a postordering

x

w

z

vy

u

xw

5

4

6

32

1

yu

zv

FF

F =

Remark: Tree rotations can help reducing the temporary memory usage!


515/ 627

Comparison between 3 approaches for LUfactorization

We compare three general approaches for sparse LU factorization

I General technique

I Frontal method

I Multifrontal approach

distributed memory multifrontal and supernodal approaches will becompared in another section

516/ 627

Description of the 3 approaches

I General techniqueI Numerical and sparsity pivoting performed at the same timeI Dynamic sparse data structuresI Good preservation of sparsity : local decision influenced by

numerical choices.

I Frontal methodI Extension of band or variable-band schemesI No indirect addressing is required in the innermost loop (data

are stored in dense matrices)I Simple data structure, fast methods, easier to implement.I Very popular, easy out-of-core implementation.I Sequential by nature

517/ 627

Description of the 3 approaches

I Multifrontal approachI Can be seen as an extension of frontal methodI Analysis phase to compute an orderingI ordering can then be perturbed by numerical pivotingI Dense matrices are used in the innermost loops.

Compared to frontal schemes:-complex to implement (assembly of dense matrices,

managament of numerical pivoting)-Preserve in a better way the sparse structure.

518/ 627

Frontal method

=

and additionally

A =

a11

a22

ajj

F =

ajj

(j)

Etape 1 (Elimination of a11)

Frontal matrix holds of values for elimination of a11

a11

F(1)

Values for updating are generated:fij= (ai1*a1j)/a11

Etape j (Elimination of ajj, j=2,..)

Frontal matrix holds:−updates from previous steps

−row j and column jof the original matrix

I Properties: The band is treated as full → Efficient reorderingto minimize bandwidth is crucial.

519/ 627

Band reduction: Illustration

Original Matrix Factors

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 58202

Figure: Matrix dwt 592.rua (N=512, NZ=2007); Structural analysis on asubmarine.

520/ 627

Reordering: Reverse Cuthill-McKee

Reordered Matrix Factors after reordering(RCM)

0 100 200 300 400 500

0

100

200

300

400

500

nz = 51040 100 200 300 400 500

0

100

200

300

400

500

nz = 16924

521/ 627

Frontal vs Multifrontal methods

Considered as full

2

1

1

3

2

Step 2:

Frontal matrix:

Frontal method

Step 2:

Considering frontal

Matrices is sufficient:

2

3

Step 1:

Multifrontal method

3

Several fronts move ahead simultaneously(treatment of blocks 1 and 2 is independent)

522/ 627

Characteristics of multifrontal method:

I More complex data structures.

I Usually more efficient for preserving sparsity than frontaltechniques

I Parallelism arising from sparsity.

523/ 627

Illustration: comparison between 3 software for LU

I General approach (MA38: Davis and Duff):I Control of fill-in: Markowitz criterionI Numerical Stability: partial pivotingI Numerical and sparsity pivoting are performed in one step

I Multifrontal method (MA41, Amestoy and Duff):I Reordering before numerical factorizationI Minimum degree type of reorderingI Partial pivoting for numerical stability

524/ 627

I Frontal method (MA42, Duff and Scott):I Reordering before numerical factorizationI Reordering for decreasing bandwidthI Partial pivoting for numerical stability

I All these software (MA38, MA41, MA42) are available inHSL-Library.

525/ 627

Test problems from Harwell-Boeing and Tim Davis(Univ. Florida) collections.

Matrix Order Nb of Descriptionnonzeros

Orani678.rua 2526 90158 Economic modellingOnetone1.rua 36057 341088 Harmonic balance methodGaron2.rua 13535 390607 2D Navier-StokesWang3.rua 26064 177168 3D Simulation of semiconductormhda416.rua 416 8562 Spectral problem in Hydrodynamicrim.rua 22560 1014951 CFD nonlinear problem

526/ 627

Execution times on a SUN

Matrix FLops Size of factors Time(Method) (×106 words) (seconds)

Onetone1.rua (×109)General 2 5 59Frontal 19 115 6392Multifrontal 8 10 193

Garon2.rua (×108)General 40 8 95Frontal 20 9 86Multifrontal 4 2 8

mhda416.rua (×105)General 24 0.16 0.80Frontal 3 0.02 0.07Multifrontal 16 0.04 0.11

527/ 627


528/ 627

Task mapping and scheduling

I Affect tasks to processors to achieve a goal: makespanminimization, memory minimization, . . .

I many approaches:I static: Build the schedule before the execution and follow it at

run-timeI Advantage: very efficient since it has a global view of the

systemI Drawback: Requires a very-good modelization of the platform

I dynamic: Take scheduling decisions dynamically at run-timeI Advantage: Reactive to the evolution of the platform and

easy to use on several platformsI Drawback: Decisions taken with local criteria (a decision

which seems to be good at time t can have very badconsequences at time t + 1)

529/ 627

Influence of scheduling on the makespan

Objective:

Assign processes/tasks to processors so that the completion time,also called the makespan is minimized. (We may also say that weminimize the maximum total processing time on any processor.)

530/ 627

Task scheduling on shared memory computers

The data can be shared between processors without anycommunication.

I Dynamic scheduling of the tasks (pool of “ready” tasks).

I Each processor selects a task (order can influence theperformance).

I Example of “good” topological ordering (w.r.t time).

3

11

4 521

16

6 7

1312

9 10

14

17

18

Ordering not so good in terms of working memory.

531/ 627

Static scheduling: Subtree to subcube (orproportional) mapping

Main objective: reduce the volume of communication betweenprocessors.

I Recursively partition the processors “equally” betweenchildren of a given node.

I Initially all processors are assigned to root node.I Good at localizing communication but not so easy if no

overlapping between processor partitions at each step.

3

11

4 521

16

6 7

1312

9 10

14

17

181,2,3,4,5

1,2,3

1 2,3

4,5

4

1 1 2 3 3 4 4 5

4,5

4

Mapping of the tasks onto the 5 processors

532/ 627

Mapping of the tree onto the processorsObjective : Find a layer L0 such that subtrees of L0 can be

mapped onto the processor with a good balance.

Construction and mapping of the initial level L0

BeginLet L0 ← Roots of the assembly treerepeat

Find the node q in L0 whose subtree has largestcomputational costSet L0 ← (L0\q) ∪ children of qGreedy mapping of the nodes of L0 onto the processorsEstimate the load unbalance

until load unbalance < thresholdEnd

Step A Step B Step C

Decomposition of the tree into levels

I Determination of Level L0 based on subtree cost.

L

Subtree roots

L0

3

L

1L

2

I Mapping of top of the tree can be dynamic.

I Could be useful for both shared and distributed memory algo.

534/ 627


535/ 627

Distributed memory sparse solvers

536/ 627

Computational strategies for parallel direct solvers

I The parallel algorithm is characterized by:I Computational graph dependencyI Communication graph

I Three classical approaches

1. “Fan-in”2. “Fan-out”3. “Multifrontal”

537/ 627

Preamble: left and right looking approaches forCholesky factorization

I cmod(j , k) : Modification of column j by column k , k < j ,

I cdiv(j) division of column j by a scalar

Left-looking approachfor j = 1 to n do

for k ∈ Struct(row Lj ,1:j−1) docmod(j , k)

cdiv(j)

Right-looking approachfor k = 1 to n do

cdiv(k)for j ∈ Struct(col Lk+1:n,k) do

cmod(j , k)

538/ 627

Illustration of Left and right looking

modified

Left−looking Right−looking

used for modification

539/ 627

Assumptions and Notations

I Assumptions :

I We assume that each column of L/each node of the tree isassigned to a single processor.

I Each processor is in charge of computing cdiv(j) for columns jthat it owns.

I Notations :I mycols(p) = is the set of columns owned by processor p.I map(j) gives the processor owning column j (or task j).I procs(L∗k) = map(j) | j ∈ Struct(L∗k)

(only processors in procs(L∗k) require updates from column k– they correspond to ancestors of k in the tree).

540/ 627

Fan-in variant (similar to left looking)Demand driven algorithm : data required are aggregated updatecolumns computed by sending processor

Fan-in (for processor p)for j = 1 to n do

u = 0for all k ∈ Struct(row Lj ,1:j−1) ∩mycols(p) do

u = u + cmod(j , k)end forif map(j) 6= p then

Send u to processor map(j)end ifif map(j) == p then

Incorporate u in column jReceive all necessary updated aggregates on column j andincorporate them in column jcdiv(j)

end ifend for

Fan-in variant (similar to left looking)

Left Looking(1) (2) (3)

(1)

(3)

(4)

(2)

Modified

Used for modification

(4)

Algorithm:

For j=1 to n do

cdiv(j)

Endfor

(Cholesky)

For k in Struct(L ) do

Endfor

cmod(j,k)j,*(1)

(3)

(2)

(4)

if map(1) = map(2) = map(3) = p and map(4) 6= p (only) onemessage sent by p to update column 4 → exploits data locality ofproportional mapping.

Fan-in variant

P0 P1 P2 P3

P4

if ∀i ∈ children map(i) = P0 and map(father) 6= P0 (only) onemessage sent by P0 → exploits data locality of proportionalmapping.

542/ 627

Fan-in variant

P0 P1 P2 P3

P4

Communication


542/ 627

Fan-in variant

P0 P0 P0 P0

P4

Communication


542/ 627

Fan-out variant (similar to right-looking)Data driven algorithm. Fan-out(p):

for all leaf node j ∈ mycols(p) docdiv(j)send column L∗j to procs(col L∗j)mycols(p) = mycols(p)− j

end forwhile mycols(p) 6= ∅ do

Receive any column (say L∗k) of Lfor j ∈ Struct(col L∗k) ∩mycols(p) do

cmod(j , k)if column j completely updated then

cdiv(j)send column L∗jmycols(p) = mycols(p) \ j

end ifend for

end while

Fan-out variant (similar to right-looking)

Right Looking(1) (3)

(1)

(2)

(3)

(4)

(4)

Algorithm:

Endfor

(Cholesky)

For k=1 to n do

cmod(j,k)

cdiv(k)

(1)(3)

Computed

Updated

(4)

(2)

For j in Struct(L*,k) do

Endfor(2)

if map(2) = map(3) = p and map(4) 6= p then 2 messages (forcolumn 2 and 3) are sent by p to update column 4.

Fan-out variant

P0 P1 P2 P3

P4

if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father

544/ 627

Fan-out variant

P0 P0 P0 P0

P4

Communication

if ∀i ∈ children map(i) = P0 and map(father) 6= P0 then nmessages (where n is the number of children) are sent by P0 toupdate the processor in charge of the father

544/ 627

Fan-out variant

Properties of fan-out:

I Historically the first implemented.I Incurs greater interprocessor communications than fan-in (or

multifrontal) approach both in terms ofI total number of messagesI total volume

I Does not exploit data locality of proportional mapping.I Improved algorithm (local aggregation):

I send aggregated update columns instead of individual factorcolumns for columns mapped on a single processor.

I Improve exploitation of data locality of proportional mapping.I But memory increase to store aggregates can be critical (as in

fan-in).

Multifrontal variant

Elimination tree

(1)(3)

(4)

(2)

Computed

Updated

Right Looking

(1) (2) (3)

(1)

(3)

(4)

(4)

(5)

(5)

(2)

(2)(2)(3)(4)(5)

L

(1)(3)

(2)

(4)

"Multifrontal Method"

Algorithm:

For k=1 to n do

Endfor

Partial factorisation

Send Contribution Block to Father

(3)(4)(5)

C B

Build full frontal matrixwith all indices in Struct(L *,k )

546/ 627

Multifrontal variant

P0 P0

P1

P2

(a) Fan-in.

P0 P0

P1

P2

(b) Fan-out.

P0 P0

P1

P2

(c) Multifrontal.

Figure: Communication schemes for the three approaches.

547/ 627


548/ 627

Some parallel solvers

549/ 627

Shared memory sparse direct codes

Code Technique Scope Availability (www.)

MA41 Multifrontal UNS cse.clrc.ac.uk/Activity/HSL

MA49 Multifrontal QR RECT cse.clrc.ac.uk/Activity/HSL

PanelLLT Left-looking SPD NgPARDISO Left-right looking UNS SchenkPSL† Left-looking SPD/UNS SGI productSPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Left-looking UNS nersc.gov/∼xiaoye/SuperLU

WSMP‡ Multifrontal SYM/UNS IBM product

† Only object code for SGI is available

550/ 627

Distributed-memory sparse direct codes


CAPSS Multifrontal LU SPD netlib.org/scalapackMUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPSPaStiX Fan-in SPD see caption§

PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU

S+ Fan-out† UNS cs.ucsb.edu/research/S+

WSMP‡ Multifrontal SYM IBM product§ dept-info.labri.u-bordeaux.fr/∼ramet/pastix

‡ Only object code for IBM is available. No numerical pivoting performed.

551/ 627

Distributed-memory sparse direct codes


CAPSS Multifrontal LU SPD netlib.org/scalapackMUMPS Multifrontal SYM/UNS graal.ens-lyon.fr/MUMPSPaStiX Fan-in SPD see caption§

PSPASES Multifrontal SPD cs.umn.edu/∼mjoshi/pspases

SPOOLES Fan-in SYM/UNS netlib.org/linalg/spooles

SuperLU Fan-out UNS nersc.gov/∼xiaoye/SuperLU

S+ Fan-out† UNS cs.ucsb.edu/research/S+

WSMP‡ Multifrontal SYM IBM productCase study: Comparison of MUMPS and SuperLU

551/ 627


552/ 627

MUMPS (Multifrontal sparse solver)Amestoy, Duff, Guermouche, Koster, L’Excellent, Pralet

1. Analysis and Preprocessing• Preprocessing (max. transversal, scaling)• Fill-in reduction on A + AT

• Partial static mapping (elimination tree)

2. Factorization• Multifrontal (elimination tree of A + AT )

Struct(L) = Struct(U)• Partial threshold pivoting• Node parallelism

- Partitioning (1D Front - 2D Root)

- Dynamic distributed scheduling

3. Solution step and iterative refinement

Features: Real/complex Symmetric/Unsymmetric matrices; Distributed input;

Assembled/Elemental format; Schur complement; multiple sparse

right-hand-sides;

SuperLU (Gaussian elimination with static pivoting)X.S. Li and J.W. Demmel

1. Analysis and Preprocessing• Preprocessing (Max. transversal, scaling)• Fill-in reduction on A + AT

• Static mapping on a 2D grid of processes

2. Factorization• Fan-out (elimination DAGs)• Static pivoting

if (|aii | <√ε‖A‖) set aii to

√ε‖A‖

• 2D irregular block cyclic partitioning (based on supernodestructure)• Pipelining / BLAS3 based factorization

3. Solution step and iterative refinement

Features: Parallel Analysis ; Real and complex matrices ; Multipleright-hand-sides.

MUMPS: dynamic scheduling

Graph of tasks = treeEach task = partial factorization of a dense matrixSome parallel tasks mapped at runtime (80 %)

P0

P0

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2

P0

P0

P1 P3

P3

SUBTREES

TIM

E

: STATIC

2D static decomposition

555/ 627



P0P1

P0

P0

P1

P3

P2

P1

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2P3P0

P0

P0

P1 P3

P3

SUBTREES

TIM

E

: STATIC

P2

: DYNAMIC


555/ 627



P0P1

P0

P0

P1

P3

P2

P1

P3P2

P0 P1

P3

P0 P1

P0

P0

P3

P0

P2 P2

P0

P2P2P3P0

P0

P0

P1 P3

P3

SUBTREES

TIM

E

P0P3P2

: STATIC

P2

1D pipelined factorization

: DYNAMIC

P3 and P0 chosen by P2 at runtime


555/ 627

Node level parallelism in multifrontal solverMUMPS: pipelined factorization

Slav

e pr

oces

ses

Mas

ter p

roce

ss

LROW

L TRSM + GEMM

TRSM + GEMM

MessageBLOCKFACT

L

LFACTORED BLOCK

NPIV

NPIV1N

PIV

1

U

(P3)

(P4)

(P2)

556/ 627

SuperLU: 2D block cyclic layout and data structures

1

.

.1

. . .

.

...

...

indexStorage of block column of L

# of blocks

nzval

block #

row subscriptsi1i2

# of full rows

block #

row subscriptsi1i2

# of full rows

LDA of nzval

!!!!!!!!!!!!!!!!!!

""""""""""""""""""

########################################

$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$

Process Mesh0 1 2

3 4 5

K

L

K

1 2 0 1 2 00

3 4 5

U4 5

210 0 1 2

3 4 5 3 4 5

0 1 2 0 21

3 4 5 3 4

0 1 2 0 1 2 0

0

0

3

3

3

Global Matrix

%%%%

&&&&

3

''''''''''''

((((((((((((

5

557/ 627

Trace of execution(bbmat, 8 proc. CRAY T3E)

Process 0 5 5 5 5 4 4 5 108 5 5 5 5 5 5 5 Facto_L1 4 5 5 5 5 5 5 5 5 5

Process 1 108 4 4 108 5 108 5 5 5 5 5 5 5 Facto_L1 4 5 5 5 5 5 5 5 5 5

Process 2 108 4 4 108 5 5 5 5 5 5 5 5 108 5 108 5 5 5 5 5 5 5 5 4

Process 3 5 5 5 4 108 5 5 4 108 5 5 5 5 5 5 4 108 5 5 5 5 5 5 5 5 5

Process 4 4 108 5 5 4 5 5 5 5 5 5 108 5 108 5 5 5 5 5 5 5 5

Process 5 4 4 4 5 5 4 108 5 5 5 5 5 5 5 5 2 2 2 2 2 2 2 2 2 4

Process 6 4 4 108 108 5 108 5 5 5 5 5 5 5 108 5 5 4 108 5 5 5 5 5 5 5

Process 7 108 4 4 108 2 2 2 2 2 2 2 2 4 108 5 5 5 5 5 5 5 5 5

MPIApplication

L

9.05s9.0s8.95s8.9s

Process 0

Process 1 80 80 80 80 80 80 80 80 80

Process 2 80 80 80 80 80 80 80 80 80 80 80

Process 3

Process 4

Process 5 80 80 80 80 80 80 80 80 80 80

Process 6 80 80 80 80 80 80 80 80 80 80 80

Process 7

MPIVT_APIComm

9.32s9.3s9.28s

558/ 627

Test problems

Real Unsymmetric AssembledMatrix Order NZ StrSym Originbbmat 38744 1771722 54 R.-B. (CFD)ecl32 51993 380415 93 EECS Dept. UC Berkeleyinvextr1 30412 1793881 97 Parasol (Polyflow)fidapm11 22294 623554 99 SPARSKIT2 (CFD)garon2 13535 390607 100 Davis (CFD)lhr71c 70304 1528092 0 Davis (Chem Eng)lnsp3937 3937 25407 87 R.-B. (CFD)mixtank 29957 1995041 100 Parasol (Polyflow)rma1010 46835 2374001 98 Davis (CFD)twotone 120750 1224224 14 R.-B. (circuit sim)Real Symmetric Assembled (rsa)bmwcra 1 148770 5396386 100 Parasol (MSC.Software)cranksg2 63838 7106348 100 Parasol (MSC Software)inline 1 503712 18660027 100 Parasol (MSC Software)

StrSym : structural symmetry;

R.-B. : Rutherford-Boeing set.

559/ 627

Impact of preprocessing and numerical issues

I Objective: Maximize diagonal entries of permuted matrixI MC64 (Harwell Sub. Lib.) code from Duff and Koster (1999)

I Unsymmetric permutation (maximum weighted mactching)and scaling

I Preprocessed matrix B = D1AQD2

is such that |bii | = 1 and |bij | ≤ 1

I Expectations :I MUMPS : reduce NB of off-diagonal pivots and

postponed var. (reduce numerical fill-in)I SuperLU : reduce NB of modified diagonal entriesI Improve accuracy.

560/ 627

MC64 and Flops (109) for factorization (AMD ordering)

Matrix MC64 StrSym MUMPS SuperLU

lhr71c No 0 1431.0(∗) –Yes 21 1.4 0.5

twotone No 28 1221.1 159.0Yes 43 29.3 8.0

fidapm11 No 100 9.7 8.9Yes 29 28.5 22.0

(∗) Estimated during analysis,

– Not enough memory to run the factorization.

561/ 627

Backward error analysis: Berr = maxi|r |i

(|A|·|x |+|b|)i

MUMPS SuperLU

bbmat ecl32 invextr1 fidapm11 garon2 lnsp3937 mixtank rma10 twotone10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

Ber

r

NO MC64 (MUMPS)MC64 (MUMPS)

bbmat ecl32 invextr1 fidapm11 garon2 lnsp3937 mixtank rma10 twotone10−16

10−14

10−12

10−10

10−8

10−6

10−4

10−2

100

Ber

r

NO MC64 (SuperLu)MC64 (SuperLu)

One step of iterative refinement generally leads to Berr ≈ εCost (1 step of iterative refinement) ≈ Cost (LUx = b − Ax)

562/ 627

Factorization cost study

I SuperLU preserves/exploits betterthe sparsity/asymmetry than MUMPS.This results in

++ smaller size of factors (less memory)++ fewer operations++ more independency/parallelism−− Extra cost of taking into account asymmetry−− Smaller block-size for BLAS-3 kernels

563/ 627

Cost of preserving sparsity(time on T3E 4 Procs)

bbmat(50) fidapm11(46) twotone(43) lhr71c(21) 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Rat

ios

(Sup

erLU

/MU

MP

S)

Size Factors (SuperLU / MUMPS)Flops "" Facto Time "" Solve Time ""

564/ 627

Nested Dissection versus Minimum Degree orderings(time on T3E 4 Procs)

Flops study Factorization time ratio (AMD/ND)

bbmat ecl32 invextr1 mixtank 0

5

10

15

20

25

30

35

40

45

50

55

Flop

s (E

9)

AMD (MUMPS) AMD(SuperLU)ND(MUMPS) ND(superLU)

bbmat ecl32 invextr1 mixtank 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Rat

io A

MD

/ND

MUMPS SuperLU

565/ 627

Communication issues

Average Vol.(64 procs) Average Message Size (64 procs)

bbmat ecl32 invextr1 mixtank twotone0

10

20

30

40

50

60Average Communication Volume on 64 Processors

Mby

tes

MUMPS (AMD) SuperLU (AMD)MUMPS (ND) SuperLU (ND)

bbmat ecl32 invextr1 mixtank twotone0

5

10

15

20

25Average Message Size on 64 Processors

Kby

tes

MUMPS (AMD) SuperLU (AMD)MUMPS (ND) SuperLU (ND)

566/ 627

Time Ratios of the numerical phasesTime(SuperLU) / Time(MUMPS)

Factorization Solve

4 8 16 32 64 128 256 5120.5

1

1.5

2

2.5

3

3.5

4

Processors

Rat

io(S

uper

LU/M

UM

PS

)

bbmat ecl32 invextr1mixtank twotone

4 8 16 32 64 128 256 5120.5

1

1.5

2

2.5

3

3.5

4

4.5

5

5.5

6

Processors

Rat

io(S

uper

LU/M

UM

PS

)

bbmat ecl32 invextr1mixtank twotone

567/ 627

Time (in seconds) of the numerical phases

Factorization Solve

4 8 16 32 64 128 256 512

5

10

15

20

25

30

35

40

45

Processors

Tim

e (S

econ

ds)

Time for factorization phase

MUMPS (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone) SuperLU (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone)

4 8 16 32 64 128 256 5120

0.5

1

1.5

2

2.5

3

3.5

4

Processors

Tim

e (S

econ

ds)

Time for solve phase

MUMPS (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone) SuperLU (bbmat) "" (ecl32) "" (invextr1) "" (mixtank) "" (twotone)

568/ 627

Performance analysis on 3-D grid problemsRectangular grids - Nested Dissection ordering

Megaflop rate Efficiency

1 2 4 8 16 32 64 1280

50

100

150

200

250

Processors

Meg

aflo

p ra

te

MUMPS−SYMMUMPS−UNSSuperLU

1 2 4 8 16 32 64 1280

0.2

0.4

0.6

0.8

1

1.2

Processors

Effi

cien

cyMUMPS−SYMMUMPS−UNSSuperLU

569/ 627

Summary

I Sparsity and Total memory-SuperLU preserves better sparsity-SuperLU (≈ 20%) less memory on 64 Procs (Asymmetry -

Fan-out/Multifrontal)

I Communication-Global volume is comparable-MUMPS : much smaller (/10) nb of messages

I Factorization / Solve time-MUMPS is faster on nprocs ≤ 64-SuperLU is more scalable

I Accuracy-MUMPS provides a better initial solution-SuperLU : one step of iter. refin. often enough

570/ 627


571/ 627

Concluding remarks

I Key parameters in selecting a method1. Functionalities of the solver2. Characteristics of the matrix

I Numerical properties and pivoting.I Symmetric or generalI Pattern and density

3. Preprocessing of the matrixI ScalingI Reordering for minimizing fill-in

4. Target computer (architecture)

I Substantial gains can be achieved with an adequate solver: interms of numerical precision, computing and storage

I Good knowledge of matrix and solversI Many challenging problems

I Active research area

572/ 627

Outline

Iterative MethodsBasic iterative methods (stationary methods)Krylov subspace methodsPreconditioning

573/ 627

Iterative Methods

Principles:

I Generates a sequence of approximates x (k) to the solution

I Essentially involves matrix-vector products

I Often linked to preconditioning techniques:Ax = b → MAx = Mb

I Evaluation of a method speed of convergence

574/ 627

Direct method vs. Iterative method

DirectI Very general technique

I High numerical accuracyI Sparse matrices with

irregular patterns

I Factorization of AI May be costly in terms of

memory for factorsI Factors can be reused for

successive/multipleright-hand sides

IterativeI Efficiency depends on the

type of the problemI Convergence

preconditionningI Numerical properties

structure of A

I Requires the product of Aby a vector

I Less costly in terms ofmemory and possibly flops

I Solutions with successiveright-hand sides can beproblematic

575/ 627


576/ 627

Basic iterative methods (stationary methods)

Definition

An iterative method is called stationary if x (k+1) can be expressedas a function of x (k) only.

I Residual at iteration k: r (k) = b − Ax (k)

I i th component:

r(k)i = bi −

∑j aijx

(k)j = bi −

∑j 6=i aijx

(k)j − aiix

(k)j

I Idea: try to “reset” the ri components to 0. This gives:Do i = 1, n

x(k+1)i = 1

aii(bi −

∑j 6=i aijx

(k)j )

EndDo

Jacobi iteration

577/ 627

Gauss-Seidel method

I Jacobi:Do i = 1, n

x(k+1)i = 1

aii(bi −

∑i−1j=1 aijx

(k)j −

∑nj=i+1 aijx

(k)j )

EndDo

I Remark that one does not use the latest information

I Gauss-Seidel iteration:Do i = 1, n

x(k+1)i = 1

aii(bi −

∑i−1j=1 aijx

(k+1)j −

∑nj=i+1 aijx

(k)j )

EndDo

578/ 627

Stationary Methods: Matrix Approach

Decompose A as

A = L + D + U

where D is the diagonal of A, and L (resp. U) is the strictly lower(resp. upper) triangular part.

I Given a non-singular matrix M, we define the recurrence:x (k+1) = M−1(b − (A−M)x (k)) = x (k) + M−1r (k)

(Note: y = M−1z means “Solve My = z for y”)

I Jacobi iteration:M = Dx (k+1) = D−1(b − (A− D)x (k)) = x (k) + D−1r (k)

I Gauss-Seidel iteration:M = L + Dx (k+1) = (D +L)−1(b−(A−D−L)x (k)) = x (k) +(D +L)−1r (k)

579/ 627

Variants

Successive over-relaxation (SOR):

x (k+1) = ωx(k+1)GaussSeidel + (1− ω)x (k)

= xk + ω(D + ωL)−1(b − Ax (k))

Choice of ω:

I Theoretical optimal values for limited classes of problems

I Problematic in general

Many other variants depending on the choice of the matrix M(block Jacobi, block Gauss-Seidel, . . . )

580/ 627

Convergence properties

Previous methods follow the model:x (k+1) = x (k) + M−1(b − Ax (k))Knowing that the solution x∗ satisfies Ax∗ = b, this gives:x (k+1) − x∗ = x (k) − x∗ + M−1(Ax∗ − Ax (k))Thus:

x (k+1) − x∗ = (I −M−1A)(x (k) − x∗)

Theorem.

The sequence (x (k))k=1,2,... defined by

x (k+1) = x (k) + M−1(b − Ax (k))

converges for all x (0) to the solution x∗ iff the spectral radius ofI −M−1A satisfies the inequality ρ(I −M−1A) < 1.

581/ 627


Proof.

x (k+1) − x∗ = (I −M−1A)k(x (0) − x∗)

⇒ Let (λ, v) be an eigenpair of Gdef= I −M−1A.

G kv = λkv , thus limk→∞ G k = 0⇒ |λ| < 1

⇐ Based on the Jordan decomposition: there exists a matrix V such

that G = V−1JV , J =

. . .

λi 1 0 · · · 00 λi 1 · · · 0...

.... . .

. . ....

0 0 · · · λi 10 0 · · · 0 λi

. . .

Then G k = (V−1JV )k = V−1JkV , and we can check that eachdiagonal block of Jk tends 0 if |λi | < 1.

582/ 627

Typical steps to design an iterative method:

I Propose a matrix M where linear systems of the form Mz = dare “easy” to solve

I Classes of matrices are identified for which the iterationmatrix G = I −M−1A satisfies ρ(G ) < 1

I Find further results about ρ(G ) to gain intuition onconvergence speed

583/ 627


Theorem.

If A ∈ lCn×n is strictly diagonal dominant, then the Jacobiiterations converge.

Proof.

I − D−1A = −D−1(L + U)ρ(I − D−1A) ≤ ‖D−1(L + U)‖∞ = maxi

∑j 6=i |

aij

ajj| < 1

Theorem.

If A is symmetric positive definite, then the Gauss-Seidel iterationsconverge (for any x0).

584/ 627

Implementation of the Jacobi iteration

I Matrix-vector product using CSR format:

c$omp P a r a l l e l Do SHARED( i a , ja , v a l , x , y ) PRIVATE( i )do k=1,n

y ( k ) = 0 . 0 d0do i= i a ( k ) , i a ( k+1)−1

y ( k ) = y ( k ) + v a l ( i )∗ x ( j a ( i ) )enddo

enddo

I Jacobi iteration x (k+1) ← x (k) + D−1(b− (L + U)x (k)) can bevectorized and parallellized similarly to a matrix-vector product

585/ 627

Example of implementation of Gauss-Seidel

Consider the 5-point stencil1 1

1

1

−4

applied to the Poisson equation(∆g(x , y) = f (x , y), g = 0 on the boundary) on a NxM grid, weobtain an NM-by-NM block tridiagonal system Ag = f :

A =

T −IN

−IN T. . .

. . .. . . −IN−IN T

, T =

4 −1

−1 4. . .

. . . −1−1 4

g = (G (1, 1), . . .G (N, 1),G (1, 2), . . . ,G (N, 2), . . . ,G (NM))T ,f = (f11, . . . fN1, f12, . . . , fN2, . . . , fNM)T

Which form does the Gauss-Seidel iteration take for this system ?

586/ 627

With the convention G (i , j) = 0 for i = 0, i = N, j = 0 or j = M,the Gauss-Seidel iteration takes the form:DO j=1,MDO i=1,N

G (i , j) = (fij +G (i−1, j)+G (i+1, j)+G (i , j−1)+G (i , j+1))/4ENDDOENDDO

No storage required for the matrix in this case !

Parallel implementation for shared memory machines(M=N case)DO k=1,NC$OMP Parallel do, private(i,j)DO j=1,k

i = k − j + 1G (i , j) = (fij +G (i−1, j)+G (i+1, j)+G (i , j−1)+G (i , j+1))/4ENDDOENDDODO k=N+1,2*N-1...

587/ 627

Conclusion on stationary methods

I Relatively easy to implement and parallelize

I Often depend on parameters that are difficult to forecast(example: ω in SOR)

I Convergence difficult to guarantee in finite precision arithmetic

I Krylov methods are preferred

588/ 627


589/ 627

Krylov method : some background

Aleksei Nikolaevich Krylov1863-1945: Russia, Maritime EngineerHis research spans a wide range of topics, includ-ing shipbuilding, magnetism, artillery, mathematics,astronomy, and geodesy. In 1904 he built the firstmachine in Russia for integrating ODEs.In 1931 he published a paper on what is now calledthe ”Krylov subspace”.

Definition

Let A ∈ IRn×n and r ∈ IRn; the space denoted by K(r ,A,m) (withm ≤ n) and defined by

K(r ,A,m) = Spanr ,Ar , ...,Am−1r

is referred to as the Krylov space of dimension m associated withA and r .

590/ 627

Why using this search space ?

[?] For the sake of simplicity of exposure, we often assume x0 = 0.This does not mean a loss of generality, because the situationx0 6= 0 can be transformed with a simple shift to the systemAy = b − Ax0 = b, for which obviously y0 = 0.The minimal polynomial q(t) of A is the unique monic polynomialof minimal degree such that q(A) = 0. It is constructed from theeigenvalues of A as follows. If the distinct eigenvalues of A areλ1, ..., λ` and if λj has index mj (the size of the largest Jordanblock associated with λj), then the sum of all indices is

m =∑j=1

mj , and q(t) =∏j=1

(t − λj)mj . (12)

When A is diagonalizable m is the number of distinct eigenvaluesof A; when A is a Jordan block of size n, then m = n.

Example

A = V−1

3 1

34

4

V

I eigenvalue 3 of index 2

I eigenvalue 4 of index 1

I This gives m = 3 and q(t) = (t − 3)2(t − 4) whereas thecharacteristic polynomial is (t − 3)2(t − 4)2.

I Property: q(A) = 0

592/ 627

If we write

q(t) =∏j=1

(t − λj)mj =

m∑j=0

αj tj ,

then the constant term is α0 =∏j=1

(−λj)mj . Therefore α0 6= 0 iff A

is nonsingular. Furthermore, from

0 = q(A) = α0I + α1A + ...+ αmAm, (13)

it follows that

A−1 = − 1

α0

m−1∑j=0

αj+1Aj .

This description of A−1 portrays x = A−1b immediately as amember of the Krylov space of dimension m associated with A andb denoted by K(b,A,m) = Spanb,Ab, ...,Am−1b.

Taxonomy of the Krylov subspace approaches

The Krylov methods for identifying xm ∈ K(b,A,m) can bedistinguished in four classes:

I The Ritz-Galerkin approach (FOM, CG,. . . ):construct xm such that the residual is orthogonal to thecurrent subspace: b − Axm⊥K(b,A,m).

I The minimum norm residual approach (GMRES,. . . ):construct xm ∈ K(b,A,m) such that ||b − Axm||2 is minimal

I The Petrov-Galerkin approach:construct xm such that b − Axm is orthogonal to some otherm-dimensional subspace.

I The minimum norm error approach:construct xm ∈ ATK(b,A,m) such that ||b − Axm||2 isminimal.

594/ 627

Constructing a basis of K(b,A,m)

I Obvious choice b, Ab, . . . , Am−1bI not very attractive from the numerical point of view because

vectors Ajb become more and more colinear to the eigenvectorassociated to the largest eigenvalue.

I In finite arithmetic, leads to a loss of rank: suppose A isdiagonalizable A = VDV−1, then Akb = VDk(V−1b).

I A better choice is the Arnoldi procedure.

595/ 627

Arnoldi

Walter Edwin Arnoldi1917-1995: USA.His main research subjects covered vibration of propellers, enginesand aircraft, high speed digital computers, aerodynamics andacoustics of aircraft propellers, lift support in space vehicles andstructural materials.”The principle of minimized iterations in the solution of theeigenvalue problem” in Quart. of Appl. Math., Vol.9 in 1951.

596/ 627

The Arnoldi procedure

This procedure builds an orthonormal basis of K(A, b,m).

Arnoldi’s algorithm1: v1 = b/‖b‖2: for j = 1, 2, . . .m − 1 do3: Compute hi ,j = vT

i Avj for i = 1, . . . , j

4: Compute wj = Avj −j∑

i=1

hi ,jvi

5: Compute hj+1,j = ‖wj‖6: Exit if (hj+1,j = 0)7: Compute vj+1 = wj/hj+1,j

8: end for

597/ 627

The Arnoldi procedure properties

Proposition

If the Arnoldi procedure does not stop before the mth step, thevectors v1, ...., vm form an orthonormal basis of the Krylovsubspace K(A, b,m)

Proof.The vectors are orthogonal by construction.They span K(A, b,m − 1) follows from the fact that each vector vj is ofthe form qj−1(A)v1, where qj−1 is a polynomial of degree j − 1. This canbe shown by induction. For j = 1 it is true as v1 = q0(A)v1 with q0 = 1.Assume that it is true for all j and consider vj+1. We have:

hj+1,jvj+1 = Avj −j∑

i=1

hi,jvi = Aqj−1(A)v1 +

j∑i=1

hi,jqi−1(A)v1.

So vj+1 can be expressed as qj(A)v1 where qj is of degree j .

Conjugate Gradients

Conjugate Gradient Method

I Solve Ax = b, with A symmetric positive definite

I Belongs to Ritz-Galerkin approaches (constructxm ∈ K(b,A,m) such that b − Axm⊥K(b,A,m))

I First introduced by Hestenes and Stiefels in 1952 [?]

Definition

Two non-zero vectors u and v are conjugate (with respect to A) ifu>Av = 0.

Because A is symmetric positive definite, the left-hand side definesan inner product〈u, v〉A := 〈Au, v〉 = 〈u,Av〉 = u>Av .

599/ 627

Conjugate Gradients

Let (pk)k=1...n be a sequence of conjugate directions. They form abasis of IRn, so the solution of Ax = b can be written:

x∗ = α1p1 + · · ·+ αnpn.

Computing the αk :

Ax∗ = α1Ap1 + · · ·+ αnApn = b.p>k Ax∗ = p>k α1Ap1 + · · ·+ p>k αkApk + · · ·+ p>k αnApn =αkp>k Apk = p>k b.

αk =p>k b

p>k Apk= 〈pk ,b〉

〈pk ,pk 〉A = 〈pk ,b〉‖pk‖2

A.

Possible (direct) method to build a solution:

1. Build a set of n conjugate directions

2. Compute the coefficients αk (and x∗)

600/ 627

Conjugate Gradient Method: main principles

Optimization point of view

Note that the solution x∗ is also the unique minimizer of thequadratic function f (x) = 1

2 xT Ax − btx , x ∈ IR.Steepest descent algorithm: search the successive xk by movingfrom xk to xk+1 in the direction−grad f (xk) = −∇f (xk) = b − Axk = rk :So it makes sense to choose as first direction: p0 = r0 ∈ K(r0,A, 1)

This gives x1 = x0 + α0p0 where α0 =rT0 r0

pT0 Ap0

was defined above.

Remark that r1 = b − Ax1 = p0 − α0Ap0 is orthogonal to p0 (byconstruction).We then choose p1 ∈ K(r0,A, 2) such that p1Ap0 = 0.More generally:

I rk+1⊥span(r0, . . . , rk) = span(p0, . . . , pk) = K(r0,A, k)I Choose pk+1 of the form rk+1 − βkApk ∈ K(r0,A, k + 1)

Conjugacy condition: pj+1⊥Apj ⇒ βk = −pTk Ark+1

pTk Apk

601/ 627

r1

p0 p1

r2p2

x2

x3

x0

x1

piApj = 0 ∀ i , jrk⊥pi for i = 1 . . . k − 1

602/ 627

Geometric interpretation in 2D

Minimize f (x) = 12 xT Ax − bT x

Steepest descent: orthogonal directionsConjugate gradients: A-orthogonal (or conjugate) directions

x0

x1

x2

x3

x0

x1

x2

603/ 627

The algorithm (after simplifications)

CG algorithm

1: r0 = b − Ax0, p0 = r0.2: for j = 0, 1, . . . do3: αj = (rT

j rj)/(pTj Apj)

4: xj+1 = xj + αjpj

5: rj+1 = rj − αjApj

6: if rj+1 “sufficiently” small then7: Exit the loop8: end if9: βj = (rT

j+1rj+1)/(rTj rj)

10: pj+1 = rj+1 + βjpj

11: end for

604/ 627

CG: Implementation and parallelization issues

Storage: four vectors (x , p,Ap, r)Main kernels involved:

I one matrix-vector product per iteration (parallelizable)I two dot products (latency + synchronization !!)

I No issue on shared memory computers – BLAS 1 routineI Distributed memory computer: each processor computes its

local contribution using the components of its own, followedby a reduction.

I Warning: in finite elements, must decide who is responsible forthe variables at the interface between 2 processors

605/ 627

CG: Convergence properties

I Let xk be the kth iterate generated by the CG algorithm andκ(A) the ratio λmax

λmin. Then

‖xk − x?‖A ≤ 2 ·

(√κ(A)− 1√κ(A) + 1

)k

‖x0 − x?‖A.

I Much better if λmaxλmin

is close to 1.

I if A diagonalizable with m distincts eigenvaluesthen CG converges in at most m steps (minimal polynomial).

I Furthermore, convergence is quicker if eigenvalues areclustered

I How to improve κ / better cluster the eigenvalues ?

Preconditioning

606/ 627

CG: Convergence properties

I Let xk be the kth iterate generated by the CG algorithm andκ(A) the ratio λmax

λmin. Then

‖xk − x?‖A ≤ 2 ·

(√κ(A)− 1√κ(A) + 1

)k

‖x0 − x?‖A.

I Much better if λmaxλmin

is close to 1.

I if A diagonalizable with m distincts eigenvaluesthen CG converges in at most m steps (minimal polynomial).

I Furthermore, convergence is quicker if eigenvalues areclustered

I How to improve κ / better cluster the eigenvalues ?Preconditioning

606/ 627


607/ 627

Driving principles to design preconditioners

Find a non-singular matrix M such that MA has “better” propertiesv.s. the convergence behaviour of the selected Krylov solver

I MA has less distinct eigenvalues,

I MA ≈ I in some sense.

608/ 627

The preconditioner constraints

The preconditioner should

I be cheap to compute and to store,

I be cheap to apply,

I ensure a fast convergence.

With a good preconditioner the solution time for thepreconditioned system should be significantly less that for theunpreconditioned system.

609/ 627

The particular case of CG

For CG let M be given in a factorized form (i.e. M = CCT ), thenCG can be applied to

Ax = b,

with A = CT AC , C x = x and b = CT b.Let us define:

xk = C xk ,

C pk = pk ,

rk = CT rk ,

zk = CCT rk ,

610/ 627

Using A = CT AC , xk = C xk , pk = C pk , rk = CT rk we can writethe CG algorithm for both the preconditioned variables and theunpreconditioned ones.

Conjugate Gradient algorithm

1. Compute r0 = b − Ax0 and p0 = r0

2. For k=0,2, ... Do

3. αk = rTk rk/pT

k Apkαk = rT

k CCT rk/pTk Apk

= rTk zk/pT

k Apk

4. xk+1 = xk + αk pkC⇒ xk+1 = xk + αkpk

5. rk+1 = rk − αk ApkC−T

⇒ rk+1 = rk − αkApk

6. βk = rTk+1rk+1/rT

k rkβk = rT

k+1CCT rk+1/rTk CCT rk

= rTk+1zk+1/rT

k zk

7. pk+1 = rk+1 + βk pk

C⇒ pk+1 = CCT rk+1 + βkpk

= zk+1 + βkpk

8. if xk accurate enough then stop9. EndDo

611/ 627

Writing the algorithm only using the unpreconditioned variablesleads to:

Preconditioned Conjugate Gradient algorithm1. Compute r0 = b − Ax0, z0 = Mr0 and p0 = r0

2. For k=0,2, ... Do3. αk = rT

k zk/pTk Apk

4. xk+1 = xk + αkpk

5. rk+1 = rk − αkApk

6. zk+1 = Mrk+1

7. βk = rTk+1zk+1/rT

k zk

8. pk+1 = zk+1 + βkpk

9. if xk accurate enough then stop10. EndDo

612/ 627

MA hass less distinct eigenvalues: an example

Let A =

(A BT

C 0

)and P =

(A 00 CA−1BT

).

Then P−1A has three distinct eigenvalues.[ Murphy, Golub, Wathen, SIAM SISC, 21 (6), 2000]

613/ 627

Preconditioner taxonomy

There are two main classes of preconditioners

I Implicit preconditioners:approximate A with a matrix M such that solving the linearsystem Mz = r is easy.

I Explicit preconditioners:approximate A−1 with a matrix M and just perform z = Mr .

The governing ideas in the design of the preconditioners are verysimilar to those followed to define iterative stationary schemes.Consequently, all the stationary methods can be used to definepreconditioners.

614/ 627

Stationary methods

Let x0 be given and M ∈ IRn×n a nonsingular matrix, compute

xk = xk−1 + M(b − Axk−1).

Note that b − Axk−1 = A(x∗ − xk−1) ⇒ the best M is A−1.The stationary sheme converges to x∗ = A−1b for any x0 iffρ(I −MA) < 1, where ρ(·) denotes the spectral radius.Let A = L + D + U

I M = I : Richardson method,

I M = D−1 : Jacobi method,

I M = (L + D)−1 : Gauss-Seidel method.

Notice that M has always a special structure and the inverse mustnever been explicitely computed (z = B−1y reads solve the linearsystem Bz = y).

615/ 627

Preconditioner location

Several possibilities exist to solve Ax = b:

I Left preconditionerMAx = Mb.

I Right preconditioner

AMy = b with x = My .

I Split preconditioner if M = M1M2

M2AM1y = M2b with x = M1y .

Notice that the spectrum of MA, AM and M2AM1 are identical(for any matrices B and C , the eigenvalues of BC are the same asthose of CB)

616/ 627

Some classical algebraic preconditioners

I Incomplete factorization : IC , ILU(p), ILU(p, τ)

I SPAI (Sparse Approximate Inverse): compute the sparseapproximate inverse by minimizing the Frobenius norm‖MA− I‖F

I FSAI (Factorized Sparse Approximate inverse): compute thesparse approximate inverse of the Cholesky factor byminimizing the Frobenius norm ‖I − GL‖F

I AINV (Approximate Inverse): compute the sparse approximateinverse of the LDU or LDLT factors using an incompletebiconjugation process

617/ 627

Incomplete factorizations

One variant of the LU factorization writes:

IKJ variant - Top looking variant1. for i = 2, ..., n do2. for k = 1, ..., i − 1 do3. ai,k = ai,k/ak,k

4. for j = k + 1, ..., n do5. ai,j = ai,j − ai,k ∗ ak,j

5. end for6. end for7. end for

618/ 627

Zero fill-in ILU - ILU(0)

Let NZ (A) denote the set of (rwo,column) index of the nonzeroentries of A.

ILU(0)1. for i = 2, ..., n do2. for k = 1, ..., i − 1 and (i , k) ∈ NZ (A) do3. ai,k = ai,k/ak,k

4. for j = k + 1, ...,n and (i , j) ∈ NZ (A) do5. ai,j = ai,j − ai,k ∗ ak,j

6. end for7. end for8. end for

619/ 627

Level of fill in ILU - ILU(p)

Definition

The initial level of fill of an entry ai ,j is defined by:

lev(i , j) =

0 if ai ,j 6= 0 or i = j,∞ otherwise.

Each time this entry is modified in line 5 of the LU top lookingalgorithm, its level fill is updated by

lev(i , j) = minlev(i , j), lev(i , k) + lev(k, j) + 1. (14)

620/ 627

A first example

A =

x x x xx xx xx x

lev =

0 0 0 00 0 ∞ ∞0 ∞ 0 ∞0 ∞ ∞ 0

→

0 0 0 00 0 1 10 1 0 10 1 1 0

.

ILU(p)1. for all nonzero entries ai,j set levi,j i = 02. for i = 2, ..., n do3. for k = 1, ..., i − 1 and levi,k ≤ p do4. ai,k = ai,k/ak,k

5. for j = k + 1, ...,n do6. ai,j = ai,j − ai,k ∗ ak,j

7. lev(i , j) = minlev(i , j), lev(i , k) + lev(k, j) + 18. if lev(i , j) > p then ai,j = 09. end for

10. end for11. end for

621/ 627

Another example

0BBBBBBBBBB@

× × × ×× ×× × ×

× × ×× × × ×× × × ×

× × ×× ×

1CCCCCCCCCCA→

0BBBBBBBBBB@

0 ∞ ∞ 0 ∞ 0 ∞ 0∞ 0 ∞ ∞ 0 ∞ ∞ ∞∞ 0 0 ∞ 1 ∞ 0 ∞0 ∞ ∞ 0 0 1 ∞ 1∞ 0 0 ∞ 0 0 ∞ 1∞ ∞ 0 0 2 0 0 20 ∞ ∞ 1 2 0 0 1∞ ∞ ∞ ∞ 0 1 2 0

1CCCCCCCCCCAIt may require to store many fill-ins that are small in absolute value.

622/ 627

ILU(p, τ) dual treshold strategy

I Fix a drop tolerance τ and a number of fill p to be allowed ineach row of the incomplete LU factors. At each step of theelimination process, drop all fill-ins that are smaller than τtimes the 2-norm of the current row; for all the remainingones keep only the p largest.

I Trade-off between amount of fill-in (construction time andapplication time for the preconditioner) and decrease ofnumber of iterations.

623/ 627

Incomplete factorization v.s. megaflop performance

The preconditioning step requires the solution of two sparsetriangular factors that can lead to poor performance on vectorcomputers due to the data dependencies. Special treatment can beimplemented for structured matrices (FD matrices).

MFlops rate for MFlops rate for MFlops rate forComputer PCG PCG with structure CGNEX SX-3 60 607 1124Cray C-90 56 444 737RS 6000 19 18 21

624/ 627

The SPAI preconditioner

The idea is to compute the sparse approximate inverse as thematrix M which minimizes ‖I −MA‖F (or ‖I − AM‖F for rightpreconditioning) subject to certain sparsity constraints. The choiceof the Frobenius norm is motivated by the identity:

‖I − AM‖2F =

n∑j=1

‖ej − Am∗,j‖22 (15)

where ej is the jth unit vector and m∗,j is the column vectorrepresenting the jth row of M. Because of the sparsity constraintonly a few rows and columns of A are used to compute eachcolumn of M. The least-squares problems can be solved efficienlywith dense QR.

625/ 627

The SPAI preconditioner (cont)

I Embarrasingly parallel construction.

I Preconditioner application reduces to a sparse matrice vectorproduct.

I For some appplications the pattern of the inverse can beprescribed a priori (e.g. by considering the pattern of power ofA, ParSails code by E. Chow).

The main difficulty consists in the determination of the sparsitypattern.

626/ 627

Hybrid approaches

One route to the solution of large sparse linear systems in parallelscientific computing is the use of hybrid methods that combinedirect and iterative methods. These techniques inherit theadavantages of each approach, namely the limited amount ofmemory and easy parallelization for the iterative component andthe numerical robustness of the direct part.

I Block preconditionners (block Jacobi, algebraic Schwarzvariants, ...).

I Domain decomposition techniques.

627/ 627

high performance matrix computations/calcul matriciel haute

Documents