stac-a2tm benchmark on power8®...path scaling max paths 5 252 * 28m 93 2.8 tib timesteps * values...

© 2015 IBM Corporation

STAC-A2TM Benchmark on POWER8®

Bishop Brock, Frank Liu, Karthick RajamaniIBM Research, Austin Texas{bcbrock,frankliu,karthick}@us.ibm.com

Workshop for High-Performance Computational FinanceWHPCF'15November 20, 2015

IBM Research

With thanks to Kenneth Hill of the University of Florida and Julian Demouth of NVIDIA

mailto:karthick%[email protected]

2

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Summary

● Workload characterization and performance analysisdemonstrate that STAC-A2TM is a well-rounded single-system HPC benchmark for risk analytics

● POWER8-based Linux® systems demonstrate consistentlyhigh performance at every scale (and we explain why)

→ Current POWER8-based systems are capable andcompetitive platforms for computational finance

3

IBM Research


4

IBM Research


Simplified POWER8 Core: SMT2, SMT4, SMT8

NBBR

FXU LSU LU

DP FP DP FP

128-bit DP SIMD BR

CR

Crypto

FXU

UniQ0

Regs0

NB NB NB NB NB BR

Regs1

FXU LSU LU

DP FP DP FP

128-bit DP SIMD

FXU

UniQ1

Up to 8 threads in 2Thread sets;Dispatch/Complete up to 3 non-branch + 1 branch per cycleper thread set

Register file/renamesShared equally by thread set

Thread sets only dispatch to 'their' UniQ

Execution units are splitBetween UniQs (thread sets)

Thread Set 0 Thread Set 1

SMT2+

5

IBM Research


STAC® and the STAC-A2TM Benchmark

● Securities Technology Analysis Center (STAC)– Coordinates the STAC Benchmark CouncilTM

– Publishes and audits several benchmarks covering technologiesimportant to capital markets

● STAC-A2 Benchmark : Risk Management– Computes numerous sensitivities (Greeks) of a multi-asset, exotic call

option (lookback, best-of) with early exercise

– Modeled by Monte Carlo simulation of the Heston stochastic volatilitymodel

– Priced using the Longstaff-Schwartz algorithm

– Greeks computed using finite differences

The STAC-A2 benchmark suite measures the performance, scalability, quality and energyefficiency of any hardware/software system that is capable of implementing the benchmarkspecification. In this presentation we focus only on the performance and scalability benchmarks.

6

IBM Research


STAC-A2 Benchmark Dimensions and Challenges

● Performance/Scalability Dimensions– Assets : The number of correlated assets : O(assets2)

– Timesteps : Granularity of discretization : O(timesteps)

– Paths : Number of Monte Carlo Paths : O(paths)

● Challenges– Unit-normal random number generation

– Dense matrix operations (custom correlation routine, beaucoup ILP)

– SQRT/DIVIDE–intensive Monte Carlo kernel (little ILP)

– Cache-efficient data management

– Efficient numerical methods (custom SVD routine)

– Bandwidth-limited performance in certain benchmarks

– Single-system parallel load balancing

7

IBM Research


IBM STAC-A2 Solution Structure

RNGMonteCarloSim.

Arrays

Arrays

ComplexScenario

Gen. LongStaff-SchwartzPricing

Monte Carlo simulation produces a numberof paths x timesteps arrays (scenarios) whichmust be stored for later analysis.

Finite difference example: Θ =yΔt − y

Δ t

y is the unmodified option value scenario

yΔ t is the modified expiration scenario

(b) (c)

A master thread (a) spawns multiple workerthreads that perform Monte Carlo simulationin parallel (b). Simulation is partitioned bypaths; Each thread creates the same path-partition of every array. Scenarios are(generated and) priced by cohorts of 1 ormore threads (c). Finally the master threadcomputes the finite differences (d).

8

IBM Research


Key “Greeks” Workloads + Characterization

Workload Goal Assets Paths Result Scenarios Memory*Baseline Speed 5 252 25K 0.317 s 93 2.5 GiB

Large Problem Speed 10 1260 100K 28.9 s 308 143 GiBAsset Scaling Max Assets * 252 25K 78 15642* 309 GiBPath Scaling Max Paths 5 252 * 28M 93 2.8 TiB

Timesteps

* Values unique to our solution

Baseline Large Problem Asset Scaling Path Scaling

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Approximate Profile Breakdown by Application Phase, POWER8 S824*

Longstaff-Schwartz Pricing

Array Transpose

Monte Carlo (Heston Model)

RNG + Correlation

* 2 x POWER8 @ 3.52 GHz (nominal)/ 3.925 GHz (turbo); 24 total cores; 1 TiB memory; IBM XL C/C++; RHEL 7 Big-endian

9

IBM Research


Memory Bandwidth Sensitivity

Baseline LargeProblem

Asset Scaling(65 Assets)

Path Scaling(8M Paths/STC)

0

0.2

0.4

0.6

0.8

1

1.2

1 1 1 10.96

0.931

0.67

0.82

0.72

0.99

0.44

POWER8 S824 Relative Performance vs. Available Memory Bandwidth*

Full192GB/sec per socket

Half96GB/sec per socket

Quarter48GB/sec per socket

Pe

rf.

Re

lative

to

Fu

ll B

an

dw

idth

*Not a STAC-A2 benchmark

10

IBM Research


Performance Comparisons – STAC-A2 Benchmarks

Legend STAC SUT* Description

IBM150305 2 x IBM POWER8 @ 3.52 GHz; 24 total cores; 1TiB Memory

INTC151028

INTC150811 4 x Intel Xeon E7-8890 v3 @ 2.50 GHz; 72 total cores; 1 TiB Memory

NVDA141116

INTC140815

INTC140814 2 x Intel Xeon E5-2699 v3 @ 2.3 GHz; 36 total cores; 256 GiB Memory

2 x Intel Xeon E5-2697 v3 @ 2.6 GHz + 2 x Intel Xeon PHI 7120P @ 1.24 GHz28 Haswell cores + 122 PHI cores; 256 GiB Memory

1 x NVIDIA Tesla K80 GPU; Supermicro SYS-2027GR-TRFH Host; 128 MiB Memory

2 x Intel Xeon E5-2699 v3 @ 2.3 GHz + 1 x Intel Xeon PHI 7120A @ 1.24 GHz36 Haswell cores + 61 PHI cores; 256 GiB Memory

*For details see www.stacresearch.com/<STAC SUT>

Baseline Large Problem0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Relative Workload Performance

2 x POWER8

2 x Haswell EP + 2 x Xeon PHI

4 x Haswell-EX

NVIDIA K80

2 x Haswell-EP+ Xeon PHI

2 x Haswell-EP

Asset Scaling0

10

20

30

40

50

60

70

80

90

Max Assets

Path Scaling00.0E-1

50.0E+5

10.0E+6

15.0E+6

20.0E+6

25.0E+6

30.0E+6

Max Paths

Intel, Xeon, PHI, NVIDIA, Tesla and Supermicro are trademarks or registered trademarks of their respective owners and are used for identification purposes only

11

IBM Research


Performance by SMT Mode

1 Core 1 Socket(12 Cores)

2 Sockets(24 Cores)

0

0.5

1

1.5

2

2.5

1 1 1

1.611.68 1.66

2.00 2.042.12

1.96 1.97 1.99

SMT Mode Performance Relative to SMT1, STAC-A2 Baseline Greeks Workload*

SMT1

SMT2

SMT4

SMT8

Pe

rf. R

ela

tive

to

SM

T1

*Not a STAC-A2 benchmark

12

IBM Research


System-Level Scalability Comparisons

*Data is from the official STAC-A2 audit reports for these systems, however these are not considered STAC-A2 benchmarks.Please see page 10 for full details of the systems being compared here.

Core → Half System Core → Full System

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Scaling Single Core Performance to Half/Full System Performance*(Higher is Better, 1.0 = Ideal)

2 x POWER8Half = 12 CoresFull = 24 coresSMT4

4 x Haswell-EXHalf = 36 CoresFull = 72 CoresHyperThreading

2 x Haswell-EPHalf = 18 CoresFull = 36 CoresHyperThreading

Experiment

Fra

ction o

f th

eore

tical s

peedup,

based

on

num

ber

of core

s

13

IBM Research


Optimizing Transpose for Monte Carlo SimulationStraightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartzpricing effectively requires transposing time-major data into path-major storage.

. . . . .

t0 ttimesteps-1

path 0

path N-1

t0 ttimesteps-1

path 0

path N-1

Simulation Pricing

Data generated time major

Data analyzed path major

. . . . .

Our Solution:

A blocked, in-place matrix transpose using noadditional storage, taking advantage of the factthat (post-processed) simulation data iscreated one path (pair) at a time

Simple and relatively efficient despite:• Moderately poor locality• Requires a second pass over the data

Simple heuristics based on working set sizeimprove performance for large working sets

2% - 9% of total benchmark run time

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7



. . . . .

Finally

14

IBM Research


Fast, Parallel Longstaff-Schwartz Algorithm

. . .

t0 ttimesteps-1

path0

pathN-1

Thread0

Thread1

Thread2

Thread3

• Paths are partitioned by thread similar to MonteCarlo simulation

• Threads synchronize (twice per timestep here)and complete each timestep in lock-step

• Scalability is limited by synchronization andcommunication overheads

• For the Baseline case (25,000 paths)performance improves for up to 16 – 20 SMT4threads per scenario

• Least-Squares Monte Carlo (LSMC orLongstaff-Schwartz) is an algorithm for valuingearly-exercise options.

• LSMC operates serially, backwards in time.

• At each time step, LSMC estimates the optimalexercise value by least squares regression usinga cross section of simulated data, approximatingthis value as a linear combination of basisfunctions.

• We describe a fast, easily parallelizable LSMCmethod based on a QR factorization algorithmspecific to row Vandermonde matrices.

• This approach is used for “microbenchmarks”and for Path Scaling, where utilizing every threadoptimizes perfomrance.

• This approach was inspired by NVIDIA'sdescription of their LSMC algorithm for STAC-A2.

15

IBM Research


Future Work

● Exploiting the high-bandwidth CPU ↔ GPU NVLINKTM

unique to future OpenPOWER systems

Current OpenPOWERFoundation Gold Members

*

* http://www.smartercomputingblog.com/power-systems/openpower-hpc-big-data-analytics/

16

IBM Research


Special Notices

Information in this document concerning non-IBM products was obtained from the suppliers of these products or other publicsources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this documentdoes not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation,New Castle Drive, Armonk, NY 10504-1785 USA.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goalsand objectives only.

The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with nowarranties or guarantees either expressed or implied.

All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can beused and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending onindividual client confgurations and conditions.

Any performance data contained in this document was determined in a controlled environment. Actual results may varysignifcantly and are dependent on many factors including system hardware confguration and software design and confguration.Some measurements quoted in this document may have been made on development-level systems. There is no guarantee thesemeasurements will be the same on generally-available systems. Some measurements quoted in this document may have beenestimated through extrapolation. Users of this document should verify the applicable data for their specifc environment.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Other company, product and service names may be trademarks or service marks of others and are used for identifcation purposesonly.

17

IBM Research


Backup

18

IBM Research


POWER8 Simultaneous Multi-threading (SMT)

● SMT modes represent different partitioning of coreresources: Fetch/Dispatch; Reg. Files; Execution Pipes

● SMT mode is determined by the number of active hardwarethreads/core, regardless of the number of threadsconfigured per core

– SMT mode switch is automatic as threads enter/exit idle states

– An enhancement over POWER7

● SMT modes: SMT1, SMT2, SMT4 and SMT8 entered when1, 2, 3 – 4, 5 – 8 threads are active respectively

● Most STAC-A2 benchmarks run in SMT4; Some SMT2– SMT4 here means we bind 4 software threads to 4 of 8 configured

HW threads per core

19

IBM Research


Simplified POWER8 Core: SMT1

NBBR

FXU LSU LU

DP FP DP FP

128-bit DP SIMD BR

CR

Crypto

FXU

UniQ0

Regs

NB NB NB NB NB BR

Regs (Mirror)

FXU LSU LU

DP FP DP FP

128-bit DP SIMD

FXU

UniQ1

Dispatch/Complete up to 6 non-branch + 2 branch per cycle

Register files/renamesare mirrored in SMT1 allowing use of either UniQ

Unified issue queues;BRanch and Condition Register have privatequeues; Crypto issued fromeither UniQ

Execution units: Pairs ofDouble Precision Floating Point pipes implement2-way DP SIMD; 1 FiXed point Unit, 1 Load-Store Unit, 1 Load Unit per side.

Note each DP FP canalso function as 2 x Single-Precision FP

Single Thread

20

IBM Research


Optimizing Data Transpose for Monte Carlo Sim.

Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartzpricing effectively requires transposing time-major data into path-major storage. Traditional out-of-place rectangular transpose requires too much memory. Traditional in-place transpose is too slow.

. . . . .

t0 ttimesteps-1

path 0

path N-1

. . . . .

t0 ttimesteps-1

path 0

path N-1

Simulation Pricing

Data generated time major

Data analyzed path major

Simple Insertion of time-major data into a path-majorarray is a disaster!

• Cache lines only partiallypopulated when touched• Line may be evicted to memorybefore being touched again• Each line is in a unique virtualpage

Instead we use the blockedapproach illustrated on thenext slide.



t0

Insert Pair n + 0

Insert Pair n + 1

t0 ttimesteps-1 ttimesteps-1Path Pair n + 0

Path Pair n + 0

. . . . .

Path Pair n + 1

Etc.

. . . . .

21

IBM Research


Illustration of In-Place Blocked Transpose for a 16x16 Block-Row




t0 t15 ttimesteps-1t16 t31 t32 t47 . . . . .

. . . . .

Insert Pair n + 0

Insert Pair n + 1

Insert Pair n + 7

Path Pair n + 0

Path Pair n + 7

Path Pair n + 0

Path Pair n + 7

Path Pair n + 0

Path Pair n + 7

Path Pair n + 0

Path Pair n + 7

(Deferred) Block Transpose

t0 ttimesteps-1

Path pairs (original + antithetic) are simulated together and inserted into the blocked (16 x 16) and padded destination array block-rowsuch that the blocks are transposed. (Eventually) transposing the blocks then brings the data into the correct orientation.

22

IBM Research


Transpose Performance

Any technique other than the Simple Insertion gives good performance, and all techniques benefit from manual prefetching. 2% - 9%of total benchmark run time is currently spent doing this transpose (varies by workload).

If the working set (16 paths per simultaneously generated array, per thread, per core) appears to fit in the 8 MiB L3, it is advantageousto transpose each block-row as soon as it is filled, otherwise it is better to defer the final block transpose until the entire array is filled.

Our latest results also suggest that Blocked Insertion with Immediate Transpose is more efficient than using a library out-of-placetransposition routine (DGETMO) for L3-contained working sets. [Each thread transposes the section of the array it created.]†

Baseline(6.4)

Large Problem(93.5)

Asset Scaling(797)

Path Scaling(6.4)

5:126:25000(3.2)

5:1260:25000(32.0)

0

0.2

0.4

0.6

0.8

1

1.2

0.91

1.04 1.06

0.930.88

1.02

0.63

0.46

0.85

0.63

0.75

0.37

0.941.00

0.951.01 0.97

1.07

Performance of Monte Carlo Simulation by Transpose Method,Relative to Blocked Insertion with Immediate Transpose†

Blocked(ImmediateTranspose)

Blocked(DeferredTranspose)

SimpleInsertion

BufferedInsertion*

Workload(Working Set, MiB)

Re

lative

Pe

rfo

rma

nce

† Not a STAC-A2 Benchmark * Discussed in the paper

Not STAC-A2 Benchmarks

23

IBM Research


Parallel Longstaff-Schwartz Algorithm

● After simulation each (complex) scenario is reduced to asingle value using the Longstaff-Schwartz algorithm

● Multi-threaded parallelization is needed in cases where thenumber of threads exceeds the number of scenarios

– “Microbenchmarks”, e.g., Delta, where 96 threads price 10 scenarios

– Path Scaling: 93 scenarios are divided into 5 partitions due to limitedmemory; Partitions contain 10 to 40 scenarios

● We may also use parallelization heuristically to improveload balancing, even when scenarios > threads

ScenarioArray

(paths x timesteps)

Longstaff-Schwartz3.276517

24

IBM Research


Longstaff-Schwartz (Least-Squares Monte Carlo[LSMC]) in a Nutshell

• Prior to expiration the holder willexercise an in-the-money option if thefuture discounted cash flow from holdingthe option is expected to be less thanthe current value.

• The value of the option is maximized ifthe exercise happens as soon as this istrue.

• LSMC estimates the future value byleast squares regression using a crosssection of simulated data, approximatingthis value as a linear combination ofbasis functions.

• LSMC is robust to the choice of basisfunctions; STAC-A2 specifies the use ofpolynomial basis functions

0 1 2 3 4 5 6

0

1

2

3

4

5

6

Longstaff-Schwartz ExamplePolynomial Curve Fit

Exercise Now

Defer Exercise

Least-SquaresFit

Current = Future

Current Payoff (Exercise Now)

Fu

ture

Payo

ff

Analysis performed for each timestep. TheSingular Value Decomposition (SVD) approach toleast-squares is preferred for numerical stability.

25

IBM Research


Longstaff-Schwartz Parallelization Schemes

. . .

t0 ttimesteps-1

Thread0

Thread1

Thread2

Thread3

By Timestep

• Threads “leapfrog” backwards in time, computing SVD inparallel

• Final evaluation is still serial: Thread must wait for nexttimestep to complete before evaluating the regression

• Scalability severely limited by ratio TSVD : Teval

path0

pathN-1

. . .

t0 ttimesteps-1

By Path-Partition

path0

pathN-1

Thread0

Thread1

Thread2

Thread3

• Paths are partitioned by thread similar to Monte Carlosimulation

• Threads synchronize (twice per timestep here) andcomplete each timestep in lock-step

• Scalability is limited by synchronization andcommunication overheads

• For the Baseline case (25,000 paths) performanceimproves for up to 16 – 20 SMT4 threads per scenario

System

26

IBM Research


Path-Parallel Least Squares Solution from QR Factorization and SVD

A = (QU )ΣVT

R = U ΣV T

xT = b

T [(QU )Σ−1VT ] = bpaths

TW paths×n

Apaths×n = Qpaths×n Rn×n

Apaths×n = [ …1 pi pi

2 … pin−1

… ]

xT = [bT (1)bT (2)…bT (N )]×[

W (1)1 W (1)2 ⋱ W (1)nW (2)1 W (2)2 ⋱ W (2)n

… … ⋱ …W (N )1 W (N )2 ⋱ W (N )n

]

• Design matrix A: STAC-A2 specifies polynomialbasis functions, hence A is a row Vandermondematrix

• The QR factorization of A. This is path-parallelizablein general for any basis functions, however we use afast QR factorization for Vandermonde matricesrequiring trivial parallel communication

• The SVD of R

• The SVD of A

•Solution for coefficients xT minimizingwhere b is the discounted future cash flow

• Both bT and W can be path-partitioned between Nthreads. Each thread j computes n partial coefficients

∥Ax−b∥2

xiT ( j) = b

T ( j)W ( j)i for i=1,… , n

27

IBM Research


High-Level Sketch – Per-thread Operations

Mi , j = ∑k= firstPath j

lastPath j

pki−1

for i=1,… ,2n−1

Thread Join

R = f (M 1,… , M 2n−1)

xiT ( j) = b

T ( j)W ( j)i for i=1,… , n

Thread Join

W ( j) = g (A( j) , SVD(R))

• Each thread j computes the sums of the first2n-1 powers of its share of prices for in-the-money paths

• Each thread j combines the partial sums toconstruct R and its SVD, which is then usedto create partial coefficients.

• In total, 3 passes are made over the pricedata at each timestep, stressing caches andmemory bandwidth as the number of pathsper partition increases.

• The working set here consists of 3 columnvectors: The price, the future value, and anindex vector recoding in-the-money paths.

Use the regression

to compute the new future value

A( j) x

Loop over timesteps:

28

IBM Research


All-Threads Cohorts vs. Small-Threads Cohorts

1MATC

1MSTC

8MATC

8MSTC

16MATC

16MSTC

0

10000

20000

30000

40000

50000

60000

70000

Greeks Throughput by Parallelization Methodand Available Memory Bandwidth

Full192GB/secper socket

Half96GB/sec Per socket

Quarter48GB/sec Per socket

Paths - Method

Pa

ths p

er

Se

co

nd

*

• All-Threads-Cohort (ATC) mode –using all threads to price eachscenario – significantly improvesperformance and reduces memorybandwidth dependency vs. usingcohorts with smaller numbers ofthreads (Small-Threads-Cohort,STC) for large numbers of paths.

• Speedups are primarily due tobetter load balancing.

• Memory effects are due to smallercache footprints and ideal NUMAlocality.

. . .

t0 ttimesteps-1

path0

pathN-1

Thread0

Thread1

Thread2

Thread3* Not a STAC-A2 Benchmark

1 / 7

Least Squares Fitting

Given observed pairs (xi , yi ), find a polynomial model which describesthe relationship.

y = a0 + a1 · x + a2 · x2 + · · ·+ an · xn (1)

Expanding into matrix format2

6664

1 x1 · · · xn11 x2 · · · xn2...

......

1 xM · · · xnM

3

7775·

2

6664

a0a1...an

3

7775=

2

6664

y1y2...yM

3

7775(2)

Or in matrix form:A · x = b (3)

Least squares solution minimizes the residue

argminx

||r||22 = argminx

||A · x� b||22 (4)

2 / 7

Solutions to Least Squares Problems

Requires more observed data points than order of polynomial:M � n. Overdetermined problem.Skipped existence and uniqueness ......Several methods to solve the LS problems. For example,pseudo-inverse method:

ATA · x = ATb (5)

ATA is square and fully ranked, we can invert it to compute x:

x =⇣ATA

⌘�1ATb (6)

Works fine for small n. Issues when n gets large: lots of datamovement; bad condition number if not properly scaledBetter: SVD (singular value decomposition)

A = U⌃VT (7)

Both U and V are unitary, and ⌃ is diagonal. Much more stable:

x = V⌃�1U · b (8)

3 / 7

Singular Value Decomposition via QR decomposition

One path of SVD is through the QR decomposition

A = Q

R0

�(9)

Q is orthogonal and R is upper triangular

Classic methods: Houesholder transformation or Givens rotation.Similar to Gaussian elimination for matrix factorization, but useorthogonal transformations

Very much a serial operation

2

6664

a11 a12 · · · a1na21 a22 · · · a2n...

......

an1 an2 · · · ann

3

7775�!

2

6664

a⇤11 a⇤12 · · · a⇤1n0 a⇤22 · · · a⇤2n...

......

0 a⇤n2 · · · a⇤nn

3

7775(10)

4 / 7

Demeure’s QR Factorization Method

Take advantage of the special structure of the A to speed-up the QRfactorization.

For LS problems, the A matrix is a Vandermonde matrixSketch of the strategy

The QR factorization isA = Q⌃BT (11)

with Q orthogonal, BT upper triangular with 1 on diagonal, ⌃ diagonalwith real values.From Eqn. (11), if E is inverse of BT .

AE = Q⌃ (12)

If H = ATA:HE = B⌃2 (13)

AlsoATQ = B⌃ (14)

From Eqn. (13), recursively compute E as column space of H; use E tocompute Q from Eqn. (12); use Q to compute B from Eqn. (14)

5 / 7

Flow

Matrix represented by vector p

A =

2

6664

1 p1 p21 · · · pn+11

1 p2 p22 · · · pn+12

......

......

1 pM p2M · · · pn+1M

3

7775=

⇥1 p p2 · · ·pn+1

⇤(15)

1: procedure VandermondeQR . initialization2: �1 m3: for j in 2, . . . , 2n � 1 do4: Bj ,1 kpj�1k1/�15: µ1 B21

6: ⌫1 B31

7: �2 �1(⌫1 � µ21)

6 / 7

Flow (cont’d)

1: procedure VandermondeQR . initialization cont’d2: for j in 3, . . . , 2n � 2 do3: Bj ,2 (�1/�2) (Bj+1,1 � µ1Bj ,1)

4: Q:,1 = 1/p�1

5: Q:,2 = (p� µ1)/p�2 . main recursion loop

6: for k in 3, . . . , n do7: µk�1 Bk,k�1

8: ⌫k�1 Bk+1,k�1

9: �k �k�1(⌫k�1 � ⌫k�2 + µk�1(µk�2 � µk�1))10: for j in k + 1, . . . , 2n � k do11: Bj ,k (�k�1/�k) (Bj+1,k�1 � Bj ,k�2+

+(µk�2 � µk�1)Bj ,k�1)

12: Q:,k (p

�k�1/�k) · ((p+ µk�2 � µk�1) · Q:,k�1��p

�k�1/�k�2 · Q:,k�2)

13: ⌃ diag(p�1,p�2, . . . ,

p�n)

14: B B1:n,:

7 / 7

Characteristics

Many operations are local, no large amount of data movements

Can be parallelized easily

stac-a2tm benchmark on power8®...path scaling max paths 5 252 * 28m 93 2.8 tib timesteps * values...

Documents