stac-a2tm benchmark on power8®...path scaling max paths 5 252 * 28m 93 2.8 tib timesteps * values...

35
© 2015 IBM Corporation STAC-A2 TM Benchmark on POWER8® Bishop Brock, Frank Liu, Karthick Rajamani IBM Research, Austin Texas {bcbrock,frankliu,karthick}@us.ibm.com Workshop for High-Performance Computational Finance WHPCF'15 November 20, 2015 IBM Research With thanks to Kenneth Hill of the University of Florida and Julian Demouth of NVIDIA

Upload: others

Post on 03-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

© 2015 IBM Corporation

STAC-A2TM Benchmark on POWER8®

Bishop Brock, Frank Liu, Karthick RajamaniIBM Research, Austin Texas{bcbrock,frankliu,karthick}@us.ibm.com

Workshop for High-Performance Computational FinanceWHPCF'15November 20, 2015

IBM Research

With thanks to Kenneth Hill of the University of Florida and Julian Demouth of NVIDIA

Page 2: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

2

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Summary

● Workload characterization and performance analysisdemonstrate that STAC-A2TM is a well-rounded single-system HPC benchmark for risk analytics

● POWER8-based Linux® systems demonstrate consistentlyhigh performance at every scale (and we explain why)

→ Current POWER8-based systems are capable andcompetitive platforms for computational finance

Page 3: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

3

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Page 4: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

4

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Simplified POWER8 Core: SMT2, SMT4, SMT8

NBBR

FXU LSU LU

DP FP DP FP

128-bit DP SIMD BR

CR

Crypto

FXU

UniQ0

Regs0

NB NB NB NB NB BR

Regs1

FXU LSU LU

DP FP DP FP

128-bit DP SIMD

FXU

UniQ1

Up to 8 threads in 2Thread sets;Dispatch/Complete up to 3 non-branch + 1 branch per cycleper thread set

Register file/renamesShared equally by thread set

Thread sets only dispatch to 'their' UniQ

Execution units are splitBetween UniQs (thread sets)

Thread Set 0 Thread Set 1

SMT2+

Page 5: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

5

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

STAC® and the STAC-A2TM Benchmark

● Securities Technology Analysis Center (STAC)– Coordinates the STAC Benchmark CouncilTM

– Publishes and audits several benchmarks covering technologiesimportant to capital markets

● STAC-A2 Benchmark : Risk Management– Computes numerous sensitivities (Greeks) of a multi-asset, exotic call

option (lookback, best-of) with early exercise

– Modeled by Monte Carlo simulation of the Heston stochastic volatilitymodel

– Priced using the Longstaff-Schwartz algorithm

– Greeks computed using finite differences

The STAC-A2 benchmark suite measures the performance, scalability, quality and energyefficiency of any hardware/software system that is capable of implementing the benchmarkspecification. In this presentation we focus only on the performance and scalability benchmarks.

Page 6: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

6

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

STAC-A2 Benchmark Dimensions and Challenges

● Performance/Scalability Dimensions– Assets : The number of correlated assets : O(assets2)

– Timesteps : Granularity of discretization : O(timesteps)

– Paths : Number of Monte Carlo Paths : O(paths)

● Challenges– Unit-normal random number generation

– Dense matrix operations (custom correlation routine, beaucoup ILP)

– SQRT/DIVIDE–intensive Monte Carlo kernel (little ILP)

– Cache-efficient data management

– Efficient numerical methods (custom SVD routine)

– Bandwidth-limited performance in certain benchmarks

– Single-system parallel load balancing

Page 7: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

7

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

IBM STAC-A2 Solution Structure

RNGMonteCarloSim.

Arrays

Arrays

ComplexScenario

Gen. LongStaff-SchwartzPricing

Monte Carlo simulation produces a numberof paths x timesteps arrays (scenarios) whichmust be stored for later analysis.

Finite difference example: Θ =yΔt − y

Δ t

y is the unmodified option value scenario

yΔ t is the modified expiration scenario

(b) (c)

A master thread (a) spawns multiple workerthreads that perform Monte Carlo simulationin parallel (b). Simulation is partitioned bypaths; Each thread creates the same path-partition of every array. Scenarios are(generated and) priced by cohorts of 1 ormore threads (c). Finally the master threadcomputes the finite differences (d).

Page 8: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

8

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Key “Greeks” Workloads + Characterization

Workload Goal Assets Paths Result Scenarios Memory*Baseline Speed 5 252 25K 0.317 s 93 2.5 GiB

Large Problem Speed 10 1260 100K 28.9 s 308 143 GiBAsset Scaling Max Assets * 252 25K 78 15642* 309 GiBPath Scaling Max Paths 5 252 * 28M 93 2.8 TiB

Timesteps

* Values unique to our solution

Baseline Large Problem Asset Scaling Path Scaling

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Approximate Profile Breakdown by Application Phase, POWER8 S824*

Longstaff-Schwartz Pricing

Array Transpose

Monte Carlo (Heston Model)

RNG + Correlation

* 2 x POWER8 @ 3.52 GHz (nominal)/ 3.925 GHz (turbo); 24 total cores; 1 TiB memory; IBM XL C/C++; RHEL 7 Big-endian

Page 9: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

9

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Memory Bandwidth Sensitivity

Baseline LargeProblem

Asset Scaling(65 Assets)

Path Scaling(8M Paths/STC)

0

0.2

0.4

0.6

0.8

1

1.2

1 1 1 10.96

0.931

0.67

0.82

0.72

0.99

0.44

POWER8 S824 Relative Performance vs. Available Memory Bandwidth*

Full192GB/sec per socket

Half96GB/sec per socket

Quarter48GB/sec per socket

Pe

rf.

Re

lative

to

Fu

ll B

an

dw

idth

*Not a STAC-A2 benchmark

Page 10: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

10

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Performance Comparisons – STAC-A2 Benchmarks

Legend STAC SUT* Description

IBM150305 2 x IBM POWER8 @ 3.52 GHz; 24 total cores; 1TiB Memory

INTC151028

INTC150811 4 x Intel Xeon E7-8890 v3 @ 2.50 GHz; 72 total cores; 1 TiB Memory

NVDA141116

INTC140815

INTC140814 2 x Intel Xeon E5-2699 v3 @ 2.3 GHz; 36 total cores; 256 GiB Memory

2 x Intel Xeon E5-2697 v3 @ 2.6 GHz + 2 x Intel Xeon PHI 7120P @ 1.24 GHz28 Haswell cores + 122 PHI cores; 256 GiB Memory

1 x NVIDIA Tesla K80 GPU; Supermicro SYS-2027GR-TRFH Host; 128 MiB Memory

2 x Intel Xeon E5-2699 v3 @ 2.3 GHz + 1 x Intel Xeon PHI 7120A @ 1.24 GHz36 Haswell cores + 61 PHI cores; 256 GiB Memory

*For details see www.stacresearch.com/<STAC SUT>

Baseline Large Problem0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

Relative Workload Performance

2 x POWER8

2 x Haswell EP + 2 x Xeon PHI

4 x Haswell-EX

NVIDIA K80

2 x Haswell-EP+ Xeon PHI

2 x Haswell-EP

Asset Scaling0

10

20

30

40

50

60

70

80

90

Max Assets

Path Scaling00.0E-1

50.0E+5

10.0E+6

15.0E+6

20.0E+6

25.0E+6

30.0E+6

Max Paths

Intel, Xeon, PHI, NVIDIA, Tesla and Supermicro are trademarks or registered trademarks of their respective owners and are used for identification purposes only

Page 11: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

11

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Performance by SMT Mode

1 Core 1 Socket(12 Cores)

2 Sockets(24 Cores)

0

0.5

1

1.5

2

2.5

1 1 1

1.611.68 1.66

2.00 2.042.12

1.96 1.97 1.99

SMT Mode Performance Relative to SMT1, STAC-A2 Baseline Greeks Workload*

SMT1

SMT2

SMT4

SMT8

Pe

rf. R

ela

tive

to

SM

T1

*Not a STAC-A2 benchmark

Page 12: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

12

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

System-Level Scalability Comparisons

*Data is from the official STAC-A2 audit reports for these systems, however these are not considered STAC-A2 benchmarks.Please see page 10 for full details of the systems being compared here.

Core → Half System Core → Full System

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Scaling Single Core Performance to Half/Full System Performance*(Higher is Better, 1.0 = Ideal)

2 x POWER8Half = 12 CoresFull = 24 coresSMT4

4 x Haswell-EXHalf = 36 CoresFull = 72 CoresHyperThreading

2 x Haswell-EPHalf = 18 CoresFull = 36 CoresHyperThreading

Experiment

Fra

ction o

f th

eore

tical s

peedup,

based

on

num

ber

of core

s

Page 13: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

13

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Optimizing Transpose for Monte Carlo SimulationStraightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartzpricing effectively requires transposing time-major data into path-major storage.

. . . . .

t0 ttimesteps-1

path 0

path N-1

t0 ttimesteps-1

path 0

path N-1

Simulation Pricing

Data generated time major

Data analyzed path major

. . . . .

Our Solution:

A blocked, in-place matrix transpose using noadditional storage, taking advantage of the factthat (post-processed) simulation data iscreated one path (pair) at a time

Simple and relatively efficient despite:• Moderately poor locality• Requires a second pass over the data

Simple heuristics based on working set sizeimprove performance for large working sets

2% - 9% of total benchmark run time

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

. . . . .

Finally

Page 14: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

14

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Fast, Parallel Longstaff-Schwartz Algorithm

. . .

t0 ttimesteps-1

path0

pathN-1

Thread0

Thread1

Thread2

Thread3

• Paths are partitioned by thread similar to MonteCarlo simulation

• Threads synchronize (twice per timestep here)and complete each timestep in lock-step

• Scalability is limited by synchronization andcommunication overheads

• For the Baseline case (25,000 paths)performance improves for up to 16 – 20 SMT4threads per scenario

• Least-Squares Monte Carlo (LSMC orLongstaff-Schwartz) is an algorithm for valuingearly-exercise options.

• LSMC operates serially, backwards in time.

• At each time step, LSMC estimates the optimalexercise value by least squares regression usinga cross section of simulated data, approximatingthis value as a linear combination of basisfunctions.

• We describe a fast, easily parallelizable LSMCmethod based on a QR factorization algorithmspecific to row Vandermonde matrices.

• This approach is used for “microbenchmarks”and for Path Scaling, where utilizing every threadoptimizes perfomrance.

• This approach was inspired by NVIDIA'sdescription of their LSMC algorithm for STAC-A2.

Page 15: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

15

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Future Work

● Exploiting the high-bandwidth CPU ↔ GPU NVLINKTM

unique to future OpenPOWER systems

Current OpenPOWERFoundation Gold Members

*

* http://www.smartercomputingblog.com/power-systems/openpower-hpc-big-data-analytics/

Page 16: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

16

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Special Notices

Information in this document concerning non-IBM products was obtained from the suppliers of these products or other publicsources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.

IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this documentdoes not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation,New Castle Drive, Armonk, NY 10504-1785 USA.

All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goalsand objectives only.

The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with nowarranties or guarantees either expressed or implied.

All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can beused and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending onindividual client confgurations and conditions.

Any performance data contained in this document was determined in a controlled environment. Actual results may varysignifcantly and are dependent on many factors including system hardware confguration and software design and confguration.Some measurements quoted in this document may have been made on development-level systems. There is no guarantee thesemeasurements will be the same on generally-available systems. Some measurements quoted in this document may have beenestimated through extrapolation. Users of this document should verify the applicable data for their specifc environment.

A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.

Other company, product and service names may be trademarks or service marks of others and are used for identifcation purposesonly.

Page 17: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

17

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Backup

Page 18: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

18

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

POWER8 Simultaneous Multi-threading (SMT)

● SMT modes represent different partitioning of coreresources: Fetch/Dispatch; Reg. Files; Execution Pipes

● SMT mode is determined by the number of active hardwarethreads/core, regardless of the number of threadsconfigured per core

– SMT mode switch is automatic as threads enter/exit idle states

– An enhancement over POWER7

● SMT modes: SMT1, SMT2, SMT4 and SMT8 entered when1, 2, 3 – 4, 5 – 8 threads are active respectively

● Most STAC-A2 benchmarks run in SMT4; Some SMT2– SMT4 here means we bind 4 software threads to 4 of 8 configured

HW threads per core

Page 19: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

19

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Simplified POWER8 Core: SMT1

NBBR

FXU LSU LU

DP FP DP FP

128-bit DP SIMD BR

CR

Crypto

FXU

UniQ0

Regs

NB NB NB NB NB BR

Regs (Mirror)

FXU LSU LU

DP FP DP FP

128-bit DP SIMD

FXU

UniQ1

Dispatch/Complete up to 6 non-branch + 2 branch per cycle

Register files/renamesare mirrored in SMT1 allowing use of either UniQ

Unified issue queues;BRanch and Condition Register have privatequeues; Crypto issued fromeither UniQ

Execution units: Pairs ofDouble Precision Floating Point pipes implement2-way DP SIMD; 1 FiXed point Unit, 1 Load-Store Unit, 1 Load Unit per side.

Note each DP FP canalso function as 2 x Single-Precision FP

Single Thread

Page 20: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

20

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Optimizing Data Transpose for Monte Carlo Sim.

Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartzpricing effectively requires transposing time-major data into path-major storage. Traditional out-of-place rectangular transpose requires too much memory. Traditional in-place transpose is too slow.

. . . . .

t0 ttimesteps-1

path 0

path N-1

. . . . .

t0 ttimesteps-1

path 0

path N-1

Simulation Pricing

Data generated time major

Data analyzed path major

Simple Insertion of time-major data into a path-majorarray is a disaster!

• Cache lines only partiallypopulated when touched• Line may be evicted to memorybefore being touched again• Each line is in a unique virtualpage

Instead we use the blockedapproach illustrated on thenext slide.

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

t0

Insert Pair n + 0

Insert Pair n + 1

t0 ttimesteps-1 ttimesteps-1Path Pair n + 0

Path Pair n + 0

. . . . .

Path Pair n + 1

Etc.

. . . . .

Page 21: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

21

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Illustration of In-Place Blocked Transpose for a 16x16 Block-Row

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7

t0 t15 ttimesteps-1t16 t31 t32 t47 . . . . .

. . . . .

Insert Pair n + 0

Insert Pair n + 1

Insert Pair n + 7

Path Pair n + 0

Path Pair n + 7

Path Pair n + 0

Path Pair n + 7

Path Pair n + 0

Path Pair n + 7

Path Pair n + 0

Path Pair n + 7

(Deferred) Block Transpose

t0 ttimesteps-1

Path pairs (original + antithetic) are simulated together and inserted into the blocked (16 x 16) and padded destination array block-rowsuch that the blocks are transposed. (Eventually) transposing the blocks then brings the data into the correct orientation.

Page 22: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

22

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Transpose Performance

Any technique other than the Simple Insertion gives good performance, and all techniques benefit from manual prefetching. 2% - 9%of total benchmark run time is currently spent doing this transpose (varies by workload).

If the working set (16 paths per simultaneously generated array, per thread, per core) appears to fit in the 8 MiB L3, it is advantageousto transpose each block-row as soon as it is filled, otherwise it is better to defer the final block transpose until the entire array is filled.

Our latest results also suggest that Blocked Insertion with Immediate Transpose is more efficient than using a library out-of-placetransposition routine (DGETMO) for L3-contained working sets. [Each thread transposes the section of the array it created.]†

Baseline(6.4)

Large Problem(93.5)

Asset Scaling(797)

Path Scaling(6.4)

5:126:25000(3.2)

5:1260:25000(32.0)

0

0.2

0.4

0.6

0.8

1

1.2

0.91

1.04 1.06

0.930.88

1.02

0.63

0.46

0.85

0.63

0.75

0.37

0.941.00

0.951.01 0.97

1.07

Performance of Monte Carlo Simulation by Transpose Method,Relative to Blocked Insertion with Immediate Transpose†

Blocked(ImmediateTranspose)

Blocked(DeferredTranspose)

SimpleInsertion

BufferedInsertion*

Workload(Working Set, MiB)

Re

lative

Pe

rfo

rma

nce

† Not a STAC-A2 Benchmark * Discussed in the paper

Not STAC-A2 Benchmarks

Page 23: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

23

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Parallel Longstaff-Schwartz Algorithm

● After simulation each (complex) scenario is reduced to asingle value using the Longstaff-Schwartz algorithm

● Multi-threaded parallelization is needed in cases where thenumber of threads exceeds the number of scenarios

– “Microbenchmarks”, e.g., Delta, where 96 threads price 10 scenarios

– Path Scaling: 93 scenarios are divided into 5 partitions due to limitedmemory; Partitions contain 10 to 40 scenarios

● We may also use parallelization heuristically to improveload balancing, even when scenarios > threads

ScenarioArray

(paths x timesteps)

Longstaff-Schwartz3.276517

Page 24: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

24

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Longstaff-Schwartz (Least-Squares Monte Carlo[LSMC]) in a Nutshell

• Prior to expiration the holder willexercise an in-the-money option if thefuture discounted cash flow from holdingthe option is expected to be less thanthe current value.

• The value of the option is maximized ifthe exercise happens as soon as this istrue.

• LSMC estimates the future value byleast squares regression using a crosssection of simulated data, approximatingthis value as a linear combination ofbasis functions.

• LSMC is robust to the choice of basisfunctions; STAC-A2 specifies the use ofpolynomial basis functions

0 1 2 3 4 5 6

0

1

2

3

4

5

6

Longstaff-Schwartz ExamplePolynomial Curve Fit

Exercise Now

Defer Exercise

Least-SquaresFit

Current = Future

Current Payoff (Exercise Now)

Fu

ture

Payo

ff

Analysis performed for each timestep. TheSingular Value Decomposition (SVD) approach toleast-squares is preferred for numerical stability.

Page 25: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

25

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Longstaff-Schwartz Parallelization Schemes

. . .

t0 ttimesteps-1

Thread0

Thread1

Thread2

Thread3

By Timestep

• Threads “leapfrog” backwards in time, computing SVD inparallel

• Final evaluation is still serial: Thread must wait for nexttimestep to complete before evaluating the regression

• Scalability severely limited by ratio TSVD : Teval

path0

pathN-1

. . .

t0 ttimesteps-1

By Path-Partition

path0

pathN-1

Thread0

Thread1

Thread2

Thread3

• Paths are partitioned by thread similar to Monte Carlosimulation

• Threads synchronize (twice per timestep here) andcomplete each timestep in lock-step

• Scalability is limited by synchronization andcommunication overheads

• For the Baseline case (25,000 paths) performanceimproves for up to 16 – 20 SMT4 threads per scenario

System

Page 26: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

26

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

Path-Parallel Least Squares Solution from QR Factorization and SVD

A = (QU )ΣVT

R = U ΣV T

xT = b

T [(QU )Σ−1VT ] = bpaths

TW paths×n

Apaths×n = Qpaths×n Rn×n

Apaths×n = [ …1 pi pi

2 … pin−1

… ]

xT = [bT (1)bT (2)…bT (N )]×[

W (1)1 W (1)2 ⋱ W (1)nW (2)1 W (2)2 ⋱ W (2)n

… … ⋱ …W (N )1 W (N )2 ⋱ W (N )n

]

• Design matrix A: STAC-A2 specifies polynomialbasis functions, hence A is a row Vandermondematrix

• The QR factorization of A. This is path-parallelizablein general for any basis functions, however we use afast QR factorization for Vandermonde matricesrequiring trivial parallel communication

• The SVD of R

• The SVD of A

•Solution for coefficients xT minimizingwhere b is the discounted future cash flow

• Both bT and W can be path-partitioned between Nthreads. Each thread j computes n partial coefficients

∥Ax−b∥2

xiT ( j) = b

T ( j)W ( j)i for i=1,… , n

Page 27: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

27

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

High-Level Sketch – Per-thread Operations

Mi , j = ∑k= firstPath j

lastPath j

pki−1

for i=1,… ,2n−1

Thread Join

R = f (M 1,… , M 2n−1)

xiT ( j) = b

T ( j)W ( j)i for i=1,… , n

Thread Join

W ( j) = g (A( j) , SVD(R))

• Each thread j computes the sums of the first2n-1 powers of its share of prices for in-the-money paths

• Each thread j combines the partial sums toconstruct R and its SVD, which is then usedto create partial coefficients.

• In total, 3 passes are made over the pricedata at each timestep, stressing caches andmemory bandwidth as the number of pathsper partition increases.

• The working set here consists of 3 columnvectors: The price, the future value, and anindex vector recoding in-the-money paths.

Use the regression

to compute the new future value

A( j) x

Loop over timesteps:

Page 28: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

28

IBM Research

STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation

All-Threads Cohorts vs. Small-Threads Cohorts

1MATC

1MSTC

8MATC

8MSTC

16MATC

16MSTC

0

10000

20000

30000

40000

50000

60000

70000

Greeks Throughput by Parallelization Methodand Available Memory Bandwidth

Full192GB/secper socket

Half96GB/sec Per socket

Quarter48GB/sec Per socket

Paths - Method

Pa

ths p

er

Se

co

nd

*

• All-Threads-Cohort (ATC) mode –using all threads to price eachscenario – significantly improvesperformance and reduces memorybandwidth dependency vs. usingcohorts with smaller numbers ofthreads (Small-Threads-Cohort,STC) for large numbers of paths.

• Speedups are primarily due tobetter load balancing.

• Memory effects are due to smallercache footprints and ideal NUMAlocality.

. . .

t0 ttimesteps-1

path0

pathN-1

Thread0

Thread1

Thread2

Thread3* Not a STAC-A2 Benchmark

Page 29: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

1 / 7

Least Squares Fitting

Given observed pairs (xi , yi ), find a polynomial model which describesthe relationship.

y = a0 + a1 · x + a2 · x2 + · · ·+ an · xn (1)

Expanding into matrix format2

6664

1 x1 · · · xn11 x2 · · · xn2...

......

1 xM · · · xnM

3

7775·

2

6664

a0a1...an

3

7775=

2

6664

y1y2...yM

3

7775(2)

Or in matrix form:A · x = b (3)

Least squares solution minimizes the residue

argminx

||r||22 = argminx

||A · x� b||22 (4)

Page 30: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

2 / 7

Solutions to Least Squares Problems

Requires more observed data points than order of polynomial:M � n. Overdetermined problem.Skipped existence and uniqueness ......Several methods to solve the LS problems. For example,pseudo-inverse method:

ATA · x = ATb (5)

ATA is square and fully ranked, we can invert it to compute x:

x =⇣ATA

⌘�1ATb (6)

Works fine for small n. Issues when n gets large: lots of datamovement; bad condition number if not properly scaledBetter: SVD (singular value decomposition)

A = U⌃VT (7)

Both U and V are unitary, and ⌃ is diagonal. Much more stable:

x = V⌃�1U · b (8)

Page 31: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

3 / 7

Singular Value Decomposition via QR decomposition

One path of SVD is through the QR decomposition

A = Q

R0

�(9)

Q is orthogonal and R is upper triangular

Classic methods: Houesholder transformation or Givens rotation.Similar to Gaussian elimination for matrix factorization, but useorthogonal transformations

Very much a serial operation

2

6664

a11 a12 · · · a1na21 a22 · · · a2n...

......

an1 an2 · · · ann

3

7775�!

2

6664

a⇤11 a⇤12 · · · a⇤1n0 a⇤22 · · · a⇤2n...

......

0 a⇤n2 · · · a⇤nn

3

7775(10)

Page 32: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

4 / 7

Demeure’s QR Factorization Method

Take advantage of the special structure of the A to speed-up the QRfactorization.

For LS problems, the A matrix is a Vandermonde matrixSketch of the strategy

The QR factorization isA = Q⌃BT (11)

with Q orthogonal, BT upper triangular with 1 on diagonal, ⌃ diagonalwith real values.From Eqn. (11), if E is inverse of BT .

AE = Q⌃ (12)

If H = ATA:HE = B⌃2 (13)

AlsoATQ = B⌃ (14)

From Eqn. (13), recursively compute E as column space of H; use E tocompute Q from Eqn. (12); use Q to compute B from Eqn. (14)

Page 33: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

5 / 7

Flow

Matrix represented by vector p

A =

2

6664

1 p1 p21 · · · pn+11

1 p2 p22 · · · pn+12

......

......

1 pM p2M · · · pn+1M

3

7775=

⇥1 p p2 · · ·pn+1

⇤(15)

1: procedure VandermondeQR . initialization2: �1 m3: for j in 2, . . . , 2n � 1 do4: Bj ,1 kpj�1k1/�15: µ1 B21

6: ⌫1 B31

7: �2 �1(⌫1 � µ21)

Page 34: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

6 / 7

Flow (cont’d)

1: procedure VandermondeQR . initialization cont’d2: for j in 3, . . . , 2n � 2 do3: Bj ,2 (�1/�2) (Bj+1,1 � µ1Bj ,1)

4: Q:,1 = 1/p�1

5: Q:,2 = (p� µ1)/p�2 . main recursion loop

6: for k in 3, . . . , n do7: µk�1 Bk,k�1

8: ⌫k�1 Bk+1,k�1

9: �k �k�1(⌫k�1 � ⌫k�2 + µk�1(µk�2 � µk�1))10: for j in k + 1, . . . , 2n � k do11: Bj ,k (�k�1/�k) (Bj+1,k�1 � Bj ,k�2+

+(µk�2 � µk�1)Bj ,k�1)

12: Q:,k (p

�k�1/�k) · ((p+ µk�2 � µk�1) · Q:,k�1��p

�k�1/�k�2 · Q:,k�2)

13: ⌃ diag(p�1,p�2, . . . ,

p�n)

14: B B1:n,:

Page 35: STAC-A2TM Benchmark on POWER8®...Path Scaling Max Paths 5 252 * 28M 93 2.8 TiB Timesteps * Values unique to our solution Baseline Large Problem Asset Scaling Path Scaling 0% 10% 20%

7 / 7

Characteristics

Many operations are local, no large amount of data movements

Can be parallelized easily