stac-a2tm benchmark on power8®...path scaling max paths 5 252 * 28m 93 2.8 tib timesteps * values...
TRANSCRIPT
© 2015 IBM Corporation
STAC-A2TM Benchmark on POWER8®
Bishop Brock, Frank Liu, Karthick RajamaniIBM Research, Austin Texas{bcbrock,frankliu,karthick}@us.ibm.com
Workshop for High-Performance Computational FinanceWHPCF'15November 20, 2015
IBM Research
With thanks to Kenneth Hill of the University of Florida and Julian Demouth of NVIDIA
2
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Summary
● Workload characterization and performance analysisdemonstrate that STAC-A2TM is a well-rounded single-system HPC benchmark for risk analytics
● POWER8-based Linux® systems demonstrate consistentlyhigh performance at every scale (and we explain why)
→ Current POWER8-based systems are capable andcompetitive platforms for computational finance
3
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
4
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Simplified POWER8 Core: SMT2, SMT4, SMT8
NBBR
FXU LSU LU
DP FP DP FP
128-bit DP SIMD BR
CR
Crypto
FXU
UniQ0
Regs0
NB NB NB NB NB BR
Regs1
FXU LSU LU
DP FP DP FP
128-bit DP SIMD
FXU
UniQ1
Up to 8 threads in 2Thread sets;Dispatch/Complete up to 3 non-branch + 1 branch per cycleper thread set
Register file/renamesShared equally by thread set
Thread sets only dispatch to 'their' UniQ
Execution units are splitBetween UniQs (thread sets)
Thread Set 0 Thread Set 1
SMT2+
5
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
STAC® and the STAC-A2TM Benchmark
● Securities Technology Analysis Center (STAC)– Coordinates the STAC Benchmark CouncilTM
– Publishes and audits several benchmarks covering technologiesimportant to capital markets
● STAC-A2 Benchmark : Risk Management– Computes numerous sensitivities (Greeks) of a multi-asset, exotic call
option (lookback, best-of) with early exercise
– Modeled by Monte Carlo simulation of the Heston stochastic volatilitymodel
– Priced using the Longstaff-Schwartz algorithm
– Greeks computed using finite differences
The STAC-A2 benchmark suite measures the performance, scalability, quality and energyefficiency of any hardware/software system that is capable of implementing the benchmarkspecification. In this presentation we focus only on the performance and scalability benchmarks.
6
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
STAC-A2 Benchmark Dimensions and Challenges
● Performance/Scalability Dimensions– Assets : The number of correlated assets : O(assets2)
– Timesteps : Granularity of discretization : O(timesteps)
– Paths : Number of Monte Carlo Paths : O(paths)
● Challenges– Unit-normal random number generation
– Dense matrix operations (custom correlation routine, beaucoup ILP)
– SQRT/DIVIDE–intensive Monte Carlo kernel (little ILP)
– Cache-efficient data management
– Efficient numerical methods (custom SVD routine)
– Bandwidth-limited performance in certain benchmarks
– Single-system parallel load balancing
7
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
IBM STAC-A2 Solution Structure
RNGMonteCarloSim.
Arrays
Arrays
ComplexScenario
Gen. LongStaff-SchwartzPricing
Monte Carlo simulation produces a numberof paths x timesteps arrays (scenarios) whichmust be stored for later analysis.
Finite difference example: Θ =yΔt − y
Δ t
y is the unmodified option value scenario
yΔ t is the modified expiration scenario
(b) (c)
A master thread (a) spawns multiple workerthreads that perform Monte Carlo simulationin parallel (b). Simulation is partitioned bypaths; Each thread creates the same path-partition of every array. Scenarios are(generated and) priced by cohorts of 1 ormore threads (c). Finally the master threadcomputes the finite differences (d).
8
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Key “Greeks” Workloads + Characterization
Workload Goal Assets Paths Result Scenarios Memory*Baseline Speed 5 252 25K 0.317 s 93 2.5 GiB
Large Problem Speed 10 1260 100K 28.9 s 308 143 GiBAsset Scaling Max Assets * 252 25K 78 15642* 309 GiBPath Scaling Max Paths 5 252 * 28M 93 2.8 TiB
Timesteps
* Values unique to our solution
Baseline Large Problem Asset Scaling Path Scaling
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Approximate Profile Breakdown by Application Phase, POWER8 S824*
Longstaff-Schwartz Pricing
Array Transpose
Monte Carlo (Heston Model)
RNG + Correlation
* 2 x POWER8 @ 3.52 GHz (nominal)/ 3.925 GHz (turbo); 24 total cores; 1 TiB memory; IBM XL C/C++; RHEL 7 Big-endian
9
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Memory Bandwidth Sensitivity
Baseline LargeProblem
Asset Scaling(65 Assets)
Path Scaling(8M Paths/STC)
0
0.2
0.4
0.6
0.8
1
1.2
1 1 1 10.96
0.931
0.67
0.82
0.72
0.99
0.44
POWER8 S824 Relative Performance vs. Available Memory Bandwidth*
Full192GB/sec per socket
Half96GB/sec per socket
Quarter48GB/sec per socket
Pe
rf.
Re
lative
to
Fu
ll B
an
dw
idth
*Not a STAC-A2 benchmark
10
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Performance Comparisons – STAC-A2 Benchmarks
Legend STAC SUT* Description
IBM150305 2 x IBM POWER8 @ 3.52 GHz; 24 total cores; 1TiB Memory
INTC151028
INTC150811 4 x Intel Xeon E7-8890 v3 @ 2.50 GHz; 72 total cores; 1 TiB Memory
NVDA141116
INTC140815
INTC140814 2 x Intel Xeon E5-2699 v3 @ 2.3 GHz; 36 total cores; 256 GiB Memory
2 x Intel Xeon E5-2697 v3 @ 2.6 GHz + 2 x Intel Xeon PHI 7120P @ 1.24 GHz28 Haswell cores + 122 PHI cores; 256 GiB Memory
1 x NVIDIA Tesla K80 GPU; Supermicro SYS-2027GR-TRFH Host; 128 MiB Memory
2 x Intel Xeon E5-2699 v3 @ 2.3 GHz + 1 x Intel Xeon PHI 7120A @ 1.24 GHz36 Haswell cores + 61 PHI cores; 256 GiB Memory
*For details see www.stacresearch.com/<STAC SUT>
Baseline Large Problem0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Relative Workload Performance
2 x POWER8
2 x Haswell EP + 2 x Xeon PHI
4 x Haswell-EX
NVIDIA K80
2 x Haswell-EP+ Xeon PHI
2 x Haswell-EP
Asset Scaling0
10
20
30
40
50
60
70
80
90
Max Assets
Path Scaling00.0E-1
50.0E+5
10.0E+6
15.0E+6
20.0E+6
25.0E+6
30.0E+6
Max Paths
Intel, Xeon, PHI, NVIDIA, Tesla and Supermicro are trademarks or registered trademarks of their respective owners and are used for identification purposes only
11
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Performance by SMT Mode
1 Core 1 Socket(12 Cores)
2 Sockets(24 Cores)
0
0.5
1
1.5
2
2.5
1 1 1
1.611.68 1.66
2.00 2.042.12
1.96 1.97 1.99
SMT Mode Performance Relative to SMT1, STAC-A2 Baseline Greeks Workload*
SMT1
SMT2
SMT4
SMT8
Pe
rf. R
ela
tive
to
SM
T1
*Not a STAC-A2 benchmark
12
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
System-Level Scalability Comparisons
*Data is from the official STAC-A2 audit reports for these systems, however these are not considered STAC-A2 benchmarks.Please see page 10 for full details of the systems being compared here.
Core → Half System Core → Full System
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Scaling Single Core Performance to Half/Full System Performance*(Higher is Better, 1.0 = Ideal)
2 x POWER8Half = 12 CoresFull = 24 coresSMT4
4 x Haswell-EXHalf = 36 CoresFull = 72 CoresHyperThreading
2 x Haswell-EPHalf = 18 CoresFull = 36 CoresHyperThreading
Experiment
Fra
ction o
f th
eore
tical s
peedup,
based
on
num
ber
of core
s
13
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Optimizing Transpose for Monte Carlo SimulationStraightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartzpricing effectively requires transposing time-major data into path-major storage.
. . . . .
t0 ttimesteps-1
path 0
path N-1
t0 ttimesteps-1
path 0
path N-1
Simulation Pricing
Data generated time major
Data analyzed path major
. . . . .
Our Solution:
A blocked, in-place matrix transpose using noadditional storage, taking advantage of the factthat (post-processed) simulation data iscreated one path (pair) at a time
Simple and relatively efficient despite:• Moderately poor locality• Requires a second pass over the data
Simple heuristics based on working set sizeimprove performance for large working sets
2% - 9% of total benchmark run time
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
. . . . .
Finally
14
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Fast, Parallel Longstaff-Schwartz Algorithm
. . .
t0 ttimesteps-1
path0
pathN-1
Thread0
Thread1
Thread2
Thread3
• Paths are partitioned by thread similar to MonteCarlo simulation
• Threads synchronize (twice per timestep here)and complete each timestep in lock-step
• Scalability is limited by synchronization andcommunication overheads
• For the Baseline case (25,000 paths)performance improves for up to 16 – 20 SMT4threads per scenario
• Least-Squares Monte Carlo (LSMC orLongstaff-Schwartz) is an algorithm for valuingearly-exercise options.
• LSMC operates serially, backwards in time.
• At each time step, LSMC estimates the optimalexercise value by least squares regression usinga cross section of simulated data, approximatingthis value as a linear combination of basisfunctions.
• We describe a fast, easily parallelizable LSMCmethod based on a QR factorization algorithmspecific to row Vandermonde matrices.
• This approach is used for “microbenchmarks”and for Path Scaling, where utilizing every threadoptimizes perfomrance.
• This approach was inspired by NVIDIA'sdescription of their LSMC algorithm for STAC-A2.
15
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Future Work
● Exploiting the high-bandwidth CPU ↔ GPU NVLINKTM
unique to future OpenPOWER systems
Current OpenPOWERFoundation Gold Members
*
* http://www.smartercomputingblog.com/power-systems/openpower-hpc-big-data-analytics/
16
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Special Notices
Information in this document concerning non-IBM products was obtained from the suppliers of these products or other publicsources. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products.
IBM may have patents or pending patent applications covering subject matter in this document. The furnishing of this documentdoes not give you any license to these patents. Send license inquires, in writing, to IBM Director of Licensing, IBM Corporation,New Castle Drive, Armonk, NY 10504-1785 USA.
All statements regarding IBM future direction and intent are subject to change or withdrawal without notice, and represent goalsand objectives only.
The information contained in this document has not been submitted to any formal IBM test and is provided "AS IS" with nowarranties or guarantees either expressed or implied.
All examples cited or described in this document are presented as illustrations of the manner in which some IBM products can beused and the results that may be achieved. Actual environmental costs and performance characteristics will vary depending onindividual client confgurations and conditions.
Any performance data contained in this document was determined in a controlled environment. Actual results may varysignifcantly and are dependent on many factors including system hardware confguration and software design and confguration.Some measurements quoted in this document may have been made on development-level systems. There is no guarantee thesemeasurements will be the same on generally-available systems. Some measurements quoted in this document may have beenestimated through extrapolation. Users of this document should verify the applicable data for their specifc environment.
A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml.
Other company, product and service names may be trademarks or service marks of others and are used for identifcation purposesonly.
17
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Backup
18
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
POWER8 Simultaneous Multi-threading (SMT)
● SMT modes represent different partitioning of coreresources: Fetch/Dispatch; Reg. Files; Execution Pipes
● SMT mode is determined by the number of active hardwarethreads/core, regardless of the number of threadsconfigured per core
– SMT mode switch is automatic as threads enter/exit idle states
– An enhancement over POWER7
● SMT modes: SMT1, SMT2, SMT4 and SMT8 entered when1, 2, 3 – 4, 5 – 8 threads are active respectively
● Most STAC-A2 benchmarks run in SMT4; Some SMT2– SMT4 here means we bind 4 software threads to 4 of 8 configured
HW threads per core
19
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Simplified POWER8 Core: SMT1
NBBR
FXU LSU LU
DP FP DP FP
128-bit DP SIMD BR
CR
Crypto
FXU
UniQ0
Regs
NB NB NB NB NB BR
Regs (Mirror)
FXU LSU LU
DP FP DP FP
128-bit DP SIMD
FXU
UniQ1
Dispatch/Complete up to 6 non-branch + 2 branch per cycle
Register files/renamesare mirrored in SMT1 allowing use of either UniQ
Unified issue queues;BRanch and Condition Register have privatequeues; Crypto issued fromeither UniQ
Execution units: Pairs ofDouble Precision Floating Point pipes implement2-way DP SIMD; 1 FiXed point Unit, 1 Load-Store Unit, 1 Load Unit per side.
Note each DP FP canalso function as 2 x Single-Precision FP
Single Thread
20
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Optimizing Data Transpose for Monte Carlo Sim.
Straightforward (i.e., not using complex blocking schemes), cache-efficient Longstaff-Schwartzpricing effectively requires transposing time-major data into path-major storage. Traditional out-of-place rectangular transpose requires too much memory. Traditional in-place transpose is too slow.
. . . . .
t0 ttimesteps-1
path 0
path N-1
. . . . .
t0 ttimesteps-1
path 0
path N-1
Simulation Pricing
Data generated time major
Data analyzed path major
Simple Insertion of time-major data into a path-majorarray is a disaster!
• Cache lines only partiallypopulated when touched• Line may be evicted to memorybefore being touched again• Each line is in a unique virtualpage
Instead we use the blockedapproach illustrated on thenext slide.
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
t0
Insert Pair n + 0
Insert Pair n + 1
t0 ttimesteps-1 ttimesteps-1Path Pair n + 0
Path Pair n + 0
. . . . .
Path Pair n + 1
Etc.
. . . . .
21
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Illustration of In-Place Blocked Transpose for a 16x16 Block-Row
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
Path Pair n + 0Path Pair n + 1Path Pair n + 2Path Pair n + 3Path Pair n + 4Path Pair n + 5Path Pair n + 6Path Pair n + 7
t0 t15 ttimesteps-1t16 t31 t32 t47 . . . . .
. . . . .
Insert Pair n + 0
Insert Pair n + 1
Insert Pair n + 7
Path Pair n + 0
Path Pair n + 7
Path Pair n + 0
Path Pair n + 7
Path Pair n + 0
Path Pair n + 7
Path Pair n + 0
Path Pair n + 7
(Deferred) Block Transpose
t0 ttimesteps-1
Path pairs (original + antithetic) are simulated together and inserted into the blocked (16 x 16) and padded destination array block-rowsuch that the blocks are transposed. (Eventually) transposing the blocks then brings the data into the correct orientation.
22
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Transpose Performance
Any technique other than the Simple Insertion gives good performance, and all techniques benefit from manual prefetching. 2% - 9%of total benchmark run time is currently spent doing this transpose (varies by workload).
If the working set (16 paths per simultaneously generated array, per thread, per core) appears to fit in the 8 MiB L3, it is advantageousto transpose each block-row as soon as it is filled, otherwise it is better to defer the final block transpose until the entire array is filled.
Our latest results also suggest that Blocked Insertion with Immediate Transpose is more efficient than using a library out-of-placetransposition routine (DGETMO) for L3-contained working sets. [Each thread transposes the section of the array it created.]†
Baseline(6.4)
Large Problem(93.5)
Asset Scaling(797)
Path Scaling(6.4)
5:126:25000(3.2)
5:1260:25000(32.0)
0
0.2
0.4
0.6
0.8
1
1.2
0.91
1.04 1.06
0.930.88
1.02
0.63
0.46
0.85
0.63
0.75
0.37
0.941.00
0.951.01 0.97
1.07
Performance of Monte Carlo Simulation by Transpose Method,Relative to Blocked Insertion with Immediate Transpose†
Blocked(ImmediateTranspose)
Blocked(DeferredTranspose)
SimpleInsertion
BufferedInsertion*
Workload(Working Set, MiB)
Re
lative
Pe
rfo
rma
nce
† Not a STAC-A2 Benchmark * Discussed in the paper
Not STAC-A2 Benchmarks
23
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Parallel Longstaff-Schwartz Algorithm
● After simulation each (complex) scenario is reduced to asingle value using the Longstaff-Schwartz algorithm
● Multi-threaded parallelization is needed in cases where thenumber of threads exceeds the number of scenarios
– “Microbenchmarks”, e.g., Delta, where 96 threads price 10 scenarios
– Path Scaling: 93 scenarios are divided into 5 partitions due to limitedmemory; Partitions contain 10 to 40 scenarios
● We may also use parallelization heuristically to improveload balancing, even when scenarios > threads
ScenarioArray
(paths x timesteps)
Longstaff-Schwartz3.276517
24
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Longstaff-Schwartz (Least-Squares Monte Carlo[LSMC]) in a Nutshell
• Prior to expiration the holder willexercise an in-the-money option if thefuture discounted cash flow from holdingthe option is expected to be less thanthe current value.
• The value of the option is maximized ifthe exercise happens as soon as this istrue.
• LSMC estimates the future value byleast squares regression using a crosssection of simulated data, approximatingthis value as a linear combination ofbasis functions.
• LSMC is robust to the choice of basisfunctions; STAC-A2 specifies the use ofpolynomial basis functions
0 1 2 3 4 5 6
0
1
2
3
4
5
6
Longstaff-Schwartz ExamplePolynomial Curve Fit
Exercise Now
Defer Exercise
Least-SquaresFit
Current = Future
Current Payoff (Exercise Now)
Fu
ture
Payo
ff
Analysis performed for each timestep. TheSingular Value Decomposition (SVD) approach toleast-squares is preferred for numerical stability.
25
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Longstaff-Schwartz Parallelization Schemes
. . .
t0 ttimesteps-1
Thread0
Thread1
Thread2
Thread3
By Timestep
• Threads “leapfrog” backwards in time, computing SVD inparallel
• Final evaluation is still serial: Thread must wait for nexttimestep to complete before evaluating the regression
• Scalability severely limited by ratio TSVD : Teval
path0
pathN-1
. . .
t0 ttimesteps-1
By Path-Partition
path0
pathN-1
Thread0
Thread1
Thread2
Thread3
• Paths are partitioned by thread similar to Monte Carlosimulation
• Threads synchronize (twice per timestep here) andcomplete each timestep in lock-step
• Scalability is limited by synchronization andcommunication overheads
• For the Baseline case (25,000 paths) performanceimproves for up to 16 – 20 SMT4 threads per scenario
System
26
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
Path-Parallel Least Squares Solution from QR Factorization and SVD
A = (QU )ΣVT
R = U ΣV T
xT = b
T [(QU )Σ−1VT ] = bpaths
TW paths×n
Apaths×n = Qpaths×n Rn×n
Apaths×n = [ …1 pi pi
2 … pin−1
… ]
xT = [bT (1)bT (2)…bT (N )]×[
W (1)1 W (1)2 ⋱ W (1)nW (2)1 W (2)2 ⋱ W (2)n
… … ⋱ …W (N )1 W (N )2 ⋱ W (N )n
]
• Design matrix A: STAC-A2 specifies polynomialbasis functions, hence A is a row Vandermondematrix
• The QR factorization of A. This is path-parallelizablein general for any basis functions, however we use afast QR factorization for Vandermonde matricesrequiring trivial parallel communication
• The SVD of R
• The SVD of A
•Solution for coefficients xT minimizingwhere b is the discounted future cash flow
• Both bT and W can be path-partitioned between Nthreads. Each thread j computes n partial coefficients
∥Ax−b∥2
xiT ( j) = b
T ( j)W ( j)i for i=1,… , n
27
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
High-Level Sketch – Per-thread Operations
Mi , j = ∑k= firstPath j
lastPath j
pki−1
for i=1,… ,2n−1
Thread Join
R = f (M 1,… , M 2n−1)
xiT ( j) = b
T ( j)W ( j)i for i=1,… , n
Thread Join
W ( j) = g (A( j) , SVD(R))
• Each thread j computes the sums of the first2n-1 powers of its share of prices for in-the-money paths
• Each thread j combines the partial sums toconstruct R and its SVD, which is then usedto create partial coefficients.
• In total, 3 passes are made over the pricedata at each timestep, stressing caches andmemory bandwidth as the number of pathsper partition increases.
• The working set here consists of 3 columnvectors: The price, the future value, and anindex vector recoding in-the-money paths.
Use the regression
to compute the new future value
A( j) x
Loop over timesteps:
28
IBM Research
STAC-A2TM Benchmark on POWER8 WHPCF'15 – Nov. 20, 2015 © 2015 IBM Corporation
All-Threads Cohorts vs. Small-Threads Cohorts
1MATC
1MSTC
8MATC
8MSTC
16MATC
16MSTC
0
10000
20000
30000
40000
50000
60000
70000
Greeks Throughput by Parallelization Methodand Available Memory Bandwidth
Full192GB/secper socket
Half96GB/sec Per socket
Quarter48GB/sec Per socket
Paths - Method
Pa
ths p
er
Se
co
nd
*
• All-Threads-Cohort (ATC) mode –using all threads to price eachscenario – significantly improvesperformance and reduces memorybandwidth dependency vs. usingcohorts with smaller numbers ofthreads (Small-Threads-Cohort,STC) for large numbers of paths.
• Speedups are primarily due tobetter load balancing.
• Memory effects are due to smallercache footprints and ideal NUMAlocality.
. . .
t0 ttimesteps-1
path0
pathN-1
Thread0
Thread1
Thread2
Thread3* Not a STAC-A2 Benchmark
1 / 7
Least Squares Fitting
Given observed pairs (xi , yi ), find a polynomial model which describesthe relationship.
y = a0 + a1 · x + a2 · x2 + · · ·+ an · xn (1)
Expanding into matrix format2
6664
1 x1 · · · xn11 x2 · · · xn2...
......
1 xM · · · xnM
3
7775·
2
6664
a0a1...an
3
7775=
2
6664
y1y2...yM
3
7775(2)
Or in matrix form:A · x = b (3)
Least squares solution minimizes the residue
argminx
||r||22 = argminx
||A · x� b||22 (4)
2 / 7
Solutions to Least Squares Problems
Requires more observed data points than order of polynomial:M � n. Overdetermined problem.Skipped existence and uniqueness ......Several methods to solve the LS problems. For example,pseudo-inverse method:
ATA · x = ATb (5)
ATA is square and fully ranked, we can invert it to compute x:
x =⇣ATA
⌘�1ATb (6)
Works fine for small n. Issues when n gets large: lots of datamovement; bad condition number if not properly scaledBetter: SVD (singular value decomposition)
A = U⌃VT (7)
Both U and V are unitary, and ⌃ is diagonal. Much more stable:
x = V⌃�1U · b (8)
3 / 7
Singular Value Decomposition via QR decomposition
One path of SVD is through the QR decomposition
A = Q
R0
�(9)
Q is orthogonal and R is upper triangular
Classic methods: Houesholder transformation or Givens rotation.Similar to Gaussian elimination for matrix factorization, but useorthogonal transformations
Very much a serial operation
2
6664
a11 a12 · · · a1na21 a22 · · · a2n...
......
an1 an2 · · · ann
3
7775�!
2
6664
a⇤11 a⇤12 · · · a⇤1n0 a⇤22 · · · a⇤2n...
......
0 a⇤n2 · · · a⇤nn
3
7775(10)
4 / 7
Demeure’s QR Factorization Method
Take advantage of the special structure of the A to speed-up the QRfactorization.
For LS problems, the A matrix is a Vandermonde matrixSketch of the strategy
The QR factorization isA = Q⌃BT (11)
with Q orthogonal, BT upper triangular with 1 on diagonal, ⌃ diagonalwith real values.From Eqn. (11), if E is inverse of BT .
AE = Q⌃ (12)
If H = ATA:HE = B⌃2 (13)
AlsoATQ = B⌃ (14)
From Eqn. (13), recursively compute E as column space of H; use E tocompute Q from Eqn. (12); use Q to compute B from Eqn. (14)
5 / 7
Flow
Matrix represented by vector p
A =
2
6664
1 p1 p21 · · · pn+11
1 p2 p22 · · · pn+12
......
......
1 pM p2M · · · pn+1M
3
7775=
⇥1 p p2 · · ·pn+1
⇤(15)
1: procedure VandermondeQR . initialization2: �1 m3: for j in 2, . . . , 2n � 1 do4: Bj ,1 kpj�1k1/�15: µ1 B21
6: ⌫1 B31
7: �2 �1(⌫1 � µ21)
6 / 7
Flow (cont’d)
1: procedure VandermondeQR . initialization cont’d2: for j in 3, . . . , 2n � 2 do3: Bj ,2 (�1/�2) (Bj+1,1 � µ1Bj ,1)
4: Q:,1 = 1/p�1
5: Q:,2 = (p� µ1)/p�2 . main recursion loop
6: for k in 3, . . . , n do7: µk�1 Bk,k�1
8: ⌫k�1 Bk+1,k�1
9: �k �k�1(⌫k�1 � ⌫k�2 + µk�1(µk�2 � µk�1))10: for j in k + 1, . . . , 2n � k do11: Bj ,k (�k�1/�k) (Bj+1,k�1 � Bj ,k�2+
+(µk�2 � µk�1)Bj ,k�1)
12: Q:,k (p
�k�1/�k) · ((p+ µk�2 � µk�1) · Q:,k�1��p
�k�1/�k�2 · Q:,k�2)
13: ⌃ diag(p�1,p�2, . . . ,
p�n)
14: B B1:n,:
7 / 7
Characteristics
Many operations are local, no large amount of data movements
Can be parallelized easily