![Page 1: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/1.jpg)
Automatically Adapting Automatically Adapting ProgramsPrograms
for Mixed-Precision for Mixed-Precision Floating-Point ComputationFloating-Point Computation
Mike Lam and Jeff Hollingsworth
University of Maryland, College Park
Bronis de Supinski and Matt LeGendre
Lawrence Livermore National Lab
![Page 2: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/2.jpg)
BackgroundBackground• Floating point represents real numbers as (± sgnf × 2exp)
o Sign bito Exponento Significand (“mantissa” or “fraction”)
• Finite precisiono Single-precision: 24 bits (~7 decimal digits)o Double-precision: 53 bits (~16 decimal digits)o Introduces rounding error
032 16 8 4
Significand (23 bits)Exponent (8 bits)
IEEE Single
2
03264 16 8 4
Significand (52 bits)Exponent (11 bits)
IEEE Double
![Page 3: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/3.jpg)
MotivationMotivation• Double precision is ubiquitous
o Necessary for some computationso Lack of easy-to-use techniques for reasoning about precision
• Single precision is preferableo Faster computation
o Tesla K20X: 2.95 TFlops (singles) vs. 1.31 TFlops (doubles)
o Intel Xeon Phi: 2.15 GFlops (singles) vs. 1.07 GFlops (doubles)
o Standard CPUs: 2x operations w/ SSE vector operationso Reduced memory pressure
o Up to 50% footprint reductiono Data movement is a bottleneck for some domains
Desire: Balance speed (singles) with accuracy (doubles) 3
![Page 4: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/4.jpg)
Mixed PrecisionMixed Precision
4
1: LU ← PA2: solve Ly = Pb3: solve Ux0 = y4: for k = 1, 2, ... do5: rk ← b – Axk-1
6: solve Ly = Prk
7: solve Uzk = y8: xk ← xk-1 + zk
9: check for convergence10: end for
Red text indicates steps performed in double-precision (all other steps are single-precision)
Mixed-precision linear solver algorithm
• Use double precision where necessary• Use single precision where possible• Nearly 2x speedups [Baboulin2008]
![Page 5: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/5.jpg)
Our GoalOur Goal
Use automated analysis techniques to prototype mixed-precision
variants and provide insight about a program’s precision level
requirements.
5
![Page 6: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/6.jpg)
FrameworkFrameworkCRAFT: Configurable Runtime Analysis
for Floating-point Tuning
•Static binary instrumentationo Parse binary on disko Replace or augment floating-point instructions with new
codeo Rewrite modified binary
•Dynamic analysiso Run modified program on representative data seto Produce results and recommendations
6
![Page 7: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/7.jpg)
Previous WorkPrevious Work• Cancellation detection [WHIST’11]
o Reports loss of precision due to subtractiono Provides insight regarding numerical behavior
• Range trackingo Reports per-instruction min/max valueso Provides insight regarding low dynamic ranges
• Mixed-precision variantso Replaces double-precision instructions and operandso Provides insight regarding precision-level sensitivity
7
![Page 8: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/8.jpg)
downcast conversion
• In-place replacemento Narrowed focus: doubles singleso In-place downcast conversiono Flag in the high bits to indicate replacement
03264 16 8 4
Double
03264 16 8 4ReplacedDouble
7 F F 4 D E A D
Non-signalling NaN 032 16 8 4
Single
8
ImplementationImplementation
![Page 9: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/9.jpg)
ExampleExample
gvec[i,j] = gvec[i,j] * lvec[3] + gvar
1 movsd 0x601e38(%rax, %rbx, 8) %xmm0
2 mulsd -0x78(%rsp) * %xmm0 %xmm0
3 addsd -0x4f02(%rip) + %xmm0 %xmm0
4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
9
![Page 10: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/10.jpg)
ExampleExample
gvec[i,j] = gvec[i,j] * lvec[3] + gvar
1 movsd 0x601e38(%rax, %rbx, 8) %xmm0
2 mulss -0x78(%rsp) * %xmm0 %xmm0
3 addss -0x4f02(%rip) + %xmm0 %xmm0
4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
10
![Page 11: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/11.jpg)
gvec[i,j] = gvec[i,j] * lvec[3] + gvar
1 movsd 0x601e38(%rax, %rbx, 8) %xmm0check/replace -0x78(%rsp) and %xmm0
2 mulss -0x78(%rsp) * %xmm0 %xmm0check/replace -0x4f02(%rip) and %xmm0
3 addss -0x4f02(%rip) + %xmm0 %xmm0
4 movsd %xmm0 0x601e38(%rax, %rbx, 8)
11
ExampleExample
![Page 12: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/12.jpg)
Replacement CodeReplacement Code push %rax push %rbx
<for each input operand> <copy input into %rax> mov %rbx, 0xffffffff00000000 and %rax, %rbx # extract high word mov %rbx, 0x7ff4dead00000000 test %rax, %rbx # check for flag je next # skip if replaced <copy input into %rax> cvtsd2ss %rax, %rax # down-cast value or %rax, %rbx # set flag <copy %rax back into input>next: <next operand> pop %rbx pop %rax
<replaced instruction> # e.g. addsd => addss
12
![Page 13: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/13.jpg)
DyninstDyninst
• Binary analysis frameworko Parses executable files (InstructionAPI & ParseAPI)o Inserts instrumentation (DyninstAPI)o Supports full binary modification (PatchAPI)o Rewrites binary executable files (SymtabAPI)
dyninst.org
13
![Page 14: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/14.jpg)
Block EditingBlock Editing
14
double single conversion
original instruction in block
block splits
initializationcheck/replace
![Page 15: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/15.jpg)
OverheadOverhead
15
Benchmark(name.CLASS)
Average Overhead
bt.A 50.6X
cg.A 6.1X
ep.A 13.8X
ft.A 10.1X
lu.A 28.5X
mg.A 14.0X
sp.A 19.5X
![Page 16: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/16.jpg)
Binary EditingBinary Editing
16
Original Binary
(“mutatee”)
Modified Binary
CRAFT(“mutator”)
Double Precision
Mixed Precision
MixedConfig
Configuration
(parser & GUI)
![Page 17: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/17.jpg)
ConfigurationConfiguration
17
![Page 18: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/18.jpg)
Automated SearchAutomated Search
• Manual mixed-precision replacemento Hard to use without intuition regarding potential
replacements
• Automatic mixed-precision analysiso Try lots of configurations (empirical auto-tuning)o Test with user-defined verification routine and data seto Exploit program control structure: replace larger structures
(modules, functions) firsto If coarse-grained replacements fail, try finer-grained
subcomponent replacements
18
![Page 19: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/19.jpg)
System OverviewSystem Overview
19
![Page 20: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/20.jpg)
Example ResultsExample Results
20
![Page 21: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/21.jpg)
Example ResultsExample Results
21
![Page 22: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/22.jpg)
NAS ResultsNAS Results
22
Benchmark(name.CLASS)
CandidateInstructions
Configurations Tested
Instructions Replaced
% Static % Dynamic
bt.W 6,647 3,854 76.2 85.7
bt.A 6,682 3,832 75.9 81.6
cg.W 940 270 93.7 6.4
cg.A 934 229 94.7 5.3
ep.W 397 112 93.7 30.7
ep.A 397 113 93.1 23.9
ft.W 422 72 84.4 0.3
ft.A 422 73 93.6 0.2
lu.W 5,957 3,769 73.7 65.5
lu.A 5,929 2,814 80.4 69.4
mg.W 1,351 458 84.4 28.0
mg.A 1,351 456 84.1 24.4
sp.W 4,772 5,729 36.9 45.8
sp.A 4,821 5,044 51.9 43.0
![Page 23: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/23.jpg)
AMGmk ResultsAMGmk Results
24
• Algebraic MultiGrid microkernel• Multigrid method is iterative and highly adaptive
• Good candidate for replacement
• Automatic search• Complete conversion (100% replacement)
• Manually-rewritten version• Speedup: 175 sec to 95 sec (1.8X)
• Conventional x86_64 hardware
![Page 24: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/24.jpg)
SuperLU ResultsSuperLU Results
25
• Package for LU decomposition and linear solves• Reports final error residual (useful for threshholding)
• Both single- and double-precision versions
• Verified manual conversion via automatic search• Used error from provided single-precision version as threshold
• Final config matched single-precision profile (99.9% replacement)
Threshold Instructions Replaced
% Static % Dynamic
Final Error
1.0e-03 99.1 99.9 1.59e-04
1.0e-04 94.1 87.3 4.42e-05
7.5e-05 91.3 52.5 4.40e-05
5.0e-05 87.9 45.2 3.00e-05
2.5e-05 80.3 26.6 1.69e-05
1.0e-05 75.4 1.6 7.15e-07
1.0e-06 72.6 1.6 4.7e7-07
![Page 25: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/25.jpg)
Future WorkFuture Work
• Memory-based analysis
• Case studies
• Search optimization
26
![Page 26: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/26.jpg)
ConclusionConclusion
Automated binary modification can build prototype mixed-precision program variants.
Automated search can provide insight to focus mixed-precision implementation efforts.
27
![Page 27: Automatically Adapting Programs for Mixed-Precision Floating-Point Computation Mike Lam and Jeff Hollingsworth University of Maryland, College Park Bronis](https://reader035.vdocuments.site/reader035/viewer/2022062618/5513d5de5503463a298b5363/html5/thumbnails/27.jpg)
Thank you!Thank you!
sf.net/p/crafthpc
28