![Page 1: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/1.jpg)
IBM Research
© 2014 IBM Corporation
Predicting GPU Performance from CPU Runs Using Machine Learning
1
IBM T. J. Watson Research Center
Yorktown Heights, NY USA
Ioana Baldini Stephen Fink Erik Altman
![Page 2: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/2.jpg)
IBM Research
© 2014 IBM Corporation 2
Pitfalls: Porting costs time/money
Sometimes GPU speedup is not worth the effort
GPU
To exploit GPGPU acceleration … … need to port code to GPGPU programming model
Problem: Can a non-expert predict whether a data-parallel application is likely to perform well on a GPU, without incurring the costs of porting?
![Page 3: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/3.jpg)
IBM Research
© 2014 IBM Corporation 3
Maybe this is easy?
0
1
2
3
4
5
6
7
8
9
0 1 2 3 4 5 6 7 8 9
Suppose we already have data parallel CPU Version, e.g.
Hypothesis : Since data parallelism is data parallelism, • SMP speedup predicts GPU speedup
OpenMP speedup
GPU speedup
![Page 4: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/4.jpg)
IBM Research
© 2014 IBM Corporation 4
What is correlation between OpenMP and GPU performance?
ρ = 0.146
![Page 5: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/5.jpg)
IBM Research
© 2014 IBM Corporation 5
Selected results, OpenMP vs. GPU
Similar scaling on CPU/OpenMP GPU results vary based on device/code
Speedup vs. single-thread CPU
![Page 6: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/6.jpg)
IBM Research
© 2014 IBM Corporation
GPU
6
GPU performance prediction: traditional approaches
Detailed analytic model of algorithm
Detailed model of GPU architecture
Accurate Performance Models +
0 2 4 6 8
10
0 5 10
Relatively simple code structures Reasoning about non-trivial transformations
Expert knowledge Complex mappings Device – Specific Models
![Page 7: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/7.jpg)
IBM Research
© 2014 IBM Corporation 7
Desired Solution: Fully automatic, for non-experts • No static analysis • No detailed architectural models • Automatically tune for new device
Our Approach: Use machine learning to learn from past experiences porting
sequential code to GPUs
Input: • corpus of past ports from sequential (C) to GPU/parallel code • features from sequential code
Output: • model of GPU speedup
Notes: • Does not explicitly model specific transformations or architectural detail
- Unlikely to produce highly accurate model, but rough estimate may be useful • Input to model includes human expert knowledge embodied in previous ports
![Page 8: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/8.jpg)
IBM Research
© 2014 IBM Corporation 8
System architecture
![Page 9: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/9.jpg)
IBM Research
© 2014 IBM Corporation 9
Application Features – Dynamic Instrumentation (Pin)
Category Feature Mnemonic Computation Arithmetic and logic
instructions ALU
SIMD-based instructions SIMD
Memory Memory loads LD Memory stores ST Memory fences FENCE
Control Flow Conditional and unconditional branches
BR
OpenMP Speedup of 12 threads over sequential execution
OMP
Aggregates Total number of instructions TOTAL
Ratio of computation over memory
ALU-MEM
Ratio of computation over GPU communication
ALU-COMM
![Page 10: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/10.jpg)
IBM Research
© 2014 IBM Corporation 10
Supervised Learning: [WEKA 3] Binary/Ternary classifiers
Source: [Martin 95]
Nearest Neighbors with Generalized Exemplars (NNGE) Support Vector Machines (SVM)
![Page 11: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/11.jpg)
IBM Research
© 2014 IBM Corporation 11
Methodology
Runs • 48 runs total • Cross validation:
- Leave one out - Leave two out
Benchmarks • Parboil 2.0 • Rodinia 2.2
System • Dual-processor Intel Xeon X5690 • 12 cores total @ 3.47 GHz • ATI FirePro v9800 • Nvidia Tesla C2050
![Page 12: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/12.jpg)
IBM Research
© 2014 IBM Corporation 12
Learning whether GPU acceleration is beneficial
Binary classifier: “GPU Speedup > 1”
Classifier Accuracy
SVM NNGE
Tesla ALU, LD, BR, TOTAL ALU, LD, ST, ALU-MEM, BR, TOTAL
FirePro ALU, LD, BR, TOTAL ALU, LD, BR, TOTAL, OMP
Features Selected
![Page 13: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/13.jpg)
IBM Research
© 2014 IBM Corporation 13
Learning the speedup factor of GPU execution (SVM)
5 Binary classifiers: “GPU Speedup > 1” “GPU Speedup > 2” “GPU Speedup > 3” “GPU Speedup > 4” “GPU Speedup > 5”
Classifier Accuracy
![Page 14: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/14.jpg)
IBM Research
© 2014 IBM Corporation 14
Predicting the Best Device for OpenCL runs
Which of 3 devices is best for a particular program? • 3-way classifier using NNGE: “CPU”, “Tesla”, “FirePro”
Correct 39 of 46 runs. Effective accuracy = 91%
Classifer prediction
Nor
mal
ized
Perfo
rman
ce
![Page 15: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/15.jpg)
IBM Research
© 2014 IBM Corporation 15
Summary of Results
• Classifiers are ~80% effective at deciding GPU speedup class for k = 1, …, 5
• Small set of features (ALU, LD, BR) are sufficient
• Predictions are effective for heterogeneous scheduling problem
Bottom Line
• Is 80% classifier accuracy good enough? Useful?
• Machine Learning doesn’t illuminate cause-and-effect
• Early days – results are promising, but much more exploration needed
![Page 16: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/16.jpg)
IBM Research
© 2014 IBM Corporation 16
Backup
![Page 17: Predicting GPU Performance from CPU Runs Using Machine … · 2017. 11. 29. · Pitfalls: Porting costs time/money ... Correct 39 of 46 runs. Effective accuracy = 91% Classifer prediction](https://reader036.vdocuments.site/reader036/viewer/2022071007/5fc4424eeeb215620936f637/html5/thumbnails/17.jpg)
IBM Research
© 2014 IBM Corporation
Related Work
– Jia et al. [14] used regression techniques to predict the results of simulations of GPU architectures.
– Grewe et al. [7], [8] use machine learning to help split code between cores of a CPU and a GPU.
– GROPHECY [23] predict GPU performance for loops. • Uses a detailed analytic model with pattern-matching static analysis.
– Kremlin [6] provides a methodology to discover opportunities for parallelization in sequential code • Primarily by deriving upper bounds on potential parallelism from a dynamic critical path
measurement. – Parallel Prophet [16] estimates parallel performance in the presence of memory contention.
• Based on instrumented, annotated serial programs. – Hong et al. [11] model GPU performance from GPU memory access patterns and
computational density per memory request • It requires a detailed model of the code to run on the device.