on the importance of thread placement on multicore...
TRANSCRIPT
![Page 1: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/1.jpg)
Technische Universität München
On the Importance of Thread Placement on
Multicore Architectures
Tobias Klug
HPCLatAm 2011
Keynote
Cordoba, Argentina
August 31, 2011
![Page 2: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/2.jpg)
Technische Universität München
Motivation: Many possibilities…
![Page 3: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/3.jpg)
Technische Universität München
… can lead to non-deterministic runtimes...
![Page 4: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/4.jpg)
Technische Universität München
... but don„t have to
![Page 5: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/5.jpg)
Technische Universität München
The autopin Approach
• User-level tool
• Start multi-threaded application under autopin control
• User can specify pinnings of interest
• Pin threads to cores
• Assess performance of chosen pinning using performance counters
• Try alternative pinnings until optimal pinning is found
![Page 6: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/6.jpg)
Technische Universität München
Performance Counters
• Multiple Event Sensors
– ALU Utilization
– Branch Prediction
– Cache Events (L1/L2/TLB)
– Bus Utilization
• Two Uses:
– Read: Get Precise Count of Events in Code Regions => Counting
– Interrupt on Overflow => Statistical Sampling
• Well-known tools:
– Oprofile
– Perfctr
– Intel Vtune
– Perfmon2
![Page 7: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/7.jpg)
Technische Universität München
perfmon2
• Kernel-Patch + library (libpfm)
• Generic interface for PMU access
• Portable: implementations for IA32, x64, IA64, MIPS, Power
• Allows for per-thread and system-wide monitoring
• Support for counting and sampling
• pfmon:
– attach to running threads
– fork new processes and attach to them
– fully exploit performance counters
![Page 8: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/8.jpg)
Technische Universität München
Algorithm
init_autopin (pinningList, initTime, program);
for (i=0; i<numOfPinnings; i++){
pinThreads(pinningList[i]);runThreads(warmupTime);p1 = readPerformanceCounters();runThreads(sampleTime);p2 = readPerformanceCounters();performanceRate[i] = (p2-p1)/sampleTime;
}
pinThreads(bestPinning);
![Page 9: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/9.jpg)
Technische Universität München
NUMA automatic page migration
![Page 10: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/10.jpg)
Technische Universität München
Automatic Page Migration
• Kernel patch from Lee Schermerhorn
• Thread moves to new NUMA node:
remove PTE references of “old” NUMA node
pages are now unmapped
• Next access to page causes page-fault
• Modified kernel routines pull page local
(migrate on fault)
• Update PTE
• Controlled via cpusets
![Page 11: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/11.jpg)
Technische Universität München
Experimental Setup
• Caneland:– Intel Tigertown: Quad-Core, 2x4MB L2/socket, 2.93GHz clock rate
– 4-way, 4x1066MHz FSB, 64MB snoop filter, UMA
• Clovertown:– Intel Clovertown: Quad-Core, 2x4MB L2, 2.66GHz clock rate
– 2-way, 2x1333MHz FSB, UMA
• Barcelona:– AMD K10: Quad-Core, 4x512kB L2, 1x2MB L3, 1.9GHz clock rate
– 2-way, 1000MHz Hypertransport, NUMA
• Linux Kernel 2.6.23 with perfmon2 patches
• Intel Compiler Suite
![Page 12: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/12.jpg)
Technische Universität München
Evaluation
• SPEC OMP
– Benchmark consists of real scientific applications
– OpenMP
– PC: INSTRUCTIONS_RETIRED
– Several Multicore-Architekturen examined
• Memory Throughput
• MPI
• Electric power consumption
![Page 13: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/13.jpg)
Technische Universität München
SPEC OMP
Benchmark Description
310.wupwise Quantum chromodynamics
312.swim shallow water modeling
314.mgrid multi-grid solver in 3D potential field
316.applu parabolic/elliptic partial differential equations
318.galgel fluid dynamics analysis of oscillatory instability
320.equake finite element simulation of earthquake modeling
324.apsi weather prediction
326.gafort genetic algorithm code
328.fma3d finite-element crash simulation
330.art neural network simulation of adaptive resonance theory
332.Ammp computational chemistry
![Page 14: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/14.jpg)
Technische Universität München
Caneland
![Page 15: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/15.jpg)
Technische Universität München
Caneland: Runtimes (10s sample time)
![Page 16: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/16.jpg)
Technische Universität München
Caneland: Runtimes (30s sample time)
![Page 17: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/17.jpg)
Technische Universität München
Results
Caneland Clovertown Barcelona
2 4 8 2 4 2 4
310.wupwise
312.swim
314.mgrid
316.applu
320.equake
324.apsi
328.fma3d
330.art
332.ammp
![Page 18: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/18.jpg)
Technische Universität München
Results
Caneland Clovertown Barcelona
2 4 8 2 4 2 4
310.wupwise
312.swim
314.mgrid
316.applu
320.equake
324.apsi
328.fma3d
330.art
332.ammp
![Page 19: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/19.jpg)
Technische Universität München
Clovertown
![Page 20: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/20.jpg)
Technische Universität München
Results
Caneland Clovertown Barcelona
2 4 8 2 4 2 4
310.wupwise
312.swim
314.mgrid
316.applu
320.equake
324.apsi
328.fma3d
330.art
332.ammp
![Page 21: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/21.jpg)
Technische Universität München
Barcelona
![Page 22: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/22.jpg)
Technische Universität München
Results (w/o NUMA patch)
Caneland Clovertown Barcelona
2 4 8 2 4 2 4
310.wupwise
312.swim
314.mgrid
316.applu
320.equake
324.apsi
328.fma3d
330.art
332.ammp
![Page 23: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/23.jpg)
Technische Universität München
Results (with NUMA patch)
Caneland Clovertown Barcelona
2 4 8 2 4 2 4
310.wupwise
312.swim
314.mgrid
316.applu
320.equake
324.apsi
328.fma3d
330.art
332.ammp
![Page 24: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/24.jpg)
Technische Universität München
Results SPEC OMP
• Optimal pinning found in all but 2 cases
autopin‟s alternative less than 5% slower
• Overhead less than 3% on UMA platform
• Overhead less than 7,5% on NUMA platform
(Kernel level page migration)
![Page 25: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/25.jpg)
Technische Universität München
![Page 26: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/26.jpg)
Technische Universität München
![Page 27: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/27.jpg)
Technische Universität München
![Page 28: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/28.jpg)
Technische Universität München
![Page 29: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/29.jpg)
Technische Universität München
Evaluation
• SPEC OMP
– Benchmark consists of real scientific applications
– OpenMP
– PC: INSTRUCTIONS_RETIRED
– Several Multicore-Architekturen examined
• Memory Throughput
• MPI
• Electric power consumption
![Page 30: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/30.jpg)
Technische Universität München
Barcelona
![Page 31: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/31.jpg)
Technische Universität München
Memory Bandwidth
• STREAM John McCalpin
• synthetic benchmark
• copy and computation operations on large FP arrays
• Reusage of data avoided
• copy: a[i] = b[i]
scale: a[i] = q*b[i]
sum: a[i] = b[i] + c[i]
triad: a[i] = b[i] + q*c[i]
![Page 32: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/32.jpg)
Technische Universität München
![Page 33: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/33.jpg)
Technische Universität München
Evaluation
• SPEC OMP
– Benchmark consists of real scientific applications
– OpenMP
– PC: INSTRUCTIONS_RETIRED
– Several Multicore-Architekturen examined
• Memory Throughput
• MPI
• Electric power consumption
![Page 34: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/34.jpg)
Technische Universität München
Clovertown
![Page 35: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/35.jpg)
Technische Universität München
![Page 36: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/36.jpg)
Technische Universität München
![Page 37: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/37.jpg)
Technische Universität München
Evaluation
• SPEC OMP
– Benchmark consists of real scientific applications
– OpenMP
– PC: INSTRUCTIONS_RETIRED
– Several Multicore-Architekturen examined
• Memory Throughput
• MPI
• Electric power consumption
![Page 38: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/38.jpg)
Technische Universität München
PET (Positron Emission Tomography)
• Nuclear medicine imaging
• Visualizes functional processes
(e.g. tumor diagnostics)
• fixed detector ring around patient
• radioisotopes injected into body
• Positron vs. electron → 2 photons 180 degree
• coincidence circuit
![Page 39: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/39.jpg)
Technische Universität München
Image: Wikipedia
![Page 40: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/40.jpg)
Technische Universität München
Image Reconstruction
g = A f
g known measurement vector
f unknown image vector
A system matrix
(describes characteristics of detector ring)
MLEM approximates linear system
![Page 41: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/41.jpg)
Technische Universität München
Clovertown
![Page 42: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/42.jpg)
Technische Universität München
![Page 43: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/43.jpg)
Technische Universität München
Conclusion and Outlook
• Pinning is essential on multicore systems
• Will become even more important on many core architectures
• Tools can reliably find optimal pinnings on UMA and NUMA
architectures
• Outlook – autopin2
– new design: perf performance counters subsystem
– Flexible and modular Design:
• perfmon, perf
• Energy, runtime, user defined objective functions
• Back channel from application to autopin2
![Page 44: On the Importance of Thread Placement on Multicore Architectureshpc2011.hpclatam.org/Keynotes/hpclatam11_tobias_klug.pdf · 2011-10-04 · Technische Universität München On the](https://reader035.vdocuments.site/reader035/viewer/2022070722/5f01d4167e708231d4013d5e/html5/thumbnails/44.jpg)
Technische Universität München
Questions?