make hpc easy with domain-specific languages and high-level frameworks biagio cosenza, ph.d. dps...

Make HPC Easy with Domain-Specific Languages and High-Level Frameworks Biagio Cosenza, Ph.D. DPS Group, Institut fr Informatik Universitt Innsbruck, Austria

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 Outline Complexity in HPC Parallel hardware Optimizations Programming models Harnessing compexity Automatic tuning Automatic parallelization DSLs Abstractions for HPC Related work in Insieme

COMPLEXITY IN HPC

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 Complexity in Hardware The need of parallel computing Parallelism in hardware Three walls Power wall Memory wall Instruction Level Parallelism

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 The Power Wall Power is expensive, but transistors are free We can put more transistors on a chip than we have the power to turn on Power efficiency challenge Performance per watt is the new metric systems are often constrained by power & cooling This forces us to concede the battle for maximum performance of individual processing elements, in order to win the war for application efficiency through optimizing total system performance Example Intel Pentium 4 HT 670 (released on May 2005) Clock rate 3.8 GHz Intel Core i7 3930K Sandy Bridge (released on Nov. 2011) Clock rate 3.2 GHz

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 The Memory Wall The growing disparity of speed between CPU and memory outside the CPU chip, would become an overwhelming bottleneck It change the way we optimize programs Optimize for memory vs optimize computation E.g. multiply is no longer considered a harming slow operation, if compared to load and store

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 The ILP Wall There are diminishing returns on finding more ILP Instruction Level Parallelism The potential overlap among instructions Many ILP techniques Instruction pipelining Superscalar execution Out-of-order execution Register renaming Branch prediction The goal of compiler and processor designers is to identify and take advantage of as much ILP as possible There is an increasing difficulty of finding enough parallelism in a single instructions stream to keep a high- performance single-core processor busy

Parallelism in Hardware Xeon Phi 5110pTesla K20xFirePro S10000Cortex A50TILE-Gx8072Power7+ CompanyIntelNVidiaAMDARMTileraIBM Memory8 GB 320 GB/s bandwidth 6 GB 250 GB/sec bandwidth 6 GB 480 GB/s bandwidth (dual) 4 GB and banked L2 23Mbyte on chip cache 32K /core 256 KB L2/core 18 MB L3 cache 2 MB L2 cache (256KB core) 32 MB of L3 cache (4 MB per core) for the 8-core SCM. Cores60 (240 treads) 2688 CUDA cores, arranged in SMs 2x1792 stream processors up to 16 (4x4 cluster) 728-core SCM, 64 with 4 drawers (4 SMT threads per core) Core frequency 1.053 GHz1 Ghz825 Mhz1.0 GHz4.14 GHz SIMD/SIMT512 bit32 th. warp 64 th. wavefront 32, 16, and 8 bit ops

The Many-core challenges Many-core vs multi-core Multi-core architectures and programming models suitable for 2 to 32 processors will not easily incrementally evolve to serve many-core systems of 1000s of processors Many-core is the future Tilera TILE-Gx807

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 What does it mean? Hardware is evolving The number of cores is the new Megahertz We need New programming model New system software New supporting architecture that are naturally parallel

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 New Challenges Make easy to write programs that execute efficiently on highly parallel computing systems The target should be 1000s of cores per chip Maximize productivity Programming models should be independent of the number of processors support successful models of parallelism, such as task-level parallelism, word-level parallelism, and bit-level parallelism Autotuners should play a larger role than conventional compilers in translating parallel programs

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 Parallel Programming Model They define (implicitly or explicitly) some properties: How is the application divided into parallel tasks? Mapping computational tasks to processing elements Distribution of data to memory elements Locating data to smaller, closer memories increases the performance of the implementation The mapping of communication to the inter-connection network Interconnect bottlenecks can be avoided by changing the communication of the application Inter-task synchronization The style and mechanisms of synchronizations can influence not only performance, but also functionality

Parallel Programming Models Real-Time Worksop (MathWorks) Binary Modular Data Flow Machine (TU Munich and AS Nuremberg) MPI Pthreads MapReduce (Google) StreamIt (MIT&Microsoft) CUDA (NVidia) OpenCL (Khronos Group) Brook (Stanford) DataCutter (Maryland) OpenMP Thread Building Blocks (Intel) Cilk (MIT) NESL (CMU) HPCS Chapel (Cray) HPCS X10 (IBM) HPCS Fortress (Sun) Sequoia (Stanford) Charm (Illinois) Erlang Borealis (Brown) HMPP OpenAcc

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 Reconsidering Applications What are common parallel kernel applications? Parallel patterns Instead of traditional benchmarks, design and evaluate parallel programming models and architectures on parallel patterns A parallel pattern (dwarf) is an algorithmic method that captures a pattern of computation and communication E.g. dense linear algebra, sparse algebra, spectral methods, Metrics Scalability An old belief was that less than linear scaling for a multi-processor application is failure With new hardware trend, this is no longer true Any speedup is OK!

HARNESSING COMPLEXITY

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 Harnessing Complexity Compiler approaches DSL, automatic parallelization, Library-based approaches

HPC Seminar at FSP Scientific Computing, Innsbruck, May 15th, 2013 What a compiler can do for us? Optimize code Automatic tuning Automatic code generation e.g. in order to support different hardware Automatically parallelize code

Automatic Parallelization Critical opinions on parallel programming model: The other way: Auto-parallelizing compilers Sequential code => parallel code Wen-mei Hwu, University of Illinois, Urbana-Champaign Why sequential programming models could be the best way to program many-core systems http://view.eecs.berkeley.edu/w/images/3/31/Micro-keynote-hwu-12-11-2006_.pdf

Automatic Parallelization Nowadays compilers have new tools for analysis Polyhedral model but performance are still far from a manual parallelization approach IR Polyhedral Model Analyses & Transformations for(int i=0;i

make hpc easy with domain-specific languages and high-level frameworks biagio cosenza, ph.d. dps...

Documents