roberto vaccaro & lorenzo verdoscia institute for high performance computing and networking
DESCRIPTION
Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking National Research Council – Italy [email protected]. Programming Models and Architectures for ManyCore Systems: Challenges and Opportunities for the next 10 years. Workshop - PowerPoint PPT PresentationTRANSCRIPT
1
Programming Models and Architectures for ManyCore Systems:
Challenges and Opportunities for the next 10 years.
Roberto Vaccaro & Lorenzo Verdoscia
Institute for High Performance Computing and Networking
National Research Council – Italy
WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
2WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ The computational and storage needs of workloads in several areas as life science are growing exponentially.
■ Heterogeneity/Computing Barriers Overcoming.– The scientist should be allowed to look at the data
• easily,• wherever it may be,• with sufficient processing power for any desired algorithm to
process it.
3WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ In life science the scientist requirements concerne a range of different scales, from the local parallel component processor to the global atchitectural level of cross-organizational grid.
■ Integrated solutions capable to face the problems at the different architectural level are needed.
4WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Grid of Clusters
Cluster
Commodity Machine
Microprocessor
Wide Area Netowrk
Local Area Network
System Level Network
Network on Chip
■ ManyCore Chip
■ Photonic Networks for intra-chip, inter-chip, box interconnects
Introduction
(*) T. Agerwala, M. Gupta, “Systems research challenges: A scale-out perspective”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 173,180
5WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ An ensemble of N nodes each comprising p computing elements
■ The p elements are tightly bound shared memory (e.g., smp, dsm)
■ The N nodes are loosely coupled, i.e., distributed Memory
■ p is greater than N
■ Distinction is which layer gives us the most power through parallelism
6WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
7WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ GRIDs built over wide-area networks & across organisational boundaries.
■ lack of (further) improvement in newtork latency.
The approach to Distributed Programmingcurrently prevailing synchronous
(using RPC primitives for ex.)
will have to be replaced with an
ASYNCHRONOUS PROGRAMMING APPROACH more - delay-tolerant - failure-resilient
8WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ A first step in that direction- peer-to-peer (P2P) architectures- service-oriented architectures (SOA)
capable of support reuse of both functionalities and data.
■ Using P2P architectures and protocols it is possible to- realize distributed systems without any centralized control or
hierarchical organisation,- achieve scalable and reliable location and exchange of scientific data
and software in a decentralised manner.
■ Service-Oriented Architecture (SOA) and the web-service infrastructures that assist in their implementation facilitate reuse of functionality.
(*) G. Kandaswamyetahi “Building Web Services for Scientific Grid Applications”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 249,260
9WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ The possibility to locate and invoke a service across machine and organisational boundaries (both in a synchronous and an asynchronous manner) is provided by SOA infrastructure fundamental primitive.
■ Computational scientist will be able to flexibly orchestrate SOA services into computational workflow.
10WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
■ Appropriate programming languages abstractions for science has to be provided.
■ Fortran and Message Passing Interface (MPI) are no longer appropriate for the above described architecture.
■ By using abstract machines it is possible to mix compilation and interpretation as well as integrate code written language seamlessly into an application or service.
11WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
A viable approach
■ Define a Multilevel Integrated Programming Model
■ Explore the management of concurrency in processor design on a range
of different scales
from instructions to programs
from microgrids to global grids
■ Evaluate the possibility and modalities to implement an integrated H/W and S/W
system capable to give the right answer in terms of:
- Inter/intra processor latency.
- More delay-tolerant and failure-resilient programming approach.
- Capability of data and functionality reuse at global
architecture level (distributed, cross-organisational).
- Capability to take advantages of parallel and distributed resources.
12WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Introduction
By Little’s law, the amount of concurrency needed to hide the latency of memory accesses will continue to increase as the gap between memory and processor speed grows. Since the memory latency is improving at a rate of only roughly 6% each year, the gap is projected to continue growing even as the increase in processor speed decreases from the historic rate of about 60% each year to about 20% each year.
13WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computer hardware industry
In 2005 a historic change of direction for computer hardware Industry.
● The major microprocessor companies all announced that
future products would be single-chip multiprocessors
future performance improvements would rely on
○ software-specified parallelism
rather than
○ additional software-transparent parallelism extracted automatically by the microarchitecture
14WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computer hardware industry
■ It is meaningfull that a multibilliondollar industry has bet its future on solving the general-purpose parallel computing problem.
even if
so many have previously attempted but failed to provide a satisfactory approach.
■ In order to tackle the parallel processing problem, innovative solutions are urgently needed, which in turn require extensive codevelopment of hardware and software.
15WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computer hardware industry
■ Advances in integrated circuit technology impose new challenges about how to implement a high performance application for low power dissipation on processors created by hundred of cores running at 200 MHz, rather than on one traditional processor running at 20 GHz.
■ The convergence of the high-performance and embedded industry.
16WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computer hardware industry
Multicore or Manycore?
■Multicore will obviously help multiprogrammed workloads, which contain a mix of independent sequential tasks, but how will individual tasks become faster?
■Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in power-performance.
■Multicore is unlikely to be ideal answer and sneaking up on the problem of parallelism via multicore solutions was likely to fail.
17WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computer hardware industry
■We desperately need a new solution for parallel hardware and software.
■Compatibility with old binaries and C programs is valuable to industry, and some researchers are trying to help multicore product plans succeed.
■We have been thinking bolder thoughts.Our aim is to realiza thousands of processors on a chip for new applications, and we welcome new programming models and new architectures if theysimplify the efficient programming of such highly parallel systems.
■Rather than multicore, we are, focused on “manycore”.
18WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computer hardware industry
■Between February 2005 and December 2006 a group of Researcher of University of California at Berkeley from many background (circuit design, computer architecture, massively parallel computing, computer-aided design, embedded h/w and s/w, programming languages, compilers, scientific programming and numerical analysis) met to discuss parallelism from these many angles.
■The result of the borrowing the good ideas regarding parallelism from different disciplines is the report.
“The Landscape of Parallel Computing Research: A View from Berkeley”
Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick
Electrical Engineering and Computer Sciences
University of California at Berkeley
Technical Report No. UCB/EECS-2006-183
http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html
December 18, 2006
19WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
The Landscape
20WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
The Landscape
■Seven critical questions used to frame the landscape of parallel computing research:
1. What are the applications?
2. What are common kernels of the applications?
3. What are the hardware building blocks?
4. How to connect them?
5. How to describe applications and kernels?
6. How to program the hardware?
7. How to measure success?
■This report do not have the answers- on some questions non-conventional and provocative perspectives are offered,- On others seemingly obvious sometine-neglected perspectives are stated.
21WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
The Landscape
Embedded versus High Performance Computing
Have more in common looking forward than they did in the past1. Both are concerned with power, whether it is battery life for cell phones or cost of
electricity and cooling in a data center.
2. Both are concerned with hardware utilization. Embedded systems are always
sensitive to cost, but efficient use of hardware is also required when you spend $
10M to $ 100M for high-end servers.
3. As the size of embedded software increases over time, the fraction of hand tuning
must be limited and so the importance of software reuse must increase.
4. Since both embedded and high-end servers now connect to networks, both need
to prevent unwanted accesses and viruses.
22WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
The Landscape
■The Biggest difference between the two target is the traditional emphasis on realtime computing in embedded, where the computer and the program need to be just fast enough to meet the deadlines, and there is no benefit to running faster.
■Running faster is usually valuable in server computing.
■As server applications become more media-oriented, real time may become more important for server computing as well
23WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Information Society Technologies (IST)
Network of Excellence on High Performance Embedded Architectures and Compilers (HiPEAC)
Meteo Valero (UPC Barcellona) HiPEAC Coordinator, introducing the pubblication of the first HiPEAC research roadmap (*) wrote:
“From the document it is clear that there are many challenges ahead of us in the design of future high-performance embedded systems. Some of themare familiar such as the memory wall, the power problem, and the
interconnection bottleneck. Others are new like the proper support for reconfigurable components, fast simulation techniques for multi-core systems, new programming paradigms for parallel programming.”
(*) K. De Bosschere, W. Luk, X. Martorell, N. Navarro, M. O’Boyle, D. Pnevmatikatos, A. Ramirez, P. Sainrat, A. Seznec, P. Stentrom, and O. Temam. “High-Performance Embedded Architecture and Compilation Roadmap” Transactions on HiPEAC I, Lecture Notes in Computer Science 4050, pp 5-29, Springer-Verlag, 2007
24WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Parallelism
For at least three decades the promise of parallelism has fascinated researchers.
■In the past, parallel computing efforts have shown promise and gathered investment, but in the end, uniprocessor computing always prevailed.
■In this time general-purpose computing is taking an irreversible step toward parallel architectures
●This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures
for parallelism●This plunge into parallelism is actually a retreat from aven greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures
25WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CW in Computer Architecture
Old & New Conventional Wisdom (CW) in Computer Architecture
1. Old CW: Power is free, but transistors are expensive.▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on.
2. Old CW: If you worry about power, the only concern is dynamic power. ▪ New CW: For desktops and servers, static power due to leakage can be 40% of total power.
3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. ▪ New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates.
4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. ▪ New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability, clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes.
guiding principles illustrating how everything is changing in computing
26WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CW in Computer Architecture
5. Old CW: Researchers demonstrate new architecture ideas by building chips.▪New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed.
6. Old CW: Performance improvements yield both lower latency and higher bandwidth. ▪ New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency.
7. Old CW: Multiply is slow, but load and store is fast. ▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.
8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. ▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP.
27WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CW in Computer Architecture
9. Old CW: Uniprocessor performance doubles every 18 months.
▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years.
10.Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer.
▪ New CW: It will be a very long wait for a faster sequential computer.
11. Old CW: Increasing clock frequency is the primary method of improving processor performance.
▪ New CW: Increasing parallelism is the primary method of improving processor performance.
12. Old CW: Less than linear scaling for a multiprocessor application is failure.
▪ New CW: Given the switch to parallel computing, any speedup via parallelism is a success.
28WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CW in Computer Architecture
1. Old CW: Power is free, but transistors are expensive.▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on.
7. Old CW: Multiply is slow, but load and store is fast. ▪ New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles.
8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. ▪ New CW is the “ILP wall”: There are diminishing returns on finding more ILP.
9. Old CW: Uniprocessor performance doubles every 18 months.
▪ New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years.
Conventional Wisdom (CW) in Computer Archietecture
29WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CW in Computer Architecture
Uniprocessor Performance (SPECint)
From Hennessy and PattersonComputer Architecture: A QuantitativeApproach, 4° edition, 2006
Sea change in chipdesign: multiple “cores” orprocessors per chip
• VAX: 25%/year 1978 to 1986• RISC + x86: 52%/yaer 1986 to 2002• RISC + x86: ??%/year 2002 to present
30WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CW in Computer Architecture
The State of Hardware
■A Negative picture about the state of hardware is painted by CW pairs based analysis.
■There are compensating positives as well●Moore’s Law continues: it will soon be possible to put thausands of simple processors on a single, economical chip;●Very low latency & very high bandwidth for the communication
between these processors within a chip;●Monolithic manycore microprocessors
- represent a very different design point from traditional multichip multiprocessors- provide promise for the development of new architectures
and programming models.
31WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
■ Mining the parallelism experience of the high-performance computing community to see if there are lessons we can learn for a broader view of parallel computing.
The hypothesis ● is not that traditional scientific computing is the future of parallel computing
● is that the body of knowledge created in bulding programs that run well on massively parallel computers may prove useful in parallelizing future
applications
■ Many of the authors from other areas, such as embedded computing, were surprised at how well future applications in their domain mapped closely to problems in scientific computing.
■ The way to guide and evaluate architecture innovation is to study a benchmark suite based on existing programs, such as EEMBC (Embedded Microprocessors Benchmark Consortium) or SPEC (Standard Performance Evalution Corporation) or SPLASH (Stanford Parallel Applications for Shared Memory).
32WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
■ It is currently unclear how to express a parallel computation best: a very big
obstacle to innovation in parallel computing.
■ It seems unwise to let a set of existing source code drive an investigation into
parallel computing.
■ There is a need to find a higher level of abstraction for reasoning about
parallel application requirements.
■ The main aim is to delineate application requirements in a manner that is not
overly specific to individual applications or the optimizations used for certain
hardware platforms.
■ It is possible to draw broader conclusions about hardware requirements.
■ The approach is to define a number of “Dwarfs”, which each capture a
pattern of computation and communication common to a class of important
applications.
33WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
■ Phil Colella identified seven numerical methods that he believed will be
important for science and engineering for at least the next decade
■ Seven Dwarfs
● Constitute classes where membership in a class is defined by
similarity in computation and data movement
● are specified at a high level of abstraction to allow reasoning about
their behavior across a broad range of applications
34WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
35WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
36WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
37WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
Seven Dwarfs, their descriptions, corresponding NAS benchmarks, and example computers.
38WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
39WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Applications and Dwarfs
Extensions to the original Seven Dwarfs.
40WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Recognition, Mining, Synthesis (RMS)
Intel’s RMS and how it maps down to functions that are more primitive. Of the five categories at the top of the figure, Computer Vision is classified as Recognition, Data Mining is Mining, and Rendering, Physical Simulation, and Financial Analytics are Synthesis. [Chen 2006]
Intel “Era of Tera” Computation Categories
41WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Parallel Programming Models
Comparison of 10 current parallel programming models for 5 critical tasks, sorted from most explicit to most implicit. High-performance computing applications [Pancake and Bergmark 1990] and embedded applications [Shah et al 2004a] suggest these tasks must be addressed one way or the other by a programming model: 1) Dividing the application into parallel tasks; 2) Mapping computational tasks to processing elements; 3) Distribution of data tomemory elements; 4) mapping of communication to the inter-connection network; and 5) Inter-task synchronization.
42WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Limits of Performance of Dwarfs
Limits to performance of dwarfs, inspired by an suggestion by IBM that a packaging technology could offer virtually infinite memory bandwidth. While the memory wall limited performance for almost half the dwarfs, memory latency is a bigger problem than memory bandwidth
43WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Transistor Integration Capacity
Transistor integration capacity
44WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Pollack’s Rule
Pollack's Rule
45WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Frequency and Power Consumption
Frequency and Power Consumption
46WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
ManyCore System
Illustration of a Many Core System
47WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Amdahl’s Law Limits Parallel Speedup
Amdahl's Law limits parallel speedup
48WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Core Performances
Performance of Large, Medium, and Small Cores
49WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Fine Grain Power Management
Fine grain power management
50WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Network Power Estimate
Network power estimate
51WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Three Dimensional Interconnect With Stacking
Three dimensional interconnect with stacking
52WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Assembly of 3D Memory
Assembly of 3D memory
53WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Recommended points from Berkeley
■ The overarching goal should be to make it easy to write programs that execute efficiently on highly parallel computing systems
■ The target should be 1000s of cores per chip, as these chips are built from processing elements that are the most efficient in MIPS per watt, MIPS per area of silicon, and MIPS per development dollar.
■ Instead of traditional benchmarks, use 13 “Dwarfs” to design and evaluate parallel programming models and architectures.
A dwarf is an algorithmic method that captures a pattern of computation and communication.
“Autotuners” should play a larger role than conventional compilers in translating parallel programs.
■ To maximize programmer productivity, future programming models must be more human-centric than the conventional focus on hardware or applications.
■ To be successful, programming models should be independent of the number of processors.
54WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Recommended points from Berkeley
■ To maximize application efficiency, programming models should support a wide range of data types and successful models of parallelism: task-level parallelism, word-level parallelism, and bit-level parallelism.
■ Architects should not include features that significantly affect performance or energy if programmers cannot accurately measure their impact via performance counters and energy counters.
■ Traditional operating systems will be deconstructed and operating system functionality will be orchestrated using libraries and virtual machines.
■ To explore the design space rapidly, use system emulators based on FPGAs that are highly scalable and low cost.
maybe they missed some key point, for example:
whenever it is possible, computational execution should happen in asynchronous manner
55WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Because Asynchronous
■ Low power consumption,
… due to fine-grain clock gating and zero stadby power consumption.
■ High operating speed,
… operating speed is determined by actual local latencies rather than global worst-case latency.
■ Less emission of electro-magnetic noise,
… the local clocks tend to tick at random points in time.
■ Robustness towards variations in supply voltage, temperature, and fabrication process parameters,
… timing is based on matched delays (and can even be insensitive to circuit and wire delays).
■ Better composability and modularity,
… because of the simple hanshake interfaces and the local timing.
■ No clock distribution and clock skew problems,
… there is no global signal that needs to be distributed with minimal phase skew across the circuit.
56WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Auto-tuners
57WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computational Model
■ Designing clever parallel hardware and then work out how to program it is a big mistake.
■ Designing parallel programming languages and then work out how to implement them is usually a mistake.
■ Developing the right computational model alongside languages & hardware is the Key.
58WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Computational Model
■ Think about systems, not just hardware or software.
■ There is lots of (possibly) relevant work e.g.- Dataflow (Single Assignment)- Graph Rewriting (Functional Languages)- Bulk Synchronous Parallelism (BSP)- Transactional Memory
■ Don’t ignore previous work and particularly don’t re-invent the wheel!.
59WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Language Effectiveness
0
5
10
15
20
25
30
35
40
1970 1975 1980 1985 1990 1995 2000 2005
Language Effectiveness
C
C++
Java
60WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Language Effectiveness
1
10
100
1000
10000
100000
1000000
10000000
1970 1975 1980 1985 1990 1995 2000 2005
Language EffectivenessMoore's Law
61WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
CISC Architecture
■ Huge effort into improving performance of sequential instruction stream
■ Complexity has grown unmanageable
■ Even with 1 billion transistors on a chip, what more can be done?
Renaming
Out-of-Order
Execution
Pipelining
SpeculativeExecution
Prefetching
BranchPrediction
ValuePrediction
62WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
TRIPS Prototype
63WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Cyclops-64 Architecture
Cyclops-64 Programming Models and System Software Supports
UPC+/-UPC+/-Co-array Fortran
OpenMP-XNEARTH-C +/-EARTH-C +/- MPI……
Application Programming APIApplication Programming API
Cyclops Thread Virtual MachineCyclops Thread Virtual MachineThreadThread
ManagementManagementShared Memory
Operations
Thread Creation & TerminationThread Creation & Termination
SchedulingScheduling
Dynamic memory managementDynamic memory management
Put / get with syncPut / get with sync
acquire / releaseacquire / release fibersfibers
async function invocationasync function invocation
Kcc/gcc
Compiler
Tool
chain
Kcc/gcc
Compiler
Tool
chain
Fine-GrainMultithreading
Thread SynchronizationThread Synchronization
Load BalancingLoad Balancing
Others
Put / getPut / get
Location Location ConsistencyConsistency
System System Software Software
PercolationPercolation
Advanced Execution/ Advanced Execution/ Programming ModelProgramming Model
InfrastructurInfrastructure and Toolse and Tools
Simulation / Simulation / Emulation Emulation
Analytical Analytical Modeling Modeling
Base Base Execution Execution
ModelModel
Fine-Grain Fine-Grain Multithreading Multithreading (e.g. EARTH, (e.g. EARTH,
CARE)CARE)
Communication Ports for3D Mesh Inter-Chip Network
Cyclops-64 ISACyclops-64 ISA
24x24
24 PC cards in 1 shishkebab
1 PetaFlops1 PetaFlops
A-Switch
Crossbar Network
…
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
…
TU TU
SP SP
FPU
4 GB/sec* 6
4 GB/sec
50 MB/sec
1 Gbit/sethernet
Off
-Chi
p M
emor
y
OtherChips via 3D
mesh
Off
-Chi
p M
emor
yO
ff-C
hip
Mem
ory
Off
-Chi
p M
emor
y
IDEHDD
4 GB/sec
6
SP SP SP SP SP SP SP SP
TU TU
SP SP
FPU
TU TU
SP SP
FPU
TU TU
SP SP
FPU
A-s
wit
ch
DM
A6A-Switch
Crossbar NetworkCrossbar Network
…
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
ME
MO
RY
BA
NK
…
TU TU
SP SP
FPU
TUTU TUTU
SPSP SPSP
FPUFPU
4 GB/sec* 6
4 GB/sec
50 MB/sec
1 Gbit/sethernet
Off
-Chi
p M
emor
yO
ff-C
hip
Mem
ory
OtherChips via 3D
mesh
Off
-Chi
p M
emor
yO
ff-C
hip
Mem
ory
Off
-Chi
p M
emor
yO
ff-C
hip
Mem
ory
Off
-Chi
p M
emor
yO
ff-C
hip
Mem
ory
IDEHDD
4 GB/sec
6
SPSP SPSP SPSP SPSP SPSP SPSP SPSP SPSP
TU TU
SP SP
FPU
TUTU TUTU
SPSP SPSP
FPUFPU
TU TU
SP SP
FPU
TUTU TUTU
SPSP SPSP
FPUFPU
TU TU
SP SP
FPU
TUTU TUTU
SPSP SPSP
FPUFPU
A-s
wit
ch
DM
A
A-s
wit
ch
DM
A6
64WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
hHLDS
The homogeneous High Level Dataflow System (hHLDS) model
Firing rules in the classical model
Let A={a1, …, an} be the set of actors
and L ={ll, …, ln} be the set of links
A dataflow graph is a labelled directed graph
G = (N, E)where N = A L is the set of nodes
E (A × L) (L × A) is the set of edges
firing of an actor
a token on each input link and no token on each output link
65WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
hHLDS
The hHLDS model
Merge
FT
A B
L
FT
Switch
A
L
Decider
A B
L
R L
Gate
are characterized by having heterogeneous I/O conditions
Special actors in the classical model
66WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
hHLDS
Any actor has two input links and one output link and consumes and produces only data tokens
firing of an actor
a token on each input link
effectconsumes all input tokens and can produces a token on its output link
a+b*c
*
+
a
b c≤
+
a
b c
If b≤c then a
67WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
hHLDS
Comparison between the two models
TF
=
T F
T F
T F
* 3
/ 2 5
F F
1 c
a
d
F
F
F
T
TT
a )
TF
> 1
+
**
+ +
> <
:_
LS T LS T
++
==
a
b
1
53 2
1
c
d
a
b )
1 2
3
6
8
10
12 13 14
11
9
7
4 5
input (a, c) b := 1; repeat if a > 1 then a := a \ 2 else a := a * 5 b := b * 3; until b = c;output (d)
The hHLDS model
68WorkshopDecember 19, Napoli - Italy
R. Vaccaro & L. VerdosciaProgramming Models and Architectures for……
CNR Bioinformatics
Dataflow Computational Model
+
+
+
DATA
Results
memorymemory
Initial
values