reinvention of computing for many-core parallelism requires addressing programmer’s productivity...

Reinvention of Computing for Many-Core Parallelism Requires Addressing Programmer’s Productivity

Uzi Vishkin

Common wisdom [cf. tribal lore collected by DARPA HPCS, 2005]: Programming for parallelism is easyIt is the programming for performance that makes it hard

Reinvention of Computing for Many-Core Parallelism Requires Addressing Productivity

Uzi Vishkin

A less fatalistic position:Programming for parallelism is easyBut, the difficulty of programming for performance depends on the system

Productivity in Parallel ComputingThe large parallel machines story

Funding of productivity: $M650 HProductivityCS, ~2002 Met # Gflops goals: up by 1000X since mid-90’s; Exascale talk & plans Met power goals. Also: groomed eloquent spokespeopleProgress on productivity: No agreed benchmarks. No spokesperson. Elusive! In fact, not much has changed since: “as intimidating and time consuming as programming in assembly language”--NSF Blue Ribbon Committee, 2003 or even “parallel software crisis”, CACM 1991.Common sense engineering: Untreated bottleneck diminished returns on improvements bottleneck becomes more criticalNext 10 years: New specific programs on flops and power. What about productivity?!Reality: economic island. Cleared by marketing: DOE applications

Enter: mainstream many-cores Every CS major should be able to program many-cores

Coherence IssueWhen you come to a fork in the road, take it!-Yogi Berra

Camp 1 Many US best minds opt for occupations that do not involve programming• NSF tries to lure them to CS in HS by: (1) presenting the steady march and broad

reach of computing across the sciences, industries, culture and society, correcting the current narrow focus on programming in introductory course [New Programs Aim to Lure Young Into Digital Jobs, NYTimes, 12/09]; (2) productivity (3) computational thinking

Camp 2 Power/performance Reinvent mainstream computing for parallelism • Vendors try to build many-cores that require decomposition-first programming.

Railroading to productivity “disaster area”. Hacking. Insufficient support from parallel algorithms design & analysis. Short on outreach/productivity/abstraction

Unintended outcome of “taking the fork” (prod vs. power/perf)• Camp cheerleaders: core CS (alg design & analysis style) is radical. Peer review

favors both sides over center. Centrists as extremists is an oxymoron! • Building wrong expectations among prospective CS majors. Disappointment will

lead to “Get me out of this major”• Pool of CS majors to be engaged in decomposition- first too limited (after

subtracting the lured-to-breadth-over-programming and the core)Consequences of “taking the fork” surrealism• Eventual casualties: # students, credibility & productivityResearch/comparison of several holistic parallel platforms could: (i) prevent much of

the damage, (ii) build up the real diversity needed for natural selection, and (iii) advise the NSF on programs that otherwise could cancel one another

Lessons from Invention of Computing“It should be noted that in comparing codes four viewpoints must be kept in

mind, all of them of comparable importance:• Simplicity and reliability of the engineering solutions required by the code;• Simplicity, compactness and completeness of the code;• Ease and speed of the human procedure of translating mathematical

conceived methods into the code [”COMPUTATIONAL THINKING”], and also of finding and correcting errors in coding or of applying to it changes that have been decided upon at a later stage;

• Efficiency of the code in operating the machine near it full intrinsic speed.-H. Goldstine, J. von Neumann. Planning and coding problems for an electronic computing instrument, 1947

Take home- Comparing codes is a pivotal and broad issue- Concern for Productivity is as old as computing (development-time) - Human process: intellectual/algorithm/planning plus skill/coding- Contrast with: Tendency to understand HW upgrade from application code

(even if machine not yet built, A. Ghuloum, Intel, CACM 9/09) – unreasonable expectation from application code developers

How was the “human procedure” addressed?Answer: Basically, By Abstraction and Induction

1. General-Purpose computing is about a platform for your future (whatever) program, as opposed specific application, a general method for the human procedure was key

2. GvN47 based coding on mathematical induction (known for math proofs and as axiom of the natural numbers)

3. It worked for establishing serial computing. This method led to simplicity, compactness and completeness of the resulting code. References:

- Knuth67, The art of Computer Programming. Vol. 1: Fundamental Algorithms. Chapter 1: Basic concepts. 1.1 Algorithms. 1.2 Math Prelims. 1.2.1 Math Induction Algorithms: 1. Finiteness. 2. Definiteness. 3. Input. 4. Output. 5. Effectiveness.

Gold standards Definiteness: InductionEffectiveness: “Uniform cost criterion" [AHU74] abstraction

“Killer app” for general-purpose many cores:Let the app-dreamers do their magic

• Oxymoron?.. general-purpose: no one application in particularNot really: If possible, a killer application would be helpful• However, wrong as condition for progressGeneral-purpose computing is an infrastructure for the IT sector and

the economy• The general-purpose computing infrastructure has been realized

by the software spiral (the cyclic process of hardware improvements leading to software improvements that lead back to hardware improvements and so on; Andy Grove, Intel)

• Instituting a parallel software spiral is a killer application for many-cores: as in the past app-dreamers will invent uses

Not surprisingly, the killer application is also an infrastructure• Government has a role in building infrastructureInstituting a parallel software spiral merits government fundingHowever, insufficient empowerment for: creating and developing

alternative platforms to the point of establishing their merit.

Serial Abstraction & A Parallel Counterpart Example• Rudimentary abstraction that made serial computing simple that any single instruction

available for execution in a serial program executes immediately

Abstracts away different execution time for different operations (e.g., memory hierarchy) . Used by programmers to conceptualize serial computing and supported by hardware and compilers. The program provides the instruction to be executed next (inductively)

• Rudimentary abstraction for making parallel computing simple: that indefinitely many instructions, which are available for concurrent execution, execute immediately, dubbed Immediate Concurrent Execution (ICE)

Step-by-step (inductive) explication of the instructions available next for concurrent execution. # processors not even mentioned. Falls back on the serial abstraction if 1 instruction/step.

What could I do in parallel at each step assuming unlimited

hardware

# ops

.. ..

time

#ops

.. ..

.... ..

time

Time = Work Work = total #ops

Time << Work

Serial Execution, Based on Serial Abstraction

Parallel Execution, Based on Parallel Abstraction

CACM’10: Using simple abstraction to guide the reinvention of computing for parallelism

[Overall: old Work-Depth description. Only “minimalist abstraction”: ICE builds only on induction, itself a rudimentary concept]

• [SV82] conjectured that the rest (full PRAM algorithm) just a matter of skill

• Lots of evidence that “work-depth” works. Used as framework in PRAM algorithms texts: JaJa-92, KKT-01

• ICE in line with PRAM: Only really successful parallel algorithmic theory. Latent, though not widespread, knowledgebase

• Widely agreed: work&depth are necessary. Jury is out on: what else. Our position: as little as possible.

Workflow from parallel algorithms to programming versus trial-and-error

Option 1PAT

Rethink algorithm: Take better advantage

of cache

Hardware

PAT

Tune

Hardware

Option 2Parallel algorithmic thinking (ICE/WD/PRAM)

Compiler

Is Option 1 good enough for the parallel programmer’s model?Options 1B and 2 start with a PRAM algorithm, but not option 1A. Options 1A and 2 represent workflow, but not option 1B.

Not possible in the 1990s.Possible now: XMT@UMDWhy settle for less?

Insufficient inter-thread bandwidth?

Domain decomposition,

or task decomposition

ProgramProgram

Provecorrectness

Still correct

Still correct

Mark Twain on the PRAM

We should be careful to get out of an experience only the wisdom that is in it— and stop there; lest we be like the cat that sits down on a hot stove-lid. She will never sit down on a hot stove-lid again— and that is well; but also she will never sit down on a cold one anymore— Mark Twain

PRAM algorithms did not become standard CS knowledge in 1988-90 since “hot stove-lid”: No 1990s implementable computer architecture allowed programmers to look at a computer as a PRAM

The XMT project @UMD changed that

PS NVidia happy to report success with 2 PRAM algorithms in IPDPS09. Great to see that from a major vendor

[These 2 algorithms are decomposition-based, unlike most PRAM algorithms. Freshmen programmed same 2 algorithms on our XMT machine

The Parallel Programmer’s Productivity LandscapePostulation: a continental divide

How different can productivity of many-core architectures be? Answer: very!Metaphor: Dropping rain a short distance apart. Very different outcomes. Think of programmer’s productivity as cost of producing usable water.The decomposition-first programming side requires domain-decomposition or task-decomposition that have not worked in spite of big investment. (Looks greener, since invested; what if goes to ocean while arid side to Sweetwater?) Work-depth initial abstraction is decomposition-free. (Arid, under-invested) Require leap-of-faith for investment.

Decomposition-first programming

Work-depth programming

Ocean

Great Lakes

Validation of Ease of Programming To Date1. Comparison with MPI by DARPA-HPCS SW Eng leaders [HochsteinBasiliVGilbert]

2. Teachability demonstrated so far [TorbertVTzurEllison, SIGCSE’10 to appear]: - To freshman class with 11 non-CS students. Some prog. assignments: median finding, merge-sort, integer-sort & sample-sort. Other teachers:- Magnet HS teacher. Downloaded simulator, assignments, class notes, from XMT page. Self-taught. Recommends: Teach XMT first. Easiest to set up (simulator), program, analyze: ability to anticipate performance (as in serial). Can do not just for embarrassingly parallel. Teaches also OpenMP, MPI, CUDA. Lookup keynote at CS4HS’09@CMU + interview with teacher.- High school & Middle School (some 10 year olds) students from underrepresented groups by HS Math teacher.

Teachability: necessary (but not sufficient) condition for ease-of-programming. Itself necessary (but not sufficient) condition for productivity.

Hence, teachability as good a benchmark as any out there for productivity

Conclusion- Want future mainstream programmers to embrace general-purpose

parallelism (every CS major; for common SW architectures). Yet, in the past:- Insufficient evidence on productivity. Yet, history of repeated surprise: Parallel

machines repel programmers Research Drivers

1. Empower select holistic (HW+SW) parallel platforms for merit-based comparison. Imagine a new world with the given platform. Consider all aspects: e.g., is it sufficient for reinstating the SW spiral? Is the barrier-to-entry for creative applications low enough? How will the CS curriculum will look? Who will be attracted to study CS? Then, gather evidence:

2. Methodically compare productivity (development-time, run-time) of platforms. Ownership stake role for Indian partner (Prof. PJ Narayan, IIIT,

Hyderabad): India – largest producer of SW. New platform requires sufficient Indian interest. Lead benchmarking/comparison for productivity, etc.

For session Coming from algorithms, computer vision and computational biology, compare select platforms for performance, productivity (development-time and run-time), and overall for reinstating the SW spiral. Benchmark algorithms and applications based on their inherent parallelism for future machine platforms, as opposed to using existing code written for yesterday’s (serial or parallel) machines. Issue: How to benchmark for productivity?

Not just a theory. XMT: prototyped HW&SW

Never a successful general-purpose parallel computer (easy to program, good speedups, up&down scalable). IF you could program it great speedups.

Motivation: Fix the IF

64-core, 75MHz FPGA prototype[SPAA’07, Computing Frontiers’08]Original explicit multi-threaded (XMT) architecture [SPAA98]

Interconnection Network for 128-core. 9mmX5mm, IBM90nm process. 400 MHz prototype [HotInterconnects’07]

Same design as 64-core FPGA. 10mmX10mm, IBM90nm process. 150 MHz prototype

The design scales to 1000+ cores on-chip

Programmer’s Model: Engineering Workflow• Arbitrary CRCW Work-depth algorithm. Reason about correctness &

complexity in synchronous model • SPMD reduced synchrony

– Threads advance at own speed, not lockstep– Main construct: spawn-join block. Note: can start any number of

processes at once. Can express locality (“decomposition-second”)– Prefix-sum (ps). Independence of order semantics (IOS).– Establish correctness & complexity by relating to WD analyses.– Circumvents “The problem with threads”, e.g., [Lee].

• Tune (compiler or expert programmer): (i) Length of sequence of round trips to memory, (ii) QRQW, (iii) WD. [VCL08]

• Trial&error contrast: similar startwhile insufficient inter-thread bandwidth do{rethink algorithm to take better advantage of cache}

spawn join spawn join

Performance• Simulation of 1024 processors: 100X on standard

benchmark suite for VHDL gate-level simulation. for 1024 processors GV06

• [SPAA’09]: ~10X relative to Intel Core 2 Duowith 64-processor XMT; same silicon area as 1 commodity processor (core)

• Promise of 100X with 1024 processors also for irregular, fine-grained parallelism with up- and down-scalability.

Some CreditsGrad students:, George Caragea, James Edwards, David Ellison, Fuat Keceli, Beliz

Saybasili, Alex Tzannes. Recent grads: Aydin Balkan, Mike Horak, Xingzhi Wen• Industry design experts (pro-bono)• Rajeev Barua, Compiler. Co-advisor of 2 CS grad students. 2008 NSF grant• Gang Qu, VLSI and Power. Co-advisor• Steve Nowick, Columbia U., Asynch computing. Co-advisor. 2008 NSF team grant. • Ron Tzur, U. Colorado, K12 Education. Co-advisor. 2008 NSF seed fundingK12: Montgomery Blair Magnet HS, MD, Thomas Jefferson HS, VA, Baltimore (inner city) Ingenuity Project

Middle School 2009 Summer Camp, Montgomery County Public Schools• Marc Olano, UMBC, Computer graphics. Co-advisor.• Tali Moreshet, Swarthmore College, Power. Co-advisor.• Bernie Brooks, NIH. Co-Advisor• Marty Peckerar, Microelectronics• Igor Smolyaninov, Electro-optics• Funding: NSF, NSA 2008 deployed XMT computer, NIH• 6 Issued patents. More patent applications• Informal industry partner: Intel • Reinvention of Computing for Parallelism. Selected for Maryland Research Center

of Excellence (MRCE) by USM. Not yet funded. 17 members, including UMBC, UMBI, UMSOM. Mostly applications.

reinvention of computing for many-core parallelism requires addressing programmer’s productivity...

Documents

programming nsf

parallel computing

difficulty of programming

core parallelism

mainstream computing

reinvention of computing

productivity disaster

parallelism vendors