petabricks: a language and compiler based on autotuning · 10/17/2011 · • joe is oblivious...
TRANSCRIPT
![Page 1: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/1.jpg)
PetaBricks: A Language and Compiler based on Autotuning
Saman AmarasingheJoint work with
Jason Ansel, Marek OlszewskiCy Chan, Yee Lok Wong, Maciej Pacula
Una-May O’Reilly and Alan Edelman
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
Tuesday, October 25, 2011
![Page 2: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/2.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
2Tuesday, October 25, 2011
![Page 3: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/3.jpg)
Today: The Happily ObliviousAverage Joe Programmer
3Tuesday, October 25, 2011
![Page 4: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/4.jpg)
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
3Tuesday, October 25, 2011
![Page 5: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/5.jpg)
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
3Tuesday, October 25, 2011
![Page 6: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/6.jpg)
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
3Tuesday, October 25, 2011
![Page 7: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/7.jpg)
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
3Tuesday, October 25, 2011
![Page 8: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/8.jpg)
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
3Tuesday, October 25, 2011
![Page 9: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/9.jpg)
Today: The Happily ObliviousAverage Joe Programmer
• Joe is oblivious about the processor– Moore’s law bring Joe performance – Sufficient for Joe’s requirements
• Joe has built a solid boundary between Hardware and Software– High level languages abstract away the processors
– Ex: Java bytecode is machine independent
• This abstraction has provided a lot of freedom for Joe
• Parallel Programming is only practiced by a few experts
3Tuesday, October 25, 2011
![Page 10: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/10.jpg)
4
0.1000
1.0000
10.0000
100.0000
1000.0000
10000.0000
100000.0000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/
52%/
??%/
8086
286
386
486
PentiumP2
P3P4
ItaniumItanium 2
Moore’s Law
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Num
ber of Transistors
Tuesday, October 25, 2011
![Page 11: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/11.jpg)
5
8086
286
386
486
PentiumP2
P3P4
ItaniumItanium 2
0.1000
1.0000
10.0000
100.0000
1000.0000
10000.0000
100000.0000
19781980198219841986198819901992199419961998200020022004200620082010201220142016
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/
52%/
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Num
ber of Transistors
Tuesday, October 25, 2011
![Page 12: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/12.jpg)
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?
6Tuesday, October 25, 2011
![Page 13: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/13.jpg)
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and
programmer productivity
6Tuesday, October 25, 2011
![Page 14: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/14.jpg)
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and
programmer productivity• Little to no emphasis on performance
6Tuesday, October 25, 2011
![Page 15: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/15.jpg)
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and
programmer productivity• Little to no emphasis on performance • This is reflected in:
– Languages– Tools– Research– Education
6Tuesday, October 25, 2011
![Page 16: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/16.jpg)
Squandering of the Moore’s Dividend
• 10,000x performance gain in 30 years! (~46% per year)• Where did this performance go?• Last decade we concentrated on correctness and
programmer productivity• Little to no emphasis on performance • This is reflected in:
– Languages– Tools– Research– Education
• Software Engineering: Only engineering discipline where performance or efficiency is not a central theme
6Tuesday, October 25, 2011
![Page 17: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/17.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching
Tuesday, October 25, 2011
![Page 18: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/18.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
Tuesday, October 25, 2011
![Page 19: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/19.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
220x
Tuesday, October 25, 2011
![Page 20: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/20.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
522x
Tuesday, October 25, 2011
![Page 21: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/21.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
1,117x
Tuesday, October 25, 2011
![Page 22: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/22.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
1,117x
Tuesday, October 25, 2011
![Page 23: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/23.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
2,271x
Tuesday, October 25, 2011
![Page 24: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/24.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
7,514x
Tuesday, October 25, 2011
![Page 25: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/25.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
12,316x
Tuesday, October 25, 2011
![Page 26: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/26.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
33,453x
Tuesday, October 25, 2011
![Page 27: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/27.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
87,042x
Tuesday, October 25, 2011
![Page 28: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/28.jpg)
Matrix Multiply
• Abstraction and Software Engineering– Immutable Types– Dynamic Dispatch– Object Oriented
• High Level Languages• Memory Management
– Transpose for unit stride– Tile for cache locality
• Vectorization• Prefetching• Parallelization
296,260x
Tuesday, October 25, 2011
![Page 29: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/29.jpg)
Matrix Multiply
• Typical Software Engineering Approach– In Java– Object oriented– Immutable– Abstract types– No memory optimizations– No parallelization
• Good Performance Engineering ApproachIn C/AssemblyMemory optimized (blocked)BLAS librariesParallelized (to 4 cores)
8
296,260x
Tuesday, October 25, 2011
![Page 30: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/30.jpg)
Matrix Multiply
• Typical Software Engineering Approach– In Java– Object oriented– Immutable– Abstract types– No memory optimizations– No parallelization
• Good Performance Engineering ApproachIn C/AssemblyMemory optimized (blocked)BLAS librariesParallelized (to 4 cores)
8
14,700x
• In Comparison: Lowest to Highest MPG in transportation
296,260x
Tuesday, October 25, 2011
![Page 31: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/31.jpg)
Matrix Multiply
• Typical Software Engineering Approach– In Java– Object oriented– Immutable– Abstract types– No memory optimizations– No parallelization
• Good Performance Engineering ApproachIn C/AssemblyMemory optimized (blocked)BLAS librariesParallelized (to 4 cores)
8
14,700x
• In Comparison: Lowest to Highest MPG in transportation
296,260x
294,000x
Tuesday, October 25, 2011
![Page 32: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/32.jpg)
9
8086
286
386
486
PentiumP2
P3P4
ItaniumItanium 2
0.1000
1.0000
10.0000
100.0000
1000.0000
10000.0000
100000.0000
19781980198219841986198819901992199419961998200020022004200620082010201220142016
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/
52%/
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Num
ber of Transistors
Tuesday, October 25, 2011
![Page 33: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/33.jpg)
10
8086
286
386
486
PentiumP2
P3P4
ItaniumItanium 2
0.1000
1.0000
10.0000
100.0000
1000.0000
10000.0000
100000.0000
1978 1980 1982 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016
Per
form
ance
(vs.
VA
X-1
1/78
0)
25%/
52%/
??%/
Uniprocessor Performance (SPECint)
From David Patterson
1,000,000,000
100,000
10,000
1,000,000
10,000,000
100,000,000
From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Num
ber of Transistors
Tuesday, October 25, 2011
![Page 34: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/34.jpg)
Performance and Parallelism
• No more automatic performance gainsàPerformance has to come from somewhere else
– Better languages– Disciplined programming– Performance engineering– Plus…
11Tuesday, October 25, 2011
![Page 35: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/35.jpg)
Performance and Parallelism
• No more automatic performance gainsàPerformance has to come from somewhere else
– Better languages– Disciplined programming– Performance engineering– Plus…
• Parallelism– Moore’s low morphed from providing performance to
providing parallelism– But…Parallelism IS performance
11Tuesday, October 25, 2011
![Page 36: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/36.jpg)
Joe the Parallel Programmer
• Moore’s law is not bringing anymore performance gains
• If Joe needs performance he has to deal with multicores– Joe has to deal with
performance– Joe has to deal with
parallelism
12
Joe
Tuesday, October 25, 2011
![Page 37: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/37.jpg)
Can Joe Handle This?
Today
Programmer is oblivious to performance.
13Tuesday, October 25, 2011
![Page 38: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/38.jpg)
Can Joe Handle This?
Today
Programmer is oblivious to performance.
13
Current Trajectory Programmer handles parallelism and performance turning
Tuesday, October 25, 2011
![Page 39: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/39.jpg)
Can Joe Handle This?
Today
Programmer is oblivious to performance.
13
Current Trajectory Programmer handles parallelism and performance turning
Better Trajectory Programmer handles concurrency. Compiler finds best parallel mapping and optimize for performance
Tuesday, October 25, 2011
![Page 40: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/40.jpg)
Conquering the Multicore Menace
14Tuesday, October 25, 2011
![Page 41: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/41.jpg)
Conquering the Multicore Menace
• Parallelism Extraction– The world is parallel,
but most computer science is based in sequential thinking– Parallel Languages
– Natural way to describe the maximal concurrency in the problem
– Parallel Thinking– Theory, Algorithms, Data Structures à Education
14Tuesday, October 25, 2011
![Page 42: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/42.jpg)
Conquering the Multicore Menace
• Parallelism Extraction– The world is parallel,
but most computer science is based in sequential thinking– Parallel Languages
– Natural way to describe the maximal concurrency in the problem
– Parallel Thinking– Theory, Algorithms, Data Structures à Education
• Parallelism Management– Mapping algorithmic parallelism to a given architecture– Find the best performance possible
14Tuesday, October 25, 2011
![Page 43: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/43.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
15Tuesday, October 25, 2011
![Page 44: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/44.jpg)
In the mean time…….the experts practicing
• They needed to get the last ounce of the performance from hardware
• They had problems that are too big or too hard• They worked on the biggest
newest machines• Porting the software to take
advantage of the latest hardware features
• Spending years (lifetimes) ona specific kernel
16Tuesday, October 25, 2011
![Page 45: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/45.jpg)
Lifetime of Software >> Hardware
• Lifetime of a software application is 30+ years
• Lifetime of a computer system is less than 6 years• New hardware every 3 years
• Multiple Ports• “Software Quality deteriorates
in each port• Huge problem for these expert programmers
17Tuesday, October 25, 2011
![Page 46: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/46.jpg)
Not a problem for Joe
18
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128256
512
Athlon
• Moore’s law gains were sufficient• Targeted the same machine
model from 1070 to now
Tuesday, October 25, 2011
![Page 47: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/47.jpg)
Not a problem for Joe
18
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128256
512
Athlon
• Moore’s law gains were sufficient• Targeted the same machine
model from 1070 to now
• New reality: changing machine model
Tuesday, October 25, 2011
![Page 48: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/48.jpg)
Not a problem for Joe
18
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128256
512
Athlon
• Moore’s law gains were sufficient• Targeted the same machine
model from 1070 to now
• New reality: changing machine model• Joe is in the same boat with
the expert programmers
Tuesday, October 25, 2011
![Page 49: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/49.jpg)
Not a problem for Joe
18
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128256
512
Athlon
Program written in 1970 still worksAnd is much faster today
• Moore’s law gains were sufficient• Targeted the same machine
model from 1070 to now
• New reality: changing machine model• Joe is in the same boat with
the expert programmers
Tuesday, October 25, 2011
![Page 50: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/50.jpg)
Not a problem for Joe
18
1985 199019801970 1975 1995 2000
4004
8008
80868080 286 386 486 Pentium P2 P3P4Itanium
Itanium 2
2005 20??
# of
cor
es
1
2
4
8
16
32
64
128256
512
Athlon
Raw
Power4 Opteron
Power6
Niagara
YonahPExtreme
Tanglewood
Cell
IntelTflops
Xbox360
CaviumOcteon
RazaXLR
PA-8800
CiscoCSR-1
PicochipPC102
Boardcom 1480 Opteron 4PXeon MP
AmbricAM2045
• Moore’s law gains were sufficient• Targeted the same machine
model from 1070 to now
• New reality: changing machine model• Joe is in the same boat with
the expert programmers
Tuesday, October 25, 2011
![Page 51: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/51.jpg)
Future Proofing Software
• No single machine model anymore– Between different processor types– Between different generation within the same family
• Programs need to be written-once and use anywhere, anytime– Java did it for portability – We need to do it for performance
19Tuesday, October 25, 2011
![Page 52: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/52.jpg)
n To be an effective language that can future-proof programsn Restrict the choices when a property is hard to automate or constant
across architectures of current and future à expose to the usern Features that are automatable and variable à hide from the user
Languages and Future Proofing
Tuesday, October 25, 2011
![Page 53: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/53.jpg)
n To be an effective language that can future-proof programsn Restrict the choices when a property is hard to automate or constant
across architectures of current and future à expose to the usern Features that are automatable and variable à hide from the user
n A lot nown Expose the architectural detailsn Good performance nown In a local miniman Will be obsolete soonn Heroic effort needed to get outn Ex: MPI
Languages and Future Proofing
Tuesday, October 25, 2011
![Page 54: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/54.jpg)
n To be an effective language that can future-proof programsn Restrict the choices when a property is hard to automate or constant
across architectures of current and future à expose to the usern Features that are automatable and variable à hide from the user
n A little forevern Hide the architectural detailsn Good solutions not visiblen Mediocre performance n But will work forevern Ex: HPF
n A lot nown Expose the architectural detailsn Good performance nown In a local miniman Will be obsolete soonn Heroic effort needed to get outn Ex: MPI
Languages and Future Proofing
Tuesday, October 25, 2011
![Page 55: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/55.jpg)
n To be an effective language that can future-proof programsn Restrict the choices when a property is hard to automate or constant
across architectures of current and future à expose to the user
n A little forevern Hide the architectural detailsn Good solutions not visiblen Mediocre performance n But will work forevern Ex: HPF
n A lot nown Expose the architectural detailsn Good performance nown In a local miniman Will be obsolete soonn Heroic effort needed to get outn Ex: MPI
Languages and Future Proofing
Tuesday, October 25, 2011
![Page 56: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/56.jpg)
n To be an effective language that can future-proof programsn Restrict the choices when a property is hard to automate or constant
across architectures of current and future à expose to the usern Features that are automatable and variable à hide from the user
n A little forevern Hide the architectural detailsn Good solutions not visiblen Mediocre performance n But will work forevern Ex: HPF
n A lot nown Expose the architectural detailsn Good performance nown In a local miniman Will be obsolete soonn Heroic effort needed to get outn Ex: MPI
Languages and Future Proofing
Tuesday, October 25, 2011
![Page 57: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/57.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
21Tuesday, October 25, 2011
![Page 58: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/58.jpg)
Ancient Days…
• Computers had limited power• Compiling was a daunting task• Languages helped by limiting choice• Overconstraint programming
languages that express only a single choice of:– Algorithm– Iteration order – Data layout– Parallelism strategy
Tuesday, October 25, 2011
![Page 59: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/59.jpg)
…as we progressed….
• Computers got faster• More cycles available to the
compiler• Wanted to optimize the programs, to
make them run better and faster
Tuesday, October 25, 2011
![Page 60: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/60.jpg)
…and we ended up at
• Computers are extremely powerful• Compilers want to do a lot• But…the same old overconstraint
languages– They don’t provide too many choices
• Heroic analysis to rediscover some of the choices
– Data dependence analysis – Data flow analysis– Alias analysis– Shape analysis– Interprocedural analysis– Loop analysis– Parallelization analysis– Information flow analysis– Escape analysis– …
Tuesday, October 25, 2011
![Page 61: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/61.jpg)
Need to Rethink Languages
• Give Compiler a Choice – Express ‘intent’ not ‘a method’– Be as verbose as you can
• Muscle outpaces brain– Compute cycles are abundant – Complex logic is too hard
Tuesday, October 25, 2011
![Page 62: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/62.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
26Tuesday, October 25, 2011
![Page 63: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/63.jpg)
Observation 1: Algorithmic Choice
27Tuesday, October 25, 2011
![Page 64: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/64.jpg)
Observation 1: Algorithmic Choice
• For many problems there are multiple algorithms – Most cases there is no single winner– An algorithm will be the best performing for a given:
– Input size– Amount of parallelism– Communication bandwidth / synchronization cost– Data layout– Data itself (sparse data, convergence criteria etc.)
27Tuesday, October 25, 2011
![Page 65: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/65.jpg)
Observation 1: Algorithmic Choice
• For many problems there are multiple algorithms – Most cases there is no single winner– An algorithm will be the best performing for a given:
– Input size– Amount of parallelism– Communication bandwidth / synchronization cost– Data layout– Data itself (sparse data, convergence criteria etc.)
• Multicores exposes many of these to the programmer– Exponential growth of cores (impact of Moore’s law)– Wide variation of memory systems, type of cores etc.
27Tuesday, October 25, 2011
![Page 66: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/66.jpg)
Observation 1: Algorithmic Choice
• For many problems there are multiple algorithms – Most cases there is no single winner– An algorithm will be the best performing for a given:
– Input size– Amount of parallelism– Communication bandwidth / synchronization cost– Data layout– Data itself (sparse data, convergence criteria etc.)
• Multicores exposes many of these to the programmer– Exponential growth of cores (impact of Moore’s law)– Wide variation of memory systems, type of cores etc.
• No single algorithm can be the best for all the cases
27Tuesday, October 25, 2011
![Page 67: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/67.jpg)
Observation 2: Natural Parallelism
28Tuesday, October 25, 2011
![Page 68: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/68.jpg)
Observation 2: Natural Parallelism
• World is a parallel place– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
28Tuesday, October 25, 2011
![Page 69: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/69.jpg)
Observation 2: Natural Parallelism
• World is a parallel place– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
28Tuesday, October 25, 2011
![Page 70: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/70.jpg)
Observation 2: Natural Parallelism
• World is a parallel place– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
• It seems that computer scientists have a hard time thinking in parallel– We have unnecessarily imposed sequential ordering on the world
– Statements executed in sequence – for i= 1 to n– Recursive decomposition (given f(n) find f(n+1))
28Tuesday, October 25, 2011
![Page 71: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/71.jpg)
Observation 2: Natural Parallelism
• World is a parallel place– It is natural to many, e.g. mathematicians
– ∑, sets, simultaneous equations, etc.
• It seems that computer scientists have a hard time thinking in parallel– We have unnecessarily imposed sequential ordering on the world
– Statements executed in sequence – for i= 1 to n– Recursive decomposition (given f(n) find f(n+1))
• This was useful at one time to limit the complexity…. But a big problem in the era of multicores
28Tuesday, October 25, 2011
![Page 72: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/72.jpg)
Observation 3: Autotuning
29Tuesday, October 25, 2011
![Page 73: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/73.jpg)
Observation 3: Autotuning
• Good old days à model based optimization
29Tuesday, October 25, 2011
![Page 74: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/74.jpg)
Observation 3: Autotuning
• Good old days à model based optimization• Now
– Machines are too complex to accurately model
– Compiler passes have many subtle interactions
– Thousands of knobs and billions of choices
Algorithmic Complexity
Compiler Complexity
Memory System Complexity
Processor Complexity
29Tuesday, October 25, 2011
![Page 75: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/75.jpg)
Observation 3: Autotuning
• Good old days à model based optimization• Now
– Machines are too complex to accurately model
– Compiler passes have many subtle interactions
– Thousands of knobs and billions of choices
• But…– Computers are cheap– We can do end-to-end execution of multiple runs – Then use machine learning to find the best choice
Algorithmic Complexity
Compiler Complexity
Memory System Complexity
Processor Complexity
29Tuesday, October 25, 2011
![Page 76: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/76.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
30Tuesday, October 25, 2011
![Page 77: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/77.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31Tuesday, October 25, 2011
![Page 78: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/78.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
Ac
h
Tuesday, October 25, 2011
![Page 79: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/79.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
wc
Tuesday, October 25, 2011
![Page 80: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/80.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
AB hw
Tuesday, October 25, 2011
![Page 81: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/81.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
AB
Tuesday, October 25, 2011
![Page 82: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/82.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
ABABy
x
Tuesday, October 25, 2011
![Page 83: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/83.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
ABABy
Tuesday, October 25, 2011
![Page 84: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/84.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
ABAB
x
Tuesday, October 25, 2011
![Page 85: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/85.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }}
• Implicitly parallel description
31
A
B
ABAB
Tuesday, October 25, 2011
![Page 86: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/86.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }
// Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); }
• Implicitly parallel description
• Algorithmic choice
32
A
B
ABABa1 a2 b1
b2
Tuesday, October 25, 2011
![Page 87: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/87.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }
// Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); }
// Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); }
33
a
B
ABAB
b2b1
ab1 ab2
Tuesday, October 25, 2011
![Page 88: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/88.jpg)
PetaBricks Language
transform MatrixMultiplyfrom A[c,h], B[w,c] to AB[w,h]{ // Base case, compute a single element to(AB.cell(x,y) out) from(A.row(y) a, B.column(x) b) { out = dot(a, b); }
// Recursively decompose in c to(AB ab) from(A.region(0, 0, c/2, h ) a1, A.region(c/2, 0, c, h ) a2, B.region(0, 0, w, c/2) b1, B.region(0, c/2, w, c ) b2) { ab = MatrixAdd(MatrixMultiply(a1, b1), MatrixMultiply(a2, b2)); }
// Recursively decompose in w to(AB.region(0, 0, w/2, h ) ab1, AB.region(w/2, 0, w, h ) ab2) from( A a, B.region(0, 0, w/2, c ) b1, B.region(w/2, 0, w, c ) b2) { ab1 = MatrixMultiply(a, b1); ab2 = MatrixMultiply(a, b2); }
// Recursively decompose in h to(AB.region(0, 0, w, h/2) ab1, AB.region(0, h/2, w, h ) ab2) from(A.region(0, 0, c, h/2) a1, A.region(0, h/2, c, h ) a2, B b) { ab1=MatrixMultiply(a1, b); ab2=MatrixMultiply(a2, b); }}
34Tuesday, October 25, 2011
![Page 89: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/89.jpg)
PetaBricks Language
transform Strassenfrom A11[n,n], A12[n,n], A21[n,n], A22[n,n], B11[n,n], B12[n,n], B21[n,n], B22[n,n]through M1[n,n], M2[n,n], M3[n,n], M4[n,n], M5[n,n], M6[n,n], M7[n,n]to C11[n,n], C12[n,n], C21[n,n], C22[n,n]{ to(M1 m1) from(A11 a11, A22 a22, B11 b11, B22 b22) using(t1[n,n], t2[n,n]) { MatrixAdd(t1, a11, a22); MatrixAdd(t2, b11, b22); MatrixMultiplySqr(m1, t1, t2); } to(M2 m2) from(A21 a21, A22 a22, B11 b11) using(t1[n,n]) { MatrixAdd(t1, a21, a22); MatrixMultiplySqr(m2, t1, b11); } to(M3 m3) from(A11 a11, B12 b12, B22 b22) using(t1[n,n]) { MatrixSub(t2, b12, b22); MatrixMultiplySqr(m3, a11, t2); }
to(M4 m4) from(A22 a22, B21 b21, B11 b11) using(t1[n,n]) { MatrixSub(t2, b21, b11); MatrixMultiplySqr(m4, a22, t2); } to(M5 m5) from(A11 a11, A12 a12, B22 b22) using(t1[n,n]) { MatrixAdd(t1, a11, a12); MatrixMultiplySqr(m5, t1, b22); }
to(M6 m6) from(A21 a21, A11 a11, B11 b11, B12 b12) using(t1[n,n], t2[n,n]) { MatrixSub(t1, a21, a11); MatrixAdd(t2, b11, b12); MatrixMultiplySqr(m6, t1, t2); } to(M7 m7) from(A12 a12, A22 a22, B21 b21, B22 b22) using(t1[n,n], t2[n,n]) { MatrixSub(t1, a12, a22); MatrixAdd(t2, b21, b22); MatrixMultiplySqr(m7, t1, t2); } to(C11 c11) from(M1 m1, M4 m4, M5 m5, M7 m7){ MatrixAddAddSub(c11, m1, m4, m7, m5); } to(C12 c12) from(M3 m3, M5 m5){ MatrixAdd(c12, m3, m5); } to(C21 c21) from(M2 m2, M4 m4){ MatrixAdd(c21, m2, m4); } to(C22 c22) from(M1 m1, M2 m2, M3 m3, M6 m6){ MatrixAddAddSub(c22, m1, m3, m6, m2); }}
35Tuesday, October 25, 2011
![Page 90: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/90.jpg)
Language Support for Algorithmic Choice
• Algorithmic choice is the key aspect of PetaBricks
• Programmer can define multiple rules to compute the
same data
• Compiler re-use rules to create hybrid algorithms
• Can express choices at many different granularities
36Tuesday, October 25, 2011
![Page 91: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/91.jpg)
Synthesized Outer Control Flow
37Tuesday, October 25, 2011
![Page 92: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/92.jpg)
Synthesized Outer Control Flow
• Outer control flow synthesized by compiler
37Tuesday, October 25, 2011
![Page 93: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/93.jpg)
Synthesized Outer Control Flow
• Outer control flow synthesized by compiler• Another choice that the programmer should
not makeBy rows?By columns?Diagonal? Reverse order? Blocked?Parallel?
37Tuesday, October 25, 2011
![Page 94: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/94.jpg)
Synthesized Outer Control Flow
• Outer control flow synthesized by compiler• Another choice that the programmer should
not makeBy rows?By columns?Diagonal? Reverse order? Blocked?Parallel?
• Instead programmer provides explicit producer-consumer relations
37Tuesday, October 25, 2011
![Page 95: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/95.jpg)
Synthesized Outer Control Flow
• Outer control flow synthesized by compiler• Another choice that the programmer should
not makeBy rows?By columns?Diagonal? Reverse order? Blocked?Parallel?
• Instead programmer provides explicit producer-consumer relations
• Allows compiler to explore choice space
37Tuesday, October 25, 2011
![Page 96: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/96.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
38Tuesday, October 25, 2011
![Page 97: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/97.jpg)
Another Exampletransform RollingSumfrom A[n]to B[n]{ // rule 0: use the previously computed value B.cell(i) from (A.cell(i) a, B.cell(i-1) leftSum) { return a + leftSum; }
// rule 1: sum all elements to the left B.cell(i) from (A.region(0, i) in) { return sum(in); }}
39Tuesday, October 25, 2011
![Page 98: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/98.jpg)
Another Exampletransform RollingSumfrom A[n]to B[n]{ // rule 0: use the previously computed value B.cell(i) from (A.cell(i) a, B.cell(i-1) leftSum) { return a + leftSum; }
// rule 1: sum all elements to the left B.cell(i) from (A.region(0, i) in) { return sum(in); }}
40
A
B
Tuesday, October 25, 2011
![Page 99: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/99.jpg)
Another Exampletransform RollingSumfrom A[n]to B[n]{ // rule 0: use the previously computed value B.cell(i) from (A.cell(i) a, B.cell(i-1) leftSum) { return a + leftSum; }
// rule 1: sum all elements to the left B.cell(i) from (A.region(0, i) in) { return sum(in); }}
40
A
B
Tuesday, October 25, 2011
![Page 100: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/100.jpg)
Another Exampletransform RollingSumfrom A[n]to B[n]{ // rule 0: use the previously computed value B.cell(i) from (A.cell(i) a, B.cell(i-1) leftSum) { return a + leftSum; }
// rule 1: sum all elements to the left B.cell(i) from (A.region(0, i) in) { return sum(in); }}
41
A
B
A
B
Tuesday, October 25, 2011
![Page 101: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/101.jpg)
Another Exampletransform RollingSumfrom A[n]to B[n]{ // rule 0: use the previously computed value B.cell(i) from (A.cell(i) a, B.cell(i-1) leftSum) { return a + leftSum; }
// rule 1: sum all elements to the left B.cell(i) from (A.region(0, i) in) { return sum(in); }}
41
A
B
A
B
Tuesday, October 25, 2011
![Page 102: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/102.jpg)
Compilation Process
• Applicable Regions• Choice Grids• Choice Dependency Graphs
42
Applicable Regions
Choice Grids
Choice Dependency
Graphs
Tuesday, October 25, 2011
![Page 103: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/103.jpg)
Applicable Regions
// rule 0: use the previously computed value B.cell(i) from (A.cell(i) a, B.cell(i-1) leftSum) { return a + leftSum; }Applicable Region: 1 ≤ i < n
// rule 1: sum all elements to the left B.cell(i) from (A.region(0, i) in) { return sum(in); }Applicable Region: 0 ≤ i < n
43
Applicable Regions
Choice Grids
Choice Dependency
Graphs
A
B
A
BTuesday, October 25, 2011
![Page 104: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/104.jpg)
Choice Grids
• Divide data space into symbolic regions with common sets of choices
• In this simple example:– A: Input (no choices)– B: [0; 1) = rule 1– B: [1; n) = rule 0 or rule 1
• Applicable regions map rules à symbolic data• Choice grids map symbolic data à rules
44
Applicable Regions
Choice Grids
Choice Dependency
GraphsA
B
Rule1
Rule0 or 1
Tuesday, October 25, 2011
![Page 105: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/105.jpg)
Choice Dependency Graphs
• Adds dependency edges between symbolic regions• Edges annotated with directions and rules• Many compiler passes on this IR to:
– Simplify complex dependency patterns– Add choices
45
Applicable Regions
Choice Grids
Choice Dependency
Graphs
Tuesday, October 25, 2011
![Page 106: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/106.jpg)
PetaBricks Flow
1. PetaBricks source code is compiled
2. An autotuning binary is created
3. Autotuning occurs creating a choice configuration file
4. Choices are fed back into the compiler to create a static binary
46Tuesday, October 25, 2011
![Page 107: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/107.jpg)
Autotuning
• Based on two building blocks:– A genetic tuner– An n-ary search algorithm
• Flat parameter space• Compiler generates a dependency graph
describing this parameter space• Entire program tuned from bottom up
47Tuesday, October 25, 2011
![Page 108: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/108.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
48Tuesday, October 25, 2011
![Page 109: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/109.jpg)
Sort
49
Size
Tim
e
Tuesday, October 25, 2011
![Page 110: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/110.jpg)
Sort
50
Size
Tim
e
Tuesday, October 25, 2011
![Page 111: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/111.jpg)
Algorithmic Choice in Sorting
51Tuesday, October 25, 2011
![Page 112: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/112.jpg)
Algorithmic Choice in Sorting
52Tuesday, October 25, 2011
![Page 113: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/113.jpg)
Algorithmic Choice in Sorting
53Tuesday, October 25, 2011
![Page 114: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/114.jpg)
Algorithmic Choice in Sorting
54Tuesday, October 25, 2011
![Page 115: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/115.jpg)
Algorithmic Choice in Sorting
55Tuesday, October 25, 2011
![Page 116: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/116.jpg)
Future Proofing Sort
56
SystemSystem Cores used Scalability Algorithm Choices
(w/ switching points)
Mobile Core 2 Duo Mobile
2 of 2 1.92 IS(150) 8MS(600) 4MS(1295) 2MS(38400) QS(∞)
Xeon 1-way
Xeon E7340 (2 x 4 core)
1 of 8 - IS(75) 4MS(98) RS(∞)
Xeon 8-way
Xeon E7340 (2 x 4 core)
8 of 8 5.69 IS(600) QS(1420) 2MS(∞)
Niagara Sun Fire T200
8 of 8 7.79 16MS(75) 8MS(1461) 4MS(2400) 2MS(∞)
Tuesday, October 25, 2011
![Page 117: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/117.jpg)
Future Proofing Sort
57
SystemSystem Cores used Scalability Algorithm Choices
(w/ switching points)
Mobile Core 2 Duo Mobile
2 of 2 1.92 IS(150) 8MS(600) 4MS(1295) 2MS(38400) QS(∞)
Xeon 1-way
Xeon E7340 (2 x 4 core)
1 of 8 - IS(75) 4MS(98) RS(∞)
Xeon 8-way
Xeon E7340 (2 x 4 core)
8 of 8 5.69 IS(600) QS(1420) 2MS(∞)
Niagara Sun Fire T200
8 of 8 7.79 16MS(75) 8MS(1461) 4MS(2400) 2MS(∞)
Trained OnTrained OnTrained OnTrained OnMobile Xeon 1-way Xeon 8-way Niagara
Run On
Mobile - 1.09x 1.67x 1.47xRun On Xeon 1-way 1.61x - 2.08x 2.50xRun On
Xeon 8-way 1.59x 2.14x - 2.35x
Run On
Niagara 1.12x 1.51x 1.08x -
Tuesday, October 25, 2011
![Page 118: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/118.jpg)
Matrix Multiply
58
Size
Tim
e
Tuesday, October 25, 2011
![Page 119: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/119.jpg)
Matrix Multiply
59
Size
Tim
e
Tuesday, October 25, 2011
![Page 120: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/120.jpg)
Eigenvector Solve
60
Size
Tim
e
Tuesday, October 25, 2011
![Page 121: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/121.jpg)
Eigenvector Solve
61
Size
Tim
e
Tuesday, October 25, 2011
![Page 122: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/122.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
62Tuesday, October 25, 2011
![Page 123: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/123.jpg)
Variable Accuracy Algorithms
63Tuesday, October 25, 2011
![Page 124: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/124.jpg)
Variable Accuracy Algorithms
• Lots of algorithms where the accuracy of output can be tuned:– Iterative algorithms (e.g. solvers, optimization)– Signal processing (e.g. images, sound)– Approximation algorithms
• Can trade accuracy for speed
• All user wants: Solve to a certain accuracy as fast as possible using whatever algorithms necessary!
63Tuesday, October 25, 2011
![Page 125: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/125.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain
64Tuesday, October 25, 2011
![Page 126: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/126.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)
64Tuesday, October 25, 2011
![Page 127: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/127.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)• Restrictions and Interpolations compute new grid with
coarser or finer discretization
64Tuesday, October 25, 2011
![Page 128: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/128.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)• Restrictions and Interpolations compute new grid with
coarser or finer discretization
64
Res
olut
ion
Compute Time
Relax on current grid
Tuesday, October 25, 2011
![Page 129: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/129.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)• Restrictions and Interpolations compute new grid with
coarser or finer discretization
64
Res
olut
ion
Compute Time
Relax on current grid
Restrict to coarser grid
Tuesday, October 25, 2011
![Page 130: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/130.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)• Restrictions and Interpolations compute new grid with
coarser or finer discretization
64
Res
olut
ion
Compute Time
Relax on current grid
Restrict to coarser grid
Tuesday, October 25, 2011
![Page 131: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/131.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)• Restrictions and Interpolations compute new grid with
coarser or finer discretization
64
Res
olut
ion
Compute Time
Relax on current grid
Restrict to coarser grid
Interpolate to finer grid
Tuesday, October 25, 2011
![Page 132: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/132.jpg)
A Very Brief Multigrid Intro• Used to iteratively solve PDEs over a gridded domain• Relaxations update points using neighboring values
(stencil computations)• Restrictions and Interpolations compute new grid with
coarser or finer discretization
64
Res
olut
ion
Compute Time
Relax on current grid
Restrict to coarser grid
Interpolate to finer grid
Tuesday, October 25, 2011
![Page 133: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/133.jpg)
Multigrid Cycles
65
Standard Approaches
V-Cycle W-Cycle
Full MG V-Cycle
Tuesday, October 25, 2011
![Page 134: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/134.jpg)
Multigrid Cycles
65
Standard Approaches
Relaxation operator?
V-Cycle W-Cycle
Full MG V-Cycle
Tuesday, October 25, 2011
![Page 135: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/135.jpg)
Multigrid Cycles
65
Standard Approaches
Relaxation operator?
How many iterations?
V-Cycle W-Cycle
Full MG V-Cycle
Tuesday, October 25, 2011
![Page 136: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/136.jpg)
Multigrid Cycles
65
Standard Approaches
Relaxation operator?
How many iterations?
How coarse do we go?
V-Cycle W-Cycle
Full MG V-Cycle
Tuesday, October 25, 2011
![Page 137: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/137.jpg)
Multigrid Cycles
• Generalize the idea of what a multigrid cycle can look like
• Example:
• Goal: Auto-tune cycle shape for specific usage
66
direct or iterative shortcut
relaxationsteps
Tuesday, October 25, 2011
![Page 138: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/138.jpg)
Algorithmic Choice in Multigrid
• Need framework to make fair comparisons• Perspective of a specific grid resolution• How to get from A to B?
67
A B
Direct
Iterative
A B
RecursiveA B
?Restrict Interpolate
Tuesday, October 25, 2011
![Page 139: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/139.jpg)
Algorithmic Choice in Multigrid
• Tuning cycle shape!– Examples of recursive options:
68
Standard V-cycle
A B
Tuesday, October 25, 2011
![Page 140: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/140.jpg)
Algorithmic Choice in Multigrid
• Tuning cycle shape!– Examples of recursive options:
69
Take a shortcut at a coarser resolution
A BA B
Tuesday, October 25, 2011
![Page 141: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/141.jpg)
Algorithmic Choice in Multigrid
• Tuning cycle shape!– Examples of recursive options:
70
Iterating with shortcuts
A B
Tuesday, October 25, 2011
![Page 142: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/142.jpg)
Algorithmic Choice in Multigrid
• Number of iterations depends on what accuracy we want at the current grid resolution!
71
• Tuning cycle shape!– Once we pick a recursive option, how many times do
we iterate?
A B C D
Higher Accuracy
Tuesday, October 25, 2011
![Page 143: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/143.jpg)
Optimal Subproblems
72Tuesday, October 25, 2011
![Page 144: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/144.jpg)
Optimal Subproblems
72
Better
Tuesday, October 25, 2011
![Page 145: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/145.jpg)
• Plot all cycle shapes for a given grid resolution:
• Idea: Maintain a family of optimal algorithms for each grid resolution
Optimal Subproblems
72
Keep only theoptimal ones!
Tuesday, October 25, 2011
![Page 146: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/146.jpg)
The Discrete Solution
73Tuesday, October 25, 2011
![Page 147: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/147.jpg)
• Problem: Too many optimal cycle shapes to remember
• Solution: Remember the fastest algorithms for a discrete set of accuracies
The Discrete Solution
73Tuesday, October 25, 2011
![Page 148: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/148.jpg)
• Problem: Too many optimal cycle shapes to remember
• Solution: Remember the fastest algorithms for a discrete set of accuracies
The Discrete Solution
73
Remember!
Tuesday, October 25, 2011
![Page 149: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/149.jpg)
Use Dynamic Programming
• Only search cycle shapes that utilize optimized sub-cycles in recursive calls
• Build optimized algorithms from the bottom up
74Tuesday, October 25, 2011
![Page 150: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/150.jpg)
Use Dynamic Programming
• Only search cycle shapes that utilize optimized sub-cycles in recursive calls
• Build optimized algorithms from the bottom up
• Allow shortcuts to stop recursion early
74Tuesday, October 25, 2011
![Page 151: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/151.jpg)
Use Dynamic Programming
• Only search cycle shapes that utilize optimized sub-cycles in recursive calls
• Build optimized algorithms from the bottom up
• Allow shortcuts to stop recursion early• Allow multiple iterations of sub-cycles to explore
time vs. accuracy space
74Tuesday, October 25, 2011
![Page 152: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/152.jpg)
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]{ // Base case // Direct solve
OR
// Base case // Iterative solve at current resolution
OR
// Recursive case // For some number of iterations // Relax // Compute residual and restrict // Call Multigridi for some i // Interpolate and correct // Relax}
Auto-tuning the V-cycle
• Algorithmic choiceShortcut base casesRecursively call some optimized sub-cycle
• Iterations and recursive accuracy let us explore accuracy versus performance space
• Only remember “best” versions
75Tuesday, October 25, 2011
![Page 153: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/153.jpg)
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]{ // Base case // Direct solve
OR
// Base case // Iterative solve at current resolution
OR
// Recursive case // For some number of iterations // Relax // Compute residual and restrict // Call Multigridi for some i // Interpolate and correct // Relax}
Auto-tuning the V-cycle
• Algorithmic choiceShortcut base casesRecursively call some optimized sub-cycle
• Iterations and recursive accuracy let us explore accuracy versus performance space
• Only remember “best” versions
75Tuesday, October 25, 2011
![Page 154: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/154.jpg)
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]{ // Base case // Direct solve
OR
// Base case // Iterative solve at current resolution
OR
// Recursive case // For some number of iterations // Relax // Compute residual and restrict // Call Multigridi for some i // Interpolate and correct // Relax}
Auto-tuning the V-cycle
• Algorithmic choiceShortcut base cases
75
?
Tuesday, October 25, 2011
![Page 155: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/155.jpg)
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]{ // Base case // Direct solve
OR
// Base case // Iterative solve at current resolution
OR
// Recursive case // For some number of iterations // Relax // Compute residual and restrict // Call Multigridi for some i // Interpolate and correct // Relax}
Auto-tuning the V-cycle
• Algorithmic choiceShortcut base casesRecursively call some optimized sub-cycle
75
?
Tuesday, October 25, 2011
![Page 156: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/156.jpg)
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]{ // Base case // Direct solve
OR
// Base case // Iterative solve at current resolution
OR
// Recursive case // For some number of iterations // Relax // Compute residual and restrict // Call Multigridi for some i // Interpolate and correct // Relax}
Auto-tuning the V-cycle
• Algorithmic choiceShortcut base casesRecursively call some optimized sub-cycle
• Iterations and recursive accuracy let us explore accuracy versus performance space
• Only remember “best” versions
75
?
Tuesday, October 25, 2011
![Page 157: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/157.jpg)
Variable Accuracy Keywords
76
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]
Tuesday, October 25, 2011
![Page 158: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/158.jpg)
Variable Accuracy Keywords• accuracy_variable – tunable variable
76
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]accuracy_variable numIterations
Tuesday, October 25, 2011
![Page 159: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/159.jpg)
Variable Accuracy Keywords• accuracy_variable – tunable variable• accuracy_metric – returns accuracy of output
76
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]accuracy_variable numIterationsaccuracy_metric Poisson2D_metric
Tuesday, October 25, 2011
![Page 160: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/160.jpg)
Variable Accuracy Keywords• accuracy_variable – tunable variable• accuracy_metric – returns accuracy of output• accuracy_bins – set of discrete accuracy bins
76
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]accuracy_variable numIterationsaccuracy_metric Poisson2D_metricaccuracy_bins 1e1 1e3 1e5 1e7
Tuesday, October 25, 2011
![Page 161: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/161.jpg)
Variable Accuracy Keywords• accuracy_variable – tunable variable• accuracy_metric – returns accuracy of output• accuracy_bins – set of discrete accuracy bins• generator – creates random inputs for accuracy
measurement
76
transform Multigridkfrom X[n,n], B[n,n]to Y[n,n]accuracy_variable numIterationsaccuracy_metric Poisson2D_metricaccuracy_bins 1e1 1e3 1e5 1e7generator Poisson2D_Generator
Tuesday, October 25, 2011
![Page 162: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/162.jpg)
Training the Discrete Solution
77
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Optimized
Resolution i
Resolutioni
Tuesday, October 25, 2011
![Page 163: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/163.jpg)
Training the Discrete Solution
77
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Optimized
Resolution i
Resolutioni
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Resolutioni+1 Training
Resolution i+1
Tuesday, October 25, 2011
![Page 164: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/164.jpg)
Training the Discrete Solution
77
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Optimized
Resolution i
Resolutioni
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Resolutioni+1 Training
Resolution i+1
Tuesday, October 25, 2011
![Page 165: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/165.jpg)
Training the Discrete Solution
78
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Optimized
Resolution i
Resolutioni
Resolutioni+1 Optimized
Resolution i+1
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Tuesday, October 25, 2011
![Page 166: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/166.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
Training
Optimized
Tuesday, October 25, 2011
![Page 167: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/167.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
Training
Optimized
Tuesday, October 25, 2011
![Page 168: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/168.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
2x
Training
Optimized
Tuesday, October 25, 2011
![Page 169: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/169.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
2x
Optimized
Training
Optimized
Tuesday, October 25, 2011
![Page 170: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/170.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
2x
Optimized
Training
Optimized
Tuesday, October 25, 2011
![Page 171: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/171.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
2x
1x
Optimized
Training
Optimized
Tuesday, October 25, 2011
![Page 172: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/172.jpg)
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Training the Discrete Solution
79
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Finer
Coarser
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
MultigridAlgorithm
Accuracy 1 Accuracy 2 Accuracy 3 Accuracy 4
Tuning order Possible choice(Shortcuts not shown)
2x
1x
Optimized
Optimized
Optimized
Tuesday, October 25, 2011
![Page 173: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/173.jpg)
Example: Auto-tuned 2D
80
Accy. 10 Accy. 103 Accy. 107
Finer
Coarser
Tuesday, October 25, 2011
![Page 174: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/174.jpg)
Auto-tuned Cycles for
81
Cycle shapes for accuracy levels a) 10, b) 103, c) 105, d) 107
Tuesday, October 25, 2011
![Page 175: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/175.jpg)
Auto-tuned Cycles for
81
Cycle shapes for accuracy levels a) 10, b) 103, c) 105, d) 107
Optimized substructures visible in cycle shapes
Tuesday, October 25, 2011
![Page 176: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/176.jpg)
Auto-tuned Cycles for
81
Cycle shapes for accuracy levels a) 10, b) 103, c) 105, d) 107
Optimized substructures visible in cycle shapes
Tuesday, October 25, 2011
![Page 177: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/177.jpg)
Poisson
82
Matrix Size
Tim
e
Tuesday, October 25, 2011
![Page 178: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/178.jpg)
Poisson
83
Matrix Size
Tim
e
Tuesday, October 25, 2011
![Page 179: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/179.jpg)
Binpacking – Algorithmic Choices
84Accuracy
Dat
a S
ize
Tuesday, October 25, 2011
![Page 180: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/180.jpg)
Outline
• The Three Side Stories– Performance and Parallelism with Multicores– Future Proofing Software– Evolution of Programming Languages
• Three Observations• PetaBricks
– Language– Compiler– Results– Variable Precision– Sibling Rivalry
85Tuesday, October 25, 2011
![Page 181: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/181.jpg)
Issues with Offline Tuning
• Offline-tuning workflow burdensome– Programs often not re-autotuned when they should be
– e.g. apt-get install fftw does not re-autotune
– Hardware upgrades / large deployments– Transparent migration in the cloud
• Can't adapt to dynamic conditions– System load– Input types
86Tuesday, October 25, 2011
![Page 182: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/182.jpg)
SiblingRivalry: an Online Approach
• Split available resources in half• Process identical requests on both halves • Race two candidate configurations (safe and experimental)
and terminate slower algorithm• Initial slowdown (from duplicating the request) can be
overcome by autotuner• Surprisingly, reduces average power consumption per
request
87Tuesday, October 25, 2011
![Page 183: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/183.jpg)
Experimental Setup
88Tuesday, October 25, 2011
![Page 184: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/184.jpg)
SiblingRivalry: throughput
89Tuesday, October 25, 2011
![Page 185: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/185.jpg)
SiblingRivalry: energy usage (on AMD48)
90Tuesday, October 25, 2011
![Page 186: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/186.jpg)
Conclusion
91Tuesday, October 25, 2011
![Page 187: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/187.jpg)
Conclusion
• Time has come for languages based on autotuning
91Tuesday, October 25, 2011
![Page 188: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/188.jpg)
Conclusion
• Time has come for languages based on autotuning
• Convergence of multiple forces– The Multicore Menace– Future proofing when machine models are changing– Use more muscle (compute cycles) than brain (human cycles)
91Tuesday, October 25, 2011
![Page 189: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/189.jpg)
Conclusion
• Time has come for languages based on autotuning
• Convergence of multiple forces– The Multicore Menace– Future proofing when machine models are changing– Use more muscle (compute cycles) than brain (human cycles)
• PetaBricks – We showed that it can be done!
91Tuesday, October 25, 2011
![Page 190: PetaBricks: A Language and Compiler based on Autotuning · 10/17/2011 · • Joe is oblivious about the processor – Moore’s law bring Joe performance – Sufficient for Joe’s](https://reader033.vdocuments.site/reader033/viewer/2022060905/60a0a76d32e8fe648c49f406/html5/thumbnails/190.jpg)
Conclusion
• Time has come for languages based on autotuning
• Convergence of multiple forces– The Multicore Menace– Future proofing when machine models are changing– Use more muscle (compute cycles) than brain (human cycles)
• PetaBricks – We showed that it can be done!
• Will programmers accept this model?– A little more work now to save a lot later– Complexities in testing, verification and validation
91Tuesday, October 25, 2011