cs267/e233 applications of parallel computers lecture 1: introduction
DESCRIPTION
CS267/E233 Applications of Parallel Computers Lecture 1: Introduction. James Demmel [email protected] www.cs.berkeley.edu/~demmel/cs267_Spr05. Outline. Introduction Large important problems require powerful computers Why powerful computers must be parallel processors - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/1.jpg)
01/19/2005 CS267-Lecture 1 1
CS267/E233Applications of Parallel
Computers
Lecture 1: Introduction
James Demmel
www.cs.berkeley.edu/~demmel/cs267_Spr05
![Page 2: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/2.jpg)
01/19/2005 CS267-Lecture 1 2
Outline
• Introduction
• Large important problems require powerful computers
• Why powerful computers must be parallel processors
• Why writing (fast) parallel programs is hard
• Principles of parallel computing performance
• Structure of the course
![Page 3: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/3.jpg)
01/19/2005 CS267-Lecture 1 3
Why we need powerful computers
![Page 4: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/4.jpg)
01/19/2005 CS267-Lecture 1 4
Units of Measure in HPC
• High Performance Computing (HPC) units are:- Flops: floating point operations
- Flops/s: floating point operations per second
- Bytes: size of data (a double precision floating point number is 8)
• Typical sizes are millions, billions, trillions…Mega Mflop/s = 106 flop/sec Mbyte = 220 = 1048576 ~ 106 bytes
Giga Gflop/s = 109 flop/sec Gbyte = 230 ~ 109 bytes
Tera Tflop/s = 1012 flop/sec Tbyte = 240 ~ 1012 bytes
Peta Pflop/s = 1015 flop/sec Pbyte = 250 ~ 1015 bytes
Exa Eflop/s = 1018 flop/sec Ebyte = 260 ~ 1018 bytes
Zetta Zflop/s = 1021 flop/sec Zbyte = 270 ~ 1021 bytes
Yotta Yflop/s = 1024 flop/sec Ybyte = 280 ~ 1024 bytes
![Page 5: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/5.jpg)
01/19/2005 CS267-Lecture 1 5
Simulation: The Third Pillar of Science
• Traditional scientific and engineering paradigm:1) Do theory or paper design.
2) Perform experiments or build system.
• Limitations:- Too difficult -- build large wind tunnels.
- Too expensive -- build a throw-away passenger jet.
- Too slow -- wait for climate or galactic evolution.
- Too dangerous -- weapons, drug design, climate experimentation.
• Computational science paradigm:3) Use high performance computer systems to simulate the
phenomenon- Base on known physical laws and efficient numerical
methods.
![Page 6: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/6.jpg)
01/19/2005 CS267-Lecture 1 6
Some Particularly Challenging Computations• Science
- Global climate modeling- Biology: genomics; protein folding; drug design- Astrophysical modeling- Computational Chemistry- Computational Material Sciences and Nanosciences
• Engineering- Semiconductor design- Earthquake and structural modeling- Computation fluid dynamics (airplane design)- Combustion (engine design)- Crash simulation
• Business- Financial and economic modeling- Transaction processing, web services and search engines
• Defense- Nuclear weapons -- test by simulations- Cryptography
![Page 7: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/7.jpg)
01/19/2005 CS267-Lecture 1 7
Economic Impact of HPC
• Airlines:- System-wide logistics optimization systems on parallel systems.- Savings: approx. $100 million per airline per year.
• Automotive design:- Major automotive companies use large systems (500+ CPUs) for:
- CAD-CAM, crash testing, structural integrity and aerodynamics.
- One company has 500+ CPU parallel system.- Savings: approx. $1 billion per company per year.
• Semiconductor industry:- Semiconductor firms use large systems (500+ CPUs) for
- device electronics simulation and logic validation - Savings: approx. $1 billion per company per year.
• Securities industry:- Savings: approx. $15 billion per year for U.S. home mortgages.
![Page 8: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/8.jpg)
01/19/2005 CS267-Lecture 1 8
$5B World Market in Technical Computing
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
1998 1999 2000 2001 2002 2003 Other
Technical Management andSupport
Simulation
Scientific Research and R&D
MechanicalDesign/Engineering Analysis
Mechanical Design andDrafting
Imaging
Geoscience and Geo-engineering
Electrical Design/EngineeringAnalysis
Economics/Financial
Digital Content Creation andDistribution
Classified Defense
Chemical Engineering
Biosciences
Source: IDC 2004, from NRC Future of Supercomputer Report
![Page 9: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/9.jpg)
01/19/2005 CS267-Lecture 1 9
Global Climate Modeling Problem
• Problem is to compute:f(latitude, longitude, elevation, time)
temperature, pressure, humidity, wind velocity
• Approach:- Discretize the domain, e.g., a measurement point every 10 km
- Devise an algorithm to predict weather at time t+t given t
• Uses:- Predict major events,
e.g., El Nino
- Use in setting air emissions standards
Source: http://www.epm.ornl.gov/chammp/chammp.html
![Page 10: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/10.jpg)
01/19/2005 CS267-Lecture 1 10
Global Climate Modeling Computation
• One piece is modeling the fluid flow in the atmosphere- Solve Navier-Stokes equations
- Roughly 100 Flops per grid point with 1 minute timestep
• Computational requirements:- To match real-time, need 5 x 1011 flops in 60 seconds = 8 Gflop/s
- Weather prediction (7 days in 24 hours) 56 Gflop/s
- Climate prediction (50 years in 30 days) 4.8 Tflop/s
- To use in policy negotiations (50 years in 12 hours) 288 Tflop/s
• To double the grid resolution, computation is 8x to 16x
• State of the art models require integration of atmosphere, ocean, sea-ice, land models, plus possibly carbon cycle, geochemistry and more
• Current models are coarser than this
![Page 11: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/11.jpg)
High Resolution Climate Modeling on NERSC-3 – P. Duffy,
et al., LLNL
![Page 12: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/12.jpg)
01/19/2005 CS267-Lecture 1 12
A 1000 Year Climate Simulation
• Warren Washington and Jerry Meehl, National Center for Atmospheric Research; Bert Semtner, Naval Postgraduate School; John Weatherly, U.S. Army Cold Regions Research and Engineering Lab Laboratory et al.
• http://www.nersc.gov/news/science/bigsplash2002.pdf
• Demonstration of the Community Climate Model (CCSM2)
• A 1000-year simulation shows long-term, stable representation of the earth’s climate.
• 760,000 processor hours used
• Temperature change shown
![Page 13: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/13.jpg)
01/19/2005 CS267-Lecture 1 13
Climate Modeling on the Earth Simulator System
Development of ES started in 1997 in order to make a comprehensive understanding of global environmental changes such as global warming.
26.58Tflops was obtained by a global atmospheric circulation code.
35.86Tflops (87.5% of the peak performance) is achieved in the Linpack benchmark.
Its construction was completed at the end of February, 2002 and the practical operation started from March 1, 2002
![Page 14: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/14.jpg)
01/19/2005 CS267-Lecture 1 14
Astrophysics: Binary Black Hole Dynamics
• Massive supernova cores collapse to black holes. • At black hole center spacetime breaks down. • Critical test of theories of gravity –
General Relativity to Quantum Gravity. • Indirect observation – most galaxies
have a black hole at their center.• Gravity waves show black hole directly
including detailed parameters.• Binary black holes most powerful
sources of gravity waves. • Simulation extraordinarily complex –
evolution disrupts the spacetime !
![Page 15: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/15.jpg)
01/19/2005 CS267-Lecture 1 15
![Page 16: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/16.jpg)
01/19/2005 CS267-Lecture 1 16
Heart Simulation
• Problem is to compute blood flow in the heart• Approach:
- Modeled as an elastic structure in an incompressible fluid.- The “immersed boundary method” due to Peskin and McQueen.- 20 years of development in model- Many applications other than the heart: blood clotting, inner ear,
paper making, embryo growth, and others- Use a regularly spaced mesh (set of points) for evaluating the fluid
• Uses- Current model can be used to design artificial heart valves- Can help in understand effects of disease (leaky valves)- Related projects look at the behavior of the heart during a heart
attack- Ultimately: real-time clinical work
![Page 17: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/17.jpg)
01/19/2005 CS267-Lecture 1 17
Heart Simulation Calculation
The involves solving Navier-Stokes equations- 64^3 was possible on Cray YMP, but 128^3 required for accurate model (would have taken 3 years).
- Done on a Cray C90 -- 100x faster and 100x more memory
- Until recently, limited to vector machines
- Needs more features:
- Electrical model of the heart, and details of muscles, E.g.,
- Chris Johnson
- Andrew McCulloch
- Lungs, circulatory systems
![Page 18: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/18.jpg)
01/19/2005 CS267-Lecture 1 18
Heart Simulation
Source: www.psc.org
Animation of lower portion of the heart
![Page 19: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/19.jpg)
01/19/2005 CS267-Lecture 1 19
Parallel Computing in Data Analysis
• Finding information amidst large quantities of data
• General themes of sifting through large, unstructured data sets:
- Has there been an outbreak of some medical condition in a community?
- Which doctors are most likely involved in fraudulent charging to medicare?
- When should white socks go on sale?
- What advertisements should be sent to you?
• Data collected and stored at enormous speeds (Gbyte/hour)
- remote sensor on a satellite
- telescope scanning the skies
- microarrays generating gene expression data
- scientific simulations generating terabytes of data
- NSA analysis of telecommunications
![Page 20: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/20.jpg)
01/19/2005 CS267-Lecture 1 20
Why powerful computers are
parallel
![Page 21: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/21.jpg)
01/19/2005 CS267-Lecture 1 21
Tunnel Vision by Experts
• “I think there is a world market for maybe five computers.”
- Thomas Watson, chairman of IBM, 1943.
• “There is no reason for any individual to have a computer in their home”
- Ken Olson, president and founder of Digital Equipment Corporation, 1977.
• “640K [of memory] ought to be enough for anybody.”
- Bill Gates, chairman of Microsoft,1981.
Slide source: Warfield et al.
![Page 22: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/22.jpg)
01/19/2005 CS267-Lecture 1 22
Technology Trends: Microprocessor Capacity
2X transistors/Chip Every 1.5 years
Called “Moore’s Law”
Moore’s Law
Microprocessors have become smaller, denser, and more powerful.
Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months. Slide source: Jack Dongarra
![Page 23: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/23.jpg)
01/19/2005 CS267-Lecture 1 23
Impact of Device Shrinkage
• What happens when the feature size (transistor size) shrinks by a factor of x ?
• Clock rate goes up by x because wires are shorter
- actually less than x, because of power consumption
• Transistors per unit area goes up by x2
• Die size also tends to increase
- typically another factor of ~x
• Raw computing power of the chip goes up by ~ x4 !
- of which x3 is devoted either to parallelism or locality
![Page 24: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/24.jpg)
01/19/2005 CS267-Lecture 1 24
Microprocessor Transistors per Chip
i4004
i80286
i80386
i8080
i8086
R3000R2000
R10000
Pentium
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970 1975 1980 1985 1990 1995 2000 2005
Year
Tran
sist
ors
• Growth in transistors per chip • Increase in clock rate
0.1
1
10
100
1000
1970 1980 1990 2000
Year
Clo
ck R
ate
(MH
z)
![Page 25: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/25.jpg)
01/19/2005 CS267-Lecture 1 25
But there are limiting forces: Increased cost and difficulty of manufacturing
• Moore’s 2nd law (Rock’s law)
Demo of 0.06 micron CMOS
![Page 26: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/26.jpg)
01/19/2005 CS267-Lecture 1 26
More Limits: How fast can a serial computer be?
• Consider the 1 Tflop/s sequential machine:
- Data must travel some distance, r, to get from memory to CPU.
- To get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s. Thus r < c/1012 = 0.3 mm.
• Now put 1 Tbyte of storage in a 0.3 mm x 0.3 mm area:
- Each word occupies about 3 square Angstroms, or the size of a small atom.
• No choice but parallelism
r = 0.3 mm
1 Tflop/s, 1 Tbyte sequential machine
![Page 27: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/27.jpg)
01/19/2005 CS267-Lecture 1 27
Performance on Linpack Benchmark
System
0.1
1
10
100
1000
10000
100000Ju
n 93
Dec 9
3Ju
n 94
Dec 9
4Ju
n 95
Dec 9
5Ju
n 96
Dec 9
6Ju
n 97
Dec 9
7Ju
n 98
Dec 9
8Ju
n 99
Dec 9
9Ju
n 00
Dec 0
0Ju
n 01
Dec 0
1Ju
n 02
Dec 0
2Ju
n 03
Dec 0
3Ju
n 04
Rmax
max Rmax
mean Rmax
min Rmax
ASCI Red
Earth Simulator
ASCI White
Nov 2004: IBM Blue Gene L, 70.7 Tflops Rmax
Gflops
www.top500.org
![Page 28: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/28.jpg)
01/19/2005 CS267-Lecture 1 28
Why writing (fast) parallel programs is hard
![Page 29: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/29.jpg)
01/19/2005 CS267-Lecture 1 29
Principles of Parallel Computing
• Finding enough parallelism (Amdahl’s Law)
• Granularity
• Locality
• Load balance
• Coordination and synchronization
• Performance modeling
All of these things makes parallel programming even harder than sequential programming.
![Page 30: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/30.jpg)
01/19/2005 CS267-Lecture 1 30
“Automatic” Parallelism in Modern Machines• Bit level parallelism
- within floating point operations, etc.
• Instruction level parallelism (ILP)- multiple instructions execute per clock cycle
• Memory system parallelism- overlap of memory operations with computation
• OS parallelism- multiple jobs run in parallel on commodity SMPs
Limits to all of these -- for very high performance, need user to identify, schedule and coordinate parallel tasks
![Page 31: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/31.jpg)
01/19/2005 CS267-Lecture 1 31
Finding Enough Parallelism
• Suppose only part of an application seems parallel
• Amdahl’s law- let s be the fraction of work done sequentially, so
(1-s) is fraction parallelizable
- P = number of processors
Speedup(P) = Time(1)/Time(P)
<= 1/(s + (1-s)/P)
<= 1/s
• Even if the parallel part speeds up perfectly may be limited by the sequential part
![Page 32: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/32.jpg)
01/19/2005 CS267-Lecture 1 32
Overhead of Parallelism
• Given enough parallel work, this is the biggest barrier to getting desired speedup
• Parallelism overheads include:- cost of starting a thread or process
- cost of communicating shared data
- cost of synchronizing
- extra (redundant) computation
• Each of these can be in the range of milliseconds (=millions of flops) on some systems
• Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (I.e. large granularity), but not so large that there is not enough parallel work
![Page 33: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/33.jpg)
01/19/2005 CS267-Lecture 1
Locality and Parallelism
• Large memories are slow, fast memories are small
• Storage hierarchies are large and fast on average
• Parallel processors, collectively, have large, fast $- the slow accesses to “remote” data we call “communication”
• Algorithm should do most work on local data
ProcCache
L2 Cache
L3 Cache
Memory
Conventional Storage Hierarchy
ProcCache
L2 Cache
L3 Cache
Memory
ProcCache
L2 Cache
L3 Cache
Memory
potentialinterconnects
![Page 34: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/34.jpg)
01/19/2005 CS267-Lecture 1 34
Processor-DRAM Gap (latency)
µProc60%/yr.
DRAM7%/yr.
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU1982
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
“Moore’s Law”
![Page 35: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/35.jpg)
01/19/2005 CS267-Lecture 1 35
Load Imbalance
• Load imbalance is the time that some processors in the system are idle due to
- insufficient parallelism (during that phase)
- unequal size tasks
• Examples of the latter- adapting to “interesting parts of a domain”
- tree-structured computations
- fundamentally unstructured problems
• Algorithm needs to balance load
![Page 36: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/36.jpg)
01/19/2005 CS267-Lecture 1 36
MeasuringPerformance
![Page 37: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/37.jpg)
01/19/2005 CS267-Lecture 1 37
Improving Real Performance
0.1
1
10
100
1,000
2000 2004T
eraf
lop
s1996
Peak Performance grows exponentially, a la Moore’s Law
In 1990’s, peak performance increased 100x; in 2000’s, it will increase 1000x
But efficiency (the performance relative to the hardware peak) has declined
was 40-50% on the vector supercomputers of 1990s
now as little as 5-10% on parallel supercomputers of today
Close the gap through ... Mathematical methods and algorithms that
achieve high performance on a single processor and scale to thousands of processors
More efficient programming models and tools for massively parallel supercomputers
PerformanceGap
Peak Performance
Real Performance
![Page 38: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/38.jpg)
01/19/2005 CS267-Lecture 1 38
Performance Levels
• Peak advertised performance (PAP)- You can’t possibly compute faster than this speed
• LINPACK - The “hello world” program for parallel computing
- Solve Ax=b using Gaussian Elimination, highly tuned
• Gordon Bell Prize winning applications performance- The right application/algorithm/platform combination plus years of work
• Average sustained applications performance- What one reasonable can expect for standard applications
When reporting performance results, these levels are often confused, even in reviewed publications
![Page 39: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/39.jpg)
01/19/2005 CS267-Lecture 1 39
Performance on Linpack Benchmark
System
0.1
1
10
100
1000
10000
100000Ju
n 93
Dec 9
3Ju
n 94
Dec 9
4Ju
n 95
Dec 9
5Ju
n 96
Dec 9
6Ju
n 97
Dec 9
7Ju
n 98
Dec 9
8Ju
n 99
Dec 9
9Ju
n 00
Dec 0
0Ju
n 01
Dec 0
1Ju
n 02
Dec 0
2Ju
n 03
Dec 0
3Ju
n 04
Rmax
max Rmax
mean Rmax
min Rmax
ASCI Red
Earth Simulator
ASCI White
Nov 2004: IBM Blue Gene L, 70.7 Tflops Rmax
Gflops
www.top500.org
![Page 40: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/40.jpg)
01/19/2005 CS267-Lecture 1 40
Performance Levels (for example on NERSC-3)
• Peak advertised performance (PAP): 5 Tflop/s
• LINPACK (TPP): 3.05 Tflop/s
• Gordon Bell Prize winning applications performance : 2.46 Tflop/s
- Material Science application at SC01
• Average sustained applications performance: ~0.4 Tflop/s
- Less than 10% peak!
![Page 41: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/41.jpg)
01/19/2005 CS267-Lecture 1 41
Course Organization
![Page 42: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/42.jpg)
01/19/2005 CS267-Lecture 1 42
Who is in the class?
• This class is listed as both a CS and Engineering class
• Normally a mix of CS, EE, and other engineering and science students
• This class seems to be about:- 40% CS
- 15% EE
- 45% Other (Bio, Civil, Mechanical, Math…)
• For final projects we encourage interdisciplinary teams- This is the way parallel scientific software is generally built
![Page 43: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/43.jpg)
01/19/2005 CS267-Lecture 1 43
First Assignment• See home page for details.• Find an application of parallel computing and build a web
page describing it.- Choose something from your research area.- Or from the web or elsewhere.
• Create a web page describing the application. - Describe the application and provide a reference (or link)- Describe the platform where this application was run- Find peak and LINPACK performance for the platform and its rank on
the TOP500 list- Find performance of your selected application- What ratio of sustained to peak performance is reported?- Evaluate project: How did the application scale, I.e. was speed roughly
proportional to the number of processorsh? What were the major difficulties in obtaining good performance? What tools and algorithms were used?
• Send us (Jim and Yozo) the link (we will publish a list online)
• Due next week, Wednesday (1/26)
![Page 44: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/44.jpg)
01/19/2005 CS267-Lecture 1 44
Rough Schedule of Topics• Introduction
• Parallel Programming Models and Machines- Shared Memory and Multithreading
- Distributed Memory and Message Passing
- Data parallelism
• Sources of Parallelism in Simulation
• Tools - Languages (UPC)
- Performance Tools
- Visualization
- Environments
• Algorithms- Dense Linear Algebra
- Partial Differential Equations (PDEs)
- Particle methods
- Load balancing, synchronization techniques
- Sparse matrices
• Applications: biology, climate, combustion, astrophysics
• Project Reports
![Page 45: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/45.jpg)
01/19/2005 CS267-Lecture 1 45
Reading Materials• Some on-line texts:
- Demmel’s notes from CS267 Spring 1999, which are similar to 2000 and 2001. However, they contain links to html notes from 1996.
- http://www.cs.berkeley.edu/~demmel/cs267_Spr99/
- Simon’s notes from Fall
- http://www.nersc.gov/~simon/cs267/
- Ian Foster’s book, “Designing and Building Parallel Programming”.
- http://www-unix.mcs.anl.gov/dbpp/
• Potentially useful texts:- “Sourcebook for Parallel Computing”, by Dongarra, Foster, Fox, ..
- A general overview of parallel computing methods
- “Performance Optimization of Numerically Intensive Codes” by Stefan Goedecker and Adolfy Hoisie
- This is a practical guide to optimization, mostly for those of you who have never done any optimization
• Reports on Supercomputing (see web page)
![Page 46: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/46.jpg)
01/19/2005 CS267-Lecture 1 46
Requirements• Fill out on-line account request for Millennium machine.
- See course web page for pointer
• Fill out class survey- Handout in class, or see course web page for pointer
• Build a web page- Every week or two students will report explorations, ideas, proposed
work, and work to the TA via an organized webpage- There will be 3-4 programming assignments geared towards hands-on
experience, interdisciplinary teams.- There will be a Final Project
- Teams of 2-3, interdisciplinary is best.- Interesting applications or advance of systems.- Presentation (poster session)- Conference quality paper
![Page 47: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/47.jpg)
01/19/2005 CS267-Lecture 1 47
What you should get out of the course
In depth understanding of:
• When is parallel computing useful?
• Understanding of parallel computing hardware options.
• Overview of programming models (software) and tools.
• Some important parallel applications and the algorithms
• Performance analysis and tuning
![Page 48: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/48.jpg)
01/19/2005 CS267-Lecture 1 48
Administrative Information
• Instructors:- Jim Demmel, 737 Soda, [email protected]
- TA:Yozo Hida, 515 Soda, [email protected]
• Accounts – fill out forms- Submit online form for Millennium account
- See web page for pointer:
- NERSC account forms will be available later from Yozo
• Lecture notes are based on previous semester notes:- Jim Demmel, David Culler, David Bailey, Bob Lucas, Kathy Yelick
and Horst Simon
• Most class material and lecture notes are at: - http://www.cs.berkeley.edu/~demmel/cs267_Spr05
![Page 49: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/49.jpg)
01/19/2005 CS267-Lecture 1 49
Extra slides
![Page 50: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/50.jpg)
01/19/2005 CS267-Lecture 1 50
Transaction Processing
• Parallelism is natural in relational operators: select, join, etc.
• Many difficult issues: data partitioning, locking, threading.
(mar. 15, 1996)
0
5000
10000
15000
20000
25000
0 20 40 60 80 100 120Processors
Thro
ughput
(tp
mC
)
other
Tandem Himalaya
IBM PowerPC
DEC Alpha
SGI PowerChallenge
HP PA
![Page 51: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/51.jpg)
01/19/2005 CS267-Lecture 1 51
SIA Projections for Microprocessors
Compute power ~1/(Feature Size)3
0.01
0.1
1
10
100
10001
99
5
19
98
20
01
20
04
20
07
20
10
Year of Introduction
Fe
atu
re S
ize
(m
icro
ns
) &
Mill
ion
T
ran
sis
tors
pe
r c
hip
Feature Size(microns)
Transistors perchip x 106
based on F.S.Preston, 1997
![Page 52: CS267/E233 Applications of Parallel Computers Lecture 1: Introduction](https://reader035.vdocuments.site/reader035/viewer/2022062803/5681479b550346895db4d15e/html5/thumbnails/52.jpg)
01/19/2005 CS267-Lecture 1 52
Much of the Performance is from Parallelism
Name
Bit-LevelParallelism
Instruction-LevelParallelism
Thread-LevelParallelism?