high performance information retrieval and mems cad on intel itanium james demmel mathematics and...
Post on 19-Dec-2015
218 views
TRANSCRIPT
![Page 1: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/1.jpg)
High Performance High Performance Information Retrieval and MEMS Information Retrieval and MEMS
CADCADon Intel Itaniumon Intel Itanium
James DemmelMathematics and EECS
UC Berkeley
www.cs.berkeley.edu/~demmel/Itanium_121001.ppt
Joint work withDavid Culler, Michael Jordan, William Kahan,
Katherine Yelick, Zhaojun Bai (UC Davis)
![Page 2: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/2.jpg)
OutlineOutline
What’s up with MillenniumAutomatic performance tuningApplications to
SUGAR – a MEMS CAD system Information Retrieval
Future Work
![Page 3: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/3.jpg)
MillenniumMillennium
Cluster of clusters at UC Berkeley 309 CPU cluster in Soda Hall Smaller clusters across campus
Made possible by Intel equipment grantSignificant other support
NSF, Sun, Microsoft, Nortel, campus
www.millennium.berkeley.edu
![Page 4: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/4.jpg)
Millennium TopologyMillennium Topology
![Page 5: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/5.jpg)
Millennium Usage Oct 1 – 11, 2001Millennium Usage Oct 1 – 11, 2001Snapshots of Millennium Jobs Running
0
100
200
300
400
500
600
700
8001 9
17
25
33
41
49
57
65
73
81
89
97
105
113
121
129
137
145
153
161
169
177
185
193
201
209
217
225
233
241
249
Hour
Nu
mb
er
of
Jo
bs
Series1
100% utilization for last few daysAbout half the jobs are parallel
![Page 6: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/6.jpg)
Usage highlightsUsage highlights
AMANDA Antarctic Muon And Neutrino Detector Array amanda.berkeley.edu 128 scientists from 15 universities and institutes in the U.S. and Europe.
TEMPEST EUV lithography simulations via 3D electromagnetic scattering cuervo.eecs.berkeley.edu/Volcano/ study the defect printability on multilayer masks
Titanium High performance Java dialect for scientific computing www.cs.berkeley.edu/projects/titanium Implementation of shared address space, and use of SSE2
Digital Library Project Large database of images elib.cs.berkeley.edu/ Used to run spectral image segmentation algorithm for clustering, search on
images
![Page 7: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/7.jpg)
Usage highlights (continued)Usage highlights (continued)
CS 267 Graduate class in parallel computing, 33 enrolled www.cs.berkeley.edu/~dbindel/cs267ta Homework
Disaster Response Help find people after Sept 11, set up immediately afterwards safe.millennium.berkeley.edu 48K reports in database, linked to other survivor databases
MEMS CAD (MicroElectroMechanical Systems Computer Aided Design) Tool to help design MEMS systems Used this semester in EE 245, 93 enrolled sugar.millennium.berkeley.edu More later in talk
Information Retrieval Development of faster information retrieval algorithms www.cs.berkeley.edu/~jordan More later in talk
Many applications are part of CITRIS
![Page 8: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/8.jpg)
Performance TuningPerformance Tuning Motivation: performance of many applications
dominated by a few kernelsMEMS CAD Nonlinear ODEs Nonlinear
equations Linear equations Matrix multiply Matrix-by-matrix or matrix-by-vector Dense or Sparse
Information retrieval by LSI Compress term-document matrix … Sparse mat-vec multiply
Information retrieval by LDA Maximum likelihood estimation … Solve linear systems
Many other examples (not all linear algebra)
![Page 9: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/9.jpg)
Conventional Performance TuningConventional Performance Tuning
Vendor or user hand tunes kernels Drawbacks:
Very time consuming and difficult work Even with intimate knowledge of architecture and compiler, performance hard to predict
Must be redone for every architecture, compiler
Compiler technology often lags architecture Not just a compiler problem:
Best algorithm may depend on input, so some tuning must occur at run-time.
Not all algorithms semantically or mathematically equivalent
![Page 10: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/10.jpg)
Automatic Performance TuningAutomatic Performance Tuning Approach: for each kernel
1.Identify and generate a space of algorithms2.Search for the fastest one, by running them
What is a space of algorithms? Depends on kernel and input May vary
instruction mix and order memory access patterns data structures mathematical formulation
When do we search? Once per kernel and architecture At compile time At run time
![Page 11: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/11.jpg)
Tuning pays off – PHIPAC (Bilmes, Asanovic, Vuduc, Tuning pays off – PHIPAC (Bilmes, Asanovic, Vuduc, Demmel)Demmel)
![Page 12: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/12.jpg)
Tuning pays off – ATLAS (Dongarra, Whaley)Tuning pays off – ATLAS (Dongarra, Whaley)
Extends applicability of PHIPACIncorporated in Matlab (with rest of LAPACK)
![Page 13: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/13.jpg)
Other Automatic Tuning ProjectsOther Automatic Tuning Projects
FFTs and Signal Processing FFTW (www.fftw.org)
Given dimension n of FFT, choose best implementation at runtime by assembling prebuilt kernels for small factors of n
Widely used, won 1999 Wilkinson Prize for Numerical Software SPIRAL (www.ece.cmu.edu/~spiral)
Extensions to other transforms, DSPs UHFFT
Extensions to higher dimension, parallelism
Special session at ICCS 2001 Organized by Yelick and Demmel www.ucalgary.ca/iccs Proceedings available Pointers to automatic tuning projects at
www.cs.berkeley.edu/~yelick/iccs-tune
![Page 14: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/14.jpg)
Search for optimal register tile sizes on Sun Search for optimal register tile sizes on Sun Ultra 10Ultra 10
16 registers, but 2-by-3 tile size fastest
![Page 15: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/15.jpg)
Search for Optimal L0 block size in dense Search for Optimal L0 block size in dense matmulmatmul
60% of peak on Pentium II-300
4% of versions exceed
![Page 16: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/16.jpg)
High precision dense mat-vec multiplyHigh precision dense mat-vec multiply
![Page 17: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/17.jpg)
Tuning Sparse matrix operationsTuning Sparse matrix operations
Sparsity Optimizes y = A*x for a particular sparse A
Im and YelickAlgorithm space
Different code organization, instruction mixes Different register blockings (change data structure and fill of
A) Different cache blocking Different number of columns of x Different matrix orderings
Software and papers available www.cs.berkeley.edu/~yelick/sparsity
![Page 18: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/18.jpg)
Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 1 Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 1 RHSRHS
![Page 19: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/19.jpg)
Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 9 Speedups on SPMV from Sparsity on Sun Ultra 1/170 – 9 RHSRHS
![Page 20: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/20.jpg)
Sparsity reg blocking results on P4 for FEM/fluids Sparsity reg blocking results on P4 for FEM/fluids matrix 1matrix 1
![Page 21: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/21.jpg)
Sparsity reg blocking results on P4 for FEM/fluids Sparsity reg blocking results on P4 for FEM/fluids matrix 2matrix 2
![Page 22: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/22.jpg)
Sparsity cache blocking results on P4 for LSISparsity cache blocking results on P4 for LSI
![Page 23: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/23.jpg)
Tuning other sparse operationsTuning other sparse operations
Symmetric matrix-vector multiply A*xSolve a triangular system of equations T-1*xAT*A*x
Kernel of Information Retrieval via LSI Same number of memory references as A*x
A2*x, Ak*x Kernel of Information Retrieval used by Google Changes calling algorithm
AT*M*A Matrix triple product Used in multigrid solver
…
![Page 24: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/24.jpg)
Symmetric Sparse Matrix-Vector Multiply on Symmetric Sparse Matrix-Vector Multiply on P4P4
![Page 25: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/25.jpg)
Sparse Triangular Solve on P4Sparse Triangular Solve on P4
![Page 26: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/26.jpg)
Applications to SUGAR – a tool for MEMS CADApplications to SUGAR – a tool for MEMS CAD
Demmel, Bai, Pister, Govindjee, Agogino, Gu, … Input: description of MicroElectroMechanical System (as netlist) Output:
DC, steady state, modal, transient analyses to assess behavior CIF for fabrication
Simulation capabilities Beams and plates (linear, nonlinear, prestressed,…) Electrostatic forces, circuits Thermal expansion, Couette damping
Availability Matlab
Publicly available www-bsac.eecs.berkeley.edu/~cfm 249 registered users, many unregistered
Web service – M & MEMS Runs on Millennium sugar.millennium.berkeley.edu Now in use in EE 245 at UCB…96 users
Lots of new features being added, including interface to measurements
![Page 27: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/27.jpg)
Micromirror (Last, Pister)Micromirror (Last, Pister)
Laterally actuated torsionally suspended micromirror Over 10K dof, 100 line netlist (using subnets) DC and frequency analysis All algorithms reduce to previous kernels
![Page 28: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/28.jpg)
Information RetrievalInformation Retrieval
Jordan Collaboration with Intel team building probabilistic graphical models Better alternatives to LSI for document modeling and search Latent Dirichlet Allocation (LDA)
Model documents as union of themes, each with own word distribution Maximum likelihood fit to find themes in set of documents, classify
them Computational bottleneck is solution of enormous linear systems One of largest Millennium users
Kernel ICA Estimate set of sources s and mixing matrix A from samples x = A*s New way to sample such that sources are as independent as possible Again reduces to linear algebra kernels
Identifying influential documents Given hyperlink patterns of documents, which are most influential? Basis of Google (eigenvector of link matrix sparse matrix vector
multiply) Applying Markov chain and perturbation theory to assess reliability
![Page 29: High Performance Information Retrieval and MEMS CAD on Intel Itanium James Demmel Mathematics and EECS UC Berkeley demmel/Itanium_121001.ppt](https://reader030.vdocuments.site/reader030/viewer/2022032800/56649d395503460f94a12e59/html5/thumbnails/29.jpg)
Future WorkFuture Work
Exploit Itanium Architecture 128 (82-bit) floating pointer registers fused multiply-add instruction predicated instructions rotating registers for software pipelining prefetching instructions three levels of cache
Tune current and wider set of kernelsIncorporate into
SUGAR Information Retrieval
Further automate performance tuning Generation of algorithm space generators