xeon phi - odd dwarfs
TRANSCRIPT
Xeon Phi - Odd Dwarfs
Thomas Lange, Kai Neumann
February 12, 2015
Thomas Lange, Kai Neumann | February 12, 2015 0/39
Xeon Phi - Odd Dwarfs
Thomas Lange, Kai Neumann | February 12, 2015 1/39
1 Dwarf No. 1: Graph Traversal
2 Dwarf No. 2: MapReduce
3 Dwarf No. 3: Dense-Linear Algebra
4 Dwarf No. 4: Spectral Methods
5 Conclusion
6 Sources
-
Outline
Thomas Lange, Kai Neumann | February 12, 2015 2/39
1 Dwarf No. 1: Graph Traversal
2 Dwarf No. 2: MapReduce
3 Dwarf No. 3: Dense-Linear Algebra
4 Dwarf No. 4: Spectral Methods
5 Conclusion
6 Sources
Dwarf No. 1: Graph Traversal -
Overview
Thomas Lange, Kai Neumann | February 12, 2015 3/39
I Paper: Using the Intel Many Integrated Core to accelerate graphtraversal
I Published Juli,14 in ’International Journal of High PerformanceComputing Applications’
I Focusing on accelerating breadth-first-search (BFS) on MICarchitecture
Dwarf No. 1: Graph Traversal -
Chosen Paper
Thomas Lange, Kai Neumann | February 12, 2015 4/39
I BFS on scale-free graphsI Graphs can be parameterized by:
I scale : graph has 2scale verticesI edgefactor : edgefactor × 2scale undirected edges⇒ edgefactor
undirected edges per vertex
I Since graph is undirected: 2× edgefactor average degree
Dwarf No. 1: Graph Traversal -
Graph Definition
Thomas Lange, Kai Neumann | February 12, 2015 5/39
I Given a Graph G(V,E) explores all vertices and producesbreadth-first spanning tree
I All vertices at one level are processed before any vertices further fromthe source vertex
I The active set of vertices is called frontierI In every step all from the frontier reachable vertices (in one step) are
explored.I The set of all from the frontier reachable (unvisited) vertices (out) is
added to the set of visited vertices visI In the next step, the new reachable vertices (out) will become the new
frontier
Dwarf No. 1: Graph Traversal -
Level Synchronized BFS
Thomas Lange, Kai Neumann | February 12, 2015 6/39
Figure: Level-synchronized BFS: in: frontier, out : frontier for next step, vis: setof visited vertices [Source1]
Dwarf No. 1: Graph Traversal -
Level Synchronized BFS: Algorithm
Thomas Lange, Kai Neumann | February 12, 2015 7/39
2 ways to explore new vertices in BFS:I Top-down:
I All vertices adjacent to the current frontier are exploredI If an unvisited vertex is found, it is added to the set of new reachable
vertices out (frontier for next step)I Better if vertex frontier is small
I Bottom-Up:I Instead of exploring starting from the frontier, each unvisited vertex is
explored if it has a neighbor which is in the frontierI Better if vertex frontier is big
⇒ Benefit from both approaches
Dwarf No. 1: Graph Traversal -
Top-down & Bottom-up
Thomas Lange, Kai Neumann | February 12, 2015 8/39
Figure: Hybrid BFS: choose whether Top-down or Bottom-up is used [Source1]
Dwarf No. 1: Graph Traversal -
Hybrid BFS
Thomas Lange, Kai Neumann | February 12, 2015 9/39
I Graph G is represented as an adjacency matrix stored in thecompressed-sparse-row (CSR) format
I Predecessor map p as an array of integers (indices of vertices)I in, out, vis as bitmaps
Dwarf No. 1: Graph Traversal -
Data representation
Thomas Lange, Kai Neumann | February 12, 2015 10/39
I Goal: Optimize for MIC architectureI Exploit 2 levels of parallelism
I Explore each vertex in frontier in parallel⇒ multi-threadingI Inspect each adjacent of a vertex in parallel⇒ SIMD
Figure: Optimized Top-Down BFS [Source1]
I Data Race in Step 3: Multiple threads access p and out
Dwarf No. 1: Graph Traversal -
Optimization: Top-Down BFS
Thomas Lange, Kai Neumann | February 12, 2015 11/39
Figure: Optimized Top-Down BFS [Source1]
I Left data race still produces a valid predecessor mapI Right data race needs further investigation
Dwarf No. 1: Graph Traversal -
Optimization: Top-Down BFS
Thomas Lange, Kai Neumann | February 12, 2015 12/39
I Some bits may not be set to 1 in outI Solution: Repair the out array with the predecessor mapI In each iteration, instead of writing the index of the predecessor, write
the negative indexI Use this information to restore bitmap information
I Can be done in parallel
Dwarf No. 1: Graph Traversal -
Optimization: Top-Down BFS
Thomas Lange, Kai Neumann | February 12, 2015 13/39
I Typically handles more vertices at the same time (compared totop-down)
I Parallelisation idea: Each thread handles multiple vertices, usesSIMD to explore vertices in parallel
Figure: Optimized Bottom-Up [Source1]
Dwarf No. 1: Graph Traversal -
Optimization: Bottom-Up BFS
Thomas Lange, Kai Neumann | February 12, 2015 14/39
Figure: Performance Comparison of native BFS [Source1]
Dwarf No. 1: Graph Traversal -
Performance Comparison
Thomas Lange, Kai Neumann | February 12, 2015 15/39
Figure: Performance Comparison of native BFS [Source1]
Dwarf No. 1: Graph Traversal -
Performance Comparison
Thomas Lange, Kai Neumann | February 12, 2015 16/39
I Idea: Partition and offload work of bottom-up levels to MICI Run Top-Down computation on host onlyI Hide data transfer overhead with asynchronous communicationI Host system: Intel Xeon E5-2692 (12 Cores, 2.20 Ghz each, 32kb L1,
256kb L2, 30MB shared L3)
Dwarf No. 1: Graph Traversal -
Heterogeneous BFS
Thomas Lange, Kai Neumann | February 12, 2015 17/39
Figure: Time chart of heterogeneous BFS [Source1]
Dwarf No. 1: Graph Traversal -
Heterogeneous BFS
Thomas Lange, Kai Neumann | February 12, 2015 18/39
Figure: Performance of different task partition ratios [Source1]
Dwarf No. 1: Graph Traversal -
Performance Comparison
Thomas Lange, Kai Neumann | February 12, 2015 19/39
Figure: Performance of different Graph scales [Source1]
Dwarf No. 1: Graph Traversal -
Performance: heterogeneous BFS
Thomas Lange, Kai Neumann | February 12, 2015 20/39
I Performance is about 1.4 times faster than CPU only versionI Graph Traversal can benefit from heterogeneous architecturesI MIC only efficient for parts of computation and large scaleI Heterogeneous version is not always faster, there has to be enough
computation to compensate communication overheadI Also: In published benchmarks the initial time to copy the Graph G to
MIC is not included
Dwarf No. 1: Graph Traversal -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 21/39
1 Dwarf No. 1: Graph Traversal
2 Dwarf No. 2: MapReduce
3 Dwarf No. 3: Dense-Linear Algebra
4 Dwarf No. 4: Spectral Methods
5 Conclusion
6 Sources
Dwarf No. 2: MapReduce -
Overview
Thomas Lange, Kai Neumann | February 12, 2015 22/39
Phoenix++
state-of-the-art Map Reduce framework for multi-core CPU’s
I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features
Dwarf No. 2: MapReduce -
Map Reduce
Thomas Lange, Kai Neumann | February 12, 2015 23/39
Phoenix++
state-of-the-art Map Reduce framework for multi-core CPU’s
I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features
Dwarf No. 2: MapReduce -
Map Reduce
Thomas Lange, Kai Neumann | February 12, 2015 23/39
Phoenix++
state-of-the-art Map Reduce framework for multi-core CPU’s
I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features
Dwarf No. 2: MapReduce -
Map Reduce
Thomas Lange, Kai Neumann | February 12, 2015 23/39
I Poor VPU usagecompiler unable to vectorize the code effectively
I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses
I small memory (8 GB)
Dwarf No. 2: MapReduce -
performance loss of Phoenix++
Thomas Lange, Kai Neumann | February 12, 2015 24/39
I Poor VPU usagecompiler unable to vectorize the code effectively
I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses
I small memory (8 GB)
Dwarf No. 2: MapReduce -
performance loss of Phoenix++
Thomas Lange, Kai Neumann | February 12, 2015 24/39
I Poor VPU usagecompiler unable to vectorize the code effectively
I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses
I small memory (8 GB)
Dwarf No. 2: MapReduce -
performance loss of Phoenix++
Thomas Lange, Kai Neumann | February 12, 2015 24/39
I Vectorization friendly code→ compiler vectorizes map operations automatically
I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access
Dwarf No. 2: MapReduce -
MRPhi Optimizations
Thomas Lange, Kai Neumann | February 12, 2015 25/39
I Vectorization friendly code→ compiler vectorizes map operations automatically
I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access
Dwarf No. 2: MapReduce -
MRPhi Optimizations
Thomas Lange, Kai Neumann | February 12, 2015 25/39
I Vectorization friendly code→ compiler vectorizes map operations automatically
I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access
Dwarf No. 2: MapReduce -
MRPhi Optimizations
Thomas Lange, Kai Neumann | February 12, 2015 25/39
Figure: [Source2]
compare MRPhi on Xeon Phi to Phoenix++ on Xeon CPU
Figure: [Source2]
I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory
accessesI system overhead is higher
takes 30-50% of running time (on Xeon ≤ 10%)
Dwarf No. 2: MapReduce -
Comparison
Thomas Lange, Kai Neumann | February 12, 2015 26/39
Figure: [Source2]
I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory
accessesI system overhead is higher
takes 30-50% of running time (on Xeon ≤ 10%)
Dwarf No. 2: MapReduce -
Comparison
Thomas Lange, Kai Neumann | February 12, 2015 26/39
Figure: [Source2]
I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory
accessesI system overhead is higher
takes 30-50% of running time (on Xeon ≤ 10%)
Dwarf No. 2: MapReduce -
Comparison
Thomas Lange, Kai Neumann | February 12, 2015 26/39
Figure: [Source2]
I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory
accessesI system overhead is higher
takes 30-50% of running time (on Xeon ≤ 10%)
Dwarf No. 2: MapReduce -
Comparison
Thomas Lange, Kai Neumann | February 12, 2015 26/39
Figure: [Source2]
I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory
accessesI system overhead is higher
takes 30-50% of running time (on Xeon ≤ 10%)
Dwarf No. 2: MapReduce -
Comparison
Thomas Lange, Kai Neumann | February 12, 2015 26/39
1 Dwarf No. 1: Graph Traversal
2 Dwarf No. 2: MapReduce
3 Dwarf No. 3: Dense-Linear Algebra
4 Dwarf No. 4: Spectral Methods
5 Conclusion
6 Sources
Dwarf No. 3: Dense-Linear Algebra -
Overview
Thomas Lange, Kai Neumann | February 12, 2015 27/39
I Paper: HPC Programming on Intel Many-Integrated-Core Hardwarewith MAGMA Port to XEON Phi
I MAGMA MIC Library: Open source, high performance library that istargeting heterogeneous architectures
I Tries to provide DLA functionality equivalent to LAPACK libraryI Exemplary describes QR factorization
Dwarf No. 3: Dense-Linear Algebra -
Chosen Paper
Thomas Lange, Kai Neumann | February 12, 2015 28/39
I QR factorization consists of two different types of operationsI Level-2 BLAS: Vector-Matrix-OperationsI Level-3 BLAS: Matrix-(Matrix-)Operations
I Problem: Level-2 BLAS Operations are not efficient on MICarchitecture.
Dwarf No. 3: Dense-Linear Algebra -
QR factorization
Thomas Lange, Kai Neumann | February 12, 2015 29/39
I Benefit from both types of hardware architecture: CPU and MICI Run CPU optimized parts on CPU and MIC optimized parts on XEON
PhiI Hide synchronization overhead as much as possible
Figure: Computational Pattern for hybrid factorizations in MAGMA [Source3]
Dwarf No. 3: Dense-Linear Algebra -
Main Idea of the Paper
Thomas Lange, Kai Neumann | February 12, 2015 30/39
Figure: Performance comparison of the CPU only version and the hybrid version[Source3]
Dwarf No. 3: Dense-Linear Algebra -
Performance: CPU vs CPU+Phi
Thomas Lange, Kai Neumann | February 12, 2015 31/39
I Performance of hybrid version more than 2x fasterI However, not fair comparison: Phi only version would perform poorlyI Memory-bound operations are executed on CPU in parallel
(Level-2-BLAS)
Dwarf No. 3: Dense-Linear Algebra -
Conclusion: DLA
Thomas Lange, Kai Neumann | February 12, 2015 32/39
1 Dwarf No. 1: Graph Traversal
2 Dwarf No. 2: MapReduce
3 Dwarf No. 3: Dense-Linear Algebra
4 Dwarf No. 4: Spectral Methods
5 Conclusion
6 Sources
Dwarf No. 4: Spectral Methods -
Overview
Thomas Lange, Kai Neumann | February 12, 2015 33/39
I 1D FFT computationsI Stampede clusterI Texas Advanced Computing Center
Dwarf No. 4: Spectral Methods -
Spectral Methods
Thomas Lange, Kai Neumann | February 12, 2015 34/39
Figure: Xeon and Xeon Phi execution time[Source4]
Dwarf No. 4: Spectral Methods -
Execution Time
Thomas Lange, Kai Neumann | February 12, 2015 35/39
Figure: performance comparison[Source4]
I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance
Dwarf No. 4: Spectral Methods -
Performance
Thomas Lange, Kai Neumann | February 12, 2015 36/39
Figure: performance comparison[Source4]
I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance
Dwarf No. 4: Spectral Methods -
Performance
Thomas Lange, Kai Neumann | February 12, 2015 36/39
Figure: performance comparison[Source4]
I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance
Dwarf No. 4: Spectral Methods -
Performance
Thomas Lange, Kai Neumann | February 12, 2015 36/39
Figure: performance comparison[Source4]
I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance
Dwarf No. 4: Spectral Methods -
Performance
Thomas Lange, Kai Neumann | February 12, 2015 36/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
What is Xeon Phi good for?
I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the
high parallel architecture and the SIMD capabilities
Conclusion -
Conclusion
Thomas Lange, Kai Neumann | February 12, 2015 37/39
1. International Journal of High Performance Computing Applicationspublished, 28 February 2014Tao Gao, Yutong Lu, Baida Zhang and Guang Suo
2. Optimizing the MapReduce Framework on Intel Xeon PhiCoprocessor, 1. September 2013Mian Lu, Lei Zhang, Huynh Phung Huynh, Zhongliang Ong, Yun Liang, Bingsheng He, Rick Siow Mong Goh, Richard Huynh
3. Design and Implementation of the Linpack Benchmark TM for Singleand Multi-Node Systems Based on Intel R Xeon Phi Coprocessor,2013Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry,
Aniruddha G Shet, George Chrysos, Pradeep Dubey
4. Tera-Scale 1D FFT with Low-Communication Algorithm and IntelXeon Phi CoprocessorsJongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, Daehyun Kim
Sources -
Sources
Thomas Lange, Kai Neumann | February 12, 2015 38/39
1. Introduction: Thomas Lange
2. Dwarf 1 - Graph Traversal: Kai Neumann
3. Dwarf 2 - MapReduce: Thomas Lange
4. Dwarf 3 - Dense Linear Algebra: Kai Neumann
5. Dwarf 4 - Spectral Methods: Thomas Lange
6. Conclusion : Thomas Lange & Kai Neumann
Credits -
Credits
Thomas Lange, Kai Neumann | February 12, 2015 39/39
Credits
Thomas Lange, Kai Neumann | February 12, 2015 39/39