xeon phi - odd dwarfs

59
Xeon Phi - Odd Dwarfs Thomas Lange, Kai Neumann February 12, 2015 Thomas Lange, Kai Neumann | February 12, 2015 0/39 Xeon Phi - Odd Dwarfs Thomas Lange, Kai Neumann | February 12, 2015 1/39

Upload: others

Post on 25-Dec-2021

10 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Xeon Phi - Odd Dwarfs

Xeon Phi - Odd Dwarfs

Thomas Lange, Kai Neumann

February 12, 2015

Thomas Lange, Kai Neumann | February 12, 2015 0/39

Xeon Phi - Odd Dwarfs

Thomas Lange, Kai Neumann | February 12, 2015 1/39

Page 2: Xeon Phi - Odd Dwarfs

1 Dwarf No. 1: Graph Traversal

2 Dwarf No. 2: MapReduce

3 Dwarf No. 3: Dense-Linear Algebra

4 Dwarf No. 4: Spectral Methods

5 Conclusion

6 Sources

-

Outline

Thomas Lange, Kai Neumann | February 12, 2015 2/39

Page 3: Xeon Phi - Odd Dwarfs

1 Dwarf No. 1: Graph Traversal

2 Dwarf No. 2: MapReduce

3 Dwarf No. 3: Dense-Linear Algebra

4 Dwarf No. 4: Spectral Methods

5 Conclusion

6 Sources

Dwarf No. 1: Graph Traversal -

Overview

Thomas Lange, Kai Neumann | February 12, 2015 3/39

Page 4: Xeon Phi - Odd Dwarfs

I Paper: Using the Intel Many Integrated Core to accelerate graphtraversal

I Published Juli,14 in ’International Journal of High PerformanceComputing Applications’

I Focusing on accelerating breadth-first-search (BFS) on MICarchitecture

Dwarf No. 1: Graph Traversal -

Chosen Paper

Thomas Lange, Kai Neumann | February 12, 2015 4/39

Page 5: Xeon Phi - Odd Dwarfs

I BFS on scale-free graphsI Graphs can be parameterized by:

I scale : graph has 2scale verticesI edgefactor : edgefactor × 2scale undirected edges⇒ edgefactor

undirected edges per vertex

I Since graph is undirected: 2× edgefactor average degree

Dwarf No. 1: Graph Traversal -

Graph Definition

Thomas Lange, Kai Neumann | February 12, 2015 5/39

Page 6: Xeon Phi - Odd Dwarfs

I Given a Graph G(V,E) explores all vertices and producesbreadth-first spanning tree

I All vertices at one level are processed before any vertices further fromthe source vertex

I The active set of vertices is called frontierI In every step all from the frontier reachable vertices (in one step) are

explored.I The set of all from the frontier reachable (unvisited) vertices (out) is

added to the set of visited vertices visI In the next step, the new reachable vertices (out) will become the new

frontier

Dwarf No. 1: Graph Traversal -

Level Synchronized BFS

Thomas Lange, Kai Neumann | February 12, 2015 6/39

Page 7: Xeon Phi - Odd Dwarfs

Figure: Level-synchronized BFS: in: frontier, out : frontier for next step, vis: setof visited vertices [Source1]

Dwarf No. 1: Graph Traversal -

Level Synchronized BFS: Algorithm

Thomas Lange, Kai Neumann | February 12, 2015 7/39

Page 8: Xeon Phi - Odd Dwarfs

2 ways to explore new vertices in BFS:I Top-down:

I All vertices adjacent to the current frontier are exploredI If an unvisited vertex is found, it is added to the set of new reachable

vertices out (frontier for next step)I Better if vertex frontier is small

I Bottom-Up:I Instead of exploring starting from the frontier, each unvisited vertex is

explored if it has a neighbor which is in the frontierI Better if vertex frontier is big

⇒ Benefit from both approaches

Dwarf No. 1: Graph Traversal -

Top-down & Bottom-up

Thomas Lange, Kai Neumann | February 12, 2015 8/39

Page 9: Xeon Phi - Odd Dwarfs

Figure: Hybrid BFS: choose whether Top-down or Bottom-up is used [Source1]

Dwarf No. 1: Graph Traversal -

Hybrid BFS

Thomas Lange, Kai Neumann | February 12, 2015 9/39

Page 10: Xeon Phi - Odd Dwarfs

I Graph G is represented as an adjacency matrix stored in thecompressed-sparse-row (CSR) format

I Predecessor map p as an array of integers (indices of vertices)I in, out, vis as bitmaps

Dwarf No. 1: Graph Traversal -

Data representation

Thomas Lange, Kai Neumann | February 12, 2015 10/39

Page 11: Xeon Phi - Odd Dwarfs

I Goal: Optimize for MIC architectureI Exploit 2 levels of parallelism

I Explore each vertex in frontier in parallel⇒ multi-threadingI Inspect each adjacent of a vertex in parallel⇒ SIMD

Figure: Optimized Top-Down BFS [Source1]

I Data Race in Step 3: Multiple threads access p and out

Dwarf No. 1: Graph Traversal -

Optimization: Top-Down BFS

Thomas Lange, Kai Neumann | February 12, 2015 11/39

Page 12: Xeon Phi - Odd Dwarfs

Figure: Optimized Top-Down BFS [Source1]

I Left data race still produces a valid predecessor mapI Right data race needs further investigation

Dwarf No. 1: Graph Traversal -

Optimization: Top-Down BFS

Thomas Lange, Kai Neumann | February 12, 2015 12/39

Page 13: Xeon Phi - Odd Dwarfs

I Some bits may not be set to 1 in outI Solution: Repair the out array with the predecessor mapI In each iteration, instead of writing the index of the predecessor, write

the negative indexI Use this information to restore bitmap information

I Can be done in parallel

Dwarf No. 1: Graph Traversal -

Optimization: Top-Down BFS

Thomas Lange, Kai Neumann | February 12, 2015 13/39

Page 14: Xeon Phi - Odd Dwarfs

I Typically handles more vertices at the same time (compared totop-down)

I Parallelisation idea: Each thread handles multiple vertices, usesSIMD to explore vertices in parallel

Figure: Optimized Bottom-Up [Source1]

Dwarf No. 1: Graph Traversal -

Optimization: Bottom-Up BFS

Thomas Lange, Kai Neumann | February 12, 2015 14/39

Page 15: Xeon Phi - Odd Dwarfs

Figure: Performance Comparison of native BFS [Source1]

Dwarf No. 1: Graph Traversal -

Performance Comparison

Thomas Lange, Kai Neumann | February 12, 2015 15/39

Page 16: Xeon Phi - Odd Dwarfs

Figure: Performance Comparison of native BFS [Source1]

Dwarf No. 1: Graph Traversal -

Performance Comparison

Thomas Lange, Kai Neumann | February 12, 2015 16/39

Page 17: Xeon Phi - Odd Dwarfs

I Idea: Partition and offload work of bottom-up levels to MICI Run Top-Down computation on host onlyI Hide data transfer overhead with asynchronous communicationI Host system: Intel Xeon E5-2692 (12 Cores, 2.20 Ghz each, 32kb L1,

256kb L2, 30MB shared L3)

Dwarf No. 1: Graph Traversal -

Heterogeneous BFS

Thomas Lange, Kai Neumann | February 12, 2015 17/39

Page 18: Xeon Phi - Odd Dwarfs

Figure: Time chart of heterogeneous BFS [Source1]

Dwarf No. 1: Graph Traversal -

Heterogeneous BFS

Thomas Lange, Kai Neumann | February 12, 2015 18/39

Page 19: Xeon Phi - Odd Dwarfs

Figure: Performance of different task partition ratios [Source1]

Dwarf No. 1: Graph Traversal -

Performance Comparison

Thomas Lange, Kai Neumann | February 12, 2015 19/39

Page 20: Xeon Phi - Odd Dwarfs

Figure: Performance of different Graph scales [Source1]

Dwarf No. 1: Graph Traversal -

Performance: heterogeneous BFS

Thomas Lange, Kai Neumann | February 12, 2015 20/39

Page 21: Xeon Phi - Odd Dwarfs

I Performance is about 1.4 times faster than CPU only versionI Graph Traversal can benefit from heterogeneous architecturesI MIC only efficient for parts of computation and large scaleI Heterogeneous version is not always faster, there has to be enough

computation to compensate communication overheadI Also: In published benchmarks the initial time to copy the Graph G to

MIC is not included

Dwarf No. 1: Graph Traversal -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 21/39

Page 22: Xeon Phi - Odd Dwarfs

1 Dwarf No. 1: Graph Traversal

2 Dwarf No. 2: MapReduce

3 Dwarf No. 3: Dense-Linear Algebra

4 Dwarf No. 4: Spectral Methods

5 Conclusion

6 Sources

Dwarf No. 2: MapReduce -

Overview

Thomas Lange, Kai Neumann | February 12, 2015 22/39

Page 23: Xeon Phi - Odd Dwarfs

Phoenix++

state-of-the-art Map Reduce framework for multi-core CPU’s

I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features

Dwarf No. 2: MapReduce -

Map Reduce

Thomas Lange, Kai Neumann | February 12, 2015 23/39

Page 24: Xeon Phi - Odd Dwarfs

Phoenix++

state-of-the-art Map Reduce framework for multi-core CPU’s

I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features

Dwarf No. 2: MapReduce -

Map Reduce

Thomas Lange, Kai Neumann | February 12, 2015 23/39

Page 25: Xeon Phi - Odd Dwarfs

Phoenix++

state-of-the-art Map Reduce framework for multi-core CPU’s

I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features

Dwarf No. 2: MapReduce -

Map Reduce

Thomas Lange, Kai Neumann | February 12, 2015 23/39

Page 26: Xeon Phi - Odd Dwarfs

I Poor VPU usagecompiler unable to vectorize the code effectively

I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses

I small memory (8 GB)

Dwarf No. 2: MapReduce -

performance loss of Phoenix++

Thomas Lange, Kai Neumann | February 12, 2015 24/39

Page 27: Xeon Phi - Odd Dwarfs

I Poor VPU usagecompiler unable to vectorize the code effectively

I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses

I small memory (8 GB)

Dwarf No. 2: MapReduce -

performance loss of Phoenix++

Thomas Lange, Kai Neumann | February 12, 2015 24/39

Page 28: Xeon Phi - Odd Dwarfs

I Poor VPU usagecompiler unable to vectorize the code effectively

I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses

I small memory (8 GB)

Dwarf No. 2: MapReduce -

performance loss of Phoenix++

Thomas Lange, Kai Neumann | February 12, 2015 24/39

Page 29: Xeon Phi - Odd Dwarfs

I Vectorization friendly code→ compiler vectorizes map operations automatically

I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access

Dwarf No. 2: MapReduce -

MRPhi Optimizations

Thomas Lange, Kai Neumann | February 12, 2015 25/39

Page 30: Xeon Phi - Odd Dwarfs

I Vectorization friendly code→ compiler vectorizes map operations automatically

I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access

Dwarf No. 2: MapReduce -

MRPhi Optimizations

Thomas Lange, Kai Neumann | February 12, 2015 25/39

Page 31: Xeon Phi - Odd Dwarfs

I Vectorization friendly code→ compiler vectorizes map operations automatically

I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access

Dwarf No. 2: MapReduce -

MRPhi Optimizations

Thomas Lange, Kai Neumann | February 12, 2015 25/39

Page 32: Xeon Phi - Odd Dwarfs

Figure: [Source2]

compare MRPhi on Xeon Phi to Phoenix++ on Xeon CPU

Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)

Dwarf No. 2: MapReduce -

Comparison

Thomas Lange, Kai Neumann | February 12, 2015 26/39

Page 33: Xeon Phi - Odd Dwarfs

Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)

Dwarf No. 2: MapReduce -

Comparison

Thomas Lange, Kai Neumann | February 12, 2015 26/39

Page 34: Xeon Phi - Odd Dwarfs

Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)

Dwarf No. 2: MapReduce -

Comparison

Thomas Lange, Kai Neumann | February 12, 2015 26/39

Page 35: Xeon Phi - Odd Dwarfs

Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)

Dwarf No. 2: MapReduce -

Comparison

Thomas Lange, Kai Neumann | February 12, 2015 26/39

Page 36: Xeon Phi - Odd Dwarfs

Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)

Dwarf No. 2: MapReduce -

Comparison

Thomas Lange, Kai Neumann | February 12, 2015 26/39

Page 37: Xeon Phi - Odd Dwarfs

1 Dwarf No. 1: Graph Traversal

2 Dwarf No. 2: MapReduce

3 Dwarf No. 3: Dense-Linear Algebra

4 Dwarf No. 4: Spectral Methods

5 Conclusion

6 Sources

Dwarf No. 3: Dense-Linear Algebra -

Overview

Thomas Lange, Kai Neumann | February 12, 2015 27/39

Page 38: Xeon Phi - Odd Dwarfs

I Paper: HPC Programming on Intel Many-Integrated-Core Hardwarewith MAGMA Port to XEON Phi

I MAGMA MIC Library: Open source, high performance library that istargeting heterogeneous architectures

I Tries to provide DLA functionality equivalent to LAPACK libraryI Exemplary describes QR factorization

Dwarf No. 3: Dense-Linear Algebra -

Chosen Paper

Thomas Lange, Kai Neumann | February 12, 2015 28/39

Page 39: Xeon Phi - Odd Dwarfs

I QR factorization consists of two different types of operationsI Level-2 BLAS: Vector-Matrix-OperationsI Level-3 BLAS: Matrix-(Matrix-)Operations

I Problem: Level-2 BLAS Operations are not efficient on MICarchitecture.

Dwarf No. 3: Dense-Linear Algebra -

QR factorization

Thomas Lange, Kai Neumann | February 12, 2015 29/39

Page 40: Xeon Phi - Odd Dwarfs

I Benefit from both types of hardware architecture: CPU and MICI Run CPU optimized parts on CPU and MIC optimized parts on XEON

PhiI Hide synchronization overhead as much as possible

Figure: Computational Pattern for hybrid factorizations in MAGMA [Source3]

Dwarf No. 3: Dense-Linear Algebra -

Main Idea of the Paper

Thomas Lange, Kai Neumann | February 12, 2015 30/39

Page 41: Xeon Phi - Odd Dwarfs

Figure: Performance comparison of the CPU only version and the hybrid version[Source3]

Dwarf No. 3: Dense-Linear Algebra -

Performance: CPU vs CPU+Phi

Thomas Lange, Kai Neumann | February 12, 2015 31/39

Page 42: Xeon Phi - Odd Dwarfs

I Performance of hybrid version more than 2x fasterI However, not fair comparison: Phi only version would perform poorlyI Memory-bound operations are executed on CPU in parallel

(Level-2-BLAS)

Dwarf No. 3: Dense-Linear Algebra -

Conclusion: DLA

Thomas Lange, Kai Neumann | February 12, 2015 32/39

Page 43: Xeon Phi - Odd Dwarfs

1 Dwarf No. 1: Graph Traversal

2 Dwarf No. 2: MapReduce

3 Dwarf No. 3: Dense-Linear Algebra

4 Dwarf No. 4: Spectral Methods

5 Conclusion

6 Sources

Dwarf No. 4: Spectral Methods -

Overview

Thomas Lange, Kai Neumann | February 12, 2015 33/39

Page 44: Xeon Phi - Odd Dwarfs

I 1D FFT computationsI Stampede clusterI Texas Advanced Computing Center

Dwarf No. 4: Spectral Methods -

Spectral Methods

Thomas Lange, Kai Neumann | February 12, 2015 34/39

Page 45: Xeon Phi - Odd Dwarfs

Figure: Xeon and Xeon Phi execution time[Source4]

Dwarf No. 4: Spectral Methods -

Execution Time

Thomas Lange, Kai Neumann | February 12, 2015 35/39

Page 46: Xeon Phi - Odd Dwarfs

Figure: performance comparison[Source4]

I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance

Dwarf No. 4: Spectral Methods -

Performance

Thomas Lange, Kai Neumann | February 12, 2015 36/39

Page 47: Xeon Phi - Odd Dwarfs

Figure: performance comparison[Source4]

I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance

Dwarf No. 4: Spectral Methods -

Performance

Thomas Lange, Kai Neumann | February 12, 2015 36/39

Page 48: Xeon Phi - Odd Dwarfs

Figure: performance comparison[Source4]

I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance

Dwarf No. 4: Spectral Methods -

Performance

Thomas Lange, Kai Neumann | February 12, 2015 36/39

Page 49: Xeon Phi - Odd Dwarfs

Figure: performance comparison[Source4]

I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance

Dwarf No. 4: Spectral Methods -

Performance

Thomas Lange, Kai Neumann | February 12, 2015 36/39

Page 50: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 51: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 52: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 53: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 54: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 55: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 56: Xeon Phi - Odd Dwarfs

What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion

Thomas Lange, Kai Neumann | February 12, 2015 37/39

Page 57: Xeon Phi - Odd Dwarfs

1. International Journal of High Performance Computing Applicationspublished, 28 February 2014Tao Gao, Yutong Lu, Baida Zhang and Guang Suo

2. Optimizing the MapReduce Framework on Intel Xeon PhiCoprocessor, 1. September 2013Mian Lu, Lei Zhang, Huynh Phung Huynh, Zhongliang Ong, Yun Liang, Bingsheng He, Rick Siow Mong Goh, Richard Huynh

3. Design and Implementation of the Linpack Benchmark TM for Singleand Multi-Node Systems Based on Intel R Xeon Phi Coprocessor,2013Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry,

Aniruddha G Shet, George Chrysos, Pradeep Dubey

4. Tera-Scale 1D FFT with Low-Communication Algorithm and IntelXeon Phi CoprocessorsJongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, Daehyun Kim

Sources -

Sources

Thomas Lange, Kai Neumann | February 12, 2015 38/39

Page 58: Xeon Phi - Odd Dwarfs

1. Introduction: Thomas Lange

2. Dwarf 1 - Graph Traversal: Kai Neumann

3. Dwarf 2 - MapReduce: Thomas Lange

4. Dwarf 3 - Dense Linear Algebra: Kai Neumann

5. Dwarf 4 - Spectral Methods: Thomas Lange

6. Conclusion : Thomas Lange & Kai Neumann

Credits -

Credits

Thomas Lange, Kai Neumann | February 12, 2015 39/39

Page 59: Xeon Phi - Odd Dwarfs

Credits

Thomas Lange, Kai Neumann | February 12, 2015 39/39