xeon phi - odd dwarfs

Xeon Phi - Odd Dwarfs

Thomas Lange, Kai Neumann

February 12, 2015

Thomas Lange, Kai Neumann | February 12, 2015 0/39

Xeon Phi - Odd Dwarfs


1 Dwarf No. 1: Graph Traversal

2 Dwarf No. 2: MapReduce

3 Dwarf No. 3: Dense-Linear Algebra

4 Dwarf No. 4: Spectral Methods

5 Conclusion

6 Sources

-

Outline






5 Conclusion

6 Sources

Dwarf No. 1: Graph Traversal -

Overview


I Paper: Using the Intel Many Integrated Core to accelerate graphtraversal

I Published Juli,14 in ’International Journal of High PerformanceComputing Applications’

I Focusing on accelerating breadth-first-search (BFS) on MICarchitecture


Chosen Paper


I BFS on scale-free graphsI Graphs can be parameterized by:

I scale : graph has 2scale verticesI edgefactor : edgefactor × 2scale undirected edges⇒ edgefactor

undirected edges per vertex

I Since graph is undirected: 2× edgefactor average degree


Graph Definition


I Given a Graph G(V,E) explores all vertices and producesbreadth-first spanning tree

I All vertices at one level are processed before any vertices further fromthe source vertex

I The active set of vertices is called frontierI In every step all from the frontier reachable vertices (in one step) are

explored.I The set of all from the frontier reachable (unvisited) vertices (out) is

added to the set of visited vertices visI In the next step, the new reachable vertices (out) will become the new

frontier


Level Synchronized BFS


Figure: Level-synchronized BFS: in: frontier, out : frontier for next step, vis: setof visited vertices [Source1]


Level Synchronized BFS: Algorithm


2 ways to explore new vertices in BFS:I Top-down:

I All vertices adjacent to the current frontier are exploredI If an unvisited vertex is found, it is added to the set of new reachable

vertices out (frontier for next step)I Better if vertex frontier is small

I Bottom-Up:I Instead of exploring starting from the frontier, each unvisited vertex is

explored if it has a neighbor which is in the frontierI Better if vertex frontier is big

⇒ Benefit from both approaches


Top-down & Bottom-up


Figure: Hybrid BFS: choose whether Top-down or Bottom-up is used [Source1]


Hybrid BFS


I Graph G is represented as an adjacency matrix stored in thecompressed-sparse-row (CSR) format

I Predecessor map p as an array of integers (indices of vertices)I in, out, vis as bitmaps


Data representation


I Goal: Optimize for MIC architectureI Exploit 2 levels of parallelism

I Explore each vertex in frontier in parallel⇒ multi-threadingI Inspect each adjacent of a vertex in parallel⇒ SIMD

Figure: Optimized Top-Down BFS [Source1]

I Data Race in Step 3: Multiple threads access p and out


Optimization: Top-Down BFS


Figure: Optimized Top-Down BFS [Source1]

I Left data race still produces a valid predecessor mapI Right data race needs further investigation




I Some bits may not be set to 1 in outI Solution: Repair the out array with the predecessor mapI In each iteration, instead of writing the index of the predecessor, write

the negative indexI Use this information to restore bitmap information

I Can be done in parallel




I Typically handles more vertices at the same time (compared totop-down)

I Parallelisation idea: Each thread handles multiple vertices, usesSIMD to explore vertices in parallel

Figure: Optimized Bottom-Up [Source1]


Optimization: Bottom-Up BFS


Figure: Performance Comparison of native BFS [Source1]


Performance Comparison


Figure: Performance Comparison of native BFS [Source1]




I Idea: Partition and offload work of bottom-up levels to MICI Run Top-Down computation on host onlyI Hide data transfer overhead with asynchronous communicationI Host system: Intel Xeon E5-2692 (12 Cores, 2.20 Ghz each, 32kb L1,

256kb L2, 30MB shared L3)


Heterogeneous BFS


Figure: Time chart of heterogeneous BFS [Source1]


Heterogeneous BFS


Figure: Performance of different task partition ratios [Source1]




Figure: Performance of different Graph scales [Source1]


Performance: heterogeneous BFS


I Performance is about 1.4 times faster than CPU only versionI Graph Traversal can benefit from heterogeneous architecturesI MIC only efficient for parts of computation and large scaleI Heterogeneous version is not always faster, there has to be enough

computation to compensate communication overheadI Also: In published benchmarks the initial time to copy the Graph G to

MIC is not included


Conclusion






5 Conclusion

6 Sources

Dwarf No. 2: MapReduce -

Overview


Phoenix++

state-of-the-art Map Reduce framework for multi-core CPU’s

I can run on Xeon Phi without any changesI is not aware of Xeon Phi hardware features


Map Reduce


I Poor VPU usagecompiler unable to vectorize the code effectively

I High memory latencylarge number of random memory accessessmall L2 caches→ high cache misses

I small memory (8 GB)


performance loss of Phoenix++


I Vectorization friendly code→ compiler vectorizes map operations automatically

I make use of SIMD parallelismI Pipeline map and reduce phase→ map function: heavy computation workload→ reduce function: many memory access


MRPhi Optimizations


Figure: [Source2]

compare MRPhi on Xeon Phi to Phoenix++ on Xeon CPU

Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)


Comparison


Figure: [Source2]

I using vectorization or SIMD instructions is very important on Xeon PhiI small local cache is inefficient for large number of random memory

accessesI system overhead is higher

takes 30-50% of running time (on Xeon ≤ 10%)


Comparison






5 Conclusion

6 Sources

Dwarf No. 3: Dense-Linear Algebra -

Overview


I Paper: HPC Programming on Intel Many-Integrated-Core Hardwarewith MAGMA Port to XEON Phi

I MAGMA MIC Library: Open source, high performance library that istargeting heterogeneous architectures

I Tries to provide DLA functionality equivalent to LAPACK libraryI Exemplary describes QR factorization


Chosen Paper


I QR factorization consists of two different types of operationsI Level-2 BLAS: Vector-Matrix-OperationsI Level-3 BLAS: Matrix-(Matrix-)Operations

I Problem: Level-2 BLAS Operations are not efficient on MICarchitecture.


QR factorization


I Benefit from both types of hardware architecture: CPU and MICI Run CPU optimized parts on CPU and MIC optimized parts on XEON

PhiI Hide synchronization overhead as much as possible

Figure: Computational Pattern for hybrid factorizations in MAGMA [Source3]


Main Idea of the Paper


Figure: Performance comparison of the CPU only version and the hybrid version[Source3]


Performance: CPU vs CPU+Phi


I Performance of hybrid version more than 2x fasterI However, not fair comparison: Phi only version would perform poorlyI Memory-bound operations are executed on CPU in parallel

(Level-2-BLAS)


Conclusion: DLA






5 Conclusion

6 Sources

Dwarf No. 4: Spectral Methods -

Overview


I 1D FFT computationsI Stampede clusterI Texas Advanced Computing Center


Spectral Methods


Figure: Xeon and Xeon Phi execution time[Source4]


Execution Time


Figure: performance comparison[Source4]

I Xeon Phi: 6.7 TFLOPS with 512 nodesI Fujitzu K computer: 206 TFLOPS with 81K nodesI 5x per-node performance


Performance


What is Xeon Phi good for?

I high parallelismI Utilizing heterogeneous architecturesI Easy learning curve and probably higher productivityI large vectors (512 bit SIMD)I more memory bandwidthI However, it is important to make use of the MIC architecture: Use the

high parallel architecture and the SIMD capabilities

Conclusion -

Conclusion


1. International Journal of High Performance Computing Applicationspublished, 28 February 2014Tao Gao, Yutong Lu, Baida Zhang and Guang Suo

2. Optimizing the MapReduce Framework on Intel Xeon PhiCoprocessor, 1. September 2013Mian Lu, Lei Zhang, Huynh Phung Huynh, Zhongliang Ong, Yun Liang, Bingsheng He, Rick Siow Mong Goh, Richard Huynh

3. Design and Implementation of the Linpack Benchmark TM for Singleand Multi-Node Systems Based on Intel R Xeon Phi Coprocessor,2013Alexander Heinecke, Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Alexander Kobotov, Roman Dubtsov, Greg Henry,

Aniruddha G Shet, George Chrysos, Pradeep Dubey

4. Tera-Scale 1D FFT with Low-Communication Algorithm and IntelXeon Phi CoprocessorsJongsoo Park, Ganesh Bikshandi, Karthikeyan Vaidyanathan, Ping Tak Peter Tang, Pradeep Dubey, Daehyun Kim

Sources -

Sources


1. Introduction: Thomas Lange

2. Dwarf 1 - Graph Traversal: Kai Neumann

3. Dwarf 2 - MapReduce: Thomas Lange

4. Dwarf 3 - Dense Linear Algebra: Kai Neumann

5. Dwarf 4 - Spectral Methods: Thomas Lange

6. Conclusion : Thomas Lange & Kai Neumann

Credits -

Credits


Credits


xeon phi - odd dwarfs

Documents