![Page 1: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/1.jpg)
Hybrid Parallel Programmingfor Massive Graph Analysisfor Massive Graph Analysis
K h M dd iKamesh [email protected]
Computational Research DivisionComputational Research DivisionLawrence Berkeley National Laboratory
SIAM Annual Meeting 2010July 12 2010July 12, 2010
![Page 2: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/2.jpg)
Hybrid Parallel Programming
Large‐scale graph analysis utilizing• Clusters of x86 multicore processorsp– MPI + OpenMP/UPC
• CPU+GPUCPU+GPU– MPI + OpenCL
• FPGAs accelerators• FPGAs, accelerators– Host code + accelerator code
![Page 3: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/3.jpg)
Why hybrid programming?
• Traditional sources of fperformance
improvement are flat‐li ilining
• We need new algorithms that exploit large on‐chip memory, shared caches, and high DRAM bandwidth
Image source: Herb Sutter, “The Free Lunch is Over”, Dr. Dobb’s Journal, 2009.
![Page 4: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/4.jpg)
This talk: Two case studies
• MPI + OpenMP on shared‐memory multicore processor clustersprocessor clusters– Graph analytics on online social network crawls, synthetic “power‐law” random graphssynthetic power‐law random graphs
– Traversal and simplification of a DNA fragment assembly string graph arising in a de novo short‐assembly string graph arising in a de novo shortread genome assembly algorithm
![Page 5: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/5.jpg)
Characterizing Large‐scale graph‐theoretic computations
R d /Gl b l E t ll t t f K ithi X hRandom/Global memory accesses
Enumerate all contacts of K within X hops
Find all events in the past six monthssimilar to event “Y”
LocalityCharacteristics
Streaming data/Local computation
Enumerate all friends of K
List today’s top trending events
104 106 108 1012O(N)O(N log N)
p
Peta+ComputationalComplexity
Data size (N: number of vertices/edges)O(N2)O(N log N)
![Page 6: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/6.jpg)
Parallelization Strategy
R d /Gl b lRandom/Global memory accesses
Partition the network,
LocalityCharacteristics
Replicate the graph on each node
aggressively attempt to minimize communication
Streaming data/Local computation
Partition the network
104 106 108 1012O(N)O(N log N)
p
Peta+ComputationalComplexity
Data size (N: number of vertices/edges)O(N2)O(N log N)
![Page 7: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/7.jpg)
Minimizing Communication
• Irregular and memory‐intensive graph problems: Intra and Inter node communication (+ I/O costsIntra‐ and Inter‐node communication (+ I/O costs, memory latency) costs typically dominate local computational complexityp p y
• Key to parallel performance: Enhance data locality, avoid superfluous inter‐node communicationp– Avoid a P‐way partitioning of the graph– Create PMP/MG replicas 4-way partitioning
MGCore
CacheCache
Core Core Core
CacheCache CacheCache CacheCache 3 replicas
Graph + data structures
Memory Controller Hub
DRAM capacity MPP=12, MG/MP = 4
![Page 8: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/8.jpg)
Real‐world dataAssembled a collection of graphs for algorithm performance analysis, from some of the largest publicly‐available network data sets.
N # ti # d TName # vertices # edges TypeAmazon-2003 473.30 K 3.50 M co-purchasereu-2005 862.00 K 19.23 M wwwFlickr 1.86 M 22.60 M socialwiki-Talk 2.40 M 5.02 M collaborkut 3 07 M 223 00 M socialorkut 3.07 M 223.00 M socialcit-Patents 3.77 M 16.50 M citeLivejournal 5.28 M 77.40 M socialuk-2002 18.50 M 198.10 M wwwUSA-road 23.90 M 29.00 M Transp.webbase-2001 118.14 M 1.02 B www
![Page 9: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/9.jpg)
“2D” Graph Partitioning Strategy
• Tuned for graphs with unbalanced degree distributions and incremental updatesdistributions and incremental updates– Sort vertices by degree– Form roughly MG/MP local communities around “high‐Form roughly MG/MP local communities around highdegree” vertices & partition adjacencies
– Reorder vertices by degree, assign contiguous chunks to each of the MG/MP nodes
– Assign ownership of any remaining low‐degree vertices to processesprocesses
• Comparison: 1D p‐way partitioning, 1D p‐way partitioning with vertices shuffledp g
![Page 10: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/10.jpg)
Parallel Breadth‐First Search Implementation
0
5
3
8
6
1
900 7 3 4 6 9source vertex
• Expensive preprocessing partitioning + reordering step,
2
currently untimed
55 28
X x x
x x x
x x x x
x x x
0 1 44 77 3 99
x x x
x x x
x x
x x
x x x
x
664: “high-degree”Two vertex partitions
x
x
![Page 11: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/11.jpg)
Parallel BFS Implementation
• Concurrency in each phase limited by size of f tifrontier array
• Local computation: inspecting adjacencies, creating a list of unvisited vertices
• Parallel communication step: All‐to‐all exchange of frontier vertices– Potentially P2 exchanges y g– Partitioning, replication, and reordering significantly reduce number of messages
![Page 12: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/12.jpg)
Single‐node Multicore Optimizations
1. Software prefetching on Intel Nehalem (supports 32 loads and 20 stores in flight)– Speculative loads of index array and adjacencies of frontier vertices will
reduce compulsory cache misses.
2. Aligning adjacency lists to optimize memory accesses– 16‐byte aligned loads and stores are faster.– Alignment helps reduce cache misses due to fragmentation– 16‐byte aligned non‐temporal stores (during creation of new frontier) are fast.
3. SIMD SSE integer intrinsics to process “high‐degree” vertex adjacencies.4. Fast atomics (BFS is lock‐free w/ low contention, and CAS‐based intrinsics
have very low overhead)y )5. Hugepage support (significant TLB miss reduction)6. NUMA‐aware memory allocation exploiting first‐touch policy
![Page 13: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/13.jpg)
Parallel Performance• 32 nodes of NERSC’s Carver system
– dual‐socket, quad‐core Intel Nehalem 2.67 GHz processor node– 24 GB DDR3 1333 MHz memory per node, or roughly 768 TB aggregate memory
12 Hybrid 12
Orkut crawl: 3.07M vertices, 227M edgesSynthetic RMAT network: 2 billion vertices, 32 billion edges
8
10
ges
econ
d
1D partitioning
1D partitioning +Randomization
8
10
4
6
Billion
Edg
raversed
/se Randomization
4
6
0
2
tr
0
2
0
Parallel Strategy
0
Single‐node performance: 300‐500 M traversed edges/second.
![Page 14: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/14.jpg)
Genome Assembly Preliminariesnucleotide
ACACGTGTGCACTACTGCACTCTACTCCACTGACTA Genome(collection of l i )
nucleotide
DNA sequences/
long strings)“Scaffold” the contigs
readsACATCGTCTG
TCGCGCTGAAAlignthe reads
contigs
Sequencer
Genome assembler
Sample
![Page 15: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/15.jpg)
De novo Genome Assembly
• Genome Assembly: “a big jigsaw puzzle”
DNA extractionjigsaw puzzle
• De novo: Latin expression meaning “from the
Fragment the DNAmeaning from the
beginning”– No prior reference
the DNA
porganism
– Computationally falls within th NP h d l f
Clone into vectors Isolate vector DNA
the NP‐hard class of problems Sequence the library
CTCTAGCTCTAA AAGTCTCTAACTCTAGCTCTAAAGGTCTCTAA
AAGTCTCTAAAAGCTATCTAA
CTCTAGCTCTAAGGTCTCTAACTAAGCTAATCTAA
Genome Assembly
![Page 16: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/16.jpg)
Eulerian path‐based strategies
• Break up the (short) reads into overlapping t i f l th kstrings of length k.
ACGTTATATATTCTA ACGTT CGTTA GTTAT
k = 5
TTATA ….. TTCTA
CCATGATATATTCTA CCATG CATGA ATGAT
C t t d B ij h ( di t d h
CCATGATATATTCTATGATA ….. TTCTA
• Construct a de Bruijn graph (a directed graph representing overlap between strings)
![Page 17: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/17.jpg)
de Bruijn graphs
• Each (k‐1)‐mer represents a node in the graph• Edge exists between node a to b iff there exists a k‐mer such• Edge exists between node a to b iff there exists a k‐mer such
that its prefix is a and suffix is b.CTC
TCCCCG
CGAAAGACTCCGACTGGGACTTTAAG AGA GAC ACT CTT TTT
CGA
ACTCCGACTGGGACTTTGACTGA
T th h (if ibl id tif i E l i th) t
CTG
TGGGGG
GGA
• Traverse the graph (if possible, identifying an Eulerian path) to form contigs.
• However, correct assembly is just one of the many possible , y j y pEulerian paths.
![Page 18: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/18.jpg)
Steps in the de Bruijn graph‐based assembly schemey1 Preprocessing Scaffolding6
FASTQ input dataError resolution + further graph compaction
5
Sequences after error resolution
2 Kmer spectrum
Vertex/edge compaction (lossless transformations)
4Compute and memory-intensive
Determine appropriate
Preliminary de Bruijn graph construction
3
y
pp pvalue of k to use
construction
![Page 19: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/19.jpg)
Graph construction
• Store edges only, represent vertices (kmers) implicitlyimplicitly.
• Distributed graph representation• Sort by start vertex• Sort by start vertex• Edge storage format:
ACTAGGC CTAGGCARead ‘x’:
Store edge (ACTAGGCA), orientation, edge direction, edge id (y), originating read id (x), edge count
2 bits per nucleotide
![Page 20: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/20.jpg)
Vertex compaction
• High percentage of unique kmers Try compacting kmers from same read first– If kmer length is k, potentially k‐times space
d i !reduction!
ACTAG CTAGG TAGGA AGGAC
• Parallelization: computation can be done locally by sorting by read ID traversing unit‐
ACTAGGAC
locally by sorting by read ID, traversing unit‐cardinality kmers.
![Page 21: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/21.jpg)
Summary of various steps and AnalysisA metagenomic data set (140 million reads, 20G bp), k = 45.
Step Memoryfootprint
Approach used Parallelism & Computationalfootprint Computational
kernels1. Preprocessing minimal Streaming file read and
write kmer merging“Pleasantly parallel” I/O-write, kmer merging parallel , I/Ointensive
2. Kmer spectrum ~ 200 GB 3 local sorts, 1 AlltoAllcommunication steps.
Parallel sort, AlltoAllvp
3. Graph construction
~ 320 GB Two sorts Fully localcomputation
4. Graph ~ 60 GB 3+ local sorts, 2 AlltoAll AlltoAllv + local G apcompaction
60 G 3 oca so ts, tocommunication steps + local graph traversal
to ocacomputation
5. Error detection ~ 35 GB Connected components + Intensive AlltoAll communication
6. Scaffolding ? GB Euler tours over components
Mostly local computation
![Page 22: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/22.jpg)
Parallel Implementation Details
• Data set under consideration requires 320 GB f i ifor in‐memory processing– NERSC Franklin system [Cray XT4, 2.3 GHz quad‐
O t 8 GB d ]core Opterons, 8 GB memory per node]– Experimented with 64 nodes (256‐way parallelism) and 128 nodes (512 way)and 128 nodes (512‐way)
• MPI across nodes + OpenMP within a node• Local sort: multicore‐parallel radix sort• Global sort: bin data in parallel + AlltoAllpcomm. + local sort + AlltoAll comm.
![Page 23: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/23.jpg)
Parallel Performance
140
160 Preprocessing
140
160
s)
128 nodes: 213 seconds 64 nodes: 340 seconds
100
120
140Kmer Freq.
Graph construct100
120
(sec
onds
60
80 Graph compact
60
80
tion
time
20
40
20
40
Exec
ut
0Assembly step
0Assembly step
• Comparison: Velvet (open‐source serial code) takes ~ 12 hours on a 500 GB machine.
![Page 24: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/24.jpg)
Talk Summary
• Two examples of “hybrid” parallel programming for analyzing large‐scale graphs– Up to 3x faster with hybrid approaches on 32 nodes
• Two different types of graphs, the strategies to achieve high performance differsperformance differs– Social and information networks: low diameter, difficult to generate
balanced partitions with low edge cutsDNA f t t i h O(N) di t lti l t d– DNA fragment string graph: O(N) diameter, multiple connected components
• Single‐node multicore optimizations + communication optimization (reducing data volume and number of messages in All‐to‐all exchange).
![Page 25: Hybrid Parallel Programming for Massive Graph Analysisgraphanalysis.org/SIAM-AN10/05_Madduri.pdf · 2019. 8. 15. · Synthetic RMAT network: 2 billion vertices, 32 billion edges 8](https://reader035.vdocuments.site/reader035/viewer/2022071412/6109fc0acc59f8560a42b88b/html5/thumbnails/25.jpg)
Acknowledgments
BERKELEY PAR LAB