how well do cpu, gpu and hybrid graph processing
TRANSCRIPT
HowwelldoCPU,GPUandHybridGraph
ProcessingFrameworksPerform?
TanujKrAasawat,TahsinReza,MateiRipeanuNetworkedSystemsLaboratory(NetSysLab)UniversityofBritishColumbia
NetworkedSystemsLaboratory(NetSysLab)UniversityofBritishColumbia
Agolfcourse…
…a(nudist)beach
(…and199daysofraineachyear)
Graphs are Everywhere
4
1B users 150B friendships
100B neurons 700T connections
Challenges in Graph Processing
Data-dependentmemory
accesspatterns
Largememoryfootprint
Poorlocality
Lowcompute-to-memoryaccessratio
Varyingdegreesofparallelism(bothintra-andinter-stage)
Graph500“mini”graphrequires128GB.
Processing Elements Characteristics
Data-dependentmemory
accesspatternsLargeCaches
Largememoryfootprint >1TB
CPUs
Poorlocality
Massivehardwaremultithreading
~16GB
GPUs
Lowcompute-to-memoryaccessratio
Caches
Varyingdegreesofparallelism(bothintra-andinter-stage)
Graph500“mini”graphrequires128GB.
Assemble a hybrid platform?
Graph Processing Frameworks
ProgrammingModel
(VertexProgramming/LinearAlgebra)
Architecture(Single-nodeorDistributed)
HighPerformance
CPU/GPU/Hybrid
Motivation
Howarchitectureandprogrammingmodelcombinationimprovesperformanceandefficiencyofthesystemasawhole?
Graph Processing Frameworks Architecture Model Programming
Model Vertex
Programming CPU
CPU+Distributed LinearAlgebra
VertexProgramming
Multi-GPU
GPU LinearAlgebra
Framework
GaloisUTexas,Austin
GraphMatIntel
GunrockUC,Davis
NvgraphNvidia
TotemUBC
CPU+multi-GPU VertexProgramming
Benchmark Algorithms
• PageRank• Rankingwebpages• Computeintensive
• SingleSourceShortestPaths(SSSP)• IProuting,Transportationnetworks
• Breadth-FirstSearch(BFS)• Findingconnectedcomponent,subroutine• Memoryintensive
Evaluation Metrics
§ RawPerformance§ TraversedEdgesPerSecond(TEPS):TraversedEdges/ExecutionTime
§ EnergyConsumption§ AveragePowerconsumed*ExecutionTime
§ Scalability§ Strongscalingw.r.tprocessingunits
Testbed Characteristics System1
CPU 2xIntelXeonE5-2695v3(Haswell)
#CPUCores 28
HostMemory 512GBDDR4
L3Cache 70MB
PCIe 3.0–x16
GPU 2xNvidiaTeslaK40c
GPUThreadCount
2880
GPUMemory 12GB
Datasets Graph #Vertices #Edges MaxDegree Avg.Degree
RealWorld
Com-Orkut 3M 234M 33,313 78
liveJournal 4.8M 68M 20,292 14
Road-USA 28.8M 47.9M 9 1.6
Twitter 52M 3.9B 3,691,240 75
Synthetic
RMAT22 4M 128M 168,729 32
RMAT23 8M 256M 272,808 32
RMAT24 16M 512M 439,994 32
RMAT27 128M 4B 3,910,241 32
WDC,2012
Memory Consumption
Framework Memorylayout PageRank SSSP BFS
Nvgraph CSC(PageRank,SSSP)andCSR(BFS)
1,159(1.8x) 1,111(1.0x) 683(1.0x)
Gunrock CSRandCOO 641(1.0x) 1,582(1.4x) 1,443(2.1x)
Galois CSR 1,599(2.5x) 2,074(1.9x) 1,432(2.1x)
GraphMat* DCSC 2,818(4.4x) 2,786(2.5x) 2,980(4.4x)
Totem-2S CSR 1,275(2.0x) 2,198(2.0x) 1,282(1.9x)
Totem-2S2G CSR 1,628(2.5x) 2,587(2.3x) 1,658(2.4x)
MemoryConsumption(inMB)forRMAT22graph(edgelistsize:512MB)
9,354MBduringpre-processing
step
Experimental Results 1. Raw Performance - PageRank
02468
1012141618
Orkut LiveJournal RMAT22 RMAT23 RMAT24 RMAT27 Twitter
Billion
TEPS/Iteratio
nNvgraph Gunrock Totem-1G GaloisGraphMat Totem-2S Totem-2S2G
Fastest:Totem-2SNvgraphvsGraphMat
Experimental Results 1. Raw Performance - SSSP
0.000.501.001.502.002.503.003.504.004.50
Orkut LiveJournalRoad_USA RMAT22 RMAT24 RMAT27 Twitter
Billion
TEPS
Nvgraph Gunrock Totem-1G GaloisGraphMat Totem-2S Totem-2S2G
Fastest:Totem-2SCSCissuitableforPageRank
20
4 3
1 0 1 3 3 6 80 1 2 3 4 5*
0 2 3 6 7 80 1 2 3 4 5*
1 2 3 0 2 4 0 20 1 2 3 4 5 6 7
3 4 0 1 3 4 1 30 1 2 3 4 5 6 7
CSRRepresentation
CSCRepresentation
rowPtrVertexId
colPtr
edgeList
VertexId
edgeList
GraphLayoutinMemory
Experimental Results 1. Raw Performance - BFS
0
20
40
60
80
100
120
Orkut LiveJournal RMAT22 RMAT24 RMAT27 Twitter
Billion
TEPS
Nvgraph Gunrock Totem-1G GaloisGraphMat Totem-2S Totem-2S2G
Fastest:Totem-2SNvgraphvsGraphMatCSRsuitableforBFS
Hybrid:~2x
Experimental Results 2. Energy Consumption – GPU Frameworks – Orkut Workload
1
10
100
1,000
Nvgraph
Gunrock
Totem-1G
Totem-2S
Totem-2S2G
Nvgraph
Gunrock
Totem-1G
Totem-2S
Totem-2S2G
Nvgraph
Gunrock
Totem-1G
Totem-2S
Totem-2S2G
PageRank SSSP BFS
Energy(w
att-sec)
Experimental Results 2. Energy Consumption – GPU Frameworks – Orkut Workload
1
10
100
1,000
Nvgraph
Gunrock
Totem-1G
Totem-2S
Totem-2S2G
Nvgraph
Gunrock
Totem-1G
Totem-2S
Totem-2S2G
Nvgraph
Gunrock
Totem-1G
Totem-2S
Totem-2S2G
PageRank SSSP BFS
Energy(w
att-sec)
Experimental Results 2. Energy Consumption – CPU Frameworks – Twitter Workload
1
10
100
1,000
10,000
100,000
Galois
GraphM
at
Totem-2S
Totem-2S2G
Galois
GraphM
at
Totem-2S
Totem-2S2G
Galois
GraphM
at
Totem-2S
Totem-2S2G
PageRank SSSP BFS
Energy(w
att-second
)
Experimental Results 2. Energy Consumption – CPU Frameworks – Twitter Workload
1
10
100
1,000
10,000
100,000
Galois
GraphM
at
Totem-2S
Totem-2S2G
Galois
GraphM
at
Totem-2S
Totem-2S2G
Galois
GraphM
at
Totem-2S
Totem-2S2G
PageRank SSSP BFS
Energy(w
att-second
)
EnergyEfficient:Totem-2S
Summary
• GPU+LinearAlgebra|CPU+Vertexprogramming=GoodMatch• GPUbasedframeworks:?• CPUbasedframeworks:Totem-2S• TotemHybrid:Greenest• CSCPageRank• CSRBFS,SSSP
Discussion
Does hybrid have the future potential?
020004000600080001000012000140001600018000
02468
1012141618
BFS SSSP PR BFS SSSP PR
4S 2S2G
Energy(W
att-Sec)
ExecutionTime(secon
ds)ExecutionTime Energy
Totem-4SvsTotem-2S2GforRMAT30(edgelistsize:128GB)4SMachine:4xIntelXeonE7-4870v2(Ivybridge),with1,536GBmemory
27
Hybrid Graph Processing
Data-dependentmemory
accesspatterns
LargeCaches+summarydatastructures
Largememoryfootprint >1TB
CPUs
Poorlocality
Massivehardwaremultithreading
16GB!
GPUs
Lowcompute-to-memoryaccessratio
Caches+summarydatastructures
Varyingdegreesofparallelism(bothintra-andinter-stage)
GraphProcessing
LowDegreeHighDegree
Questions
code@:netsyslab.ece.ubc.ca