numa-optimized parallel breadth-first search on multicore single-node system

NUMA%op(mized.Parallel.Breadth%first.Search.on.Mul(core.Single%node.System..

Yuichiro.Yasui*1,$Katsuki$Fujisawa*1$Kazushige$Goto*2$

*1$Chuo$university$&$JST$CREST$*2$Intel$CorporaDon�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:$NUMAIopDmized$parallel$BFS$4.  Numerical$Results$5.  Conclusion$

Background�•  Large.scale.graph.in.various.fields.–  US$Road$network$$$$:$$$$58$million$edges$–  TwiVer$followIship$:$1.47$billion$$edges$–  Neuronal$network$:$$$100$trillion$$edges$

89.billion.ver(ces.&.100.trillion.edges�Neuronal.network.@.Human.Brain.Project�

Cyber%security�

TwiQer�

US.road.network�24.million.ver(ces.&.58.million.edges� 15.billion.log.entries./.day.

Social.network�

•  Fast.and.scalable.graph.processing$by$using.HPC$large�

61.6.million.ver(ces..&..1.47.billion.edges.

•  TransportaDon$•  Social$network$•  CyberIsecurity$•  BioinformaDcs�

Importance.of.graph.processing�

•  BFS$is$important$and$fundamental$graph$processing$–  Obtains$relaDonship$of$distance$(hops)$as$standIalone$– Many$algorithm$(BC,$$Max.$flow,$$Max.$independent$set)$

•  concurrent.search.(breadth%first.search).•  opDmizaDon$(single$source$shortest$path)$•  edgeIoriented$(maximal$independent$set)$

graph.processing�

Understanding�

Applica(on.field�

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Rela(on.ships�- SCALE- edgefactor

TEPSratio

ValidationBFS

64 Iterations

graph�

- SCALE- edgefactor

TEPSratio

ValidationBFS

64 Iterations

results�

low.arithme(c.intensity$&$irregular.memory.accesses.Problems.of.Fast.&.scalable.computa(on.BFS�

Step1�

Step2�

Step3�

Breadth%first.search�

Graph500.Benchmark�•  Measures$computer$performance$using$TEPS$raDo$in$graph$processing$such$as$BFS$(BreathIfirst$search)$

•  TEPS.raDo$=$#$of$Traversed$edges$per$second$

SCALE$and$edgefactor.(=16)�

Median.TEPS�

1.   Genera(on.

- SCALE- edgefactor

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

TEPSratio

ValidationBFS

64 Iterations

3.   BFS.2.   Construc(on.

.x.64�

TEPS$raDo�.x.64�•  Kronecker$graph$– 2SCALE$verDces$and$2SCALE+4$edges$–  syntheDc$scaleIfree$network$

hVp:www.graph500.org�

•  NUMA%op(mized$hybrid$algorithm$•  Improves$locality$of$memory$access$– Library$for$considering$NUMA$carefully$– ColumnIwise$graph$parDDoning$

Contribu(on�•  Efficient$hybrid$algorithm$of$BFS.[Beamer2011,2012]$

–  reduces$unnecessary$edge$traversal$ 5.1.GTEPS�

Hybrid$BFS�

NUMA�

4%way.Intel.Xeon.E5.(64.CPU.cores)�

•  Scalable:.Scale.well.up.to.64.threads.•  Fast:.11.15.GTEPS.and.2.2x.speedup.compared$with$original$Hybrid$algorithm$

Our.proposal�

Outline�

1.  Background$2.   .Breadth%first.Search.(BFS).3.  NUMA$architecture$4.  Proposal$:$NUMAIopDmized$parallel$BFS$5.  Numerical$Results$

Breadth%first.Search.(BFS)�•  Obtains$level$of$each$verDces$from$source$vertex$•  Level$=$certain$#$of$hops$away$from$the$source�

Input:$Graph$G.and$source�

Output:$Tree$with$root$as$source�

BFS�

Source�

Level.3�

source� Level.2�Level.1�

Hybrid.BFS.for.low%diameter.graph�•  Efficient.for.Low%diameter.graph$

–  scale%free$and/or$small%world$property$such$as$social$network.$

•  At$higher$ranks$in$Graph500$benchmark$•  Hybrid$algorithm$

–  combines$topIdown$algorithm$and$boVomIup$algorithm$–  reduces$unnecessary$edge$traversal$

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Efficient$for$a$smallIfronDer� Efficient$for$a$largeIfronDer�

[Beamer2011,.2012]�

Fron(er$<$neighbor� Fron(er$>$neighbor�

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

Level.1�Source�

Level.0� QN�QF�

QN�Level.1�

Source�Level.0�QF�

Level.2�Level.1�

QN�QF�

Unnecessary.edge.traversal�

•  Efficient.for.a.small.fron(er.•  Has$an$unnecessary$edge$traversal$for$a$large$fronDer$

QN�Level.1�

Source�Level.0�QF�

QN�QF�

Unnecessary.edge.traversal�Level.3�

Level.2�

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

source� QN�

Unvisited.ver(ces�

Level.1�

source� QN�

Level.1�

Level.2�QF�Level.1�

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� QN�

Level.1�

Level.2�QF�Level.1�

Level.3�

Level.2�

Hybrid.BFS.combines.Top%down.and.BoQom%up�

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Hybrid algorithm of Beamer et al 1

• Two different traversal kernels: top-down and bottom-up.• Top-down

• traverse neighbors of the frontier.• performance depends on frontier size.

• Bottom-up• finds the frontier from vertices in candidate

neighbors (all unvisited vertices).• performance depends on number of unvisited

vertices.• This lazy estimation of candidate neighbors

increases the number of edges traverse.Level Top-down Bottom-up Hybrid

mF mB min(mF ,mB)0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

��

Fig: Top-down for small frontier

��

Fig: Bottom-up for large frontier

1S. Beamer et al.: Direction-optimizing breadth-first search, SC’12, 2012.5 / 35

Traversal.edges$of$Kronecker$graph$

(SCALE$26)�

only�

switch�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:.NUMA%op(mized.parallel.BFS.4.  Numerical$Results$5.  Conclusion$

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

•  Frequent$non%local.memory$accesses$on$NUMA.architecture.

Source�

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Top%down�

BoQom%up�

Working.data.(QF,.QN,.visited%flag)�

Graph.G�

8-coreXeon E5 4640

interconnect

shared L3 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

Across.the.local.memories�

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

G$=$G0�G1�G2�G3�

G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

.2.   How.does.bind.parDal$graph$and$data$to$each.NUMA.unit?.

G0 B0 B1 B2 B3G1 G2 G3

G0� G1� G2�G3�

G$=$G0�G1�G2�G3�G�

CPU0� CPU1� CPU2� CPU3�

RAM0� RAM1�

NUMA$unit3�

RAM2� RAM3�

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

/sys/devices/system/*

Linux.device.files�

Thread$ID�

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

Thread$ID�

Supports$scaQer.and$compact.policy�

ULIBC.is.possible.to.manage.NUMA.carefully..

round%robin.on.CPU.sockets.

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

NUMA%opt..Column%wise.Graph.Par((oning�

Row%wise.graph.par((oning�

Column%wise.graph.par((oning�

Adjacency.matrix�

Adjacency.matrix�j�i�

O(m).mostly.non%local.memory.accesses�O(m).Local.memory.accesses.only�

i� j�

Fron(er.

Neighbors�

Level.k�Level.k+1�

Fron(er�

Neighbors.

Level.k�Level.k+1�

•  divides$G=(V,A)$into$parDal$Gk=(Vk,.Ak)$and$binds.local.RAM.k.$–  Ak$is$a$set$of$adjacency$list$that$holds$incoming.edges$to$Vk.�

NUMA%op(mized.Top%down�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

•  Efficient.for.a.small.fron(er.•  Has$unnecessary$edge$traversal$for$a$large$fronDer$

Neighbors.QN�Level.1�

Source�Level.0�

Fron(er.QF�

Neighbors.QN�Fron(er.QF�

Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

NUMA.unit.3�

Details.of.NUMA%op(mized.Top%down�•  Explores$outgoing$edges$Ak$of$fron(er.queue.QF

•  Appends$unvisited$verDces$into$neighbor.queue.QNk.

QN2�

QF2�

QN1�

QF1�

QN0�

QF0�

Neighbors.QN�Fron(er.QF�

Level.2�Fron(er.QF�All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� Neighbors.QN�

Fron(er.QF�

Level.1�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�

Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

NUMA.unit.3�

Details.of.NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF

k$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN

Level.2�Fron(er.QF�

All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�

Level.2�QF

Level.1�

QN0�

Level.2�QF

Level.1�

QN1�

Level.2�QF

Level.1�

QN2�

Level.2�QF

Level.1�

QN3�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.  NUMAIopDmized$parallel$BFS$4.   Numerical.Results.5.  Conclusion.

Machine.specifica(on�•  4%way.Intel.Xeon.E5.– CentOS$6.4$(Kernel$2.6.32)$– GCC$4.4.7$– 64$logical$CPU$cores$– 4.NUMA.units.x.16.logical%cores.

8-coreXeon E5 4640

interconnect

shared L3 cache

RAM RAM RAM

•  4%way.AMD.Opteron.6174.– Fedora$19$(Kernel$3.11.2)$– GCC$4.8.1$– 48$CPU$cores$– 8.NUMA.units.x.6%core.

RAMRAMRAMRAM

12-coresOpteron 6174

interconnect

20 21 22 23 24 25 26 27 28 29

HybridHybrid + NUMA

TEPS.ra(o$varied.with$problem.size�•  Ours.achieves.11.15.GTEPS$for$Kronecker$graph$(SCALE26).•  Ours.2.2x.speedups$compared$with$original.hybrid.algorithm.

Beamer2011,$2012�

Peak.performance�

Hybrid�

Hybrid� NUMA�

This.paper�BeVer�

x2.2�

11.15.GTEPS�

5.1.GTEPS�4Iway$Intel$Xeon$E5I8870$

WestmereIEX$arch.�

4Iway$Intel$Xeon$E5I4640$$SandyBridgeIEP$arch.�

67$million$verDces$and$1$billion$edges�

12 24 32 48 1 2 4 8 16 64

Number of threads

ideal4-way SandyBridge-EP

4-way MagnyCours

Strong.scaling.on.Intel/AMD.System�Scale.well.up.to.#.of.threads.as.#.of.cores�

4%way.Intel.Xeon.11.15.GTEPS�

4%way.AMD.Opteron.6.17.GTEPS�

40.threads.:.x40�

64.threads.:.x28�

Lv� FronDer$size� Freq.$(%)$ Cum.$Freq.$(%)$0$ 1$$ 0.00$$ 0.00$$1$ 7$$ 0.00$$ 0.00$$2$ 6,188$$ 0.01$$ 0.01$$3$ 510,515$$ 1.23$$ 1.24$$4$ 29,526,508$$ 70.89$$ 72.13$$5$ 11,314,238$$ 27.16$$ 99.29$$6$ 282,456$$ 0.68$$ 99.97$$7$ 11536$$ 0.03$$ 100.00$$8$ 673$$ 0.00$$ 100.00$$9$ 68$$ 0.00$$ 100.00$$

10$ 19$$ 0.00$$ 100.00$$11$ 10$$ 0.00$$ 100.00$$12$ 5$$ 0.00$$ 100.00$$13$ 2$$ 0.00$$ 100.00$$14$ 2$$ 0.00$$ 100.00$$15$ 2$$ 0.00$$ 100.00$$

Total� 41,652,230$$ 100.00$$ I$

TwiQer.network�

41$million$verDces$and$1.47$billion$edges$

Fron(er.size.in.BFS.$$$$$$$$$$$$$with$source$as$User$21,804,357�

Follow%ship.network.2009�

User$i�

User$j�

(i,$j)Iedge�

Our.NUMA%op(mized.BFS.on.4%way.Xeon.system�

180.ms$/$BFS$$$$$$$$$$$$$$$$$$$$$$$�$8.1$GTEPS�

Six%degrees.of.separa(on�

Graph500$benchmark�•  Fastest$of$singleInode$on$4th.list$(June$2012)$

•  Fastest$of$CPUIbased$singleInode$on$6th.list$(June$2013)$

ours�

4%way.Intel.Xeon.Westmere%EX�

4%way.Intel.Xeon.SandyBridge%EP�

8.2.GTEPS�

Rank26�

Rank57� 11.1.GTEPS�

Convey.4.FPGA.+.2.CPU�

hVp:www.graph500.org�

1st.Green.Graph500.list$on$June$2013�•  Measures$powerIefficient$using$TEPS/W$raDo$•  Results$on$various$system$such$as$Android,.Linux,.and.Mac.$

Small.Data$category�

ours�

Rank.1.ASUS.tablet.TF700T� Rank.2.Intel.NUC.(Linux)�Rank.3.Mac.mini�

Android$NDK�53.5$MTEPS/w$$(1.9$GTEPS)�53.8$MTEPS/w$

$(1.1$GTEPS)�

64.1$MTEPS/w$$(150$MTEPS)�

NVIDIA.Tegra3.(4%core)�

NVIDIA.Tegra3� Intel/AMD.arch.�with$same$source$code�

hQp://green.graph500.org�

Conclusion�•  NUMA%op(mized.Hybrid.BFS.algorithm.– Reduces.unnecessary.edge.traversals$and$remote.RAM.access.carefully$considering$NUMA.

•  Numerical.results.on.4%way.Intel.Xeon�–  scales.well.up.to.64.threads.(scalable)$–  achieves.11.15.GTEPS.(fast).–  2.2x.speedup.compared.original.Hybrid.

•  Graph500.&.Green.Graph500.– Fastest.single%node$in$June$2012$– Most.power%efficient$in$June$2013$

Hybrid� NUMA�

Future.work�•  Further.op(mizing$NUMAIopDmized$BFS$

20 21 22 23 24 25 26 27 28 29

Latest versionBigdata2013

BigData2013.version:.11.GTEPS�

Latest.version:.26.GTEPS.

•  distributed%memory.parallel.computa(on$

2.4x...faster�

numa-optimized parallel breadth-first search on multicore single-node system

qf level

froner level

verces level

qn level

traversal level

source bfs source level

froner neighbors level

source qn qf unvisited

Technology

enabling and scaling biomolecular simulations of 100 ... ·...

09_practical multicore programming

multicore processsors

mit opencourseware 6.189 multicore programming primer...

bs7846 multicore

multicore processors: challenges, opportunities, emerging...

multithreaded parallelism on multicore...

market breadth

1 embedded multicore

gpu-accelerated applications for hpc industries| …...

breadth map

multicore application debugging multicore debugging

multicore procossors

multicore 101: migrating embedded apps to multicore with...

multicore processors

highly parallel rate-distortion optimized intra-mode...

multicore simulator

highly integrated and performance optimized - nexty ele ·...

highly parallel rate-distortion optimized intra-mode...

breadth #1