numa-optimized parallel breadth-first search on multicore single-node system

Post on 24-May-2015

376 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NUMA%op(mized.Parallel.Breadth%first.Search.on.Mul(core.Single%node.System..

Yuichiro.Yasui*1,$Katsuki$Fujisawa*1$Kazushige$Goto*2$

$

*1$Chuo$university$&$JST$CREST$*2$Intel$CorporaDon�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:$NUMAIopDmized$parallel$BFS$4.  Numerical$Results$5.  Conclusion$

Background�•  Large.scale.graph.in.various.fields.–  US$Road$network$$$$:$$$$58$million$edges$–  TwiVer$followIship$:$1.47$billion$$edges$–  Neuronal$network$:$$$100$trillion$$edges$

89.billion.ver(ces.&.100.trillion.edges�Neuronal.network.@.Human.Brain.Project�

Cyber%security�

TwiQer�

US.road.network�24.million.ver(ces.&.58.million.edges� 15.billion.log.entries./.day.

Social.network�

•  Fast.and.scalable.graph.processing$by$using.HPC$large�

61.6.million.ver(ces..&..1.47.billion.edges.

•  TransportaDon$•  Social$network$•  CyberIsecurity$•  BioinformaDcs�

Importance.of.graph.processing�

•  BFS$is$important$and$fundamental$graph$processing$–  Obtains$relaDonship$of$distance$(hops)$as$standIalone$– Many$algorithm$(BC,$$Max.$flow,$$Max.$independent$set)$

•  concurrent.search.(breadth%first.search).•  opDmizaDon$(single$source$shortest$path)$•  edgeIoriented$(maximal$independent$set)$

graph.processing�

Understanding�

Applica(on.field�

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

Rela(on.ships�- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

graph�

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

results�

low.arithme(c.intensity$&$irregular.memory.accesses.Problems.of.Fast.&.scalable.computa(on.BFS�

Step1�

Step2�

Step3�

Breadth%first.search�

Graph500.Benchmark�•  Measures$computer$performance$using$TEPS$raDo$in$graph$processing$such$as$BFS$(BreathIfirst$search)$

•  TEPS.raDo$=$#$of$Traversed$edges$per$second$

SCALE$and$edgefactor.(=16)�

Median.TEPS�

1.   Genera(on.

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

- SCALE- edgefactor

- SCALE- edgefactor- BFS Time- Traversed edges- TEPS

Input parameters ResultsGraph generation Graph construction

TEPSratio

ValidationBFS

64 Iterations

3.   BFS.2.   Construc(on.

.x.64�

TEPS$raDo�.x.64�•  Kronecker$graph$– 2SCALE$verDces$and$2SCALE+4$edges$–  syntheDc$scaleIfree$network$

hVp:www.graph500.org�

•  NUMA%op(mized$hybrid$algorithm$•  Improves$locality$of$memory$access$– Library$for$considering$NUMA$carefully$– ColumnIwise$graph$parDDoning$

Contribu(on�•  Efficient$hybrid$algorithm$of$BFS.[Beamer2011,2012]$

–  reduces$unnecessary$edge$traversal$ 5.1.GTEPS�

Hybrid$BFS�

NUMA�

4%way.Intel.Xeon.E5.(64.CPU.cores)�

•  Scalable:.Scale.well.up.to.64.threads.•  Fast:.11.15.GTEPS.and.2.2x.speedup.compared$with$original$Hybrid$algorithm$

Our.proposal�

Outline�

1.  Background$2.   .Breadth%first.Search.(BFS).3.  NUMA$architecture$4.  Proposal$:$NUMAIopDmized$parallel$BFS$5.  Numerical$Results$

Breadth%first.Search.(BFS)�•  Obtains$level$of$each$verDces$from$source$vertex$•  Level$=$certain$#$of$hops$away$from$the$source�

Input:$Graph$G.and$source�

Output:$Tree$with$root$as$source�

BFS�

Source�

Level.3�

source� Level.2�Level.1�

Hybrid.BFS.for.low%diameter.graph�•  Efficient.for.Low%diameter.graph$

–  scale%free$and/or$small%world$property$such$as$social$network.$

•  At$higher$ranks$in$Graph500$benchmark$•  Hybrid$algorithm$

–  combines$topIdown$algorithm$and$boVomIup$algorithm$–  reduces$unnecessary$edge$traversal$

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Efficient$for$a$smallIfronDer� Efficient$for$a$largeIfronDer�

[Beamer2011,.2012]�

Fron(er$<$neighbor� Fron(er$>$neighbor�

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

Level.1�Source�

Level.0� QN�QF�

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

QN�Level.1�

Source�Level.0�QF�

Level.2�Level.1�

QN�QF�

Unnecessary.edge.traversal�

Top%down.algorithm�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

•  Efficient.for.a.small.fron(er.•  Has$an$unnecessary$edge$traversal$for$a$large$fronDer$

QN�Level.1�

Source�Level.0�QF�

Level.2�Level.1�

QN�QF�

Unnecessary.edge.traversal�Level.3�

Level.2�

QN�

QF�

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

source� QN�

QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

source� QN�

QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Level.2�QF�Level.1�

QN�

Unvisited.ver(ces�

BoQom%up.algorithm�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� QN�

QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Level.2�QF�Level.1�

QN�

Unvisited.ver(ces�

Level.3�

Level.2�

QN�

QF�

Hybrid.BFS.combines.Top%down.and.BoQom%up�

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down.algorithm� BoQom%up.algorithm�

switch�

Hybrid algorithm of Beamer et al 1

• Two different traversal kernels: top-down and bottom-up.• Top-down

• traverse neighbors of the frontier.• performance depends on frontier size.

• Bottom-up• finds the frontier from vertices in candidate

neighbors (all unvisited vertices).• performance depends on number of unvisited

vertices.• This lazy estimation of candidate neighbors

increases the number of edges traverse.Level Top-down Bottom-up Hybrid

mF mB min(mF ,mB)0 2 2,103,840,895 21 66,206 1,766,587,029 66,2062 346,918,235 52,677,691 52,677,6913 1,727,195,615 12,820,854 12,820,8544 29,557,400 103,184 103,1845 82,357 21,467 21,4676 221 21,240 227

Total 2,103,820,036 3,936,072,360 65,689,631Ratio 100.00% 187.09% 3.12%

!�

����������������

�� ������

Fig: Top-down for small frontier

�����������������

������� �

Fig: Bottom-up for large frontier

1S. Beamer et al.: Direction-optimizing breadth-first search, SC’12, 2012.5 / 35

Traversal.edges$of$Kronecker$graph$

(SCALE$26)�

only�

switch�

switch�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.   Proposal.:.NUMA%op(mized.parallel.BFS.4.  Numerical$Results$5.  Conclusion$

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

How.to.speedup.the.hybrid.algorithm?�•  NUMA.architecture.– Non%uniform.memory.access$– Each.CPU.socket.has$a$local.RAM.– Fast.local.RAM.and.slow.non%local.RAM.

•  Frequent$non%local.memory$accesses$on$NUMA.architecture.

G BFS

Source�

Fron(er�

Neighbors�

Level.k�

Level.k+1�

Fron(er�Level.k�

Level.k+1�neighbors�

Top%down�

BoQom%up�

Working.data.(QF,.QN,.visited%flag)�

Graph.G�

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

Non%local�local�

4%socket$Intel$Xeon$E5$system�

Across.the.local.memories�

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

G$=$G0�G1�G2�G3�

?�

G�

G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3G0 B0 B1 B2 B3G1 G2 G3

Difficulty.of.considering.NUMA.architecture�1.   How.does.distribute.graph$and$data$to$each.local.RAM?.

.

.2.   How.does.bind.parDal$graph$and$data$to$each.NUMA.unit?.

G0 B0 B1 B2 B3G1 G2 G3

?�

G0� G1� G2�G3�

G$=$G0�G1�G2�G3�G�

CPU0� CPU1� CPU2� CPU3�

RAM0� RAM1�

NUMA$unit3�

RAM2� RAM3�

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

/sys/devices/system/*

Linux.device.files�

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

Thread$ID�

/sys/devices/system/*

Linux.device.files�

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

ULIBC:.Ubiquity$Library$for$Intelligently$Binding$Cores�

1.   NUMACTL$(command$line$tool,$library$for$C/C++)$2.   Intel.compiler$Thread$Affinity$Interface$(API)$

3.   ULIBC$(Our$library,$library$for$C/C++)$–  Processor.ID$:$index$of$logical$processor$core$–  Package.ID$:$index$of$CPU$socket$–  Core.ID$:$index$of$physical$core$in$each$CPU$socket$

�$CPU.affinity.+.Local.memory.binding�

�$CPU.affinity.+.Local.memory.binding.+.Processor.Topology�

Processor.topology.for.each.CPU.core�

Thread$ID�

Supports$scaQer.and$compact.policy�

ULIBC.is.possible.to.manage.NUMA.carefully..

/sys/devices/system/*

Linux.device.files�

round%robin.on.CPU.sockets.

At.a.parallel.region� sched_setaffinity.system$call�

mbind$system$call�Processor$ID� Package$ID�

Core$ID�

NUMA%opt..Column%wise.Graph.Par((oning�

A0

A1

A2

A3

Row%wise.graph.par((oning�

Vk�

Column%wise.graph.par((oning�

A0

A1

A2

A3

Adjacency.matrix�

Vk�

Adjacency.matrix�j�i�

O(m).mostly.non%local.memory.accesses�O(m).Local.memory.accesses.only�

i� j�

Fron(er.

Neighbors�

Level.k�Level.k+1�

Fron(er�

Neighbors.

Level.k�Level.k+1�

•  divides$G=(V,A)$into$parDal$Gk=(Vk,.Ak)$and$binds.local.RAM.k.$–  Ak$is$a$set$of$adjacency$list$that$holds$incoming.edges$to$Vk.�

i�

j�

i�

j�

NUMA%op(mized.Top%down�•  Explores$outgoing$edges$of$fron(er.queue.QF.•  Appends$unvisited$verDces$into$neighbor.queue.QN.

•  Efficient.for.a.small.fron(er.•  Has$unnecessary$edge$traversal$for$a$large$fronDer$

Neighbors.QN�Level.1�

Source�Level.0�

Fron(er.QF�

Level.2�Level.1�

Neighbors.QN�Fron(er.QF�

Unnecessary.edge.traversal�

Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

NUMA.unit.3�

Details.of.NUMA%op(mized.Top%down�•  Explores$outgoing$edges$Ak$of$fron(er.queue.QF

k.

•  Appends$unvisited$verDces$into$neighbor.queue.QNk.

Level.2�Level.1�

QN2�

QF2�

Level.2�Level.1�

QN1�

QF1�

Level.2�Level.1�

QN0�

QF0�

Level.2�Level.1�

Neighbors.QN�Fron(er.QF�

Unnecessary.edge.traversal�

Level.2�Fron(er.QF�All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN.

•  Efficient.for.a.large.fron(er.•  Has$unnecessary$edge$traversal$for$a$small$fronDer$

source� Neighbors.QN�

Fron(er.QF�

Unvisited.ver(ces�

Level.1�

Unnecessary.edge.traversal�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�

Unvisited.ver(ces�

Level.3�

Level.2�

Neighbors.QN�

Fron(er.QF�

NUMA.unit.3�

Details.of.NUMA%op(mized.BoQom%up�•  Explores$fron(er.queue.QF

k$from$unvisited.ver(ces.•  Appends$adjacent$verDces$into$neighbors.QN

k.

Level.2�Fron(er.QF�

All%gather�

NUMA.unit.0�

NUMA.unit.1�

NUMA.unit.2�

Level.2�Fron(er.QF�Level.1�

Neighbors.QN�

Unvisited.ver(ces�

Level.2�QF

0�

Level.1�

QN0�

Level.2�QF

1�

Level.1�

QN1�

Level.2�QF

2�

Level.1�

QN2�

Level.2�QF

3�

Level.1�

QN3�

Outline�

1.  Background$2.  BreadthIfirst$Search$(BFS)$3.  NUMAIopDmized$parallel$BFS$4.   Numerical.Results.5.  Conclusion.

Machine.specifica(on�•  4%way.Intel.Xeon.E5.– CentOS$6.4$(Kernel$2.6.32)$– GCC$4.4.7$– 64$logical$CPU$cores$– 4.NUMA.units.x.16.logical%cores.

RAM

8-coreXeon E5 4640

interconnect

shared L3 cache

processor core & L1/L2 cache

RAM RAM RAM

•  4%way.AMD.Opteron.6174.– Fedora$19$(Kernel$3.11.2)$– GCC$4.8.1$– 48$CPU$cores$– 8.NUMA.units.x.6%core.

processor core & L1/L2 cache

RAMRAMRAMRAM

12-coresOpteron 6174

interconnect

0

2

4

6

8

10

12

14

20 21 22 23 24 25 26 27 28 29

GTE

PS

Scale

HybridHybrid + NUMA

TEPS.ra(o$varied.with$problem.size�•  Ours.achieves.11.15.GTEPS$for$Kronecker$graph$(SCALE26).•  Ours.2.2x.speedups$compared$with$original.hybrid.algorithm.

Beamer2011,$2012�

Peak.performance�

Hybrid�

Hybrid� NUMA�

This.paper�BeVer�

x2.2�

11.15.GTEPS�

5.1.GTEPS�4Iway$Intel$Xeon$E5I8870$

WestmereIEX$arch.�

4Iway$Intel$Xeon$E5I4640$$SandyBridgeIEP$arch.�

67$million$verDces$and$1$billion$edges�

12

24

32

48

1

2

4

8

16

64

12 24 32 48 1 2 4 8 16 64

Spe

edup

Number of threads

ideal4-way SandyBridge-EP

4-way MagnyCours

Strong.scaling.on.Intel/AMD.System�Scale.well.up.to.#.of.threads.as.#.of.cores�

4%way.Intel.Xeon.11.15.GTEPS�

4%way.AMD.Opteron.6.17.GTEPS�

40.threads.:.x40�

64.threads.:.x28�

Lv� FronDer$size� Freq.$(%)$ Cum.$Freq.$(%)$0$ 1$$ 0.00$$ 0.00$$1$ 7$$ 0.00$$ 0.00$$2$ 6,188$$ 0.01$$ 0.01$$3$ 510,515$$ 1.23$$ 1.24$$4$ 29,526,508$$ 70.89$$ 72.13$$5$ 11,314,238$$ 27.16$$ 99.29$$6$ 282,456$$ 0.68$$ 99.97$$7$ 11536$$ 0.03$$ 100.00$$8$ 673$$ 0.00$$ 100.00$$9$ 68$$ 0.00$$ 100.00$$

10$ 19$$ 0.00$$ 100.00$$11$ 10$$ 0.00$$ 100.00$$12$ 5$$ 0.00$$ 100.00$$13$ 2$$ 0.00$$ 100.00$$14$ 2$$ 0.00$$ 100.00$$15$ 2$$ 0.00$$ 100.00$$

Total� 41,652,230$$ 100.00$$ I$

TwiQer.network�

41$million$verDces$and$1.47$billion$edges$

Fron(er.size.in.BFS.$$$$$$$$$$$$$with$source$as$User$21,804,357�

Follow%ship.network.2009�

User$i�

User$j�

(i,$j)Iedge�

Our.NUMA%op(mized.BFS.on.4%way.Xeon.system�

180.ms$/$BFS$$$$$$$$$$$$$$$$$$$$$$$�$8.1$GTEPS�

Six%degrees.of.separa(on�

Graph500$benchmark�•  Fastest$of$singleInode$on$4th.list$(June$2012)$

•  Fastest$of$CPUIbased$singleInode$on$6th.list$(June$2013)$

ours�

ours�

4%way.Intel.Xeon.Westmere%EX�

4%way.Intel.Xeon.SandyBridge%EP�

8.2.GTEPS�

Rank26�

Rank57� 11.1.GTEPS�

Convey.4.FPGA.+.2.CPU�

hVp:www.graph500.org�

1st.Green.Graph500.list$on$June$2013�•  Measures$powerIefficient$using$TEPS/W$raDo$•  Results$on$various$system$such$as$Android,.Linux,.and.Mac.$

Small.Data$category�

ours�

Rank.1.ASUS.tablet.TF700T� Rank.2.Intel.NUC.(Linux)�Rank.3.Mac.mini�

Android$NDK�53.5$MTEPS/w$$(1.9$GTEPS)�53.8$MTEPS/w$

$(1.1$GTEPS)�

64.1$MTEPS/w$$(150$MTEPS)�

NVIDIA.Tegra3.(4%core)�

NVIDIA.Tegra3� Intel/AMD.arch.�with$same$source$code�

hQp://green.graph500.org�

Conclusion�•  NUMA%op(mized.Hybrid.BFS.algorithm.– Reduces.unnecessary.edge.traversals$and$remote.RAM.access.carefully$considering$NUMA.

•  Numerical.results.on.4%way.Intel.Xeon�–  scales.well.up.to.64.threads.(scalable)$–  achieves.11.15.GTEPS.(fast).–  2.2x.speedup.compared.original.Hybrid.

•  Graph500.&.Green.Graph500.– Fastest.single%node$in$June$2012$– Most.power%efficient$in$June$2013$

Hybrid� NUMA�

Future.work�•  Further.op(mizing$NUMAIopDmized$BFS$

0

5

10

15

20

25

30

20 21 22 23 24 25 26 27 28 29

GTE

PS

SCALE

Latest versionBigdata2013

BigData2013.version:.11.GTEPS�

Latest.version:.26.GTEPS.

•  distributed%memory.parallel.computa(on$

2.4x...faster�

top related