work-efficient parallel skyline computation for the...

Work-Efficient Parallel Skyline Computation for theGPU

Kenneth S. Bøgh, Sean Chester, Ira Assent

[email protected] Systems Group

Aarhus University, Denmark

Harvard University11 February 2016

[email protected]

What this talk will cover

1 An introduction to Genereal Purpose computing on GraphicsProcessing Units (GPGPU)

2 An introduction to the skyline operator3 A review of state-of-the-art algorithms for computing skylines4 An introduction of parallel search trees for:

I multicore CPUsI GPUs

5 Current research at DASlab

Kenneth S. Bøgh, Sean Chester, Ira Assent (Aarhus University)Parallel Skyline Computation HU, 11 Feb 2016 2 / 22

What is a GPU?

1 Graphics Processing Unit - Specialized hardware for graphics2 Massively parallel (2688 cores in our card)3 More power efficient than CPUs (21 vs 5 GFLOPS/watt)4 More processing power per $5 Using accelerator card - The extreme in terms of scale-up


Key differences between CPU and GPU

Seperate memory - data must be tranfered back and forthHigher memory bandwidth (x4) and latency (x2)No prefetcher, and a small cache (1.5MB for 2688 cores)2048 threads per 192 cores (2 threads per core on the CPU)Groups of 32 threads execute step locked

CPU

CPU RAM

Shared cache, 2MB per core

256KB 256KB

2x32KB 2x32KB

Core Core ...GPU

GPU RAM

Shared cache, 1.5MB

64KB R/W48KB RO

192 cores

64KB R/W48KB RO

192 cores ...


The CPU and GPU threading models

CPU threads execute independently

GPU threads execute in step-locked groups of 32 called warpsThreads of a warp must agree on what instruction to execute nextOtherwise some threads will halt while the others execute

C

B E

A D F



CPU threads execute independently

GPU threads execute in step-locked groups of 32 called warpsThreads of a warp must agree on what instruction to execute nextOtherwise some threads will halt while the others execute

C

CPU1 CPU2

B E

A D F



CPU threads execute independentlyGPU threads execute in step-locked groups of 32 called warps

Threads of a warp must agree on what instruction to execute nextOtherwise some threads will halt while the others execute

C

B E

A D F



CPU threads execute independentlyGPU threads execute in step-locked groups of 32 called warpsThreads of a warp must agree on what instruction to execute next

Otherwise some threads will halt while the others execute

C

WarpThread1−32

B E

A D F



CPU threads execute independentlyGPU threads execute in step-locked groups of 32 called warpsThreads of a warp must agree on what instruction to execute nextOtherwise some threads will halt while the others execute

C

WarpThread1 WarpThread2−32 (halted)

B E

A D F


Example - Finding a conference hotel

Close to the conferencelocation - to make you happyCheap - to make yourdepartment happySkyline query: Minimize priceand distance, returning all besttrade-offs p

q

PriceD

ista

nce

Price

Dis

tanc

e

Price

Dis

tanc

e

Price

Dis

tanc

e

*This is the same concept as pareto dominance from Economics, but applied to databases.

[1] S. Börzsönyi et al. "The skyline operator." In Proc. ICDE (2001).



Close to the conferencelocation - to make you happy

Cheap - to make yourdepartment happySkyline query: Minimize priceand distance, returning all besttrade-offs p

q

Price

Dis

tanc

e

PriceD

ista

nce

Price

Dis

tanc

e

Price

Dis

tanc

e





Close to the conferencelocation - to make you happyCheap - to make yourdepartment happy

Skyline query: Minimize priceand distance, returning all besttrade-offs p

q

Price

Dis

tanc

e

Price

Dis

tanc

e

PriceD

ista

nce

Price

Dis

tanc

e





Close to the conferencelocation - to make you happyCheap - to make yourdepartment happySkyline query: Minimize priceand distance, returning all besttrade-offs

p

q

Price

Dis

tanc

e

Price

Dis

tanc

e

Price

Dis

tanc

e

PriceD

ista

nce





A point p dominates* anotherpoint q if:

I p is preferable or equivalentto q in all dimensions

I p is strictly preferable to q inat least one dimension

The skyline [1] consists ofpoints that are not dominated

p

q

Price

Dis

tanc

e

Price

Dis

tanc

e

Price

Dis

tanc

e

PriceD

ista

nce




The state of parallel skylines

GGS [3] is the state-of-the-artGPU skyline algorithmRun on Nvidia GTX Titan with2688 cores at 0.8 Ghz

BSkyTree [7] is the sequentialstate-of-the-artRun on a 3.4 Ghz Inteli7-3770

[3] K.S. Bøgh et al., “Efficient GPU-based skyline computation”,Proc. DaMoN, 2013.[7] J. Lee and S.-w. Hwang, “Scalable skyline computationusing a balanced pivot selection technique”, Inf. Syst., 2014.

1 2 4 6 80

20

40

Tim

e(s

)1 2 4 6 8

103

104

Cardinality, ×106

Dom

test

s/n

BSkyTree GGS


Monotonic sorting

1. Compute monotonic score for each data point

2. Sort the data by the score

3. for i = 0, . . . ,n − 1 do

4. Append point i to candidate buffer if nopoint in candidate buffer dominates i

candidate buffer. . .

unprocessed points


Object-based partitioning

Partitions the data recursivelyBuilds a search tree on the fly to minimize data point comparisonsStores bit masks in nodes to minimize dominance tests

CB

A

ED

F

F

C

B E

A D

F


Object-based partitioning

Partitions the data recursivelyBuilds a search tree on the fly to minimize data point comparisonsStores bit masks in nodes to minimize dominance tests

CB

A

ED

F

F

C

B E

A D F


Control flow of Hybrid

Phase Isolution tree

. . .α α

Phase IIsolution tree

. . .α α

Updatesolution tree

. . .α

Phase I is ideal; Phase II is cache-resident; Update phase is sequential


Static median/quartile based partitioning

Fixed two-level tree, based on median and quartile valuesCan be built in parallelEnables predictable branching

p0

p0

p1

p2

p3

p5 p6

p4




p0

p0

p1

p2

p3

p5 p6

p4

Point Median Quartilep0 M0 = −− Q0 = −−




p0

p0

p1

p2

p3

p5 p6

p4

Point Median Quartilep0 M0 = 0− Q0 = −−




p0p0

p1

p2

p3

p5 p6

p4

Point Median Quartilep0 M0 = 01 Q0 = −−




p0p0

p1

p2

p3

p5 p6

p4

Point Median Quartilep0 M0 = 01 Q0 = 1−




p0p0

p1

p2

p3

p5 p6

p4

Point Median Quartilep0 M0 = 01 Q0 = 10




p0

p0

p1

p2

p3

p5 p6

p4

Point Median Quartilep0 M0 = 01 Q0 = 10p1 M1 = 11 Q1 = 00p2 M2 = 10 Q2 = 11p3 M3 = 10 Q3 = 10p4 M4 = 10 Q4 = 01p5 M5 = 01 Q5 = 01p6 M6 = 01 Q6 = 11




p0

p0

p1

p2

p3

p5 p6

p4





01

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1



The SkyAlign workflow

w3P3

P2

CompareCompareDescentCompareDominance testCompareDominance testCompare

01

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1

w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3

P2


01

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2


01

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

Compare

CompareDescentCompareDominance testCompareDominance testCompare

01

w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

Compare

Compare

DescentCompareDominance testCompareDominance testCompare

01

w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

CompareCompare

Descent

CompareDominance testCompareDominance testCompare

01w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

CompareCompareDescent

Compare

Dominance testCompareDominance testCompare

01w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

CompareCompareDescentCompare

Dominance test

CompareDominance testCompare

01w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

CompareCompareDescentCompareDominance test

Compare

Dominance testCompare

01w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3P2

CompareCompareDescentCompareDominance testCompare

Dominance test

Compare

01w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4



w3P3

P2CompareCompareDescentCompareDominance testCompareDominance test

Compare

01w3

10

P0

11

P6

01

P5

M

Q

10

01

P4

10

P3

11

P2

11

00

P1w1 w2 w3 w4

p0

p1

p2

p3

p5 p6

p4


Experimental setup

Intel i7-3770 with 4 cores at 3.4Ghz and hyperthreading enabledNvidia GTX Titan with 2688 cores at 0.8 GhzTransfer of data to and from GPU are included in the running timeTree building is included in the running time

Compared algorithms:I BSkyTree [7]: State-of-the-art sequential algorithmI Hybrid [4]: The proposed multicore algorithm (run with 8 threads)I GGS [3]: Previous state-of-the-art, tree-less GPU algorithmI SkyAlign [2]: The proposed GPU algorithm

Download all code: http://cs.au.dk/research-at-cs/data-intensive-systems/repository/

[2] K.S. Bøgh et al., “Work-efficient parallel skyline computation for the GPU”, PVLDB, 8:9, 962–973. 2015.[3] K.S. Bøgh et al., “Efficient GPU-based skyline computation”, Proc. DaMoN, 2013.[4] S. Chester et al., “Scalable parallelization of skyline computation for multi-core processors”, ICDE, 2015.[7] J. Lee and S.-w. Hwang, “Scalable skyline computation using a balanced pivot selection technique”, Inf. Syst., 2014.


http://cs.au.dk/research-at-cs/data-intensive-systems/repository/

Experimental setup

Intel i7-3770 with 4 cores at 3.4Ghz and hyperthreading enabledNvidia GTX Titan with 2688 cores at 0.8 GhzTransfer of data to and from GPU are included in the running timeTree building is included in the running timeCompared algorithms:

I BSkyTree [7]: State-of-the-art sequential algorithmI Hybrid [4]: The proposed multicore algorithm (run with 8 threads)I GGS [3]: Previous state-of-the-art, tree-less GPU algorithmI SkyAlign [2]: The proposed GPU algorithm

Download all code: http://cs.au.dk/research-at-cs/data-intensive-systems/repository/

[2] K.S. Bøgh et al., “Work-efficient parallel skyline computation for the GPU”, PVLDB, 8:9, 962–973. 2015.[3] K.S. Bøgh et al., “Efficient GPU-based skyline computation”, Proc. DaMoN, 2013.[4] S. Chester et al., “Scalable parallelization of skyline computation for multi-core processors”, ICDE, 2015.[7] J. Lee and S.-w. Hwang, “Scalable skyline computation using a balanced pivot selection technique”, Inf. Syst., 2014.


http://cs.au.dk/research-at-cs/data-intensive-systems/repository/

Evaluating running time

1 2 4 6 8

104

106

Tim

e(m

s)

4 8 12 16102103104

1 2 4 6 8102

103

104

Cardinality, ×106

Tim

e(m

s)

4 8 12 16101102103104

Dimensionality

ANTICORRELATED

INDEPENDENT

BSkyTree Hybrid GGS SkyAlign


Evaluating dominance tests

1 2 4 6 8102

104

106

Dom

test

s/n

4 8 12 16101

103

105

1 2 4 6 8102103104105

Cardinality, ×106

Dom

test

s/n

4 8 12 16100102104

Dimensionality

ANTICORRELATED

INDEPENDENT



Evaluating work

1 2 4 6 8

106

108

Wor

k

4 8 12 16103

105

107

1 2 4 6 8104

105

106

Cardinality, ×106

Wor

k

4 8 12 16101

104

107

Dimensionality

ANTICORRELATED

INDEPENDENT



Evaluating running time on the CPU

1 2 4 6 8102

104

106

Tim

e(m

s)

4 8 12 16102

104

106

1 2 4 6 8102

103

104

Tim

e(m

s)

4 8 12 16102

104

106

ANTICORRELATED

INDEPENDENT

Hybrid GGS SkyAlign


Scalability

1 4 7 14 280

10

20

Tim

e(m

s)

1 4 8 16 320

102030

1 4 7 14 280

10

20

Cores, 2x14

Tim

e(m

s)

1 4 8 16 320

102030

Cores, 4x8

ANTICORRELATED

INDEPENDENT

Hybrid GGS SkyAlign


Evaluating Clock per instruction

1 2 4 6 8

0.4

0.6

CP

I

4 8 12 160.20.40.60.8

1

1 2 4 6 8

0.4

0.6

0.8

CP

I

4 8 12 16

0.5

1

1.5

ANTICORRELATED

INDEPENDENT

Hybrid GGS SkyAlign


Current research: The RUM conjecture

Trade-offs are present in all parts of computer scienceEach field have its own major components between whichtrade-offs are madeThe Data Systems Laboratory have recently formalized this datasystemsThe result is the RUM-conjecture



Read-overhead - The overhead of reading dataUpdate-overhead - The overhead of updating dataMemory-overhead - The additional storage usedOptimize for at most two - at the cost of the third

Read Update

Memory



Read-overhead - The overhead of reading data

Update-overhead - The overhead of updating dataMemory-overhead - The additional storage usedOptimize for at most two - at the cost of the third

Read

Update

Memory



Read-overhead - The overhead of reading dataUpdate-overhead - The overhead of updating data

Memory-overhead - The additional storage usedOptimize for at most two - at the cost of the third

Read Update

Memory



Read-overhead - The overhead of reading dataUpdate-overhead - The overhead of updating dataMemory-overhead - The additional storage used

Optimize for at most two - at the cost of the third

Read Update

Memory




Read Update

Memory




Read Update

Memory

8 4 9 1 5 0 2




Read Update

Memory

8 4 9 1 5 0 2 3




Read Update

Memory

8 4 7 1 5 0 2 3




Read Update

Memory

0 1 2 4 5 8 9




Read Update

Memory

0 1 2 3 4 5 8 9




Read Update

Memory

1 2 0 5 8 4 9




Read Update

Memory

1 2 0 5 8 4 9

<4 <9




Read Update

Memory

1 2 0 8 4 5 9

<4 <9




Read Update

Memory

1 2 0 3 8 4 5 9

<4 <9




Read Update

Memory

1 2 0 G 5 8 4 G 9 G

<4 <9




Read Update

Memory

1 2 0 3 5 8 4 G 9 G

<4 <9


Open questions

Which of the approaches is better?How many partitions should we choose?How should the partitions be distributed?How should ghost values be distributed?Can we extend this idea to indexes?


work-efficient parallel skyline computation for the...

Documents