x86opti 05 s5yata

37
Remove Branches in BitVector Select Operations - marisa 0.2.2 - Susumu Yata @s5yata Brazil, Inc. 30 March 2013 1 Brazil, Inc.

Upload: s5yata

Post on 28-Jun-2015

1.255 views

Category:

Career


0 download

DESCRIPTION

Remove Branches in BitVector Select Operations - marisa 0.2.2 -

TRANSCRIPT

Page 1: X86opti 05 s5yata

Remove Branches inBitVector Select Operations

- marisa 0.2.2 -

Susumu Yata@s5yata

Brazil, Inc.

30 March 20131

Brazil, Inc.

Page 2: X86opti 05 s5yata

Who I AmJob

Brazil, Inc. (groonga developer)We need R&D software engineers.

Personal research & developmentTries

darts-clone, marisa-trie, etc.Corpus

Nihongo Web Corpus 2010 (NWC 2010)

30 March 20132

Brazil, Inc.

Page 3: X86opti 05 s5yata

BitVector and MarisaRelationships between BitVector and Marisa.

30 March 20133

Brazil, Inc.

Page 4: X86opti 05 s5yata

BitVectorWhat’s BitVector?

A sequence of bits

OperationsBitVector::get(i)BitVector::rank(i)BitVector::select(i)

30 March 20134

Brazil, Inc.

Page 5: X86opti 05 s5yata

BitVector – Get OperationsInterface

BitVector::get(i)

DescriptionThe i-th bit (“0” or “1”)

30 March 20135

Brazil, Inc.

0 1 2 … i–1 i i+1 … n-2 n-1

0 0 1 … 0 1 1 … 0 0

Get!

Page 6: X86opti 05 s5yata

BitVector – Rank OperationsInterface

BitVector::rank(i)

DescriptionThe number of “1”s up to the i-th bit

30 March 20136

Brazil, Inc.

0 1 2 … i–1 i i+1 … n-2 n-1

0 0 1 … 0 1 1 … 0 0

How many “1”s?

Page 7: X86opti 05 s5yata

BitVector – Select Operations

InterfaceBitVector::select(i)

DescriptionThe position of the i-th “1”

30 March 20137

Brazil, Inc.

0 1 2 … … … … … n-2 n-1

0 0 1 … … … … … 0 0

Where is the i-th “1”?

Page 8: X86opti 05 s5yata

MarisaWho’s Marisa?

An ordinary human magician

What’s Marisa?A static and space-efficient dictionary

Data structureRecursive LOUDS-based Patricia tries

Sitehttp://code.google.com/p/marisa-trie

30 March 20138

Brazil, Inc.

Page 9: X86opti 05 s5yata

Marisa – PatriciaPatricia is a labeled tree.

Keys = Tree + Labels

Node Label

1 “Ar”

2 “Brazil”

3 ‘C’

4 “gentina”

5 “menia”

6 “anada”

7 “yprus”

30 March 20139

Brazil, Inc.

ID Key

0 “Argentina”

1 “Armenia”

2 “Brazil”

3 “Canada”

4 “Cyprus”

20

3

4

6

7

5

4

6

7

51

Page 10: X86opti 05 s5yata

Marisa – RecursivenessUnfortunately, this margin is too

small…Keys = Tree + LabelsLabels = Tree + LabelsLabels = Tree + Labels <– ReasonableLabels = Tree + LabelsLabels = Tree + LabelsLabels = Tree + LabelsLabels = Tree + Labels…

30 March 2013 Brazil, Inc.10

Page 11: X86opti 05 s5yata

Marisa – BitVector UsageLOUDS

Level-Order Unary Degree Sequence

Terminal flagsA node is terminal (“1”) or not (“0”).

Link flagsA node has a link to its multi-byte label

(“1”) or has a built-in single-byte label (“0”).

30 March 2013 Brazil, Inc.11

Page 12: X86opti 05 s5yata

Marisa – BitVector UsageLOUDS

BitVector::get(), select()

Terminal flagsBitVector::get(), rank(), select()

Link flagsBitVector::get(), rank()

30 March 2013 Brazil, Inc.12

Page 13: X86opti 05 s5yata

ImplementationsHow to implement Rank/Select operations.

30 March 2013 Brazil, Inc.13

Page 14: X86opti 05 s5yata

Rank DictionaryIndex structures

r_idx[x].abs = rank(512 ・ x)x = 0, 1, 2, …

r_idx[x].rel[y] =rank(512 ・ x + 64 ・ y) –

rank(512 ・ x)Y = 1, 2, 3, … , 7

Calculationabs + rel + popcnt()

30 March 2013 Brazil, Inc.14

Page 15: X86opti 05 s5yata

Rank OperationsTime complexity = O(1)

30 March 2013 Brazil, Inc.15

512 512 512 512

r_idx.abs

64 64 64 64 64 64 64 64

512

r_idx.rel

64

popcnt()

Page 16: X86opti 05 s5yata

Select DictionaryIndex structure

s_idx[x] = select(512 ・ x)i = 0, 1, 2, …

CalculationLimit the range by using s_idx.Limit the range by using r_idx[x].abs.Limit the range by using r_idx[x].rel[y].Find the i-th “1” in the range.

30 March 2013 Brazil, Inc.16

Page 17: X86opti 05 s5yata

Select Operations

30 March 2013 Brazil, Inc.17

r_idx.abs

64 64 64 64 64 64 64 64

512

r_idx.rel

64

512 512 512512512512

s_idx s_idx

r_idx.abs

Final round

r_idx.rel

Page 18: X86opti 05 s5yata

Select Final RoundBinary search & table lookup

Three-level branches

30 March 2013 Brazil, Inc.18

8 8 8 8 8 8 8 8

if

if if

if if if if

Table lookup

Page 19: X86opti 05 s5yata

ImprovementsHow to remove the branches in the final round.

30 March 2013 Brazil, Inc.19

Page 20: X86opti 05 s5yata

Original// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcountif (i < ((x >> 24) & 0xFF)) { // The first-level

branch if (i < ((x >> 8) & 0xFF)) { // The second-level

branch if (i < (x & 0xFF)) { // The third-level branch // The first byte contains the i-th “1”. } else { // The second byte contains the i-th “1”.30 March 2013 Brazil, Inc.

20

Page 21: X86opti 05 s5yata

Tips – Tricky PopCount

x = x – ((x >> 1) & MASK_55);

x = (x & MASK_33) + ((x >> 2) & MASK_33);

x = (x + (x >> 4)) & MASK_0F;

30 March 2013 Brazil, Inc.21

1 2 0 1

0 1 1 1 0 0 1 0

3 1

4

Page 22: X86opti 05 s5yata

Tips – Tricky PopCount// MASK_01 = 0x0101010101010101ULL;// x = x | (x << 8) | (x << 16) | (x << 24) | …;x *= MASK_01;

30 March 2013 Brazil, Inc.22

4 1 3 5 2 6 3 4

28

24

23

20

15

13

7

4

Page 23: X86opti 05 s5yata

+ SSE2 (After PopCount)// y[0 … 7] = i + 1;__m128i y = _mm_cvtsi64_si128((i + 1) * MASK_01);__m128i z = _mm_cvtsi64_si128(x);

// Compare the 16 8-bit signed integers in y and z.// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z); // PCMPGTB

// The j-th byte contains the i-th “1”.// TABLE is a 128-byte pre-computed table.uint8_t j = TABLE[_mm_movemask_epi8(y)];

30 March 2013 Brazil, Inc.23

Page 24: X86opti 05 s5yata

Tips – PCMPGTBy = _mm_cvtsi64_si128((i + 1) * MASK_01);

z = _mm_cvtsi64_si128(x);

// y[k] = (y[k] > z[k]) ? 0xFF : 0x00;y = _mm_cmpgt_epi8(y, z);

30 March 2013 Brazil, Inc.24

28 24 23 20 15 13 7 4

20 20 20 20 20 20 20 20

0x00 0x00 0x00 0x00 0xFF 0xFF 0xFF 0xFF

Page 25: X86opti 05 s5yata

+ Tricks (After Comparison)uint64_t j = _mm_cvtsi128_si64(y);

// Calculation without TABLEj = ((j & MASK_01) * MASK_01) >> 56;

// Calculation with BSRj = (63 – __builtin_clzll(j + 1)) / 8;

// Calculation with popcnt (SSE4.2 or SSE4a)j = __builtin_popcountll(j) / 8;

30 March 2013 Brazil, Inc.25

Page 26: X86opti 05 s5yata

– SSE2 (Simple and Fast)// x is the final 64-bit block (uint64_t).x = x – ((x >> 1) & MASK_55);x = (x & MASK_33) + ((x >> 2) & MASK_33);x = (x + (x >> 4)) & MASK_0F;x *= MASK_01; // Tricky popcount

uint64_t y = (i + 1) * MASK_01;uint64_t z = x | MASK_80;// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80;uint8_t j = __builtin_ctzll(z) / 8;

30 March 2013 Brazil, Inc.26

Page 27: X86opti 05 s5yata

Tips – Comparisonuint64_t y = (i + 1) * MASK_01;

uint64_t z = x | MASK_80;

// Compare the 8 7-bit unsigned integers in y and z.z = (z – y) & MASK_80;

30 March 2013 Brazil, Inc.27

0x14 0x14 0x14 0x14 0x14 0x14 0x14 0x14

0x9C 0x98 0x97 0x94 0x8F 0x8D 0x87 0x84

0x80 0x80 0x80 0x80 0x00 0x00 0x00 0x00

Page 28: X86opti 05 s5yata

+ SSSE3 (For PopCount)// Get lower nibbles and upper nibbles of x.__m128i lower = _mm_cvtsi64_si128(x & MASK_0F);__m128i upper = _mm_cvtsi64_si128(x & MASK_F0);upper = _mm_srli_epi32(upper, 4);// Use PSHUFB for counting “1”s in each nibble.__m128i table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, 2, 1, 1,

0);lower = _mm_shuffle_epi8(table, lower);upper = _mm_shuffle_epi8(table, upper);// Merge the counts to get the number of “1”s in each

byte.x = _mm_cvtsi128_si64(_mm_add_epi8(lower, upper));x *= MASK_01;30 March 2013 Brazil, Inc.

28

Page 29: X86opti 05 s5yata

Tips – PSHUFBlower = _mm_cvtsi64_si128(x & MASK_0F);

table = _mm_set_epi8(4, 3, 3, 2, 3, 2, 2, 1, 3, 2, 2, 1, …);

// Perform a parallel 16-way lookup.lower = _mm_shuffle_epi8(table, lower);

30 March 2013 Brazil, Inc.29

12 8 7 4 15 13 7 4

4 3 3 2 3 2 2 1 3 2 2 1 2 1 1 0

2 1 3 1 4 3 3 1

Page 30: X86opti 05 s5yata

EvaluationHow effective the improvements are.

30 March 2013 Brazil, Inc.30

Page 31: X86opti 05 s5yata

EnvironmentOS

Mac OSX 10.8.3 (64-bit)CPU

Core i7 3720QM – Ivy Bridge2.6GHz – up to 3.6GHz

CompilerApple LLVM version 4.2 (clang-425.0.24)

(based on LLVM 3.2svn)

30 March 2013 Brazil, Inc.31

Page 32: X86opti 05 s5yata

DataSource

Japanese Wikipedia page titlesgzip –cd jawiki-20130328-all-titles-in-

ns0.gz | LC_ALL=C sort –R > data

DetailsNumber of keys: 1,367,750Average length: 21.14 bytesTotal length: 28,919,893 bytes

30 March 2013 Brazil, Inc.32

Page 33: X86opti 05 s5yata

Binariesmarisa 0.2.1

./configure CXX=clang++ --enable-popcnt

maketools/marisa-benchmark < data

marisa 0.2.2./configure CXX=clang++ --enable-sse4maketools/marisa-benchmark < data

30 March 2013 Brazil, Inc.33

Page 34: X86opti 05 s5yata

Results – marisa 0.2.1Without improvements

Baseline

30 March 2013 Brazil, Inc.34

#Tries Size[KB]

Build[Kqps]

Lookup

[Kqps]

Reverse

[Kqps]

Prefix[Kqps]

Predict

[Kqps]

1 11,811 724 1,105 1,223 1,038 711

2 8,639 632 790 877 753 453

3 8,001 621 750 816 708 406

4 7,788 591 723 791 687 391

5 7,701 590 712 781 680 384

Page 35: X86opti 05 s5yata

Results – marisa 0.2.2With improvements

Same sizeFaster operations

30 March 2013 Brazil, Inc.35

#Tries Size[KB]

Build[Kqps]

Lookup

[Kqps]

Reverse

[Kqps]

Prefix[Kqps]

Predict

[Kqps]

1 11,811 757 1,198 1,359 1,115 772

2 8,639 657 873 1,000 820 503

3 8,001 621 817 924 770 453

4 7,788 613 797 900 752 438

5 7,701 610 787 884 737 427

Page 36: X86opti 05 s5yata

Results – ImprovementsImprovement ratios

Same sizeFaster operations

30 March 2013 Brazil, Inc.36

#Tries Size[%]

Build[%]

Lookup

[%]

Reverse

[%]

Prefix[%]

Predict

[%]

1 0.00 +4.56 +8.42 +11.12

+7.42 +8.58

2 0.00 +3.96 +10.52

+14.03

+8.90 +11.04

3 0.00 0.00 +8.93 +13.24

+8.76 +11.58

4 0.00 +3.72 +10.24

+13.78

+9.46 +12.02

5 0.00 +3.39 +10.53

+13.19

+8.38 +11.20

Page 37: X86opti 05 s5yata

Conclusion

30 March 2013 Brazil, Inc.37

“Any sufficiently advanced technology is indistinguishable

from magic.”

“Any sufficiently advanced technique is indistinguishable from

magic.”

“You are magician.”