optimising column stores with statistical analysis

How do Column Stores Work?

Turning Rows into Columns

Product Customer

Date Sale

Beer Thomas 2011-11-25

2 GBP

Beer Thomas 2011-11-25

2 GBP

Vodka Thomas 2011-11-25

10 GBP

Whiskey Christian 2011-11-25

5 GBP

Whiskey Christian 2011-11-25

5 GBP

Vodka Alexei 2011-11-25

10 GBP

Vodka Alexei 2011-11-25

10 GBP

Sales

ID

Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

ID

Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

And so on… until…

And we get…

ID

Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

ID

Customer

1 Thomas

2 Thomas

3 Thomas

4 Christian

5 Christian

6 Alexei

7 Alexei

Product Customer

ID

Date

1 2011-11-25

2 2011-11-25

3 2011-11-25

4 2011-11-25

5 2011-11-25

6 2011-11-25

7 2011-11-25

Date

ID

Sale

1 2 GBP

2 2 GBP

3 10 GBP

4 5 GBP

5 5 GBP

6 10 GBP

7 10 GBP

Sale

And what now?

ID

Value

1 Beer

2 Beer

3 Vodka

4 Whiskey

5 Whiskey

6 Vodka

7 Vodka

Product

Run lengthEncode

Product’

ID Value

1-2 Beer

3 Vodka

4-5 Whiskey

6-7 Vodka

Applying Compression

ID Value

1-2

Beer

3 Vodka

4-5

Whiskey

6-7

Vodka

ID Customer

1-3 Thomas

4-5 Christian

6-7 Alexei

Product’ Customer’

ID Date

1-7

2011-11-25

Date’

ID Sale

1-2 2 GBP

3 10 GBP

4-5 5 GBP

6-7 10 GBP

Sale’

Insights• With dictionary, every

value can be assumed to fit a machine word (64bits)

• Compression is proportional with total number of run length (RL) in all columns

• Number of RL will depend on ordering of rows

ID Value

1-2

Beer

3 Vodka

4-5

Whiskey

6-7

Vodka

Product’

One RL

Ordering Example

Product

Customer

Beer Thomas

Beer Thomas

Vodka Thomas

Whiskey

Christian

Whiskey

Christian

Vodka Alexei

Vodka Alexei

Product

Customer

Beer Thomas

Whiskey

Christian

Vodka Thomas

Whiskey

Christian

Beer Thomas

Vodka Alexei

Vodka Alexei

Product

Customer

Beer Thomas

Whiskey

Christian

Vodka Thomas

Whiskey

Christian

Beer Thomas

Vodka Alexei

Product

Customer

BeerThomas

Vodka

Whiskey

Christian

Vodka Alexei

VS.

There is some overhead…

Clusteron ID

Heap

Data Size 327MB 327MB

Column Index Size

59MB 142MB

Manipulating the Rules

Rule of Thumb?

“Sort by lowest cardinality column first”

Rationale: Low cardinality columns

have potential for long RL(C1, C2): 68MB (C2, C1): 61MB Lowest first is

worse!

x N

OK, so what about highest first?Loose correlation

(C1, C2): 64MB (C2, C1): 68MB

Highest first is worse!

What are we looking for?1) Values that are skewed or have low cardinality

2) Columns that correlate/cluster with other columns

Just Read the Magic Code?• Values with low cardinality are

easy (COUNT DISTINCT)• Is there a more general way to

classify the notion of “predictable content of a column”?

• Yes, Entropy:

Coming to Terms with Entropy• Intuition: A single number expressing the amount of

“surprise” at seeing a value in a column• Consider an example:

SKEW SPLAT ID

Histogram

COUNT DISTINCT

10001 10001 1000000

DISTINCT / COUNT

0.01 0.01 1

Calculate and Evaluate

SKEW

SPLAT

ID

≈ 0.21

≈ 13

≈ 20

New theory: Lower Entropy First

You will NEVER win

Take best of these

Column that “cluster” with other columns?• Is there a way to calculate this?• Yes indeed, information theory to the help again• Mutual information:

• “The information left in Y, given that I know X”

Mutual WHAT?

H(X ¦ Y) H(Y ¦ X) I(X;Y)

From I(X;Y) we can find the distance

X

Y

C1

C2

C3

“Find the minimal distancethat visits all columns

in the information plane”

d(c1,c2)

d(c2,c3)

So, how does THAT work?

Better… not impressive.. More consistent

Take best of these

Medicine is compared against placebo

What else does d(X,Y) tell us?• Consider this fact

table:d(A, B) is zero!

What is our expected estimate of rows?

Dodge this!

Fix:

Why is this so Hard?

Reflecting on Information Distance“Find the shortest paththat visits all cities on

a map”

Picture Credits: RUC.dk

How many routes are there?

n! = n * (n-1) * (n-2) * … * 1

Travelling Salesman Problem

(TSP)

There are MORE than n! routes• What if lexicographical ordering of the columns isn’t

best?• Daniel Lemire et al: ”Reordering Rows for Better Compression: Beyond

the Lexicographic Order“ ( http://arxiv.org/pdf/1207.2189.pdf )

• Some may be ruled out immediately (ex: don’t go to Skagen from Copenhagen and then to Roskilde)

• The issue of local optimums exist

http://arxiv.org/pdf/1207.2189.pdf



Heuristics are your Best Bet• “Find Minimum RLE” can be shown to be NP complete• There is no fast algorithm that finds the optimal

• I have shown you one heuristic• moderate gain for a small effort• Shown that 2x gains are possible• Any ordering is (typically) better than random, often by a lot

• I wrote a tool to help analyse: TableStat.exe• Interested? Come up to talk after• I need more real life datasets to test on

P = NP ? We just don’t know

optimising column stores with statistical analysis

Technology

rule of thumb

best bet

loose correlationhighest

magic code

information distance

routes http