optimising column stores with statistical analysis
DESCRIPTION
A presentation about column stores, how they work and how you can optimise compression with themTRANSCRIPT
![Page 1: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/1.jpg)
How do Column Stores Work?
![Page 2: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/2.jpg)
Turning Rows into Columns
Product Customer
Date Sale
Beer Thomas 2011-11-25
2 GBP
Beer Thomas 2011-11-25
2 GBP
Vodka Thomas 2011-11-25
10 GBP
Whiskey Christian 2011-11-25
5 GBP
Whiskey Christian 2011-11-25
5 GBP
Vodka Alexei 2011-11-25
10 GBP
Vodka Alexei 2011-11-25
10 GBP
Sales
ID
Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID
Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
And so on… until…
![Page 3: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/3.jpg)
And we get…
ID
Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
ID
Customer
1 Thomas
2 Thomas
3 Thomas
4 Christian
5 Christian
6 Alexei
7 Alexei
Product Customer
ID
Date
1 2011-11-25
2 2011-11-25
3 2011-11-25
4 2011-11-25
5 2011-11-25
6 2011-11-25
7 2011-11-25
Date
ID
Sale
1 2 GBP
2 2 GBP
3 10 GBP
4 5 GBP
5 5 GBP
6 10 GBP
7 10 GBP
Sale
![Page 4: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/4.jpg)
And what now?
ID
Value
1 Beer
2 Beer
3 Vodka
4 Whiskey
5 Whiskey
6 Vodka
7 Vodka
Product
Run lengthEncode
Product’
ID Value
1-2 Beer
3 Vodka
4-5 Whiskey
6-7 Vodka
![Page 5: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/5.jpg)
Applying Compression
ID Value
1-2
Beer
3 Vodka
4-5
Whiskey
6-7
Vodka
ID Customer
1-3 Thomas
4-5 Christian
6-7 Alexei
Product’ Customer’
ID Date
1-7
2011-11-25
Date’
ID Sale
1-2 2 GBP
3 10 GBP
4-5 5 GBP
6-7 10 GBP
Sale’
![Page 6: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/6.jpg)
Insights• With dictionary, every
value can be assumed to fit a machine word (64bits)
• Compression is proportional with total number of run length (RL) in all columns
• Number of RL will depend on ordering of rows
ID Value
1-2
Beer
3 Vodka
4-5
Whiskey
6-7
Vodka
Product’
One RL
![Page 7: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/7.jpg)
Ordering Example
Product
Customer
Beer Thomas
Beer Thomas
Vodka Thomas
Whiskey
Christian
Whiskey
Christian
Vodka Alexei
Vodka Alexei
Product
Customer
Beer Thomas
Whiskey
Christian
Vodka Thomas
Whiskey
Christian
Beer Thomas
Vodka Alexei
Vodka Alexei
Product
Customer
Beer Thomas
Whiskey
Christian
Vodka Thomas
Whiskey
Christian
Beer Thomas
Vodka Alexei
Product
Customer
BeerThomas
Vodka
Whiskey
Christian
Vodka Alexei
VS.
![Page 8: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/8.jpg)
There is some overhead…
Clusteron ID
Heap
Data Size 327MB 327MB
Column Index Size
59MB 142MB
![Page 9: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/9.jpg)
Manipulating the Rules
![Page 10: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/10.jpg)
Rule of Thumb?
“Sort by lowest cardinality column first”
Rationale: Low cardinality columns
have potential for long RL(C1, C2): 68MB (C2, C1): 61MB Lowest first is
worse!
x N
![Page 11: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/11.jpg)
OK, so what about highest first?Loose correlation
(C1, C2): 64MB (C2, C1): 68MB
Highest first is worse!
![Page 12: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/12.jpg)
What are we looking for?1) Values that are skewed or have low cardinality
2) Columns that correlate/cluster with other columns
![Page 13: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/13.jpg)
Just Read the Magic Code?• Values with low cardinality are
easy (COUNT DISTINCT)• Is there a more general way to
classify the notion of “predictable content of a column”?
• Yes, Entropy:
![Page 14: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/14.jpg)
Coming to Terms with Entropy• Intuition: A single number expressing the amount of
“surprise” at seeing a value in a column• Consider an example:
SKEW SPLAT ID
Histogram
COUNT DISTINCT
10001 10001 1000000
DISTINCT / COUNT
0.01 0.01 1
![Page 15: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/15.jpg)
Calculate and Evaluate
SKEW
SPLAT
ID
≈ 0.21
≈ 13
≈ 20
![Page 16: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/16.jpg)
New theory: Lower Entropy First
You will NEVER win
Take best of these
![Page 17: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/17.jpg)
Column that “cluster” with other columns?• Is there a way to calculate this?• Yes indeed, information theory to the help again• Mutual information:
• “The information left in Y, given that I know X”
![Page 18: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/18.jpg)
Mutual WHAT?
H(X ¦ Y) H(Y ¦ X) I(X;Y)
![Page 19: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/19.jpg)
From I(X;Y) we can find the distance
X
Y
C1
C2
C3
“Find the minimal distancethat visits all columns
in the information plane”
d(c1,c2)
d(c2,c3)
![Page 20: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/20.jpg)
So, how does THAT work?
Better… not impressive.. More consistent
Take best of these
![Page 21: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/21.jpg)
Medicine is compared against placebo
![Page 22: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/22.jpg)
What else does d(X,Y) tell us?• Consider this fact
table:d(A, B) is zero!
What is our expected estimate of rows?
![Page 23: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/23.jpg)
Dodge this!
Fix:
![Page 24: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/24.jpg)
Why is this so Hard?
![Page 25: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/25.jpg)
Reflecting on Information Distance“Find the shortest paththat visits all cities on
a map”
Picture Credits: RUC.dk
How many routes are there?
n! = n * (n-1) * (n-2) * … * 1
Travelling Salesman Problem
(TSP)
![Page 26: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/26.jpg)
There are MORE than n! routes• What if lexicographical ordering of the columns isn’t
best?• Daniel Lemire et al: ”Reordering Rows for Better Compression: Beyond
the Lexicographic Order“ ( http://arxiv.org/pdf/1207.2189.pdf )
• Some may be ruled out immediately (ex: don’t go to Skagen from Copenhagen and then to Roskilde)
• The issue of local optimums exist
![Page 27: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/27.jpg)
Heuristics are your Best Bet• “Find Minimum RLE” can be shown to be NP complete• There is no fast algorithm that finds the optimal
• I have shown you one heuristic• moderate gain for a small effort• Shown that 2x gains are possible• Any ordering is (typically) better than random, often by a lot
• I wrote a tool to help analyse: TableStat.exe• Interested? Come up to talk after• I need more real life datasets to test on
![Page 28: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/28.jpg)
![Page 29: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/29.jpg)
![Page 30: Optimising Column stores with statistical analysis](https://reader033.vdocuments.site/reader033/viewer/2022052900/5561ad5fd8b42ae1538b5530/html5/thumbnails/30.jpg)
P = NP ? We just don’t know