less is more: novel approaches to mysql compression for modern data sets - percona live 2016
TRANSCRIPT
Novel Approaches to MySQL Compression for Modern Data Sets Less Is More
Ernie Souhrada Database Engineer / Bit Wrangler, Pinterest Percona Live Data Performance Conference – 19 April 2016 1
• Introductions • The Data Explosion • Stand Back, I’m Going to Math • So Many Options, So Little CPU • Don’t Try This At Home • Not Your Grandfather’s GZIP • Ooh, Shiny Numbers! • Q&A
Agenda
2 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
My god, it’s full of cats!
Who am I? • Database Engineer at Pinterest (January 2015) – One of two people solely responsible for hundreds of TB of MySQL data
– Also loosely affiliated with HBase and Core SRE teams
• Previously: Percona, Sun, assorted random small companies • Jack of many trades, master of some
Why am I here? • Interested in almost EVERYTHING (not just tech) • Mathematician by training; compression is fundamentally a math
problem.
Who Am I, Why Am I Here?
3 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Turning technical skill into cat food since 1996
“Every two days now we create as much information as we did from the dawn of civilization up to 2003.” – Eric Schmidt, Google [1]
He said this in 2010. • Mostly user-generated content – Over 2 million cat videos on YouTube in 2015 [2] – Lots of unstructured data, not easily put into relational form
• Don’t forget the NSA! – Although nobody really knows how much data they have….
The Data Explosion
4 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Because ‘DELETE’ is a four-letter word.
The Data Explosion
5 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
In 2012, there were 2.1 billion people on the internet[3]
2012
The Data Explosion
6 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Two years later, that number rose to 2.4 billion[4]
2014
The Data Explosion
7 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Drowning in a sea of bits
Storage costs are stabilizing[5]
$0.02/GB
The Data Explosion
8 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Drowning in a sea of bits
But data volume is still increasing! 2016: 1.1 ZB of global IP traffic per year (>1 billion GB/month) 2019: 2 ZB[6]
2011: 1.8 ZB of information created 2012: 2.8 ZB 2020: 40 ZB[7]
The Data Explosion
9 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Mo’ data, mo’ problems.
TRUNCATE is also a four-letter word. (So is DROP…) The Data Explosion
10 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
What to do? • Delete
• Some organizations afraid to delete anything • Creation velocity still a problem
• Collect less? • Pray to the storage gods? • Panic! • Spend the money, buy more storage
• May be inevitable • ROI and efficiency still matter
Trading CPU cycles for disk space since 2015 The Data Explosion
11 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Compression to the rescue! • Well, sort of.
• Workload matters. • Structure of data matters.
• Decrease velocity of data growth • Thank you, Gordon Moore!
Compressed pins are compressed. The Data Explosion
12 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Pinterest, 12 months ago: • Lots of data stored as JSON blobs • Workload is read-heavy, but not overall QPS-heavy • No compression being used • i2.4xlarge for DB servers (3TB of disk) • Estimated disk space exhaustion around EOQ1 2016
• More servers? • Bigger servers? • Panic?
Compressed pins are compressed. The Data Explosion
13 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Pinterest, today: • Pin data still stored as JSON blobs • i2.4xlarge for DB servers (3TB of disk) • Workload profile hasn’t changed much • InnoDB page compression being used
• Approximately 50% space reduction • Reduction in data growth velocity • Disk space exhaustion estimated Q2 2017
• Still looking for ways to do more with our existing resources
Entropy is more than just the heat death of the universe. Stand Back, I’m Going To Math
14 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Entropy: A mathematical measure of information or uncertainty. • Computed as a function of a probability distribution. • Claude Shannon (1948): A Mathematical Theory of Communication
More formally: Suppose X is a discrete random variable which takes on values from a finite set X. Then, then entropy of the random variable X is defined to be:
H (X) = − P(x)logx∈X∑ 2P(x)
Encoding to binary strings for fun and profit Stand Back, I’m Going To Math
15 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
An encoding is a function that maps elements from the set X to the set of finite binary strings.
f : X→ {0,1}*
Extend this to finite sequences (strings) of elements: f (x1x2x3...xk ) = f (x1) || f (x2 ) || f (x3) || ... || f (xk )
f : X*→ {0,1}*
where || is the concatenation operator So, we can really think of the encoding like this:
For a given set X, there are infinitely many encodings. Why?
But not just any encoding will do. Stand Back, I’m Going To Math
16 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
• Injective • Guarantees an unambiguous decoding
• Prefix-free • Allows sequential decoding, no memory required • An encoding is prefix-free if there do not exist elements x, y in X and a string S in {0,1}*
such that f(x) = f(y) || S • Lossless
• Informally, exactly what it sounds like – given an encoded string E, we can decode it back precisely into the original string S
• Efficient! • Use as few bits as possible to encode each string. • How low can we go?
A little theory before some practice. Stand Back, I’m Going To Math
17 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
One more definition. Suppose that we have a string such that each in the string occurs according to a specified probability distribution. The probability of any such string (note that the elements of the string do not need to be distinct) is given by:
x1!xk xi
P(x1!xk ) = P(xi )i=1
k
∏This is just basic probability. Consider a fair coin that gets flipped twice. Possible outcomes are: HH, HT, TH, TT
CAT BREAK! Stand Back, I’m Going To Math
18 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Efficiency cat likes short strings Stand Back, I’m Going To Math
19 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
The efficiency of a particular encoding f is defined as the weighted average length of an encoding of an element of X.
ℓ( f ) = P(x)x∈X∑ f (x)
Where |y| denotes the length of string y.
Putting it all together Stand Back, I’m Going To Math
20 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Source Coding Theorem (informally stated): A string S of length N consisting of elements of X and probability distribution X that has entropy H(X) can be compressed into more than N*H(X) bits with negligible risk of data loss as N à ∞, but it cannot be compressed into fewer than N*H(X) bits without virtually guaranteeing data loss.
H (X) ≤ ℓ( f )< H (X)+1
What does this mean? It provides a bound on encoding efficiency for lossless compression algorithms.
Proof is left as an exercise to the reader. But you can use Huffman coding to actually find an efficient code that satisfies the above.
Looking at things differently Stand Back, I’m Going To Math
21 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
It’s not possible to have an average information content of more than one bit per bit of message without losing data. On average, English text has roughly one bit of entropy per letter.[8] ASCII is an 8-bit encoding. It should come as no surprise that English text compresses quite well.
The last slide on theory, I promise Stand Back, I’m Going To Math
22 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
We don’t necessarily have to think of individual letters. - Bigrams, trigrams - Words or tokens (think about SQL keywords or a JSON document) Some strings come out smaller when compressed. Some come out larger. There’s no universal encoding that works equally-well for every set of source strings.
• “Old” compression technology • Application layer • SQL functions: COMPRESS() / DECOMPRESS() • ARCHIVE storage engine • InnoDB page compression
• “New” compression technology • TokuDB • MyRocks • MySQL 5.7 “punch hole” transparent compression • Server-level column compression… what?!
So Many Options, So Little CPU!
23 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Compression sounds great! I want some for my database, too.
Don’t Try This At Home
24 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Just because you can do something doesn’t mean you should.
Application-Level Compression The Good: • Not limited in choice of algorithm • Scales horizontally with app servers • Minimizes network traffic • Works with any storage engine • Fine-grained control over what to
compress and what to leave alone
The Bad: • Might require a lot of code retrofit • Significant operational overhead in the
event of incidents • Potentially-significant loss of SQL
functionality • WHERE clauses on compressed data • SQL functions
Unless you’re Batman. Then be Batman. Don’t Try This At Home
25 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
When might you consider it? • New projects, maybe • Existing projects, maybe not • The data to be compressed doesn’t need anything more than store/retrieve • You’re OK with the output of ‘SHOW PROCESSLIST’ screwing up your terminal • Network bandwidth is at a premium but CPU is plentiful (MySQL on Mars?)
Don’t Try This At Home
26 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
You’re not Batman.
SQL Function Compression (COMPRESS/DECOMPRESS) The Good: • Works with any storage engine • Fine-grained control over what to
compress and what to leave alone
The Bad: • All of the same negatives of
application-level compression but without any of the major benefits.
• Extra load on the MySQL server
When might you consider it? • For any serious project, probably never
Don’t Try This At Home
27 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Included for the sake of completeness only
ARCHIVE Storage Engine The Good: • Convenient • Mature
The Bad: • No UPDATE or DELETE • SELECT is a table scan • Not a usable general-purpose engine
When might you consider it? • Data that never needs to be updated and is rarely accessed • Data that can be lost or regenerated in an emergency
Honey, I shrunk the database!
Not Your Grandfather’s GZIP
28 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
InnoDB Page Compression (pre-5.7) The Good: • Mature • No need to retrofit code • Decent compression ratio • Reasonably performant for many things
The Bad: • Memory inefficient • Not as space-efficient as it could be • Not much configurability
When might you consider it? • Read-mostly workloads of low to moderate concurrency • For many users, it’s still the only game in town
Eh.
Not Your Grandfather’s GZIP
29 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
InnoDB Punch-Hole Compression (5.7+) The Good: • Configurable choice of algorithm • No need to retrofit code • No more buffer pool inefficiency
The Bad: • Immature • Crashed my test server • FS fragmentation • Doesn’t seem to play well with XFS
When might you consider it? • Maybe 5.8, but that’s just my opinion. • Maybe if you’re using FusionIO NVMFS
Hole-punching revisited (or, how I learned to stop worrying and love deadlocks) Not Your Grandfather’s GZIP
30 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
InnoDB Punch-Hole Compression (5.7+) continued.
Lots of this in dmesg: [203516.812112] XFS: possible memory allocation deadlock in kmem_alloc (mode:0x250)
CPUs reporting nontrivial IO wait and nothing else: 05:54:38 PM CPU %usr %nice %sys %iowait %irq %soft %steal %guest %idle 05:54:39 PM all 0.31 0.00 0.00 6.20 0.00 0.00 0.00 0.00 93.49 05:54:39 PM 0 1.00 0.00 0.00 13.00 0.00 0.00 0.00 0.00 86.00 05:54:39 PM 1 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00 05:54:39 PM 2 0.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 88.00 05:54:39 PM 3 0.00 0.00 0.00 10.00 0.00 0.00 0.00 0.00 90.00 05:54:39 PM 4 1.00 0.00 0.00 12.00 0.00 0.00 0.00 0.00 87.00 05:54:39 PM 5 0.00 0.00 0.00 13.13 0.00 0.00 0.00 0.00 86.87 05:54:39 PM 6 3.00 0.00 1.00 11.00 0.00 0.00 0.00 0.00 85.00 05:54:39 PM 7 0.00 0.00 0.00 14.14 0.00 0.00 0.00 0.00 85.86 05:54:39 PM 8 1.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 99.00 05:54:39 PM 9 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 10 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 11 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 12 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 13 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 14 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 05:54:39 PM 15 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00
What does Tokutek mean, anyway?
Not Your Grandfather’s GZIP
31 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
TokuDB The Good: • Fully transactional • Very good compression ratio • Optimized for high write volume • Code changes not likely needed
The Bad: • Reads can be slower than InnoDB • MySQL’s datadir becomes a mess • Some InnoDB constructs unsupported • Limited MySQL community knowledge
When might you consider it? • Lower-end storage technology (slow SSD vs. Flash) • Data that can benefit from multiple clustering indexes (time series data, perhaps) • Dedicated server (no InnoDB)
Get your rocks on!
Not Your Grandfather’s GZIP
32 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
RocksDB (MyRocks) The Good: • Fully transactional • Good compression ratio • Optimized for high write volume • Generally very fast • Low write amplification
The Bad: • Not GA yet. • Currently only available as part of
Facebook MySQL 5.6 • Some InnoDB constructs unsupported • Locking behavior different from InnoDB
When might you consider it? • Need high compression ratio • Concerned about SSD burnout • Becomes available separately from FB-MySQL
Hey, I didn’t see THAT in the manual
Not Your Grandfather’s GZIP
33 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
InnoDB Column Compression The Good: • Configurable compression dictionary • Very good compression ratio possible • Excellent performance under load • Very memory-efficient
The Bad: • Not yet released to the public (not GA)
When should you consider it? • Storage of a lot of JSON, XML, or other compressible BLOB data • After it becomes GA
But first… A CAT. Ooh, Shiny Numbers!
34 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
There are so many of them Ooh, Shiny Numbers!
35 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Recall that we’ve already gone from uncompressed to InnoDB page compression • Performance is good • We think we can do better on disk space efficiency However… • Not going to engage in massive code rewrite • ARCHIVE engine isn’t relevant to us • MyRocks isn’t yet in a state where we’d spend significant time on it So… • Page compression • Column compression without dictionary • Column compression with dictionary of various sizes • TokuDB • Punch-hole (or not...)
Servers, start your engines Ooh, Shiny Numbers!
36 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Choose a typical ‘pins’ shard, of which there are thousands. Call it N. • Shard N contains about 20GB of raw, uncompressed data • InnoDB page compression brings this down to around 10GB
• Up to 20% fragmentation overhead • Run ‘OPTIMIZE TABLE’ and we go down to 8.4GB – this is our starting point
• Set up several test servers with various compression configurations
Server A: page compressed – the control Server B: column compression, no dictionary Server C: column compression, one pin dictionary Server D: column compression, four pin dictionary Server E: column compression, eight pin dictionary Server F: column compression, 32K dictionary Server G: TokuDB, default settings
They don’t lie. And 65% of all statistics are made up. Ooh, Shiny Numbers!
37 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Server A Server B Server C Server D Server E Server F Server G
Size (GB) 8.4 8.2 5.4 5.4 5.4 5.2 3.6
dump rate (rows/sec)
52.2K 33.3K 34.3K 32.4K 30.6K 25K 53.5K
replication 1 2:40 2:52 2:35 2:57 2:47 3:00 6:36
replication 16 0:19 0:19 0:21 0:19 0:19 0:22 1:46
RO QPS 16 35K 40K-50K 40K-50K 40K-50K 40K-50K 40K-50K 20K
P99.9999 10ms 10ms 10ms 10ms 10ms 10ms 40ms
RW QPS 16 25K-30K 30K-40K 30K-40K 30K-40K 30K-40K 30K-40K 18K
P99.9999 30ms 25ms 25ms 25ms 25ms 25ms 40ms
Replication resync rate, single thread Ooh, Shiny Numbers!
38 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Replication resync rate, 16-thread MTS Ooh, Shiny Numbers!
39 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Interpreting the images on the pages to come
For the graphs on the next several slides: • Server A (page compression) is RED
• Server B (column compression, no dictionary) is LIGHT GREEN
• Server C (column compression, one pin) is BLUE
• Server D (column compression, four pins) is LIGHT BLUE
• Server E (column compression, eight pins) is DARK RED
• Server F (column compression, 32K of pins) is PURPLE
• Server G (TokuDB) is GOLD/YELLOW
A Key to the Graphics Kingdom
Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016 40
SELECT 256, 128, 32, 16, 8, 4, 1 threads(pquery) Ooh, Shiny Numbers
41 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
p99.9 Read Performance (Log Scale y-axis) Ooh, Shiny Numbers
42 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Read performance for ALL the 9s! (p99.9999) Ooh, Shiny Numbers
43 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
Read/write QPS for 16, 8, 4, 1, 32, 64, 128 threads Ooh, Shiny Numbers
44 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
P99.9 write performance for the previous graph (log10 scale) Ooh, Shiny Numbers
45 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
P99.9999 overall performance for the previous QPS (r/w) graph (log10 scale) Ooh, Shiny Numbers
46 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
What’d we get out of this? Summary Results
47 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
• Even with just the simplest predefined dictionary – a single pin – thus capturing all of the JSON field names - we get dramatically improved space efficiency. With a better dictionary, we can likely do even better, and at our scale, a few percent can be a nontrivial improvement.
• At low concurrency (running threads <= number of cores), there isn’t too much difference between column compression and page compression when it comes to performance.
• At higher concurrency (number of running threads > number of cores in the machine), page compression falls over pretty badly on the read-only test. Column compression continues working quite well up to 256 active threads and perhaps even higher.
• TokuDB wins on compression easily, but otherwise doesn’t do that well for our workload in a default configuration (and with all the other tables on the server still InnoDB).
• Column compression looks like a serious winner, at least for what we need. I don’t think we’ll be the only ones.
Credit where credit is due. Notes & References
48 Less Is More: Novel Approaches to MySQL Compression for Modern Data Sets– Ernie Souhrada, Database Engineer @ Pinterest – Percona Live 2016
[1] http://techcrunch.com/2010/08/04/schmidt-data/
[2] http://nymag.com/scienceofus/2015/06/heres-a-study-about-internet-cats.html
[3] https://www.domo.com/blog/2012/06/how-much-data-is-created-every-minute/
[4] https://www.domo.com/blog/2014/04/data-never-sleeps-2-0/
[5] http://www.mkomo.com/cost-per-gigabyte-update
[6] http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.html
[7] http://www.webopedia.com/quick_ref/just-how-much-data-is-out-there.html
[8] http://people.seas.harvard.edu/~jones/cscie129/papers/stanford_info_paper/entropy_of_english_9.htm
49
Questions? Answers! email: [email protected] | twitter: @denshikarasu | pinterest engineering blog: https://engineering.pinterest.com
We are hiring! https://careers.pinterest.com