data diversity at

Data Diversity.

When MySQL can’t do everything.

• MySQL Scalability @ Booking, 2006-2014

• Openstack @ SysEleven,2014-2016

• Production Scalability @ Booking, 2016-

Kristian Köhntopp.

A lot of MySQL.

• Over 6000 Instances running.

• Over 100 Replication chains.

• Very diverse workload and sizes.

Trying to stay current. And failing.

Running mostly 5.6. Upgrade project stalled due to various problems.

Parts on 5.7. Where it is running, it performs nicely and provides highly desirable features.

Key population on MariaDB. We need to be able to execute our workload on both strains of MySQL at all times.

Testing Prereleases in production. We would like to find problems before GA.

MySQL does a lot.

but not

everything.

First “NoSQL” database.

• Used for search autocompletion, much faster that MySQL.

• Now using an in-house thing, Brick.

• ES now used for log handling at Booking.

Elasticearch.

• 20+ Clusters, largest 150 nodes, several hundred TB of syslog data.

• Hitting size limits, lots of toil. • ES Clusters unstable at size • Sync times unacceptable. • Blocking index creation.

• Takes up to a minute for data to show up in searches in smaller clusters.

Large clusters.

Brick.

In-house full text search system, based on Lucene. Managed garbage collection, tons of sort, ordering plugins for lucene.

• Fulltext search in MySQL is a joke does not meet requirements.

• MySQL Scalability limits: 5x memory. • Clustering, Sharding story?

• ES 2x-10x faster on small, simple data sets.

• Difference more pronounced on larger sets.

Why not MySQL?

• Support not useful at our scale. • We know Java, JVM, Lucene,

Cluster Comms, Scaling. • We match and track issues upstream. • We would be much more interested

into engineering access, instead.

• ES mostly works for us within its limits.

• Elastic is present on github,we are present on their conferences,speak to their engineers.

• You get better support by not paying. WTF?

Working with Elastic?

Big Data.

2012 Aggregations fail Aggregating one hour of data in “dw” (Data Warehouse) schema took longer than one hour.

Skunkworks Take decommissioned database boxes,

hand install Hadoop, run proof of concept dw queries.

Manual parallelisation Hand crafted perl utilises all cores on a single machine, buying us time to try things.

60 cores, 56x speedup Running aggregations in parallel provides a nearly

linear speedup.

Today.

Getting a budget Run a proper install, learn a lot: Standard HP boxes do not work so well, using a distro vs using vanilla Hadoop.

Eight clusters 8 clusters + 2 sandboxes, 4 with hive,

27 PB of data present (2012-08, going to cut off @ 2y)

Running Hadoop.

Events and event processing, used in everything from monitoring to BI,

Aggregations, DWH/BI, Reporting, Analytics

Running HBase.

Real time monitoring, Front End Roundtripping,

MySQL Time Machine & Replication, Mesos/Marathon Integration.

Frontend to Hadoop and back Data collected, aggregated and filtered back, in realtime.


Aggregated Visitors Aggregate front end logs, produce per object looker statistics.


Aggregated Bookers Aggregate reservation stats, produce per object booker statistics.



Aggregated Bookers Aggregate reservation stats, produce per object booker statistics.

Aggregated Properties Per Location, find similar properties, produce set counts.


Hadoop Cluster Stats

[lhr4] $ hdfs dfs -df -h Filesystem Size Used Available Use% hdfs://nameservice1 41.2 P 27.0 P 10.8 P 66%

[lhr4] $ htloc raw_events /hive_tables/raw_tables/raw_events

[lhr4] $ hdfs dfs -du -h -s /hive_tables/raw_tables/raw_events4.9 P 14.6 P /hive_tables/raw_tables/raw_events

Mostly HIVE.

• Over 95% of the workload: HQL.

• Declarative, powerful, familiar.

• ODBC Endpoint: Excel, R.

• Well known to data people, good tooling.

Quietly HBase.

• Running for one year now, 400 nodes.

• Real Time Metrics.

• Not widely advertised.

• “Have you considered MySQL first?”

Quietly HBase.

• Running for one year now, 400 nodes.

• Real Time Metrics.

• Not widely advertised.

• “Have you considered MySQL first?”

MySQL Time Machine

HBase Replication

Matterhorn 1

17:20

• MySQL did not scale to parallel processing.

• MySQL sucks at does not meet requirements for DWH processing.

• MySQL and Petabytes to not mix.

Why not MySQL?

• Hadoop is a lot like Lego:under load, parts fall off.

• CDH is 1+ year behind Hadoop current.

• Few clusters, many interests: Diverse workload. Interference.

• Operational challenges: No interruptions.

• Toil, Bugs, Interference.

Hadoop Ecosystem?

• Hadoop: not built for deployment from vanilla repos.

The distro question.


• Vendor per-seat licensing does not scale for us.

• Control issues: if you do stuff past our admin system, you are off support.

• Integration with existing provisioning and monitoring.

• Build in-house knowhow.



• Vendor per-seat licensing does not scale for us.

• Control issues: if you do stuff past our admin system, you are off support.

• Integration with existing provisioning and monitoring.

• Build in-house knowhow.

• Booking:

• save on licensing, hire people instead.

• partnership vs. customer/vendor.

• Some contributions upstream, but not very active.


• Isolation problems, known defects

Educating users.


• Schema and access require a lot of rethinking:

• Table design, index usage.

• Hotspotting (overwhelming a single node).

Educating users.


• Schema and access require a lot of rethinking:

• Table design, index usage.

• Hotspotting (overwhelming a single node).

• Mandatory user education.

• Mandatory table review.

• This is not the Booking.com way.

Educating users.

• Enterprise hardware: Smart Controllers, small, fast disks,expensive redundancy features.

• All of this is useless for Hadoop.

• Buy bulk from Taiwanese maker. The hardware question.

Value Adding yourself into obsolescence.

https://twitter.com/davykestens/status/654334869267312640

Value Adding yourself into obsolescence.

Graphing.

Graphite at Booking: A hackathon project that escaped from the cage.

32 Frontend Servers, 200 Store Servers in two data centers, 2M unique metrics per second (8M hitting the stores), 130 TB metrics in total, 11 Gbps traffic on the backend.

Graphite setup. After mutation

Graphite setup. After mutation

Graphite @ Scale:

How to store million metric

per second

Vladimir Smirnov

LinuxCon Europe 2016,

5. October 2016

It’s on github.

carbonzipper — github.com/dgryski/carbonzippercarbonserver — github.com/grobian/carbonserver

carbonapi — github.com/dgryski/carbonapicarbon-c-relay — github.com/grobian/carbon-c-relay

• MySQL does not meet the requirements for time series data.

• Data deletion kills the server (Partitions help a bit).

• MySQL does not cluster, so scalability limit.

Why not MySQL?

• Python does not scale. At all. • Practially rewrote the backend

completely in Go and C.

• It won’t grow another 10x. • Ops heavy, scales linearly with cluster

size, metrics discovery needs improvement. Storage side is a goner.

• Time Series Data at scale is hard.

Graphite problems

Graphite and Community.

• “Raintank”, startup in the monitoring space, hiring Grafana and Graphite people.

• We are working with the Graphite community,but we have probably outgrown all their use cases.

• We are most likely on our own with all of this.

Cassandra.

2014: PoC for an S3-like Photo Storage: BOSOS (Booking Simple Object STorage) 2015: Perl Tooling (previously: Go Tooling) DBD::cassandra, mearc (40T MySQL instance) because MariaDB cassandra engine, bizmail archive, RTM, Event lookup table, “that chatbot project”, PII storage? 2016: DBD::cassandra becomes CPAN module, Experiment tool data

Cassandra

15 clusters

200 nodes

700 TB data

3 datacenters

1-2ms response

time

?• Any large data set that needs sharding and simple access.

• If you need indexes, sharing is a pain, but Cassandra does that.

• Using CQL(limited features, but declarative)

How we are using it.

• no cluster, scaling limits.

• no automated sharding, sharding by proxy is dangerous.

• large BLOB storage in MySQL does not meet requirements.

• We tried the MariaDB Cassandra engine, but that did not work out.

Why not MySQL?

• Individual node failures: non-issue.Rolling restarts fine.

• “Never run the latest version”, running the oldest supported version.

• Secondary indexes do not work the way one would expect from MySQL. We are still recovering from that.

Cassandra problems

Cassandra and community.

• We do not have commercial support, running community version.Considering buying support for community version, but as of now not large enough deployment to justify that.

• Considering NRE for a few things. • Cassandra community on IRC is amazing.

• We sometimes file tickets, and are beginning to upstream fixes. • DBD::cassandra, Cassandra::client

Redis.

Used a lot as better memcache, as a Queue.

Used with persistence only in sessapp (temp storage for session data), LRU expiring to MySQL, previously hacked version of Redis,50 + 50 nodes

Redis.

No support, no community work.

Standard deployment, boring.

Also using.Postgres - used in our Puppet deployment. Riak - used in the event processing pipeline, future unclear. RocksDB - used in “Smart AV”, decidated availability service.

Postgres Riak RocksDB

• Supposedly there is MongoDB running somewhere at Booking.

• Have been searching for it.

• Found no one willing to admit to it.

MongoDB?

Conclusions for MySQL @ Booking.

• Recurring theme: need to shard, distributed database, multithreaded per shard. • MySQL breakdown at special use cases: • Fulltext, DWH, TSDB, BLOB Store, Column Store.

Conclusions for NoSQL @ Booking.

• Trend towards declarative languages in NoSQL systems.

• Distributed systems are hard (Surprise!)

Conclusions for Upstream Interaction.

• We tend to become self-supporting wrt to support. • We tend to become contributors in some way. • We tend to have need to “engineering access”.

We are hiring.

Check https://workingatbooking.com/or talk to us at our booth.

https://workingatbooking.com/

data diversity at

Internet