using riak for events storage and analysis at booking.com
TRANSCRIPT
EVENT STRUCTURE
• Provides info about subsystems
• Data
• Deep HashMap
• Timestamp
• Type + Subtype
• The rest: specific data
• Schema-less
EXAMPLE 1: WEB APP EVENT
• Large event
• Info about users actions
• Requests, user type
• Timings
• Warnings, errors
• Etc…
{ timestamp => 12345, type => 'WEB', subtype => 'app',
action => { is_normal_user => 1, pageview_id => '188a362744c301c2', # ... }, tuning => { the_request => 'GET /display/...' bytes_body => 35, wallclock => 111, nr_warnings => 0, # ... }, # ... }
EXAMPLE 2: AVAILABILITY CLUSTER EVENT
• Small event
• Cluster provides availability info
• Event: Info about request types and timings
{ type => 'FAV', subtype => 'fav', timestamp => 1401262979, dc => 1, tuning => { flatav => { cluster => '205', sum_latencies => 21, role => 'fav', num_queries => 7 } } }
EVENTS FLOW PROPERTIES
• Read-only
• Schema-less
• Continuous, ordered, timed
• 15 K events per sec
• 1.25 Billion events per day
• peak at 70 MB/s, min 25MB/s
• 100 GB per hour
SERIALIZATION
• JSON didn’t work for us (slow, big, lack features)
• Created Sereal in 2012
• « Sereal, a new, binary data serialization format that provides high-performance, schema-less serialization »
• Added Sereal encoder & decoder in Erlang in 2014
GRAPHS
• Graph in real-time ( few seconds lag )
• Graph as many systems as possible
• General platform health check
DECISION MAKING
• Strategic decision ( use facts )
• Long term or short term
• Technical / Non technical Reporting
SHORT TERM ANALYSIS
• From 10 sec ago -> 8 days ago
• Code deployment checks and rollback
• Anomaly Detector
A/B TESTING
• Our core philosophy: use facts
• It means: do A/B testing
• Concept of Experiments
• Events provide data to compare
WHAT WE WANT
• Storage security
• Mass write performance
• Mass read performance
• Easy administration
• Very scalable
WE CHOSE RIAK
• Security: cluster, distributed, very robust
• Good and predictable read / write performance
• The easiest to setup and administrate
• Advanced features (MapReduce, triggers, 2i, CRDTs …)
• Riak Search
• Multi Datacenter Replication
CLUSTER
• Commodity hardware • All nodes serve data • Data replication
• Gossip between nodes • No master
Ring of servers
RIAK: ADVANCED FEATURES
• MapReduce
• Secondary indexes
• Riak Search
• Multi DataCenter Replication
BACKEND: BITCASK
• Log-based storage backend
• Append-only files
• Advanced expiration
• Predictable performance
• Perfect for reading sequential data
HARDWARE
• 16 nodes
• 12 CPU cores ( Xeon 2.5Ghz)
• 192 GB RAM
• network 1 Gbit/s
• 8 TB (raid 6)
• Cluster space: 128 TB
RIAK CONFIGURATION
• Vnodes: 256
• Replication: n_val = 3
• Expiration: 8 days
• 4 GB files
• Compaction only when file is full
• Compact only once a day
web api dbs
ee
eeeee
eeeee
1 sec
events storage
1 blob per EPOCH / DC / CELL / TYPE / SUBTYPE 500 KB max chunks
DATA
• Bucket name: “data“
• Key: “12345:1:cell0:WEB:app:chunk0“
• Value: serialized compressed data
• About 120 keys per seconds
METADATA
• Bucket name: “metadata“
• Key: epoch-dc “12345-2“
• Value: list of data keys:[ “12345:1:cell0:WEB:app:chunk0“, “12345:1:cell0:WEB:app:chunk1“ … “12345:4:cell0:EMK::chunk3“ ]
• As pipe separated value
PUSH DATA IN
• In each DC, in each cell, Loggers push to Riak
• 2 protocols: REST or ProtoBuf
• Every seconds:
• Push data values to Riak, async
• Wait for success
• Push metadata
JAVA
Bucket DataBucket = riakClient.fetchBucket("data").execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk0", Data1).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk1", Data2).execute(); DataBucket.store("12345:1:cell0:WEB:app:chunk2", Data3).execute();
Bucket MetaDataBucket = riakClient.fetchBucket("metadata").execute(); MetaDataBucket.store("12345-1", metaData).execute(); riakClient.shutdown();
Perl
my $client = Riak::Client->new(…);
$client->put(data => '12345:1:cell0:WEB:app:chunk0', $data1); $client->put(data => '12345:1:cell0:WEB:app:chunk1', $data2); $client->put(data => '12345:1:cell0:WEB:app:chunk2', $data3);
$client->put(metadata => '12345-1', $metadata, 'text/plain' );
GET DATA OUT
• Request metadata for epoch-DC
• Parse value
• Filter out unwanted types / subtypes
• Fetch the data keys
Perl
my $client = Riak::Client->new(…); my @array = split '\|', $client->get(metadata => '12345-1'); @filtered_array = grep { /WEB/ } @array; $client->get(data => $_) foreach @array;
STREAMING
• Fetch 1 second every second
• Or a range ( last 10 min )
• Client generates all the epochs for the range
• Fetch all epochs from Riak
EXAMPLES
• Continuous fetch => Graphite ( every sec )
• Continuous fetch => Anomaly Detector ( last 2 min )
• Continuous fetch => Experiment analysis ( last day )
• Continuous fetch => Hadoop
• Manual request => test, debug, investigate
• Batch fetch => ad hoc analysis
• => Huge numbers of fetches
events storage
graphite cluster
Anomaly detector
experimentcluster
hadoop cluster
mysql analysis
manual requests
50 MB/s
50 MB/s
50 M
B/s 50 MB/s
50 MB/s50 MB/s
REALTIME
• 1 second of data
• Stored in < 1 sec
• Available after < 1 sec
• Issue : network saturation
WHAT TAKES TIME
• Takes a lot of time
• Fetching data out
• Decompressing
• Takes almost no time
• Crunching data
MAPREDUCE
• Send code to be executed
• Works fine for 1 job
• Takes < 1s to process 1s of data
• Doesn’t work for multiple jobs
• Has to be written in Erlang
Riak post-commit hook
REST serviceRIAK service
key keysocket
new data sent for storage
fetch, decompressand process all tasks
NODE HOST
HOOK CODE
metadata_stored_hook(RiakObject) -> Key = riak_object:key(RiakObject), Bucket = riak_object:bucket(RiakObject), [ Epoch, DC ] = binary:split(Key, <<"-">>), Data = riak_object:get_value(RiakObject), DataKeys = binary:split(Data, <<"|">>, [ global ]), send_to_REST(Epoch, Hostname, DataKeys), ok.
send_to_REST(Epoch, Hostname, DataKeys) -> Method = post, URL = "http://" ++ binary_to_list(Hostname) ++ ":5000?epoch=" ++ binary_to_list(Epoch), HTTPOptions = [ { timeout, 4000 } ], Options = [ { body_format, string }, { sync, false }, { receiver, fun(ReplyInfo) -> ok end } ], Body = iolist_to_binary(mochijson2:encode( DataKeys )), httpc:request(Method, {URL, [], "application/json", Body}, HTTPOptions, Options), ok.
REST SERVICE
• In Perl, using PSGI, Starman, preforks
• Allow to write data cruncher in Perl
• Also supports loading code on demand
ADVANTAGES
• CPU usage and execution time can be capped
• Data is local to processing
• Two systems are decoupled
• REST service written in any language
• Data processing done all at once
• Data is decompressed only once
DISADVANTAGES
• Only for incoming data (streaming), not old data
• Can’t easily use cross-second data
• What if the companion service goes down ?
FUTURE
• Use this companion to generate optional small values
• Use Riak Search to index and search those
• network usage ( PUT and GET ): • 3 x 13/16+ 2 x 3/16= 2.81 • plus gossip • inside network > 3 x outside network
• Usually it’s not a problem • But in our case: • big values, constant PUTs, lots of GETs • sadly, only 1 Gbit/s
• => network bandwidth issue
THE BANDWIDTH SOLUTIONS
1. Optimize GET for network usage, not speed 2. Don’t choose a node at random
WARNING
• Possible only because data is read-only
• Data has internal checksum
• No conflict possible
• Corruption detected
THE BANDWIDTH SOLUTIONS
1. Optimize GET for network usage, not speed 2. Don’t choose a node at random
• bucket = “metadata”
• key = “12345”
Hash = hashFunction(bucket + key)
RingStatus = getRingStatus
PrimaryNodes = Fun(Hash, RingStatus)
WARNING
• Possible only if • Nodes list is monitored • In case of failed node, default to random • Data is requested in an uniform way
CONCLUSION
• We used only Riak Open Source
• No training, self-taught, small team
• Riak is a great solution
• Robust, fast, scalable, easy
• Very flexible and hackable
• Helps us continue scaling