building sexy real-time analytics systems - erlang factory nyc / toronto 2013

34
Building “sexy” real-time analytics systems

Upload: lpgauth

Post on 16-Jun-2015

426 views

Category:

Technology


0 download

DESCRIPTION

In the world of Real-time bidding (RTB), it is crucial to get performance metrics as soon as possible. This is why AdGear build their own real-time analytics system. In this talk, Louis-Philippe will share with you what he has learnt building this system and he will introduce Swirl, AdGear's lightweight distributed stream processor. He will also give some clues on how to build a subset of SQL to power your distributed jobs. Talk objectives: - Introduce Swirl, a lightweight distributed stream processor - Implement a subset of SQL (lexer + parser + boolean logic) - Demo real-time graphing web interface powered by Swirl, Cowboy, Bullet and D3.js

TRANSCRIPT

Page 1: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Building “sexy” real-time analytics systems

Page 2: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

AdGear is full-stack ad platform for publishers and advertisers, with advanced analytics, attribution measurement, ad serving, and real-time bidding technology.

Page 3: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Real-time bidding (RTB)

Page 4: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

• help clients to make informed decisions

• should I increase the bid price?

• should I bid on exchange X?

• inventory control (brand safety)

• debugging (bots detection, creatives audits)

Real-time reporting... why?

Page 5: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

“Sexy” real-time analytics systems

Page 6: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

“Sexy”?

• elegant backend

• beautiful user interface

Page 7: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

• ssh

• node.js

• socket.io

Architecture #1 (3 years ago)

Page 8: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Problems

• no SMP support

• each process needs to be monitored

• requires load-balancing (nginx)

• duplicated state (per process)

• duplicated work (de-serialization)

• bad error handling (event loop explodes)

• callbacks...

Page 9: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

* promise construct

Page 10: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

• ssh_channel *

• gproc (pub sub)

• ETS counters

• bullet (cowboy)

* https://gist.github.com/lpgauth/6529807

Architecture #2 (1.5 years ago)

Page 11: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

1. receive buffered events, split and de-serialize

2. each event is sent to a collector process (3) using gproc (pubsub) for filtering

3. collector (gen_server) aggregates message using ETS counters and flush every second

4. bullet handler serializes the aggregates (tab2list to json)

Architecture #2

Page 12: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Problems

• ssh_channel process and collector process are bottlenecks

• number of messages increases with the number of clients

• requires lots of bandwidth for large streams

• limited filtering (match specs)

Page 13: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Improvements... (6 months ago)

• optimize collector’s msg loop (gen_server to proc_lib)

• use ssh compression

• added support for openssh zlib compression *

• R16B02

* https://github.com/lpgauth/otp/tree/openssh_zlib

Page 14: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

This worked for a while...

Page 15: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

“Hey man, it would be very cool if you could show in real-time the number of bid requests per domain for

Friday’s demo... Can you do it?” - boss

Page 16: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Sure.

Page 17: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

What did I just agree too...

• I only have 3 days to build this...

• bid requests stream is too large to aggregate in a central location (1+ Gbit/s - 80K+/s)

Page 18: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Strategy for demo

1. move aggregation upstream

2. use ETS match select to find table ids (filtering)

3. increment counters in process (no message!)

4. periodically flush aggregates via message to collector node

5. collector node increments local counters and periodically flush aggregates to bullet handler

Page 19: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Success!

Page 20: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Introducing swirl! “lightweight distributed stream processor”

Page 21: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Swirl components

• “dynamic” streams (swirl_stream)

• simple behavior that implements a map-reduce like interface (swirl_flow)

• powerful filtering language (swirl_ql)

• process registry (swirl_tracker)

Page 22: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Streams

Page 23: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Flows

* application:start(swirl).

Page 24: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

swirl_flow behavior

Page 25: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Mapper Node1. process “emits” event

2. lookup in ETS if there’s a flow that matches the stream name and filter

3. if there’s a match, call flow_mod:map/4

4. if map returns counters, increment in ETS

5. swirl_mapper periodically flush aggregates to reducer node

Page 26: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Reducer Node

1. swirl_tracker receives mapper aggregates and forwards it to reducer

2. reducer increments counters in ets

3. reducer flushes counters to flow_mod:reduce/4

Page 27: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Swirl-ql

• sql where clause like syntax

• supported operators:

• AND / OR

• <, <=, =, >, <>

• IN (x, y) / NOT IN (x, y, z)

• IS NULL / IS NOT NULL (undefined)

* https://github.com/lpgauth/swirl-ql

Page 28: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Swirl-ql

• examples:

• “event IN (‘impression’, ‘click’)”!

• “buyer_id IS NOT NULL AND buyer_id <> 3”!

• “event = ‘impressions’ AND (buyer_id IN (3, 5) OR buyer_id IS NULL)

Page 29: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Swirl-ql

• leex / yecc for parsing (use lex / yacc doc)

• pattern match ftw!

• use hipe (~200% speed gain in micro benchmarks)

• 0.286 vs 0.097 microseconds *

• experimenting with dynamic compilation

* http://theory.stanford.edu/~sergei/papers/sigmod10-index.pdf

Page 30: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Swirl limitations

• best-effort (hard problem!)

• netsplits

• crash

• in-memory only

Page 31: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

Todo

• node discovery

• code distribution

• resource limitation

• better documentation!

Page 32: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

• swirl

• bullet (cowboy)

Architecture #3 (now!)

Page 33: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

* https://github.com/lpgauth/swirl-demo

Demo!

Page 34: Building Sexy Real-Time Analytics Systems - Erlang Factory NYC / Toronto 2013

pssst: we’re hiring!

Thank You!

twitter: lpgauth github: lpgauth