from node.js to scala - with a 100x performance boost
TRANSCRIPT
FROM TO with a 100x perf. boost!
BY ITAMAR RAVID | MAY 3, 2016
t
AGENDA
WE’LL TALK ABOUT…
• What we do, our challenges and what led us to Scala and Akka;
• How we redesigned our core data processing service;
• Some useful lessons and patterns.
There will be relatively little node.js bashing. Promise.
t
BIGPANDA: THE ANSWER TO ALERT FATIGUE
RABBIT IS DOWN!NO FREE SPACE!
INBOUND QUEUE OVERFLOWING!OUTBOUND QUEUE OVERFLOWING!
APPLICATION HEALTH CRITICAL!TOO MANY FAILED HTTP REQS!
rabbit-1, ping
rabbit-2, disk
queue-1, size
queue-2, size
app1, health
app2, 500 codes
RabbitMQ cluster
ping disk
RabbitMQ node 3
queue size queue size
API server
health failed reqs
Correlation A
lgorithm
t
CorrelationStage
NormalizationStage
IN TERMS OF STREAMS…
RABBIT IS DOWN!NO FREE SPACE!
INBOUND QUEUE OVERFLOWING!OUTBOUND QUEUE OVERFLOWING!
APPLICATION HEALTH CRITICAL!TOO MANY FAILED HTTP REQS!
Nagios event source
Datadog event source
AppDynamics event source
rabbit-1, ping
rabbit-2, disk
queue-1, size
queue-2, size
app1, health
app2, 500 codes
RabbitMQ cluster
ping disk
RabbitMQ node 3
queue size queue size
API server
health failed reqs
Correlation A
lgorithm
CHALLENGE 1 SCALING TO MEET CUSTOMER LOAD
t
HIGH-LEVEL ARCHITECTURE
API servers
API servers
API servers
Normalization Correlation
Correlation
Correlation
RabbitMQ Exchange Normalization
Normalization
RabbitMQ Exchange
Mongo
RabbitMQ Exchange
t
USAGE OF RABBITMQ
Correlation
Correlation
Correlation
RabbitMQ Cons. Hash
Queue (Customers A, B, C)
Queue (Customers D, E, F)
Queue (Customers X, Y, Z)
Route byhash on
Customer
DATA FOR A GIVEN CUSTOMER MUST BE PROCESSED SERIALLY,
IN ORDER. SO…
t
(ALERT) STORMS
t
MEET REALITY!
Not fun!
A hiccup in a customer’s datacenter =>An entire queue is blocked
CHALLENGE 2 CORRELATION PREVIEW
t
CORRELATION
Same host, 4 hours …
MATCHING RULES
+
INCIDENTrabbit-1
ping disk
rabbit-1, ping, t=5
rabbit-1, disk, t=7
t
CORRELATION
MATCHING RULES
+
INCIDENTrabbit-1
ping disk
Same host, 4 hours 30 minutes
rabbit-1, ping, t=5
rabbit-1, disk, t=7
t
CORRELATION
MATCHING RULES
+
INCIDENT
rabbit-1, ping, t=5
rabbit-1, disk, t=7
Same host, 4 hours 30 minutes
?
t
A CORRELATION TIME-MACHINE
1 2 3 4 5 6 7 8 9 N…10
ALERTS WE’RE HERE
START FROM HERE (DC OUTAGE)
CorrelationServers
OFFSETS
t
THIS MEANS…
REPLAY DETERMINISTICFAST
SOLUTIONS!
t
EXISTING CORRELATION SOLUTION
Processing Stage
Mongo
RabbitRabbit RabbitRabbit Processing Stage
Processing Stage
PROCESSING STAGE - A NODE.JS CALLBACK.
Shared mutable state
No isolation
No replay
t
DESIRED SOLUTION
Processing Stage RabbitRabbit Processing
Stage
Mongo
Processing Stage
t
NODE.JS - PLATFORM LIMITATIONS
HEAP SIZE - LIMITED TO 1.7GB
SINGLE THREADED :-(
TypeError: undefined is not a function
t
COMPONENTS
DURABLE EVENT STREAM
PLATFORM
COMPUTING FRAMEWORK
t
ACTOR-BASED SOLUTION
Node Manager
Customer A Pipeline
KafkaReader
Algorithmrunner
MongoWriter
RabbitWriter
Customer B Pipeline
Customer C Pipeline
SUPERVISION
MESSAGING
customer_a_inputs
t
NEXT-GEN SOLUTION
Node Manager
Customer A Pipeline
KafkaReader
Algorithm runner
MongoWriter
RabbitWriter
Customer B Pipeline
Customer C Pipeline
SUPERVISION
MESSAGING
FAILURE
ISOLATION
customer_a_inputs
t
NEXT-GEN SOLUTION
Node Manager
Customer A Pipeline
KafkaReader
Algorithmrunner
MongoWriter
RabbitWriter
Customer B Pipeline
Customer C Pipeline
SUPERVISION
MESSAGING
SEPARATE DISPATCHERS
FOR QOS-TUNING
customer_a_inputs
t
SCALING OUT
Node 1
ClusterManager
Node Manager
Node 2
Node Manager
Node 3
Node Manager
LESSONS LEARNED
t
PRUNING AN INFINITE DATA STREAM
1 2 3 4 5 6 7 8 9 N…10
t
PRUNING AN INFINITE DATA STREAM
1 2 3 4 5 6 7 8 9 N…10
t=10, Critical t=8, OK
t
PRUNING AN INFINITE DATA STREAM
5 6 7 8 9 N…10
t=8, OK
MISSING ALERTS :-(
PRUNING STREAMS THAT RESULT IN STATE REQUIRES STATE RECOVERY.
t
PRUNING AN INFINITE DATA STREAM
5 6 7 8 9 N…10
Snapshot Repository
<data …> lastOffset: 4
<data …> lastOffset: 8
<data …> lastOffset: 10
ON BOOT, LATEST SNAPSHOT IS LOADED
AND STREAM IS SEEKED TO STORED OFFSET.
t
PRUNING AN INFINITE DATA STREAM
CHALLENGES: - COMPACTNESS - SCHEMA EVOLUTION
kryo/chill with a manual de/serializer <=> Map[String, Any]
Schema evolution support with some caveats
Big datasets are only a few MBs in size
USE SNAPSHOTS TO PRUNE STREAMS
JSON IS NOT THE ONLY SOLUTION!
KEY TAKEAWAYS
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES1
PIPELINING BETWEEN STAGES
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES2 1
PIPELINING BETWEEN STAGES
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES3 2 1
PIPELINING BETWEEN STAGES
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
INPUTS
MSG BATCHES3 2 1
PIPELINING BETWEEN STAGES
RETRYING
Persistent failurewill restart entire
pipeline
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES4 3 2 1
PIPELINING BETWEEN STAGES
RETRYING
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
t
FAULT-TOLERANCE THROUGH BEHAVIORAL TRAITS
INPUTS
MSG BATCHES4 3 2 1
PIPELINING BETWEEN STAGES
RETRYING
Kafka reader
Algorithm Runner
MongoWriter
RabbitWriter
CAPTURE COMMON ACTOR
BEHAVIOR USING TRAITS
(BUT MAKE SURE THEY COMPOSE!)
KEY TAKEAWAYS
t
DEFERRING AND CONTROLLING STATE MUTATION
PREVIOUSLY:
Processing Stage
Mongo
Processing Stage
Processing Stage
HERE BE RACE CONDITIONS!
t
DEFERRING AND CONTROLLING STATE MUTATION
Algorithm runner
Mongo
Mongo Writer
Instructions
AN INTERPRETER
t
DEFERRING STATE MUTATION
id1 id2 id1 id1 id2 id2 id1 id2 id1 id1 id2 id2
Mongoget
set
OPTIMIZE ME!
t
FOLDING INSTRUCTIONS TO REDUCE I/O
id1 -> inst1 :: inst2 :: inst3 … :: Nil
id2 -> inst1 :: inst2 :: inst3 … :: Nil
Mongo
getMultiple setMultiple
foldLeft(initialObject)(processInstruction)
DECOUPLE STATE MUTATION FROM PROCESSING
OPTIMIZE STATE MUTATION WHEN INTERPRETING
KEY TAKEAWAYS
t
MEASURE!
Dropwizard Metrics + metrics-scala:
KEY TAKEAWAYS
INSTRUMENT AWAY!
t
FINAL NUMBERS AND BENEFITS
OVERALL RATE IMPROVMENT:
~ 16 events/s on a single node.js process at peak
1600-2500 events/s on a single pipeline at peak
ISOLATION
COMPLETE DETERMINISM
SCALABILITY
Actor-per-Customer; failure isolation
More nodes => more actors; reduced I/O
Actions determined entirely by Kafka contents;
amazing for debugging!
Q&A
WE’RE HIRING! [email protected]
t
GROCERY LIST
RabbitMQ - op-rabbit
MongoDB - reactivemongo
Kafka - kafka-clients
Zookeeper - curator
Dependency Injection - scaldi
Logging - log4j2, scala-logging, raven-log4j2
Metrics - Dropwizard Metrics, metrics-scala
Config - Typesafe Config
JSON - play-json
Binary serde - kryo/chill