developing a real-time engine with akka, cassandra, and spray

Post on 15-Apr-2017

1.733 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Developing a Real-time Engine with Akka, Cassandra, and SprayJacob Park

What is Paytm Labs and Paytm?• Paytm Labs is a data-driven lab focusing on tackling very difficult problems involving the topics of fraud, recommendations, ratings, and platforms for Paytm.• Paytm is the world's fastest growing mobile-first marketplace and payment ecosystem that serves over 100 million people who make over 1.5 million business transactions representing $1.7 billion of goods and services exchanged annually.

What is Akka?• Akka (http://akka.io/):• “Akka is a toolkit and runtime for building highly concurrent, distributed, and resilient message-driven applications on the JVM.”• Packages: “akka-actor”, “akka-remote”, “akka-cluster”, “akka-persistence”, “akka-http”, and “akka-stream”.

What is Cassandra?• Cassandra (http://cassandra.apache.org/):• “The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance.”

What is Spray?• Spray (http://spray.io/):• “Spray is an open-source toolkit for building REST/HTTP-based integration layers on top of Scala and Akka.”• Packages: “spray-caching”, “spray-can”, “spray-http”, “spray-httpx”, “spray-io”, “spray-json”, “spray-routing”, “spray-servlet”.

What is Maquette?• A real-time fraud rule-engine which enables synchronous calls for core operational platforms to evaluate fraud.• Its core technologies include Akka, Cassandra, and Spray.

Why Akka, Cassandra, and Spray?• Akka, Cassandra, and Spray are highly performant, developer-friendly, treat failures as a first-class concept, provide great support for clustering to ensure responsiveness, resiliency, and elasticity when creating Reactive Systems.

Maquette In a Nutshell

HTTP Environment Executor

Maquette Actor System

HTTP Layer• Utilize Spray-Can for a fast HTTP endpoint.• Utilize Jackson for JSON deserialization/serialization.• Utilize a separate dispatcher for the Bulkhead Pattern.• Expose a normalized yet flexible schema for integration.• Request Handling: Worst → Best• Cameo Pattern (Per-request Actor), • Ask Pattern (Future), • RequestHandlerPool (Akka Router Pool).

HTTP Layertrait FraudRoute extends BaseRoute with ActorLogging { this: Actor =>

import SprayJacksonSupportUtils._

override protected def receiveRequest( delegateActorRef: ActorRef, parentUriPath: Path ): Actor.Receive = { case incomingHttpRequest @ HttpRequest( HttpMethods.POST, requestUri, requestHeaders, requestEntity, requestProtocol

) if requestUri.path startsWith parentUriPath => val senderActorRef = sender()

unmarshalHttpEntityAndDelegateRequest( requestEntity, delegateActorRef, senderActorRef ) }}

Environment Layer• A tree of actors which are responsible for managing a cache or pool of Contexts and Dependencies required to evaluate incoming requests.• A Context is a Document Message which wraps configurations for evaluating requests.• A Dependency is a Document Message which wraps optimized queries to Cassandra.

Environment Layer• Map incoming requests to a Context by forking a template with .copy().• Forward the forked Context to Executor Layer in the same or different JVM with Akka Router.• Consider implementing a custom router to favour locality of execution on the same JVM until responsiveness requires distribution.

Environment Layer• Always pre-compute and pre-optimize the Environment Layer as a whole.• Allow the capability to remotely pre-compute and update Contexts.• Ensure Contexts and Dependencies are designed for optimization by allowing arithmetic reduction or sorts.

• Having a ProxyActor and StateActor for an EnvironmentActor is preferred to ensure caching of the whole environment to recover from failures fast.

Environment Layertype EnvironmentStateActorRefFactory = (EnvironmentProxyActorContext, EnvironmentProxyActorSelf) => ActorReftype EnvironmentActorRefFactory = (EnvironmentProxyActorContext, EnvironmentProxyActorSelf) => ActorRef

class EnvironmentProxyActor( environmentStateActorRefFactory: EnvironmentStateActorRefFactory, environmentActorRefFactory: EnvironmentActorRefFactory) extends Actor with ActorLogging {

val environmentStateActorRef = environmentStateActorRefFactory(context, self) val environmentActorRef = environmentActorRefFactory(context, self)

override def receive: Receive = receiveEnvironmentState orElse receiveFraudRequest orElse receiveEnvironmentLocalCommand orElse receiveEnvironmentRemoteCommand}

Environment Layerclass EnvironmentStateActor( environmentProxyActorRef: ActorRef, databaseInstance: Database) extends Actor with ActorLogging { import EnvironmentStateActor._ import EnvironmentStateFactory._ import EnvironmentStateLifecycleStrategy._ import EnvironmentStateRepository._

var environmentState: Option[EnvironmentState] = None

override def receive: Receive = receiveLocalCommand orElse receiveRemoteCommand

object EnvironmentStateLifecycleStrategy { ... }

object EnvironmentStateFactory { ... }

object EnvironmentStateRepository { ... }}

Environment Layerclass EnvironmentActor( environmentProxyActor: ActorRef, executorActorRef: ActorRef, bootActorRef: ActorRef) extends Actor with ActorLogging { import EnvironmentActor._ import EnvironmentLifecycleStrategy._

var environmentState: Option[EnvironmentState] = None

override def receive: Receive = receiveEnvironmentState orElse receiveFraudRequest

def forkedMaquetteContext(fraudRequest: FraudRequest): Option[MaquetteContext] = { val forkedMaquetteContextOption = for { actualEnvironmentState <- environmentState actualBaseMaquetteContext <- actualEnvironmentState.maquetteContextMap. get(fraudRequest.evaluationType) actualForkMaquetteContext = actualBaseMaquetteContext. copy(fraudRequest = fraudRequest) } yield actualForkMaquetteContext

forkedMaquetteContextOption }}

Executor Layer• A pipeline of actors responsible for scheduling execution of Tasks defined within a Context with the specified Dependencies, executing the Tasks, and coordinating the results of the Tasks to provide a response.• A Task is an optimized set of executable rules.

Executor Layer• Ideally, an Execution Layer should be stateless to allow easy recovery from failures.• Ideally, keep the Execution Layer available across the cluster.

Executor Layertype ExecutorRouterActorRefFactory = (ExecutorActorContext, ExecutorActorSelf) => ActorReftype ExecutorCoordinatorActorRefFactory = (ExecutorActorContext, ExecutorActorSender, ExecutorActorNext, MaquetteContext, Timeout) => ActorRef

class ExecutorActor( executorRouterActorRefFactory: ExecutorRouterActorRefFactory, executorCoordinatorActorRefFactory: ExecutorCoordinatorActorRefFactory, actionActorRef: ActorRef) extends Actor with ActorLogging { import ExecutorActor._ import ExecutorSchedulerStrategy._

val executorRouterActorRef: ActorRef = executorRouterActorRefFactory(context, self)

override def receive: Receive = receiveMaquetteContext orElse receiveMaquetteResult

object ExecutorSchedulerStrategy { def scheduleExecution(maquetteContext: MaquetteContext): Unit = { ... } }}

Executor Layer• Design a Task as a functional and monadic data structure.• Utilizing functional programming, the Task should isolate side effects from functions.• Utilizing Monads, the Task becomes easily optimizable with its properties for composition or reduction which allows high parallelization.

Executor Layercase class Query( selectComponent: Select, fromComponent: From, whereComponent: Where) { def + (that: Query): Query = { this.copy(selectComponent = Select(this.selectComponent.columnNames union that.selectComponent.columnNames) ) }

def - (that: Query): Query = { this.copy(selectComponent = Select(this.selectComponent.columnNames diff that.selectComponent.columnNames) ) }}

Note: An example of a Rule object is not shown as it is a trade secret.

Executor Layer• For a Task object, consider the use of an external DSL to interpret into executable and immutable graphs and even Java byte code.• Scala Parser Combinators: https://github.com/scala/scala-parser-combinators• Parboiled2: https://github.com/sirthias/parboiled2• ANTLR: http://www.antlr.org/

Executor Layerobject QueryParser extends JavaTokenParsers { def parseQuery(queryString: String): Try[Query] = { parseAll(queryStatement, queryString) ... }

object QueryGrammar { lazy val queryStatement: Parser[Query] = selectClause ~ fromClause ~ opt(whereClause) ~ ";" ^^ { case selectComponent ~ fromComponent ~ whereComponent ~ ";" => Query(selectComponent, fromComponent, whereComponent.getOrElse(Where.Empty)) } }

object SelectGrammar { ... } object FromGrammar { ... } object WhereGrammar { ... } object StaticClauseGrammar { ... } object DynamicClauseGrammar { ... } object InterpolationTypeGrammar { ... } object DataTypeGrammar { ... } object LexicalGrammar { ... }}

Note: An example of a Rule parser is not shown as it is a trade secret.

Abstracting Concurrency for High Parallelism Tasks• Scala Futures.• Scala Parallel Collections.• Akka Router Pool.• Akka Streams.

Scala Futures• “A Future is an object holding a value which may become available at some point.”

val f = for { a <- Future(10 / 2) b <- Future(a + 1) c <- Future(a - 1) if c > 3} yield b * c

f foreach println

Scala Futures• Advantages: Efficient, Highly Parallel, Simple Monadic Abstraction.• Disadvantages: Lacks Communication, Lacks Low-Level Concurrency Control, JVM Bound.• Note: Monadic Futures Enqueue All Operations to

ExecutionContext ⇒ Lack of Control over Context-Switching.

Scala Parallel Collections• Scala Parallel Collections is a package in the Scala standard library which allows collections to execute operations in parallel.

(0 until 100000).par .filter(x => x.toString == x.toString.reverse)

Scala Parallel Collections• Advantages: Very Efficient, Highly Parallel, Control of Parallelism Level.• Disadvantages: Lacks Communication, Non-parallelizable Operations (foldLeft() and aggregate()), Non-deterministic and Side Effects Issues for Degree of Abstraction, JVM-Bound.

Akka Router Pool• An Akka Router Pool maintains pool of child actors to forward messages.• If an Akka Router Pool is configured with an appropriate dispatcher, mailbox, supervisor, and routing logic, it allows a highly parallel yet elastic construct to execute tasks.

Akka Router Poolval routerSupervisionStrategy = OneForOneStrategy() { case _ => SupervisorStrategy.Restart}val routerPool = FromConfig. withSupervisorStrategy(routerSupervisionStrategy)val routerProps = routerPool.props( ExecutorWorkerActor.props(accessLayer). withDispatcher(DispatcherConfigPath))

context.actorOf( props = routerProps, name = RouterName)

Akka Router Pool• Advantages:• Work-Pull Pattern = Rate Limiting.• Bounded Mailbox = Backpressure.• SupervisionStrategy = Failure.• Scheduler = Timeout.• Router Resizer = Predictive Parallelism & Scaling.• Dispatcher Throughput = Predictive Context Switching.• Location Transparency = JVM Unbound.

Akka Router Pool• Disadvantages:• Complex optimizations or implementation required.• Actors with state potentially lead to issues regarding mutability and lack of idempotence.• Actors which require communication beyond parent-child trees lead to potentially complex graphs.

Akka Steams• “Akka Streams is an implementation of Reactive Streams, which is a standard for asynchronous stream processing with non-blocking backpressure.”

implicit val system = ActorSystem("reactive-tweets")implicit val materializer = ActorMaterializer() val authors: Source[Author, Unit] = tweets .filter(_.hashtags.contains(akka)) .map(_.author) authors.runWith(Sink.foreach(println))

Akka Steams• Advantages: Backpressure and Failure as First-class Concepts, Concurrency Control, Simple Monadic Abstraction, Graph API, Bi-directional Channels.• Disadvantages: Too New = Risk for Production.• Current: JVM Bounded; Potentially: Distributed Streaming.• Current: No Graph Optimization; Potentially: Macro-based Optimization.

Maquette Performance• With 10 Cassandra nodes, 4 Maquette nodes, and an HA Proxy as a staging environment, ~40 000 requests per second with a mean 10 millisecond response time with 50 rules.

Tips• Investigate Akka Streams for Akka HTTP.• Investigate CPU usage and memory consumption: YourKit or VisualVM and Eclipse MAT.• Utilize Kamon for real-time metrics to StatsD or a third-party service like Datadog.• If implementing a DSL or a complex actor-based graph, remember to utilize ScalaTest and Akka TestKit properly.• Utilize Gatling.io for load and scenario based testing.

Tips• We used Cassandra 2.1.6 as our main data store for Maquette. We experienced many pains with operating Cassandra.• Mastering Apache Cassandra (2nd Edition): http://www.amazon.com/Mastering-Apache-Cassandra-Second-Edition-ebook/dp/B00VAG2WZO

Tips• Investigate the Play Framework with Akka Cluster to create a web application for operations.• Commands to operate instances in the cluster.• Commands to configure instances in real-time.• GUI interface for data scientists and business analysts to easily define and configure rules.

Tips• Utilize Kafka to publish audits which can be utilized to monitor rules through an Logstash, Elasticsearch, and Kibana flow, and archived in a HDFS.• Consider Kafka to replay audits as requests to run real-time engine offline for tuning rules.

Resources• The Reactive Manifesto: • http://www.reactivemanifesto.org/

• Reactive Messaging Patterns with the Actor Model: • http://www.amazon.ca/Reactive-Messaging-Patterns-Actor-Model/dp/

0133846830• Learning Concurrent Programming in Scala:• http://www.amazon.com/Learning-Concurrent-Programming-Aleksan

dar-Prokopec/dp/1783281413• Akka Concurrency: • http://www.amazon.ca/Akka-Concurrency-Derek-Wyatt/dp/09815316

60

Thank you!Jacob ParkPhone Number Removedjacob@paytm.compark.jacob.96@gmail.com

top related