how jkool analyzes streaming data in real time with datastax

How jKool Analyzes Streaming Data in Real Time with DataStaxCharles RichVP of Product ManagementjKool – jKoolcloud.com

Thank you for joining. We will begin shortly.

http://jkoolcloud.com/

All attendees placed on mute

Input questions at any timeusing the online interface

Webinar Housekeeping

© 2015 jKool, All Rights Reserved. 3

Agenda

• jKool Overview

• jKool Technology

• Challenges

• Why We Selected Cassandra and DataStax

• Demo

jKool Overview


• jKool – Founded 2014 as an spin-off from Nastel Technologies– Expertize in building scalable real-time analytics

• Initial Vision– Address the big data problems we saw at customers

• Inability to analyze data fast enough to take action and address problems• Too much data – Too little time

– Provide real-time, in-memory analytics (our heritage)– Leverage open-source– SaaS (or on-premises)– Simplicity

© © 2015 jKool, All Rights Reserved. 5

What is jKool?

A solution to Find and Fix Problems Faster (operational intelligence)

DevOps can use jKool to get real-time diagnostics for entire applications: logs, metrics and transactions.

– Detect anomalies, 2-clicks to root-cause– Discover log, transaction topologies– Analyze app behavior– Diagnose and determine causality

• An alternative to Splunk or Elasticsearch– Fraction of the cost of Splunk– Much easier to use than Elasticsearch


Business Value: Instant Insight

Provide high quality app experiences for customers - Improve customer satisfaction

Enable DevOps to:– Fix problems faster

• Faster problem resolution, eliminate false alarms– Deliver releases sooner

• Less time patching and more time innovating– Be proactive

• Spot trends and prevent problems

© 2015 jKool, All Rights Reserved 7

Features

• Web-based, mobile-friendly dashboard– Designed for simplicity and power

• Real-time & historical visualization– Flexible, user configurable

• Analytics immediately detect outliers– Aggregation, summarization, comparison, including: count, min,

max, avg., bucketing, filtering and Bollinger

• Ease of use– Talk to your data using English-like query language

• Scale to handle the largest volumes of data– NoSQL architecture provides elastic scalability


jKool Does Machine Data

• Sequence, Order, Group, Store

• Relationships

• Compute Timing

• Summarization, comparisons

• Triggers based on continuous queries (CEP)– Subscribe to events min elapsedtime, avg elapsedtime, max

elapsedtime where eventname="Buy" show as linechart


Real-time, In-Memory Analytics

jKool Analyzes Time-Series Data

Technology

• Elastic Architecture– Linear scalability – Highly extensible– Fast, in-memory analysis

•Open Source– NoSQL DB, tools and

instrumentation– No schema to maintain

•FatPipes– Micro-services for ultimate flexibility,

change and configuration


RESTful


Key to Real-time Analytics

• Process streams as they come while at the same time avoiding IO

– Streams are split into real-time queue and persistence queue with eventual consistency

• Both have to be processed in parallel– Writing to persistence layer and then analyzing will not achieve

near real-time processing


Why clustered computing platforms?

• STORM paired with Kafka/JMS and CEP– Clustered way to process incoming real-time streams

• STORM handles clustering/distribution

• Kafka/JMS for a messaging between grids

– Split streaming workload across the cluster

– Achieve linear scalability for incoming real-time streams• Apache Spark (alternative to MapReduce)

– For distributing queries and trend analysis– Micro batching for historical analytics– Loading large dataset into memory (across different nodes)– Running queries against large data-sets

Web Interface: DevOps Application Owner

13© 2015 jKool, All Rights Reserved


Challenges: Meeting our Objectives

• Store everything, analyze everything…• Combined real-time & historical analytics• Fast response, flexible query capabilities

– Target – for business user– Insulate us from underlying software– Hide complexity

• Scale for ingesting data-in-motion• Scale for storing data-at-rest• Elasticity & Operational efficiency• Ease of monitoring & management


Challenges: What we experienced

• So many technology options (…so little time…)

– Deciding on the right combination is key early on

• Cassandra/Solr deployment — (it was a learning experience for us)

– Lots of configuration, memory management, replication options

• Monitoring, managing clusters– Cassandra/Solr, STORM, Zookeeper, Messaging– +Leverage parent company’s AutoPilot Technology

• Achieving near real-time analytics proved extremely challenging – but we did it!

– Keeping track of latencies across cluster– Estimating computational capacity required to crunch incoming

streams


Challenges: DB was the bottleneck

• Needed high performance DB platform

• SQL (Oracle, MySQL, etc.)– No scale. We have had a lot of experience our customer’s issues with

this at our parent company Nastel…– RAM was “the” bottleneck. Commits take too long and while that is

happening everything else stops

• NoSQL– Cassandra/Solr (DSE)– Hadoop/MapReduce– MongoDB

• Clustered Computing Platforms– STORM– MapReduce– Spark (we learned about this while building jKool)

17

Why we chose Cassandra/Solr?

• Pros:– Simple to setup & scale for clustered deployments– Scalable, resilient, fault-tolerant (easy replication)– Ability to have data automatically expire (TTL – necessary for our pricing model)– Configurable replication strategy– Great for heavy write workloads

• Write performance was better than Hadoop.• Insert rate was of paramount importance for us – get data in as fast as possible was our goal• Java driver balances the load amongst the nodes in a cluster for us (master-slave would never have worked for us)

– Solr provides a way to index all incoming data - essential– DSE provides a nice integration between Cassandra and Solr

• Cons:– Susceptible to GC pauses (memory management)

• The more memory the more GC pauses• Less memory and more nodes seems a better approach than one big “honking” server (we see 6-8GB optimal, so

far)– Data compaction tasks may hang

© © 2015 jKool, All Rights Reserved


Why not Hadoop MapReduce?

• MapReduce too slow for real-time workloads– Ok for batch, not so great for real-time– Need to be paired with other technologies for query (Hive/Pig)– Complex to setup, run and operate

• Our goals were simplicity first…

• Opted for STORM/Spark wrapped with our own micro services platform FatPipes instead of the Map Reduce functionality


Why we chose Cassandra/Solr vs. Mongo?

• Why not Mongo?– Global write-lock performance concerns…

• Cassandra/Solr – Java based (our project was in Java) – Easy to scale, replicate data, – Flexible write & write consistency levels (ALL, QUORUM, ANY, etc.)– Did we say Java? Yes.(we like Java…)

• Flexible choice of platform coverage– Great for time-series data streams (market focus for jKool)

• Inherent query limitations in Cassandra solved via Solr integration (provided with DSE – as mentioned earlier)


What we learned

• Consider your application – Read heavy or write heavy? Both?

• Evaluate performance of course, but consider the user– We needed simplicity: setup and scale (us and end user)– We needed reliability – not planning on targeting data engineers– We needed auto pruning (TTL)– We needed easy search

• DSE had this…the others did not provide all of this– We choose DSE.


jKool in Real Time – A Live Demo

Thank you!

Input questions at any timeusing the online interface

More information on jKool at: jKoolCloud.com

how jkool analyzes streaming data in real time with datastax

Software