[strata] sparkta

Download [Strata] Sparkta

Post on 18-Jul-2015

2.720 views

Category:

Technology

0 download

Embed Size (px)

TRANSCRIPT

technical evaluation of message queue software

SPARKTAA real-time analytics platform based on Apache SparkLondon, May 2015

FIRST SPARK PLATFORM.APR 201420+ INTERNATIONALPROJECTSWITH SPARK

PLATFORMOVERVIEW1STRATIOINGESTIONCustomer lakeSTRATIOSTREAMINGSTRATIOQUANTUMSTRATIO DEEPSTRATIO CROSSDATAODBCJBDCAPI Rest

CRMERPCall CenterBI

Internal DataExternalDataBIAD HOC APPHdfsS3ElasticSearchMongo DBCassandraRedisOracle, DB2Other Databases

STRATIO DATAVIS4STRATIOINGESTIONCustomer lakeSTRATIOSTREAMINGSTRATIOQUANTUMSTRATIO DEEPSTRATIO CROSSDATAODBCJBDCAPI Rest

CRMERPCall CenterBI

InternalDataExternal dataBIAD HOC APPIngests, transformsAnalyzes and processes real time streamingA unified SQL interfaceMachine Learning and algorithmsProcesses & combines with SparkSTRATIO DATAVISCreates and designs dashboards and reportsHdfsS3ElasticSearchMongo DBCassandraRedisOracle, DB2Other Databases

5STRATIOINGESTIONIngests, transformsSTRATIOSTREAMINGSTRATIOQUANTUMSTRATIO CROSSDATAAnalyzes & processesA unified SQL interfaceMachine Learning and algorithmsODBCJBDCAPI Rest

Streaming

Apache KiteApache Flume

CRMERPCall CenterBI

MLlib

InternalDataExternal DataBIAD HOC APPCombines with Spark data from any sourceCustomer lakeSTRATIO DEEPProcesses & combines with Spark

HdfsS3ElasticSearchMongo DBCassandraRedisOracle, DB2Other Databases

STRATIO DATAVISCreates and designs dashboards and reports6STRATIOINGESTIONHdfsS3ElasticSearchMongo DBCassandraRedisOracle, DB2Other Databases

Ingests, transformsSTRATIOSTREAMINGSTRATIOQUANTUMSTRATIO CROSSDATAAnalyzes & processes

Consult & analyze. SQL interfaceMachine Learning & algorithmsODBCJBDCAPI Rest

Streaming

Apache KiteApache Flume

CRMERPCall CenterBI

MLib

InternalDataExternalDataBIAD HOC APPData combination through timeCustomer lakeSTRATIO DEEPProcesses & combines with Spark

Real-timeEphemeral tablesPastStored tablesFutureQuantum tablesSTRATIO DATAVISCreates and designs dashboards and reports7STRATIO DATAVISSTRATIOINGESTIONIngests, transformsSTRATIOSTREAMINGSTRATIOQUANTUMSTRATIO CROSSDATAAnalyzes & processesConsulta y analiza. Interfaz SQLMachine Learning & algorithmsODBCJBDCAPI Rest

Streaming

Apache KiteApache Flume

CRMERPCall CenterBI

MLlib

InternalDataExternalDataCreates and designs dashboards and reportsCustomer lakeSTRATIO DEEPProcesses & combines with Spark

HdfsS3ElasticSearchMongo DBCassandraRedisOracle, DB2Other Databases

INFORMATIONAL + OPERATIONAL WITHOUT NEED TO REPLICATE DATAOracle, DB2Other DatabasesMongo DBTeradata

OPERATIONAL8

REAL-TIME:Beyond cool dashboards2

The time is N W

We all know this story already

Social media and networking sites are a part of the fabric of everyday life, changing the way the world shares and accesses information.

The overwhelming amount of information gathered not only from messages, updates and images but also readings from sensors, GPS signals and many other sources was the origin of a (big) technological revolution.

Remember? VOLUME, VARIETY & VELOCITY

CONFERENCE10Buscar reloj para reemplazar la O.

10Look at these sexy infographics!

We all love data visualization

Insights from this vast amount of data allows us to learn from the users and explore our own world.

We can follow in real-time the evolution of a topic, an event or even an incident just by exploring aggregated data.

CONFERENCE11Buscar reloj para reemplazar la O.

11Delivering real-time business in the InternetBut beyond cool visualizations, there are some core services delivered in real-time, using aggregated data to answer common questions in the fastest way.

These services are the heart of the business behind their nice logos.

Site traffic, user engagement monitoring, service health, APIs, internal monitoring platforms, real-time dashboards

Aggregated data feeds directly to end users, publishers, and advertisers, among others.

CONFERENCE12Buscar reloj para reemplazar la O.

12Pushing business processes to perform fasterDigital companies, born to develop their services in real-time have changed the expectations of many others businesses.

Real-time information makes it possible for a company to be much more agile than its competitors, improving business answers, gaining insights on their performance

CONFERENCE13Buscar reloj para reemplazar la O.

13Listen to your data

CLIENTTPVAccountsLoansand creditsInsurancesBrokerMortgagesCardsDepositsATMOnlinegatewayapplication logsSocialnetworks transactionsgeolocationCRM

Where as business intelligence is data gathered for the purpose of analyzing trends over time, operational intelligence provides a picture of what is currently happening within a process.

And we can listen to almost everything! Orders, transactions, clicks, calls, bookings, internal services...

CONFERENCE14Buscar reloj para reemplazar la O.

14and start delivering real-time services

Real-time monitoring could be really nice, but your company needs to work in the same way as digital companies:

Rethinking existing processes to deliver them faster, better.

Creating new opportunities for competitive advantages.

CONFERENCE15Buscar reloj para reemplazar la O.

15

REAL-TIMEChallenges at Stratio2Real-time fraud monitoringDATA RECEIVERREAL-TIME AGGREGATIONCONSOLIDATION

DashboardingReporting

FRAUDDETECTION

Leveraging the power of Spark Streaming, we have developed some fraud detection solutions, aggregating data in real-time to work better with machine learning algorithms.

CONFERENCE17Buscar reloj para reemplazar la O.

17Extract, Transform and AggregateBy combining Apache Flume and Spark Streaming we have deployed complex topologies to deal with data coming from heterogeneous sources.

The full solution allow us to transform and aggregate data on-the-fly(data cleaning, normalization and enrichment)

REAL-TIMEAGGREGATION

DashboardingReporting

CONFERENCE18Buscar reloj para reemplazar la O.

18Custom data sources and storageEach project requires specific inputs and data storages, dealing with different kinds of events.

From click stream activity to bank transactions...

DATA STREAMLOADINGTRANSFORM

CUSTOM LOGS

CONFERENCE19Buscar reloj para reemplazar la O.

19Towards a generic real-time aggregation platformAt Stratio, we have implemented several real-time analytic projects based on Apache Spark, Kafka, Flume, Cassandra, or MongoDB.

These technologies were always a perfect fit, but soon we found ourselves writing the same pieces of integration code over and over again.This is how SPARKTA was born.

CONFERENCE20Buscar reloj para reemplazar la O.

20

ELSEWHERE3

#1 RainBird from TwitterSome folks from twitter shared some thoughts about their real-time needs at Strata (2011).

They worked on a generic platform in order to deal with pre-calculated data from a huge number of events.

It allows them to deal with:

Data StructuresHierarchical AggregationTemporal AggregationMultiple Formulas

Still not open sourceCURRENT STATEhttp://goo.gl/ykvQa

CONFERENCE22Buscar reloj para reemplazar la O.

22#2 CountandraCountandra is a hierarchical distributed counting engine exploiting all the excellent write&read performance of Cassandra.

It supports:

Geographically distributed counting.

Easy Http Based interface to insert counts.

Hierarchical counting such as com.mywebsite.music.

Retrieves counts, sums and square in near real-time.

Simple Http queries provides desired output in Json format

Queries can be sliced by period such as lasthour ,lastyear and so on for minutely,hourly,daily,monthly values

https://github.com/milindparikh/Countandra

Rather deprecatedCURRENT STATE

CONFERENCE23Buscar reloj para reemplazar la O.

23#3 ThunderRain from IntelThunderRain is a Real-Time Analytical Processing (RTAP) example using Spark and Shark, which can be best characterized by the following four salient properties:

Data continuously streamed in & processed in near real-time

Real-time data queried and presented in an online fashion

Real-time and history data combined and mined interactively

Predominant RAM-based processinghttps://github.com/thunderain-project/thunderain

Rather deprecatedCURRENT STATE

CONFERENCE24Buscar reloj para reemplazar la O.

24#4 TSAR from TwitterTSAR (the TimeSeries AggregatoR) is a flexible, reusable, end-to-end service architecture on top of Summingbird.

Twitter really needs a truly robust real-time aggregation service considering their scaling and evolving needs.

They realized that many time-series applications call for essentially the same architecture, with only slight variations in the data model. https://blog.twitter.com/2014/tsar-a-timeseries-aggregator

Still not open sourceCURRENT STATE

CONFERENCE25Buscar reloj para reemplazar la O.

25Towards a generic real-time aggregation platformSome initiatives have tried to solve this problem, but until now most of them were complex or obsolete while others were not open source.

For this reason, Stratio created SPARKTA: an open source and full-featured platform for real-time analytics, based on Apache Spark.This is why SPARKTA was conceived

CONFERENCE26Buscar reloj para reemplazar la O.

264THIS