bi, reporting and analytics on apache cassandra

39
BI, Reporting and Analytics on Apache Cassandra 27/10/2015 Victor Coustenoble Solutions Engineer [email protected] @vizanalytics

Upload: victor-coustenoble

Post on 16-Apr-2017

11.583 views

Category:

Software


4 download

TRANSCRIPT

Page 1: BI, Reporting and Analytics on Apache Cassandra

BI, Reporting and Analytics on Apache Cassandra

27/10/2015

Victor Coustenoble Solutions [email protected]@vizanalytics

Page 2: BI, Reporting and Analytics on Apache Cassandra

2

Agenda

• DataStax & Apache Cassandra• Data Modeling and CQL• Data Access• Reporting and Analytics• DataStax Enterprise Analytics• Architectures• Hadoop + Cassandra use cases

©2014 DataStax Confidential. Do not distribute without consent.

Page 3: BI, Reporting and Analytics on Apache Cassandra

3

DataStax & Apache Cassandra

Page 4: BI, Reporting and Analytics on Apache Cassandra

© 2014 DataStax Confidential. Do not distribute without consent.

DataStax delivers Apache Cassandra in a database platform purpose-built for the performance and availability demands of Web, Mobile, and IOT applications, giving enterprises a secure always-on database that remains operationally simple when scaled in a single datacenter or across multiple datacenters and clouds.

““Elevator Pitch

Page 5: BI, Reporting and Analytics on Apache Cassandra

5

Page 6: BI, Reporting and Analytics on Apache Cassandra

No Vertical Market Concentration

Page 7: BI, Reporting and Analytics on Apache Cassandra

Functional use cases

Messaging

Collections/Playlists

Fraud detection

Recommendation/ Personalization

Internet of things/ Sensor data

Page 8: BI, Reporting and Analytics on Apache Cassandra

Apache Cassandra™• Massively scalable, Open Source, NoSQL, distributed database built for modern, mission-

critical online applications • Written in Java and is a hybrid of Amazon Dynamo and Google BigTable• Masterless with no single point of failure• Distributed and data center aware• 100% uptime• Predictable scaling• High Performance• Multi Data Center• Time Series• Tunable Consistency• Simple to Operate

• CQL language• OpsCenter / DevCenter

Dynamo

BigTable

BigTable: http://research.google.com/archive/bigtable-osdi06.pdfDynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

Page 9: BI, Reporting and Analytics on Apache Cassandra

9

Data Modeling and CQL

Page 10: BI, Reporting and Analytics on Apache Cassandra

Data Modeling

Cassandra is not like well known RDBMS systems:• No a relational model• No foreign keys, no joins, no agregations• Modeling guided by requests to be supported, by data access and by

actions (filters, grouping and order needs)

Denormalisation• Combine columns from different tables in a unique table (“materialized

view”), no joins!• Better performances, less data trafic• Don’t be afraid to duplicate data, to write data• Avoid joins at client level

©2014 DataStax Confidential. Do not distribute without consent. 10

Page 11: BI, Reporting and Analytics on Apache Cassandra

11

Cassandra Data Model

©2014 DataStax Confidential. Do not distribute without consent.

• Based on Google Bigtable • Row-oriented column family• De-normalisedCREATE TABLE sporty_league ( team_name varchar, player_name varchar, jersey int, PRIMARY KEY (team_name, player_name));SELECT * FROM sporty_league;

The primary key uniquely identifies a row.A composite primary key consists of:

• A partition key• One or more clustering columns

e.g. PRIMARY KEY (partition key, cluster columns, ...)

• The partition key determines on which node the partition resides

• Data is ordered in cluster column order within the partition

Page 12: BI, Reporting and Analytics on Apache Cassandra

CQL – Cassandra Query Language

©2014 DataStax Confidential. Do not distribute without consent.

• Data type : BLOB, UUID, TIMEUUID, User Defined Type …

• User Defined Functions, User Defined Aggregates• Collections : Map, List, Set• TTL (Time-To-Live) at column level• Counters• Lightweight Transactions (LWT) : race condition

problem solving with IF NOT EXISTS• Batch statements• Secondary Index

• Very similar to RDBMS SQL syntax• Core DML and DDL commands supported: INSERT, UPDATE, DELETE, SELECT, CREATE, GRANT …

INSERT INTO sporty_league (team_name, player_name, jersey) VALUES (’PSG',’Zlatan’,10);SELECT player_name as nom_joueur FROM sporty_league WHERE team_name = ‘PSG’;

DevCenter

Page 13: BI, Reporting and Analytics on Apache Cassandra

13

Data Access

Page 14: BI, Reporting and Analytics on Apache Cassandra

Cassandra Data Access

CQL language via cqlsh (command line) or DevCenter (development environnement) or drivers

• Drivers on Cassandra native protocol• Command CQL COPY• Import/Export tools for massive bulk loader• Connectors in ETL solutions (Talend, Informatica) • Via analytics layers Spark and Hadoop• Via ODBC/JDBC drivers

Page 15: BI, Reporting and Analytics on Apache Cassandra

15

Cassandra Clients - Native DriverDataStax drivers available and supported: Java, Python, C#, C++, Ruby, Node.js, PHP (much more to come like Scala, Go…)

This includes:• Load Balancing

• Data Centre Aware• Latency Aware• Token Aware

• Reconnection policies• Retry policies

• Downgrading Consistency• Plus others…

©2014 DataStax Confidential. Do not distribute without consent.

Page 16: BI, Reporting and Analytics on Apache Cassandra

Connexions ODBC / JDBC

ODBC drivers• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server• For Hive (Hadoop SQL engine)• For Cassandra directly (ANSI SQL or CQL requests)

JDBC drivers• For SparkSQL (SQL engine on Spark), via JDBC/ODBC SparkSQL thrift server• For Cassandra directly (in progress)• JDBC drivers from the community but not officialy supported

Page 17: BI, Reporting and Analytics on Apache Cassandra

17

Reporting & Analytics

Page 18: BI, Reporting and Analytics on Apache Cassandra

Real-Time / Operational Analytics Use Cases

Recommendation EngineInternet of ThingsFraud DetectionRisk AnalysisBuyer Behaviour AnalyticsTelematics, LogisticsBusiness IntelligenceInfrastructure Monitoring…

Page 19: BI, Reporting and Analytics on Apache Cassandra

How to do analytics on Cassandra data ?

Remember … Cassandra = NO JOIN , NO GROUP BY , Filter on Primary Key only

2 solutions:• CQL with predictable queries• Joins and Aggregations on the fly:

Server level => Need a distributed processing framework : Hadoop or Spark

Client level => Possible but risky !

Page 20: BI, Reporting and Analytics on Apache Cassandra

Reporting and Dashboard

Confidential 20

• Static and operational dashboards and reports created for a specific Cassandra application.

• CQL, Solr queries and DataStax drivers• KPI and aggregations pre-calculated with scheduled batch or on

the fly during insert.

Page 21: BI, Reporting and Analytics on Apache Cassandra

BI & Data Visualization tools

21

For BI and Data Visualization tools like Tableau Software, Power BI, Qlikview, Excel ….

• DataStax ODBC driverSQL joins and aggregations executed at client level !

• Spark ODBC driver (from Databricks or Microsoft)SQL translated in Spark jobs and executed at server level

Page 22: BI, Reporting and Analytics on Apache Cassandra

Tableau Software

22

Databricks Spark ODBC Driver for SparkSQLLive SQL queries to Spark or Extract data on local client

Page 23: BI, Reporting and Analytics on Apache Cassandra

Power BI Desktop

23

Support for On-Prem Spark distributions“The new data source in this month’s release is support for On-Prem Spark distributions. Last month, we added support for Microsoft Azure HDInsight Spark, and this month we’re expanding to other Spark distributions.This new connector can be found under the “Other” category in the “Get Data” dialog.”http://blogs.msdn.com/b/powerbi/archive/2015/09/23/44-new-features-in-the-power-bi-desktop-september-update.aspx

Microsoft Spark ODBC Driver

Page 24: BI, Reporting and Analytics on Apache Cassandra

Notebook

24

Run code (Spark or CQL) from a Web browserNotebooks like Zeppelin, Spark Notebook, JupyterFor example Zeppelin:• Examples available for Cassandra• CQL language interpretor• https://github.com/doanduyhai/incubator-zeppelin

Page 25: BI, Reporting and Analytics on Apache Cassandra

25

DataStax Enterprise Analytics

Page 26: BI, Reporting and Analytics on Apache Cassandra

Analytics with DataStax EnterpriseThere are 4 ways to do Analytics on Cassandra data:

• Reporting with CQL queries• Integrated Search (Solr)• Integrated Batch Analytics (Hadoop integrated) on Cassandra• Integrated Near Real-Time Analytics (Spark)

• Virtual multi data centers optimised as required – different workloads, hardware, availability etc..

• Cassandra will replicate the data for you – no ETL is necessary• Cassandra node started with Solr, Hadoop or Spark

CassandraReplication

Transactions Analytics

Page 27: BI, Reporting and Analytics on Apache Cassandra

27

Enterprise Search & Powerfull Secondary Index• Built-in enterprise search on Cassandra data via a strong Apache Solr and

Lucene integration• Facets, Filtering, Geospatial search, Text Analysis, Joins, etc.• Real-time indexing process and search operations• Search queries from CQL and REST/Solr• Solr shortcomings:

• No bottleneck. Client can read/write to any Solr node.• Search index partitioning and replication for scalability and availability.• Multi-DC support• Data durability (Solr lacks write-ahead log, data can be lost)

CassandraReplication

CustomerFacing

SearchNodes

Page 28: BI, Reporting and Analytics on Apache Cassandra

28

Batch Analytics - Hadoop• Integrated Hadoop 1.0.4• CFS (Cassandra File System) , no HDFS• No Single Point of failure• No Hadoop complexity – every node is built the same• Hive / Pig / Sqoop / Mahout

©2014 DataStax Confidential. Do not distribute without consent.

CassandraReplication

CustomerFacing

HadoopNodes

Page 29: BI, Reporting and Analytics on Apache Cassandra

29

Real-Time Analytics - Spark• Tight integration between Apache Spark and Cassandra• Distributed Processing : “In-memory Map/Reduce”, multi-thread, best for iterations• GraphX, MLLib (Machine learning), SparkSQL, Spark Streaming (Real-time processing)• Thrift JDBC/ODBC Spark server – Spark Job server• Apache Solr integration• DataStax / Databricks partnership• 10x – 100x speed of MapReduce

©2014 DataStax Confidential. Do not distribute without consent.

CassandraReplication

CustomerFacing

SparkNodes

« Big Data » SDK

Page 30: BI, Reporting and Analytics on Apache Cassandra

30

Real-time or Batch Analytics

©2014 DataStax Confidential. Do not distribute without consent.

Data Enrichment

Batch Processing Machine Learning

Pre-computedaggregates

DataNO ETL

Page 31: BI, Reporting and Analytics on Apache Cassandra

Spark Use Cases

31

Load data from various sources

Analytics (join, aggregate, transform, …)

Sanitize, validate, normalize data

Schema migration,Data conversion

Page 32: BI, Reporting and Analytics on Apache Cassandra

32

Architectures

Page 33: BI, Reporting and Analytics on Apache Cassandra

33

Workloads Isolation

©2014 DataStax Confidential. Do not distribute without consent.

No ETL

Page 34: BI, Reporting and Analytics on Apache Cassandra

Hot / Cold Data in a DataStax architecture

© 2014 DataStax, All Rights Reserved. Company Confidential

Hot DataOnline Operational Application

Cold Data Offline Application

DataStax Cassandra Enterprise

34

Page 35: BI, Reporting and Analytics on Apache Cassandra

DataStax Enterprise + Datawarehouse / Hadoop

© 2014 DataStax, All Rights Reserved. Company Confidential

Write IntensiveInternet of Things - Activity logs for fraud and recommendation –

Messages

35

Read Intensive Catalogue – Playlist –

Recommendation – Fraud Alert – Personalization

Operational Search, Dashboard and Reporting

Offline ApplicationsHistorical Analysis - OLAP -

Complex Analytics – Self Service BI

Operational Search, Dashboard and Reporting

Data WarehouseHadoop cluster Computation EngineMultidimensional Cube

Page 36: BI, Reporting and Analytics on Apache Cassandra

36

Cassandra + Hadoop Use Cases

Page 37: BI, Reporting and Analytics on Apache Cassandra

Ooyala Use Case : Hadoop + Cassandra

Company Confidential 37

By leveraging data stored in Apache Cassandra, Ooyala is helping their customers take a more strategic approach when delivering a digital video experience, so they can get ahead in this fast-evolving space.

http://www.datastax.com/resources/casestudies/ooyala

San Francisco-based video services company Ooyala provides a suite of technologies and services that support content owners in managing, analyzing and monetizing the digital video they publish online, on mobile devices, and through the over-the-top distribution platform for delivering Internet video to television.

Page 38: BI, Reporting and Analytics on Apache Cassandra

Spotify Use Case : Hadoop + Cassandra

Company Confidential 38

https://labs.spotify.com/2015/01/09/personalization-at-spotify-using-cassandra/

Personalization at Spotify using Cassandra

Page 39: BI, Reporting and Analytics on Apache Cassandra

Thanks

We power the big data apps that transform business.

©2013 DataStax Confidential. Do not distribute without consent.