apache cassandra and python for analyzing streaming big data
TRANSCRIPT
![Page 1: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/1.jpg)
Apache Cassandra and Python
For streaming Big Data
Prajod S VettiyattilArchitect, Wipro
@prajodshttps://in.linkedin.com/in/prajod
Nishant SahayArchitect, Wipro
@nsahaytechhttps://in.linkedin.com/in/nishantsahay
1
Open Source IndiaNov 2015
Database track
![Page 2: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/2.jpg)
Agenda
1. Time Series Data Analysis2. Spark, Python, Cassandra and D3 3. Business problem4. Solution using Logical Architecture5. Data Processor6. Data Persistence 7. Data Visualization
2
![Page 3: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/3.jpg)
What this session is about
3
What
Big Data
Streaming
Time Series
How
Spark
Python
Cassandra
D3.js, Node.js
![Page 4: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/4.jpg)
Tools: Python, Spark, Cassandra, Node and D3
• Python and Spark for Big data processing• Cassandra for persistence and serving• D3 for visualization• Node for
• Enabling scalability • Data aggregation
4
![Page 5: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/5.jpg)
python
• Popular with Open source projects• Wide support base• Strong in data science • Visualization libraries• Statistics functions
5
![Page 6: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/6.jpg)
Cassandra
• noSQL database• Column family• Dynamic columns• AP in CAP theorem
• Tunable consistency
• Suited for time series storage
6
![Page 7: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/7.jpg)
D3.js
• Data driven documents• SVG, html, css and javascript• Fine grained control of screen elements• Plethora of UI widgets
7
![Page 8: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/8.jpg)
Business Problem
•Handle streaming data•Stock ticks•Weather movements•Satellite captures•Astronomical observations•Large Hadron Collider
•Ingest•Persist•Visualize
•Analysing stock prices
8
![Page 9: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/9.jpg)
Logical Solution Architecture
Time Series Data Producer (IoT devices, Stock ticks)
Data Processor(pySpark)
Data Persistence(Cassandra)
Visualization Aggregator
(Node.js)
Visualization(D3.js)
9
![Page 10: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/10.jpg)
Data Processor: pySpark
•Apache Spark is a big data processor•Streaming data•Batch data•Lambda architecture
•pySpark for using python’s power on top of Spark•python
•Machine learning•Statistics•Visualization
•Cassandra integration•pyspark-cassandra adapter from TargetHoldings
10
![Page 11: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/11.jpg)
Logical Architecture diagram of Spark
Apache Spark
Spark
SQLMLlib GraphX SparkR pySpark
11
Spark Streaming
![Page 12: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/12.jpg)
Apache Spark: Core
• In memory processing for Big Data• Cached intermediate data sets• Multi-step DAG based execution• Resilient Distributed Data(RDD) sets
12
![Page 13: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/13.jpg)
pySpark and Cassandra
Java
Python
Cassandra
13
![Page 14: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/14.jpg)
Apache Spark: Processing stock ticks
• Ingest stock tick stream, coming in at a high rate• Calculate moving average of stock prices• Insert the average of prices into Cassandra
14
![Page 15: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/15.jpg)
Data Persistence - Cassandra
• Master less: Peer to peer• Built to Scale: Scales to support millions of operations per second• High Availability: No single point of failure• Ease of Use: Operational simplicity, CQL for developers• It is supposedly battle tested at Facebook, Apple and Netflix :-)
15
![Page 16: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/16.jpg)
Data Persistence - Cassandra
16
n1
n5
n2
n4
n3n7
n8
n6
Write Request -Partition Key Hash value for n1
n8 – Coordinator Noden1 – Primary responsible node handling
requestn2, n3 – Replication Nodes (RF=3)
![Page 17: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/17.jpg)
Cassandra Data Model – Skinny Rows
Skinny Rows: Primary Key with only partition key
CREATE TABLE stock_info(stock_id text, date text, price double, PRIMARY KEY ((stock_id, date));
stock_id date price
GAZP 2015-11-11 556.50
GAZP 2015-11-10 556.65
GAZP:2015-11-11
price
556.50
GAZP:2015-11-10
price
556.6517
Composite Partition KeyLogical View Disc View
Node n1
Node n4
![Page 18: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/18.jpg)
Cassandra Data Model – Wide Rows
Wide RowsPrimary key contains column (Clustering Columns) other than the
partition key. CREATE TABLE stock_ticker(stock_id text, price double, event_time timestamp , PRIMARY KEY (stock_id, event_time);
GAZP
2015-11-10
13:30:00:price
556.45
2015-11-10
09:30:00:price
559.45
stock_
id
price date event_time
GAZP 559.45 2015-11-10 2015-11-10
09:30:00
GAZP 556.45 2015-11-10 2015-11-10
13:30:00
GAZP 556.65 2015-11-11 2015-11-11
18:00:00
2015-11-11
16:00:00:price
556.65
18
Logical View Disc ViewCompound Primary Key (Partition+Clustering)
Node n1
![Page 19: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/19.jpg)
Time Series – Cassandra Data Model
Wide Row + Row Partition CREATE TABLE stock_info(stock_id text, date text, price double, event_time
timestamp, PRIMARY KEY ((stock_id, date), event_time);
stock_id price date event_time
GAZP 559.45 2015-11-10 2015-11-10
09:30:00
GAZP 556.45 2015-11-10 2015-11-10
13:30:00
GAZP 556.65 2015-11-11 2015-11-11
18:00:00
GAZP:2015-11-10
2015-11-10 13:30:00:price
556.45
2015-11-10 09:30:00:price
559.45
GAZP:2015-11-11
2015-11-11 18:00:00:price
556.6519
Logical View Disc View
Node n1
Node n6
![Page 20: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/20.jpg)
Summary – Cassandra Data Model
Skinny Row
Wide Row
Wide Row + Row PartitionOptimize with Expiring Columns/Split day bucket to multiple rows
20
GAZP:2015-11-10
2015-11-10 13:30:00:price
556.45
2015-11-10 09:30:00:price
559.45
GAZP:2015-11-11
2015-11-11 18:00:00:price
556.65
Node n1
Node n6
GAZP
2015-11-10
13:30:00:price
556.45
2015-11-10
09:30:00:price
559.45
2015-11-11
16:00:00:price
556.65
Node n1
GAZP:2015-11-11
price
556.50
GAZP:2015-11-10
price
556.65
Node n1
Node n4
![Page 21: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/21.jpg)
Node.js, Cassandra and D3.js
D3.js graph
Browser
Web UI Layer
ExpressJS
cassandra-driver
Server Layer Database Layer
Cassandra DB
Rest Based Polling
Get JSON Data
CQL – SelectTime SeriesData
21
![Page 22: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/22.jpg)
Data Aggregator
• Node.js is proxy for data aggregation• Expose Rest endpoint for visualization• Retrieve data from Cassandra• Data transformation as per business need
• ExpressJS: Flexible web application framework
• Datastax cassandra-driver: client library for Apache Cassandra
• EJS: For quick templating of on-the-fly node application
22
![Page 23: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/23.jpg)
Visualization - Frameworks
• D3 for transformation of time series data into visual information• Consume REST API• Generate customized data driven graphs and visualization
• Rickshaw is a JavaScript toolkit for creating interactive time series graphs• Built on D3.js• Generate time-series graph
23
![Page 24: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/24.jpg)
Visualization – Graphs
2424
Price
Moving Average
Trade Volume
Stock Price
![Page 25: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/25.jpg)
Summary
• Processing time series data• Apache Spark• Cassandra• Node.js• D3.js
25
![Page 26: Apache Cassandra and Python for Analyzing Streaming Big Data](https://reader031.vdocuments.site/reader031/viewer/2022021813/588b2e621a28abed688b7007/html5/thumbnails/26.jpg)
QUESTIONS
Prajod S VettiyattilArchitect, Wipro
@prajodshttps://in.linkedin.com/in/prajod
Nishant SahayArchitect, Wipro
@nsahaytechhttps://in.linkedin.com/in/nishantsahay