real time analytics with dse
TRANSCRIPT
© 2014 DataStax, All Rights Reserved
Hadoop Limitations
• Master / Slave Architecture • Every Processing Step requires Disk IO • Difficult API and Programming Model • Designed for batch-mode jobs • No even-streaming / real-time • Complex Ecosystem
© 2014 DataStax, All Rights Reserved
Apps in the early 2000s were written for
Apps today are written for
Single machines Clusters of machinesSingle core processors Multicore processors
Expensive RAM Cheap RAMExpensive disk Cheap diskSlow networks Fast networks
Few concurrent users Lots of concurrent usersSmall data sets Large data sets
Latency in seconds Latency in milliseconds
What is Spark?• Fast and general compute engine for large-scale data
processing
• Fault Tolerant Distributed Datasets
• Distributed Transformation on Datasets
• Integrated Batch, Iterative and Streaming Analysis
• In Memory Storage with Spill-over to Disk
© 2014 DataStax, All Rights Reserved
Advantages of Spark• Improves efficiency through:
• In-memory data sharing • General computation graphs - Lazy Evaluates Data • 10x faster on disk, 100x faster in memory than
Hadoop MR
• Improves usability through: • Rich APIs in Java, Scala, Py..?? • 2 to 5x less code • Interactive shell
10© 2015. All Rights Reserved.
•Functional Paradigm is ideal for Data Analytics
•Strongly Typed - Enforce Schema at Every Later
•Immutable by Default - Event Logging
•Declarative instead of Imperative - Focus on Transformation not Implementation
Scala for Data Analytics
Spark is about Data Analytics
• How do we get data into Spark?
• How can we work with large datasets?
• What do we do with the results of the analytics?
© 2014 DataStax, All Rights Reserved ●19
Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C*
Spark C*
Full Token Range
Each Executor Maintains a connection to the C* Cluster
Spark Executor
DataStax Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different splits based on sets of tokens
Spark Cassandra Connector
Connector Token Range Mapping
Spark Cassandra Connector uses the DataStax Java Driver to Read from and Write to C*
Spark C*
Full Token Range
Each Executor Maintains a connection to the C* Cluster
Spark Executor
DataStax Java Driver
Tokens 1-1000
Tokens 1001 -2000
Tokens …
RDD’s read into different splits based on sets of tokens
Spark Cassandra Connector
Spark Cassandra Connector • Data locality-aware (speed)
• Read from and Write to Cassandra
• Cassandra Tables Exposed as RDD and DataFrames
• Server-Side filters (where clauses)
• Cross-table operations (JOIN, UNION, etc.)
• Mapping of Java Types to Cassandra Types
Spark Cassandra Connector • Open Source Project
• Requires maintaining separate Cassandra and Spark Clusters
• Spark Master is not Highly Available without Zookeeper
• Submitting Spark Applications requires setting hard coded Spark Master and Cassandra Locations
© 2014 DataStax, All Rights Reserved.
Confidential
DataStax Enterprise Platform Workload Segregation w/out ETL
24
Cassandra OLTP Database
Analytics Streaming and Analytics
Search All Data Searchable
Graph Graph Data Structure - Coming this year
C*
C
C
S A
A
26© 2015. All Rights Reserved.
•DSE Analytic Nodes configured to run Spark •No need to run separate Spark Cluster
•Simplified Deployment and Management •No need to specify Spark Master and Cassandra
Host
•High Availability of Spark Master
DSE Analytics with Spark Internal / Administrative Benefits
28© 2015. All Rights Reserved.
•High Availability Spark Master with automatic leader election
•Detects when Spark Master is down with gossip
•Uses Paxos to elect Spark Master
•Stores Spark Worker metadata in Cassandra
•No need to run Zookeeper
Spark Master High Availability
29© 2015. All Rights Reserved.
•Integration of Analytics and Search
•Spark Job Server
•SparkSQL and HiveQL access of Cassandra Data
•Streaming Resiliency with w/ Kafka Direct API via Cassandra File System
DSE Analytics with Spark Integration Benefits
DSE 4.8 Analytics + Search• Allows Analytics Jobs to use Solr Queries
• Allows searching for data across partitions
val table = sc.cassandraTable("music","albums")
val result = table.select(“id","artist_name")
.where(“solr_query='artist_name:Miles*'")
.collect
31
DSE Analytics Streaming Analysis
DSE Analytics Batch Analysis
Data Center 1 - US East
Data Center 2 - US West
replication
replication
Data Center Replication
Spark Streaming from Kafka
DSE Analytics Streaming Analysis
DSE Analytics Batch Analysis
Spark Streaming from Kafka
Passive Kafka
Active Kafka
Network Traffic Analysis Architecture
Common Use Cases
• Personalization
• Banking Fraud Detection
• Website Click Stream Analysis
• Login Monitoring
Spark Notebook
34© 2015. All Rights Reserved.
C*
C
C A
AANotebook
Notebook
Notebook
Spark Notebook ServerCassandra Cluster with Spark Connector
Apache Spark Notebook
35© 2015. All Rights Reserved.
•Reactive / Dynamic Graphs based on Scala, SQL and DataFrames
•Spark Streaming • Examples notebooks covering visualization, machine learning, streaming, graph analysis, genomics analysis
•SVG / Sliders - interactive graphs •Tune and Configure Each Notebook Separately •https://github.com/andypetrella/spark-notebook
Demo of Streaming in the Real World - Spark At Scale Project
36© 2015. All Rights Reserved.
•Based on Real World Use Cases
•Simulate a real world streaming use case
•Test throughput of Spark Streaming
•Best Practices for scaling
•https://github.com/retroryan/SparkAtScale
Spark At Scale Demo Application
37© 2015. All Rights Reserved.
DataStax Enterprise Platform
DataStax Enterprise Platform
Web Service
Legacy Systems
Spark Streaming with Kafka Direct Approach
39© 2015. All Rights Reserved.
•Use Kafka Direct Approach (No Receivers)
•Queries Kafka Directly
•Automatically Parallelizes based on Kafka Partitions
•Exactly Once Processing - Only Move Offset after Processing
•Resiliency without copying data
Spark Streaming Deployment
40© 2015. All Rights Reserved.
•Don’t build fat jars!!!!
•spark-submit —package specify dependencies maven style
•Test submit options to match load •--executor-memory 4G •--total-executor-cores 15
Spark Streaming Monitoring
43© 2015. All Rights Reserved.
Processing Time
>Batch Duration
=Total Delay Grows
Out Of Memory Errors
Data Modeling using Event Sourcing
44© 2015. All Rights Reserved.
•Append-Only Logging
•Database of Facts
•Snapshots or Roll-Ups
•Why Delete Data any more?
•Replay Events
© 2014 DataStax, All Rights Reserved
• Creating and Running Spark Programs Faster • Write less code • Read less data • Let the optimizer do the hard work
• Spark SQL Catalyst optimizer
Why Spark SQL?
© 2014 DataStax, All Rights Reserved
• Distributed collection of data • Similar to a Table in a RDBMS • Common API for reading/writing data • API for selecting, filtering, aggregating
and plotting structured data • Similar to a Table in a RDBMS
DataFrame
© 2014 DataStax, All Rights Reserved
• Sources such as Cassandra, structured data files, tables in Hive, external databases, or existing RDDs.
• Optimization and code generation through the Spark SQL Catalyst optimizer
• Decorator around RDD • Previously SchemaRDD
DataFrame Part 2
© 2014 DataStax, All Rights Reserved
• Unified interface to reading/writing data in a variety of formats
• Spark Notebook Example
Write Less Code: Input & Output
Key to Scaling - Configuring Kafka Topics
51© 2015. All Rights Reserved.
•Number of Partitions per Topic — Degree of parallelism
•Directly Affects Spark Streaming Parallelism
•bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 5 --topic ratings
Populating Kafka Topics
52© 2015. All Rights Reserved.
val record = new ProducerRecord[String, String] (feederExtension.kafkaTopic, partNum, key, nxtRating.toString)
val future = feederExtension.producer.send(record, new Callback {
Streaming:collect tweets
Twitter API
HDFS:dataset
Spark SQL:ETL, queries
MLlib:train classifier
Spark:featurize
HDFS:model
Streaming:score tweets
language filter
Demo: Twitter Streaming Language Classifier
Cassandra
Cassandra
1. extract text from the tweet
https://twitter.com/andy_bf/status/
"Ceci n'est pas un tweet"
2. sequence text as
tweet.sliding(2).toSeq
("Ce", "ec", "ci", …, )
3. convert bigrams into
seq.map(_.hashCode())
(2178, 3230, 3174, …, )
4. index into sparse tf
seq.map(_.hashCode() % 1000)
(178, 230, 174, …, )
5. increment feature
Vector.sparse(1000, …)
(1000, [102, 104, …], [0.0455, 0.0455,
Demo: Twitter Streaming Language Classifier
From tweets to ML features, approximated as sparse vectors:
Demo: Twitter Streaming Language Classifier
Sample Code + Output: gist.github.com/ceteri/835565935da932cb59a2
val sc = new SparkContext(new SparkConf())
val ssc = new StreamingContext(conf, Seconds(5))
val tweets = TwitterUtils.createStream(ssc, Utils.getAuth)
val statuses = tweets.map(_.getText)
val model = new KMeansModel(ssc.sparkContext.objectFile[Vector]
(modelFile.toString).collect())
val filteredTweets = statuses
.filter(t =>
model.predict(Utils.featurize(t)) == clust)
filteredTweets.print()
ssc.start()
ssc.awaitTermination()
CLUSTER 1:TLあんまり⾒見ないけど@くれたっらいつでもくっるよ٩(δωδ)۶そういえばディスガイアも今⽇日か CLUSTER 4:قالوا العروبه روحت بعد صدامواقول مع سلمان تحيى العروبهRT @vip588: √ للمتواجدين االن √ زيادة متابعني √ فولو مي vip588
فولو باك √ رتويت للتغريدة √ فولو للي عمل رتويت √ اللي ما يلتزم ما √… بيستفيدن سورة