cassandra summit sept 2015 - real time advanced analytics with spark and cassandra recommendations...

42
BM | spark.tc Cassandra Summit 2015 Real time Advanced Analytics with Spark and Cassandra Chris Fregly, Principal Data Solutions Engineer IBM Spark Technology Center Sept 24, 2015 Power of data. Simplicity of design. Speed of innovation.

Upload: chris-fregly

Post on 21-Apr-2017

2.765 views

Category:

Software


1 download

TRANSCRIPT

Page 1: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Cassandra Summit 2015Real time Advanced Analytics with Spark and

CassandraChris Fregly, Principal Data Solutions Engineer

IBM Spark Technology CenterSept 24, 2015

Power of data. Simplicity of design. Speed of innovation.

Page 2: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Who am I?Streaming Platform EngineerNot a Photographer or Model

Streaming Data EngineerNetflix Open Source Committer

Data Solutions EngineerApache Contributor

Principal Data Solutions Engineer

IBM Technology Center

Page 3: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Advanced Apache Spark Meetup (Organizer)Total Spark Experts: ~1000Mean RSVPs per Meetup: ~300Mean Attendance: ~50-60% of RSVPs

I’m lucky to work for a company/bossthat let’s me do this full-time!

Come work with me!

We’ll kick ass!

Page 4: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Recent and Future MeetupsSpark-Cassandra Connector w/ Russell Spitzer (DataStax) & Me

Sept 21st, 2015 <-- Great turnout and interesting questions!Project Tungsten Data Structs+Algos: CPU & Memory Optimizations

Nov 12th, 2015Text-based Advanced Analytics and Machine Learning

Jan 14th, 2016ElasticSearch-Spark Connector w/ Costin Leau (Elastic.co) & Me

Feb 16th, 2016Spark Internals Deep Dive

Mar 24th, 2016Spark SQL Catalyst Optimizer Deep Dive

Apr 21st, 2016

Page 5: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Topics of this Talk① Recommendations② Live Interactice Demo!③ DataFrames④ Catalyst Optimizer and Query Plans⑤ Data Sources API⑥ Creating and Contributing Custom Data Source⑦ Partitions, Pruning, Pushdowns⑧ Native + Third-Party Data Source Impls⑨ Spark SQL Performance Tuning

New fe

atures

of Sp

ark 1.

5!!

Audience ParticipationRequired!!

Page 6: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

Live, Interactive Demo!

Spark After DarkGenerating High-Quality Dating

RecommendationsReal-time Advanced Analytics

Machine Learning, Graph Processing

Page 7: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

RecommendationsNon-Personalized

“Cold Start” ProblemTop KPageRank

PersonalizedUser-User, User-Item, Item-ItemCollaborative Filtering

Page 8: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Types of User FeedbackExplicitratings, likes

Implicitsearches, clicks, hovers, views, scrolls,

pauses

Used to train models for future recommenda-tions

Page 9: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Similarity①Euclidean: linear measure

Suffers from magnitude bias

②Cosine: angle measureAdjust for magnitude bias

③Jaccard: Set intersection / union

Suffers from popularity bias

④Log LikelihoodAdjust for popularity bias

  Ali Matei Reynold Patrick Andy

Kimberly 1 1 1 1

Leslie 1 1Meredith 1 1 1

Lisa 1 1 1

Holden 1 1 1 1 1

Page 10: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Comparing SimilarityAll-pairs Similarity

aka. Pair-wise Similarity, Similarity joinNaïve Implementation

O(m * n^2) shuffle; m = rows, n = colsClever

ApproximateReduce m: Sampling and bucketingReduce n: Sparse matrix, factor out frequent vals

(0?)Locality Sensitive Hashing

Page 11: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Audience Participation Required!!Instructions for you:①Navigate to sparkafterdark.com

②Click on

3 actors &3 actresses

->You are here

->

github.com/fluxcapacitor/hub.docker.com/r/fluxcapacitor/

pipeline/

Page 12: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

DataFramesInspired by R and Pandas DataFramesCross language support

SQL, Python, Scala, Java, RLevels performance of Python, Scala, Java, and R

Generates JVM bytecode vs serialize/pickle objects to PythonDataFrame is Container for Logical Plan

Transformations are lazy and represented as a treeCatalyst Optimizer creates physical plan

DataFrame.rdd returns the underlying RDD if neededCustom UDF using registerFunction()New, experimental UDAF support

Use DataFrames instead of

RDDs!!

Page 13: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Catalyst OptimizerConverts logical plan to physical planManipulate & optimize DataFrame transformation tree

Subquery elimination – use aliases to collapse sub-queries

Constant folding – replace expression with constantSimplify filters – remove unnecessary filtersPredicate/filter pushdowns – avoid unnecessary data

loadProjection collapsing – avoid unnecessary projections

Hooks for custom rulesRules = Scala Case Classes

val newPlan = MyFilterRule(analyzedPlan)

Implementsoas.sql.catalyst.rules.Rule

Apply to any plan stage

Page 14: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Plan DebugginggendersCsvDF.select($"id", $"gender").filter("gender != 'F'").filter("gender != 'M'").explain(true)

Requires explain(true)

DataFrame.queryExecution.logicalDataFrame.queryExecution.analyzed

DataFrame.queryExecution.optimizedPlan

DataFrame.queryExecution.executedPlan

Page 15: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Plan Visualization & Join/Aggregation Metrics

Effectiveness of Filter

Cost-based Optimization

is Applied

Peak Memory forJoins and Aggs

Optimized CPU-cache-aware

Binary FormatMinimizes GC &

Improves Join Perf(Project Tungsten)

New in Spark 1.5!

Page 16: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Data Sources APIExecution (o.a.s.sql.execution.commands.scala)RunnableCommand (trait/interface)

ExplainCommand(impl: case class)CacheTableCommand(impl: case class)

Relations (o.a.s.sql.sources.interfaces.scala)BaseRelation (abstract class)

TableScan (impl: returns all rows)PrunedFilteredScan (impl: column pruning and predicate push-

down)InsertableRelation (impl: insert or overwrite data using Save-

Mode)Filters (o.a.s.sql.sources.filters.scala)

Filter (abstract class for all filter pushdowns for this data source)EqualToGreaterThanStringStartsWith

Page 17: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Creating a Custom Data SourceStudy Existing Native and Third-Party Data Source Impls

Native: JDBC (o.a.s.sql.execution.datasources.jdbc)class JDBCRelation extends BaseRelation

with PrunedFilteredScan with InsertableRelation

Third-Party: Cassandra (o.a.s.sql.cassandra)class CassandraSourceRelation extends BaseRela-

tion with PrunedFilteredScan with InsertableRelation

Page 18: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Contributing a Custom Data Sourcespark-packages.orgManaged byContains links to externally-managed github

projectsRatings and commentsSpark version requirements of each package

Exampleshttps://github.com/databricks/spark-csvhttps://github.com/databricks/spark-avrohttps://github.com/databricks/spark-redshift

Page 19: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

Partitions, Pruning, Pushdowns

Page 20: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Demo Dataset (from previous Spark After Dark talks)

RATINGS ========

UserID,ProfileID,Rating

(1-10)

GENDERS========

UserID,Gender (M,F,U)

<-- Totally -->

Anonymous

Page 21: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

PartitionsPartition based on data usage patterns/genders.parquet/gender=M/… /gender=F/… <-- Use case: access users by

gender /gender=U/…

Partition DiscoveryOn read, infer partitions from organization of data (ie. gen-

der=F)Dynamic Partitions

Upon insert, dynamically create partitionsSpecify field to use for each partition (ie. gender)SQL: INSERT TABLE genders PARTITION (gender) SELECT …DF:

gendersDF.write.format(”parquet").partitionBy(”gender”).save(…)

Page 22: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

PruningPartition PruningFilter out entire partitions of rows on partitioned

dataSELECT id, gender FROM genders where gender = ‘U’

Column PruningFilter out entire columns for all rows if not re-

quiredExtremely useful for columnar storage formats

Parquet, ORCSELECT id, gender FROM genders

Page 23: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

PushdownsPredicate (aka Filter) Pushdowns

Predicate returns {true, false} for a given function/condition

Filters rows as deep into the data source as possibleData Source must implement PrunedFilteredScan

Page 24: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Cassandra Pushdown RulesDetermines which filter predicates can be pushed down to Cassandra.* 1. Only push down no-partition key column predicates with =, >, <, >=, <= predicate* 2. Only push down primary key column predicates with = or IN predicate.* 3. If there are regular columns in the pushdown predicates, they should have* at least one EQ expression on an indexed column and no IN predicates.* 4. All partition column predicates must be included in the predicates to be pushed down,* only the last part of the partition key can be an IN predicate. For each partition column,* only one predicate is allowed.* 5. For cluster column predicates, only last predicate can be non-EQ predicate* including IN predicate, and preceding column predicates must be EQ predicates.* If there is only one cluster column predicate, the predicates could be any non-IN predicate.* 6. There is no pushdown predicates if there is any OR condition or NOT IN condition.* 7. We're not allowed to push down multiple predicates for the same column if any of them* is equality or IN predicate.

spark-cassandra-connector/…/o.a.s.sql.cassandra.PredicatePushDown.scala

Page 25: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

Native Spark SQL Data Sources

Page 26: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Spark SQL Native Data Sources - Source Code

Page 27: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

JSON Data SourceDataFrameval ratingsDF = sqlContext.read.format("json")

.load("file:/root/pipeline/datasets/dating/ratings.j-son.bz2") -- or --

val ratingsDF = sqlContext.read.json("file:/root/pipeline/datasets/dating/ratings.j-

son.bz2")

SQL CodeCREATE TABLE genders USING jsonOPTIONS

(path "file:/root/pipeline/datasets/dating/genders.j-son.bz2")

Convenience Method

Page 28: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

JDBC Data SourceAdd Driver to Spark JVM System Classpath

$ export SPARK_CLASSPATH=<jdbc-driver.jar>

DataFrameval jdbcConfig = Map("driver" -> "org.postgresql.Driver",

"url" -> "jdbc:postgresql:hostname:port/database", "dbtable" -> ”schema.tablename")

df.read.format("jdbc").options(jdbcConfig).load()

SQLCREATE TABLE genders USING jdbc

OPTIONS (url, dbtable, driver, …)

Page 29: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Parquet Data SourceConfigurationspark.sql.parquet.filterPushdown=truespark.sql.parquet.mergeSchema=truespark.sql.parquet.cacheMetadata=true

spark.sql.parquet.compression.codec=[uncompressed,snappy,gzip,lzo]DataFrames

val gendersDF = sqlContext.read.format("parquet") .load("file:/root/pipeline/datasets/dating/genders.parquet")gendersDF.write.format("parquet").partitionBy("gender") .save("file:/root/pipeline/datasets/dating/genders.parquet")

SQLCREATE TABLE genders USING parquetOPTIONS

(path "file:/root/pipeline/datasets/dating/genders.parquet")

Page 30: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

ORC Data SourceConfiguration

spark.sql.orc.filterPushdown=trueDataFrames

val gendersDF = sqlContext.read.format("orc").load("file:/root/pipeline/datasets/dating/genders")

gendersDF.write.format("orc").partitionBy("gender").save("file:/root/pipeline/datasets/dating/genders")

SQLCREATE TABLE genders USING orcOPTIONS

(path "file:/root/pipeline/datasets/dating/genders")

Page 31: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

Third-Party Data Sources

spark-packages.org

Page 32: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

CSV Data Source (Databricks)Github

https://github.com/databricks/spark-csv

Mavencom.databricks:spark-csv_2.10:1.2.0

Codeval gendersCsvDF = sqlContext.read

.format("com.databricks.spark.csv")

.load("file:/root/pipeline/datasets/dating/gen-der.csv.bz2")

.toDF("id", "gender") toDF() defines column names

Page 33: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Avro Data Source (Databricks)Github

https://github.com/databricks/spark-avro

Mavencom.databricks:spark-avro_2.10:2.0.1

Codeval df = sqlContext.read

.format("com.databricks.spark.avro") .load("file:/root/pipeline/datasets/dating/gen-der.avro")

Page 34: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Redshift Data Source (Databricks)Github

https://github.com/databricks/spark-redshift

Mavencom.databricks:spark-redshift:0.5.0

Codeval df: DataFrame = sqlContext.read

.format("com.databricks.spark.redshift") .option("url", "jdbc:redshift://<hostname>:<port>/<database>…") .option("query", "select x, count(*) my_table group by x") .option("tempdir", "s3n://tmpdir") .load()

Copies to S3 for fast, parallel reads vs

single Redshift Master bottleneck

Page 35: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

ElasticSearch Data Source (Elastic.co)Githubhttps://github.com/elastic/elasticsearch-hadoop

Mavenorg.elasticsearch:elasticsearch-spark_2.10:2.1.0

Codeval esConfig = Map("pushdown" -> "true", "es.nodes" -> "<host-

name>", "es.port" -> "<port>")

df.write.format("org.elasticsearch.spark.sql”).mode(SaveMode.Overwrite)

.options(esConfig).save("<index>/<document>")

Page 36: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Cassandra Data Source (DataStax)Githubhttps://github.com/datastax/spark-cassandra-connector

Mavencom.datastax.spark:spark-cassandra-connector_2.10:1.5.0-

M1

CoderatingsDF.write.format("org.apache.spark.sql.cassandra")

.mode(SaveMode.Append)

.options(Map("keyspace"->"dating","table"->"rat-ings"))

.save()

Page 37: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

REST Data Source (Databricks)Coming Soon!

https://github.com/databricks/spark-rest?

Michael ArmbrustSpark SQL Lead @ Databricks

Page 38: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

DynamoDB Data Source (IBM Spark Tech Center) Coming Soon!

https://github.com/cfregly/spark-dynamodb

Me Erlich

Page 39: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

SparkSQL Performance Tuning (oas.sql.SQL-Conf)spark.sql.inMemoryColumnarStorage.compressed=trueAutomatically selects column codec based on data

spark.sql.inMemoryColumnarStorage.batchSizeIncrease as much as possible without OOM – improves compression and GC

spark.sql.inMemoryPartitionPruning=trueEnable partition pruning for in-memory partitions

spark.sql.tungsten.enabled=trueCode Gen for CPU and Memory Optimizations (Tungsten aka Unsafe Mode)

spark.sql.shuffle.partitionsIncrease from default 200 for large joins and aggregations

spark.sql.autoBroadcastJoinThresholdIncrease to tune this cost-based, physical plan optimization

spark.sql.hive.metastorePartitionPruningPredicate pushdown into the metastore to prune partitions early

spark.sql.planner.sortMergeJoinPrefer sort-merge (vs. hash join) for large joins

spark.sql.sources.partitionDiscovery.enabled & spark.sql.sources.parallelPartitionDiscovery.threshold

Enable automatic partition discovery when loading data

Page 40: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM | spark.tc

Freg-a-palooza Upcoming World Tour① New York Strata (Sept 29th – Oct 1st)② London Spark Meetup (Oct 12th)③ Scotland Data Science Meetup (Oct 13th)④ Dublin Spark Meetup (Oct 15th)⑤ Barcelona Spark Meetup (Oct 20th)⑥ Madrid Spark Meetup (Oct 22nd)⑦ Amsterdam Spark Summit (Oct 27th – Oct 29th)⑧ Delft Dutch Data Science Meetup (Oct 29th) ⑨ Brussels Spark Meetup (Oct 30th)⑩ Zurich Big Data Developers Meetup (Nov 2nd)

High probabilityI’ll end up in jail

Page 41: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

IBM Spark Tech Center is Hiring! Only Fun, Collaborative People - No

Erlichs!

IBM | spark.tc

Sign up for our newsletter at

Thank You!

Power of data. Simplicity of design. Speed of innovation.

Chris Fregly @cfregly

Page 42: Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cassandra Recommendations Machine Learning Graph Processing

Power of data. Simplicity of design. Speed of innovation.IBM Spark