apache spark - meetupfiles.meetup.com/18532292/apache spark.pdf · what is spark? a distributed...

Apache Spark:Sreeram Nudurupati May 2015

What is Spark?A distributed computing platform designed to be

Fast to develop distributed applications

Fast to run distributed applications

General Purpose

A single framework to handle a variety of workloads

Batch, interactive, iterative, streaming, SQL

Spark Architecture

How to run Spark?Local

Not really distributed computing

Cluster

Standalone Scheduler + Shared File System

Amazon EC2 + S3

Google Compute Engine + Mesosphere

Databricks Cloud

Spark Cluster

• Mesos• Yarn• Standalone

Spark BasicsRDD - Resilient Distributed Datasets

Spark’s primary abstraction

A distributed collection of items called elements

Can be created from a variety of sources

Immutable

RDD Visualized

RDD 1 Partition 1

Partition 2 Partition 3

Partition 1 Partition 2

Partition 1 Partition 3 Partition 4Partition 2

Node 1 Node 4Node 3Node 2

RDD OperationsTransformations

Operate on an RDD and return a new RDD

Are lazily evaluated

Actions

Return a value after running a computation on an RDD

Lazy Evaluation

Evaluation happens only when an action is called

Deferring decisions for better runtime optimization

data back to Driver

Transformation 1

Transformation 2

Action

filter

collect

DataFramesExtension of RDD API and a Spark SQL abstraction

Distributed collection of data with named columns

Equivalent to RDBMS tables or data frames in R/Pandas

Can be built from a variety of structured data sources

Hive tables, JSON, Databases, RDDs etc.

Why DataFrame?Lot of data formats are structured

Schema-on-read

data has inherent structure and needed to make sense of it

RDD programming with structured data is not intuitive

SchemaRDD = RDD(ROW) + Schema

Write SQLs

Use Domain Specific Language (DSL)

RDD vs DataFrameDataFrame

Inbuilt support for a variety of data formats

A more feature rich DSL

Memory management with Java objects is challenging

Future GC free managed memory in the future

Execution optimized by Catalyst

JVM bytecode generated for any/all APIs

RDD vs DataFrame

DataFrame OpsprintSchema prints schema

show(N) shows N rows

join joins two DFs

apply returns the selected column

select returns new DF with selected columns

selectExpr use a SQL query to select

filter same as where

groupBy groups using specified columns

SaveAs(JSON/Parquet/Table)

saveAsTable saves to a Hive table

createJDBCTable save to a JDBC database

SQLContext OpsparquetFile loads parquet file

into a DF

jsonFile loads JSON file into a DF

load creates a DF from a source file

createExternalTable

creates a Hive external table

jdbc returns new DF with selected columns

sql executes SQL query

table return specified table as DF

cacheTable cache table in-memory

What Next?Spark Community: spark.apache.org/community.html

Worldwide Events: goo.gl/2YqJZK

Video, presentation archives: spark-summit.org

Dev resources: databricks.com/spark/developer-resources

Workshops: databricks.com/services/spark-training

Books: Learning Spark, Advanced Analytics with Apache Spark

Github: https://github.com/snudurupati/spark_training

apache spark - meetupfiles.meetup.com/18532292/apache spark.pdf · what is spark? a distributed...

Documents

apache spark streaming

apache spark - lightning fast cluster computing - hyderabad...

spark sql | apache spark

the how and why of fast data analytics with apache spark

apache spark 101

parallel maritime tra c clustering based on apache spark ·...

war-winning capabilities … on time, on cost...• apache...

an introduction to spark and to its programming...

apache spark introduction

scalable storage of whole slide images and fast retrieval...

apache spark™ is a fast and general-purpose engine for...

apache spark: lightning fast cluster computing

august 2016 hug: better together: fast data with apache...

iot & the pervasive nature of fast data & apache spark

apache spark in the cloud - amazon s3 · 13 apache spark in...

developing apache spark applications - cloudera · apache...

apache spark™ is a fast and general-purpose engine for...

using apache spark pat mcdonough - databricks. apache spark...

apache spark : fast and easy data processing - … · 2015...

an introduction to spark and to its programming model ·...