introduction to spark with scala

Post on 15-Jul-2015

2.749 Views

Category:

Engineering

11 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Introduction to Spark with ScalaIntroduction to

Spark with Scala

Himanshu GuptaSoftware Consultant

Knoldus Software LLP

Himanshu GuptaSoftware Consultant

Knoldus Software LLP

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

What is Apache Spark ?What is Apache Spark ?

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Spark HistorySpark History

Project Begins at

UCB AMP Lab

20092009

20102010

Open Sourced

Apache Incubator

20112011

20122012

20132013

20142014

20152015

Data Frames

ClouderaSupport

ApacheTop level

SparkSummit

2013

SparkSummit

2014

Spark StackSpark Stack

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Fastest Growing Open Source ProjectFastest Growing Open Source Project

Img src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.htmlImg src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Code SizeCode Size

Img src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdfImg src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Word Count Ex.public static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); }public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Daytona GraySort Record:Data to sort 100TB

Daytona GraySort Record:Data to sort 100TB

Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015 Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

Hadoop (2013):Hadoop (2013): 2100 nodes2100 nodes

72 minutes72 minutes

Spark (2014):Spark (2014): 206 nodes206 nodes

23 minutes23 minutes

Runs EverywhereRuns Everywhere

Img src - http://spark.apache.org/

Who are using Apache Spark ?Who are using Apache Spark ?

Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010

AgendaAgenda

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Brief Introduction to RDDBrief Introduction to RDD

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)

This is the Spark Configuration

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark Context

Contd...Contd...

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark Context

Contd...Contd...

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("data.txt")

Extract linesfrom text file

Contd...Contd...

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))

Map linesto words

map

Contd...Contd...

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)

Word Count RDDmap groupBy

Contd...Contd...

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

Map[word, count] map groupBy

collect

StartsComputation

Contd...Contd...

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

map groupBy

collect

Transformation Action

Contd...Contd...

AgendaAgenda

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Brief Introduction to Spark StreamingBrief Introduction to Spark Streaming

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

How Spark Streaming Works ?How Spark Streaming Works ?

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Why we need Spark Streaming ?Why we need Spark Streaming ?

High Level API:High Level API:TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(10), Seconds(5)) //Counting tweets on a sliding window

Fault Tolerant:Fault Tolerant:

Integration:Integration:

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Integrated with Spark SQL, MLLib, GraphX...

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master)

Specify SparkConfiguration

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10))

Setup StreamContext

Contd...Contd...

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999)

This is theReceiverInputDStream

linesDStream

at time0 - 1

at time1 - 2

at time2 - 3

at time3 - 4

Contd...Contd...

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1))

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

map

Creates a Dstream(sequence of RDDs)

Contd...Contd...

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

wordCountDStream

map

groupBy

Groups Dstreamby Words

Contd...Contd...

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)

ssc.start()

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

wordCountDStream

map

groupBy

Start streaming& computation

Contd...Contd...

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

How to Install Spark ? Download Spark from -

http://spark.apache.org/downloads.html

Extract it to a suitable directory.

Go to the directory via terminal & run following command -

mvn -DskipTests clean package

Now Spark is ready to run in Interactive mode

./bin/spark-shell

Download Spark from -

http://spark.apache.org/downloads.html

Extract it to a suitable directory.

Go to the directory via terminal & run following command -

mvn -DskipTests clean package

Now Spark is ready to run in Interactive mode

./bin/spark-shell

sbt Setup

name := "Spark Demo"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.1", "org.apache.spark" %% "spark-streaming" % "1.2.1", "org.apache.spark" %% "spark-sql" % "1.2.1", "org.apache.spark" %% "spark-mllib" % "1.2.1" )

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Demo

Download Code

https://github.com/knoldus/spark-scala

References

http://spark.apache.org/

http://spark-summit.org/2014

http://spark.apache.org/docs/latest/quick-start.html

http://stackoverflow.com/questions/tagged/apache-spark

https://www.youtube.com/results?search_query=apache+spark

http://apache-spark-user-list.1001560.n3.nabble.com/

http://www.slideshare.net/paulszulc/apache-spark-101-in-50-min

Presenter:himanshu@knoldus.com

@himanshug735

Presenter:himanshu@knoldus.com

@himanshug735

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Thanks

top related