writing your own rdd for fun and profit

Post on 22-Jan-2017

110 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Writing your own RDD for fun and profit

by Paweł Szulc @rabbitonweb

Writing my own RDD? What for?

● To write your own RDD, you need to understand to some extent internal mechanics of Apache Spark

● Writing your own RDD will prove you understand them well● When connecting to external storage, it is reasonable to

create your own RDD for it

Outline

1. The Recap

Outline

1. The Recap2. The Internals

Outline

1. The Recap2. The Internals3. The Fun & Profit

Part I - The Recap

RDD - the definition

RDD - the definition

RDD stands for resilient distributed dataset

RDD - the definition

RDD stands for resilient distributed dataset

Dataset - initial data comes from some distributed storage

RDD - the definition

RDD stands for resilient distributed dataset

Distributed - stored in nodes among the cluster

Dataset - initial data comes from some distributed storage

RDD - the definition

RDD stands for resilient distributed dataset

Resilient - if data is lost, data can be recreated

Distributed - stored in nodes among the cluster

Dataset - initial data comes from some distributed storage

RDD - example

RDD - example

val logs = sc.textFile("hdfs://logs.txt")

RDD - example

val logs = sc.textFile("hdfs://logs.txt")

From Hadoop DistributedFile System

RDD - example

val logs = sc.textFile("hdfs://logs.txt")

From Hadoop DistributedFile SystemThis is the RDD

RDD - example

val numbers = sc.parallelize(List(1, 2, 3, 4))

Programmatically from a collection of elementsThis is the RDD

RDD - example

val logs = sc.textFile("logs.txt")

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

Creates a new RDD

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

And yet another RDD

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

And yet another RDDPerformance Alert?!?!

RDD - Operations

1. Transformationsa. Mapb. Filterc. FlatMapd. Samplee. Unionf. Intersectg. Distincth. GroupByKeyi. ….

2. Actionsa. Reduceb. Collectc. Countd. Firste. Take(n)f. TakeSampleg. SaveAsTextFileh. ….

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

val numberOfErrors = errors.count

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

val numberOfErrors = errors.count

This will trigger the computation

RDD - example

val logs = sc.textFile("logs.txt")

val lcLogs = logs.map(_.toLowerCase)

val errors = lcLogs.filter(_.contains(“error”))

val numberOfErrors = errors.count

This will the calculated value (Int)

This will trigger the computation

Partitions?

Partitions?

A partition represents subset of data within your distributed collection.

Partitions?

A partition represents subset of data within your distributed collection.

Number of partitions tightly coupled with level of parallelism.

Partitions evaluationval counted = sc.textFile(..).count

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Partitions evaluationval counted = sc.textFile(..).count

node 1

node 2

node 3

Pipeline

Pipelinemap

Pipelinemap count

Pipelinemap count

task

Pipelinemap count

task

Pipelinemap count

task

But what if...val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

But what if...val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

But what if...filter

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

And now what?filter

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

And now what?filter mapValues

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

And now what?filter

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

Shufflingfilter groupBy

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

Shufflingfilter mapValuesgroupBy

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

Shufflingfilter reduceByKeygroupBy

val startings = allShakespeare

.filter(_.trim != "")

.groupBy(_.charAt(0))

.mapValues(_.size)

.reduceByKey {

case (acc, length) =>

acc + length

}

mapValues

Shufflingfilter reduceByKeygroupBy mapValues

Shufflingfilter reduceByKey

task

groupBy mapValues

Shufflingfilter reduceByKey

task

groupBy mapValues

Shufflingfilter reduceByKey

task

groupBy mapValues

Shufflingfilter reduceByKey

task

Wait for calculations on all partitions before moving on

groupBy mapValues

Shufflingfilter reduceByKey

task

groupBy mapValues

Shufflingfilter reduceByKey

task

groupBy

Data flying around through cluster

mapValues

Shufflingfilter reduceByKey

task

groupBy mapValues

Shufflingfilter reduceByKey

task taskgroupBy mapValues

Shufflingfilter reduceByKeygroupBy mapValues

stage1

Stagefilter reduceByKeygroupBy mapValues

sda

stage2stage1

Stagefilter reduceByKeygroupBy mapValues

Part II - The Internals

What is a RDD?

What is a RDD?

Resilient Distributed Dataset

What is a RDD?

Resilient Distributed Dataset

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

What is a RDD?

node 1

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

node 2 node 3

What is a RDD?

node 1

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

node 2 node 3

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

...10 10/05/2015 10:14:01 UserInitialized Ania Nowak10 10/05/2015 10:14:55 FirstNameChanged Anna12 10/05/2015 10:17:03 UserLoggedIn12 10/05/2015 10:21:31 UserLoggedOut …198 13/05/2015 21:10:11 UserInitialized Jan Kowalski

What is a RDD?

What is a RDD?

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

What is a partition?

A partition represents subset of data within your distributed collection.

What is a partition?

A partition represents subset of data within your distributed collection.

override def getPartitions: Array[Partition] = ???

What is a partition?

A partition represents subset of data within your distributed collection.

override def getPartitions: Array[Partition] = ???

How this subset is defined depends on type of the RDD

example: HadoopRDD

val journal = sc.textFile(“hdfs://journal/*”)

example: HadoopRDD

val journal = sc.textFile(“hdfs://journal/*”)

How HadoopRDD is partitioned?

example: HadoopRDD

val journal = sc.textFile(“hdfs://journal/*”)

How HadoopRDD is partitioned?

In HadoopRDD partition is exactly the same as file chunks in HDFS

example: HadoopRDD

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

example: HadoopRDD

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

example: HadoopRDD

class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {...override def getPartitions: Array[Partition] = { val jobConf = getJobConf()

SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array

}

example: HadoopRDD

class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {...override def getPartitions: Array[Partition] = { val jobConf = getJobConf()

SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array

}

example: HadoopRDD

class HadoopRDD[K, V](...) extends RDD[(K, V)](sc, Nil) with Logging {...override def getPartitions: Array[Partition] = { val jobConf = getJobConf()

SparkHadoopUtil.get.addCredentials(jobConf) val inputFormat = getInputFormat(jobConf) if (inputFormat.isInstanceOf[Configurable]) { inputFormat.asInstanceOf[Configurable].setConf(jobConf) } val inputSplits = inputFormat.getSplits(jobConf, minPartitions) val array = new Array[Partition](inputSplits.size) for (i <- 0 until inputSplits.size) { array(i) = new HadoopPartition(id, i, inputSplits(i)) } array

}

example: MapPartitionsRDD

val journal = sc.textFile(“hdfs://journal/*”)

val fromMarch = journal.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

example: MapPartitionsRDD

val journal = sc.textFile(“hdfs://journal/*”)

val fromMarch = journal.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

How MapPartitionsRDD is partitioned?

example: MapPartitionsRDD

val journal = sc.textFile(“hdfs://journal/*”)

val fromMarch = journal.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

How MapPartitionsRDD is partitioned?

MapPartitionsRDD inherits partition information from its parent RDD

example: MapPartitionsRDD

class MapPartitionsRDD[U: ClassTag, T: ClassTag](...) extends RDD[U](prev) {

...

override def getPartitions: Array[Partition] = firstParent[T].partitions

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how to evaluate its internal data

RDD parent

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

.foreach(println)

RDD parent

sc.textFile(“hdfs://journal/*”)

.groupBy(extractDate _)

.map { case (date, events) => (date, events.size) }

.filter {

case (date, size) => LocalDate.parse(date) isAfter LocalDate.of(2015,3,1)

}

.take(300)

.foreach(println)

RDD parent

sc.textFile()

.groupBy()

.map { }

.filter {

}

.take()

.foreach()

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graph

HadoopRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graph

HadoopRDD

ShuffeledRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two types of parent dependencies:

1. narrow dependency2. wider dependency

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two types of parent dependencies:

1. narrow dependency2. wider dependency

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Two types of parent dependencies:

1. narrow dependency2. wider dependency

Directed acyclic graph

HadoopRDD

ShuffeledRDD MapPartRDD MapPartRDD

sc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

Stage 1Stage 2

Directed acyclic graphsc.textFile() .groupBy() .map { } .filter { } .take() .foreach()

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { }

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()

action

Stage 1Stage 2

Running Job aka materializing DAGsc.textFile() .groupBy() .map { } .filter { } .collect()

action

Actions are implemented using sc.runJob method

Running Job aka materializing DAG

/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

): Array[U]

Running Job aka materializing DAG

/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

): Array[U]

Running Job aka materializing DAG

/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

func: Iterator[T] => U,

): Array[U]

Running Job aka materializing DAG

/**

* Return an array that contains all of the elements in this RDD.

*/

def collect(): Array[T] = {

val results = sc.runJob(this, (iter: Iterator[T]) => iter.toArray)

Array.concat(results: _*)

}

Running Job aka materializing DAG

/**

* Return the number of elements in the RDD.

*/

def count(): Long = sc.runJob(this, Utils.getIteratorSize _).sum

Multiple jobs for single action

/*** Take the first num elements of the RDD. It works by first scanning one partition, and use the results from that partition to estimate the number of additional partitions needed to satisfy the limit.*/def take(num: Int): Array[T] = { while (buf.size < num && partsScanned < totalParts) { (….) val left = num - buf.size val res = sc.runJob(this, (it: Iterator[T]) => it.take(left).toArray, p, allowLocal = true) (….) res.foreach(buf ++= _.take(num - buf.size)) partsScanned += numPartsToTry (….) } buf.toArray }

Running Job aka materializing DAG

/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

func: Iterator[T] => U,

): Array[U]

Running Job aka materializing DAG

/**

* Run a function on a given set of partitions in an RDD and return the results as an array.

*/

def runJob[T, U](

rdd: RDD[T],

func: Iterator[T] => U,

): Array[U]

Running Job aka materializing DAG

/** * :: DeveloperApi :: * Implemented by subclasses to compute a given partition. */@DeveloperApidef compute(split: Partition, context: TaskContext): Iterator[T]

What is a RDD?

RDD needs to hold 3 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data

What is a RDD?

RDD needs to hold 3 + 2 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner

What is a RDD?

RDD needs to hold 3 + 2 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner

Data Locality: HDFS example

node 1

10 10/05/2015 10:14:01 UserInit3 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo4 10/05/2015 10:21:31 UserLo5 13/05/2015 21:10:11 UserIni

node 2 node 3

16 10/05/2015 10:14:01 UserInit20 10/05/2015 10:14:55 FirstNa42 10/05/2015 10:17:03 UserLo67 10/05/2015 10:21:31 UserLo12 13/05/2015 21:10:11 UserIni

10 10/05/2015 10:14:01 UserInit10 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo12 10/05/2015 10:21:31 UserLo198 13/05/2015 21:10:11 UserIni

5 10/05/2015 10:14:01 UserInit4 10/05/2015 10:14:55 FirstNa12 10/05/2015 10:17:03 UserLo142 10/05/2015 10:21:31 UserLo158 13/05/2015 21:10:11 UserIni

What is a RDD?

RDD needs to hold 3 + 2 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner

What is a RDD?

RDD needs to hold 3 + 2 chunks of information in order to do its work:

1. pointer to his parent2. how its internal data is partitioned3. how evaluate its internal data4. data locality5. paritioner

Spark performance - shuffle optimization

Spark performance - shuffle optimization

join

Spark performance - shuffle optimization

join

Spark performance - shuffle optimization

map groupBy

Spark performance - shuffle optimization

map groupBy

Spark performance - shuffle optimization

map groupBy join

Spark performance - shuffle optimization

map groupBy join

Spark performance - shuffle optimization

map groupBy join

Optimization: shuffle avoided if data is already partitioned

Spark performance - shuffle optimization

map groupBy

Spark performance - shuffle optimization

map groupBy map

Spark performance - shuffle optimization

map groupBy map

Spark performance - shuffle optimization

map groupBy map join

Spark performance - shuffle optimization

map groupBy map join

Spark performance - shuffle optimization

map groupBy mapValues

Spark performance - shuffle optimization

map groupBy mapValues

Spark performance - shuffle optimization

map groupBy mapValues join

Spark performance - shuffle optimization

map groupBy mapValues join

Part III - The Fun & Profit

It’s all on github!

http://bit.do/scalapolis

RandomRDD

RandomRDD

sc.random()

.take(3)

.foreach(println)

RandomRDD

sc.random()

.take(3)

.foreach(println)

210

-321

21312

RandomRDD

sc.random()

.take(3)

.foreach(println)

RandomRDD

sc.random()

.take(3)

.foreach(println)

sc.random(maxSize = 10, numPartitions = 4)

.take(10)

.foreach(println)

CensorshipRDD

CensorshipRDD

val statement =

sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))

CensorshipRDD

val statement =

sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))

.censor()

.collect().toList.mkString(" ")

println(statement)

CensorshipRDD

CensorshipRDD

sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))

.censor().collectLegal().foreach(println)

CensorshipRDD

sc.parallelize(List("We", "all", "know that", "Hadoop rocks!"))

.censor().collectLegal().foreach(println)

We

all

know that

Fin

Fin

Paweł Szulc

Fin

Paweł Szulc

paul.szulc@gmail.com

Fin

Paweł Szulc

paul.szulc@gmail.com

Twitter: @rabbitonweb

Fin

Paweł Szulc

paul.szulc@gmail.com

Twitter: @rabbitonweb

http://rabbitonweb.com

Fin

Paweł Szulc

paul.szulc@gmail.com

Twitter: @rabbitonweb

http://rabbitonweb.com

http://github.com/rabbitonweb

Fin

Paweł Szulc

paul.szulc@gmail.com

Twitter: @rabbitonweb

http://rabbitonweb.com

http://github.com/rabbitonweb

http://bit.do/scalapolis

top related