spark: cluster computing with working sets --aaron 2013/03/28

Spark: Cluster Computing with Working Sets

--Aaron 2013/03/28

Overviews

• Motivation• Spark• Resilient Distributed Datasets(RDD)• Shared Variables• Experiments

Motivation

Hadoop

• Advantages:ScalabilityFault Tolerance

• Disadvantage:Acyclic ： not suitable for iterative

algorithms

Disadvantage of Hadoop(1)

• Iterative jobs: Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter.

• Interactive analytics: When querying on large datasets, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.

Disadvantage of Hadoop(2)

• In most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system.

• This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.

Programming Model

• Driver Program: implements the high-level control flow of their application and launches various operations in parallel.

• Resilient Distributed Datasets(RDD)• Parallel Operations: passing a function to apply

on a dataset.• Shared Variables: they can be used in functions

running on the cluster.

Framework of Spark

Resilient Distributed Datasets(RDD)

Resilient Distributed Datasets

• A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

• The elements of an RDD need not exist in physical storage; instead, a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage. This means that RDDs can always be reconstructed if nodes fail.

Changing the Persistence of RDD

• By default, RDDs are lazy and ephemeral.• User can alter the persistence of an RDD

through two actions:– Cache action: hints that it should be kept in

memory after the first time it is computed, because it will be reused.

– Save action: evaluates the dataset and writes it to a distributed filesystem such as HDFS

Operations on RDD(1)

• Transformations(create an RDD): operations on either (1) data in stable storage or (2) others RDDs.

• Examples of transformations include map, filter, and join.


• Actions: operations that return a value to the application or export data to a storage system.

• Examples of actions include count, collect, and save.


• Programmers can call a persist method to indicate which RDDs they want to reuse in future operations.

• Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

• Users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

Example1 of RDD

• These datasets will be stored as a chain of objects capturing the lineage of each RDD. Each dataset object contains a pointer to its parent and information about how the parent was transformed.

Lineage Chain of Example1

Example2 of RDD

Lineage Chain of Example2

RDD Objects• Each RDD object implements the same simple

interface, which consists of five piece of information:

Examples of RDD Operations

• HDFS files(lines): – partitions(): returns one partition for each block

of the file (with the block’s offset stored in each Partition object)

– preferredLocations: gives the nodes the block is on

– iterator: reads the block

Dependencies between RDDs(1)

• Narrow Dependencies: each partition of the parent RDD is used by at most one partition of the child RDD(1:1). Map leads to a narrow dependency.

• Wide Dependencies: multiple child partitions may depend on it(1:N). Join leads to wide dependencies.

Dependencies between RDDs(2)

• Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis.

• Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce like operation.

• Recovery after a node failure is more efficient with a narrow dependency than the ones with wide dependency.

Parallel Operations

• reduce: Combines dataset elements using an associative function to produce a result at the driver program.

• collect: Sends all elements of the dataset to the driver program.

Shared Variables

Shared Variables• Programmers invoke operations like map, filter

and reduce by passing closures (functions) to Spark. Normally, when Spark runs a closure on a worker node, these variables are copied to the worker.

• However, Spark also lets programmers create two restricted types of shared variables to support two simple but common usage patterns.

Broadcast Variables

• When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.

Accumulators

• Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type.

• On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.

Experiments

Results(1)• Dataset: 29G• Envirement: 20 “m1.xlarge” EC2 nodes with 4 cores

each.• Algorithm: logistic regression

Results(2)

Results(3)

Thank You!!!

spark: cluster computing with working sets --aaron 2013/03/28

Documents