spark: cluster computing with working sets --aaron 2013/03/28
TRANSCRIPT
![Page 1: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/1.jpg)
Spark: Cluster Computing with Working Sets
--Aaron 2013/03/28
![Page 2: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/2.jpg)
Overviews
• Motivation• Spark• Resilient Distributed Datasets(RDD)• Shared Variables• Experiments
![Page 3: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/3.jpg)
Motivation
![Page 4: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/4.jpg)
Hadoop
• Advantages:ScalabilityFault Tolerance
• Disadvantage:Acyclic : not suitable for iterative
algorithms
![Page 5: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/5.jpg)
Disadvantage of Hadoop(1)
• Iterative jobs: Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter.
• Interactive analytics: When querying on large datasets, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.
![Page 6: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/6.jpg)
Disadvantage of Hadoop(2)
• In most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system.
• This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.
![Page 7: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/7.jpg)
Spark
![Page 8: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/8.jpg)
Programming Model
• Driver Program: implements the high-level control flow of their application and launches various operations in parallel.
• Resilient Distributed Datasets(RDD)• Parallel Operations: passing a function to apply
on a dataset.• Shared Variables: they can be used in functions
running on the cluster.
![Page 9: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/9.jpg)
Framework of Spark
![Page 10: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/10.jpg)
Resilient Distributed Datasets(RDD)
![Page 11: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/11.jpg)
Resilient Distributed Datasets
• A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.
• The elements of an RDD need not exist in physical storage; instead, a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage. This means that RDDs can always be reconstructed if nodes fail.
![Page 12: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/12.jpg)
Changing the Persistence of RDD
• By default, RDDs are lazy and ephemeral.• User can alter the persistence of an RDD
through two actions:– Cache action: hints that it should be kept in
memory after the first time it is computed, because it will be reused.
– Save action: evaluates the dataset and writes it to a distributed filesystem such as HDFS
![Page 13: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/13.jpg)
Operations on RDD(1)
• Transformations(create an RDD): operations on either (1) data in stable storage or (2) others RDDs.
• Examples of transformations include map, filter, and join.
![Page 14: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/14.jpg)
Operations on RDD(2)
• Actions: operations that return a value to the application or export data to a storage system.
• Examples of actions include count, collect, and save.
![Page 15: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/15.jpg)
Operations on RDD(3)
• Programmers can call a persist method to indicate which RDDs they want to reuse in future operations.
• Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.
• Users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.
![Page 16: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/16.jpg)
Example1 of RDD
• These datasets will be stored as a chain of objects capturing the lineage of each RDD. Each dataset object contains a pointer to its parent and information about how the parent was transformed.
![Page 17: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/17.jpg)
Lineage Chain of Example1
![Page 18: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/18.jpg)
Example2 of RDD
![Page 19: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/19.jpg)
Lineage Chain of Example2
![Page 20: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/20.jpg)
RDD Objects• Each RDD object implements the same simple
interface, which consists of five piece of information:
![Page 21: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/21.jpg)
Examples of RDD Operations
• HDFS files(lines): – partitions(): returns one partition for each block
of the file (with the block’s offset stored in each Partition object)
– preferredLocations: gives the nodes the block is on
– iterator: reads the block
![Page 22: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/22.jpg)
Dependencies between RDDs(1)
• Narrow Dependencies: each partition of the parent RDD is used by at most one partition of the child RDD(1:1). Map leads to a narrow dependency.
• Wide Dependencies: multiple child partitions may depend on it(1:N). Join leads to wide dependencies.
![Page 23: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/23.jpg)
![Page 24: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/24.jpg)
Dependencies between RDDs(2)
• Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis.
• Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce like operation.
• Recovery after a node failure is more efficient with a narrow dependency than the ones with wide dependency.
![Page 25: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/25.jpg)
Parallel Operations
• reduce: Combines dataset elements using an associative function to produce a result at the driver program.
• collect: Sends all elements of the dataset to the driver program.
![Page 26: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/26.jpg)
Shared Variables
![Page 27: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/27.jpg)
Shared Variables• Programmers invoke operations like map, filter
and reduce by passing closures (functions) to Spark. Normally, when Spark runs a closure on a worker node, these variables are copied to the worker.
• However, Spark also lets programmers create two restricted types of shared variables to support two simple but common usage patterns.
![Page 28: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/28.jpg)
Broadcast Variables
• When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.
![Page 29: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/29.jpg)
Accumulators
• Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type.
• On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.
![Page 30: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/30.jpg)
Experiments
![Page 31: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/31.jpg)
Results(1)• Dataset: 29G• Envirement: 20 “m1.xlarge” EC2 nodes with 4 cores
each.• Algorithm: logistic regression
![Page 32: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/32.jpg)
Results(2)
![Page 33: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/33.jpg)
Results(3)
![Page 34: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28](https://reader031.vdocuments.site/reader031/viewer/2022032307/56649cef5503460f949be4e2/html5/thumbnails/34.jpg)
Thank You!!!