spark
DESCRIPTION
Small presentation about sparkTRANSCRIPT
![Page 1: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/1.jpg)
Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory
Cluster ComputingPresentation by Mário Almeida
![Page 2: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/2.jpg)
OutlineMotivationRDDs overviewSparkData SharingExample : Log MiningFault ToleranceExample : Logistic RegressionRDD RepresentationEvaluationConclusion
![Page 3: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/3.jpg)
Motivation
How to perform large-scale data analytics?● MapReduce ● Dryad
Problem?● reuse intermediate? DFS? ● Pregel? ● How to provide Fault-tolerance
efficiently? Shared memory? key-value stores? Picollo?
Overhead!!
no abstraction for general reuse!!
Fine-grained!!
1
![Page 4: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/4.jpg)
RDDs Overview
Read-only, partitioned collection of records
Created through transformations on data in stable storage or other RDDs
Has information on the lineage of transformations
Control over partitioning and persistence (e.g. non serialized in-memory storage)
2
![Page 5: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/5.jpg)
Spark
Exposes RDDs through a language integrated API.
RDDs can be used in actions.● which return a value or export it to a storage system
(e.g. count, collect and save)
Persist method indicates which RDDs to reuse (default: stored in memory)
3
![Page 6: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/6.jpg)
Data Sharing in MReduce
Overhead: Replication, serialization, disk IO!
4
![Page 7: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/7.jpg)
Data Sharing in Spark
10-100x faster than network and disk
5
![Page 8: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/8.jpg)
Example - Log Mining
Load error messages into memory and search for patterns.
1Tb in 5-7 sec(170 sec for on-disk data)
6
![Page 9: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/9.jpg)
Fault Tolerance
RDDs keep information of the transformations used to build them. This lineage can be used to recover lost data.
7
![Page 10: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/10.jpg)
Example - Logistic Regression
Many machine learning algorithms are iterative in nature because they run iterative optimization procedures!
Repeated MapReduce steps to calculate the gradient
One time loaded into memory!
8
![Page 11: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/11.jpg)
Logistic Regression Performance
30Gb set20 * 4 cores w/ 15GBHadoop - 127 s/iterationSpark . 1st iteration 174s, afterwards 6s
9
![Page 12: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/12.jpg)
Representing RDDs
Narrow dependencies allow pipelined
execution
Wide dependencies require data from all
parents
Partition
Wide dependencies are harder to recover!
10
![Page 13: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/13.jpg)
Evaluation - Iteration times
Extra MR job to convert to binary
Heartbeat Protocol
Computation intensive
11
![Page 14: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/14.jpg)
Evaluation - number of machines
25.3x & 20.7x
1.9x & 3.2x
12
![Page 15: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/15.jpg)
Evaluation - Partitioning
Page rank algorithm on a 54GB dataset that builds a link graph of 4 million articles.
13
![Page 16: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/16.jpg)
Evaluation - Failures
100 GB Working set
14
![Page 17: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/17.jpg)
Conclusion
Spark is up to 20x faster than Hadoop for iterative applications. (IO and serialization)
Can interactively scan 1 TB (5-7s latency).
Quick recovery (builds lost RDD partitions).
Pregel/HaLoop can be built on top of Spark.
Good for batch applications that apply the same operation to all elements of a dataset.
15
![Page 18: Spark](https://reader030.vdocuments.site/reader030/viewer/2022020105/554a0fe8b4c905825d8b49af/html5/thumbnails/18.jpg)
References
● Resilient Distributed Datasets : A Fault-Tolerant Abstraction for In-Memory Cluster Computing
● slideshare :/Hadoop_Summit/spark-and-shark