massive online analysis - storm,sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · resilient distributed...
TRANSCRIPT
![Page 1: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/1.jpg)
Massive Online Analysis - Storm,Spark
presentation by
R. Kishore Kumar
Research Scholar
Department of Computer Science & Engineering
Indian Institute of Technology, Kharagpur
Kharagpur-721302, India
17/06/2016(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 1 / 1
![Page 2: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/2.jpg)
Overview
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 2 / 1
![Page 3: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/3.jpg)
Introduction to Big Data - Examples
– Weather Forcasting management
– Raytheon - working with NASA and NOAA.– Every day, the system receives about 60GB worth of data from weather
satellites.– Using sophisticated algorithms, they were able to process that data
into environmental data records, such as cloud coverage and height.They produce sea ice concentrations, surface temperatures, as well asatmospheric pressure.
– Flight takeoff and landing
– It produces 5 to 10 GB of data.
– It produces 500+ TB of data per day.– User Behaviour
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 3 / 1
![Page 4: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/4.jpg)
Introduction
– What is Big Data?
– Big Data is more data and more varieties of data that must be handledby a conventional database.
– The term also refers to the many tools and techniques that haveemerged to help users mine valuable information from these massivedata.
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 4 / 1
![Page 5: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/5.jpg)
Introduction cont..
– Characteristics of Big Data
– Huge volume of data– Complexity of datatypes and structures– Speed or Velocity of new data creation
– Criteria for Big Data Projects
– Speed of decision making– Analysis flexibility– Throughput
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 5 / 1
![Page 6: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/6.jpg)
Different types of Data Types
– Structured Eg. Transaction data
– Semi Structured Eg.XML data files that are self describing anddefined by XML Schema
– Quasi Structured Eg.Webclickstream data that many contain someinconsistent in data values and formats
– Unstructured Eg.Text document,PDFs, Images, Audios, Videos
1 TB (1024 Gigabytes) Oracle RDBMS
1 PB (1024 Terabytes) PDF, Excel, Word, ppt ContentManagement
1 EB (1024 Petabytes) Youtube, Tweet, FB, Wiki No-SQL,Hadoop
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 6 / 1
![Page 7: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/7.jpg)
HDFS Architecture
– Hadoop Distributed File System(HDFS)– HDFS is a file system designed for storing very large files with
streaming data access, patterns, running clusters on commodityhardware.
– HDFS is the primary storage system used by the Hadoop application.
Figure : HDFS Architecture
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 7 / 1
![Page 8: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/8.jpg)
HDFS Features
– HDFS Features
– High fault Tolerant (replication of data in minimum 3 different nodes).– Suitable for application with large data sets.– Can be built out of commodity hardware.
– HDFS Drawback
– Each time it should write into disk.
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 8 / 1
![Page 9: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/9.jpg)
Apache Spark
Spark uses a restricted abstraction of distributed shared memory: theResilient Distributed Datasets (RDDs).
An RDD is a collection of elements partitioned across the nodes ofthe cluster that can be operated in parallel.
RDDs are partitioned to be distributed across nodes.
RDDs are created by starting with a file in the Hadoop file system.
Invoke Spark operations on each RDD to do computation.
Offers efficient fault tolerance.
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 9 / 1
![Page 10: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/10.jpg)
Resilient Distributed Datasets
Motivation
A Fault-Tolerant Abstraction for In-Memory Cluster Computing
MapReduce greatly simplified big data analysis on large, unreliableclusters
But as soon as it got popular, users wanted more:
”More complex, multi-stage applications (e.g. iterative machinelearning & graph processing)”More interactive ad-hoc queries”.
Response:
Specialized frameworks for some of these apps (e.g. Pregel for graphprocessing)
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 10 / 1
![Page 11: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/11.jpg)
Resilient Distributed Datasets
Motivation
Complex apps and interactive queries both need one thing that
MapReduce lacks:Efficient primitives for data sharing
MapReduce
In MapReduce, the only way to share data across jobs is stable storage−−− > slow!
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 11 / 1
![Page 12: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/12.jpg)
Resilient Distributed Datasets
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 12 / 1
![Page 13: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/13.jpg)
Resilient Distributed Datasets
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 13 / 1
![Page 14: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/14.jpg)
Challenge
Challenge
How to design a distributed memory abstraction
that is both fault-tolerant and efficient?
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 14 / 1
![Page 15: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/15.jpg)
Challenge
Challenge
Existing storage abstractions have interfaces based onfine-grained updates to mutable state
RAMCloud, databases, distributed mem, Piccolo
Challenge
Requires replicating data or logs across nodes for faulttolerance
Costly for data-intensive apps
10-100x slower than memory write
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 15 / 1
![Page 16: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/16.jpg)
Challenge
Challenge
Existing storage abstractions have interfaces based onfine-grained updates to mutable state
RAMCloud, databases, distributed mem, Piccolo
Challenge
Requires replicating data or logs across nodes for faulttolerance
Costly for data-intensive apps
10-100x slower than memory write
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 15 / 1
![Page 17: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/17.jpg)
Solution:
Resilient Distributed Datasets (RDDs)
Restricted form of distributed shared memory
Immutable, partitioned collections of records
Can only be built through coarse-grained deterministictransformations (map, filter, join, ...)
Resilient Distributed Datasets (RDDs)
Efficient fault recovery using lineage
Log one operation to apply to many elements
Recompute lost partitions on failure
No cost if nothing fails
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 16 / 1
![Page 18: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/18.jpg)
Resilient Distributed Datasets
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 17 / 1
![Page 19: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/19.jpg)
Generality of RDDs
Despite their restrictions, RDDs can express surprisingly many parallelalgorithms
These naturally apply the same operation to many items
Unify many current programming models
Data flow models: MapReduce, Dryad, SQL, ...Specialized models for iterative apps: BSP (Pregel), iterativeMapReduce (Haloop), bulk incremental,...
Support new apps that these models dont
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 18 / 1
![Page 20: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/20.jpg)
Tradeoff Space
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 19 / 1
![Page 21: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/21.jpg)
Spark Programming Interface
DryadLINQ-like API in the Scala language
Usable interactively from Scala interpreter
Provides:
Resilient distributed datasets (RDDs)Operations on RDDs: transformations (buildnew RDDs),actions (compute and output results)Control of each RDDs partitioning (layout acrossnodes) and persistence (storage in RAM, on disk,etc)
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 20 / 1
![Page 22: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/22.jpg)
Example:Log Mining
Load error messages from a log into memory, then interactively searchfor various patterns
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 21 / 1
![Page 23: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/23.jpg)
Fault Recovery
RDDs track the graph of transformations that built them (theirlineage) to rebuild lost data
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 22 / 1
![Page 24: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/24.jpg)
Fault Recovery Results
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 23 / 1
![Page 25: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/25.jpg)
Example: Page Rank
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 24 / 1
![Page 26: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/26.jpg)
Optimizing Placement
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 25 / 1
![Page 27: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/27.jpg)
PageRank Performance
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 26 / 1
![Page 28: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/28.jpg)
Implementation
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 27 / 1
![Page 29: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/29.jpg)
Programming Models Implemented on Spark
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 28 / 1
![Page 30: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/30.jpg)
Open Source Community
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 29 / 1
![Page 31: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/31.jpg)
Related works
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 30 / 1
![Page 32: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/32.jpg)
Behavior with Insufficient RAM
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 31 / 1
![Page 33: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/33.jpg)
Scalability
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 32 / 1
![Page 34: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/34.jpg)
Breaking Down the Speedup
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 33 / 1
![Page 35: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/35.jpg)
Spark Operations
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 34 / 1
![Page 36: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/36.jpg)
Task Scheduler
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 35 / 1
![Page 37: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/37.jpg)
Apache Spark
RDDs can only be created through reading data from stable stroageor Spark operations over existing RDDs.
Spark defines two kinds of RDD operations:
transformations which apply the same operation on every record ofthe RDD to generate a separate, new RDD. Map()
actions which aggregate the RDD to generate computation result.Reduce()(runs on the all nodes parallel)
In our implementation, each iteration consists of two map operationsand two reduce operations.
Like Hadoop, the efficiency of Spark highly depends on the parallelismof the algorithm itself.
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 36 / 1
![Page 38: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/38.jpg)
Apache Spark
A second abstraction in Spark is shared variables that can be used inparallel operations.
When Spark runs a function in parallel as a set of tasks on differentnodes, it ships a copy of each variable used in the function to eachtask
Spark supports two types of shared variables: broadcast variables,which can be used to cache a value in memory on all nodes, andaccumulators, which are variables that are only added to, such ascounters and sums.
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 37 / 1
![Page 39: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/39.jpg)
Apache Spark
Figure : Spark Architecture
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 38 / 1
![Page 40: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/40.jpg)
Spark Engine
Figure : Spark Engine
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 39 / 1
![Page 41: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/41.jpg)
Approaching with Apache Spark
– Approaching with Apache Spark
– http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 40 / 1
![Page 42: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/42.jpg)
Approaching with Apache Spark
– val input = sc.textFile(”log.txt”)
– val splitedLines = input.map(line => line.split(” ”)).map(words =>(words(0), 1)).reduceByKey(a, b) => a + b
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 41 / 1
![Page 43: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/43.jpg)
Conclusion
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 42 / 1
![Page 44: Massive Online Analysis - Storm,Sparkcse.iitkgp.ac.in/~kishorekr/spark.pdf · Resilient Distributed Datasets (RDDs). ... A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://reader035.vdocuments.site/reader035/viewer/2022070801/5f029c587e708231d4051fc0/html5/thumbnails/44.jpg)
Thank You
(R Kishore Kumar IITKGP) Massive Online Analysis - Storm,Spark 17/06/2016 43 / 1