map reduce vs spark
TRANSCRIPT
![Page 1: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/1.jpg)
MapReduce vs/and Spark Tudor Lapusan
BigData Romanian Tour - Timisoara
![Page 2: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/2.jpg)
History
![Page 3: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/3.jpg)
MapReduce basic functionalities
● Fault tolerance
● Monitoring & status updates
● Scalability
![Page 4: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/4.jpg)
Hadoop MapReduce
Input Map Reduce Output
![Page 5: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/5.jpg)
Hadoop MapReduce
Input Map Shuffle Reduce Output
![Page 6: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/6.jpg)
MapReduce DAG
A
D
B
C
E
F
![Page 7: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/7.jpg)
Spark
● RDD● Operations : Transformations and Actions
![Page 8: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/8.jpg)
RDD - Resilient Distributed Dataset
RDD is fault-tolerant collection of elements distributed across many servers on which we can perform parallel operations.
![Page 9: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/9.jpg)
RDD
Scala code
val data = Array(1, 2, 3, 4, 5, 6, 7, 8)
val rddData = sc.parallelize(data)
![Page 10: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/10.jpg)
RDD
Scala code
val rddFile = sc.textFile("data.txt")
![Page 11: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/11.jpg)
RDD persistence
MEMORY_ONLY
MEMORY_AND_DISK
MEMORY_ONLY_SER
MEMORY_AND_DISK_SER
DISK_ONLY
MEMORY_ONLY_2
MEMORY_AND_DISK_2
OFF_HEAP
![Page 12: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/12.jpg)
Transformations
RDD 1
RDD 2
Transformations are operations on RDDs that return new RDDs
![Page 13: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/13.jpg)
TransformationsRDD 1
InputRDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
![Page 14: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/14.jpg)
ActionsRDD 1
Actions are the operations on RDD which return a final value or write the data to an external storage system.
RDD 1
![Page 15: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/15.jpg)
ActionsRDD 1
InputRDD{1,2,3,4,5,6}
MapRDD{2,3,4,5,6,7}
FilterRDD{1,2,3,5,6}
map x => x +1 filter x => x != 4
count()=6 take(2)={1,2} saveAsTextFile()
![Page 16: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/16.jpg)
Spark DAG
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
ActionTransformation
Stage
![Page 17: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/17.jpg)
Spark DAG vs MapReduce DAG
RDD 1
RDD 2
RDD 4
RDD 6
RDD 3
RDD 5
A
B
D C
E
F
![Page 18: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/18.jpg)
Programing languages
MapReduce Java Ruby Perl
PythonPHP
RC++
SparkJavaScala
Python
![Page 19: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/19.jpg)
Easy of use
- Spark is easier to program and include an interactive mode.
- Hadoop MapReduce is harder to program but many tools are available to make it easier.
![Page 20: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/20.jpg)
Performance : Sort Benchmark 2013
![Page 21: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/21.jpg)
Performance : Sort Benchmark 2014
![Page 22: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/22.jpg)
Costs
![Page 23: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/23.jpg)
Costs : hardware recommendation
Spark MapReduce Hadoop
Cores 8-16 4
Memory 8GB to hundreds of GB 24GB
Disks 4-8 4-6 one-TB disks
Network 10GB or more 1GB Ethernet
Spark recommendation Hortonworks recommendation
![Page 24: Map reduce vs spark](https://reader035.vdocuments.site/reader035/viewer/2022062308/55c577a7bb61eb290a8b4571/html5/thumbnails/24.jpg)
Costs : developers