hadoop/spark non-technical basics
TRANSCRIPT
Hadoop/Spark Non-Technical Basics
Zitao Liu
Department of Computer ScienceUniversity of Pittsburgh
September 24, 2015
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17
Big Data Analytics
Big Data Analytics always require two components:
A filesystem to store big data.
A computation framework to analysis big data.
Hadoop
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17
Apache Hadoop
Too many meanings associated with “Hadoop”. Let’s look at ApacheHadoop first.
Apache Hadoop is an open-source software framework written in Java fordistributed storage and distributed processing of very large data setson computer clusters built from commodity hardware.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the followingmodules:
Hadoop Common
Hadoop Distributed File System
Hadoop YARN
Hadoop MapReduce
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17
Apache Hadoop
The base Apache Hadoop framework is composed of the followingmodules:
Hadoop Common
Hadoop Distributed File System (F) - storage
Hadoop YARN
Hadoop MapReduce (F) - processing
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17
Hadoop Distributed File System (HDFS)
The Hadoop distributed file system (HDFS) is a distributed, scalable, andportable file-system written in Java for the Hadoop framework.
Hadoop Distributed File System (HDFS) a distributed file-system thatstores data on commodity machines, providing very high aggregatebandwidth across the cluster.
HDFS stores large files (typically in the range of gigabytes to terabytes)across multiple machines.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17
Hadoop MapReduce
MapReduce is a programming model and an associated implementation forprocessing and generating large data sets with a parallel, distributedalgorithm on a cluster.
A MapReduce program iscomposed of
Map procedure
Reduce procedure
Figure 1: Image fromhttp://tessera.io/docs-datadr/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17
Hadoop Ecosystem
Hadoop Ecosystem includes:
Distributed Filesystem, such as HDFS.
Distributed Programming, such as MapReduce, Pig, Spark.
SQL-On-Hadoop, such as Hive, Drill, Presto.
NoSQL Databases.
Column Data Model, such as HBase, Cassandra.Document Data Model, such as MongoDB.
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17
MapReduce V.S. Spark
A quick history:
Figure 2: Image fromhttp://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17
Advantages of MapReduce
MapReduce has proven to be an ideal platform to implement complexbatch applications as diverse as sifting through
analyzing system logs
running ETL
computing web indexes
powering personal recommendation systems
· · ·
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17
Limitations of MapReduce
Some limitations of MapReduce:
Batch mode processing (one-pass computation model)
difficult to program directly in MapReduce
performance bottlenecks
In short, MR doesn’t compose well for a large number of applications.
Therefore, people built specialized systems as workarounds, such as Spark.
Details can be found in http:
//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17
Apache Spark
Spark fits into the Hadoop open-source community, building on top of theHadoop Distributed File System (HDFS). It is a framework for writingfast, distributed programs.
Faster (a in-memory approach) 10 times faster than MapReduce forcertain applications. Better for iterative algorithms in ML.
Clean, concise APIs in Scala, Java and Python.
Interactive query analysis (from the Scala and Python shells).
Real-time analysis (Spark Streaming).
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17
Advantages of Spark
Low-latency computations by caching the working dataset in memoryand then performing computations at memory speeds.
Efficient iterative algorithm by having subsequent iterations sharedata through memory, or repeatedly accessing the same dataset.
Figure 3: Image from http://blog.cloudera.com/blog/2013/11/
putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17
Apache Spark
Spark has the upper hand as long as were talking about iterativecomputations that need to pass over the same data many times.
But when it comes to one-pass ETL-like jobs, for example, datatransformation or data integration, then MapReduce is the deal - this iswhat it was designed for1.
1https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17
Apache Spark Cost
The memory in the Spark cluster should be at least as large as the amountof data you need to process, because the data has to fit into the memoryfor optimal performance. So, if you need to process really Big Data,Hadoop will definitely be the cheaper option since hard disk space comesat a much lower rate than memory space2.
2https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17
Thank you
Thank You
Q & A
Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17