hadoop/spark non-technical basics

Hadoop/Spark Non-Technical Basics

Zitao Liu

Department of Computer ScienceUniversity of Pittsburgh

[email protected]

September 24, 2015

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17

Big Data Analytics

Big Data Analytics always require two components:

A filesystem to store big data.

A computation framework to analysis big data.


Big Data Analytics

Big Data Analytics always require two components:

A filesystem to store big data.

A computation framework to analysis big data.

Hadoop


Apache Hadoop

Too many meanings associated with “Hadoop”. Let’s look at ApacheHadoop first.

Apache Hadoop is an open-source software framework written in Java fordistributed storage and distributed processing of very large data setson computer clusters built from commodity hardware.


Apache Hadoop

The base Apache Hadoop framework is composed of the followingmodules:

Hadoop Common

Hadoop Distributed File System

Hadoop YARN

Hadoop MapReduce


Apache Hadoop

The base Apache Hadoop framework is composed of the followingmodules:

Hadoop Common

Hadoop Distributed File System (F) - storage

Hadoop YARN

Hadoop MapReduce (F) - processing


Hadoop Distributed File System (HDFS)

The Hadoop distributed file system (HDFS) is a distributed, scalable, andportable file-system written in Java for the Hadoop framework.

Hadoop Distributed File System (HDFS) a distributed file-system thatstores data on commodity machines, providing very high aggregatebandwidth across the cluster.

HDFS stores large files (typically in the range of gigabytes to terabytes)across multiple machines.


Hadoop MapReduce

MapReduce is a programming model and an associated implementation forprocessing and generating large data sets with a parallel, distributedalgorithm on a cluster.

A MapReduce program iscomposed of

Map procedure

Reduce procedure

Figure 1: Image fromhttp://tessera.io/docs-datadr/


http://tessera.io/docs-datadr/

Hadoop Ecosystem

Hadoop Ecosystem includes:

Distributed Filesystem, such as HDFS.

Distributed Programming, such as MapReduce, Pig, Spark.

SQL-On-Hadoop, such as Hive, Drill, Presto.

NoSQL Databases.

Column Data Model, such as HBase, Cassandra.Document Data Model, such as MongoDB.

· · ·


MapReduce V.S. Spark

A quick history:

Figure 2: Image fromhttp://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf


http://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Advantages of MapReduce

MapReduce has proven to be an ideal platform to implement complexbatch applications as diverse as sifting through

analyzing system logs

running ETL

computing web indexes

powering personal recommendation systems

· · ·


Limitations of MapReduce

Some limitations of MapReduce:

Batch mode processing (one-pass computation model)

difficult to program directly in MapReduce

performance bottlenecks

In short, MR doesn’t compose well for a large number of applications.

Therefore, people built specialized systems as workarounds, such as Spark.

Details can be found in http:

//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.




Apache Spark

Spark fits into the Hadoop open-source community, building on top of theHadoop Distributed File System (HDFS). It is a framework for writingfast, distributed programs.

Faster (a in-memory approach) 10 times faster than MapReduce forcertain applications. Better for iterative algorithms in ML.

Clean, concise APIs in Scala, Java and Python.

Interactive query analysis (from the Scala and Python shells).

Real-time analysis (Spark Streaming).


Advantages of Spark

Low-latency computations by caching the working dataset in memoryand then performing computations at memory speeds.

Efficient iterative algorithm by having subsequent iterations sharedata through memory, or repeatedly accessing the same dataset.

Figure 3: Image from http://blog.cloudera.com/blog/2013/11/

putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/


http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/

http://blog.cloudera.com/blog/2013/11/putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/

Apache Spark

Spark has the upper hand as long as were talking about iterativecomputations that need to pass over the same data many times.

But when it comes to one-pass ETL-like jobs, for example, datatransformation or data integration, then MapReduce is the deal - this iswhat it was designed for1.

1https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17

Apache Spark Cost

The memory in the Spark cluster should be at least as large as the amountof data you need to process, because the data has to fit into the memoryfor optimal performance. So, if you need to process really Big Data,Hadoop will definitely be the cheaper option since hard disk space comesat a much lower rate than memory space2.

2https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17

Thank you

Thank You

Q & A


hadoop/spark non-technical basics

Data & Analytics