hadoop/spark non-technical basics

17
Hadoop/Spark Non-Technical Basics Zitao Liu Department of Computer Science University of Pittsburgh [email protected] September 24, 2015 Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17

Upload: zitao-liu

Post on 07-Jan-2017

395 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Hadoop/Spark Non-Technical Basics

Hadoop/Spark Non-Technical Basics

Zitao Liu

Department of Computer ScienceUniversity of Pittsburgh

[email protected]

September 24, 2015

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 1 / 17

Page 2: Hadoop/Spark Non-Technical Basics

Big Data Analytics

Big Data Analytics always require two components:

A filesystem to store big data.

A computation framework to analysis big data.

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 2 / 17

Page 3: Hadoop/Spark Non-Technical Basics

Big Data Analytics

Big Data Analytics always require two components:

A filesystem to store big data.

A computation framework to analysis big data.

Hadoop

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 3 / 17

Page 4: Hadoop/Spark Non-Technical Basics

Apache Hadoop

Too many meanings associated with “Hadoop”. Let’s look at ApacheHadoop first.

Apache Hadoop is an open-source software framework written in Java fordistributed storage and distributed processing of very large data setson computer clusters built from commodity hardware.

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 4 / 17

Page 5: Hadoop/Spark Non-Technical Basics

Apache Hadoop

The base Apache Hadoop framework is composed of the followingmodules:

Hadoop Common

Hadoop Distributed File System

Hadoop YARN

Hadoop MapReduce

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 5 / 17

Page 6: Hadoop/Spark Non-Technical Basics

Apache Hadoop

The base Apache Hadoop framework is composed of the followingmodules:

Hadoop Common

Hadoop Distributed File System (F) - storage

Hadoop YARN

Hadoop MapReduce (F) - processing

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 6 / 17

Page 7: Hadoop/Spark Non-Technical Basics

Hadoop Distributed File System (HDFS)

The Hadoop distributed file system (HDFS) is a distributed, scalable, andportable file-system written in Java for the Hadoop framework.

Hadoop Distributed File System (HDFS) a distributed file-system thatstores data on commodity machines, providing very high aggregatebandwidth across the cluster.

HDFS stores large files (typically in the range of gigabytes to terabytes)across multiple machines.

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 7 / 17

Page 8: Hadoop/Spark Non-Technical Basics

Hadoop MapReduce

MapReduce is a programming model and an associated implementation forprocessing and generating large data sets with a parallel, distributedalgorithm on a cluster.

A MapReduce program iscomposed of

Map procedure

Reduce procedure

Figure 1: Image fromhttp://tessera.io/docs-datadr/

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 8 / 17

Page 9: Hadoop/Spark Non-Technical Basics

Hadoop Ecosystem

Hadoop Ecosystem includes:

Distributed Filesystem, such as HDFS.

Distributed Programming, such as MapReduce, Pig, Spark.

SQL-On-Hadoop, such as Hive, Drill, Presto.

NoSQL Databases.

Column Data Model, such as HBase, Cassandra.Document Data Model, such as MongoDB.

· · ·

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 9 / 17

Page 10: Hadoop/Spark Non-Technical Basics

MapReduce V.S. Spark

A quick history:

Figure 2: Image fromhttp://stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 10 / 17

Page 11: Hadoop/Spark Non-Technical Basics

Advantages of MapReduce

MapReduce has proven to be an ideal platform to implement complexbatch applications as diverse as sifting through

analyzing system logs

running ETL

computing web indexes

powering personal recommendation systems

· · ·

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 11 / 17

Page 12: Hadoop/Spark Non-Technical Basics

Limitations of MapReduce

Some limitations of MapReduce:

Batch mode processing (one-pass computation model)

difficult to program directly in MapReduce

performance bottlenecks

In short, MR doesn’t compose well for a large number of applications.

Therefore, people built specialized systems as workarounds, such as Spark.

Details can be found in http:

//stanford.edu/~rezab/sparkclass/slides/itas_workshop.pdf.

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 12 / 17

Page 13: Hadoop/Spark Non-Technical Basics

Apache Spark

Spark fits into the Hadoop open-source community, building on top of theHadoop Distributed File System (HDFS). It is a framework for writingfast, distributed programs.

Faster (a in-memory approach) 10 times faster than MapReduce forcertain applications. Better for iterative algorithms in ML.

Clean, concise APIs in Scala, Java and Python.

Interactive query analysis (from the Scala and Python shells).

Real-time analysis (Spark Streaming).

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 13 / 17

Page 14: Hadoop/Spark Non-Technical Basics

Advantages of Spark

Low-latency computations by caching the working dataset in memoryand then performing computations at memory speeds.

Efficient iterative algorithm by having subsequent iterations sharedata through memory, or repeatedly accessing the same dataset.

Figure 3: Image from http://blog.cloudera.com/blog/2013/11/

putting-spark-to-use-fast-in-memory-computing-for-your-big-data-applications/

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 14 / 17

Page 15: Hadoop/Spark Non-Technical Basics

Apache Spark

Spark has the upper hand as long as were talking about iterativecomputations that need to pass over the same data many times.

But when it comes to one-pass ETL-like jobs, for example, datatransformation or data integration, then MapReduce is the deal - this iswhat it was designed for1.

1https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 15 / 17

Page 16: Hadoop/Spark Non-Technical Basics

Apache Spark Cost

The memory in the Spark cluster should be at least as large as the amountof data you need to process, because the data has to fit into the memoryfor optimal performance. So, if you need to process really Big Data,Hadoop will definitely be the cheaper option since hard disk space comesat a much lower rate than memory space2.

2https://www.xplenty.com/blog/2014/11/apache-spark-vs-hadoop-mapreduce/Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 16 / 17

Page 17: Hadoop/Spark Non-Technical Basics

Thank you

Thank You

Q & A

Zitao Liu (University of Pittsburgh) www.zitaoliu.com September 24, 2015 17 / 17