hadoop-2.6.0 slides
TRANSCRIPT
![Page 1: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/1.jpg)
Taming Big Data with HadoopKul Subedi
![Page 2: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/2.jpg)
Introduction
● What is Big Data?
![Page 3: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/3.jpg)
Properties of Big Data
![Page 4: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/4.jpg)
Cont...
● Large and growing data files● Commonly measured in terabytes or
petabytes● Unstructured data● May not fit in a “relational database model”● Derived from Users, Applications, Systems,
and Sensors
![Page 5: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/5.jpg)
Problem: Data Throughput Mis-match
● Standard spinning hard drive 60-120 MB/sec● Standard solid state hard drive 250-500
MB/sec● Hard drive capacity growing● Online data growth
Moving data on and off of disk is the bottleneck.
![Page 6: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/6.jpg)
Cont...● One Terabyte (TB) of data will take 10,000
seconds (approximately 167 minutes)● One TB of data will take 2000 seconds
(approximately 33 minutes) to read data at 500 MB/sec (solid state)
One TB is a “small” file sizeThe need for parallel data access is essential for Big data.
![Page 7: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/7.jpg)
Problem(1): Scaling
![Page 8: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/8.jpg)
Agenda● Hadoop Definition● Hadoop Ecosystem● History● Hadoop Design Principle● HDFS and MapReduce (Demo)● Conclusion
![Page 9: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/9.jpg)
Definition
● A framework of open-source tools, libraries, and methodologies for the distributed processing of large data sets
● Scale up from single servers to thousands of machines, each offering local computation and storage
![Page 10: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/10.jpg)
![Page 11: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/11.jpg)
Cont...
● The project includes ❏ Hadoop Common❏ HDFS❏ YARN❏ MapReduce
![Page 12: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/12.jpg)
Cont...
● Other Hadoop-related projects❏ Pig❏ Hive❏ Tez❏ Spark❏ HBase❏ Ambari etc
![Page 13: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/13.jpg)
Hadoop Usage Modes● Administrators❏ Installation❏ Monitor/Manage System❏ Tune System
● End Users❏ Design MapReduce Applications❏ Import/Export Data❏ Work with various Hadoop Tools
![Page 14: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/14.jpg)
Hadoop History
● Developed by Doug Cutting and Michael J. Cafarella
● Based On Google MapReduce technology● Designed to handle large amounts of data
and be robust● Donated to Apache Software Foundation in
2006 by Yahoo
![Page 15: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/15.jpg)
Cont...
Application areas:❏ Social media❏ Retail❏ Financial services❏ Web Search❏ Everywhere there is large amounts of
unstructured data
![Page 16: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/16.jpg)
Cont...
Prominent Users:❏ Yahoo!❏ Facebook❏ Amazon❏ Ebay❏ American Airlines ❏ The New York Time, and many others
![Page 17: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/17.jpg)
Design Principles
● Moving computation is cheaper than moving data
● Hardware will fail, manage it● Hide execution details from the user● Use streaming data access● Use a simple file system coherency model
![Page 18: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/18.jpg)
Cont...
What Hadoop is not: A replacement for SQL, always fast and efficient, good for quick ad-hoc querying
![Page 19: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/19.jpg)
HDFS
![Page 20: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/20.jpg)
HDFS-Architecture1.Where do I read or write data?2.Use these data nodes.
![Page 21: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/21.jpg)
NameNode
● Only one per cluster “master node”● Stores meta information of “filesystem” such
as Filename, permission, directories, blocks● Keep in RAM for fast access● Persisted to disk● The namenode is the brain of the outfit
![Page 22: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/22.jpg)
DataNode
● Many per cluster, “slave node”● Stores individual file “blocks” but knows
nothing about them, accept the block name● Reports regularly to NameNode “Hey I am alive, and I have these blocks”
![Page 23: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/23.jpg)
HDFS Presents
● Transparency● Replication
![Page 24: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/24.jpg)
![Page 25: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/25.jpg)
HDFS Properties● Files are immutable
No updates, no appends● Disk access is optimized for sequential
readsStore data in large “blocks” 128 MB default
● Avoid Corruption“blocks” are verified with checksum when
stored and read
![Page 26: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/26.jpg)
Cont...● High throughput
Avoid contention, have system share as little information and resources as possible
● Fault TolerantLoss of a disk, or machine, or rack of
machines should not lead to data loss
![Page 27: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/27.jpg)
Client Reading From HDFS
![Page 28: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/28.jpg)
Client Writing To HDFS
![Page 29: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/29.jpg)
Demo● File System Health check❏ hdfs fsck /
● List the file system content❏ $hdfs dfs -ls /
● Create a directory❏ $hdfs dfs -mkdir /data1
● Upload file to HDFS❏ $hdfs dfs -put input.txt /in
![Page 30: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/30.jpg)
Cont...
● Input directory in HDFS: /in● Output directory in HDFS: /output
![Page 31: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/31.jpg)
NameNode High Availability
● NFS Filer● Quorum Journal Manager (QJM)
![Page 32: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/32.jpg)
![Page 33: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/33.jpg)
MapReduce● A programming model for processing large
Data in Distributed fashion over cluster of commodity machines
● Introduced by Google● Uses two key steps: mapping and reducing
![Page 34: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/34.jpg)
Cont...● Almost all data can be mapped into <key,
value> pairs somehow● Your keys and values may be of any type:
string, integers, dummy types, and <K,V> pairs themselves and so on
● Scale-free programming❏ If a program works for a 1KB file, it can
work for any file size
![Page 35: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/35.jpg)
see spot run run spot runsee the cat
run spot run
see spot run
see the cat
see,1 spot,1 run,1
see,1 see,1
spot,1 spot,1
run,1 run,1run,1
the,1
cat,1
InputSplit
Map
Shuffle
see,1 spot,1 run,1
see,1 the,1 cat,1
see,2
spot,2
run,3
the,1
cat,1
Reduce
see,2 spot,2 run,3the,1 cat,1
Output
Data Flow
MapReduce Word Count Data Flow
![Page 36: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/36.jpg)
Example: Hello World Program
● wget www.gutenberg.org/files/2600/2600.txt● python mapper.py < input.txt | sort | python
reducer.py● cat *.txt | python mapper.py | sort | python
reducer.py
![Page 37: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/37.jpg)
Demo
● cat input1.txt | python mapper.py● cat input1.txt | python mapper.py | sort● cat input1.txt | python mapper.py | sort |
python reducer.py● cat input1.txt | python mapper.py | sort |
python reducer.py > output.txt
![Page 38: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/38.jpg)
How to run job in Cluster?
● Using Streaming Interface❏ hadoop jar /opt/hadoop/hadoop-2.6.0
/share/hadoop/tools/lib/hadoop-streaming-2.6.0.jar -file ./mapper.py -mapper ./mapper.py -file ./reducer.py -input /in/input.txt -output /output/run1
![Page 39: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/39.jpg)
Web Interfaces
● NameNode: http://10.0.0.160:50070● ResourceManager: http://10.0.0.160:8088● /opt/hadoop/hadoop-2.6.0
/share/hadoop/hdfs/webapps/hdfs contains view
![Page 40: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/40.jpg)
Prerequisites
● Java● ssh● rsync
![Page 41: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/41.jpg)
Installation
● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0-src.tar.gz
● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0-src.tar.gz.mds
![Page 42: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/42.jpg)
Cont...
● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz
● wget apache.osuosl.org/hadoop/common/hadoop-2.6.0/hadoop-2.6.0.tar.gz.mds
![Page 43: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/43.jpg)
Integrity Check
● md5sum hadoop-2.6.0.tar.gz● cat hadoop-2.6.0.tar.gz.mds | grep -i md5
![Page 44: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/44.jpg)
Startup Code
● https://github.com/kpsubedi/BigData
![Page 45: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/45.jpg)
OpenSource
● Apache Hadoop http://hadoop.apache.org/
![Page 46: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/46.jpg)
Commercial Big Data Players
● Hortonworks http://hortonworks.com/● Cloudera http://www.cloudera.
com/content/cloudera/en/home.html● MAPR https://www.mapr.com/● Others
![Page 47: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/47.jpg)
Conclusion
● Thank you
![Page 48: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/48.jpg)
References
● Apache Hadoop http://hadoop.apache.org/
● Hortonworks http://hortonworks.com/
● Cloudera http://www.cloudera.com/content/cloudera/en/downloads.html
● MAPR https://www.mapr.com/
● The Google File System http://static.googleusercontent.com/media/research.google.com/en/us/archive/gfs-sosp2003.pdf
![Page 49: Hadoop-2.6.0 Slides](https://reader030.vdocuments.site/reader030/viewer/2022032421/55a68dfd1a28abb27d8b47bb/html5/thumbnails/49.jpg)
References (1)
● MapReduce: Simplified Data Processing on Large Clusters http://static.googleusercontent.com/media/research.google.com/en/us/archive/mapreduce-osdi04.pdf