hadoop introduction
TRANSCRIPT
![Page 1: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/1.jpg)
Training (Day – 1)
Introduction
![Page 2: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/2.jpg)
Big-dataFour parameters:
–Velocity: Streaming data and large volume data movement.–Volume: Scale from terabytes to zettabytes.–Variety: Manage the complexity of multiple relational and non-relational data types and schemas.–Voracity: Produced data has to be consumed fast before it becomes meaningless.
![Page 3: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/3.jpg)
Not just internet companiesBig Data Shouldn’t Be a SiloMust be an integrated part of enterprise information architecture
![Page 4: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/4.jpg)
Data >> Information >> Business ValueRetail – By combining data feeds on inventory, transactional records, social media and online trends, retailers can make real-time decisions around product and inventory mix adjustments, marketing and promotions, pricing or product quality issues.
Financial Services – By combining data across various groups and services like financial markets, money manager and lending, financial services companies can gain a comprehensive view of their individual customers and markets.
Government – By collecting and analyzing data across agencies, location and employee groups, the government can reduce redundancies, identify productivity gaps, pinpoint consumer issues and find procurement savings across agencies.
Healthcare – Big data in healthcare could be used help improve hospital operations, provide better tracking of procedural outcomes, help accelerate research and grow knowledge bases more rapidly. According to a 2011 Cleveland Clinic study, leveraging big data is estimated to drive $300B in annual value through an 8% systematic cost savings.
![Page 5: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/5.jpg)
Single-core, single processorSingle-core, multi-processor
Single-core
Multi-core, single processorMulti-core, multi-processorMulti-core
Cluster of processors (single or multi-core) with shared memoryCluster of processors with distributed memory
Cluster
Broadly, the approach in HPC is to distribute the work across a cluster of machines, which access a shared file-system, hosted by a SAN.
Grid of clusters
Embarrassingly parallel processing
MapReduce, distributed file system
Cloud computing
Pipelined Instruction level
Concurrent Thread level
Service Object level
Indexed File level
Mega Block level
Virtual System Level
Data size: small
Data size: large
Reference: Bina Ramamurthy 2011
Processing Granularity
![Page 6: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/6.jpg)
How to Process BigData?
Need to process large datasets (>100TB)–Just reading 100TB of data can be overwhelming–Takes ~11 days to read on a standard computer–Takes a day across a 10Gbit link (very high end storage solution)–On a single node (@50MB/s) – 23days–On a 1000 node cluster – 33min
![Page 7: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/7.jpg)
Examples•Web logs;•RFID; •sensor networks; •social networks; •social data (due to the social data revolution), •Internet text and documents; •Internet search indexing; •call detail records; •astronomy, •atmospheric science, •genomics, •biogeochemical, •biological, and •other complex and/or interdisciplinary scientific research; •military surveillance; •medical records;• photography archives; •video archives; and •large-scale e-commerce.
![Page 8: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/8.jpg)
Not so easy…
Moving data from storage cluster to computation cluster is not feasible
In large clusters–Failure is expected, rather than exceptional. –In large clusters, computers fail every day–Data is corrupted or lost–Computations are disrupted–The number of nodes in a cluster may not be constant. –Nodes can be heterogeneous.
Very expensive to build reliability into each application–A programmer worries about errors, data motion, communication…–Traditional debugging and performance tools don’t apply
Need a common infrastructure and standard set of tools to handle this complexity–Efficient, scalable, fault-tolerant and easy to use
![Page 9: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/9.jpg)
Why is Hadoop and MapReduce needed?
The answer to this questions comes from another trend in disk drives: –seek time is improving more slowly than transfer rate.
Seeking is the process of moving the disk’s head to a particular place on the disk to read or write data.
It characterizes the latency of a disk operation, whereas the transfer rate corresponds to a disk’s bandwidth.
If the data access pattern is dominated by seeks, it will take longer to read or write large portions of the dataset than streaming through it, which operates at the transfer rate.
![Page 10: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/10.jpg)
Why is Hadoop and MapReduce needed?
On the other hand, for updating a small proportion of records in a database, a traditional B-Tree (the data structure used in relational databases, which is limited by the rate it can perform seeks) works well.
For updating the majority of a database, a B-Tree is less efficient than MapReduce, which uses Sort/Merge to rebuild the database.
MapReduce can be seen as a complement to an RDBMS.
MapReduce is a good fit for problems that need to analyze the whole dataset, in a batch fashion, particularly for ad hoc analysis.
![Page 11: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/11.jpg)
Why is Hadoop and MapReduce needed?
![Page 12: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/12.jpg)
Hadoop distributions
Apache™ Hadoop™
Apache Hadoop- based Services
for Windows Azure
Cloudera’s Distribution Including
Apache Hadoop (CDH)
Hortonworks Data Platform
IBM InfoSphere BigInsights
Platform Symphony MapReduce
MapR Hadoop Distribution
EMC Greenplum MR (using MapR’s
M5 Distribution)
Zettaset Data Platform
SGI Hadoop Clusters (uses
Cloudera distribution)
Grand Logic JobServer
OceanSync Hadoop Management
Software
Oracle Big Data Appliance (uses
Cloudera distribution)
![Page 13: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/13.jpg)
What’s up with the names?
When naming software projects, Doug Cutting seems to have been inspired by his family.
Lucene is his wife’s middle name, and her maternal grandmother’s first name.
His son, as a toddler, used Nutch as the all-purpose word for meal and later named a yellow stuffed elephant Hadoop.
Doug said he “was looking for a name that wasn’t already a web domain and wasn’t trademarked, so I tried various words that were in my life but not used by anybody else. Kids are pretty good at making up words.”
![Page 14: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/14.jpg)
Hadoop features
Distributed Framework for processing and storing data generally on commodity hardware.
Completely Open Source.
Written in Java–Runs on Linux, Mac OS/X, Windows, and Solaris.–Client apps can be written in various languages.
•Scalable: store and process petabytes, scale by adding Hardware
•Economical: 1000’s of commodity machines
•Efficient: run tasks where data is located
•Reliable: data is replicated, failed tasks are rerun
•Primarily used for batch data processing, not real-time / user facing applications
![Page 15: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/15.jpg)
Components of Hadoop
•HDFS (Hadoop Distributed File System)–Modeled on GFS–Reliable, High Bandwidth file system that can
store TB' and PB's data.
•Map-Reduce–Using Map/Reduce metaphor from Lisp language–A distributed processing framework paradigm that
process the data stored onto HDFS in key-value .
DFS
Processing Framework
Client 1 Client 2
Inputdata
Outputdata
Map
Map
Map
Reduce
Reduce
Input Map Shuffle & Sort Reduce Output
![Page 16: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/16.jpg)
•Very Large Distributed File System–10K nodes, 100 million files, 10 PB–Linearly scalable–Supports Large files (in GBs or TBs)
•Economical–Uses Commodity Hardware–Nodes fail every day. Failure is expected, rather than exceptional.–The number of nodes in a cluster is not constant.
•Optimized for Batch Processing
HDFS
![Page 17: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/17.jpg)
HDFS Goals
•Highly fault-tolerant–runs on commodity HW, which can fail frequently
•High throughput of data access–Streaming access to data
•Large files–Typical file is gigabytes to terabytes in size–Support for tens of millions of files
•Simple coherency–Write-once-read-many access model
![Page 18: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/18.jpg)
HDFS: Files and Blocks
•Data Organization–Data is organized into files and directories–Files are divided into uniform sized large blocks–Typically 128MB–Blocks are distributed across cluster nodes
•Fault Tolerance–Blocks are replicated (default 3) to handle hardware failure–Replication based on Rack-Awareness for performance and fault tolerance
– Keeps checksums of data for corruption detection and recovery–Client reads both checksum and data from DataNode. If checksum fails, it tries other replicas
![Page 19: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/19.jpg)
HDFS: Files and Blocks
•High Throughput:–Client talks to both NameNode and DataNodes–Data is not sent through the NameNode.–Throughput of file system scales nearly linearly with the number of nodes.
•HDFS exposes block placement so that computation can be migrated to data
![Page 20: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/20.jpg)
HDFS Components
•NameNode–Manages the file namespace operation like opening, creating, renaming etc.–File name to list blocks + location mapping–File metadata –Authorization and authentication–Collect block reports from DataNodes on block locations–Replicate missing blocks–Keeps ALL namespace in memory plus checkpoints & journal
•DataNode–Handles block storage on multiple volumes and data integrity.–Clients access the blocks directly from data nodes for read and write–Data nodes periodically send block reports to NameNode–Block creation, deletion and replication upon instruction from the NameNode.
![Page 21: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/21.jpg)
name:/users/joeYahoo/myFile - blocks:{1,3}name:/users/bobYahoo/someData.gzip - blocks:{2,4,5}
Datanodes (the slaves)
Namenode (the master)
1 12
224 5
33 4 4
55
ClientMetadata
I/O
1
3
HDFS Architecture
![Page 22: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/22.jpg)
Simple commandshdfs dfs -ls, -du, -rm, -rmr
Uploading fileshdfs dfs –copyFromLocal foo mydata/foo
Downloading fileshdfs dfs - moveToLocal mydata/foo foo
hdfs dfs -cat mydata/foo
Adminhdfs dfsadmin –report
Hadoop DFS Interface
![Page 23: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/23.jpg)
Map Reduce - Introduction
•Parallel Job processing framework •Written in java•Close integration with HDFS•Provides :
–Auto partitioning of job into sub tasks–Auto retry on failures–Linear Scalability–Locality of task execution–Plugin based framework for extensibility
![Page 24: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/24.jpg)
Map-Reduce
•MapReduce programs are executed in two main phases, called –mapping and –reducing .
•In the mapping phase, MapReduce takes the input data and feeds each data element to the mapper. •In the reducing phase, the reducer processes all the outputs from the mapper and arrives at a final result.•The mapper is meant to filter and transform the input into something•That the reducer can aggregate over.•MapReduce uses lists and (key/value) pairs as its main data primitives.
![Page 25: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/25.jpg)
Map-Reduce
Map-Reduce Program–Based on two functions: Map and Reduce–Every Map/Reduce program must specify a Mapper and optionally a Reducer–Operate on key and value pairs
Map-Reduce works like a Unix pipeline:cat input | grep | sort | uniq -c | cat > output
Input | Map | Shuffle & Sort | Reduce | Output
cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c >
~/userlist
Map function: Takes a key/value pair and generates a set of intermediate key/value pairs map(k1, v1) -> list(k2, v2)Reduce function: Takes intermediate values and associates them with the same intermediate key reduce(k2, list(v2)) -> list (k3, v3)
![Page 26: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/26.jpg)
Map-Reduce on Hadoop
![Page 27: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/27.jpg)
Hadoop and its elements
HDFS
.
.
.
File 1
File 2
File 3
File N-2
File N-1
File N
Inputfiles
Splits
Mapper
Machine -1
Machine -2
Machine - M
Split 1
Split 2
Split 3
Split M-2
Split M-1
Split M
Map 1
Map 2
Map 3
Map M-2
Map M-1
Map M
Combiner 1
Combiner C
(Kay, Value) pairs
Record Reader combiner
.
.
.
Partition 1
Partition 2
Partition P-1
Partition P
Partitionar
Reducer
HDFS
.
.
.
File 1
File 2
File 3
File O-2
File O-1
File O
Reducer 1
Reducer 2
Reducer R-1
Reducer R
Input
Output
Machine -x
![Page 28: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/28.jpg)
Hadoop Eco-system•Hadoop Common: The common utilities that support the other Hadoop subprojects.•Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.•Hadoop MapReduce: A software framework for distributed processing of large data sets on compute clusters.•Other Hadoop-related projects at Apache include:
–Avro™: A data serialization system.–Cassandra™: A scalable multi-master database with no single points of failure.–Chukwa™: A data collection system for managing large distributed systems.–HBase™: A scalable, distributed database that supports structured data storage for large tables.–Hive™: A data warehouse infrastructure that provides data summarization and ad hoc querying.–Mahout™: A Scalable machine learning and data mining library.–Pig™: A high-level data-flow language and execution framework for parallel computation.–ZooKeeper™: A high-performance coordination service for distributed applications.
![Page 29: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/29.jpg)
Exercise – task
You have timeseries data (timestamp, ID, value) collected from 10,000 sensors in every millisecond. Your central system stores this data, and allow more than 500 people to concurrently access this data and execute queries on them. While last one month data is accessed more frequently, some analytics algorithm built model using historical data as well.
•Task: –Provide an architecture of such system to meet following goals–Fast–Available–Fair
–Or, provide analytics algorithm and data-structure design considerations (e.g. k-means clustering, or regression) on this data set of worth 3 months.
•Group / individual presentation
![Page 30: Hadoop introduction](https://reader030.vdocuments.site/reader030/viewer/2022032422/55a9822e1a28ab65458b468f/html5/thumbnails/30.jpg)
End of session
Day – 1: Introduction