Introduction• Hadoop is a large-scale distributed batch processing infrastructure. • Can be used on a single machine, but it is meant to run on hundreds
or thousands of computers, each with several processor cores. • Designed to efficiently distribute large amounts of work across a set
of machines. • Built to process "web-scale" or “big” data on the order of hundreds
of gigabytes to terabytes or petabytes. • At this scale, it is likely that the input data set will not even fit on a
single computer's hard drive, much less in memory. • Includes a distributed file system which breaks up input data and
sends fractions of the original data to several machines in cluster to hold.
• This results in the problem being processed in parallel using all of the machines in the cluster and computes output results as efficiently as possible.
Introduction• Challenges at Large Scale
– Large volume of data requires distributing parts of the problem to multiple machines to handle in parallel.
– In a distributed environment, partial failures are an expected and common occurrence.
– Synchronization between multiple machines.– If nodes in a distributed system can explicitly communicate with one
another, then application designers must be cognizant of risks associated with such communication patterns.(More RPC Calls required)
– The ability to continue computation in the face of failures becomes more challenging.
– Compute hardware has finite resources available to it. The major resources include: • Processor time • Memory • Hard drive space • Network bandwidth
Introduction
Hadoop is designed to efficiently process large volumes of information
by connecting many commodity computers together to work in
parallel.
What is Hadoop?
• At Google MapReduce operation are run on a special file system called Google File System (GFS) that is highly optimized for this purpose.
• GFS is not open source.• Doug Cutting and Yahoo! reverse engineered the
GFS and called it Hadoop Distributed File System (HDFS).
• The software framework that supports HDFS, MapReduce and other related entities is called the project Hadoop or simply Hadoop.
• This is open source and distributed by Apache.
8 Fallacies of Distributive Computing
1. The network is reliable.2. Latency is zero.3. Bandwidth is infinite.4. The network is secure. 5. Topology doesn't change. 6. There is one administrator. 7. Transport cost is zero. 8. The network is homogeneous.
Parts of Hadoop• Hadoop Distributed File System (HDFS)
– is a file system that can store very large data sets by scaling out across a cluster of hosts. It has specific design and performance characteristics; in particular, it is optimized for throughput instead of latency, and it achieves high availability through replication instead of redundancy.
• MapReduce– is a data processing paradigm that takes a specification of how the
data will be input and output from its two stages (called map and reduce) and then applies this across arbitrarily large data sets.
MapReduce integrates tightly with HDFS, ensuring that wherever possible, MapReduce tasks run directly on the HDFS nodes that hold the required data.
HDFS• HDFS is a distributed file system, meaning that it spreads storage across
multiple nodes.• Key features:
– HDFS stores files in blocks typically at least 64 MB in size, much larger than the 4-32 KB seen in most file systems.
– HDFS is optimized for throughput over latency; it is very efficient at streaming read requests for large files but poor at seek requests for many small ones.
– HDFS is optimized for workloads that are generally of the write-once and read-many type.
• Each storage node runs a process called a DataNode that manages the blocks on that host, and these are coordinated by a master NameNode process running on a separate host. Instead of handling disk failures by having physical redundancies in disk arrays or similar strategies, HDFS uses replication. Each of the blocks comprising a file is stored on multiple nodes within the cluster, and the HDFS NameNode constantly monitors reports sent by each DataNode to ensure that failures have not dropped any block below the desired replication factor. If this does happen, it schedules the addition of another copy within the cluster
MapReduce
• Basic Introduction:http://architecture-soa-bpm-eai.blogspot.com/2013/07/map-reduce-for-my-sophomore.html
MapReduce• MapReduce is a processing paradigm which provides a series of
transformations from a source to a result data set. • Principles:
– In the simplest case, the input data is fed to the map function and the resultant temporary data to a reduce function.
– The developer only defines the data transformations; Hadoop's MapReduce job manages the process of how to apply these transformations to the data across the cluster in parallel.
– Though the underlying ideas may not be novel, a major strength of Hadoop is in how it has brought these principles together into an accessible and well-engineered platform.
– Unlike traditional relational databases that require structured data with well-defined schemas, MapReduce and Hadoop work best on semi-structured or unstructured data.
– Hadoop provides a standard specification (that is, interface) for the map and reduce functions, and implementations of these are often referred to as mappers and reducers.
– A typical MapReduce job will comprise of a number of mappers and reducers.
Hadoop Ecosystem• Although Hadoop is best known for MapReduce and its distributed filesystem (HDFS,
renamed from NDFS), the term is also used for a family of related projects that fall under the umbrella of infrastructure for distributed computing and large-scale data processing.
• Common: A set of components and interfaces for distributed file systems and general I/O (serialization, Java RPC, persistent data structures).
• Avro: A serialization system for efficient, cross-language RPC, and persistent data storage. • MapReduce: A distributed data processing model and execution environment that runs on
large clusters of commodity machines. • HDFS :A distributed file system that runs on large clusters of commodity machines. • Pig: A data flow language and execution environment for exploring very large datasets. Pig
runs on HDFS and MapReduce clusters. • Hive : A distributed data warehouse. Hive manages data stored in HDFS and provides a
query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data.
• Hbase : A distributed, column-oriented database. HBase uses HDFS for its underlying storage, and supports both batch-style computations using MapReduce and point queries (random reads).
• ZooKeeper : A distributed, highly available coordination service. ZooKeeper provides primitives such as distributed locks that can be used for building distributed applications.
• Sqoop: A tool for efficiently moving data between relational databases and HDFS
Hadoop 1 Ecosystem
Hadoop 2 Ecosystem
Relational Database vs. Hadoop
Relational Database Hadoop