f act s data intensive applications with petabytes of data web pages - 20+ billion web pages x...
TRANSCRIPT
fACTsData intensive applications with
Petabytes of data
Web pages - 20+ billion web pages x 20KB = 400+ terabytes
One computer can read 30-35 MB/sec from disk ~four months to read the web
same problem with 1000 machines, < 3 hours
Single-thread performance doesn’t matterWe have large problems and total
throughput/price more important than peak performance
Stuff Breaks – more reliability• If you have one server, it may stay up three years
(1,000 days)• If you have 10,000 servers, expect to lose ten a day
“Ultra-reliable” hardware doesn’t really helpAt large scales, super-fancy reliable hardware still
fails, albeit less often– software still needs to be fault-tolerant– commodity machines without fancy hardware
give better perf/price
What is Hadoop?
It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it
Hadoop is a framework used to have distributed processing of big data which is stored at different physical locations.
The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.
It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.
Hadoop Includes
HDFS a distributed filesystem
Map/Reduce HDFS implements this programming model. It is an offline computing engine
Hardware failure is the norm rather than the exception.
Moving Computation is Cheaper than Moving Data
HDFS run on commodity hardware
HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware
provides high throughput access to application data
suitable for applications that have large data sets
NameNode and DataNodes HDFS has a master/slave architecture
NameNode :-manages the file system namespace and regulates access to files by clients
DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on
a file is split into one or more blocks
these blocks are stored in a set of DataNodes
NameNode executes file system namespace operations like opening, closing, and renaming files and directories
It also determines the mapping of blocks to DataNodes
The DataNodes are responsible for serving read and write requests from the file system’s clients.
The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.
software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner
A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.
The framework sorts the outputs of the maps, which are then input to the reduce tasks
Typically the compute nodes and the storage nodes are the same
The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node
The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks
The slaves execute the tasks as directed by the master
applications specify the input/output locations
supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes.
The Hadoop job client then submits the job and configuration to the JobTracker
JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information
LETS Simulate The MapReduce operates on <key,
value> pairs, that is, the input to the job set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job
Process
Consider a simple example File 1:-Hello World Bye World File 2:-Hello Hadoop Goodbye Hadoop
For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1>
The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>
After using Cobiner The output of the first map:
< Bye, 1> < Hello, 1> < World, 2>
The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1>
Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>
References https://
hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html
http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
http://www.aosabook.org/en/hdfs.html http://hadoop.apache.org/
Thank You……….