f act s data intensive applications with petabytes of data web pages - 20+ billion web pages x...

fACTsData intensive applications with

Petabytes of data

Web pages - 20+ billion web pages x 20KB = 400+ terabytes

One computer can read 30-35 MB/sec from disk ~four months to read the web

same problem with 1000 machines, < 3 hours

Single-thread performance doesn’t matterWe have large problems and total

throughput/price more important than peak performance

Stuff Breaks – more reliability• If you have one server, it may stay up three years

(1,000 days)• If you have 10,000 servers, expect to lose ten a day

“Ultra-reliable” hardware doesn’t really helpAt large scales, super-fancy reliable hardware still

fails, albeit less often– software still needs to be fault-tolerant– commodity machines without fancy hardware

give better perf/price

What is Hadoop?

It's a framework for running applications on large clusters of commodity hardware which produces huge data and to process it

Hadoop is a framework used to have distributed processing of big data which is stored at different physical locations.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.

Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.

Hadoop Includes

HDFS a distributed filesystem

Map/Reduce HDFS implements this programming model. It is an offline computing engine

Hardware failure is the norm rather than the exception.

Moving Computation is Cheaper than Moving Data

HDFS run on commodity hardware

HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware

provides high throughput access to application data

suitable for applications that have large data sets

NameNode and DataNodes HDFS has a master/slave architecture

NameNode :-manages the file system namespace and regulates access to files by clients

DataNodes, usually one per node in the cluster, which manage storage attached to the nodes that they run on

a file is split into one or more blocks

these blocks are stored in a set of DataNodes

NameNode executes file system namespace operations like opening, closing, and renaming files and directories

It also determines the mapping of blocks to DataNodes

The DataNodes are responsible for serving read and write requests from the file system’s clients.

The DataNodes also perform block creation, deletion, and replication upon instruction from the NameNode.

software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner

A MapReduce job usually splits the input data-set into independent chunks which are processed by the map tasks in a completely parallel manner.

The framework sorts the outputs of the maps, which are then input to the reduce tasks

Typically the compute nodes and the storage nodes are the same

The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per cluster-node

The master is responsible for scheduling the jobs' component tasks on the slaves, monitoring them and re-executing the failed tasks

The slaves execute the tasks as directed by the master

applications specify the input/output locations

supply map and reduce functions via implementations of appropriate interfaces and/or abstract-classes.

The Hadoop job client then submits the job and configuration to the JobTracker

JobTracker assumes the responsibility of distributing the software/configuration to the slaves, scheduling tasks and monitoring them, providing status and diagnostic information

LETS Simulate The MapReduce operates on <key,

value> pairs, that is, the input to the job set of <key, value> pairs and produces a set of <key, value> pairs as the output of the job

Process

Consider a simple example File 1:-Hello World Bye World File 2:-Hello Hadoop Goodbye Hadoop

For the given sample input the first map emits:< Hello, 1> < World, 1> < Bye, 1> < World, 1>

The second map emits:< Hello, 1> < Hadoop, 1> < Goodbye, 1> < Hadoop, 1>

After using Cobiner The output of the first map:

< Bye, 1> < Hello, 1> < World, 2>

The output of the second map:< Goodbye, 1> < Hadoop, 2> < Hello, 1>

Thus the output of the job is:< Bye, 1> < Goodbye, 1> < Hadoop, 2> < Hello, 2> < World, 2>

References https://

hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html

http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html

http://www.aosabook.org/en/hdfs.html http://hadoop.apache.org/

https://hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html






http://www.aosabook.org/en/hdfs.html

http://www.aosabook.org/en/hdfs.html

http://hadoop.apache.org/

http://hadoop.apache.org/

Thank You……….

f act s data intensive applications with petabytes of data web pages - 20+ billion web pages x...

Documents