big data analytics using apache hadoop

52
BIG DATA ANALYTICS USING APACHE HADOOP SEMINAR REPORT Submitted in partial fulfilment of the requirements for the award of Bachelor of Technology Degree in Computer Science and Engineering of the University of Kerala Submitted by ABIN BABY Roll No : 1 Seventh Semester B.Tech Computer Science and Engineering DEPARTMENT i

Upload: abinbabyelichirayil

Post on 18-Jul-2016

63 views

Category:

Documents


4 download

DESCRIPTION

A seminar report based on big data analytics and how we can use apache hadoop for big data analytics

TRANSCRIPT

BIG DATA ANALYTICS USING APACHE HADOOP

SEMINAR REPORT

Submitted in partial fulfilment of

the requirements for the award of Bachelor of Technology Degree

in Computer Science and Engineering

of the University of Kerala

Submitted by

ABIN BABYRoll No : 1

Seventh SemesterB.Tech Computer Science and Engineering

DEPARTMENT

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERINGCOLLEGE OF ENGINEERING

TRIVANDRUM2014

i

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERINGCOLLEGE OF ENGINEERING

TRIVANDRUM

CERTIFICATE

This is to certify that this seminar report entitled “BIG DATA ANALYTICS USING

APACHE HADOOP” is a bonafide record of the work done by Abin Baby, under our

guidance towards partial fulfilment of the requirements for the award of the Degree of

Bachelor of Technology in Computer Science and Engineering of the University of

Kerala during the year 2011-2015.

Dr. Abdul Nizar A Mrs. Sabitha S Mrs. Rani Koshi

Professor Assoc. Professor Assoc. ProfessorDept. of CSE Dept. of CSE Dept. of CSE(Head of the Department) (Guide) (Guide)

ii

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude and heartful indebtedness to my

guide Dr. Abdul Nizar , Head of Department, Department of Computer Science and

Engineering for her valuable guidance and encouragement in pursuing this seminar.

I am also very much thankful to, Mrs. Sabitha S, Associate Professor,

Department of Computer Science and Engineering for their help and support.

I also extend my hearty gratitude to Seminar Co-ordinator, Mrs. Rani Koshi,

Associate Professor, Department of CSE, College of Engineering Trivandrum for

providing necessary facilities and their sincere co-operation. My sincere thanks is

extended to all the teachers of the department of CSE and to all my friends for their help

and support.

Above all, I thank God for the immense grace and blessings at all stages of the project.

Abin Baby

iii

ABSTRACT

The paradigm of processing huge datasets has been shifted from centralized

architecture to distributed architecture. As the enterprises faced issues of gathering

large chunks of data they found that the data cannot be processed using any of the

existing centralized architecture solutions. Apart from time constraints, the enterprises

faced issues of efficiency, performance and elevated infrastructure cost with the data

processing in the centralized environment.

With the help of distributed architecture these large organizations were able to

overcome the problems of extracting relevant information from a huge data dump. One

of the best open source tools used in the market to harness the distributed architecture

in order to solve the data processing problems is Apache Hadoop. Using Apache

Hadoop’s various components such as data clusters, map-reduce algorithms and

distributed processing, we will resolve various location-based complex data problems

and provide the relevant information back into the system, thereby increasing the user

experience.

iv

TABLE OF CONTENTS

1. Introduction 1

2. Big Data 8

2.1 Attributes of Big data 102.2 Big data Applications 12

3. Big Data Analytics 14

3.1 Challenges of Big Data Analytics 163.2 Objective of Big Data Analytics 17

4. Apache Hadoop 19

5. Hadoop Distributed File System 22

5.1 Architecture 23

6. Hadoop Map-Reduce 25

6.1 Architecture 256.2 Map Reduce Paradigm 276.3 Comparing RDBMS & Map Reduce 29

7. Apache Hadoop Ecosystem 31

8. Conclusion 32

9. Bibliography 33

v

CHAPTER 1

INTRODUCTION

Amount of data generated every day is expanding in drastic manner.Big data is a

popular term used to describe the data which is in zetta bytes. Government , companies

many organisations try to acquire and store data about their citizens and customers in

order to know them better and predict the customer behaviour . social networking

websites generate new data every second and handling such a data is one of the major

challenges companies are facing. Data which is stored in data warehouses is causing

disruption because it is in a raw format ,proper analysis and processing is to be done in

order to produce usable information out of it. Big Data has to deal with large and

complex datasets that can be structured , semi structured,or unstructured and will

typically not fit into memory to be processed. They have to be processed in place, which

means that computation has to be done where the data resides for processing.

Big data challenges include analysis, capture, curation, search, sharing, storage,

transfer, visualization, and privacy violations. The trend to larger data sets is due to the

additional information derivable from analysis of a single large set of related data, as

compared to separate smaller sets with the same total amount of data, allowing

correlations to be found to "spot business trends, prevent diseases, combat crime and so

on.

Big Data usually includes datasets with sizes. It is not possible for such systems

to process this amount of data within the time frame mandated by the business. Big

Data volumes are a constantly moving target, as of 2012 ranging from a few dozen

terabytes to many petabytes of data in a single dataset. Faced with this seemingly

insurmountable challenge, entirely new platforms are called Big Data platforms.

New tools are being used to handle such a large amount of data in short

time.ApacheHadoop is java based programming framework which is used for

processing large data sets in distributed computer environment. Hadoop is used in

1

system where multiple nodes are present which can process terabytes of data.hadoop

uses its own file system HDFS which facilitates fast transfer of data which can sustain

node failure and avoid system failure as whole. Hadoop uses MapReduce algorithm

which breaks down the big data into smaller chunks and performs the operations on it.

Hadoop framework is used by many big companies like Google, yahoo, IBM for

applications such as search engine, advertising and information gathering and

processing.

Various technologies will come in hand-in-hand to accomplish this task such as

Spring Hadoop Data Framework for the basic foundations and running of the Map-

Reduce jobs, Apache Maven for distributed building of the code, REST Web services for

the communication, and lastly Apache Hadoop for distributed processing of the huge

dataset.

2

CHAPTER 2

BIG DATA

Every day, we create 2.5 quintillion bytes of data — so much that 90% of the data

in the world today has been created in the last two years alone. This data comes from

everywhere: sensors used to gather climate information, posts to social media sites,

digital pictures and videos, purchase transaction records, and cell phone GPS signals to

name a few. This data is big data.

The data lying in the servers of the company was just data until yesterday –

sorted and filed. Suddenly, the slang Big Data got popular and now the data in a

company is Big Data. The term covers each and every piece of data an organization has

stored till now. It includes data stored in clouds and even the URLs that you have been

bookmarked.A company might not have digitized all the data. They may not have

structured all the data already. But then, all the digital, papers, structured and non-

structured data with the company is now Big Data.

Big data is an all-encompassing term for any collection of data sets so large and

complex that it becomes difficult to process using traditional data processing

applications. It refers to the large amounts, at least terabytes, of poly-structured data

that flows continuously through and around organizations, including video, text, sensor

logs, and transactional records. The business benefits of analyzing this data can be

significant. According to a recent study by the MIT Sloan School of Management,

organizations that use analytics are twice as likely to be top performers in their industry

as those that don’t.

Big data burst upon the scene in the first decade of the 21st century, and the first

organizations to embrace it were online and startup firms. In a nutshell, Big Data is your

data. It's the information owned by your company, obtained and processed through new

techniques to produce value in the best way possible.

3

Companies have sought for decades to make the best use of information to

improve their business capabilities. However, it's the structure (or lack thereof) and

size of Big Data that makes it so unique. Big Data is also special because it represents

both significant information - which can open new doors – and the way this information

is analyzed to help open those doors. The analysis goes hand-in-hand with the

information, so in this sense "Big Data" represents a noun – "the data" - and a verb –

"combing the data to find value." The days of keeping company data in Microsoft Office

documents on carefully organized file shares are behind us, much like the bygone era of

sailing across the ocean in tiny ships. That 50 gigabyte file share in 2002 looks quite tiny

compared to a modern-day 50 terabyte marketing database containing customer

preferences and habits.

The world's technological per-capita capacity to store information has roughly

doubled every 40 months since the 1980s as of 2012, every day 2.5 exabytes (2.5×1018)

of data were created. The challenge for large enterprises is determining who should

own big data initiatives that straddle the entire organization.

Some of the popular organizations that hold Big Data are as follows:

• Facebook: It has 40 PB of data and captures 100 TB/day

• Yahoo!: It has 60 PB of data

• Twitter: It captures 8 TB/day

• EBay: It has 40 PB of data and captures 50 TB/day

How much data is considered as Big Data differs from company to

company.Though true that one company's Big Data is another's small, there is

something common: doesn't fit in memory, nor disk, has rapid influx of data that needs

to be processed and would benefit from distributed software stacks. For some

companies, 10 TB of data would be considered Big Data and for others 1 PB would be

Big Data. So only you can determine whether the data is really Big Data. It is sufficient to

say that it would start in the low terabyte range.

4

ATTRIBUTES OF BIG DATA

As far back as 2001, industry analyst Doug Laney (currently with Gartner)

articulated the now mainstream definition of big data as the three Vs of big data:

volume, velocity and variety.

Volume. The quantity of data that is generated is very important in this context.It is the

size of the data which determines the value and potential of the data under

consideration and whether it can actually be considered as Big Data or not.The name

‘Big Data’ itself contains a term which is related to size and hence the

characteristic.Many factors contribute to the increase in data volume. Transaction-

based data stored through the years. Unstructured data streaming in from social media.

Increasing amounts of sensor and machine-to-machine data being collected. In the past,

excessive data volume was a storage issue. But with decreasing storage costs, other

issues emerge, including how to determine relevance within large data volumes and

how to use analytics to create value from relevant data.

It is estimated that 2.5 Quintillion data is generated every day.

40 Zettabytes of data will be created by 2020,an increase of 30 times from 2005

6 billion people around the world are using mobile phones

Velocity. The term ‘velocity’ in the context refers to the speed of generation of data or

how fast the data is generated and processed to meet the demands and the challenges

which lie ahead in the path of growth and development.Data is streaming in at

unprecedented speed and must be dealt with in a timely manner. RFID tags, sensors and

smart metering are driving the need to deal with torrents of data in near-real time.

Reacting quickly enough to deal with data velocity is a challenge for most organizations.

20 Hours of video being uploaded every minute

2.9 million emails sent every second

NewYork stock exchange captures 1TB of information in every trading session

Variety. The next aspect of Big Data is its variety.This means that the category to which

Big Data belongs to is also a very essential fact that needs to be known by the data

5

analysts.This helps the people, who are closely analyzing the data and are associated

with it, to effectively use the data to their advantage and thus upholding the importance

of the Big Data.Data today comes in all types of formats. Structured, numeric data in

traditional databases. Information created from line-of-business applications.

Unstructured text documents, email, video, audio, stock ticker data and financial

transactions. Managing, merging and governing different varieties of data is something

many organizations still grapple with.

Global size of data in health care is about 150 exabytes as of 2011

30 Billion pieces of content are shared on facebook every month

4 billion hours of video are watched on youtube each month

Veracity. In addition to the increasing velocities and varieties of data, data flows can be

highly inconsistent with periodic peaks. Is something trending in social media? Daily,

seasonal and event-triggered peak data loads can be challenging to manage. Even more

so with unstructured data involved.  This is a factor which can be a problem for those

who analyse the data. 

Poor data quality causes US economy around $3.1 trillion a year

Complexity - Data management can become a very complex process,especially when

large volumes of data come from multiple sources.These data need to be

linked,connected and correlated in order to be able to grasp the information that is

supposed to be conveyed by these data.This situation,is therefore,termed as the

‘complexity’ of Big Data.

Volatility : Big data volatility refers to how long is data valid and how long should it be

stored. In this world of real time data you need to determine at what point is data no

longer relevant to the current analysis.

6

BIG DATA APPLICATIONS

ADVERTISING : Big data analytics help companies like google and other advertising

companies to identify the behavior of a person and to target the ads accordingly.Big

data analytics help in more personal and targeted ads.

ONLINE MARKETING : Big data analytics is used by online retailers like

amazon,ebay,flipkart etc to identify their potential customers giving them offers , for

varying the price of products according to the trends etc

HEALTH CARE : The average amount of data per hospital will increase from 167TB to

665TB in 2015.With Big Data medical professionals can improve patient care and

reduce cost by extracting relevant clinical information.

CUSTOMER SERVICE : Service representatives can use data to gain a more holistic

view of their customers , understanding their likes amd dislikes in real time

INSURANCE : An insurance or citizen service provider can apply advanced analytics on

data to detect fraud quickly

UNDERSTANDING AND OPTIMIZING BUSINESS PERFORMANCE : Big data is also

increasingly used to optimize business processes. Retailers are able to optimize their

stock based on predictions generated from social media data, web search trends and

weather forecasts. One particular business process that is seeing a lot of big data

analytics is supply chain or delivery route optimization. 

FINANCIAL TRADING : High-Frequency Trading (HFT) is an area where big data finds a

lot of use today. Here, big data algorithms are used to make trading decisions. Today,

the majority of equity trading now takes place via data algorithms that increasingly take

into account signals from social media networks and news websites to make, buy and

sell decisions in split seconds.

7

IMPROVING AND OPTIMIZING CITY MANAGEMENT : Big data is used to improve

many aspects of our cities and countries. For example, it allows cities to optimize traffic

flows based on real time traffic information as well as social media and weather data. A

number of cities are currently piloting big data analytics with the aim of turning

themselves into Smart Cities, where the transport infrastructure and utility processes

are all joined up.

IMPROVING SECURITY AND LAW ENFORCEMENT : Big data is applied heavily in

improving security and enabling law enforcement. I am sure you are aware of the

revelations that the National Security Agency (NSA) in the U.S. uses big data analytics to

foil terrorist plots (and maybe spy on us). Others use big data techniques to detect and

prevent cyber attacks. 

OPTIMIZING MACHINE AND DEVICE PERFORMENCE : Big data analytics help

machines and devices become smarter and more autonomous. For example, big data

tools are used to operate Google’s self-driving car. The Toyota Prius is fitted with

cameras, GPS as well as powerful computers and sensors to safely drive on the road

without the intervention of human beings. Big data tools are also used to optimize

energy grids using data from smart meters. 

IMPROVING SCIENCE AND RESEARCH : Science and research is currently being

transformed by the new possibilities big data brings. Take, for example, CERN, the Swiss

nuclear physics lab with its Large Hadron Collider, the world’s largest and most

powerful particle accelerator. Experiments to unlock the secrets of our universe – how

it started and works - generate huge amounts of data. The CERN data center has 65,000

processors to analyze its 30 petabytes of data. However, it uses the computing powers

of thousands of computers distributed across 150 data centers worldwide to analyze

the data. 

IMPROVING SPORTS PERFORMANCE : Most elite sports have now embraced big data

analytics. We have the IBM SlamTracker tool for tennis tournaments; we use video

analytics that track the performance of every player in a football or baseball game, and

sensor technology in sports equipment such as basket balls or golf clubs allows us to get

feedback (via smart phones and cloud servers) on our game and how to improve it.

8

CHAPTER 3

BIG DATA ANALYTICS

Big data is difficult to work with using most relational database management

systems and desktop statistics and visualization packages, requiring instead "massively

parallel software running on tens, hundreds, or even thousands of servers"

Rapidly ingesting, storing, and processing big data requires a cost effective

infrastructure that can scale with the amount of data and the scope of analysis. Most

organizations with traditional data platforms—typically relational database

management systems (RDBMS) coupled to enterprise data warehouses (EDW) using

ETL tools—find that their legacy infrastructure is either technically incapable or

financially impractical for storing and analyzing big data. A traditional ETL process

extracts data from multiple sources, then cleanses, formats, and loads it into a data

warehouse for analysis. When the source data sets are large, fast, and unstructured,

traditional ETL can become the bottleneck, because it is too complex to develop, too

expensive to operate, and takes too long to execute.

By most accounts, 80 percent of the development effort in a big data project goes into

data integration and only 20 percent goes toward data analysis. Furthermore, a

9

traditional EDW platform can cost upwards of USD 60K per terabyte. Analyzing one

petabyte—the amount of data Google processes in 1 hour—would cost USD 60M.

Clearly “more of the same” is not a big data strategy that any CIO can afford.So we

require more efficient analytics for Big Data.

Big Analytics delivers competitive advantage in two ways compared to the

traditional analytical model. First, Big Analytics describes the efficient use of a simple

model applied to volumes of data that would be too large for the traditional analytical

environment. Research suggests that a simple algorithm with a large volume of data is

more accurate than a sophisticated algorithm with little data. The algorithm is not the

competitive advantage; the ability to apply it to huge amounts of data—without

compromising performance—generates the competitive edge.

Second, Big Analytics refers to the sophistication of the model itself. Increasingly,

analysis algorithms are provided directly by database management system (DBMS)

vendors. To pull away from the pack, companies must go well beyond what is provided

and innovate by using newer, more sophisticated statistical analysis.

To analyze such a large volume of data, big data analytics is typically performed

using specialized software tools and applications for predictive analytics, data mining,

text mining, forecasting and data optimization. Collectively these processes are separate

but highly integrated functions of high-performance analytics. Using big data tools and

software enables an organization to process extremely large volumes of data that a

business has collected to determine which data is relevant and can be analyzed to drive

better business decisions in the future.

10

CHALLENGES OF BIG DATA ANALYTICS

For most organizations, big data analysis is a challenge. Consider the sheer

volume of data and the many different formats of the data

(both structured and unstructured data) collected across the entire organization and

the many different ways different types of data can be combined, contrasted and

analyzed to find patterns and other useful information. 

1.Meeting the need for speed : In today’s hypercompetitive business environment,

companies not only have to find and analyze the relevant data they need, they must find

it quickly. Visualization helps organizations perform analyses and make decisions much

more rapidly, but the challenge is going through the sheer volumes of data and

accessing the level of detail needed, all at a high speed. One possible solution is

hardware. Some vendors are using increased memory and powerful parallel processing

to crunch large volumes of data extremely quickly. Another method is putting data in-

memory but using a grid computing approach, where many machines are used to solve

a problem.

2.Understanding the data : It takes a lot of understanding to get data in the right

shape so that you can use visualization as part of data analysis. For example, if the data

comes from social media content, you need to know who the user is in a general sense –

such as a customer using a particular set of products – and understand what it is you’re

trying to visualize out of the data. One solution to this challenge is to have the proper

domain expertise in place. Make sure the people analyzing the data have a deep

understanding of where the data comes from, what audience will be consuming the data

and how that audience will interpret the information.

3.Addressing data quality : Even if you can find and analyze data quickly and put it in

the proper context for the audience that will be consuming the information, the value of

data for decision-making purposes will be jeopardized if the data is not accurate or

timely. This is a challenge with any data analysis, but when considering the volumes of

information involved in big data projects, it becomes even more pronounced. Again, data

11

visualization will only prove to be a valuable tool if the data quality is assured. To address

this issue, companies need to have a data governance or information management process

in place to ensure the data is clean.

4.Displaying meaningful results : Plotting points on a graph for analysis becomes

difficult when dealing with extremely large amounts of information or a variety of

categories of information. For example, imagine you have 10 billion rows of retail SKU

data that you’re trying to compare. The user trying to view 10 billion plots on the screen

will have a hard time seeing so many data points. One way to resolve this is to cluster

data into a higher-level view where smaller groups of data become visible. By grouping

the data together, or “binning,” you can more effectively visualize the data.

5. Dealing with outliers : The graphical representations of data made possible by

visualization can communicate trends and outliers much faster than tables containing

numbers and text. Users can easily spot issues that need attention simply by glancing at

a chart. Outliers typically represent about 1 to 5 percent of data, but when you’re

working with massive amounts of data, viewing 1 to 5 percent of the data is rather

difficult. How do you represent those points without getting into plotting issues?

Possible solutions are to remove the outliers from the data (and therefore from the

chart) or to create a separate chart for the outliers.

The availability of new in-memory technology and high-performance analytics

that use data visualization is providing a better way to analyze data more quickly than

ever. Visual analytics enables organizations to take raw data and present it in a

meaningful way that generates the most value. Nevertheless, when used with big data,

visualization is bound to lead to some challenges. If you’re prepared to deal with these

hurdles, the opportunity for success with a data visualization strategy is much greater.

12

OBJECTIVES OF BIG DATA ANALYTICS

1. Cost Reduction from Big Data Technologies : Some organizations pursuing big data

believe strongly that MIPS and terabyte storage for structured data are now most

cheaply delivered through big data technologies like Hadoop clusters. Organizations

that were focused on cost reduction made the decision to adopt big data tools primarily

within the IT organization on largely technical and economic criteria.

2. Time Reduction from Big Data : The second common objective of big data

technologies and solutions is time reduction.Many companies make use of big data

analytics to generate real-time decisions to save a lot of time. Another key objective

involving time reduction is to be able to interact with the customer in real time, using

analytics and data derived from the customer experience.

3. Developing New Big Data-Based Offerings : One of the most ambitious things an

organization can do with big data is to employ it in developing new product and service

offerings based on data. Many of the companies that employ this approach are online

firms, which have an obvious need to employ data-based products and services.

4. Supporting Internal Business Decisions : The primary purpose behind traditional,

“small data” analytics was to support internal business decisions. What offers should be

presented to a customer? Which customers are most likely to stop being customers

soon? How much inventory should be held in the warehouse? How should we price our

products?

These types of decisions employ big data when there are new, less structured data

sources that can be applied to the decision. For example, any data that can shed light on

customer satisfaction is helpful, and much data from customer interactions is

unstructured

13

CHAPTER 4

APACHE HADOOP

Doug Cutting, and Mike Carafella helped create Apache Hadoop in 2005 out

of necessity as data from the web exploded, and grew far beyond the ability of

traditional systems to handle it. Hadoop was initially inspired by papers published by

Google outlining its approach to handling an avalanche of data, and has since become

the de facto standard for storing, processing and analyzing hundreds of terabytes, and

even petabytes of data.

Apache Hadoop is an open source distributed software platform for storing and

processing data. Written in Java, it runs on a cluster of industry-standard servers

configured with direct-attached storage. Using Hadoop, you can store petabytes of data

reliably on tens of thousands of servers while scaling performance cost-effectively by

merely adding inexpensive nodes to the cluster.

Apache Hadoop is 100% open source, and pioneered a fundamentally new way

of storing and processing data. Instead of relying on expensive, proprietary hardware

and different systems to store and process data, Hadoop enables distributed parallel

processing of huge amounts of data across inexpensive, industry-standard servers that

both store and process the data, and can scale without limits. With Hadoop, no data is

too big. And in today’s hyper-connected world where more and more data is being

created every day, Hadoop’s breakthrough advantages mean that businesses and

organizations can now find value in data that was recently considered useless.

Reveal Insight From All Types of Data,From All Types of Systems

Hadoop can handle all types of data from disparate systems: structured, unstructured,

log files, pictures, audio files, communications records, email – just about anything you

can think of, regardless of its native format. Even when different types of data have been

stored in unrelated systems, one can dump it all into a Hadoop cluster with no prior

need for a schema. In other words, we don’t need to know how we intend to query the

14

data before storing it; Hadoop lets one to decide later and over time can reveal

questions that never even thought to ask.

By making all data useable, not just what’s in a databases, Hadoop lets one to see

relationships that were hidden before and reveal answers that have always been just

out of reach. One can start making more decisions based on hard data instead of

hunches and look at complete data sets, not just samples. 

Redefine the Economics of Data: Keep Everything, Forever, Online

In addition, Hadoop’s cost advantages over legacy systems redefine the

economics of data. Legacy systems, while fine for certain workloads, simply were not

engineered with the needs of Big Data in mind and are far too expensive to be used for

general purpose with today's largest data sets.

One of the cost advantages of Hadoop is that because it relies in an internally

redundant data structure and is deployed on industry standard servers rather than

expensive specialized data storage systems, you can afford to store data not previously

viable. And we all know that once data is on tape, it’s essentially the same as if it had

been deleted - accessible only in extreme circumstances.

Hadoop consists of the Hadoop Common package, which provides filesystem and

OS level abstractions, a MapReduce engine (either MapReduce/MR1 or YARN/MR2) and

theHadoop Distributed File System (HDFS). The Hadoop Common package contains the

necessary Java Archive (JAR) files and scripts needed to start Hadoop.

For effective scheduling of work, every Hadoop-compatible file system should

provide location awareness: the name of the rack (more precisely, of the network

switch) where a worker node is. Hadoop applications can use this information to run

work on the node where the data is, and, failing that, on the same rack/switch, reducing

backbone traffic. HDFS uses this method when replicating data to try to keep different

copies of the data on different racks. The goal is to reduce the impact of a rack power

outage or switch failure, so that even if these events occur, the data may still be

readable.

15

HDFS : Self-healing, high-bandwidth, clustered storage.

MapReduce : Distributed, fault-tolerant resource management,coupled with scalable

data processing.

YARN : YARN is the architectural center of Hadoop that allows multiple data processing

engines such as interactive SQL, real-time streaming, data science and batch processing

to handle data stored in a single platform, unlocking an entirely new approach to

analytics. It is the foundation of the new generation of Hadoop and is enabling

organizations everywhere to realize a modern data architecture.

YARN is the prerequisite for Enterprise Hadoop, providing resource management and a central platform to deliver consistent operations, security, and data governance tools across Hadoop clusters.

YARN also extends the power of Hadoop to incumbent and new technologies found within the data center so that they to can take advantage of cost effective, linear-scale storage and processing. It provides ISVs and developers a consistent framework for writing data access applications that run IN Hadoop.

16

CHAPTER 5

HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

HDFS is Hadoop's own rack-aware filesystem, which is a UNIX-based data

storage layer of Hadoop. HDFS is derived from concepts of Google filesystem. An

important characteristic of Hadoop is the partitioning of data and computation across

many (thousands of) hosts, and the execution of application computations in parallel,

close to their data. On HDFS, data files are replicated as sequences of blocks in the

cluster. A Hadoop cluster scales computation capacity, storage capacity, and I/O

bandwidth by simply adding commodity servers. HDFS can be accessed from

applications in many different ways. Natively, HDFS provides a Java API for applications

to use.

The Hadoop clusters at Yahoo! span 40,000 servers and store 40 petabytes of

application data, with the largest Hadoop cluster being 4,000 servers. Also, one hundred

other organizations worldwide are known to use Hadoop.

HDFS was designed to be a scalable, fault-tolerant, distributed storage system

that works closely with MapReduce. HDFS will “just work” under a variety of physical

and systemic circumstances. By distributing storage and computation across many

servers, the combined storage resource can grow with demand while remaining

economical at every size.

HDFS supports parallel reading and writing and is optimized for streaming

reading and writing.The bandwidth scales linearly with the number of nodes.HDFS

provides a block redundancy factor which is normally 3 , ie every block will be

replicated 3 times in various nodes .This helps to get higher fault tolerance.

These specific features ensure that the Hadoop clusters are highly functional and highly

available:

17

Rack awareness allows consideration of a node’s physical location, when

allocating storage and scheduling tasks

Minimal data motion. MapReduce moves compute processes to the data on

HDFS and not the other way around. Processing tasks can occur on the physical

node where the data resides. This significantly reduces the network I/O patterns

and keeps most of the I/O on the local disk or within the same rack and provides

very high aggregate read/write bandwidth.

Utilities diagnose the health of the files system and can rebalance the data on

different nodes

Rollback allows system operators to bring back the previous version of HDFS

after an upgrade, in case of human or system errors

Standby NameNode provides redundancy and supports high availability

Highly operable. Hadoop handles different types of cluster that might otherwise

require operator intervention. This design allows a single operator to maintain a

cluster of 1000s of nodes.

ARCHITECTURE

HDFS stores large files by dividing them into blocks (usually 64 or 128 MB) and

replicating the blocks on three or more servers.Data is organized in to files and

directories.

HDFS has a master/slave architecture. An HDFS cluster consists of a single

NameNode, a master server that manages the file system namespace and regulates

access to files by clients.In addition, there are a number of DataNodes, usually one per

node in the cluster, which manage storage attached to the nodes that they run on. HDFS

exposes a file system namespace and allows user data to be stored in files. Internally, a

file is split into one or more blocks and these blocks are stored in a set of DataNodes.

NameNode : They executes file system namespace operations like opening, closing, and

renaming files and directories. It also determines the mapping of blocks to DataNodes.

The Namenode actively monitors the number of replicas of a block. When a replica of a

block is lost due to a DataNode failure or disk failure, the NameNode creates another

18

replica of the block. The NameNode maintains the namespace tree and the mapping of

blocks to DataNodes, holding the entire namespace image in RAM.

DataNodes : are responsible for serving read and write requests from the file system’s

clients. The DataNodes also perform block creation, deletion, and replication upon

instruction from the NameNode.

The NameNode and DataNode are pieces of software designed to run on commodity

machines. These machines typically run a GNU/Linux operating system (OS). HDFS is

built using the Java language; any machine that supports Java can run the NameNode or

the DataNode software. Usage of the highly portable Java language means that HDFS can

be deployed on a wide range of machines. A typical deployment has a dedicated

machine that runs only the NameNode software. Each of the other machines in the

cluster runs one instance of the DataNode software. HDFS provide automatic block

replication if any nodes fail. A failed disk or node need not to be repaired immediately

unlike RAID systems.Typically repair is done periodically for a collection of failures

which makes it more efficient.

19

CHAPTER 6

HADOOP MAP REDUCE

Central to the scalability of Apache Hadoop is the distributed processing

framework known as MapReduce. MapReduce helps programmers solve data-parallel

problems for which the data set can be sub-divided into small parts and processed

independently. MapReduce is an important advance because it allows ordinary

developers, not just those skilled in high-performance computing, to use parallel

programming constructs without worrying about the complex details of intra-cluster

communication, task monitoring, and failure handling. MapReduce simplifies all that.

MapReduce is a programming model for processing large datasets distributed on

a large cluster. MapReduce is the heart of Hadoop. Its programming paradigm allows

performing massive data processing across thousands of servers configured with

Hadoop clusters. This is derived from Google MapReduce. Hadoop MapReduce is a

software framework for writing applications easily, which process large amounts of

data (multiterabyte datasets) in parallel on large clusters (thousands of nodes) of

commodity hardware in a reliable, fault-tolerant manner.

ARCHITECTURE

MapReduce is also implemented over master-slave architectures. Classic

MapReduce contains job submission, job initialization, task assignment, task execution,

progress and status update, and job completion-related activities, which are mainly

managed by the JobTracker node and executed by TaskTracker. Client application

submits a job to the JobTracker. Then input is divided across the cluster. The JobTracker

then calculates the number of map and reducer to be processed. It commands the

TaskTracker to start executing the job. Now, the TaskTracker copies the resources to a

local machine and launches JVM to map and reduce program over the data. Along with

20

this, the TaskTracker periodically sends update to the JobTracker, which can be

considered as the heartbeat that helps to update JobID, job status, and usage of

resources.

• JobTracker : This is the master node of the MapReduce system, which

manages the jobs and resources in the cluster (TaskTrackers). The JobTracker

tries to schedule each map as close to the actual data being processed on

the TaskTracker, which is running on the same DataNode as the underlying

block.

• TaskTracker : These are the slaves that are deployed on each machine. They

are responsible for running the map and reducing tasks as instructed by the

JobTracker.

21

MAP REDUCE PARADIGM

Hadoop MapReduce is a software framework for writing applications easily, which

process large amounts of data (multiterabyte datasets) in parallel on large clusters

(thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. This

MapReduce paradigm is divided into two phases, Map and Reduce that mainly deal

with key and value pairs of data. The Map and Reduce task run sequentially in a cluster;

the output of the Map phase becomes the input for the Reduce phase. These phases are

explained as follows:

• Map phase: Once divided, datasets are assigned to the task tracker to

perform the Map phase. The data functional operation will be performed

over the data, emitting the mapped key and value pairs as the output of the

Map phase.

• Reduce phase: The master node then collects the answers to all the

subproblems and combines them in some way to form the output; the answer

to the problem it was originally trying to solve.

22

The five common steps of parallel computing are as follows:

1. Preparing the Map() input: This will take the input data row wise and emit

key value pairs per rows, or we can explicitly change as per the requirement.

Map input: list (k1, v1)

2. Run the user-provided Map() code

Map output: list (k2, v2)

3. Shuffle the Map output to the Reduce processors. Also, shuffle the similar

keys (grouping them) and input them to the same reducer.

4. Run the user-provided Reduce() code: This phase will run the custom

reducer code designed by developer to run on shuffled data and emit key

and value.

Reduce input: (k2, list(v2))

Reduce output: (k3, v3)

5. Produce the final output: Finally, the master node collects all reducer output

and combines and writes them in a text file.

EXAMPLE

23

COMPARING RDBMS AND MAP-REDUCE

Traditional RDBMS Map Reduce

Data Size Gigabyte(Terabytes) Pentabytes(Exabytes)

Data Format Relational Tables Key / Value Pairs

Access Interactive and Batch Batch

Updates Read and Write many times Write once,Read many

Structure Static schema Dynamic schema

Integrity High(ACID) Low

Scaling Non Linear Linear

Processing Online Transactions Offline Batch Processing

Querying Declarative Queries Functional Programming

24

CHAPTER 7

APACHE HADOOP ECOSYSTEM

Apache Hadoop ecosystem consist of many other implementations apart from hadoop

hdfs and hadoop map reduce.They include :

1.Apache PIG (Scripting platform) : Pig is a platform for analyzing large data sets that

consists of a high-level language for expressing data analysis programs, coupled with

infrastructure for evaluating these programs. The salient property of Pig programs is

that their structure is amenable to substantial parallelization, which in turns enables

them to handle very large data sets. At the present time,Pig's infrastructure layer

consists of a compiler that produces sequences of Map-Reduce programs. Pig's language

layer currently consists of a textual language called Pig Latin, which is easy to use,

optimized, and extensible.

2. Apache Hive : Hive is a data warehouse system for Hadoop that facilitates easy data

summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop

compatible file systems. It provides a mechanism to project structure onto this data and

query the data using a SQL-like language called HiveQL. Hive also allows traditional

map/reduce programmers to plug in their custom mappers and reducers when it is

inconvenient or inefficient to express this logic in HiveQL.

3. Apache HBase : HBase (Hadoop DataBase) is a distributed, column oriented

database. HBase uses HDFS for the underlying storage. It supports both batch style

computations using MapReduce and point queries (random reads).

The main components of HBase are as described below:

HBase Master is responsible for negotiating load balancing across all Region

Servers and maintain the state of the cluster. It is not part of the actual data

storage or retrieval path.

RegionServer is deployed on each machine and hosts data and processes I/O

requests.

25

4Apache Mahout : Mahout is a scalable machine learning library that implements many

different approaches to machine learning. The project currently contains

implementations of algorithms for classification, clustering, frequent item set mining,

genetic programming and collaborative filtering. Mahout is scalable along three

dimensions: It scales to reasonably large data sets by leveraging algorithm properties or

implementing versions based on Apache Hadoop.

5. Apache HCatalog : HCatalog is a metadata abstraction layer for referencing data

without using the underlying filenames or formats. It insulates users and scripts from

how and where the data is physically stored.WebHCat provides a REST-like web API for

HCatalog and related Hadoop components. Application developers make HTTP requests

to access the Hadoop MapReduce, Pig, Hive, and HCatalog DDL from within the

applications. Data and code used by WebHCat is maintained in HDFS. HCatalog DDL

commands are executed directly when requested. MapReduce, Pig, and Hive jobs are

placed in queue by WebHCat and can be monitored for progress or stopped as required.

Developers also specify a location in HDFS into which WebHCat should place Pig, Hive,

and MapReduce results.

6. Apache ZooKeeper : ZooKeeper is a centralized service for maintaining

configuration information, naming, providing distributed synchronization, and

providing group services which are very useful for a variety of distributed systems.

HBase is not operational without ZooKeeper.

7. Apache Flume : Flume is a top level project at the Apache Software Foundation.

While it can function as a general purpose event queue manager, in the context of

Hadoop it is most often used as a log aggregator, collecting log data from many diverse

sources and moving them to a centralized data store.

26

CONCLUSION

Big data is considered as the next big thing in the world of information

technology. The use of Big Data is becoming a crucial way for leading companies to

outperform their peers. Big Data will help to create new growth opportunities and

entirely new categories of companies, such as those that aggregate and analyse industry

data. Sophisticated analytics of big data can substantially improve decision-making,

minimise risks, and unearth valuable insights that would otherwise remain hidden.

Big data analytics is the process of examining big data to uncover hidden

patterns, unknown correlations and other useful information that can be used to make

better decisions. With big data analytics, data scientists and others can analyze huge

volumes of data that conventional analytics and business intelligence solutions can't

touch.

Apache Hadoop is a popular open source big data analytics tool. Using Hadoop,

we can store petabytes of data reliably on tens of thousands of servers while scaling

performance cost-effectively by merely adding inexpensive nodes to the cluster.

Instead of relying on expensive, proprietary hardware and different systems to store

and process data, Hadoop enables distributed parallel processing of huge amounts of

data across inexpensive, industry-standard servers that both store and process the data,

and can scale without limits. 

27

Bibliography

[1] Big   data   analysis   using   Apache   Hadoop Nandimath, J. ; Banerjee, E. ; Patil, A. ; Kakade, P. ; Vaidya, S.  Information Reuse and Integration (IRI), 2013 IEEE 14th International Conference on 

[2] Extending   Map - Reduce   for   Efficient   Predicate - BasedSampling

Grover, R. ; Carey, M.J. Data Engineering (ICDE), 2012 IEEE 28th International Conference on 

[3] Hadoop: Addressing challenges of Big Data Singh, K. ; Kaur, R. Advance Computing Conference (IACC), 2014 IEEE International 

[4] Big Data analytics frameworks Chandarana, P. ; Vijayalakshmi, M. Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014 International Conference on 

[5] MapReduce: Simplified Data Processing on LargeClusters.Available at http://labs.google.com/papers/mapreduceosdi04.pdf

[6] “Pro Hadoop- build scalable distributed applications in thecloud” by Jason Venner

[7] “Hadoop the definitive guide” by Tom White publisher: O’REILLY

28