hadoop(big data)

HADOOP(BIG DATA) INTRODUCTION

07/03/2014

ASML-OBIEE

BABY MANI DEEPA S

Technology [email protected]

mailto:[email protected]

2

Confidentiality Statement Include the confidentiality statement within the box provided. This has to be legally approved Confidentiality and Non-Disclosure Notice The information contained in this document is confidential and proprietary to TATA Consultancy Services. This information may not be disclosed, duplicated or used for any other purposes. The information contained in this document may not be released in whole or in part outside TCS for any purpose without the express written permission of TATA Consultancy Services.

Tata Code of Conduct We, in our dealings, are self-regulated by a Code of Conduct as enshrined in the Tata Code of Conduct. We request your support in helping us adhere to the Code in letter and spirit. We request that any violation or potential violation of the Code by any person be promptly brought to the notice of the Local Ethics Counselor or the Principal Ethics Counselor or the CEO of TCS. All communication received in this regard will be treated and kept as confidential.

3

Table of Content

1. Introduction .............................................................................................................................................................. 4

1.1 Hadoop Cluster ....................................................................................................................................................... 4

2. Big data analytics ...................................................................................................................................................... 5

2.1 Predictive analytics ........................................................................................................................................... 5

2.2 Data mining ....................................................................................................................................................... 5

2.3 NoSQL database ................................................................................................................................................ 5

2.4 MapReduce framework .................................................................................................................................... 6

3. Apache Hadoop framework ...................................................................................................................................... 7

3.1 Hadoop Common .............................................................................................................................................. 7

3.2 Hadoop Distributed File System (HDFS) ............................................................................................................ 7

3.3 Hadoop YARN (Yet Another Resource Negotiator) ......................................................................................... 10

3.4 Hadoop MapReduce ....................................................................................................................................... 11

4. Pig and Hive ............................................................................................................................................................. 12

5. Hadoop Uniqueness ................................................................................................................................................ 13

4

1. Introduction Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. When the volume of data goes enormously large we go for big data, since it’s practically impossible to handle such volumes using traditional RDBMS. 1.1 Hadoop Cluster Hadoop is a software framework which supports data intensive processes and enables applications to work with Big Data. Hadoop is based on Mapper-reducer technology.There are two very famous companies uses Hadoop to process their large data – Facebook and Yahoo. Hadoop platform can solve problems where the deep analysis is complex and unstructured but needs to be done in reasonable time. Apache Hadoop is an open-source software framework. Hadoop technology maintains and manages the data among all the independent servers. Individual user can not directly gain the access to the data as data is divided among these servers. Additionally, a single data can be shared on multiple server which gives availability of the data in case of the disaster or single machine failure.

5

2. Big data analytics

Big data analytics is the process of examining large variety of types of data to uncover hidden patterns hence provides competitive advantages over rival organizations and result in business benefits. The primary goal of big data analytics is to help companies make better business decisions by enabling data scientists and other users to analyze huge volumes of transaction data as well as other data sources that may be left untapped by conventional business intelligence (BI) programs.

Big data analytics can be done with the software tools commonly used as part of advanced analytics disciplines such as:

Predictive analytics. Data mining NoSQL databases MapReduce framework

2.1 Predictive analytics

Predictive analytics is the branch of data mining concerned with the prediction of future probabilities and trends. The central element of predictive analytics is the predictor, a variable that can be measured for an individual or other entity to predict future behaviour. For example, an insurance company is likely to take into account potential driving safety predictors such as age, gender, and driving record when issuing car insurance policies.

2.2 Data mining

Data mining is sorting through data to identify patterns and establish relationships. Data mining parameters include:

Association - looking for patterns where one event is connected to another event. Sequence or path analysis - looking for patterns where one event leads to another later event. Classification - looking for new patterns (May result in a change in the way the data is organized but that's ok). Clustering - finding and visually documenting groups of facts not previously known. Forecasting - discovering patterns in data that can lead to reasonable predictions about the future (This area of

data mining is known as predictive analytics.)

2.3 NoSQL database

NoSQL database, also called Not Only SQL, seeks to solve the scalability and big data performance issues that relational databases weren’t designed to address. NoSQL is especially useful when an enterprise needs to access and analyse massive amounts of unstructured data or data that's stored remotely on multiple virtual servers in the cloud

6

2.4 MapReduce framework

The framework is divided into two parts:

Map, a function that parcels out work to different nodes in the distributed cluster. Reduce another function that collates the work and resolves the results into a single value.

The MapReduce framework is fault-tolerant because each node in the cluster is expected to report back periodically with completed work and status updates. If a node remains silent for longer than the expected interval, a master node makes note and re-assigns the work to other nodes. This technology is much simpler conceptually but very powerful when put along with Hadoop framework. There are two major steps:

Map Reduce

In Map step master node takes input and divides into simple smaller chunks and provides it to other worker node. In Reduce step it collects all the small solution of the problem and returns as output in one unified answer. Both of these steps use function which relies on Key-Value pairs. This process runs on the various nodes in parallel and brings faster results for framework.

7

3. Apache Hadoop framework

The Apache Hadoop framework is composed of the following modules:

Hadoop Common - contains libraries and utilities needed by other Hadoop modules. Hadoop Distributed File System (HDFS) - a distributed file-system that stores data on commodity machines,

providing very high aggregate bandwidth across the cluster. Hadoop YARN (Yet Another Resource Negotiator) - a resource-management platform responsible for

managing compute resources in clusters and using them for scheduling of users' applications. Hadoop MapReduce - a programming model for large scale data processing. 3.1 Hadoop Common

Hadoop Common package provides file system and OS level abstractions, a MapReduce engine, necessary Java Archive (JAR) files and scripts needed to start Hadoop, source code, documentation and a contribution section that includes projects from the Hadoop Community.

3.2 Hadoop Distributed File System (HDFS)

HDFS is comprised of interconnected clusters of nodes where files and directories reside. An HDFS cluster consists of a : Single node, known as a NameNode, that manages the file system namespace operations like opening,

closing, and renaming files and directories regulates client access to files. A name node also maps data blocks to data nodes, which handle read and write requests from HDFS clients. The current design has a single NameNode for each cluster. NameNode keeps the entire namespace image in RAM. The HDFS namespace is a hierarchy of files and directories. Files and directories are represented on the NameNode by inodes.Inodes record attributes like permissions, modification and access times, namespace and disk space quotas. Image: The inodes and the list of blocks that define the metadata of the name system are called the image. Journal: Each client-initiated transaction is recorded in the journal

DataNodes, the cluster can have thousands of DataNodes and tens of thousands of HDFS clients per cluster, as each DataNode may execute multiple application tasks concurrently. DataNodes store data as blocks within files. The file content is split into large blocks (typically 128 megabytes), and each block of the file is independently replicated at multiple DataNodes. The blocks are stored on the local file system on the DataNodes. Each block replica on a DataNode is represented by two files in the local native file system. The first file contains the data itself and the second file records the block's metadata including checksums for the data and the generation stamp. The size of the data file equals the actual length of the block and does not require extra space to round it up to the nominal block size as in traditional file systems. Thus, if a block is half full it needs only half of the space of the full block on the local drive

8

Handshake: During startup each DataNode connects to the NameNode and performs a handshake. The purpose of the handshake is to verify the namespace ID and the software version of the DataNode. If either does not match that of the NameNode, the DataNode automatically shuts down. Namespace id: The namespace ID is assigned to the file system instance when it is formatted. The namespace ID is persistently stored on all nodes of the cluster. Nodes with a different namespace ID will not be able to join the cluster, thus protecting the integrity of the file system. A DataNode that is newly initialized and without any namespace ID is permitted to join the cluster and receive the cluster's namespace ID. Storage id: After the handshake the DataNode registers with the NameNode. DataNodes persistently store their unique storage IDs. The storage ID is an internal identifier of the DataNode, which makes it recognizable even if it is restarted with a different IP address or port. The storage ID is assigned to the DataNode when it registers with the NameNode for the first time and never changes after that. Data nodes also create, delete, and replicate data blocks according to instructions from the governing name node. The NameNode actively monitors the number of replicas of a block. When a replica of a block is lost due to a DataNode failure or disk failure, the NameNode creates another replica of the block. The NameNode maintains the namespace tree and the mapping of blocks to DataNodes, holding the entire namespace image in RAM. The NameNode is the repository for all HDFS metadata, and user data never flows through the NameNode. Heartbeats: A DataNode identifies block replicas in its possession to the NameNode by sending a block report. A block report contains the block ID, the generation stamp and the length for each block replica the server hosts. The first block report is sent immediately after the DataNode registration. Subsequent block reports are sent every hour and provide the NameNode with an up-to-date view of where block replicas are located on the cluster. During normal operation DataNodes send heartbeats to the NameNode to confirm that the DataNode is operating and the block replicas it hosts are available. The default heartbeat interval is three seconds. If the NameNode does not receive a heartbeat from a DataNode in ten minutes the NameNode considers the DataNode to be out of service and the block replicas hosted by that DataNode to be unavailable. The NameNode then schedules creation of new replicas of those blocks on other DataNodes. Heartbeats from a DataNode also carry information about total storage capacity, fraction of storage in use, and the number of data transfers currently in progress. These statistics are used for the NameNode's block allocation and load balancing decisions. The NameNode does not directly send requests to DataNodes. It sends instructions to the DataNodes by replying to heartbeats sent by those DataNodes. The instructions include commands to: replicate blocks to other nodes, remove local block replicas, re-register and send an immediate block report, or shut down the node. These specific features ensure that the Hadoop clusters are highly functional and highly available: •Rack awareness allows consideration of a node’s physical location, when allocating storage and scheduling tasks •Minimal data motion. MapReduce moves compute processes to the data on HDFS and not the other way around. Processing tasks can occur on the physical node where the data resides. This significantly reduces the network I/O patterns and keeps most of the I/O on the local disk or within the same rack and provides very high aggregate read/write bandwidth. •Utilities diagnose the health of the files system and can rebalance the data on different nodes

9

•Rollback allows system operators to bring back the previous version of HDFS after an upgrade, in case of human or system errors •Standby NameNode provides redundancy and supports high availability •Highly operable. Hadoop handles different types of cluster that might otherwise require operator intervention. This design allows a single operator to maintain a cluster of 1000s of nodes. HDFS has demonstrated scalability of up to 200 PB of storage and a single cluster of 4500 servers, supporting close to a billion files and blocks. The NameNode is a multithreaded system and processes requests simultaneously from multiple clients. Saving a transaction to disk becomes a bottleneck since all other threads need to wait until the synchronous flush-and-sync procedure initiated by one of them is complete. In order to optimize this process, the NameNode batches multiple transactions. When one of the NameNode's threads initiates a flush-and-sync operation, all the transactions batched at that time are committed together. Remaining threads only need to check that their transactions have been saved and do not need to initiate a flush-and-sync operation. secondary name node: The HDFS file system includes a so-called secondary NameNode, which misleads some people into thinking [citation needed] that when the primary NameNode goes offline, the secondary NameNode takes over. In fact, the secondary NameNode regularly connects with the primary NameNode and builds snapshots of the primary NameNode’s directory information, which the system then saves to local or remote directories. These check pointed images can be used to restart a failed primary NameNode without having to replay the entire journal of file-system actions, then to edit the log to create an up-to-date directory structure. Because the NameNode is the single point for storage and management of metadata, it can become a bottleneck for supporting a huge number of files, especially a large number of small files. HDFS Federation, a new addition, aims to tackle this problem to a certain extent by allowing multiple

File read and writes: An application adds data to HDFS by creating a new file and writing the data to it. After the file is closed, the bytes written cannot be altered or removed except that new data can be added to the file by reopening the file for append. HDFS implements a single-writer, multiple-reader model. The HDFS client that opens a file for writing is granted a lease for the file; no other client can write to the file. The writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the lease is revoked. The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to renew the lease, HDFS assumes that the client has quit and will automatically close the file on behalf of the writer, and recover the lease. The writer's lease does not prevent other clients from reading the file; a file may have many concurrent readers.

10

Replication Management: The NameNode endeavors to ensure that each block always has the intended number of replicas. The NameNode detects that a block has become under- or over-replicated when a block report from a DataNode arrives. When a block becomes over replicated, the NameNode chooses a replica to remove. The NameNode prefer to remove a replica from the DataNode with the least amount of available disk space. The goal is to balance storage utilization across DataNodes without reducing the block's availability When a block becomes under-replicated; it is put in the replication priority queue. A block with only one replica has the highest priority, while a block with a number of replicas that is greater than two thirds of its replication factor has the lowest priority.

checkpoint: The persistent record of the image stored in the NameNode's local native filesystem is called a checkpoint. Checkpoint node: The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. The role is specified at the node startup. The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal.A recently introduced feature of HDFS is the BackupNode. Like a CheckpointNode, the BackupNode is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the filesystem namespace that is always synchronized with the state of the NameNode. The BackupNode accepts the journal stream of namespace transactions from the active NameNode, saves them in journal on its own storage directories, and applies these transactions to its own namespace image in memory. The NameNode treats the BackupNode as a journal store the same way as it treats journal files in its storage directories. If the NameNode fails, the BackupNode's image in memory and the checkpoint on disk is a record of the latest namespace state. 3.3 Hadoop YARN (Yet Another Resource Negotiator)

The fundamental idea of YARN is to split up the two major responsibilities of the MapReduce - JobTracker i.e. resource management and job scheduling/monitoring, into separate daemons: a global ResourceManager and per-application ApplicationMaster (AM). The ResourceManager and per-node slave, the NodeManager (NM), form the new, and generic, system for managing applications in a distributed manner. The main components of YARN are: • ResourceManager is the ultimate authority that arbitrates resources among all the applications in the system. The per-application ApplicationMaster is, in effect, a framework specific entity and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the component tasks. • NodeManager is YARN’s per-node agent, and takes care of the individual compute nodes in a Hadoop cluster. This includes keeping up-to date with the ResourceManager (RM), overseeing containers’ life-cycle management;

11

monitoring resource usage (memory, CPU) of individual containers, tracking node-health, log’s management and auxiliary services which may be exploited by different YARN applications.MapReduce is a computational paradigm in which an application is divided into self-contained units of work. Each of these units of work can be executed on any node in the cluster

3.4 Hadoop MapReduce A MapReduce job splits the input data set into independent "chunks" that are processed by map tasks in parallel. The framework sorts the map outputs, which are then input to reduce tasks. Job inputs and outputs are stored in the file system.The MapReduce framework and the HDFS are typically on the same set of nodes, which enables the framework to schedule tasks on nodes that contain data. The MapReduce framework consists of a single master JobTracker and one slave TaskTracker per node. The master is responsible for scheduling job component tasks on the slaves, monitoring tasks, and re-executing failed tasks. The slaves execute tasks as directed by the master. Minimally, applications specify input and output locations and supply map and reduce functions through implementation of appropriate interfaces or abstract classes. Although the Hadoop framework is implemented in Java,MapReduce applications do not have to be written in Java. HDFS uses a master/slave architecture in which one device (the master) controls one or more other devices (the slaves). A small Hadoop cluster includes a single master and multiple worker nodes. The master node consists of a JobTracker, TaskTracker, NameNode and DataNode. A slave or worker node acts as both a DataNode and TaskTracker, though it is possible to have data-only worker nodes and compute-only worker nodes. The job tracker schedules map or reduce jobs to task trackers with an awareness of the data location. For example: if node A contains data (x,y,z) and node B contains data (a,b,c), the job tracker schedules node B to perform map or reduce tasks on (a,b,c) and node A would be scheduled to perform map or reduce tasks on (x,y,z). This reduces the amount of traffic that goes over the network and prevents unnecessary data transfer.

12

4. Pig and Hive

Pig is high level platform for creating MapReduce programs with Hadoop. Hive is a data warehouse infrastructure built for Hadoop for analysis and aggregation (summary of the data) of the data. Both of these commands are compilation of the MapReduce commands. Pig procedure language where one describes procedures to apply on the Hadoop. Hives is SQL-like declarative language. Yahoo uses Pigs and Hives both in their Hadoop Toolkit.

Apache Pig, Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets. At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs. Pig's language layer currently consists of a textual language called Pig Latin, which is easy to use, optimized, and extensible.

Apache Hive, Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. It provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. Hive also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

Apache Catalog, Catalog is a metadata abstraction layer for referencing data without using the underlying file-names or formats. It insulates users and scripts from how and where the data is physically stored.

Apache HBase, HBase (Hadoop DataBase) is a distributed, column oriented database. HBase uses HDFS for the underlying storage. It supports both batch style computations using MapReduce and point queries (random reads).

The main components of HBase are HBase Master is responsible for negotiating load balancing across all Region Servers and maintain the state of the cluster. It is not part of the actual data storage or retrieval path. The RegionServer is deployed on each computer and hosts data and processes I/O request. Apache ZooKeeper, ZooKeeper is a centralized service for maintaining configuration information, naming,

providing distributed synchronization, and providing group services which are very useful for a variety of distributed systems. HBase is not operational without ZooKeeper.

Apache Oozie, Apache Oozie is a workflow/coordination system to manage Hadoop jobs.

13

5. Hadoop Uniqueness

Hadoop enables a computing solution that is:

Scalable - New nodes can be added as needed and added without needing to change data formats, how data is loaded, how jobs are written, or the applications on top.

Cost effective – Hadoop brings massively parallel computing to commodity servers. The result is a sizeable decrease effective in the cost per terabyte of storage, which in turn makes it affordable to model all your data.

Flexible - Hadoop is schema-less, and can absorb any type of data, structured or not, from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary ways enabling deeper analyses than any one system can provide.

Fault tolerant - When you lose a node, the system redirects work to another location of the data and continues processing without missing a beat.

Contact

For more information, contact [email protected] (Email Id of ISU)

About Tata Consultancy Services (TCS)

Tata Consultancy Services is an IT services, consulting and business solutions organization that delivers real results to global business, ensuring a level of certainty no other firm can match. TCS offers a consulting-led, integrated portfolio of IT and IT-enabled infrastructure, engineering and assurance services. This is delivered through its unique Global Network Delivery ModelTM, recognized as the benchmark of excellence in software development. A part of the Tata Group, India’s largest industrial conglomerate, TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India.

For more information, visit us at www.tcs.com.

IT Services Business Solutions Consulting All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and other applicable laws, and could result in criminal or civil penalties. Copyright © 2011 Tata Consultancy Services Limited

Thank You

mailto:[email protected]

hadoop(big data)

Documents