scale-out storage 12112012 - nexsan...title scale-out storage_12112012 created date 12/4/2012...

Scale-Out StorageThe key to harnessing the power of Big Data in a cost-effective, futuristic and sustainable way

The big questions around Big Data are:

How can we speed up processing time?

How can we ensure accuracy and quality of output?

What can we do about the ever-increasing volume of data?

What can we do about the different formats/types of data to be processed?

Storage system design is the key to supporting these four aspects in order to enable companies to derive value from Big Data.

Big Data is best dealt with at a storage level with scale-out storage systems, either proprietary or open source. This white paper provides an analysis ofScale-Out Storage systems, their advantages and how Tata Consultancy Services (TCS) can help you derive power from Big Data.

n

n

n

n

White Paper

2

About the Authors

Reena Dayal

Reena Dayal has over 18 years of experience in the IT industry. During her tenure in TCS, she has played key managerial roles in Program Management, Product Development, Delivery, Technologies and Solution Architecture and is currently heading HiTech Solution Central for the HiTech Industrial Solution Unit.In this role, she is responsible for creating and driving comprehensive solutions around technology and domains relevant to the HiTech industry. Reena is the Chairperson of Storage Networking Industry Association – India (SNIA – India),a member of the Lucknow Management Association (LMA) and of the Association for Computational Machinery (ACM).

Amit Shukla

Amit Shukla has more than 15 years of experience in the Government, Education and HiTech Industries where he has occupied various roles including that of IT Infrastructure Architect, Solutions and Pre-Sales Head, Project Manager and Delivery/Support Manager for IT infrastructure. He currently leads the HiTech Storage CoE and the Storage Solution Labs.In this, Amit has been instrumental in providing storage and cloud ecosystem solutions.

Udayan Singh

Udayan Singh has a total of ten years of industrial experience in software development, QA and sustenance, primarily in Storage Product Engineering and Telecom. He currently leads Storage and Platform Product Engineering and Innovate@HiTech in HTSC, where he is responsible for the development of solution accelerators for storage / platform and multiple initiatives on innovation in the HiTech ISU. Udayan has papers presented / published at SDC 2008 / 2009 and has patents filed in storage technology. He is also a member of CSI and IEEE and is leading SNIA, Cloud Storage SIG in India.

3

Table of Contents

1. Big Data Challenges in the Technology Industry 4

2. SOS: Answer to the Big Data Problem 5

3. Advantages of Scale-out Storage 7

4. Difference Between Scale-out andScale-up Storage 8

5. Key ConsiderationsWhile Designing Big Data Solutions 10

6. An Example of a Scale-out System Using Hadoop 10

7. Conclusion 12

Big Data Challenges in the Technology Industry

Technology companies in particular are greatly impacted by the problem of Big Data. The real challenge of Big Data is the nature of analytics that can be derived from massive and dynamically changing data at an incredibly fast rate. Storage systems have to support this need and hence, form the very foundation of tackling the ‘Big Data problem’. The biggest challenge in terms of storage is how capacity can be scaled infinitely without having an effect on existing infrastructures.

Here it is important to emphasize the nature of a Big Data problem. The Big Data challenge is mainly seen to be the large volume of data. However, just because a business is dealing with large volumes of data does not mean they have a ‘Big Data problem’. The question is what the business wants to do with the data. The real challenge of Big Data is the nature of analytics that can be derived from massive and dynamically changing data at an incredibly fast rate. Storage systems have to support this need and hence, form the very foundation of tackling the ‘Big Data problem’.

Another aspect that needs to be considered is speed. With the emergence of high speed computing, cloud technology and demanding applications from a diverse domain, predicting the growth of data manually is almost impossible. Failure to predict such volumes of data growth and the cross section required to transfer that, may lead to disaster in terms of business loss. To counter the ever increasing need for capacity and performance, storage solutions must cater to web scale performance, linear scalability, ability to scale on demand and high end performance. This speed requirement is not just about high IOPS but IOPS with scale which, in networked storage, translates to both high IOPS and high bandwidth.

Analyzing data has become harder not only because there is more of it but because it comes from new sources and is of various types. Web blogs, comments and other social platforms such as Facebook and Twitter contribute data which is unstructured in nature and which cannot be crunched the way data from relational databases can. The need to mine different types of content has led to new data analysis platforms, most notably the open source Hadoop framework. Mobility plays a vital role and for optimum business execution, continuous and mobile access of enterprise applications is a high priority requirement. Multiple instances of the same piece of data are required to ensure such seamless access to data on the fly.

Scale-Out Storage (SOS) systems help meet the capacity, speed and type requirements posed by the ‘Big Data problem’.

4

MBPS $$ Expense

Initial Investment

TIME

Linear Scalability

Node Addition

Node Addition

Node Addition

No Disk Spindles

5

SOS: Answer to the Big Data Problem

In the last two decades there has been a lot of fine-tuning in the architecture of monolithic Scale-Up Storage sub-systems. However, the basic architecture remains the same. It has still got single or dual controllers with storage capacity in the form of HDDs at the backend as well as disk sub-systems which provide capacity and a number of spindles. Since the controllers govern performance in terms of IOPS and throughput by adding disks or scaling-up, saturation is reached once capacity starts increasing. Scale-Out Storage systems, on the other hand, mitigate the challenges created by evolving size and forms of data – something which was a huge challenge for the legacy storage systems in contemporary and dynamic data centers. Figures 1 and 2 demonstrate a Scale-Up versus a Scale-Out performance.

Figure 2 – Scale-Out Performance

MBPS

PerformanceSaturation

$$ Expense

FrameUpgrade

No Disk Spindles Initial Investment

TIME

Figure 1 – Scale-Up Performance

Scale-Out architecture is a modular architecture comprising numerous nodes to be pooled and configured to create an extensive and massive storage infrastructure. Each node consists of disk capacity, computes power and is connected through the network. As nodes are added to the system, the capacity, performance and throughput goes up. Figure 3 outlines the environment of a typical Scale-Out Storage system.

The objective of Scale-Out Storage is to expand storage infrastructure according to the data center requirements and as per dynamic business needs. Scale-Out Storage systems not only enable storage expansion but also sustain the functionality and performance as it expands. Scale-Out Storage platforms are a perfect solution for IT decision makers to mitigate exponential data growth and ensure low cost of ownership of the storage infrastructure along with linear performance growth enabled by parallel processing.

Scale-Out Storage at first emerged as a Network Attached Storage (NAS) solution, because customers required storage infrastructure that could sustain the colossal data flow necessity of high performance computing and unstructured data. Scale-Out NAS storage solutions can, in parallel, enhance capacity and performance by accumulation nodes working in parallel. These nodes are managed as a single system and addressed by a single name space which results in ease of implementation and management.

6

Figure 3 – Scale-Out Environment

Storage System

Storage Controller

Disk Drives

Scale Out

Storage Controller

Disk Drives

Storage Controller

Disk Drives

Storage System

7

Advantages of Scale-Out Storage

Essentially, Scale-Out Storage provides not only linear scalability of storage performance but also flexibility of incremental investments providing agility and pay as you grow facility for the storage infrastructure. Hence, it is a best fit for the cloud environment which is one of the biggest producers of Big Data.

Some of the advantages are:

Scale-Out Storage, when implemented with virtual storage solutions, provides the ability for storage space expansion with “zero downtime”. Planning storage for application systems becomes more flexible when the Scale-Out Storage platform is combined with “thin provisioning” technology.

Scale-Out Storage also provides maximum utilization of overall disk capacity. Historically, market prices of disk drives have always shown a downward trend. Therefore, delaying disk procurement results in cost benefits. Figure 4 shows a comparison of disk cost and capacity trends.

Scale-Out Storage architecture aggregates the power of a storage controller and bandwidth to enhance performance. For handling storage solutions which have reached the limit of capacity and performance there is no need to replace the entire system with a new and high end one. Rather, the required amount of storage can simply be added to support the performance. There are two methods using which Scale-Out Storage aggregates multiple controllers to perform storage processing.

Flexible and Efficient Capacity Management

Efficient Performance Management

Figure 4 – Hard Drive Cost Vs Capacity

1995

1996

1997

1998

1999

2000

2001

2002

2003

2004

2005

2006

2007

2008

2009

2010

2011

$625

Cost per Gigabyte

$220

$123$14 $2 60¢ 29¢ 9¢ 5¢

4TB

Drive Capacity

3TB

2TB

1TB

500GB250GB

80GB

4GB1GB

We Are

Here

HARD DRIVE COST AND CAPACITY TRENDS

8

Deploying Agent on the Host

Deploying Primary Controller

Robust Management and System Security

Futuristic and Sustained Applicability

This method involves installing an agent on the host to perform the essential computing operations for assigning data. It sporadically communicates with the controllers and exchanges data about where data is stored. The agent can instantly acquire the data the host wants to access from the correct controller. The benefit of this method is high performance, whereas the downside is requirement of an external agent program on the host.

In this method, a primary controller handles the communication between the host and other additional controllers. Data is accessed through the primary controller in two stages, in which the host first sends a request to the primary controller, which in turn points to the correct data requested for by the host. This method is less efficient in terms of performance than the first method. However, there is no need for an external agent program on the host in this method.

Integrity and security of data is the most significant factor in any data center. Most of the Storage Area Network (SAN) architectures used by enterprises recommend RAID 5 or even RAID 6 data protection. Enterprise Scale-Out Storage solutions offer widespread attributes such as storage drive monitoring, drive snapshot and close integration with manageability solutions. These attributes are complimented by the fact that there are no single points of failure. This ensures a high possibility of recovery even in the case of hardware failure. During a maintenance window in Scale-Out Storage infrastructure, storage administrators can provisionally move data from systems that are scheduled for maintenance to other systems. With the implementation of virtualized storage infrastructure, maintenance or upgrades can be performed at runtime ensuring no maintenance windows. These result in improved management, high availability and robust business continuity plans around storage systems.

A key consideration for business decision makers is whether moving all applications to Scale-Out Storage is required and whether it will replace all the legacy storage solutions. This is answered by the futuristic and sustained architecture of Scale-Out Storage which can co-exist in a hybrid environment with legacy storage solutions and also be used for a very long period of time.

The key difference between Scale-Out and Scale-Up Storage is that scale-out storage architecture is constituted of multiple “Commodity Based Storage” systems clustered at the storage controller level while scale-up storage provides an abstraction layer between its components which are connected to a single controller.

The “commodity based storage” systems of scale-out storage can accommodate various types of disks such as SATA, SSD, SAS, and so on. For enterprise implementation, scale-out storage is simply expanded by incrementally adding heterogeneous storage systems. In scale-out storage replication ports require distinct mapping on a controller-by-controller basis. Scale-out design is complex in implementation and drives

Difference between Scale-Out and Scale-Up Storage

9

increased network or fabric connectivity costs. Scale-out storage requires management and capacity planning similar to multiple independent modular storage arrays due to its multi node architecture. Scale-out storage systems have high availability and no single point of failure due to the clustered architecture and existence of multiple controller nodes. The network connectivity between the controller nodes at times leads to performance overhead in scale-out systems. The scale-out systems also require “multi-pathing” software on hosts to ensure controlled data paths between the host and the target.

In Scale-up storage the component disks should be similar or compatible to each other. To get maximum performance, scalability and availability scale-up storage is architected with application specific integrated circuits (ASIC). In scale-up storage mapping is not required at controller level as there is only a single controller for a storage system. While this makes the overall design and implementations simpler, the initial acquisition cost is high. Scale-up storage has a consolidated/integrated architecture which reduces its administrative overheads. The scale-up systems are susceptible to single points of failure if not architected accurately due to single controller systems. The scale-up systems are integrated together which means fewer connections and optimized performance for mission critical applications. “Multi-pathing” software is not mandatory for hosts connected to scale-up systems. Figure 5 below demonstrates the difference between the two.

Figure 5 – Storage Solution Architecture – Traditional (Scale up) Vs. Emerging (Scale out)

Scale-Up Architecture

n

n

n

n

To scale up (or scale vertically) means to add resources to a single node in a system, typically involving the addition of CPUs or memory to single computer.

Support adding or upgrading existing components with more powerful resources

Scale up storage can’t span multiple locations.

Scale up storage cannot be easily expandable beyond the maximum capacity f any single storage node.

Scale-Out Architecture

n

n

n

To scale out (or scale horizontally) means to add more nodes to a system, such as adding a new computer to a distributed software application.

Provides integration and management of data over Petabytes.

Designed for rapid growth in file or unstructured data environment.

10

The most important difference between Scale-Up and Scale-Out Storage is in the application of these architectures. If the business requirement is high IOPS, then a scale-up system is required. Scalability in scale-up systems comes at a high cost since most of these systems are built on proprietary platform architectures. On the other hand, scale-out architectures cater to high throughput requirements so when we are dealing with large volumes of data, the data can be spread out across multiple nodes.

There are several considerations which must be kept in mind while designing Big Data storage solutions.

What are the IOPS and bandwidth requirements? This is an important question to ask, since even with Big Data speed requirements, actual speed requirements could differ. For example, a video repository with video search has more high-end IOPS and bandwidth requirements as opposed to a customer market analytics data related requirement.

What is the nature of data? Is it structured, semi-structured, unstructured or a combination of all? For example, a social media analytics would have a lot of unstructured data whereas a supply chain analytics coupled with Social CRM would be a mix of structured, semi-structured and unstructured data.

Where does the data reside? Is it all in one physical location or spread across different locations? The data could be within the boundaries of the enterprise and within the same data center or it could be spread across multiple locations/data centers or it could come from outside the enterprise as well.

Is the data tiered? What are the archiving requirements? Is it possible to place data in archival tiers for retention purposes? While dealing with such data, it is to be seen how this data from multiple tiers is going to fit into the new solution.

What are the analytics driven data extraction requirements? How often does old data need to be accessed and does it get mixed with new data access very often? This may again be linked to the archival or tiering of data.

An analysis of all these requirements is necessary to define effective architectures for a Big Data storage solution.

A solution to the Big Data demand for higher IOPS and storage scalability is the Hadoop framework, which is available in Apache license 2.0.

As depicted in Figure 5 below, a Hadoop cluster consists of the Name Node and the N Data Node, which are connected through an IP network. The Hadoop framework has the following components:

n Hadoop Distributed file system, which stores files in Hadoop across different nodes and maintains fault tolerance.

n Map Reduce Engine, which consists of a Job Tracker (for receiving requests from applications to process the data) and a Task Tracker (for focusing on performing work assigned on the data that is stored in the data node).

Key Considerations while Designing Big Data Solutions

An Example of a Scale-Out system using Hadoop

n

n

n

n

n

Master(Name Node /Job Tracker)

Slaves(Data Node /Task Tracker (Commodity Hardware))

Analytics

11

While architecting solutions, the major benefits achieved using the Hadoop framework are:

Better Scalability: Nodes with compute / storage can be added on the fly without bringing down the system or even affecting the ongoing operations and can easily scale to multi petabyte environments.

Cost Effective: It uses commodity hardware and does not require any specialized hardware, which usually increases the cost of the compute and storage.

Improved Performance: Analytics applications demanding greater performance are able to achieve better results, as the processing of the data is done on the data node itself.

Figure 5 – A Hadoop Cluster

Nutch (Crawler)

Pentaho (BI Tool)

RHIPE (Statistics)

Oozie (Process

Orchestrator)

Mahout (Insights &

Mining)

PIG (Data Access)

MapReduce (Processing Framework)

Zookeeper (Cluster

Communicator)

Data Store HIve/Hbase

Hadoop HDFS (Storage)

Java Virtual Machine 6.0 (SUN Oracle)

OS (Red Hat)

Mig

rati

on

Arc

hit

ect,

Dep

loym

ent &

Su

pp

ort

Services

Val

idat

ion

& P

oC

s

Inte

gra

tio

n &

Tes

tin

g

Services

12

Figure 6 below depicts the overall technology stack around Hadoop which is being included by most Scale-Out Storage solutions today.

Futuristic trends such as Cloud, Mobility and Social Computing are the biggest contributors to the Big Data challenge. Today, organizations across verticals are not only facing the challenge of managing increasingly large data volumes in their real-time systems but also of analyzing the data for right decisions. Big Data is going to impact the IT architecture of the organizations in a big way.

The scale at which data is growing and the speed at which data needs to get processed is breaking the boundaries of the existing Scale-Up Storage architecture. The Big Data storage solutions have to be scalable in terms of petabytes and should be able to deliver the same kind of performance even after the capacity is increased manifold.

Scale-Out Storage platforms are emerging as the leader among all the existing storage technologies.The “Big Data problem” is best addressed at a storage level by Scale-Out Storage systems, which can be proprietary or open source. This is the perfect answer for the “changing face of data” which now wears an entirely new look and demands extensive scalability, performance, longevity and fault tolerance. That is the

Conclusion

Figure 6 – Hadoop Technology Stack

13

major reason almost every storage vendor is coming up with Scale-Out Storage with parallel file systems. Some of the key Scale-Out architectures are EMC Isilion, IBM Sonas, NetApp Engenio, Hitachi HNAS, HP Ibrix and many more. Scale-Out architecture comfortably scales into the petabytes data range while offering high performance for reads and writes no matter what the data volume size is.

Another way of implementing scale-out solutions is by selecting Hadoop and its community components and leveraging its solution accelerators providing services on commodity based hardware. Firms such as TCS bring in their industry specific experience through Proof of Concept (PoC), Integration, Migration and/or Deployment/Support on most of the components of Hadoop.

Modern businesses looking for a solution to handle their Big Data easily and effectively should consider a composite strategy to solve Big Data related issues and take a step-by-step approach of performing a PoC, analyzing the results and then finally deploying the solutions, where the benefits can be substantiated.

ReferencesThe following references are used in the preparation of this document:

http://hadoop.apache.org/

http://en.wikipedia.org/wiki/Apache_Hadoop

TCS

Des

ign

Serv

ices

M

11

12

II

I

All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and

other applicable laws, and could result in criminal or civil penalties. Copyright © 2012 Tata Consultancy Services Limited

IT ServicesBusiness SolutionsOutsourcing

ContactTo know more about how we help companies in the HiTech Industry overcome their challenges to achieve real business results, contact [email protected]

About TCS HiTech Industry Solution Unit

TCS' HiTech industry Solutions Unit provides optimal, customized, and comprehensive solutions across varied High Tech industry segments: Computer Platform and Services Companies, Software Firms, Electronics and Semiconductor Companies, and Professional Services Firms (Legal, HR, Tax & Accounting and Consulting & Advisory/Analyst firms).

Building on its vast experience in engineering, business process transformation, innovation and IT solutions, TCS offers a comprehensive portfolio of services that maximize growth, manage risk, and reduce costs. The TCS HiTech Industry Solution Unit partners with High Tech enterprises to provide end-to-end solutions which help realize operational excellence, innovation and greater profitability.

For more information, visit us at http://www.tcs.com/industries/high_tech

About Tata Consultancy Services (TCS)

www.tcs.com

Tata Consultancy Services is an IT services, consulting and business solutions organization that delivers real results to global business, ensuring a level of certainty no other firm can match.TCS offers a consulting-led, integrated portfolio of IT and IT-enabled infrastructure, engineering

TMand assurance services. This is delivered through its unique Global Network Delivery Model , recognized as the benchmark of excellence in software development. A part of the Tata Group, India’s largest industrial conglomerate, TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India.

For more information, visit us at

Subscribe to TCS White PapersTCS.com RSS: http://www.tcs.com/rss_feeds/Pages/feed.aspx?f=wFeedburner: http://feeds2.feedburner.com/tcswhitepapers

Page 1Page 2Page 3Page 4Page 5Page 6Page 7Page 8Page 9Page 10Page 11Page 12Page 13Page 14

scale-out storage 12112012 - nexsan...title scale-out storage_12112012 created date 12/4/2012...

Documents