scale-out storage 12112012 - nexsan...title scale-out storage_12112012 created date 12/4/2012...

14
Scale-Out Storage The key to harnessing the power of Big Data in a cost-effective, futuristic and sustainable way The big questions around Big Data are: How can we speed up processing time? How can we ensure accuracy and quality of output? What can we do about the ever-increasing volume of data? What can we do about the different formats/types of data to be processed? Storage system design is the key to supporting these four aspects in order to enable companies to derive value from Big Data. Big Data is best dealt with at a storage level with scale-out storage systems, either proprietary or open source. This white paper provides an analysis of Scale-Out Storage systems, their advantages and how Tata Consultancy Services (TCS) can help you derive power from Big Data. n n n n White Paper

Upload: others

Post on 25-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • Scale-Out StorageThe key to harnessing the power of Big Data in a cost-effective, futuristic and sustainable way

    The big questions around Big Data are:

    How can we speed up processing time?

    How can we ensure accuracy and quality of output?

    What can we do about the ever-increasing volume of data?

    What can we do about the different formats/types of data to be processed?

    Storage system design is the key to supporting these four aspects in order to enable companies to derive value from Big Data.

    Big Data is best dealt with at a storage level with scale-out storage systems, either proprietary or open source. This white paper provides an analysis ofScale-Out Storage systems, their advantages and how Tata Consultancy Services (TCS) can help you derive power from Big Data.

    n

    n

    n

    n

    White Paper

  • 2

    About the Authors

    Reena Dayal

    Reena Dayal has over 18 years of experience in the IT industry. During her tenure in TCS, she has played key managerial roles in Program Management, Product Development, Delivery, Technologies and Solution Architecture and is currently heading HiTech Solution Central for the HiTech Industrial Solution Unit.In this role, she is responsible for creating and driving comprehensive solutions around technology and domains relevant to the HiTech industry. Reena is the Chairperson of Storage Networking Industry Association – India (SNIA – India),a member of the Lucknow Management Association (LMA) and of the Association for Computational Machinery (ACM).

    Amit Shukla

    Amit Shukla has more than 15 years of experience in the Government, Education and HiTech Industries where he has occupied various roles including that of IT Infrastructure Architect, Solutions and Pre-Sales Head, Project Manager and Delivery/Support Manager for IT infrastructure. He currently leads the HiTech Storage CoE and the Storage Solution Labs.In this, Amit has been instrumental in providing storage and cloud ecosystem solutions.

    Udayan Singh

    Udayan Singh has a total of ten years of industrial experience in software development, QA and sustenance, primarily in Storage Product Engineering and Telecom. He currently leads Storage and Platform Product Engineering and Innovate@HiTech in HTSC, where he is responsible for the development of solution accelerators for storage / platform and multiple initiatives on innovation in the HiTech ISU. Udayan has papers presented / published at SDC 2008 / 2009 and has patents filed in storage technology. He is also a member of CSI and IEEE and is leading SNIA, Cloud Storage SIG in India.

  • 3

    Table of Contents

    1. Big Data Challenges in the Technology Industry 4

    2. SOS: Answer to the Big Data Problem 5

    3. Advantages of Scale-out Storage 7

    4. Difference Between Scale-out andScale-up Storage 8

    5. Key ConsiderationsWhile Designing Big Data Solutions 10

    6. An Example of a Scale-out System Using Hadoop 10

    7. Conclusion 12

  • Big Data Challenges in the Technology Industry

    Technology companies in particular are greatly impacted by the problem of Big Data. The real challenge of Big Data is the nature of analytics that can be derived from massive and dynamically changing data at an incredibly fast rate. Storage systems have to support this need and hence, form the very foundation of tackling the ‘Big Data problem’. The biggest challenge in terms of storage is how capacity can be scaled infinitely without having an effect on existing infrastructures.

    Here it is important to emphasize the nature of a Big Data problem. The Big Data challenge is mainly seen to be the large volume of data. However, just because a business is dealing with large volumes of data does not mean they have a ‘Big Data problem’. The question is what the business wants to do with the data. The real challenge of Big Data is the nature of analytics that can be derived from massive and dynamically changing data at an incredibly fast rate. Storage systems have to support this need and hence, form the very foundation of tackling the ‘Big Data problem’.

    Another aspect that needs to be considered is speed. With the emergence of high speed computing, cloud technology and demanding applications from a diverse domain, predicting the growth of data manually is almost impossible. Failure to predict such volumes of data growth and the cross section required to transfer that, may lead to disaster in terms of business loss. To counter the ever increasing need for capacity and performance, storage solutions must cater to web scale performance, linear scalability, ability to scale on demand and high end performance. This speed requirement is not just about high IOPS but IOPS with scale which, in networked storage, translates to both high IOPS and high bandwidth.

    Analyzing data has become harder not only because there is more of it but because it comes from new sources and is of various types. Web blogs, comments and other social platforms such as Facebook and Twitter contribute data which is unstructured in nature and which cannot be crunched the way data from relational databases can. The need to mine different types of content has led to new data analysis platforms, most notably the open source Hadoop framework. Mobility plays a vital role and for optimum business execution, continuous and mobile access of enterprise applications is a high priority requirement. Multiple instances of the same piece of data are required to ensure such seamless access to data on the fly.

    Scale-Out Storage (SOS) systems help meet the capacity, speed and type requirements posed by the ‘Big Data problem’.

    4

  • MBPS $$ Expense

    Initial Investment

    TIME

    Linear Scalability

    Node Addition

    Node Addition

    Node Addition

    No Disk Spindles

    5

    SOS: Answer to the Big Data Problem

    In the last two decades there has been a lot of fine-tuning in the architecture of monolithic Scale-Up Storage sub-systems. However, the basic architecture remains the same. It has still got single or dual controllers with storage capacity in the form of HDDs at the backend as well as disk sub-systems which provide capacity and a number of spindles. Since the controllers govern performance in terms of IOPS and throughput by adding disks or scaling-up, saturation is reached once capacity starts increasing. Scale-Out Storage systems, on the other hand, mitigate the challenges created by evolving size and forms of data – something which was a huge challenge for the legacy storage systems in contemporary and dynamic data centers. Figures 1 and 2 demonstrate a Scale-Up versus a Scale-Out performance.

    Figure 2 – Scale-Out Performance

    MBPS

    PerformanceSaturation

    $$ Expense

    FrameUpgrade

    No Disk Spindles Initial Investment

    TIME

    Figure 1 – Scale-Up Performance

  • Scale-Out architecture is a modular architecture comprising numerous nodes to be pooled and configured to create an extensive and massive storage infrastructure. Each node consists of disk capacity, computes power and is connected through the network. As nodes are added to the system, the capacity, performance and throughput goes up. Figure 3 outlines the environment of a typical Scale-Out Storage system.

    The objective of Scale-Out Storage is to expand storage infrastructure according to the data center requirements and as per dynamic business needs. Scale-Out Storage systems not only enable storage expansion but also sustain the functionality and performance as it expands. Scale-Out Storage platforms are a perfect solution for IT decision makers to mitigate exponential data growth and ensure low cost of ownership of the storage infrastructure along with linear performance growth enabled by parallel processing.

    Scale-Out Storage at first emerged as a Network Attached Storage (NAS) solution, because customers required storage infrastructure that could sustain the colossal data flow necessity of high performance computing and unstructured data. Scale-Out NAS storage solutions can, in parallel, enhance capacity and performance by accumulation nodes working in parallel. These nodes are managed as a single system and addressed by a single name space which results in ease of implementation and management.

    6

    Figure 3 – Scale-Out Environment

    Storage System

    Storage Controller

    Disk Drives

    Scale Out

    Storage Controller

    Disk Drives

    Storage Controller

    Disk Drives

    Storage System

  • 7

    Advantages of Scale-Out Storage

    Essentially, Scale-Out Storage provides not only linear scalability of storage performance but also flexibility of incremental investments providing agility and pay as you grow facility for the storage infrastructure. Hence, it is a best fit for the cloud environment which is one of the biggest producers of Big Data.

    Some of the advantages are:

    Scale-Out Storage, when implemented with virtual storage solutions, provides the ability for storage space expansion with “zero downtime”. Planning storage for application systems becomes more flexible when the Scale-Out Storage platform is combined with “thin provisioning” technology.

    Scale-Out Storage also provides maximum utilization of overall disk capacity. Historically, market prices of disk drives have always shown a downward trend. Therefore, delaying disk procurement results in cost benefits. Figure 4 shows a comparison of disk cost and capacity trends.

    Scale-Out Storage architecture aggregates the power of a storage controller and bandwidth to enhance performance. For handling storage solutions which have reached the limit of capacity and performance there is no need to replace the entire system with a new and high end one. Rather, the required amount of storage can simply be added to support the performance. There are two methods using which Scale-Out Storage aggregates multiple controllers to perform storage processing.

    Flexible and Efficient Capacity Management

    Efficient Performance Management

    Figure 4 – Hard Drive Cost Vs Capacity

    1995

    1996

    1997

    1998

    1999

    2000

    2001

    2002

    2003

    2004

    2005

    2006

    2007

    2008

    2009

    2010

    2011

    $625

    Cost per Gigabyte

    $220

    $123$14 $2 60¢ 29¢ 9¢ 5¢

    4TB

    Drive Capacity

    3TB

    2TB

    1TB

    500GB250GB

    80GB

    4GB1GB

    We Are

    Here

    HARD DRIVE COST AND CAPACITY TRENDS

  • 8

    Deploying Agent on the Host

    Deploying Primary Controller

    Robust Management and System Security

    Futuristic and Sustained Applicability

    This method involves installing an agent on the host to perform the essential computing operations for assigning data. It sporadically communicates with the controllers and exchanges data about where data is stored. The agent can instantly acquire the data the host wants to access from the correct controller. The benefit of this method is high performance, whereas the downside is requirement of an external agent program on the host.

    In this method, a primary controller handles the communication between the host and other additional controllers. Data is accessed through the primary controller in two stages, in which the host first sends a request to the primary controller, which in turn points to the correct data requested for by the host. This method is less efficient in terms of performance than the first method. However, there is no need for an external agent program on the host in this method.

    Integrity and security of data is the most significant factor in any data center. Most of the Storage Area Network (SAN) architectures used by enterprises recommend RAID 5 or even RAID 6 data protection. Enterprise Scale-Out Storage solutions offer widespread attributes such as storage drive monitoring, drive snapshot and close integration with manageability solutions. These attributes are complimented by the fact that there are no single points of failure. This ensures a high possibility of recovery even in the case of hardware failure. During a maintenance window in Scale-Out Storage infrastructure, storage administrators can provisionally move data from systems that are scheduled for maintenance to other systems. With the implementation of virtualized storage infrastructure, maintenance or upgrades can be performed at runtime ensuring no maintenance windows. These result in improved management, high availability and robust business continuity plans around storage systems.

    A key consideration for business decision makers is whether moving all applications to Scale-Out Storage is required and whether it will replace all the legacy storage solutions. This is answered by the futuristic and sustained architecture of Scale-Out Storage which can co-exist in a hybrid environment with legacy storage solutions and also be used for a very long period of time.

    The key difference between Scale-Out and Scale-Up Storage is that scale-out storage architecture is constituted of multiple “Commodity Based Storage” systems clustered at the storage controller level while scale-up storage provides an abstraction layer between its components which are connected to a single controller.

    The “commodity based storage” systems of scale-out storage can accommodate various types of disks such as SATA, SSD, SAS, and so on. For enterprise implementation, scale-out storage is simply expanded by incrementally adding heterogeneous storage systems. In scale-out storage replication ports require distinct mapping on a controller-by-controller basis. Scale-out design is complex in implementation and drives

    Difference between Scale-Out and Scale-Up Storage

  • 9

    increased network or fabric connectivity costs. Scale-out storage requires management and capacity planning similar to multiple independent modular storage arrays due to its multi node architecture. Scale-out storage systems have high availability and no single point of failure due to the clustered architecture and existence of multiple controller nodes. The network connectivity between the controller nodes at times leads to performance overhead in scale-out systems. The scale-out systems also require “multi-pathing” software on hosts to ensure controlled data paths between the host and the target.

    In Scale-up storage the component disks should be similar or compatible to each other. To get maximum performance, scalability and availability scale-up storage is architected with application specific integrated circuits (ASIC). In scale-up storage mapping is not required at controller level as there is only a single controller for a storage system. While this makes the overall design and implementations simpler, the initial acquisition cost is high. Scale-up storage has a consolidated/integrated architecture which reduces its administrative overheads. The scale-up systems are susceptible to single points of failure if not architected accurately due to single controller systems. The scale-up systems are integrated together which means fewer connections and optimized performance for mission critical applications. “Multi-pathing” software is not mandatory for hosts connected to scale-up systems. Figure 5 below demonstrates the difference between the two.

    Figure 5 – Storage Solution Architecture – Traditional (Scale up) Vs. Emerging (Scale out)

    Scale-Up Architecture

    n

    n

    n

    n

    To scale up (or scale vertically) means to add resources to a single node in a system, typically involving the addition of CPUs or memory to single computer.

    Support adding or upgrading existing components with more powerful resources

    Scale up storage can’t span multiple locations.

    Scale up storage cannot be easily expandable beyond the maximum capacity f any single storage node.

    Scale-Out Architecture

    n

    n

    n

    To scale out (or scale horizontally) means to add more nodes to a system, such as adding a new computer to a distributed software application.

    Provides integration and management of data over Petabytes.

    Designed for rapid growth in file or unstructured data environment.

  • 10

    The most important difference between Scale-Up and Scale-Out Storage is in the application of these architectures. If the business requirement is high IOPS, then a scale-up system is required. Scalability in scale-up systems comes at a high cost since most of these systems are built on proprietary platform architectures. On the other hand, scale-out architectures cater to high throughput requirements so when we are dealing with large volumes of data, the data can be spread out across multiple nodes.

    There are several considerations which must be kept in mind while designing Big Data storage solutions.

    What are the IOPS and bandwidth requirements? This is an important question to ask, since even with Big Data speed requirements, actual speed requirements could differ. For example, a video repository with video search has more high-end IOPS and bandwidth requirements as opposed to a customer market analytics data related requirement.

    What is the nature of data? Is it structured, semi-structured, unstructured or a combination of all? For example, a social media analytics would have a lot of unstructured data whereas a supply chain analytics coupled with Social CRM would be a mix of structured, semi-structured and unstructured data.

    Where does the data reside? Is it all in one physical location or spread across different locations? The data could be within the boundaries of the enterprise and within the same data center or it could be spread across multiple locations/data centers or it could come from outside the enterprise as well.

    Is the data tiered? What are the archiving requirements? Is it possible to place data in archival tiers for retention purposes? While dealing with such data, it is to be seen how this data from multiple tiers is going to fit into the new solution.

    What are the analytics driven data extraction requirements? How often does old data need to be accessed and does it get mixed with new data access very often? This may again be linked to the archival or tiering of data.

    An analysis of all these requirements is necessary to define effective architectures for a Big Data storage solution.

    A solution to the Big Data demand for higher IOPS and storage scalability is the Hadoop framework, which is available in Apache license 2.0.

    As depicted in Figure 5 below, a Hadoop cluster consists of the Name Node and the N Data Node, which are connected through an IP network. The Hadoop framework has the following components:

    n Hadoop Distributed file system, which stores files in Hadoop across different nodes and maintains fault tolerance.

    n Map Reduce Engine, which consists of a Job Tracker (for receiving requests from applications to process the data) and a Task Tracker (for focusing on performing work assigned on the data that is stored in the data node).

    Key Considerations while Designing Big Data Solutions

    An Example of a Scale-Out system using Hadoop

    n

    n

    n

    n

    n

  • Master(Name Node /Job Tracker)

    Slaves(Data Node /Task Tracker (Commodity Hardware))

    Analytics

    11

    While architecting solutions, the major benefits achieved using the Hadoop framework are:

    Better Scalability: Nodes with compute / storage can be added on the fly without bringing down the system or even affecting the ongoing operations and can easily scale to multi petabyte environments.

    Cost Effective: It uses commodity hardware and does not require any specialized hardware, which usually increases the cost of the compute and storage.

    Improved Performance: Analytics applications demanding greater performance are able to achieve better results, as the processing of the data is done on the data node itself.

    Figure 5 – A Hadoop Cluster

  • Nutch (Crawler)

    Pentaho (BI Tool)

    RHIPE (Statistics)

    Oozie (Process

    Orchestrator)

    Mahout (Insights &

    Mining)

    PIG (Data Access)

    MapReduce (Processing Framework)

    Zookeeper (Cluster

    Communicator)

    Data Store HIve/Hbase

    Hadoop HDFS (Storage)

    Java Virtual Machine 6.0 (SUN Oracle)

    OS (Red Hat)

    Mig

    rati

    on

    Arc

    hit

    ect,

    Dep

    loym

    ent &

    Su

    pp

    ort

    Services

    Val

    idat

    ion

    & P

    oC

    s

    Inte

    gra

    tio

    n &

    Tes

    tin

    g

    Services

    12

    Figure 6 below depicts the overall technology stack around Hadoop which is being included by most Scale-Out Storage solutions today.

    Futuristic trends such as Cloud, Mobility and Social Computing are the biggest contributors to the Big Data challenge. Today, organizations across verticals are not only facing the challenge of managing increasingly large data volumes in their real-time systems but also of analyzing the data for right decisions. Big Data is going to impact the IT architecture of the organizations in a big way.

    The scale at which data is growing and the speed at which data needs to get processed is breaking the boundaries of the existing Scale-Up Storage architecture. The Big Data storage solutions have to be scalable in terms of petabytes and should be able to deliver the same kind of performance even after the capacity is increased manifold.

    Scale-Out Storage platforms are emerging as the leader among all the existing storage technologies.The “Big Data problem” is best addressed at a storage level by Scale-Out Storage systems, which can be proprietary or open source. This is the perfect answer for the “changing face of data” which now wears an entirely new look and demands extensive scalability, performance, longevity and fault tolerance. That is the

    Conclusion

    Figure 6 – Hadoop Technology Stack

  • 13

    major reason almost every storage vendor is coming up with Scale-Out Storage with parallel file systems. Some of the key Scale-Out architectures are EMC Isilion, IBM Sonas, NetApp Engenio, Hitachi HNAS, HP Ibrix and many more. Scale-Out architecture comfortably scales into the petabytes data range while offering high performance for reads and writes no matter what the data volume size is.

    Another way of implementing scale-out solutions is by selecting Hadoop and its community components and leveraging its solution accelerators providing services on commodity based hardware. Firms such as TCS bring in their industry specific experience through Proof of Concept (PoC), Integration, Migration and/or Deployment/Support on most of the components of Hadoop.

    Modern businesses looking for a solution to handle their Big Data easily and effectively should consider a composite strategy to solve Big Data related issues and take a step-by-step approach of performing a PoC, analyzing the results and then finally deploying the solutions, where the benefits can be substantiated.

    ReferencesThe following references are used in the preparation of this document:

    http://hadoop.apache.org/

    http://en.wikipedia.org/wiki/Apache_Hadoop

  • TCS

    Des

    ign

    Serv

    ices

    M

    11

    12

    II

    I

    All content / information present here is the exclusive property of Tata Consultancy Services Limited (TCS). The content / information contained here is correct at the time of publishing. No material from here may be copied, modified, reproduced, republished, uploaded, transmitted, posted or distributed in any form without prior written permission from TCS. Unauthorized use of the content / information appearing here may violate copyright, trademark and

    other applicable laws, and could result in criminal or civil penalties. Copyright © 2012 Tata Consultancy Services Limited

    IT ServicesBusiness SolutionsOutsourcing

    ContactTo know more about how we help companies in the HiTech Industry overcome their challenges to achieve real business results, contact [email protected]

    About TCS HiTech Industry Solution Unit

    TCS' HiTech industry Solutions Unit provides optimal, customized, and comprehensive solutions across varied High Tech industry segments: Computer Platform and Services Companies, Software Firms, Electronics and Semiconductor Companies, and Professional Services Firms (Legal, HR, Tax & Accounting and Consulting & Advisory/Analyst firms).

    Building on its vast experience in engineering, business process transformation, innovation and IT solutions, TCS offers a comprehensive portfolio of services that maximize growth, manage risk, and reduce costs. The TCS HiTech Industry Solution Unit partners with High Tech enterprises to provide end-to-end solutions which help realize operational excellence, innovation and greater profitability.

    For more information, visit us at http://www.tcs.com/industries/high_tech

    About Tata Consultancy Services (TCS)

    www.tcs.com

    Tata Consultancy Services is an IT services, consulting and business solutions organization that delivers real results to global business, ensuring a level of certainty no other firm can match.TCS offers a consulting-led, integrated portfolio of IT and IT-enabled infrastructure, engineering

    TMand assurance services. This is delivered through its unique Global Network Delivery Model , recognized as the benchmark of excellence in software development. A part of the Tata Group, India’s largest industrial conglomerate, TCS has a global footprint and is listed on the National Stock Exchange and Bombay Stock Exchange in India.

    For more information, visit us at

    Subscribe to TCS White PapersTCS.com RSS: http://www.tcs.com/rss_feeds/Pages/feed.aspx?f=wFeedburner: http://feeds2.feedburner.com/tcswhitepapers

    Page 1Page 2Page 3Page 4Page 5Page 6Page 7Page 8Page 9Page 10Page 11Page 12Page 13Page 14