unlockingthe business benefits in big data_ch4_ebook_final

Upload: radit-mursid

Post on 14-Jan-2016

9 views

Category:

Documents


0 download

DESCRIPTION

Hadoop clusters make it easier for organizations to process andanalyze streams of big data. But there are limits to what the opensource technology can do—plus implementation and managementchallenges. BY JACK VAUGHAN AND ED BURNS

TRANSCRIPT

  • THE INS AND OUTS OF HARNESSING HADOOPHadoop clusters make it easier for organizations to process and analyze streams of big data. But there are limits to what the open source technology can doplus implementation and management challenges. BY JACK VAUGHAN AND ED BURNS

    UNLOCKING THE BUSINESS BENEFITS IN BIG DATA

    2AVOID DASHED EXPECTATIONS

    3MORE TO IT

    THAN MORE NODES

    4NEED FOR

    ANALYTICS SPEED

    1YARN SPINS

    NEW FLEXIBILITY

  • THE INS AND OUTS OF HARNESSING HADOOP 2

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    But it also confronts them with new challenges as they look to deploy and work with Hadoop systems. And because Hadoop and the large number of open source technologies surrounding it are evolving quickly, organizations must be prepared for frequent changesmost immedi-ately in the Hadoop 2 release.

    The new version, which the Apache Software Foundation released in October 2013, will even-tually take Hadoop far beyond its current core configuration, which combines the Hadoop Distributed File System (HDFS) with Java-based MapReduce programs. Early-adopter companies are using that pairing to help them deal with large amounts of transaction dataas well as server and network log files, sensor data, social media feeds, text documents, image files and other types of unstructured and semi-structured data.

    Hadoop typically runs on clusters of commodity servers, resulting in relatively low data processing and

    storage costs. And because of its ability to handle data with very light structure, Hadoop applications can take advantage of new information sources that dont lend themselves to traditional databases, said Tony Cosentino, an analyst at Ventana Research.

    But Cosentino added in an email that implementations of the existing Hadoop architecture are restricted by its batch-processing orientation, which makes it more akin to a truck than a sports car on performance. Hadoop is ideally suited where time latency is not an issue and where significant amounts of data need to be processed, he said.

    In its HDFS-MapReduce con-figuration, Hadoop is very good at analysis of very large, static

    THE HADOOP DISTRIBUTED PROCESSING FRAMEWORK PRESENTS IT, DATA MANAGEMENT AND ANALYTICS TEAMS WITH NEW OPPORTUNITIES FOR PROCESSING, STORING AND USING DATA, PARTICULARLY IN BIG DATA APPLICATIONS.

    Organizations must be prepared for frequent changesmost immedi ately in the Hadoop 2 release.

  • THE INS AND OUTS OF HARNESSING HADOOP 3

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    unstructured data sets consisting of many terabytes or even petabytes of information, said William Bain, CEO of ScaleOut Software Inc., a vendor of data grid software. As an example, he cited a sentiment anal-ysis application on a huge chunk of Twitter data aimed at discerning what customers are thinkingand tweetingabout a company or its products.

    Like Cosentino, Bain emphasized that because of its batch nature and large startup overhead on processing jobs, Hadoop generally hasnt been useful in real-time anal-ysis of live data setsat least not as its currently constituted. But some vendors have recently introduced query engines designed to support ad hoc analysis of Hadoop data.

    Data warehousing applications involving large volumes of data are good targets for Hadoop uses, according to Sanjay Sharma, a prin-cipal architect at software devel-opment services provider Impetus Technologies Inc. How large? It var-ies, he said: Tens of terabytes is a sweet spot for Hadoop, but if there is great complexity to the unstruc-tured data, it could be tens of giga-bytes.

    Some users, such as car-shop-ping information website operator Edmunds.com Inc., have deployed Hadoop and related technologies to replace their traditional data ware-houses. But Hadoop clusters often are being positioned as landing pads

    and staging areas for the data gush-ing into organizations. In such cases, data can be pared down by MapRe-duce, transformed into or summa-rized in a relational structure and moved along to an enterprise data warehouse or data marts for analy-sis by business users and analytics professionals. That approach also provides increased flexibility: The raw data can be kept in a Hadoop system and modeled for analysis as needed, using extract, load and transform processes.

    Sharma describes such imple-mentations as a data lake for downstream processing. Colin White, president of consultancy BI Research, uses the term busi-ness refinery. In a report released in February 2013, Gartner Inc. ana-lysts Mark Beyer and Ted Friedman wrote that using Hadoop to collect and prepare data for analysis in a data warehouse was the most-cited strategy for supporting big data analytics applications in a survey conducted by the research and con- sulting company. An even 50% of the 272 respondents said their

    INTRODUCTION

    Some users, such as Edmunds.com, have deployed Hadoop and related technologies to replace their traditional data warehouses.

  • THE INS AND OUTS OF HARNESSING HADOOP 4

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    organizations planned to do so dur-ing the next 12 months.

    The vibrancy of the open source ecosystem that surrounds Hadoop can hardly be overstated.

    From its earliest days, Hadoop has attracted software develop-ers looking to create add-on tools to fill in gaps in its functionality. For example, there are HBase, Hive and Pigrespectively, a distributed database, a SQL-style data ware-house and a high-level language for developing data analysis programs in MapReduce. Other supporting actors that have become Hadoop subprojects or Apache projects in their own right include Ambari, for provisioning, managing and moni-toring Hadoop clusters; Cassandra, a NoSQL database; and ZooKeeper, which maintains configuration data and synchronizes distributed opera-tions across clusters.

    1YARN SPINS NEW

    FLEXIBILITY

    And now Hadoop 2 is entering the picture. Central to the update is YARN, an overhauled resource manager that enables applications

    other than MapReduce programs to work with HDFS. By doing so, YARN (short, good-naturedly, for Yet Another Resource Negotiator) is meant to free Hadoop from its reli-ance on batch processing while still providing backward compatibility with existing application program-ming interfaces.

    YARN is the key difference for Hadoop 2.0, Cosentino said, using the releases original name. Instead of letting a MapReduce job see itself as the only tenant on HDFS, he added, it allows for multiple work-loads to run concurrently. One early example comes from Yahoo,

    YARN SPINS NEW FLEXIBILITY

    FIGURE 1: MINORITY GROUP

    BASED ON RESPONSES FROM 387 IT, BUSINESS INTELLIGENCE, ANALYTICS AND BUSINESS PROFESSIONALS IN ORGANIZA-TIONS WITH DATA WAREHOUSES INSTALLED OR UNDER DEVELOPMENT; SOURCE: TECHTARGETS 2013 ANALYTICS & DATA WAREHOUSING READER SURVEY

    n Hadoop and MapReduce might be all the buzz, but the percentage of organizations choosing not to deploy the software might surprise you.

    n n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n n

    62%No

    current plans to

    add

    29%Plan to

    add9%

    In use now

  • THE INS AND OUTS OF HARNESSING HADOOP 5

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    which has implemented the Storm complex event processing software on top of YARN to aid in funneling data about the activities of website users into a Hadoop cluster.

    Hadoop 2 also is due to bring high availability improvements, through a new feature that enables users to create a federated name (or master) node architecture in HDFS instead of relying on a single node to control an entire cluster. Meanwhile, commercial vendors are brewing up additional manage-ment-tool elixirsnew job schedul-ers and cluster provisioning soft-ware, for examplein an effort to further boost Hadoops enterprise-readiness.

    Hadoop use still isnt widespread. In a 2013 survey of TechTarget readers on business intelligence, analytics and data warehousing

    technologies, the percentage of active Hadoop and MapReduce users was still in the single digits, and nearly two-thirds of respon-dents said their organizations had no current plans to deploy the technologies (see Figure 1). Even in companies with big data programs in place or planned, Hadoop ranked fourth on the list of technologies being used or eyed to help underpin the initiatives (see Figure 2).

    Because Hadoop is novel to most users, deploying it can present unfa-miliar challenges to project teamsespecially if they dont have experi-ence with open source software or parallel processing on distributed clusters.

    Even seasoned IT hands may find surprises in working with Hadoop, for much assembly typically is required.

    YARN SPINS NEW FLEXIBILITY

    FIGURE 2: DOWN THE LIST

    BASED ON RESPONSES FROM 222 IT, BUSINESS INTELLIGENCE, ANALYTICS AND BUSINESS PROFESSIONALS IN ORGANIZATIONS WITH ACTIVE OR PLANNED BIG DATA MANAGEMENT AND ANALYTICS PROGRAMS; RESPONDENTS WERE ASKED TO CHOOSE ALL TECHNOLOGIES THAT APPLIED; SOURCE: TECHTARGETS 2013 ANALYTICS & DATA WAREHOUSING READER SURVEY

    n Hadoop isnt the first choice among the top technologies being used by organizations to support their big data environments.

    Mainstream relational databases or data warehousesSpecialized analytical databases

    Data warehouse appliancesHadoop clusters

    55%52%

    46%41%

  • THE INS AND OUTS OF HARNESSING HADOOP 6

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    2AVOID DASHED EXPECTATIONS

    IT managers and corporate execu-tives might look at what the large Internet companies that first honed Hadoop are doing with it and see a chance to do bigger systems at less cost, said Ofir Manor, a prod-uct manager and technical architect at Gene by Gene Ltd., a Houston-based genetic testing services com-pany. But Manor, who also writes a blog on data technologies, added that those expectations can be dif-ficult to meet.

    Its relatively easy to do a small Hadoop implementation and try it out, he said. Playing with the tech-nology can be fun. But to move it to the infrastructure level is hard. In addition to the technical challenges, another issue Manor cited is that IT operations often work in silos, with separate teams handling systems administration, database adminis-tration, storage, networking, securi-ty and application development and so on. That approach can lead to problems in managing Hadoop clus-ters, he warned. Hadoop requires more teamwork than usual, and enterprises may fall into a which team owns the platform? debate.

    Navigating the open source soft-ware culture can be a hurdle for some companies, too. The commer-cial distributions of Hadoop offered by a variety of IT vendors do help simplify the process of rolling out and supporting the software. But Manor said organizations have to ask themselves if theyre ready and willing to commit their own devel-opers to involvement in the Hadoop community, which can aid in efforts to take full advantage of the tech-nology.

    Successfully implementing Hadoop requires first coming to terms with the process of setting up the computer cluster that will run the software. And while clusters are usually built around low-cost and easy-to-use servers, there are numerous configuration settings and issues to work through up front.

    Hadoop is a very complex envi-ronment. There are a lot of mov-ing parts, said Douglas Moore, a consultant at Think Big Analytics, a consulting and development ser-vices provider that focuses on big data deployments. Moore said a Hadoop implementation team needs to make sure the size and overall design of its system are sufficient to handle the pipeline of data that will be fed into the cluster. Job schedul-ing routines and the performance of disk drives and other hardware components can also factor into the Hadoop-cluster performance equa-tion.

    AVOID DASHED EXPECTATIONS

  • THE INS AND OUTS OF HARNESSING HADOOP 7

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    For example, RAID Level 0 strip-ing of data across a disk array, typi-cally turned on by default in Hadoop systems, can shackle I/O speeds to the rate of the slowest drive in an array. In addition, a single disk failure can take down an entire

    array and temporarily knock all of a cluster nodes data offline. As a result, various Hadoop vendors and consultants recommend configur-ing the disks in a cluster as separate devices or limiting RAID striping to pairs of disks.

    AVOID DASHED EXPECTATIONS

    HADOOP BREAKS FREEHadoop initially was the province of the large Internet companies that cre-ated it, and the likes of eBay, Facebook, LinkedIn, Twitter and Yahoo remain marquee users of the technology. But the number of other types of organiza-tions that are looking to ride the Hadoop surge is growing.

    NASA is using Hadoop to make climate data available via a cloud-based ser-vice to researchers outside its walls. To help predict and improve crop yields, agricultural and chemical company Monsanto is loading geospatial data from internal and external sources into Hadoop for processing and then moving the files to HBase, its companion NoSQL database, for analysis. Data storage tech-nology vendor NetApp uses Hadoop to cull log data from sensors to monitor the performance of its equipment at customer sites. Telecom service provider China Mobile Group Guangdong built a Hadoop-based system to support on-line bill payments and provide new data analytics capabilities internally.

    Marketing and advertising analysis is another common application for Hadoop and related big data technologies. Edmunds.com, which publishes automobile pricing data and vehicle reviews online, deployed a combination of Hadoop and HBase to help business analysts fine-tune its paid-search marketing and keyword bidding processes. Retailer Kohls plans to use Hadoop to enable business users to analyze store and website data. And Luminar, a company that analyzes data about Hispanic consumers in the U.S. for retailers, manu-facturers and other clients, replaced a traditional data warehouse with a Ha-doop system to power its analytical modeling.

    But Colin White, president of consultancy BI Research, thinks Hadoop has the potential to spark new and innovative applications, not just to step in and take the place of traditional systems. My concern is in seeing Hadoop used for a bunch of workloads in which it is reinventing the wheel, he said. Id rather see it moving in the direction of solving problems we havent solved. JACK VAUGHAN

  • THE INS AND OUTS OF HARNESSING HADOOP 8

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    Also, because Hadoop is so often combined with supporting software such as HBase and Hive, pinpoint-ing the sources of performance problems can be, well, problematic. In working with clients to optimize cluster performance, Moore and his fellow consultants find that in many cases the first suspect isnt neces-sarily the culprit.

    Weve been brought in for tech-nology assessments by people who think they had an issue with HBase failing, he said. But the fact is, the problem could be with how their workflow is set uphow theyre rolling jobs into a cluster.

    3MORE TO IT THAN

    MORE NODES

    The use of commodity servers makes it relatively inexpensive to add more nodes to a cluster. And with the fast-paced growth of Google, Twitter and other Web powerhouses, and the correspond-ing expansion of their data pro-cessing requirements, scaling out clusters as needed to boost perfor-mance became a common strategy. But that approach isnt likely to fly

    in more traditional organizations, said Vin Sharma, director of product marketing for Hadoop at Intel Corp.

    Its true that throw another node at it may have become a mantra at fast-growing Web monsters, but it wont be repeated in the typical enterprise, Sharma said. Instead, he expects to see a focus on trou-bleshooting performance prob-lems. Doing so in a Hadoop cluster, though, is more complicated than in the average system, he said. It requires expertise that not every organization has in-house.

    The first order of business once a cluster is set up, according to Sharma, is to deploy performance monitoring tools to help identify bottlenecks. He also recommends checking MapReduce applica-tions to ensure that theyve been designed for optimal performance on a cluster. If [an application] requires a lot of network communi-cation, it may not be a good fit.

    Hadoop itself might not be the right choice to begin with: The high-fevered interest in the technology shouldnt obscure the fact that its not the best option for every appli-cation, Cosentino cautioned. Dont think about technology first, he said. Think first about the busi-ness problem youre trying to solve, because you may not even need a Hadoop cluster.

    And while its tempting, the incli-nation to follow the lead of the Inter-net giants down the Hadoop path

    MORE TO IT THAN MORE NODES

  • THE INS AND OUTS OF HARNESSING HADOOP 9

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    shouldnt be an absolute, Manor said, noting that the needs of those companies and other types of busi-nesses are often different. The tools to solve the [online] scalability issue are not always a good fit for enterprise challenges, he said.

    One particular case in point: real-time analytics applications involving ad hoc querying of Hadoop data. Hadoop is optimized to crunch through large data sets, but its batch-processing power doesnt equate to data analysis speed.

    4NEED FOR

    ANALYTICS SPEED

    And Jan Gelin, vice president of technical operations at Rubicon Project, said analytics speed is something that the online advertis-ing broker needsbadly. The com-pany, based in Playa Vista, Calif., offers a platform for advertisers to use in bidding for ad space on

    webpages as Internet users visit the pages. The system allows the advertisers to see information about website visitors before making bids, in order to ensure that ads will only be seen by interested consumers. Gelin said the process involves a lot of analytics and it all has to happen in fractions of a second.

    Rubicon leans heavily on Hadoop to help power the ad-bidding plat-form. The key, Gelin said, is to pair it with other technologies that can handle true real-time analysis. Like Yahoo, Rubicon uses the Storm processing engine to capture and quickly analyze large amounts of data as part of the ad bidding pro-cess. Storm then sends the data into a cluster running MapR Technolo-gies Inc.s Hadoop distribution. The Hadoop cluster is primarily used to transform the data to prepare it for more traditional analytical applica-tions, such as business intelligence reporting. Even for that stage, though, much of the information is loaded into a Greenplum analytical database for access by users.

    Gelin said the sheer volume of data that Rubicon produces on a daily basis meant it would need a system capable of processing all the information. Thats where Hadoop

    NEED FOR ANALYTICS SPEED

    Think first about the busi ness problem youre trying to solve, because you may not even need a Hadoop cluster.

    TONY COSENTINO, ANALYST AT VENTANA RESEARCH

  • THE INS AND OUTS OF HARNESSING HADOOP 10

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    comes in. But, he added, you cant take away the fact that Hadoop is a batch-processing system. Theres other things on top of Hadoop you can play around with that are actu-ally like real real-time.

    Several Hadoop vendors are try-ing to eliminate the real-time analyt-ics restrictions.

    Cloudera Inc. got the ball rolling in April 2013 by releasing its Impala query engine, promising the abil-ity to run interactive SQL queries against Hadoop data in near real time. Pivotal, a data management and analytics spinoff from EMC Corp. and its VMware subsidiary, followed three months later with a similar query engine named Hawq. Also looking to get in the game is Splunk Inc., which focuses on cap-turing streams of machine-gener-ated data; it began beta-testing a Hadoop data analysis tool called Hunk in June 2013.

    Hadoop 2 also aids the cause by opening up Hadoop systems to non-MapReduce applications. With all the new tools and capabilities, Hadoop may soon be up to the real-time challenge, said Mike Gualtieri, an analyst at Forrester Research Inc. One big factor working in its favor, he added, is that vendors as well as Hadoop users are determined to make the technology function in real or near real time for analytics appli-cations.

    Hadoop is fundamentally a batch operation environment, Gualtieri

    said. However, because of the distributed architecture and because a lot of use cases have to do with putting data into Hadoop, a lot of vendors or even the end users are saying, Hey, why cant we do more real-time or ad hoc queries against Hadoop, and its a good question.

    Gualtieri sees two main challeng-es. First, he said, most of the new Hadoop query engines still arent as fast as running queries against mainstream relational databases is. Tools like Impala and Hawq provide interfaces that enable end users to

    write queries in the SQL program-ming language. The queries then get translated into MapReduce for exe-cution on a Hadoop cluster, but that process is inherently slower than running a SQL query directly against a relational database, according to Gualtieri.

    The second challenge that Gual-tieri sees is that Hadoop currently is a read-only system once data has been written into HDFS. Users cant easily insert, delete or modify indi-vidual pieces of data stored in the file system like they can in a rela-tional database, he said. While the

    NEED FOR ANALYTICS SPEED

    With all the new tools and capabilities, Hadoop may soon be up to the real-time chal lenge.

  • THE INS AND OUTS OF HARNESSING HADOOP 11

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    challenges are real, Gualtieri thinks they can be overcome. For example, Hadoop 2 includes a capability for appending data to HDFS files.

    Gartner analyst Nick Heudecker wrote in an email that even though the new query engines dont sup-port true real-time analytics func-tionality, they do enable end users with less technical expertise to access and analyze data stored in Hadoop. That can decrease the cycle time and cost associated with running Hadoop analytics jobs because MapReduce developers no

    longer have to write queries, he said. Organizations will have to decide

    for themselves whether thats enough of a justification for deploy-ing such tools. Despite all the hype, Hadoop isnt a magic bullet, said Patricia Gorla, a consultant at IT ser-vices provider OpenSource Connec-tions LLC. Whats important, Gorla said, is finding the best fit for the technologyand not trying to force-fit it into a systems architecture where it doesnt belong. Hadoop is good at what its good at and not at what its not, she said. n

    NEED FOR ANALYTICS SPEED

  • THE INS AND OUTS OF HARNESSING HADOOP 12

    HOME

    YARN SPINS NEW FLEXIBILITY

    AVOID DASHED EXPECTATIONS

    MORE TO IT THAN MORE

    NODES

    NEED FOR ANALYTICS

    SPEED

    JACK VAUGHAN is site editor of SearchDataManagement .com. He covers topics such as big data management, data warehousing, databases and data integration. Vaughan previously was an editor for

    TechTargets SearchSOA.com, SearchVB.com, TheServerSide.net and SearchDomino.com web-sites. Email him at [email protected].

    ED BURNS is site editor of SearchBusinessAnalytics.com; in that position, he covers business intelligence, analyt-ics and data visualization technologies and topics. He previously was a news writer

    for TechTargets SearchHealthIT.com website, and he has also written for a variety of daily and weekly newspapers in eastern Massachusetts. Email him at [email protected].

    The Ins and Outs of Harnessing Hadoop is a SearchBusinessAnalytics.com

    e-publication.

    Scot Petersen Editorial Director

    Jason Sparapani Managing Editor, E-Publications

    Joe Hebert Associate Managing Editor, E-Publications

    Craig Stedman Executive Editor

    Melanie Luna Managing Editor

    Mark Brunelli News Director

    Linda Koury Director of Online Design

    Neva Maniscalco Graphic Designer

    Doug Olender Publisher

    [email protected]

    Annie Matthews Director of Sales

    [email protected]

    TechTarget Inc. 275 Grove Street, Newton, MA 02466

    www.techtarget.com

    2013 TechTarget Inc. No part of this publication may be transmitted or reproduced in any form or by any means without written permission from the publisher. TechTarget reprints are available through The YGS Group.

    About TechTarget: TechTarget publishes media for information technology profes sionals. More than 100 focused websites enable quick access to a deep store of news, advice and analysis about the tech-nologies, products and processes crucial to your job. Our live and virtual events give you direct access to independent expert commentary and advice. At IT Knowledge Exchange, our social commu nity, you can get advice and share solu tions with peers and experts.

    ABOUT THE AUTHORS