putting your big data management strategy on right track

PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACKBig data brings a mix of technologies into organizations, and harnessing those tools can be a challenge. But there are steps IT teams can take to put their projects on the path to success. BY JACK VAUGHAN

UNLOCKING THE BUSINESS BENEFITS IN BIG DATA

2DON’T COUNT

OUT THE DATA WAREHOUSE

3DATA BY ANY OTHER NAME

4GROWING PAINS

1FINDING THE RIGHT TOOLS

PUTTING YOUR BIG DATA STRATEGY ON THE RIGHT TRACK 2

HOME

FINDING THE RIGHT TOOLS

DON’T COUNT OUT THE DATA

WAREHOUSE

DATA BY ANY OTHER NAME

GROWING PAINS

Information of all types is engulfing computer systems in many organi-zations, complicating efforts to pull valuable business insights out of it through big data analytics initia-tives. At the same time, a cavalcade of new technologies has arrived to help companies cope with the data influx—but sorting through those technologies is often an intimidating task in itself.

In addition, IT managers must assess whether Hadoop clusters, NoSQL databases and other big data management tools can fit comfort-ably into existing systems architec-tures or if architectural modi-fica-tions are needed to accommodate them. The answer varies based on factors such as planned uses, organi-zational structures and IT maturity.

And the burgeoning business-side interest in extracting business value and deriving competitive advan-tages from vaults of big data means that there isn’t a lot of time to make those assessments and choose between the available technology options. In more and more compa-nies, big data is viewed as a precious

resource that business leaders and data scientists want to sift through like prospectors looking for precious metals.

This “big data gold rush” puts added pressure on IT and data management strategists to quickly deliver systems that can handle the growing amounts, and increasing variety, of incoming data.

One of the biggest issues in plan-ning a big data strategy is where to put all the data for processing and analysis. It wasn’t long ago that transactional data was the primary concern and that the options for managing it boiled down to a hand-ful of relational databases. Multi-dimensional databases, columnar software and other specialized ana-lytical engines added some choices for warehousing data from transac-tion systems for analysis. Even so, in many companies the big decision was: enterprise data warehouse (EDW) or collection of independent data marts?

But things have changed. Collect-ing and analyzing data from social media sites, sensors, system logs

SURGING VOLUMES OF STRUCTURED AND UNSTRUCTURED DATA—WHAT WE’VE COME TO KNOW AS BIG DATA—ARE PUTTING IT AND DATA MANAGEMENT TEAMS UNDER THE GUN.


HOME



WAREHOUSE


GROWING PAINS

and other nontransactional sources has become a priority for many organizations. And big data tech-nologies that can support those ini-tiatives have proliferated to such an extent that the number of different, and disparate, options is dizzying.

Matthew Aslett, an enterprise software analyst at research and advisory company The 451 Group, has depicted the plethora of data storage and management choices now available in the form of a Lon-don Underground subway map, arraying the available technologies as stations along color-coded lines representing different product cat-egories. In addition to conventional databases, a sampling of those cat-egories includes Hadoop file system implementations as well as schema-less NoSQL databases and “NewS-QL” hybrids that use SQL-based relational data models but aim to provide NoSQL-like levels of data scalability. Heightening the potential for buyer bewilderment even more, some categories house technologies of widely varying stripes. In particu-lar, NoSQL is an umbrella term that encompasses a diverse mix of graph databases; document, column and key-value stores; and other types of repositories.

Initially, many big data applica-tions were “greenfield” projects that didn’t face some of the issues of typical application development initiatives, such as the need to inte-grate with legacy systems or struc-

tured data sources. Often, technol-ogy-savvy data analysts and other business users took a first hack at doing something with unstructured or semi-structured data under the radar of IT and business intelligence managers, taking advantage of the open source nature of Hadoop and many NoSQL tools. But big data is definitely on the corporate radar now, and the drive to incorporate non-transactional forms of data into mainstream analytics processes is making effective deployment and management of big data systems by IT teams a necessity.

There are some fundamental steps that companies can take to get started on harnessing big data technologies and putting their proj-ects on the path to success. Let’s take a closer look at a few of them.

1FINDING THE RIGHT TOOLS

It’s still early in the big data adop-tion cycle, and different companies are trying out different technolo-gies—sometimes with the same end goal, as a look at available user case studies shows:


http://searchdatamanagement.techtarget.com/news/2240146975/Big-data-management-debate-Buy-tools-now-or-fix-data-issues-first

http://searchdatamanagement.techtarget.com/news/2240146975/Big-data-management-debate-Buy-tools-now-or-fix-data-issues-first

http://searchdatamanagement.techtarget.com/news/2240181822/The-buzz-What-are-NoSQL-databases

http://searchbusinessanalytics.techtarget.com/video/Eckerson-Using-Hadoop-in-big-data-systems-can-pay-off-fast


HOME



WAREHOUSE


GROWING PAINS

n NoSQL databases are being used to analyze network failure and degradation patterns, man-age digital assets and track and correlate Web server log activ-ity, among other applications.

n Hadoop systems are being employed for uses such as matching highway traffic pat-terns with cell phone usage data, evaluating consumer buying behavior for more targeted eth-nic demographics and creating new financial services products based on real-time analysis of customer activity.

n NewSQL databases have been tapped to support applications that include automating real-time pricing for air travel and improving the scalability of util-ity database systems.

n Analytical databases have been applied in initiatives such as dis-secting website user activity and uncovering trends in GPS infor-mation collected from taxis.

The key is to pick the right data-base for the job at hand, in the same way bettors at a race track try to choose “the horse for the course,” a phrase that refers to the ability of some thoroughbreds to run bet-ter on dirt or grass, or on a dry or muddy track. But multiple database horses might be required for differ-

ent courses within a big data envi-ronment.

ThoughtWorks Inc., a Chicago-based software development servic-es company that also sells applica-tion lifecycle management tools, has created a hypothetical online retail application framework to illustrate the concept of polyglot persistence, or using a variety of database tech-nologies to handle different types of data based on which technology is the best fit in each individual case. For example, a key-value NoSQL data store might be best for manag-ing website user-session data as part of the retail framework, accord-ing to the ThoughtWorks model. But it envisions the use of four other fla-vors of NoSQL databases for tasks such as processing online shopping-cart data, powering the site’s rec-ommendation engine and storing user activity logs.

And SQL-based relational data-bases still have their place in this new polyglot world. In the online retail framework, relational tech-nology is depicted as a good fit for financial data that requires transac-tional updates and is best served by a tabular structure. Reporting also could be the province of a relational database with SQL interfaces at the ready for exchanging data with reporting tools.

Relational databases are efficient at processing transactions, and through their support for character-istics such as transactional atomi-


http://searchbusinessanalytics.techtarget.com/news/2240150376/To-glean-value-from-big-data-projects-it-may-take-more-than-Hadoop

http://searchbusinessanalytics.techtarget.com/news/2240150376/To-glean-value-from-big-data-projects-it-may-take-more-than-Hadoop


HOME



WAREHOUSE


GROWING PAINS

city and consistency, they offer reli-ability and data recovery capabilities that NoSQL technologies typically can’t match. But relational software often isn’t suited to text and other unstructured forms of big data. And it requires “a lot of maintenance on the back end,” including the need to carefully construct data schemas and modify them when business requirements change, said Pramod Sadalage, a principal consultant at ThoughtWorks. Those issues are minimized with NoSQL and Hadoop offerings.“What we’re saying is, ‘Give the things that belong to a certain task to a certain database,’ ” Sadalage said. “If you have, for example, a [product] catalog, put it in a data-base that is well suited for that—then searches go faster.”

2DON’T COUNT

OUT THE DATA WAREHOUSE

Big data management projects might be born because existing data warehouse systems are beginning to sag under the weight of the data that is flooding into organizations. But that doesn’t mean data warehouses

are all of a sudden obsolete—just that the nature of warehousing data is changing to make room for big data. “Different styles of data warehouse architecture have come and gone over the years,” said Philip Russom, data management research director at The Data Warehousing Institute (TDWI) in Renton, Wash. “As we move to bigger volumes and diversity of data, we have to again evolve the data warehouse, just as we have in the past.”

Hadoop-based big data systems initially were viewed as potential data warehouse killers, but that sentiment has largely given way to expectations of peaceful coexis-tence. For example, 78% of 263 IT professionals, business users and consultants surveyed by TDWI in November 2012 said they thought Hadoop systems could be a useful complement to their data warehous-es for supporting advanced analyt-ics applications. In addition, 41% saw Hadoop as an effective staging area for information on its way to a data warehouse. Asked if Hadoop clusters could fully replace an EDW, more than half of the respondents said no; just 4% said yes (see FIGURE

1 on page 6).Russom thinks that using Hadoop

to stage data for loading into data warehouses is a “beachhead” for big data technologies in companies. But the staging process itself is one aspect of data warehousing that has changed significantly in recent

DON’T COUNT OUT THE DATA WAREHOUSE

http://searchdatamanagement.techtarget.com/news/2240147012/Big-data-best-practices-for-EDW-environments

http://searchdatamanagement.techtarget.com/news/2240147012/Big-data-best-practices-for-EDW-environments


HOME



WAREHOUSE


GROWING PAINS

years, he said. In many cases, raw data is likely to pile up in Hadoop systems and initially be analyzed there. “In the old days, the data staging area was pretty temporary,” Russom said. “But it has evolved to become a kind of archive.”

Even so, he doesn’t expect those archives to exist in isolation, dis-connected from data warehouses. Some of the data will be moved into EDWs, perhaps in the form of aggregated analytics results, and the two technologies increasingly are being used in tandem, according to Russom. “Hadoop-enabled analyt-ics are sometimes deployed in silos, but the trend is toward integrating Hadoop and EDW data at analysis time for maximal visibility into busi-

ness performance,” he wrote in a report about the TDWI survey.

3DATA BY ANY OTHER NAME

Big data projects begun as skunk-works or standalone undertakings do run the risk of creating informa-tion silos. To prevent that, organiza-tions should incorporate them into an overall data management strat-


FIGURE 1: HADOOP VERSUS THE DATA WAREHOUSE

SOURCE: THE DATA WAREHOUSING INSTITUTE. BASED ON A SURVEY OF 263 IT PROFESSIONALS, BUSINESS USERS AND CONSULTANTS CONDUCTED IN NOVEMBER 2012.

n Can the HDFS augment your enterprise data warehouse?

n Can the Hadoop Distributed File System replace your enterprise data warehouse?

n n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n n

n n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n nn n n n n n n n n n

4%Yes

50%Yes

37%Maybe

47%Maybe

59%No

3%No


HOME



WAREHOUSE


GROWING PAINS

egy from the start, said Mark Beyer, an analyst at Gartner Inc. in Stam-ford, Conn. That means asking many of the same questions IT teams ask about conventional data as part of data quality and governance programs, he added. For example, where did a particular set of big data come from, how long must it be kept and does it need to be remediated before being used?

Beyer said applying proven data management processes to pools of big data is especially important with information that comes from external sources, including what he described as “crowdsourced” data collected from Facebook, Twitter and other social networks. With such data, “you don’t know if the ‘create case’ matches the use case,” he said. Understanding the origins of data and factors such as how fast it changes is crucial to effective big data management, he advised.

The bottom line, Beyer said, is that “big data assets are no more accurate than any other digital information”—and often less so. As a result, he warned IT managers to get ready for a bumpy ride: “Big data is an invader. Big data breaks things. You don’t control it.” Asserting control over the data once it’s in an organization’s systems could mean the difference between success and

failure in making effective use of the information.

4GROWING PAINS

It’s also important to recognize that technologies such as Hadoop, its associated MapReduce program-ming model and NoSQL databases aren’t automatic cure-alls for a com-pany’s data management needs. In addition to the data quality and governance challenges, technical complexities lurk around the corners of big data environments.

For many companies, complex-ity comes in the form of Java-based development. Java is the program-ming language of choice for Hadoop and other big data technologies. But even the large army of experienced Java developers faces challenges in working with Hadoop because it doesn’t include native support for SQL. As a result, developers can run into difficulties in creating MapRe-duce programs to distill Hadoop

GROWING PAINS

For many companies, complexity comes in the form of Java-based development.

http://searchdatamanagement.techtarget.com/feature/Mission-impossible-Data-governance-process-takes-on-big-data

http://searchdatamanagement.techtarget.com/feature/Mission-impossible-Data-governance-process-takes-on-big-data

http://searchdatamanagement.techtarget.com/answer/Emerging-database-technologies-How-Hadoop-and-MapReduce-compare

http://searchdatamanagement.techtarget.com/answer/Emerging-database-technologies-How-Hadoop-and-MapReduce-compare

http://searchsoa.techtarget.com/feature/Success-with-Java-based-Hadoop-demands-variety-of-skills

http://searchsoa.techtarget.com/feature/Success-with-Java-based-Hadoop-demands-variety-of-skills


HOME



WAREHOUSE


GROWING PAINS

data into subsets for processing on different compute nodes in a cluster, said Paul Dix, CEO and founder of Errplane, a New York-based consul-tancy and developer of application monitoring software. “Most Java developers face issues in how they think about processing data into the MapReduce paradigm,” said Dix, who also is a member of the New York Hadoop User Group. “They have to learn how to write MapRe-duce code to work with Hadoop; they have to learn to structure the problem correctly.”

Programming directly in MapRe-duce isn’t the only path developers can take. “There are a lot of ways to do Hadoop without writing MapRe-duce programs from scratch,” said Paul Mackles, senior manager of software architecture at software vendor Adobe Systems Inc. in San Jose, Calif. For example, Hive, an open source Hadoop offshoot, offers a table-based data model and a SQL-like language that automati-cally compiles queries into MapRe-duce statements for analyzing data in Hadoop systems. Apache Pig is a

GROWING PAINS

A SNAPSHOT OF THE BIG DATA TECHNOLOGY LANDSCAPEIT architects building big data systems have a variety of technology compo-nents at their disposal.

n Distributions of the Hadoop file system and related MapReduce program-ming model are offered by Cloudera, Hortonworks, MapR Technologies and other vendors.

n Hadoop is not an island: The open source software framework is supported by a long list of supporting tools, including Hive, HBase, Pig, HCatalog and ZooKeeper.

n NoSQL database technology has grown into a flourishing market seemingly overnight, populated with products such as CouchDB, Cassandra, MongoDB, RavenDB, Redis, Riak, Neo4j and InfiniteGraph.

n Hybrid mixes of relational and non-relational technologies are emerging. Referred to as “NewSQL” databases, they include the likes of VoltDB, NuoDB, ScaleBase and Drizzle.

n Analytical databases based on a mix of relational, columnar and massively parallel processing technology include Sybase IQ, Teradata Aster, IBM Netezza, HP Vertica, Greenplum and ParAccel. n


HOME



WAREHOUSE


GROWING PAINS

separate platform with a high-level language for creating highly parallel-ized MapReduce programs. In addi-tion, software vendors such as Clou-dera Inc. are starting to offer their own SQL query engines for Hadoop.

Mixing Java skills and SQL add-ons doesn’t assure Hadoop suc-cess, though. Converting queries to MapReduce in Hive “works fairly well, but it isn’t always a clean tran-sition,” Dix said.

Hive queries often require tuning to attain the best possible perfor-mance, according to Mackles. Data joins are “not its strong suit,” he said during a presentation at TDWI’s 2013 BI Executive Summit in Las Vegas. Working with MapReduce typically incurs performance hits at the start of query jobs and imposes more processing overhead while they’re running, he added.

Finding a good starting point for a would-be Hadoop development team can help build both skills and confidence. One possible starter project recommended by Dix: put-ting Web server log files into a Hadoop cluster and then applying MapReduce to the data to find out, say, average response times on webpages or the number of page-loading errors generated by a Web application. “That’s the low-hanging fruit,” he said.

Mackles listed a variety of new and upgraded tools that are being developed to help organizations get over the big data hump. That

includes a second-generation ver-sion of MapReduce called Yarn; a table and storage management util-ity named HCatalog; and Hadoop 2.0, which is available in an alpha release and is designed to make real-time processing and querying of Hadoop data more feasible, among other improvements. “Hadoop has been around long enough that I think a lot of the shortcomings are pretty well known,” Mackles said, adding that Hadoop 2.0 addresses many of the issues.

Those technologies and others might well help the big data man-agement and analytics cause, but they further add to the vast and growing forest of tools that IT, data warehousing and data management professionals need to navigate in planning and managing deploy-ments. It’s a challenge that likely will be faced in more and more compa-nies, though. In the TDWI survey, only 10% of the respondents said their organizations had Hadoop systems in production use—but another 51% said they expected to be Hadoop users within three years. The corporate spotlight will be on the IT teams responsible for build-ing scalable big data systems and integrating them into existing data warehousing and analytics environ-ments. Finding the right technolo-gies, and managing the process in a way that gets the most out of them, will help keep the glare of that light from getting too hot. n

GROWING PAINS

http://searchdatamanagement.techtarget.com/news/2240118671/Hadoop-connector-software-hitches-DBs-to-big-data-clusters


HOME



WAREHOUSE


GROWING PAINS

JACK VAUGHAN is news and site editor of SearchData Management.com. He covers big data management, data warehousing, databases and data integration. Vaughan was an editor for TechTarget’s

SearchSOA.com, SearchVB.com, TheServerSide .net and SearchDomino.com websites. Email him at [email protected].

Putting Your Big Data Strategy on the Right Track is a

SearchBusinessAnalytics.com e-publication.

Scot Petersen Editorial Director

Jason Sparapani Managing Editor, E-Publications

Joe Hebert Associate Managing Editor, E-Publications

Craig Stedman Executive Editor

Melanie Luna Managing Editor

Mark Brunelli News Director

Linda Koury Director of Online Design

Neva Maniscalco Graphic Designer

Doug Olender Publisher

[email protected]

Ed Laplante Director of Sales

[email protected]

TechTarget Inc. 275 Grove Street, Newton, MA 02466

www.techtarget.com

© 2013 TechTarget Inc. No part of this publication may be transmitted or reproduced in any form or by any means without written permission from the publisher. TechTarget reprints are available through The YGS Group.

About TechTarget: TechTarget publishes media for information technology profes sionals. More than 100 focused websites enable quick access to a deep store of news, advice and analysis about the tech-nologies, products and processes crucial to your job. Our live and virtual events give you direct access to independent expert commentary and advice. At IT Knowledge Exchange, our social commu nity, you can get advice and share solu tions with peers and experts.

ABOUT THE AUTHOR

mailto:[email protected]

http://searchbusinessanalytics.techtarget.com/



http://reprints.ygsgroup.com/m/techtarget

putting your big data management strategy on right track

Technology