dmr_direct_template

DM Direct October 2007

Analysis of Large Data Volumes:Challenges and Solutionsby Joseph Rozenfeld

“Ihave too much data to analyze” is a common complaint ofbusiness analysts. Today’s definition of a very large data sethas evolved dramatically from what was considered large

ten years ago. During my early days at Arbor Software, the companythat pioneered the concept of online analytical processing (OLAP) andlater became Hyperion Solutions (which was acquired by Oracle),I was chartered with extending our product’s capabilities to supportvery large data sets. At the time, several gigabytes of data were con-sidered large... and tens of gigabytes were considered huge. Mywork resulted in a set of product improvements that helped to analyzegigabytes of data at reasonable speeds.

But even then, it was already becoming apparent that analysisof very large data sets would continue to be a significant problem.With recent advances in storage technology and the storage pricedecline, a terabyte of data is not uncommon - and large data setsare now measured in petabytes.

While storage technology advances have been staggering,progress in analytical data processing has been marginal. It ispossible to find a piece of data in a petabyte-size storage system (Google,Yahoo! and other search engine technology vendors are a testamentto that), but analyzing this data to find correlations and trendsmeaningful to a business analyst remains a huge challenge.

“Why do we need to analyze this data - and what is the natureof this data that grew in size by a factor of millions over the last tenyears?” The answer is simple: we live in a world that is getting moredigitized and more connected every day. We use networks to talk, shop,read and work. We all have a digital life that is only growing bigger.Our music and photo libraries are stored and managed by networkproviders, our answering machines are virtual mailboxes, instantmessenger IDs are on our business cards and our reference librariesare online. We spend hours every day in this digital world, soit’s no wonder the amount of data we access online is growing atan exponential – and unstoppable – rate.

What’s more, most businesses today use IT to support everyconceivable business function. What this means is that trillions ofdigital interactions and transactions are generated and carried out byvarious networks hourly and daily. Where does all this data go? Someends up in databases; most ends up in log files discarded on a regularbasis, because even a petabyte-sized storage system is not largeenough to keep all this transaction data gathered over an extendedperiod of time.

The ability to analyze this largely untapped data is the holy grailfor business intelligence (BI) practitioners and business owners alike.Imagine, for a moment, what business would be like if companies couldanalyze all the data flowing through their systems, instead of just atiny fraction of it:• A network security analyst could preempt an insider intrusion

threat if he could quickly analyze all network transactions alongwith the HR database transactions related to hiring, firing andcompensation changes.

• A CFO could discover financial improprieties if he could analyzefinancial transactions along with network system log transactions.

A marketer could make real-time adjustments to a broadly ex-ecuted marketing campaign if he could analyze transactions from theWeb site along with transactions from the enterprise resourceplanning (ERP) system and call detail records from the call center.

There is much insight to be gained by analyzing large volumesand all types of corporate data, and yet we are compelled not toask those questions, because our existing BI technologies lack theanalytical capabilities to answer them.

This article examines the challenges of analyzing large volumesof complex transactional data and proposes new approaches todata analysis that go beyond conventional BI technology.

Data and InformationToo often the words “data” and “information” are used interchange-

ably, when there is a very significant distinction between the two. Aneasy way to think about the difference is as follows:• Data is the input for the analysis system, and• Information is the output.

Analysis is the process of turning data into information.While it may seem basic, this distinction is important, because

it opens up a different way to approach the problem of analyzing largevolumes of data, instead of relying on traditional database- anddata warehouse-centric approaches.

Let’s examine a simple scenario of network flow data analysis:Network flow data is captured for every single chunk (packet)

of data that moves on the network. The simple operation of a personlooking at a page on a Web site will generate quite a few network flowtransactions that capture both the request going from the user to theWeb site, and the response going from the Web site to the user. A sin-gle network flow transaction is comprised of a source (a source IP

http://www.dmreview.com/

address), a target (a destination IP address), and some notion of thesize of the data moved. At this level of granularity, the networkflow transaction is a good example of data - not information - withthe analytical value of an individual transaction close to zero. But assoon as multiple network flow transactions are associated with a sin-gle Web page lookup and are aggregated, one could get access to somebasic operational information, such as:• How much data was transferred for a particular Web page?• How long did it take?• Were there any errors generated in the process?

And suddenly the Web site operator gains valuable insight intooverall Web site performance.

But there is much more to be gleaned from these transactions thanthis simple information. If we continue with the data aggregation andsummarization exercise in this example, we might even get tobusiness information such as:• Quality of service – how much traffic does a particular user gener-

ate on the network over a fixed period of time, and how many net-work errors occur in the process?

• IT chargeback – how much network traffic does a businessapplication generate over a month?

• Compliance and intrusion detection – which users have the high-est traffic volume on the network?

These are the questions a business analyst would be interested inasking. But traditional BI tools have not been able to deliver the answers.

This example not only illustrates the differences between data andinformation, it also explains what needs to happen to improve theprocess of creating operational and business information from data.

Requirements for the Processof Converting Data into Information

The network flow data example touches on every single require-ment for an effective process of converting data into information,including:• A large data set must be associated with other data (for example,

network flow data associated with business application data) toproduce meaningful information.

• Several processing steps (aggregations and summarization) may benecessary, first to turn data into operational information, then intobusiness information.

• The entire process has to be reasonably fast (it doesn’t help any-one to identify a security breach a month after it occurs).

Unfortunately, these three requirements conflict with each other.The more data that needs to be processed and the more intelligencewe want to gain from it, the longer the process is going to take. It’s notsurprising that the majority of analytical applications designed for suchlarge data sets as Web and network traffic data are mostly focusedon:• Event correlation, because it can be done on a smaller data set; or• Operational information, because it requires the least amount of

processing.As we try to extract more valuable information from data, the analy-

sis process takes longer and longer, and as we try to apply thisprocess to large data sets, we begin to hit the performance brick wall.

We know that conventional analytical solutions are severelylimited in their capability to get information out of large data sets; solet’s explore the alternatives.

Extracting Information fromLarge Data Volumes Remains Challenging

There are several known approaches to information extractionwhen dealing with very large data volumes:• Search - a technique that is often confused with analysis.

Extracting information is a process of transforming data. While asearch process is efficient when applied to large data volumes, itmerely finds what is already inside data instead of producinginformation.

• BI or business analytics (BA) - an approach relying on databasetechnologies. While this is the most common approach to theproblem, it is fundamentally flawed and breaks down when dealingwith very large data sets.

With the BI/BA approach, the main roadblock to analyzingvery large data sets is latency:• If terabytes of data are generated hourly and daily, a highly scalable

database is necessary just to keep up with this data volume andget the new data into the database. We saw a major bank tryingto analyze its daily Web traffic data using a database. It requiredabout 23 hours to add 24 hours worth of data to the database,and then another two hours to run analytical queries against thisdata. It was only natural that the bank was falling behind everyday and was forced to start sampling data, which created credibil-ity problems for its business analysts.

• If terabytes of data need to be perused in order to run an analyticalquery, and if the database is growing daily, the latency of analyticalqueries will increase exponentially and eventually will render theentire system unusable. This is the main reason why businessinformation is rarely available for very large data sets.

New Technologies are Emergingto Address the Data Volume Challenge

Three problems need to be solved when dealing with analysisof very large data sets:• First, it is necessary to create associations between multiple data

sources that can be quite complex. For instance, mapping a Webpage to network traffic to understand the data flow across differ-ent URLs on the page is a nontrivial problem, if addressed in ageneric way.

• Second, there must be a way to do data analysis without firstputting data into a database. Data insertion and subsequent dataindexing are among the slowest of database operations.

• Third, there must be a way to analyze only new data withoutsacrificing the quality of information. The ability to create informationfrom new data, as well as from information already producedfrom old data, is the only way to deal with the exponentialcomplexity of running analytical queries against very large data sets.

Some technologies attempt to solve some parts of this puzzle.For example, streaming databases address the second half of theproblem by focusing on data analysis within a relatively small timewindow and thus focusing only on analyzing new data.

Specialized analytical applications, such as Web analytics andnetwork traffic analytics, are trying to improve the first half of theproblem by streamlining the database insertion process. ThroughWeb site or network instrumentation and exact mapping of thisinstrumentation to the database, these applications can gain some

performance improvements. But only a few companies are addressingthe problem as a whole.

Only a few years ago, a couple dozen gigabytes of data wasconsidered very large. But with advances in storage technologyand lower costs for storage, it’s not unusual for companies today todeal with terabytes - or even petabytes - of data. What has eludedmost companies is the ability to convert this data into meaningfulbusiness information.

However, new alternatives to traditional BI are coming ontothe market, and not a moment too soon. I anticipate that as high-volume data analysis solutions become more pervasive, the barwill be raised dramatically for what people expect from theirbusiness information systems. Just as the basic reporting capabilitiesof a decade ago forever changed how we manage and measure ourbusinesses, so to will high-volume data analysis become a powerful andnecessary requirement for doing business in the coming years.

Case Study: One of the LargestContent Delivery Networks in the Country

OObbjjeeccttiivvee:: A reporting and analytics solution to provide timelyvisibility into very large volumes of data.

CChhaalllleennggeess:: Data collected is semistructured and complex; andcustomers demand near real-time analysis and reporting.

SSoolluuttiioonn:: XML-based BI solutionRReessuullttss

• A rapidly deployed, easy-to-use reporting and analytics solution;• Built to handle huge volumes of complex semi-structured data; and• A complete set of BI features and functionality at a fraction of the

cost and time.The subject of this case study is the fastest growing global service

provider for accelerating applications and content over the Internet.The company provides network infrastructure on demand, optimizingapplication and content delivery while shifting bandwidth, computingand storage requirements to their own infrastructure. Large multi-national corporations such as Verizon Business and Hewlett Packarduse the company’s services to ensure that their customers receiveLAN-like response times when accessing their Web application fromanywhere in the world.

Business ChallengeAs part of its core solution offering, the company collects large

quantities of data, including Web site response times, throughput andother pieces of information related to overall application performanceand availability. Their customers were increasingly asking for a re-porting and analytics solution integrated directly into the company’sglobal overlay network that could be used to sift through, analyzeand provide visibility into overall application performance. Overtime, as the company continued to broaden its reach into largeraccounts, providing this capability had become absolutely essential.Their operations team also was looking for tools that would increasetheir ability to measure the operational efficiency of the network,manage SLAs, and provide “low latency” technical support by beingable to pinpoint issues through the use of advanced analytics.

These requirements posed a significant data analysis challengefor the product management team. The data that the company collectsfor its customers is semi-structured and quite complex, and datavolumes are massive. Their network manages hundreds of millions

of transactions daily, and this number will be in the billions in lessthan a year. This level of activity generates 500 gigabytes of logfiles daily, with this number expected to quadruple in a year,eventually reaching two terabytes per day, or nearly one petabyteof data annually.

At the same time, the company needed to be able to provide nearreal-time analysis and reporting to meet customer and internal demandsand differentiate itself from its competitors. They knew that processingsuch large volumes of nontabular data using a traditional BI systemwould be prohibitively expensive and slow. The company needed anentirely different approach to BI in order to achieve its goal of pro-viding comprehensive, fast reporting and analytics to its customerbase and internal operations.

The SolutionThe company implemented an XML-based analytical solution

to provide its customers with immediate visibility into the performanceof the managed Web infrastructure, while giving its internal operationsteam fast insight into network performance and the ability to performcustomer value analysis and pinpoint problems quickly. The tech-nology was designed from the ground up to concurrently address largevolumes of traditional and nontraditional data sources, such as theactivity log files collected in the company’s system.

The solution uses XML as a common layer to significantly reducesystem complexity while offering advanced functionality that cannotbe achieved by traditional BI technology. By using XML to tietogether different pieces of the BI stack into an integrated, “virtual”technology stack, the solution operates on data where it resides,with no movement or restructuring of data required.

The company was able to analyze large amounts of data comingdirectly out of activity log files and various network applicationswithout the need to transform and store this data in a data warehouse.This results in extremely high data throughput and near real-time analysis and reporting.

The company can now offer its customer base rapidly deployedreporting and analytics services that are easy to use and built tohandle the complexity of the data volumes and structures that aretypical in today’s highly interactive Web environment. At the sametime, they can now also provide internal users with immediate insightinto network performance while enabling fast customer value analysisand the ability to quickly identify and handle problems as theyarise. The solution was built to scale as the business grows, both withnew customers and volumes of data. Scalability is increasingly valuableas the company pursues larger customers who use analytics as a base-line for making technology investment decisions.

Joseph Rozenfeld is co-founder and vice president of Products for Skytide. He may be reached [email protected].

www.skytide.com

mailto:[email protected]

http://www.skytide.com

dmr_direct_template

Documents