high volume challenge wp v1

Analysis of Large Data Volumes

Challenges and Solutions

By Joseph E. Rozenfeld,

Co-Founder and Vice President of Strategy & Solutions

Skytide, Inc.

(650) 292-1912

[email protected]

Skytide

1820 Gateway Drive

San Mateo, CA 94404

650.292.1900

Table of ContentsIntroduction 1

Today’s Business Intelligence Challenges 1

Need for Data Analysis 2

Data and Inforation 2

Requirements for Converting Data into Information 3

Extracting Information from Large Data Volumes Remains Challenging 3

New Technologies are Emerging to Address the Data Volume Challenge 4

Conclusion 4

About Skytide 4

Analysis of Large Data Volumes: Challenges and Solutions Introduction

Today’s defi nition of a very large data set has evolved dramatically from what was considered large ten years ago. The reason is simple: we live in a world that is getting more digitized every day. We spend hours every day in this digital world, so it’s no wonder the amount of data we access online is growing at an exponential – and unstoppable – rate. What’s more, most businesses today use IT to support every conceivable business function.

What this means is that trillions of digital interactions and transactions are generated and carried out by various networks hourly and daily. And with recent advances in storage technology and lower storage prices, storing a terabyte of this data is no longer uncommon. In fact, large data sets are now measured in petabytes.

And yet while storage technology advances have been staggering, progress in analytical data processing has been marginal.

This article examines the challenges of analyzing large volumes of complex transactional data and proposes new approaches to data analysis that go beyond conventional BI technology. It also provides a real-world example of how one large content delivery network is using new XML-based analytical technology to provide near real-time reporting on massive volumes of data.

Today’s Business Intelligence Challenges

“I have too much data to analyze” is a common complaint of business analysts.

Today’s defi nition of a very large data set has evolved dramatically from what was considered large ten years ago. During my early days at Arbor Software, the company that pioneered the concept of online analytical processing (OLAP) and later became Hyperion Solutions (which was recently acquired by Oracle), I was chartered with extending our product’s capabilities to support very large data sets. At the time, several gigabytes of data were considered large, and tens of gigabytes were considered huge. My work resulted in a set of product improvements that indeed helped to analyze gigabytes of data at reasonable speeds.

But even then, it was already becoming apparent that analysis of very large data sets would continue to be a signifi cant problem. With recent advances in storage technology and the storage price decline, a terabyte of data is not uncommon – and large data sets are now measured in petabytes.

While storage technology advances have been staggering, progress in analytical data processing has been marginal. It is possible to fi nd a piece of data in a petabyte-size storage system (Google, Yahoo, and other search engine technology vendors are a testament to that), but analyzing this data to fi nd correlations and trends meaningful to a business analyst remains a huge challenge.

Author Bio:

Joseph Rozenfeld is Vice President of Strategy and Solutions and Co-Founder of Skytide. He has more than 20 years of software development and management experience and has founded or co-founded four companies including Skytide and ChainCast Networks.As executive vice president and CTO of ChainCast Networks, the fi rst provider of commercial peer-to-peer software for broadcast streaming, he grew the company into the largest streaming provider for terrestrial radio broadcasters in the U.S. with a client list including ClearChannel, NTT, Cox, and ABC. Joseph was also a founding engineer, development manager and architect of Essbase and IBM DB2 OLAP servers at Hyperion Solutions (formerly Arbor Software). Joseph holds an M.S. in Computer Science and a B.S. in Applied Mathematics from Moscow Polytechnique University.

Need for Data Analysis

Why do we need to analyze this data— and what is the nature of this data that grew in size by a factor of millions over the last ten years?

The answer is simple: we live in a world that is getting more digitized and more connected every day. We use networks to talk, shop, read, and work. We all have a digital life that is only growing bigger. Our music and photo libraries are stored and managed by network providers, our answering machines are virtual mailboxes, instant messenger IDs are on our business cards, our reference libraries are online. We spend hours every day in this digital world, so it’s no wonder the amount of data we access online is growing at an exponential – and unstoppable – rate.

What’s more, most businesses today use IT to support every conceivable business function. What this means is that trillions of digital interactions and transactions are generated and carried out by various networks hourly and daily. Where does all this data go? Some ends up in databases; most ends up in log fi les discarded on a regular basis, because even a petabyte-sized storage system is not large enough to keep all this transaction data gathered over an extended period of time.

The ability to analyze this largely untapped data is the Holy Grail for business intelligence (BI) practitioners and business owners alike. Imagine, for a moment, what business would be like if companies could analyze all the data fl owing through their systems, instead of just a tiny fraction of it:

• A network security analyst could preempt an insider intrusion threat if he could quickly analyze all network transactions along with the HR database transactions related to hiring, fi ring, and compensation changes.

• A CFO could discover fi nancial improprieties if he could analyze fi nancial transactions along with network system log transactions.

• A marketer could make real-time adjustments to a broadly-executed marketing campaign if he could analyze transactions from the website along with transactions from the ERP system and call detail records from the call center.

There is much insight to be gained by analyzing large volumes and all types of corporate data, and yet we are compelled not to ask those questions, because our existing BI technologies lack the analytical capabilities to answer them.

This article examines the challenges of analyzing large volumes of complex transactional data and proposes new approaches to data analysis that go beyond conventional BI technology.

Data and InformationToo often the words “data” and “information” are used interchangeably, when there is a very signifi cant distinction between the two. An easy way to think about the difference is as follows:

• Data is the input for the analysis system

• Information is the output

Analysis is the process of turning data into information.

While it may seem basic, this distinction is important, because it opens up a different way to approach the problem of analyzing large volumes of data, instead of relying on traditional database- and data warehouse-centric approaches.

Let’s examine a simple scenario of network fl ow data analysis:

Network fl ow data is captured for every single chunk (packet) of data that moves on the network. The simple operation of a person looking at a page on a website will generate quite a few network fl ow transactions that capture both the request going from the user to the website, and the response going from the website to the user. A single network fl ow transaction is comprised of a source (a source IP address), a target (a destination IP address), and some notion of the size of the data moved.

At this level of granularity, the network fl ow transaction is a good example of data–not information–with the analytical value of an individual transaction close to zero. But as soon as multiple network fl ow transactions are associated with a single web page lookup and are aggregated, one could get access to some basic operational information such as:

• How much data was transferred for a particular web page?

• How long did it take?

• Were there any errors generated in the process?

And suddenly the website operator gains valuable insight into overall website performance.

But there is much more to be gleaned from these transactions than this simple information. If we continue with the data aggregation and summarization exercise in this example, we might even get to business information such as:

• Quality of Service – how much traffi c does a particular user generate on the network over a fi xed period of time, and how many network errors occur in the process?

• IT Chargeback – how much network traffi c does a business application generate over a month?

• Compliance and Intrusion Detection – which users have the highest traffi c volume on the network?

Business Intelligence in Rapidly Changing Market Conditions Page 2

© 2007 Skytide, Inc. All rights reserved.

These are the questions a business analyst would be interested in asking. But traditional BI tools have not been able to deliver the answers.

This example not only illustrates the differences between data and information, it also explains what needs to happen to improve the process of creating operational and business information from data.

Requirements for Converting Data into Information

The network fl ow data example touches on every single requirement for an effective process of converting data into information, including:

• A large data set must be associated with other data to produce meaningful information; for example, network fl ow data associated with business application data.

• Several processing steps may be necessary, fi rst to turn data into operational information, then into business information; for example aggregations and summarization.

• The entire process has to be reasonably fast; for example, it doesn’t help anyone to identify a security breach a month after it occurs.

Unfortunately, these three requirements confl ict with each other. The more data that needs to be processed and the more intelligence we want to gain from it, the longer the process is going to take. It’s not surprising that the majority of analytical applications designed for such large data sets as web and network traffi c data are mostly focused on:

1) Event correlation: it can be done on a smaller data set; or

2) Operational information: it requires the least amount of processing.

Figure 1: A new approach is needed to get business information out of large data volumes

As we try to extract more valuable information from data, the analysis process takes longer and longer, and as we try to apply this process to large data sets, we begin to hit the performance brick wall.

We know that conventional analytical solutions are severely limited in their capability to get information out of large data sets; so let’s explore the alternatives.

Extracting Information from Large Data Volumes Remains ChallengingThere are several known approaches to information extraction when dealing with very large data volumes:

• Search – a technique that is often confused with analysis. Extracting information is a process of transforming data. While a search process is effi cient when applied to large data volumes, it merely fi nds what is already inside data instead of producing information.

• Business Intelligence (BI) or Business Analytics (BA) – an approach relying on database technologies. While this is the most common approach to the problem, it is fundamentally fl awed and breaks down when dealing with very large data sets.

With the BI/BA approach, the main roadblock to analyzing very large data sets is latency:

• If terabytes of data are generated hourly and daily, a highly scalable database is necessary just to keep up with this data volume and get the new data into the database. We saw a major bank trying to analyze its daily web traffi c data using a database. It required about 23 hours to add 24 hours worth of data to the database, and then another two hours to run analytical queries against this data. It was only natural that the bank was falling behind every day and was forced to start sampling data, which created credibility problems for its business analysts.

• If terabytes of data need to be perused in order to run an analytical query, and if the database is growing daily, the latency of analytical queries will increase exponentially and eventually will render the entire system unusable. This is the main reason why business information is rarely available for very large data sets.

Business Intelligence in Rapidly Changing Market Conditions Page 3

© 2007 Skytide, Inc. All rights reserved.

© 2007 Skytide, Inc. All rights reserved. Skytide and the Skytide logo are registered trademarks of Skytide, Inc. All other trademarks are the property of their respective owners.

Skytide, Inc.

1820 Gateway Drive, Suite 300, San Mateo, CA 94404

Phone: 1.650.292.1900

Fax: 1.650.312.1400

E-mail: [email protected]

Internet: www.skytide.com

New Technologies to Address the Data Volume Challenge

Three problems need to be solved when dealing with analysis of very large data sets:

• First, it is necessary to create associations between multiple data sources that can be quite complex. For instance, mapping a web page to network traffi c to understand the data fl ow across different URLs on the page is a non-trivial problem, if addressed in a generic way.

• Second, there must be a way to do data analysis without fi rst putting data into a database. Data insertion and subsequent data indexing are among the slowest of database operations.

• Third, there must be a way to analyze only new data without sacrifi cing the quality of information. The ability to create information from new data, as well as from information already produced from old data, is the only way to deal with the exponential complexity of running analytical queries against very large data sets.

Some technologies attempt to solve some parts of this puzzle. For example, streaming databases address the second half of the problem by focusing on data analysis within a relatively small time window and thus focusing only on analyzing new data.

Specialized analytical applications, such as web analytics and network traffi c analytics, are trying to improve the fi rst half of the problem by streamlining the database insertion process. Through web site or network instrumentation and exact mapping of this instrumentation to the database, these applications can gain some performance improvements. But only a few companies are addressing the problem as a whole.

Conclusion

Only a few years ago, a couple dozen gigabytes of data was considered very large. But with advances in storage technology and lower costs for storage, it’s not unusual for companies today to deal with terabytes – or even petabytes – of data. What has eluded most companies is the ability to convert this data into meaningful business information.

However, new alternatives to traditional BI are coming onto the market, and not a moment too soon. We anticipate that as high-volume data analysis solutions become more pervasive, the bar will be raised dramatically for what people expect from their business information systems. Just as the basic reporting capabilities of a decade ago forever changed how we manage and measure our businesses, so to will high-volume data analysis become a powerful and necessary requirement for doing business in the coming years.

About Skytide

Skytide is a leading provider of next-generation analytical solutions that provide an un-precedented view into what is driving business performance. Skytide’s breakthrough technol-ogy uses XML as a common layer to dramatically reduce system complexity while offering ad-vanced functionality that can-not be achieved by traditional BI technology. Application areas for Skytide technology include contact centers, risk and secu-rity management, compliance, and other areas of business that generate signifi cant volumes of mission-critical unstructured and semi-structured data. Skytide partners include IBM, Sun Micro-systems and Inxight. Based in San Mateo, Calif., Skytide is a privately held company funded by Granite Ventures and El Dorado Ventures. For more information please visit www.skytide.com

high volume challenge wp v1

Documents