emcien overview v6 01282013

17
The Big Data Paradigm Shift: Insight Through Automation

Upload: wcjones6348

Post on 22-Nov-2014

390 views

Category:

Social Media


2 download

DESCRIPTION

EMCIEN Scout social media, BIG Data analysis, pattern detection software

TRANSCRIPT

Page 1: Emcien overview v6 01282013

The Big Data Paradigm Shift:Insight Through Automation

Page 2: Emcien overview v6 01282013

2 w w w . e m c i e n . c o m

In this white paper, you will learn about:

• Big Data’s combinatorial explosion• Current and emerging technologies• Automation as the new way to leverage

insight within Big Data• An algorithmic approach to the Big

Data revolution

Page 3: Emcien overview v6 01282013

3w w w . e m c i e n . c o m

CONTENTS

01 / EXECUTIVE SUMMARYpage 5

02 / THE BIG DATA PARADIGM SHIFT

page 6

03 / LANDSCAPE OF EXISTING

METHODS AND TOOLSpage 8

04 / EMERGENCE OF

BIG DATA TOOLSpage 10

05 / THE NEED FOR A NEW BIG

DATA ANALYTICS APPROACHpage 11

06 / EMCIEN’S ALGORITHMIC

APPROACH TO BIG DATApage 13

07 / NOT JUST THEORY:

SOLVING REAL-WORLD

PROBLEMSpage 15

08 / CONCLUSIONpage 16

3

Page 4: Emcien overview v6 01282013

4 w w w . e m c i e n . c o m

... a study of progress over a 15-year span on a benchmark production-planning task. Over that time, the speed of completing the calculations improved by a factor of 43 million. Of the total, a factor of roughly 1,000 was attributable to faster processor speeds. Yet a factor of 43,000 was due to improvements in the efficiency of software algorithms.”

Martin Grotschel, a German scientist and mathematicianWhite House Advisory ReportDECEMBER 2010

Page 5: Emcien overview v6 01282013

5w w w . e m c i e n . c o m

Big Data promises greater insight, competitive advantage, and the possibility of solving problems that have yet to be imagined. These insights will come from software applications that automate the analytics process. While infrastructure may be an important building block, this paper focuses on the algorithms that will deliver the insights organizations need.

This paper does the following:

• Proposes a paradigm shift away from analysts’ one-to-one relationship with data toward a relationship with algorithms.

• Explains the combinatorial explosion that makes Big Data Analytics impossible for old data analytics tools and methodologies.

• Examines current and emerging technologies.

• Suggests that methods which accommodate Big Data users in the same way that smaller data sets are handled are missing the potential of Big Data.

• Proposes that rather than search and crunch data, organizations need the ability to automate the process of analyzing, visualizing and ultimately leveraging the insight within their data.

• Introduces an algorithmic approach that provides an efficient, sustainable, automated way to delve into Big Data, detect patterns, and discover insights hidden within that data.

Executive Summary

Data used to be scarce, and tiny bits of it were extremely valuable. Tiny bits of it are still extremely valuable, but now there is so much data that finding the valuable bits can be extremely difficult.”

Page 6: Emcien overview v6 01282013

6 w w w . e m c i e n . c o m

The Need for a Paradigm ShiftHumankind has always possessed a love for data. We can do remarkable things with data and have built remarkable tools to collect, store, sift, sort, splice, dice, chart, report, predict, and visualize it. Data can change the way we perceive the world and how we interact with it. But the world is changing. Data used to be scarce, and tiny bits of it were extremely valuable. Tiny bits of it are still extremely valuable, but now there is so much data that finding the valuable bits can be extremely difficult.

Big Data demands that organizations change the way they interact with data. In the past, analysts could stand by the data faucet and collect what was needed in a paper cup, but now data is the ocean in which they are floating. Paper cups are useless here. Why? Because the search for data is over. Data is everywhere. It creates noise. Now analysts are searching for the signal amidst the noise. They are looking for the important bits. And in this vast sea of information, that task is overwhelming. Current tools and methodologies are failing when it comes to finding the most critical information in a time-sensitive, cost-effective manner. The reason? Many of these emerging tools and technologies are trying to approach this challenge with new ways of doing the same old things. But a bigger cup is not what’s needed. That won’t solve it. What is needed is a completely new approach to Big Data Analytics.

In this new approach, the only way to find the signal is to automate the process of data-to-insight conversion. And automation requires algorithms—fast, sophisticated, highly optimized algorithms.

The volume of Big Data demands a change in the human relationship with data, from human to machine. The algorithms have to do the work, not the humans. In this brave new world, the machines and algorithms are the protagonists. The role of the analysts will be to select the best algorithms and approve the quality of results based on speed, quality and economics.

The Promise of Data and the Search for InsightWhy is the world obsessed with data? Because the promise of data is insight. In the last few years, organizations have become exceptionally good at collecting data, and as the cost of storage has dropped, companies are now drowning in that “Big Data.” However, the business world has hit a wall where the amount of data available far exceeds the human capacity to process it. The amount of data also exceeds the capabilities of existing analytics and intelligence tools, which have served as mere data-shovels or pick axes in the search for the gold that is insight.

Acquiring insight inevitably involves querying a database, or, most likely, several databases. Many analysts are mashing up data across data silos in an attempt to discover the connections between data points. For example, marketers are aggregating customer demographics, purchase data and social media data, while purchasers are aggregating supplier data with procurement and pricing data. This process produces a variety of data sets of different types and qualities.

A simple query, for example, might ask for specific values within a subset of columns. So the real question becomes, “How many queries will it take to answer even one of these questions?” Consider how many queries one might make into even a small set of data, such as a table containing just ten columns:

• If each column has 2 possible values, there are 59,048 possible queries.

• If each column has 3 possible values, there are 1,048,575 possible queries.

To think of it another way, a database with 100 columns and 6 choices per column yields more possible queries than there are atoms in the universe.

The Big Data Paradigm Shift

Page 7: Emcien overview v6 01282013

7w w w . e m c i e n . c o m

0 2 4 6 8 10 12 14

10 Billion

1 Million

1 Billion

30 Billion

ExponentialExplosion

of Queries

Number of Variables per Column

Num

ber

of Q

ueri

es

The Limitations of Search and Query-based ApproachesData becomes unwieldy due to the number of rows or the number of columns, or both. Data mash-ups, described previously, create a lot of columns. High volume transactional systems have lots of rows or records. However, having millions of records is not the problem. The depth of the data—the number of rows—merely impacts processing time in a linear fashion and can be reduced with fast or parallel computing. Thus, executing a query is simple enough. The problem, however, is the width of the data because the query explodes exponentially based on the number of columns.

As a result, the real task of extracting insight from data is formulating the right queries. And manually laboring through thousands of queries to find the ones that deliver insight is not an efficient way to derive value from data. Therefore, when it comes to Big Data, the big challenge is knowing the right query.

Page 8: Emcien overview v6 01282013

8 w w w . e m c i e n . c o m

Over 85% of all data is unstructured.1 However, existing methods and tools are designed to analyze structured data. A high level categorization of analytics tools is critical to understanding the state of Big Data Analytics.

Statistical Tool KitsThe purpose of statistical analysis is to make inferences from samples of data, especially when data is scarce. In the era of Big Data, scarcity is not the problem. Traditional statistical methods have severe limitations in the realm of Big Data for the following reasons:

• Statistical methods break down as dimensionality increases.

• In unstructured data, dimensions are not well defined.

• Attempts to define dimension for unstructured data result in millions of dimensions.

Data MiningData mining is a catchall phrase for a very broad category of methods. Essentially, it is a method for sifting through very large amounts of data in attempt to find useful information. It implies “digging through tons of data” to uncover patterns and relationships contained within the business activity and history. Data mining involves manually slicing and dicing the data until a pattern becomes obvious or by using software that analyzes the data automatically.

The first limitation of data mining is that the data has to be put in a structured format first, such as a database. The second limitation is that most forms of data mining require that the analyst knows what to look for. For example, in classification and clustering analysis, the analyst is trying to find instances of known categories, such as people who have a high probability of defaulting on their mortgages. In anomaly detection, the analyst is looking for instances that do not match the known normal patterns or known suspicious patterns, such as people who pay cash for one-way plane tickets.

Data VisualizationData visualization is the study of the visual representation of data, meaning information that has been abstracted in some schematic form, including attributes or variables for the units of information.2 Humans are better equipped to consume visual data than text. As we know, a picture is worth a thousand words.

While visualization tools are interesting, they rely on human evaluation to extract insight and knowledge. The more severe limitation of visualization is that the visuals can only focus on two or three dimensions at the most before the amount of information is overwhelming. The most common limitation of visualization is that while it is a good test for small samples, it is not a sustainable method to gain insight into large volumes of higher dimensionality data.

Consider a scenario in which there aren’t enough pixels on the screen to represent each item. An analyst can easily inspect a friendship network with 10-100 people, but not on billion.

Business Intelligence & AnalyticsBusiness Intelligence (BI) is a catchall phrase for ad hoc reports created in a database. These are typically pre-canned reports based on metrics that users are comfortable reporting. “Analytics” includes any computation performed for reporting. Hence, BI tools are now called analytics. BI was created as a way to extract data from the database. While it continues to serve that purpose, it is time and labor intensive and is not intended to surface insights.

Landscape of Existing Methods and Tools

The overwhelming shortcoming of all these methods is that they are query-based and labor intensive.”

Page 9: Emcien overview v6 01282013

9w w w . e m c i e n . c o m

Limitations of Existing ToolsThe overwhelming shortcoming of all these methods is that they are query-based and labor-intensive. Big Data offers an infinite number of queries, causing all these methods to rely on analysts to produce questions. Any method that puts the burden on the user is a game-stopper.

Although search remains the go-to information access interface, reliance on search needs to end. Search is not enough. A new type of information-processing focus is needed.

The major shortcomings of the existing tools are as follows:

• Because search helps you discover insights you already know about, it doesn’t help you discover things about which you’re completely unaware.

• Query-based tools are time-consuming because search-based approaches require a virtually infinite number of queries.

• Statistical methods are largely limited to numerical data; over 85% of data is unstructured.

Page 10: Emcien overview v6 01282013

1 0 w w w . e m c i e n . c o m

Because Big Data includes data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process within a tolerable elapsed time, new technologies are emerging to address the challenges brought on by these large quantities of data. These technologies can be categorized into two groups: Hadoop-based solutions and In-Memory based solutions.

Emergence of Big Data Tools

Hadoop and Hadoop-based ToolsWhile Hadoop is not an analytics tool per se, it is often confused as being one. Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. It supports the running of applications on large clusters of commodity hardware.

Hadoop is used to break big tasks into smaller ones so that they can be run in parallel to gain speed and efficiency. This is great for a query on a large volume data set. The data set can be cut into smaller pieces, and the same query can be run on each smaller set. Hadoop aims to lower costs by storing data in chunks across many inexpensive servers and storage systems. The software can help speed up certain types of simple calculations by sending many queries to multiple machines at the same time. The technology has spawned a set of new start-ups, such as Hortonworks Inc. and Cloudera Inc., which help companies implement it.

Hadoop helps companies store large amounts of data but doesn’t provide critical insights based on the naturally occurring connections within the data. The impulse to store lots of data because it is cheap to do so can lead to storing too much data, which can make answering simple questions more difficult.

IT professionals and analysts are asking the following questions:

• Where is the insight? What is the data telling us?

• How can I prove the return of investment?

As a result, in-memory databases are gaining attention in an attempt to come closer to the goal of real-time business processing.3,4

In-Memory-based ApplianceSome of these approaches have been around for a long time in areas such as telecommunications or fields related to embedded databases. An example is SAP’s HANA (High Performance Analytical Appliance). This in-memory paradigm is now touted as the future database paradigm for Big Data.

The primary limitation of in-memory is cost and size. There is a significant limit to the amount of data that can be held in memory. If you need to perform Big Data–style analysis and you want to see the bigger picture, in-memory is not enough. The cost is prohibitive and it is not sustainable.

Page 11: Emcien overview v6 01282013

1 1w w w . e m c i e n . c o m

While these emerging technologies are attempting to address the challenge of Big Data, at the end of the day, they are heavy-handed and time-consuming because they lack automated intelligence for gaining insight. It’s time for an entirely new approach.

This new approach demands a paradigm shift that focuses on the following:

• A fundamental change in the role played by analysts from data-miners to insight-evaluators.

• Fast and efficient algorithms that automatically convert data to insight for evaluation.

• Continual improvement of these algorithms to keep up with the speed of data and critical need for timely insights.

The Need for a New Big Data Analytics Approach

Old Paradigm: Data Analyst Digs for Insights by Manually Querying a Database

Analysis Takes From Months to Years

Specialized Skills in Math and Computer Science Required

Operational and Business Intelligence Immediate Insight and Perspective

Anyone (No Specialized Skills Required)

Automatic Insights in Seconds to Minutes

New Paradigm: Algorithms Automatically Surface

Insights to Evaluate

Page 12: Emcien overview v6 01282013

1 2 w w w . e m c i e n . c o m

Emergence of Algorithms as a New Class of Big Data Software Tools

The size and speed of Big Data demands true automation, in which work is offloaded from human to machine. This automation happens with algorithms, which are designed for calculation, data processing, and automated reasoning. Algorithms are designed for tasks that are beyond human comprehension and require the speed of machines. This is the realm of Big Data.

One of the most dramatic and game-changing examples of an algorithm was designed by Alan Turing to automatically decode German Navy messages at Bletchley Park during WWII. In this instance, the urgency was critical and demanded an automated approach to convert the data to intelligence. There were 158 million million million (158,000,000,000,000,000,000) possible ways that a message could be coded by the German Enigma machine. The decoder algorithm, Shark, worked and the results changed the course of the war. The Allies won because they had a competitive advantage. Bringing it to the present, the use of Big Data in the 2012 United States presidential election changed the face of political campaigns forever.

Emcien’s approach to Big Data is to automate the process of data-to-insight in a timely and cost effective manner through sophisticated algorithms. The algorithms leverage advanced mathematics to solve complex problems of an unimaginable size, thereby pushing the frontier of innovation and competition. The following section details Emcien’s algorithmic approach to Big Data Analytics.

Page 13: Emcien overview v6 01282013

1 3w w w . e m c i e n . c o m

Rather than search and crunch data, organizations need the ability to analyze, visualize and ultimately leverage the patterns and connections within their data. Emcien’s innovation is a suite of automatic pattern detection algorithms. These algorithms utilize a graph data model that captures the interconnectedness of the data elements and creates a very elegant representation of high volume data with unknown structure. These fast, sophisticated algorithms automatically detect patterns and self-organize what they find thereby providing immediate insight and perspective.

Here is an outline of how the algorithm works:

1. Assesses the data in order to identify and measure connections between data points.

2. Converts the original high density/low value (structured, semi-structured or unstructured) data into a low density/high value graph.

3. Builds the graph on non-Euclidean distances. This is important, as most of the data is unstructured and non-numeric. The distances and strengths are computed in non-Euclidean space. (For example – you may be closer to your family than your friends – but this is not Euclidean space!).

4. Computes millions of data points across the graphs to enable patterns to emerge.

5. Distills the noise, to allow the signal to emerge. This is made possible because the noise has patterns and the algorithms are designed to detect these patterns.

6. Enables the key topographical elements from the graph to emerge. The algorithm will then rank and focus these elements.

7. Categorizes these elements based on the application and output the insight.

Structured If the data is structured:

• Each row in the data table is considered an event.• Each cell in the row is converted to a node in the graph.• Cells that co-occur in a row (event) are connected by an arc.

Unstructured If the data is unstructured:

• Every word or data element is converted to a node.• Two words or data elements that occur simultaneously in an event are

connected by an arc.• An event may be defined as a single document, message, email exchange, etc.

Emcien’s Algorithmic Approach to Big Data

Page 14: Emcien overview v6 01282013

1 4 w w w . e m c i e n . c o m

Understanding the make-up of Graphs based on ConnectionsThe graph data model is very flexible and displays a distinct topography based on the density of connections. A cross sectional view of the graph data model will typically expose the following layers -

Advantages of Emcien’s Graph Data ModelThe graph data model exhibits a topography that signifies relationships and connectedness in a way that is not possible through any other method. Emcien’s algorithms have been designed to surface these patterns. Listed below are a few key attributes that help describe the characteristics of the algorithms.

Layer Connectedness Description1 Very Noisy Connections Typically the most highly prevalent data, this layer is composed of high

volume interactions that may be mundane and blatantly obvious.

2 Highly Connected Nodes Lying just below the noise, this second layer is composed of the first signal that is interesting. This layer exhibits distinct patterns based on crowd behavior.

3 Weaker Connections The third layer is a weaker signal and displays the non-obvious connections. These relate to events that are less frequent and may be connected in non-obvious ways.

4 The Faint Signal Composed of very weak connections and interactions, this last layer is of interest for security and surveillance. In many cases, this layer only emerges when the data is very rich in entities, causing connections to emerge in very non-obvious ways.

Attribute AdvantageSoftware A critical distinction of Emcien’s graph data model is that it is software.

Emcien’s software provides the computational engine with a data representation that lends itself to high-speed computing. As a result, the software runs on typical commodity computing environments.

Algorithmic Layer

Although some products on the market model the graph into the database layer or hardware layer, they do not have an algorithmic layer, and, therefore, require the user to query the systems based on the old data-inquiring paradigm. Algorithms automate the data analysis process that is an absolute requirement for efficient Big Data analytics.

Compact Representation of Data

Data is big because the number of events can grow exponentially as the various entities are continually interacting.

The graph representation is ideal for Big Data because it creates a very compact representation of the data. This is because the number of entities grows more slowly and reaches a natural steady state. In the graph data model these interactions translate to connection weights, allowing the graph model to encapsulate very big data in smaller structures.

Noise Elimination

The graph data model can be thought of in terms of layers, based on the connectedness of the data elements. The highly connected and noisy nodes are at the top layer, and the weak connections lay buried deep in the graphs. The noisy connections can be overwhelming and tend to render graph models as burdensome. Emcien utilizes a suite of patented algorithms to automatically distance the noise and detect critical patterns that relate to highly significant and relevant information.

Page 15: Emcien overview v6 01282013

1 5w w w . e m c i e n . c o m

The representation and visualization of complex networks as graphs helps surface critical, time-sensitive intelligence. One of the most important tasks in graph analysis is to identify closely connected network components comprising nodes that share similar properties. Detecting communities is of significant value in retail, healthcare, banking and intelligence work - verticals where loosely federated communities deliver insight and intelligence into the profile of a customer base or any other group being analyzed.

How Can This Model Be Applied? Emcien’s Pattern Detection Engine:

Not Just Theory: Solving Real-World Problems

• Intelligence: Surfaces critical correlations between people that merit serious attention, determines key individuals in targeted social networks, and geo-locates persons of interest and their networks around the world – from gangs to terrorists.

• Network Security: Auto-detects intrusion patterns and surfaces suspicious activity by providing immediate insight into highly linked variables. It then automatically identifies anomalies without the user having to query the data. For example, Emcien analyzes millions to billions of transactions to identify patterns in source and destination-IP addresses, ports, days, times and activity – to show you what you should be paying attention to. Emcien eliminates over 95% of the noise and identifies patterns that are “surprising” or that deviate from the norm.

• Fraud Detection: Surfaces patterns in money laundering and fraud by identifying groups of customers, locations, or transaction types that occur together in banking transactions,

• Customer Analytics: Surfaces insights on customer buying patterns, locations, demographics, loyalty, savings, lifestyle and insurance.

• Healthcare Analytics: Analyzes massive volumes of clinical data on medications, allergies, medical claims, pharmacy therapies, lab results, medical records, clinician notes and more in order to surface patterns.

• Performance and Operations Analytics: Analyzes raw information about performance and operations of every element of an organization which can be interpreted to increase profitability or improve customer service.

In short, Emcien tackles one of the biggest challenges with Big Data, namely “What are the right questions to ask?” Emcien’s pattern-detection engine quickly discovers the value within massive data sets by making connections between disparate, seemingly unrelated bits of information and by finding the highest-ranked of these connections to focus on – which reveals time-sensitive, mission-critical insights.

Page 16: Emcien overview v6 01282013

1 6 w w w . e m c i e n . c o m

The Big Data Analytics revolution is underway. This revolution is a historic and game-changing expansion of the role that information plays in business, government and consumer realms. To harness the power of this data revolution, a paradigm shift is required. Organizations must be able to do more than query their Big Data stores; search is no longer enough.

Up until now in the history of data analysis, the objective of queries was to find the signal in the noise. And it worked because we had clear-cut business questions and the size of the data was smaller, the data set was more complete, and we usually knew what we were looking for. We were playing in the realm of known knowns and known unknowns. In the new world of Big Data, it is now more important to know what to ignore. Because unless you know what to ignore, you’ll never get a chance to pay attention to what’s really important. Using algorithms to first ignore the noise and then find the insights is the way of the new world.

Extracting insight from Big Data requires analytics methods that are fundamentally different from traditional querying, mining, and statistical analysis on small samples. Big Data is often noisy, dynamic, heterogeneous, unstructured, inter-related and untrustworthy.

The combinatorial explosion requires new methods for finding insight in Big Data. The need for data sophistication is due to economics and time-criticality. As stated, manually laboring through thousands of queries to find the ones that deliver insight is not an efficient way to derive value from data.

Emcien’s technology provides a “Command Center” for Big Data, automatically interpreting the data, discovering patterns, identifying complex and significant relationships, and surfacing the most relevant questions that lead to the insights analysts need to know.

Conclusion

About Emcien Corp.Emcien’s automatic pattern-detection engine converts data to actionable insight that organizations can use immediately. Emcien breaks through time, cost and scale barriers that limit the ability to operationalize the value of data for mission-critical applications. Our patented algorithms recognize what’s important, defocus what’s not, evaluate all possible combinations and deliver the optimal results automatically. Emcien’s engine, fueled by several highly competitive NSF grants and years of research at Georgia Tech and MIT, is delivering unprecedented value to organizations across sectors that depend on immediate insight for success—banking, healthcare, insurance, retail, Intelligence and others. Visit emcien.com to learn more.

Page 17: Emcien overview v6 01282013

Sources1. Christopher C. Shilakes and Julie Tylman, “Enterprise Information Portals”, Merrill Lynch, 16 November, 1998.

2. Michael Friendly, “Milestones in the history of thematic cartography, statistical graphics, and data visualization,” 2008.

3. J. Vascellaro, “Hadoop Has Promise but Also Problems,” The Wall Street Journal, 23 February 2012.

4. R. Srinivasan, “Enterprise Hadoop: Five Issues With Hadoop That Need Addressing,” blog, 28 May 2012.