generating value from big data - dell€¦ · helps in to get an indication about the important...
TRANSCRIPT
2015 EMC Proven Professional Knowledge Sharing 2
Table of Contents
Data Explosion – A Flashback ................................................................................................... 3
Big Data Overview ..................................................................................................................... 5
Big Data – Problem or Opportunity? ........................................................................................... 6
Layers of Big Data ..................................................................................................................... 7
Big Data Analytics – Introduction to Hadoop .............................................................................. 9
Stepping into Data Analytics – A Few Guidelines ......................................................................10
Challenges of Big Data .............................................................................................................11
Big Data use cases ...................................................................................................................12
Latest Trends in Big Data ..........................................................................................................14
Common Myths about Big Data ................................................................................................15
Conclusion ................................................................................................................................16
Disclaimer: The views, processes, or methodologies published in this article are those of the
author. They do not necessarily reflect EMC Corporation’s views, processes, or methodologies.
2015 EMC Proven Professional Knowledge Sharing 3
Data Explosion – A Flashback
Many years ago, data were stored on clay discs and stones. More recently, the invention of the
computer enabled data to be stored on magnetic discs. This allowed data to be stored,
retrieved, shared, and reused for purposes not imagined while collecting the data. When the
computer first came on the scene, only the computer operator or company staff was generating
data and the magnitude was much less. With the evolution of the Internet, many more users
started generating data and the magnitude grew significantly. Today we have machines like
sensors, cameras, mobile phones, and websites generating huge volumes of data when
compared to earlier times. A data explosion has happened over the past years as the statistics
in Figure 1 clearly show.
Figure1: Explosion of data in recent years
“Every 2 days we create as much information as we did from the beginning of the time until
2003.”
“Every day we create 2.5 quintillion bytes of data. Over 90%of all the data available in the world
has been created in the past 2 years alone.”
“It is expected that by 2020 the amount of digital information in existence will have grown from
3.2 Zettabytes today to 40 Zettabytes.”
2015 EMC Proven Professional Knowledge Sharing 4
More than 80% of the data available today is unstructured (i.e. text, audio, video) which is very
difficult to process using traditional Relational Database Management Systems (RDBMS).
Source: IDC
Figure 2: Structured and Unstructured data
Big Data has become a well-known buzzword in the IT industry and beyond. While there is no
standard definition for Big Data, the one I find most appropriate is this: “Huge collection of data
sets, both structured and unstructured, which are difficult to process using a traditional Data
Base”.
2015 EMC Proven Professional Knowledge Sharing 5
Big Data Overview
Suppose we have a 100 MB document which is difficult to send, or a 100 MB image which is
difficult to view, or a 100 TB video which is difficult to edit. In any of these instances, we have a
Big Data problem. Or suppose company ‘A’ is able to process a video of 300TB while company
‘B’ cannot. We would say that company ’B’ has a Big Data problem. Thus, as you can see, Big
Data can be system-specific or organization-specific.
Big Data is not only about the size of the data. It is related to the velocity and variety as well.
They are known as the 3 V’s of Big Data – Volume, Velocity, and Variety. In addition, another
V is often added to the Big Data dimensions; Veracity.
Figure 3: The 3 V’s of Big Data
Velocity and Volume focus on the speed and amount of data. Variety and Veracity refers to the
category and trustworthiness of data, respectively
Volume (how much): Capturing and processing large quantities of data.
Velocity (changing frequency and real time processing): Processing data in real time.
Variety (category of data): How many different types of data can be processed, e.g. email,
video, audio, text, log files, and various transaction data.
Veracity (truthfulness and confidence): How much you can trust the data for its legitimacy when
it is pouring in from various sources.
2015 EMC Proven Professional Knowledge Sharing 6
Big Data – Problem or Opportunity?
Whether Big Data is perceived as a problem or an opportunity depends on how you or your
organization approaches it. In the beginning, it is certainly a problem for all as we do not have
the right infrastructure and resources to process it. Also, we are unsure about the insights and
value which will be derived from it. Once the data is processed with the right tools, it can lead to
the breakthrough insights which will help in strategic decision making and thereby the growth of
the organization. In a survey conducted by supply chain LLC, 76% of customers considered Big
Data an opportunity rather than a problem. A recent CNBC quote supports this notion; “Data is
the new oil - in its raw form oil has little value; once processed it helps power the world”.
Meanwhile, a study conducted by IBM in 2012 found that nearly half of the respondents
considered customer satisfaction as the top priority. They think Big Data as a key to
understanding the customer and predicting their behavior. Organizations are interested in
investing in Big Data solutions for various reasons including many business processes as
illustrated in Figure 4. Analysis of operational data seems to be the biggest driver for using Big
Data solutions.
Figure 4: Top drivers for using Big Data Analytics or Business Intelligence
2015 EMC Proven Professional Knowledge Sharing 7
Layers of Big Data
To gain actionable insight from data, it has to pass through different stages as illustrated in
Figure 5. There are four layers in a Big Data platform which everyone must be aware of.
1. Data Sources Layer: This is the first stage where the organization accumulates its data
from various sources, i.e. social networking sites, emails, transaction records, and data
residing in the existing database. It is best to perform a detailed assessment of the
problem you are going to address, how it helps the business, and measure it against the
data you currently have. You may need to go to new sources of data.
Figure 5: Layers in Big Data
2. Data Storage Layer: This is where data gets stored once it is collected from various
sources. As computing capacity and storage capacity has increased over the decades,
data storage has attained prime importance in Big Data. While considering a file system
for storing data, keep in mind that it should be easily accessible, free from cyber threats,
and easily implementable and manageable. Google had come up with such a file system
— GFS (Google File System) — over a decade ago and they had not made it as open
source. Later, Yahoo did a lot of work in this area and came up with Hadoop Distributed
File System [HDFS] and made it as open source under Apache Software Foundation.
HDFS has the capability to run on commodity hardware and handles large scale data
with the help of MapReduce (a component of Hadoop which helps in data processing).
2015 EMC Proven Professional Knowledge Sharing 8
3. Data processing or Analysis Layer: This phase will analyze the data collected in the
above phase to derive the insights from it. MapReduce is the common tool used in this
analysis. It is a programming model and an associated implementation for processing
and generating large data sets with parallel, distributed algorithm on a cluster. The
analytic phase will result in trends and patterns of a particular business.
4. Data Output Layer: In this phase, insights gained from the analysis phase will be
transferred to those who are supposed to act on it. These outputs can be in the form of
charts, figures, and key recommendations. Clear and concise communication is a must
here as the decision makers may not have a thorough knowledge in statistics.
2015 EMC Proven Professional Knowledge Sharing 9
Big Data Analytics – Introduction to Hadoop
There are many Big Data platforms such as Hadoop, NoSQL, and MongoDB. Hadoop is useful
when dealing with massive amounts of data as it provides the parallel processing capability to
handle Big Data. It is a framework of tools designed by Doug Cutting and Yahoo when they
reverse engineered the Google File System (which was not open source). They made this
software framework open source and are being distributed by Apache. Hadoop can run on any
commodity hardware such as Linux servers. HDFS and MapReduce are the two components in
the Hadoop framework.
Figure 6 depicts machines arranged in parallel at the bottom and each machine will have a data
node and task tracker. Data node is also known as HDFS and task tracker is known as map-
reducers. Data node contains the data sets and task tracker will perform operations on it. Task
tracker in each machine needs to be controlled or synchronized, which is done by a job tracker.
A name node will coordinate all data nodes, handling distribution of data going to each machine.
Figure 6: Hadoop Framework
2015 EMC Proven Professional Knowledge Sharing 10
Stepping into Data Analytics – A Few Guidelines
Data is valuable only if it can help lead to better decisions. Whole data holds less value when
organizations are unable to translate it to meaningful insights that drive their business. Success
in deriving the insight is not dependent on the size of the data but on the effectiveness of the
analytics used to generate results. Let’s look at a few steps.
How to define your data: What is the outcome the organization aims to achieve by processing
the data? Who is interested in it and who is going to invest in it? Are there other major things or
decisions that can support this initiative and prioritize? Are there any new results which were not
possible by means of an existing system? To what degree will the new result change the future
of the organization? How do we integrate the new outcome to the organization?
Creating the data framework: Understanding the data by preparing charts, figures, and tables
helps in to get an indication about the important insight. Prepare sketches and designs and
relate things. Try to filter out the unwanted data and concentrate on the most important data.
Prioritize data and summarize. Start creating visuals out of it. It is advisable to use some
visualization tools. Tips that help in visualizing include:
Look for trends rather than looking at a single data value
Look for relationships and derive correlation
Examine the data over a time range, i.e. week over week, month over month
Examine from the perspective of others - they are important for insights
Bringing data into action: Many of the above techniques will help reveal insights in the data.
There is no standard procedure that every organization can adopt. It is up to the organization to
choose the best techniques in order to have excellent insights.
2015 EMC Proven Professional Knowledge Sharing 11
Challenges of Big Data
In its simplest form there are four major challenges that every organization will have to face
when implementing a Big Data platform.
1. Ownership: Since Big Data is heavily business-oriented, top management of the
organization will have to play a major role. They should be the leaders of Big Data
projects. Big Data is helping organizations of all sizes to make better business decisions,
save costs, improve customer service, deliver better user experience, and identify
security risks. The insights gleaned from Big Data and the corresponding organization
changes have to be managed very carefully. That’s why top management has to play a
vital role in this.
2. Data: Identifying the correct and most relevant data is another challenge as there are
various sources of it. Only the relevant data will help in producing meaningful insight to
guide the management to take critical decisions. For example, if the organization wants
to analyze the customer experience on their website, it would be good to collect the
failure login attempts and other related error logs in the site rather than collecting the log
of only successful connections.
3. People: For a successful Big Data project, the team should be a mixture of Data
Scientists, Technology experts, and Business owners. Data Scientists will use their skills
and expertise to correlate data sets, identify patterns, and generate the final insights.
Technology experts form the core of the Big Data initiative by playing a role in identifying
the right set of software and hardware tools required for the platform.
Business Owners define the outcome and work with Data Scientists and Technology
experts to achieve the outcome at the right time.
4. Technology: This is the backbone of the Big Data platform in any organization.
Hardware infrastructure and software tools are the two technology components that
need be in place according to the requirement of the organization. Cloud computing can
be one of the options from a hardware infrastructure point of view. Tools like Hadoop,
NoSQL, and MongoDB should be identified and selected for collecting, processing,
storing, and analyzing data sets.
All of the challenges above need to be addressed and managed in a balanced way so that the
Big Data project will succeed. Neglecting any one of these will create great problems for Big
Data projects.
2015 EMC Proven Professional Knowledge Sharing 12
Big Data use cases
Normally, organizations will not reveal their Big Data strategy to others out of fear that doing so
might affect their competitive edge in the market. However, this fear may be groundless as Big
Data projects are starting provide benefits; that is why the market for Hadoop and NoSQL
services are growing fast. A September 2013 study by open source research firm Wikibon, for
instance, forecasts an annual Big Data software growth rate of 45% through 2017.
1. 360 degree view of the customer: Online marketing firms and other retailers want to
know how much time customers spend on their websites, what pages they search, what
they like, and when they leave. All these unstructured data collected will be processed or
analyzed along with transactional data which will be stored as structured data in the
company’s ERP system. They will also add social media sentiments into the mix and the
whole can give a 360 degree view of the customer. Before giving offers to a customer,
these shops will come to know what the customer has bought in the past and their
sentiment and behavior pattern
2. Smart devices: Today almost all machines or pieces of hardware are built with a
number of sensors which can transfer a lot of information when these devices are
connected over the Internet. They collect information such as device health, use, and
security. These unstructured and some structured data will be processed in the
background and derive correlations and patterns to understand device performance over
a period of time and customer usage.
3. Optimizing the data warehouse: Customers can identify their archived data and
unstructured data to store it more cost effectively on Hadoop architecture.
4. Information security: Security vendors and large organizations with sophisticated
security architectures can leverage Hadoop as a platform that can offer reliable and
cheap data protection. Fraud analysis and identification will be much easier as this
platform can derive thousands of relations and patterns.
5. Health Care: A Big Data platform can play a major role in predicting outbreaks of
contagious diseases and illness diagnosis by processing information collected over
many years and pulling patterns and trends from it. Data from disease control efforts,
hospitals, and accident reports can easily show which geographical areas are over-
served or under-served by the current health care efforts in place.
2015 EMC Proven Professional Knowledge Sharing 13
6. Object analytics: Object analytics look for connections among various objects, i.e.
place, thing, transaction, and location. Billions of such data points can be used for
predicting suspicious activity in a location and for detecting fraud.
2015 EMC Proven Professional Knowledge Sharing 14
Latest Trends in Big Data
More companies are getting involved in Big Data analytics to keep pace their competition. With
the Internet of Things (IoT) growing fast, customers are very aware of — and not always happy
— when a business uses their data in an underhanded way. Consequently, companies who are
keen on Big Data analytics have to ensure that they have a strong data management plan to
handle the customer data in an ethical manner. Let’s look at some of the interesting trends.
1. Open Source: The ongoing trends with open source products such as Hadoop, Spark,
and R are going to continue and will have even bigger markets in 2015. Stability of any
open source product must be ensured before placing them alongside your existing
database systems. Commercial distribution of open source is becoming more popular
than pure open source implementations because the chances of changing code is less
in the former and hence won’t affect the functionality frequently.
2. Cognitive computing: It makes new classes of problems computable by addressing
complex situations characterized by ambiguity and uncertainty. The goal of cognitive
computing is to develop a computing technology that works similar to how the human
brain senses work and responds to stimulus. Many pioneers in IT are already into this
area of technology and the trend will increase in 2015.
3. Big Data analytics In the Cloud: Hadoop and its related framework of tools were
originally designed to work on clusters of physical machines. Now the trend has changed
and an increased number of technologies including Hadoop are available for processing
data in the cloud. Google’s BigQuery data analytics service and Amazon’s Redshift
hosted BI data warehouse are good examples.
4. NoSQL: Alternatives to traditional SQL-based relational databases, NoSQL (short for
“Not Only SQL”) databases are rapidly gaining popularity as tools for use in specific
kinds of analytic applications. That momentum will continue to grow.
5. In-Memory analytics: An In-memory database management system primarily relies on
main memory for computer storage. Main memory databases are faster than disk-
optimized databases since the internal optimization algorithm are simpler and execute
fewer CPU instructions. Accessing data in-memory eliminates seek time when querying
the data, which provides faster and greater performance than disk.
2015 EMC Proven Professional Knowledge Sharing 15
Common Myths about Big Data
Those starting their Big Data journey should be aware of some common myths so that the
project will not be a waste of time or manpower.
1. Big is simple: We know that Apache Hadoop can store and process tons of data and it
provides an inbuilt fault tolerance like in-cluster replication to improve cluster availability.
However, HDFS doesn’t natively provide a solution for advanced data protection or
disaster recovery. For such functionality, enhanced Hadoop distributions like that from
MapR would be required.
2. Fast analytics using Hadoop: A common misconception about Hadoop is that it’s fast.
It is only designed for high throughput batch-style processing to reduce the impact due
to common hardware failures in systems. However, these days there are a number of
enhancements to address the issue of performance, among them integrating traditional
database, streaming data, and in-memory processing products.
3. Store everything: Big Data hype has created an impression that Big Data can store
forever all the data that an enterprise can have. It could be true, but the ultimate purpose
of having a data analytic solution will not be utilized. The truth is you can expect a faster,
more efficient, and cost-effective solution when you store less needed data on the
framework.
4. Start it as others are doing it: This is the wrong approach, at least to Big Data, as it
can waste money and effort. Putting tons of data on a scalable cluster and expecting the
Data Scientist to pull out the insights for you will never happen. As with any other
project, success will mostly depend on having a thought-out plan and strategy in place to
drive the whole framework of tools and other resources, including Data Scientists.
2015 EMC Proven Professional Knowledge Sharing 16
Conclusion
Big Data is a buzzword heard just about everywhere. Why is Big Data getting this much
attention? Because it has the potential to profoundly affect the way we do business. In the past ,
we used to look at small data and make our decisions. Now, with the Internet of Things and
technology advances, we are moving huge sets of data (instead of moving technology towards
data) to computing. This has become what is known as Big Data Analytics.
Big Data is about analytics, not storage. Start with questions, not data. All problems are not Big
Data problems. We will have to audit our data, classify problems, and set an approach. Invest in
up-skilling the resources, build the parts, and plan the whole. In the coming years, Big Data is
going to transform how we live, how we work, and how we think. Using the positive sides of Big
Data Analytics and the corresponding insights holds the promise of changing the world. Hence,
Big Data is a big deal!
EMC believes the information in this publication is accurate as of its publication date. The
information is subject to change without notice.
THE INFORMATION IN THIS PUBLICATION IS PROVIDED “AS IS.” EMC CORPORATION
MAKES NO RESPRESENTATIONS OR WARRANTIES OF ANY KIND WITH RESPECT TO
THE INFORMATION IN THIS PUBLICATION, AND SPECIFICALLY DISCLAIMS IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
Use, copying, and distribution of any EMC software described in this publication requires an
applicable software license.