chapter 6: big data and analytics - gonzaga universitychen/mbus673/chap_ppt/bi_ch06.pdf · chapter...

11
1 Dr. Chen, Data Base Management Chapter 6: Big Data and Analytics Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration Gonzaga University Spokane, WA 99258 [email protected] Dr. Chen, Business Intelligence 2 Learning Objectives Learn what Big Data is and how it is changing the world of analytics Understand the motivation for and business drivers of Big Data analytics Become familiar with the wide range of enabling technologies for Big Data analytics Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data analytics Understand the role of and capabilities/skills for data scientist as a new analytics profession (Continued…) Dr. Chen, Business Intelligence 3 Learning Objectives Compare and contrast the complementary uses of data warehousing and Big Data Become familiar with the vendors of Big Data tools and services Understand the need for and appreciate the capabilities of stream analytics Learn about the applications of stream analytics Dr. Chen, Business Intelligence 4 Opening Vignette… Big Data Meets Big Science at CERN Situation Problem Solution Results Answer & discuss the case questions. Dr. Chen, Business Intelligence 5 Questions for the Opening Vignette What is CERN, and why is it important to the world of science? How does the Large Hadron Collider work? What does it produce? What is the essence of the data challenge at CERN? How significant is it? What was the solution? How were the Big Data challenges addressed with this solution? What were the results? Do you think the current solution is sufficient? Dr. Chen, Business Intelligence 6 1. What is CERN, and why is it important to the world of science? CERN is the European Organization for Nuclear Research. It plays a leading role in fundamental studies of physics. It has been instrumental in many key global innovations and breakthrough discoveries in theoretical physics and today operates the world’s largest particle physics laboratory, home to the Large Hadron Collider (LHC).

Upload: lenguyet

Post on 30-Mar-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

1

Dr. Chen, Data Base Management

Chapter 6:

Big Data and Analytics

Jason C. H. Chen, Ph.D.

Professor of MIS

School of Business Administration

Gonzaga University

Spokane, WA 99258

[email protected]

Dr. Chen, Business Intelligence

2

Learning Objectives

• Learn what Big Data is and how it is changing the world of analytics

• Understand the motivation for and business drivers of Big Data analytics

• Become familiar with the wide range of enabling technologies for Big Data analytics

• Learn about Hadoop, MapReduce, and NoSQL as they relate to Big Data analytics

• Understand the role of and capabilities/skills for data scientist as a new analytics profession

(Continued…)

Dr. Chen, Business Intelligence 3

Learning Objectives

• Compare and contrast the complementary uses of

data warehousing and Big Data

• Become familiar with the vendors of Big Data

tools and services

• Understand the need for and appreciate the

capabilities of stream analytics

• Learn about the applications of stream analytics

Dr. Chen, Business Intelligence 4

Opening Vignette…

Big Data Meets Big Science at CERN

• Situation

• Problem

• Solution

• Results

• Answer & discuss the case questions.

Dr. Chen, Business Intelligence 5

Questions for the Opening

Vignette

• What is CERN, and why is it important to the world of

science?

• How does the Large Hadron Collider work? What does

it produce?

• What is the essence of the data challenge at CERN?

How significant is it?

• What was the solution? How were the Big Data

challenges addressed with this solution?

• What were the results? Do you think the current

solution is sufficient?

Dr. Chen, Business Intelligence 6

• 1. What is CERN, and why is it important to the

world of science?

• CERN is the European Organization for Nuclear

Research. It plays a leading role in fundamental

studies of physics.

• It has been instrumental in many key global

innovations and breakthrough discoveries in

theoretical physics and today operates the world’s

largest particle physics laboratory, home to the

Large Hadron Collider (LHC).

2

Dr. Chen, Business Intelligence 7

• 2. How does Large Hadron Collider work? What

does it produce?

• The LHS houses the world’s largest and the most

sophisticated scientific instruments to study the basic

constituents of matter—the fundamental particles. These

instruments include purpose-built particle accelerators

and detectors.

• Accelerators boost the beams of particles to very high

energies before the beams are forced to collide with each

other or with stationary targets. Detectors observe and

record the results of these collisions, which are

happening at or near the speed of light. This process

provides the physicists with clues about how the

particles interact, and provides insights into the

fundamental laws of nature Dr. Chen, Business Intelligence 8

• 3. What is the essence of the data challenge at CERN?

How significant is it?

• Collision events in LHC occur 40 million times per

second, resulting in 15 petabytes of data produced

annually at the CERN Data Centre. CERN does not have

the capacity to process all of the data that it generates,

and therefore relies on numerous other research centers

all around the world to access and process the data.

• Processing all the data of their experiments requires an

enormously complex distributed and heterogeneous

computing and data management system. With such vast

quantities of data, both structured and unstructured,

information discovery is a big challenge for CERN.

Dr. Chen, Business Intelligence 9

• 4. What was the solution? How were the Big

Data challenges addressed with this solution?

• CMS’s data management and workflow

management (DMWM) created the Data

Aggregation System (DAS), built on MongoDB (a

Big Data management infrastructure) to provide

the ability to search and aggregate information

across this complex data landscape.

• MongoDB’s non-schema structure allowed

flexible data structures to be stored and indexed.

Dr. Chen, Business Intelligence 10

• 5. What were the results? Do you think the

current solution is sufficient?

• DAS is used 24 hours a day, seven days a week,

by CMS physicists, data operators, and data

managers at research facilities around the world.

The performance of MongoDB has been

outstanding, with an ability to offer a free text

query system that is fast and scalable.

• Without help from DAS, information lookup

would have taken orders of magnitude longer. The

current solution is outstanding, but more

improvements can be made and CERN is looking

to apply big data approached beyond CMS.

Dr. Chen, Business Intelligence 11

Big Data - Definition and

Concepts

• Big Data means different things to people with different

backgrounds and interests

• Traditionally, “Big Data” = _______ volumes of data

E.g., volume of data at CERN, NASA, Google, …

• Where does the Big Data come from?

Everywhere! Web logs, RFID, GPS systems, sensor networks,

social networks, Internet-based text documents, Internet search

indexes, detail call records, astronomy, atmospheric science,

biology, genomics, nuclear physics, biochemical experiments,

medical records, scientific research, military surveillance,

multimedia archives, …

massive

Dr. Chen, Business Intelligence 12

Technology Insights 6.1

The Data Size Is Getting Big, Bigger,

• Hadron Collider - 1 PB/sec

• Boeing jet - 20 TB/hr

• Facebook - 500 TB/day

• YouTube – 1 TB/4 min

• The proposed Square

Kilometer Array telescope

(the world’s proposed biggest

telescope) – 1 EB/day

3

Dr. Chen, Business Intelligence 13

Big Data - Definition and

Concepts

• Big Data is a misnomer!

• Big Data is more than just “big”

• The Vs that define Big Data

_______

_______

_______

Veracity

Variability

Value

Volume

Variety

Velocity

Dr. Chen, Business Intelligence 14

A High-Level Conceptual Architecture for Big Data

Solutions (by AsterData / Teradata)

Math

and Stats

Data

Mining

Business

Intelligence

Applications

Languages

Marketing

ANALYTIC TOOLS & APPS USERS

DISCOVERY PLATFORM

INTEGRATED DATA WAREHOUSE

DATAPLATFORM

ACCESSMANAGEMOVE

UNIFIED DATA ARCHITECTURESystem Conceptual View

Marketing

Executives

Operational

Systems

Frontline

Workers

Customers

Partners

Engineers

Data

Scientists

Business

Analysts

EVENT PROCESSING

ERPERP

SCM

CRM

Images

Audio

and Video

Machine

Logs

Text

Web and

Social

BIG DATA SOURCES

ERP

Dr. Chen, Business Intelligence 15

Fundamentals of Big Data

Analytics

• Big Data by itself, regardless of the size, type, or speed, is worthless

• Big Data + “big” _________ = value

• With the value proposition, Big Data also brought about big challenges

Effectively and efficiently capturing, storing, and analyzing Big Data

New breed of technologies needed (developed or purchased or hired or outsourced …)

analytics

Dr. Chen, Business Intelligence 16

Big Data Considerations

• You can’t process the amount of data that you want to because of

the limitations of your current platform.

• You can’t include new/contemporary data sources (e.g., social

media, RFID, Sensory, Web, GPS, textual data) because it does

not comply with the data storage schema

• You need to (or want to) integrate data as quickly as possible to

be current on your analysis.

• You want to work with a schema-on-demand data storage

paradigm because of the variety of data types involved.

• The data is arriving so fast at your organization’s doorstep that

your traditional analytics platform cannot handle it.

• …

Dr. Chen, Business Intelligence 17

Critical Success Factors for

Big Data Analytics

• A clear business need (alignment with the vision and the strategy)

• Strong, committed sponsorship (executive champion)

• Alignment between the business and IT strategy

• A fact-based decision-making culture

• A strong data infrastructure

• The right analytics tools

• Right people with right skills

Dr. Chen, Business Intelligence 18

Critical Success Factors for Big Data Analytics

Keys to Success

with Big Data Analytics

A Clear business need

Strong , committed sponsorship

Alignment between the

business and IT strategy

A fact - based decision - making

culture

A strong data infrastructure

The right analytics tools

Personnel with advanced

analytical skills

4

Dr. Chen, Business Intelligence 19

Enablers of Big Data Analytics

• __________ analytics Storing and processing the complete data set in RAM

• __________ analytics Placing analytic procedures close to where data is stored

• Grid computing & MPP Use of many machines and processors in parallel (MPP -

massively parallel processing)

• Appliances Combining hardware, software, and storage in a single unit

for performance and scalability

In-memory

In-database

Dr. Chen, Business Intelligence 20

Challenges of Big Data Analytics

• Data _______

The ability to capture, store, and process the huge volume of

data in a timely manner

• Data ________

The ability to combine data quickly and at reasonable cost

• Processing capabilities

The ability to process the data quickly, as it is captured (i.e.,

stream analytics)

• Data __________ (… security, privacy, access)

• Skill availability (… data scientist)

• Solution cost (ROI)

volume

integration

governance

Dr. Chen, Business Intelligence 21

Business Problems Addressed by

Big Data Analytics

• Process efficiency and cost reduction

• Brand management

• Revenue maximization (e.g., cross-selling/ up-selling/ bundled-selling)

• Enhanced customer experience

• Churn identification, customer recruiting

• Improved customer service

• Identifying new products and market opportunities

• Risk management

• Regulatory compliance

• Enhanced security capabilities

• … Dr. Chen, Business Intelligence

22

Application Case 6.2

Top 5 Investment Bank Achieves Single Source of the Truth

Questions for Discussion

1. How can Big Data benefit large-scale trading banks?

2. How did MarkLogic’s infrastructure help ease the leveraging of Big Data?

3. What were the challenges, the proposed solution, and the obtained results?

Dr. Chen, Business Intelligence 23

Application Case 6.2

Before After

Before it was difficult to identify financial exposure across many systems (separate copies of derivatives trade store)

After it was possible to analyze all contracts in single database (MarkLogic Server eliminates the need for 20 database copies)

Moving from many old systems to a unified new system

Dr. Chen, Business Intelligence 24

• 1. How can Big Data benefit large-scale trading

banks?

• Big Data can potentially handle the high volume,

high variability, continuously streaming data that

trading banks need to deal with.

• Traditional relational databases are often unable to

keep up with the data demand.

5

Dr. Chen, Business Intelligence 25

• 2. How did MarkLogic infrastructure help ease the

leveraging of Big Data?

• MarkLogic was able to meet two needs: (1)

upgrading existing Oracle and Sybase platforms, and

(2) compliance with regulatory requirements.

• Their solution provided better performance,

scalability, and faster development for future

requirements than competitors.

• Most importantly, MarkLogic was able to eliminate

the need for replicated database servers by providing

a single server providing timely access to the data.

Dr. Chen, Business Intelligence 26

• 3. What were the challenges, the proposed solution,

and the obtained results?

• The Bank’s legacy system was built on relational

database technology. As data volumes and variability

increased, the legacy system was not fast enough to

respond to growing business needs and requirements.

• It was unable to deliver real-time alerts to manage

market and counterparty credit positions in the desired

timeframe. Big data offered the scalability to address

this problem.

• The benefits included a new alert feature, less downtime

for maintenance, much faster capacity to process

complex changes, and reduced operations costs.

Dr. Chen, Business Intelligence 27

Big Data Technologies

• There are a number of technologies for processing and

analyzing Big Data, but most have some common

characteristics. Namely:

take advantage of commodity hardware to enable scale-out

and parallel-processing techniques;

employ non-relational data storage capabilities to process

unstructured and semi-structured data;

apply advanced analytics and data visualization technology

to Big data to convey insights to end users.

• Three major technologies will transform the business

analytics and data management markets:

MapReduce, Hadoop, and

NoSQL Dr. Chen, Business Intelligence

28

Big Data Technologies

• MapReduce …

• Hadoop …

• Hive

• Pig

• Hbase

• Flume

• Oozie

• Ambari

• Avro

• Mahout, Sqoop, Hcatalog, ….

Dr. Chen, Business Intelligence 29

Big Data Technologies - MapReduce

• MapReduce

distributes the processing of very large multi-structured data

files across a large cluster of ordinary machines/processors

by breaking the processing into small units of work

is a programming model not a programming language

• Goal

achieving high performance with many “simple” computers

processing and analyzing large volumes of multi-structured

data in a timely manner

• Developed and popularized by Google

• Example tasks: indexing the Web for search, graph

analysis, text analysis, machine learning, …

Dr. Chen, Business Intelligence 30

MapReduce Processes

• Objective: to count the number of squares of each color

• Input: a set of colored squares

• Responsibility of programmers: coding the map and reducing

programs; the remainder of the processing is handled by the

software system implementing the MapReduce programming model

• Processes: the MapReduce system first reads the input file and

splits it into multiple pieces.

In this example, there are two splits, but in a real-life scenario, the

number of splits would typically be much higher. These splits are then

processed by multiple map programs running in parallel on the nodes

of the clusters.

• Role of each map program: groups the data in a split by color, the

MapReduce system then takes the output from each map program

and merges (shuffle/sort) the results for input to the reduce

program, which calculates the sum of the number of squares of

each color.

In this example, only one copy of the reduce program is used

6

Dr. Chen, Business Intelligence 31

Big Data Technologies MapReduce

How does

MapReduce

work? 4

3

3

3

3

Raw Data Map Function Reduce Function

Fig. 6.4 A Graphical Depiction of the MapReduce Process Dr. Chen, Business Intelligence 32

Big Data Technologies

Hadoop

• Hadoop is an open source framework for storing and

analyzing massive amounts of distributed, unstructured

data in a cost- and time-effective manner.

• Originally created by Doug Cutting at Yahoo!

• Hadoop clusters run on inexpensive commodity

hardware so projects can scale-out inexpensively

• Hadoop is now part of Apache Software Foundation

• Open source - hundreds of contributors continuously

improve the core technology

• MapReduce + Hadoop = Big Data core technology

Dr. Chen, Business Intelligence 33

Big Data Technologies Hadoop

• How Does Hadoop Work?

Access unstructured and semi-structured data (e.g., log files, social

media feeds, other data sources)

Break the data up into “parts,” which are then loaded into a file

system made up of multiple nodes running on commodity hardware

using Hadoop Distributed File System (HDFS).

Each “part” is replicated multiple times and loaded into the file

system for replication and failsafe processing

A node acts as the Facilitator and another as Job Tracker

Facilitator: communicate back to the client information such aws which nodes

are available, where in the cluster certain data resides, and which node have

failed

Job Tracker: determine which data it needs to access to complete the job and

where in the cluster that data is located.

The job Tracker submits the query to the relevant nodes rather than bringing all

the data back into a central location for processing. Processing then occurs at

each node simultaneously, or in parallel – an essential characteristics of Hadoop. Dr. Chen, Business Intelligence

34

Big Data Technologies Hadoop

• How Does Hadoop Work? (cont.) - Jobs are distributed

to the clients, and once completed, the results are

collected and aggregated using MapReduce

When each node has finished processing its given job, it stores the

results.

The client initiates a “Reduce” job through the Job Tracker in

which results of the map phase stored locally on individual nodes

are aggregated to determine the “answer” to the original query, and

then are loaded onto another node in the cluster.

The client accesses these results, which can then be loaded into one

of a number of analytic environments for analysis.

The MapReduce job has now been completed.

Once the MapReduce phase is complete, the processed data is ready

for further analysis by data scientists and others with advanced data

analytics skills.

Dr. Chen, Business Intelligence 35

Big Data Technologies

Hadoop

• Hadoop Technical Components

Hadoop Distributed File System (HDFS)

Name Node (primary facilitator)

Secondary Node (backup to Name Node)

Job Tracker

Slave Nodes (the grunts of any Hadoop cluster)

Additionally, Hadoop ecosystem is made up of a

number of complementary sub-projects: NoSQL

(Cassandra, Hbase), DW (Hive), …

NoSQL = not only SQL Dr. Chen, Business Intelligence

36

Big Data Technologies

Hadoop - Demystifying Facts

• Hadoop consists of multiple products

• Hadoop is open source but available from vendors too

• Hadoop is an ecosystem, not a single product

• HDFS is a file system, not a DBMS

• Hive resembles SQL but is not standard SQL

• Hadoop and MapReduce are related but not the same

• MapReduce provides control for analytics, not analytics

• Hadoop is about data diversity, not just data volume

• Hadoop complements a DW; it’s rarely a replacement

• Hadoop enables many types of analytics, not just Web analytics

7

Dr. Chen, Business Intelligence 37

Data Scientist

“The Sexiest Job of the 21st Century” Thomas H. Davenport and D. J. Patil

Harvard Business Review, October 2012

• Data Scientist = Big Data guru

One with skills to __________ Big Data

• Very high salaries, very high expectations

• Where do Data Scientists come from?

M.S./Ph.D. in MIS, CS, IE,… and/or Analytics

There is not a specific degree program for DS!

PE, PML, … DSP (Data Science Professional)

investigate

Dr. Chen, Business Intelligence 38

Skills That Define a Data Scientist

Curiosity and Creativity

Internet and Social Media / Social Networking

Technologies

Programming , Scripting and Hacking

Data Access and Management

( both traditional and new data systems )

Domain Expertise , Problem Definition and

Decision Modeling

Communication and Interpersonal

DATA SCIENTIST

Dr. Chen, Business Intelligence 39

A Typical

Job Post

for Data

Scientist

Dr. Chen, Business Intelligence 40

Application Case 6.4

Big Data and Analytics in Politics

Questions for Discussion

1. What is the role of analytics and Big Data in modern-day politics?

2. Do you think Big Data analytics could change the outcome of an election?

3. What do you think are the challenges, the potential solution, and the probable results of the use of Big Data analytics in politics?

Dr. Chen, Business Intelligence 41

Application Case 6.4

Big Data and Analytics in Politics

INPUT: Data Sources

§ Census dataPopulation specifics, age, race, sex, income, etc.

§ Election DatabasesParty affiliations, previous election outcomes, trends and distributions

§ Market research Polls, recent trends and movements

§ Social mediaFacebook, Twitter, LinkedIn, Newsgroups, Blogs, etc.

§ Web (in general)Web pages, posts and replies, search trends, etc.

· Other data sources

OUTPUT: Goals

§ Raise money contributions§ Increase number of

volunteers§ Organize movements§ Mobilize voters to get out

and vote§ Other goals and objectives§ ...

Big Data & Analytics

(Data Mining, Web Mining, Text Mining, Multi-media Mining)

§ Predicting outcomes and trends

§ Identifying associations between events and outcomes

§ Assessing and measuring the sentiments

§ Profiling (clustering) groups with similar behavioral patterns

§ Other knowledge nuggets

Dr. Chen, Business Intelligence 42

• 1. What is the role of analytics and Big Data in

modern day politics?

• Big Data and analytics have a lot to offer to

modern-day politics. The main characteristics of

Big Data, namely volume, variety, and velocity

(the three Vs), readily apply to the kind of data

that is used for political campaigns.

• Big Data Analytics can help predict election

outcomes as well as targeting potential voters and

donors, and have become a critical part of political

campaigns.

8

Dr. Chen, Business Intelligence 43

• 2. Do you think Big Data analytics could change the

outcome of an election?

• It may well have changed the outcome of the 2008 and 2012

elections. For example, in 2012, Barack Obama had a data

advantage and the depth and breadth of his campaign’s digital

operation, from political and demographic data mining to

voter sentiment and behavioral analysis, reached beyond

anything politics had ever seen.

• So there was a significant difference in expertise and comfort

level with modern technology between the two parties. The

usage and expertise gap between the party lines may

disappear over time, and this will even the playing field. But

new data regarding voter sympathies may itself change the

thinking and ideological directions of the major parties. This

in itself could shift election outcomes.

Dr. Chen, Business Intelligence 44

• 3. What do you think are the challenges, the potential

solution, and the probable results of the use of Big Data

analytics in politics?

• One of the big challenges is to ensure some sort of reliability

in news media coverage of election issues. For example,

most people, including the so-called political experts (who

often rely on gut feelings and experiences), thought the 2012

presidential election would be very close. They were wrong.

• On the other hand, data analysts, using data-driven

analytical models, predicted that Barack Obama would win

easily with close to 99 percent certainty. The results will be

an ever-increasing use of Big Data analytics in politics, both

by the parties and candidates themselves and by the news

media and analysts who cover them.

Dr. Chen, Business Intelligence 45

Big Data And Data Warehousing

• Two paradigms in BI:

_____ _________ and ___ _____.

Both are competing each other for turning data into

actionable information.

• However, in recent years, the variety and

complexity of data made data warehouse

incapable of keeping up the changing needs.

• Big Data

A new paradigm that the world of IT was forced to

develop, not because the _______ of the structured data

but the ______ and the _______ .

Data Warehouse Big Data

volume

variety velocity

Dr. Chen, Business Intelligence 46

Big Data And Data Warehousing

• What is the impact of Big Data on DW?

Big Data and RDBMS do not go nicely together

Will Hadoop replace data warehousing/RDBMS?

• Use Cases for Hadoop

Hadoop is the repository and refinery for raw data

Hadoop is a powerful, economical, and active archive

• Use Cases for Data Warehousing

Data warehouse performance

Integrating data that provides business value

Interactive BI tools

Dr. Chen, Business Intelligence 47

Hadoop versus Data Warehouse

When to Use Which Platform

Dr. Chen, Business Intelligence 48

Coexistence of Hadoop and DW

• The following scenarios support an environment where

the Hadoop and RDBMS are separate from each other

and connectivity software is used to exchange data

between the two systems:

1. Use Hadoop for storing and archiving multi-structured data

2. Use Hadoop for filtering, transforming, and/or consolidating

multi-structured data

3. Use Hadoop to analyze large volumes of multi-structured data

and publish the analytical results

4. Use a relational DBMS that provides MapReduce capabilities

as an investigative computing platform

5. Use a front-end query tool to access and analyze data

9

Dr. Chen, Business Intelligence 49

Coexistence of Hadoop and DW

Source: Teradata

connectivity

Dr. Chen, Business Intelligence 50

Big Data Vendors

• Big Data vendor landscape is developing very rapidly

• A representative list would include

Claudera - claudera.com

MapR – mapr.com

Hortonworks - hortonworks.com

Also, IBM (Netezza, InfoSphere), Oracle (Exadata, Exalogic), Microsoft, Amazon, Google, …

Software,

Hardware,

Service, …

Dr. Chen, Business Intelligence 51

Top 10 Big Data Vendors

with Primary Focus on Hadoop

$0

$10

$20

$30

$40

$50

$60

$70

Dr. Chen, Business Intelligence 52

Technology Insights 6.4

How to Succeed with Big Data

1. Simplify

2. Coexist

3. Visualize

4. Empower

5. Integrate

6. Govern

7. Evangelize

Dr. Chen, Business Intelligence 53

Big Data And Stream Analytics

• Data-in-motion analytics and real-time data analytics

• One of the Vs in Big Data = _________

• Analytic process of extracting actionable information from

continuously flowing/streaming data

• Why Stream Analytics?

It may not be feasible to store the data, or may lose its value

• Stream Analytics vs. Perpetual Analytics

“Stream analytics” involves applying transaction-level logic to real-

time observations.

Used when transaction volumes are high and the time-to-decision is too

short,

“Perpetual analytics” is the term describing the process of performing

(near) real-time analytics on data streams.

Used when the mission is critical and transaction volumes can managed in

real time.

Velocity

Dr. Chen, Business Intelligence 54

• Q: Why is stream analytics becoming more

popular?

• Stream analytics is becoming increasingly more

popular for two main reasons.

• First, time-to-action has become an ever

decreasing value, and

• second, we have the technological means to

capture and process the data while it is being

created.

Application Case 6.7: Turning Machine-Generated Streaming Data into Valuable Business Insights

10

Dr. Chen, Business Intelligence 55

Stream Analytics

A Use Case in Energy Industry

Sensor Data(Energy Production

System Status)

Meteorological Data (Wind, Light,

Temperature, etc.)

Usage Data(Smart Meters,

Smart Grid Devices)

Permanent Storage Area

Streaming Analytics(Predicting Usage, Production and

Anomalies)

Energy Production System(Traditional and/or Renewable)

Energy Consumption System(Residential and Commercial)

Data Integration and Temporary

Staging

Capacity Decisions

Pricing Decisions

Dr. Chen, Business Intelligence 56

Stream Analytics Applications

• e-Commerce

• Telecommunication

• Law Enforcement and Cyber Security

• Power Industry

• Financial Services

• Health Services

• Government

Dr. Chen, Business Intelligence 57

End-of-Chapter Application Case

Discovery Health Turns Big Data into Better Healthcare

Questions for Discussion 1. How big is “big data” for Discovery Health?

2. What big data sources did Discovery Health use for their analytic solutions?

3. What were the main data/analytics challenges Discovery Health was facing?

4. What were the main solutions they have produced?

5. What were the initial results/benefits? What do you think will be the future of “big data” analytics at Discovery?

Dr. Chen, Business Intelligence 58

• 1. How big is “big data” for Discovery Health?

• Discovery Health serves 2.7 million members, and its claims

system generates a million new rows of data each day. For

analytics purposes, they want to use three years of data, which

comes to over a billion rows of claims data alone.

• 2. What big data sources did Discovery Health use for their

analytic solutions?

• The data sources include members’ claims data, pathology

results, members’ questionnaires, data from pharmacies,

hospital admissions, and billing data, to name a few. Looking

to the future, Discovery is also starting to analyze unstructured

data, such as text-based surveys and comments from online

feedback forms.

Dr. Chen, Business Intelligence 59

• 3. What were the main data/analytics challenges

Discovery Health was facing?

• The challenge faced by Discovery Health was similar

to that faced by any company operating in a Big Data

environment. Given the huge amount of data that

Discovery Health was generating, how could they

identify the key insights that their business and their

members’ health depend on?

• To find the needles of vital information in the big

data haystack, the company not only needed a

sophisticated data-mining and predictive modeling

solution, but also an analytics infrastructure with the

power to deliver results at the speed of business.

Dr. Chen, Business Intelligence

60

• 4. What were the main solutions they have

produced?

• To achieve this transformation in its analytics

capabilities, Discovery worked with BITanium, an

IBM Business Partner with deep expertise in

operational deployments of advanced analytics

technologies.

• The solution included use of IBM’s SSPS and

PureData System for Analytics.

11

Dr. Chen, Business Intelligence 61

• 5. What were the initial results/benefits? What do you think

will be the future of “big data” analytics at Discovery?

• Discovery Health is now able to run three years’ worth of

data for their 2.7 million members through complex statistical

models to deliver actionable insights in a matter of minutes.

This helps them to better predict and prevent health risks, and

to identify and eliminate fraud. The results have been

transformative.

• From an analytics perspective, the speed of the solution gives

Discovery more scope to experiment and optimize its models.

From a broader business perspective, the combination of

SPSS and PureData technologies gives Discovery the ability

to put actionable data in the hands of its decision makers

faster.

Dr. Chen, Business Intelligence

62

End of the Chapter

• Questions, comments