foundation for success: how big data fits in an information architecture
DESCRIPTION
BDIA Roundtable Live Webcast on April 9, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=c84869fcca958d278b210cfca2a023a0 Big Data can offer big value and big challenges, and there are lots of solutions and promises out there. But in order to harness the most insight from Big Data, organizations need to solve pain points with more than triage. Since data challenges continue to permeate the information landscape, businesses would do well to incorporate solutions that fit into the infrastructure and provide a sustainable method for managing and analyzing Big Data. Register for this Roundtable Webcast to hear veteran Analysts Robin Bloor, Mike Ferguson and Richard Winter as they offer their perspectives on the evolving Big Data industry. They’ll comment on the proposed Big Data Information Architecture, and take questions from the audience. This is the second event of The Bloor Group's Interactive Research Report for 2014 which will focus on illuminating optimal Big Data Information Architectures. The series will include a dozen interviews with today's Big Data visionaries, plus three interactive Webcasts and a detailed findings report. Visit InsideAnlaysis.com for more information.TRANSCRIPT
Grab some coffee and enjoy the pre-show banter before the top of the hour!
“The Inevitable Shift: How Big Data Impacts Enterprise Architecture”
RoundTable Webcast | April 9, 2014
Host
Eric Kavanagh CEO, The Bloor Group @eric_kavanagh [email protected]
Findings Webcast June 25, 2014
Big Data Information Architecture
Roundtable Webcast April 9, 2014
Exploratory Webcast January 22, 2014
#BigDataArch
✓
✓
Analysts
Robin Bloor Chief Analyst, The Bloor Group
Richard Winter President & Founder, WinterCorp
Mike Ferguson Managing Director, Intelligent Business Strategies
BIG DATA
Hadoop as the Data Reservoir
Big Data and the Data Reservoir
BDIA: The Story So Far
Robin Bloor, Ph.D.
Big Data – A Poorly Defined Term
WHAT IS BIG DATA?
Business data
Traditional data
Log file data
Operational data
Mobile data
Location data Social
network data
Public data
Commercial databases
Streaming data
Internet of Things
A TRANSACTION is a MOLECULE of ATOMIC
EVENTS
The ATOM of data has become the EVENT
Atoms and Molecules
The Traffic Cop (Events)
Atoms and Molecules
DATA FLOW is becoming a driving factor
This suggests the need for a
DATA RESERVOIR
Hadoop as the Data Reservoir
Big Data and the Data Reservoir
The Workload Paradigm Shift
u Previously, we viewed database workloads as an i/o optimization problem
u With analytics the workload is a very variable mix of i/o and calculation
u No databases were built precisely for this – not even Big Data databases
The Big Data Applications
It’s pretty much all about
BI & ANALYTICS
The Biological System
u Our human control system works at different speeds: • Almost instant reflex • Swift response • Considered response
u Organizations will gradually implement similar control systems
u This suggests a data-flow- based architecture
u The EDW is memory
The Corporate Biological System
u Right now this division into two different data flows is already occurring
u Currently we can distinguish between: • Real-time/Business-time
applications • Analytical applications
u We should build specific architectures for this
W I N T E R C O R P
T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
Big Data Information Architecture Bloor Group Roundtable
Richard Winter WinterCorp
April 2014
Big Data and the Data Reservoir
From Robin’s charts:
©2010 Winter Corporation. All Rights Reserved. 22!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
©2010 Winter Corporation. All Rights Reserved.
It’s About the Platforms & Their Roles
• Data Warehouse • Data Mart • Data Refinery • Data Landing Zone • Data Discovery • Graph Analytics • Etc.
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 23!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
©2010 Winter Corporation. All Rights Reserved.
Data Refining Example Data from Turbines
© 2010, 2011, 2012 WINTER CORPORATION, CAMBRIDGE MA. ALL RIGHTS RESERVED. © 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 24!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
©2010 Winter Corporation. All Rights Reserved.
Data Refining Example Data Management Requirements
1. Hundreds of TB or more of data per week
2. Raw data life: few hours to a few days
3. Challenge: find the important events or trends quickly
4. Massive analysis problem
5. When analyzing, read entire files
6. Keep only the significant data
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 25!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
©2010 Winter Corporation. All Rights Reserved.
Business Example Enterprise Data Warehouse
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 26!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
©2010 Winter Corporation. All Rights Reserved.
Enterprise Data Warehouse Data Management Requirements
1. Data volume a. TB to PB – all retained for at least five years b. Continual growth of data and workload
2. Data sources: hundreds to thousands a. Data sources change their feeds frequently b. New data sources are frequent
3. Challenges a. Data must be correct b. Data must be integrated
4. Typical enterprise data lifetime: decades 5. Analytic application lifetime: years 6. Many thousands of data users (104 – 106) 7. Hundreds of analytic applications 8. Thousands of one time analyses 9. Tens of thousands of complex queries
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 27!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
Some Platform Examples Requirement Platform Data Refinery Hadoop
Complex SQL Query Data Warehouse Enforce/Manage Business Rules Data Warehouse
Intensive Batch Processing Hadoop Simple Data Mart Multiple Options Data Discovery New Category Integrated Data Data Warehouse
Data Landing Zone Hadoop Document Store Multiple Options Stream Processing Multiple Options
ETL Multiple Options
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 28!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
Understand the Platform Cost Tradeoffs
• Cost tradeoffs can be surprising – platform cost is not
always the driver
• Requires a total cost framework & systematic
approach
• “Big Data: What Does it Really Cost?”
wintercorp.com/tcod-‐‑report
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
©2010 Winter Corporation. All Rights Reserved. 29!W I N T E R C O R P : T H E L A R G E S C A L E D A T A M A N A G E M E N T E X P E R T S
Data Platforms A Changing Picture
• Categories are not seiled
• Data Warehouse has a continuing, major role
• Hadoop has a major role
• Everything else is in flux
© 2012, 2013, 2014 WINTER CORPORATION, BELMONT MA. ALL RIGHTS RESERVED.
Big Data Information Architecture
Mike Ferguson Managing Director Intelligent Business Strategies Bloor Group Big Data Roundtable April 2014
Twitter: @mikeferguson1
31
For Many Years The Traditional Data Warehouse and BI Environment Has Been Used For Analysis & Reporting
Operational systems
web
P o r t a l
Employees Partners
Customers
BI Tools
Platform Dat
a In
tegr
atio
n / D
Q
Reports & analytics
Data warehouse & data marts
DW
32
However There Are New Types of Data That Businesses Now Want to Analyse § Web data
• Clickstream data, e-commerce logs
• Social networks data e.g., Twitter
§ Semi-structured data e.g., e-mail, XML, JSON
§ Unstructured content • How much is TEXT worth to you
§ Sensor data • Temperature, light, vibration, location,
liquid flow, pressure, RFIDs
§ Vertical industries structured transaction data • E.g. Telecom call data records, retail Source: Analytics: The Real-World Use of Big Data
Said Business School Oxford and IBM
33
The Impact of Big Data – We Now Have Different Platforms Optimised For Different Analytical Workloads
Streaming data
Hadoop data store
Data Warehouse RDBMS
NoSQL DBMS
EDW
DW & marts
NoSQL DB e.g. graph DB
Advanced Analytic (multi-structured data)
mart DW
Appliance
Advanced Analytics (structured data)
Analytical RDBMS
Big Data workloads now mean we require multiple platforms for analytical processing
C
R
U
D
Prod
Asset
Cust
MDM
Graph analysis
Investigative analysis,
Data refinery
Data mining, model
development
Traditional query,
reporting & analysis
Real-time stream
processing & decision
management
Master data management
34
Hadoop Is A Platform At The Heart of Big Data Analytics – There Are Multiple Ways To Access Hadoop
SQL Java MapReduce APIs to HDFS, HBase, Cascading
file file file file file
file file file file file
file file
file file
Vendor SQL on Hadoop engine
webHDFS (An HTTP interface to HDFS has
REST APIs) HDFS
file
file
index index Index
partition
file
file
MapReduce Hadoop 2.0 F’work
YARN
SQL
PIG latin scripts
MapReduce Application
BI Tools / Apps
35
Popular Hadoop Use Cases
§ Hadoop as a data refinery • Offloading data integration from a DW
§ Hadoop for investigative analysis in an analytical sandbox
§ Hadoop as an on-line data warehouse archive
36
The Hadoop Data Refinery
EDW
Graph DBMS
DW Appliance
Analytical DBMS
XML, JSON
social
Web logs
ERP
CRM
SCM
Ops NoSQL DB
web
Data marts
insi
ghts
ELT processing
cloud
37
A Centralised Hadoop Based Data Refinery is One Way to Scale at Reduced Cost
Data Hub - Consume, Clean, Integrate, Analyse And Provision Data From Hadoop To Any Analytical Platform
Generated MapReduce
ELT jobs
business insight
sandbox
ELT Processing
feeds sensors
!"#$%&'()%
RDBMS Files office docs social Cloud *+,*-./0123%
Web logs web services
NoSQL DB e.g. graph DB EDW
DW & marts
mart
DW Appliance
Advanced Analytics (structured data)
Exploratory analysis
Staging area / landing zone
Sometime analysts refer to this as a Data Refinery
Data Refinery
What is the purpose of the data refinery?
Is it to process un-modelled data or all data?
38
Investigative Analysis Can Be Done In A Hadoop Sandbox
Click stream web log data Customer interaction data
Social interaction data (e.g. Twitter, Facebook)
Sensor data Rich media data (video, audio)
External web content Documents
Internal web content Seismic data (oil & gas)
Investigative / Exploratory Analysis
C
R U
D
Asset Customer
Product
MDM System
EDW mart
new business insight
sandbox
Multi-structured data
Historical Data
archived DW data master data
Data Scientists
39
Streaming Data
Graph Data Multi-Structured
+
Master Data Business Value Created
sentiment Customer sentiment & Product sentiment
Customer online behaviour
Prospects & Influencers
Sensor data Field service optimization Risk mgm’t Asset performance
Joining Big Data With Master Data During Exploratory Analysis Can Produce Insight for Competitive Advantage
customer
product
NoSQL DB e.g. graph DB
C
R
U
D
Master data
customer
customer
asset
40
New Insights Can Be Added Into A DW To Enrich What You Already Know
DW D I
new insights
Operational systems
e.g. Deriving insight from social web sites like for sentiment analytics
sandbox
Data Scientists
social
Web logs
web cloud
41
Alternatively New Insights In Hadoop Can Integrated With A DW Using Data Virtualization To Provide Enriched Information
DW D I
e.g. Deriving insight from social web sites like for sentiment analytics
new insights
OLTP systems
sandbox
Data Scientists
social
Web logs
web cloud
Data Vitualisation
SQL on Hadoop
42
Using Hadoop As A Data Archive Means Data Can Be Kept On-line, Analysed And Still Integrated With Data In The DW
DW D I
new insights
OLTP systems
Data Vitualisation
SQL on Hadoop
Archived data
Archive unused
or data > n years
43
Real-time Data From NoSQL DBMSs Can Also Be Joined To DW Data Using Data Virtualization
DW D I
Nested data like JSON needs to be handled by the data virtualisation server
real-time insights
OLTP systems
social
Web logs
Data Vitualisation
Column Family DB Document DB
NoSQL DB
sensors
Nested data !!
44
Investigative Analysis Can Be Done In A Graph DBMS – New Insight Can Also Come From Graph Analysis
Investigative / Exploratory Analysis
C
R U
D
Asset Customer
Product
MDM System
new business Insight
Structured data
master data
Data Scientists
Multi-structured data
Graph DBMS
45
SQL Access To Big Data - Options
SQL
SQL access to big data in Hadoop
SQL
Analytical RDBMS
SQL access to big data in an
analytical RDBMS
streaming data
SQL
SQL access to streaming data in
motion
SQL access to a combination of the above
SQL
DW
data virtualisation server
SQL access to big data via data
virtualisation
46
SQL on Hadoop Challenges – Multi-structured Data May Need to Be Analysed
{ "firstName": ”Wayne", "lastName": ”Rooney", "age": 25, "address": { "streetAddress": "21 Sir Matt Busby Way", "city": ”Manchester”, “country”: “England”, "postalCode": “M1 6DY” }, "phoneNumbers": [ { "type": "home”, "number": ”0161-123-1234” }, { "type": ”mobile", "number": ”07779-123234” } ] }
JSON data
Text data
Image Data
SQL??
SQL??
SQL??
47
SQL on Hadoop Challenges – Multi-structured Data May Need to Be Analysed
Web log data
Tab delimited file data
SQL??
SQL??
48
Hadoop Storage Is Independent of Any SQL Engine Accessing HDFS - Multiple SQL Engines Can Coexist On The Same Data
§ Key points about Hadoop • It is possible to have MULTIPLE SQL engines on the same data • Different SQL engines run on different Hadoop frameworks (M/R, Tez,
Spark) or on no framework at all i.e. directly access HDFS or HBase data
Source: Hortonworks
SQL SQL SQL SQL
Storage is independent of any SQL engine
49
Relational DBMS / Hadoop Integration – Several Vendors Have Integrated RDBMS with Hadoop to Run Analytics
Relational DBMS
External Polymorphic
table function(s)
HDFS / Hbase/ Hive
SQL, XQuery
RDBMS optimizer handles transparent access to external analytical platforms on behalf of the user
CitusDB Exasol EXAPowerlitics IBM PureData System for Analytics and DB2 HDFS clients Oracle HDFS Client Pivotal HAWQ PFX Teradata SQL H
RDBMS and Hadoop could be deployed on the same hardware cluster or on different hardware clusters
Allows join across data in a single RDBMS and Hadoop
50
Self-Service BI
Self-service Data Discovery & Visualisation
or Dashboard Server
Business analyst
Data Virtualization and Optimization
personal & office
data Predictive models
Transaction systems
Data Management Tools (ETL, DQ, etc.)
DW
Self-Service Access To Big Data Via Data Virtualization
BUT what about optimization? Can the data virtualisation server push down analytics to underlying platforms to make them do the work?
Product examples: Cirro, Cisco, Denodo, Informatica Data Services, ScleraDB
51
sandbox Analytical Operational
Conclusions - People In Different Roles In The Analytical Landscape Need to Work Together To Deliver Value
Exploratory analysis Model producer
Business Analyst Business Manager/ Operations Worker
Data Scientist
Model consumer Data discovery & visualisation Information Producer
• Build reports • Build and publish dashboards
Information consumer Decision maker Action taker
52
www.intelligentbusiness.biz [email protected]
Twitter: @mikeferguson1
Tel/Fax (+44)1625 520700
Thank You!
ROUNDTABLE DISCUSSION
Questions?
#BigDataArch or
USE THE Q&A
THANK YOU!
REGISTER FOR BDIA WEBCASTS AT: http://insideanalysis.com/research/big-data-information-architecture
Image on Slide 53 borrowed from http://www.apieceofmonologue.com/2012/08/stanley-kubrick-film-photography-design.html