evolving data warehouse architectures
TRANSCRIPT
Philip Russom April 15, 2014
Evolving Data Warehouse
Architectures
In the Age of Big Data
TDWI would like to thank the following companies
for sponsoring the 2014 TDWI Best Practices research report:
Evolving Data Warehouse
Architectures
This presentation is based on the findings of that report.
STAY TUNED
At the end of this webinar, learn how to download a free copy of the report.
Agenda
• Definitions of
Data Warehouse
Architectures
• Drivers of Change
• Benefits & Barriers
• From EDWs to DWEs
• Role of Hadoop
• Analytics versus Reporting
• Trends among Architectural
Components and Practices
• Top Ten Priorities
PLEASE TWEET @pRussom, #TDWI, #EDW,
#DataWarehouse, #DataArchitecture,
#Analytics, #Hadoop
Upcoming
Points • There isn’t one, single
architecture for all data warehouses (DWs)
– Each org is different
• Expect multiple architectures
– A well-designed DW has multiple architectural layers
– Architectural approaches get mixed together into hybrids
– A DW architecture interacts with architectures for data integration, reporting, analytics, operational applications, etc.
• The warehouse is still vital, even central
– But it’s evolving into a multiple platform environment
– Architecture is more important than ever, but now as a logical design that’s deployed over multiple physical platforms
• Please don’t ask me to draw a Reference Architecture for DWs
– Given the current diversity, there isn’t just one. But I’ll describe many.
What do you
think data
warehouse
architecture is? Select all that apply.
Source: TDWI survey run in late 2013.
Based on 1197 responses from 538
respondents. 2.2 responses per respondent,
on average.
Logical versus Physical DW Architectures And Other Architectural Components that Coexist
• Logical architecture – mostly about data models
and their relationships, with a focus on how these
represent organizational entities and processes
– Data standards – including standards for data modeling,
data quality metrics, interfaces for data integration,
programming style, format standards, etc.
• Physical architecture – mostly a plan for deploying
data and data structures based on the workload and
platform requirements of each
– System architecture – a topology of hardware servers
and software servers, plus the interfaces and networks
that tie them together
Today’s
Focus
Drivers of Change
Does your primary enterprise data warehouse
have an architectural design?
Yes 79%
No 18%
Don’t know 3%
Source: TDWI survey run in late 2013.
Based on 538 respondents.
Is the architecture of your data warehouse
environment evolving?
Yes – moderately 54%
Yes – dramatically 22%
No – except with DW updates 22%
Don’t know 2%
What technical issues or practices are driving
change in your DW architecture?
Advanced analytics 57%
Increasing data volumes 56%
Real-time operations 41%
Business performance mgt 38%
OLAP 30%
Non-relational data 25%
Virtualization of data 23%
Cloud adoption 21%
Streaming data 15%
What business issues or practices are driving
change in your DW architecture?
Competitiveness 45%
Fast-paced business processes 43%
Compliance 29%
Funding 29%
Sponsorship 26%
Reorganizations 25%
Centralizing business control 30%
Departmental power struggles 19%
Mergers and acquisitions 18%
Benefits of Multi-Platform Architecture In priority order, based on survey responses
• All data analytics, in general (61%) – Many new platforms are built for analytics: DW appliances, columnar databases,
NoSQL databases, Hadoop.
– With a multi-platform portfolio, users can match an analytic workload to best platform.
• A diverse platform portfolio can handle a diverse range of data types. – This is key to embracing the unstructured and schema-free data types found in most
big data.
– Enables broad data exploration and discovery (43%)
• A more diverse platform portfolio can aid a business – Additional platforms are key to addressing new business requirements (36%),
especially data-oriented ones like analytics (61%), more numerous business insights (34%), business optimization (30%)…
• Handling data in real time usually requires an additional purpose-built system. – Traditional relational databases and batch-oriented Hadoop systems were not built
for real-time operations (33%), though many organizations need faster business processes (26%).
• Adding low-cost platforms to a DW environ makes big data more affordable. – DW appliances, columnar RDBMSs, Hadoop & NoSQL all lower cost for data staging
for data warehousing (20%) and data archiving (16%).
Source: TDWI survey run in late 2013.
Based on 538 respondents.
Barriers to Multi-Platform Architecture In priority order, based on survey responses
• Inadequate staffing or skills (47%) is the most prominent barrier.
– Immaturity with new data types and sources (23%) – plus new technologies for
Hadoop, event processing, and so on – make them unprepared for the
complexity of multi-platform designs (25%).
• As usual, organizational and business issues should be settled first.
– Data ownership and other politics (43%), a lack of business sponsorship (38%),
a lack of a compelling business case (25%)
• A number of data management issues should be addressed.
– Data integration complexity (36%), poor data quality (34%), lack of data
architecture (29%), and data security, privacy, and governance issues (25%)
• As with any new IT initiative, proper funding is key.
– Account for the cost of acquiring multiple platforms (25%) and the cost of
administering multiple platforms (27%)
Source: TDWI survey run in late 2013.
Based on 538 respondents.
WHY CAN’T A DATA WAREHOUSE DO EVERYTHING?
“Square Peg” Workloads may not fit
“Round Hole” DW Architectures
• Most data warehouses were designed and
optimized for common deliverables and methods:
– Standard reports, dashboards, performance mgt,
online analytic processing (OLAP)
– This is a design and architectural decision made by users, not a failing of
vendor platforms
• Can/should all DW & analytic workloads run on your EDW?
– If your EDW can handle multiple mixed concurrent workloads with
performance and without impeding other workloads, then run all workloads
(including analytics) on the EDW, for simplicity’s sake
– If not, you may need additional data platforms for some workloads
Multi-Platform Data Warehouse Environments
• Many enterprise data warehouses (EDWs) are evolving into
multi-platform data warehouse environments (DWEs).
• Users continue to add additional standalone data platforms to
their warehouse tool and platform portfolio.
• The new platforms don’t replace the core warehouse, because
it is still the best platform for the data that goes into standards
reports, dashboards, performance management, and OLAP.
• Instead, the new platforms complement the warehouse,
because they are optimized for workloads that manage,
process, and analyze new forms of big data, non-structured
data, and real-time data.
Ramifications of a Multi-Platform DW Environ
• Workload-centric DW architecture
– Assumes that some workloads and their data are best offloaded from the
core DW and taken to a platform more suited to them
– Workloads and data for advanced analytics (not OLAP), SQL-based
analytics, unstructured data, massive big data, real time
• Distributed DW architecture
– This simply means that data and data structures (as defined in a logical
architectural layer) are distributed across multiple physical data platforms
– Again, the logical layer is the “big picture” needed with many platforms
• A distributed DW architecture is both good and bad
– Good if it serves the unique requirements of multiple workloads and the
users that depend on them
– Bad if platforms proliferate like the dreaded data marts of yore
Growing Complexity in DW System Architectures • The technology stack for DW, BI, analytics, and data integration
has always been a multi-platform environment.
• What’s new? The trend toward a portfolio of many data
platforms has accelerated.
Complex,
Event
Processing
Streaming
Data
Tools
Analytic
Sand
Box
Data
Federation
& Virtuali-
zation
DW
Appliance
Columnar
DBMS Columnar
DBMS
DW
Appliances
No-SQL
Database
Hadoop
Distributed
File Sys
Map
Reduce
No-SQL
Database
Hadoop
Distributed
File Sys
Star or
Snowflake
Scheme
Data
Warehouse
Federated
Data
Marts
Customer
Mart or
ODS
Metrics for
Performance
Mgt
Multi-
dimensional
Data Models
Federated
Data
Marts
Federated
Data
Marts
Customer
Mart or
ODS
Real
Time
ODS
Data
Staging
Areas
OLAP
Cubes
Detailed
Source
Data
Data
Staging
Areas
Data
Staging
Areas
Detailed
Source
Data
Detailed
Source
Data
OLAP
DBMSs
DW from a
Merger
Over The Passage of Time
Which of the following best describes your
extended data warehouse environment today?
• Pure, central, monolithic EDWs are relatively rare (15%, far left)
• Likewise, environments without a DW are equally rare (15%, far right)
• EDWs mix well in hybrid environments (68%, middle three)
Other
(2%)
No true EDW, but
many workload-
specific data
platforms instead
Many workload-specific
data platforms; EDW is
present but not the center
Central EDW
with many
additional data
platforms
Central EDW with a few
additional data platforms
Central
monolithic EDW
with no other
data platforms
15% 15% 16% 37% 15%
ED
W D
WE
Source: TDWI survey run in late 2013.
Based on 538 respondents.
Which of the following best describes your
organization’s strategy for evolving your DW
environment and its architecture, relative to big data? • Most survey respondents plan to extend an existing DW (41%, far left)
• Few will deploy new data platforms (25%)
• 29% have no strategy for DW evolution or addressing big data
Source: TDWI survey run in late 2013.
Based on 538 respondents.
41% 25% 23% 6%
Extend existing core DW to
accommodate big data and other
new requirements
Deploy new data
management systems
specifically for big data,
analytics, real time, etc.
No strategy for DW
architecture, though
we need one
No strategy for
DW architecture,
because we
don't need one
Other
(5%)
Hadoop is a Useful Addition to DW Architectures IT COMPLEMENTS AND EXTENDS DATA WAREHOUSES
• HDFS extends DW Architectures
– Managing multi-structured data
– Repository for detailed source data
– Processing big data for analytics
– Advanced forms of algorithmic analytics
– Data staging on steroids
– ELT push-down processing
– Inexpensive compared to average DW
• Hadoop also contributes outside DWs
– Imagine HDFS as shared infrastructure,
similar to SAN & NAS
– Imagine a huge, live archive
– Imagine content mgt on steroids
Reporting and Analytics have Different
Requirements for Data and DW Architecture
• Reporting is mostly about entities and facts you know well, represented by highly polished data that you know well.
• Carefully modeled and cleansed data with rich metadata and master data that’s managed in a data warehouse.
• Most users designed their DWs first and foremost as a repository for reporting and similar practices such as OLAP, performance management, dashboards, and operational BI.
• Advanced analytics enables the discovery of new facts you didn’t know, based on the exploration and analysis of data that’s probably new to you.
• Unlike the pristine data that reports operate on, advanced analytics works best with detailed source data in its original (even messy) form, using discovery oriented technologies, such as ad hoc queries, search, mining, statistics, predictive algorithms, and natural language processing.
Commitment & Growth Components relative to DW Architecture
• Analytics is driving most adoption of new platforms & features.
– In-memory analytics (36%), analytic sandboxes (29%)
• Managing non-relational big data is also a pressing need for
many organizations.
– HDFS (34%), open-source MapReduce (32%), vendor-built
MapReduce (25%), NoSQL databases (24%)
• Real-time is just as important as analytics and big data.
– In-memory database (34%), in-database analytics (29%), solid-state
drives (25%), real-time data (24%)
• Relational technology is more relevant than ever, but in
updated forms.
– Columnar DBMSs (27%), DW appliances (23%)
Some components are poised for aggressive adoption by users.
Top Ten Priorities for DW Architecture These are recommendations, requirements, or rules that can guide you.
1. Recognize that successful data warehouse architectures have integrated logical and physical layers, plus other components.
2. Determine the business and technical drivers in your organization, and let those determine the evolution of your DW architecture.
3. Beware that the leading barrier to successful DW architecture is inadequate staffing and skills.
4. Address other barriers for sponsorship, funding, and improvements to data management infrastructure.
5. Turn on unused features in existing platforms.
6. Establish DW architectures and standards, but be open to exceptions.
7. Be open to hybrids and alternate standards.
8. Consider Hadoop as a DW complement.
9. Remember that analytics and reporting have different data and DW architectural requirements.
10. Don’t expect the new stuff to replace the old stuff.
Download a free copy
of the report that this
Webinar is based on
• Download the report in a
PDF file at:
tdwi.org/bpreports
• Feel free to distribute the
PDF file of any TDWI Best
Practices Report
EVOLVING DATA WAREHOUSE
ARCHITECTURES IN THE AGE
OF BIG DATA
Philip Russom Research Director for Data Mgt
TDWI
www.bit.ly/PhilipRussom
@pRussom on Twitter
linkedin.com/in/philiprussom
Q & A