denodo datafest 2016: logical data lakes: performance considerations
TRANSCRIPT
![Page 1: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/1.jpg)
O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A
#DenodoDataFest
RAPID, AGILE DATA STRATEGIESFor Accelerating Analytics, Cloud, and Big Data Initiatives.
![Page 2: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/2.jpg)
Logical Data Lakes: Performance Considerations
Dr. Alberto Pan
Denodo, CTO
![Page 3: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/3.jpg)
Agenda1. Logical Architectures for Analytics: The Role of Data Virtualization
2. Example Scenario: The Numbers
3. Performance: The “Move Processing to the Data” Paradigm
4. Performance: Cost-based Optimization in Logical Architectures
![Page 4: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/4.jpg)
Logical Architectures for Analytics
The Role of Data Virtualization
![Page 5: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/5.jpg)
5
The Logical Data Warehouse
![Page 6: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/6.jpg)
6
Logical Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDWIn-Memory
(SAP Hana, …)Analytical
Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
HiveSparkDrill
Impala
Storm HBase SolrHunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
![Page 7: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/7.jpg)
7
The Logical Data Warehouse
![Page 8: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/8.jpg)
8
The Logical Data Warehouse: Data Virtualization
![Page 9: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/9.jpg)
9
Logical Data LakeReal-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data Mining
Data Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
Metadata Management, Data Governance, Data Security
NoSQL
EDWIn-Memory
(SAP Hana, …)Analytical
Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Real-Time Data Access (On-Demand / Streaming)
Batch
YARN / Workload Management
HDFS
HiveSparkDrill
Impala
Storm HBase SolrHunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
![Page 10: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/10.jpg)
10
Logical Data Lake: Data Virtualization
Real-TimeDecision
Management
Alerts
ScorecardsDashboards
Reporting
Data DiscoverySelf-Service
Search
Predictive Analytics
Statistical Analytics (R)
Text Analytics
Data MiningData Warehouse
Sensor Data
Machine Data (Logs)
Social Data
Clickstream Data
Internet Data
Image and Video
Enterprise Content (Unstructured)
Big Data
Enterprise Applications
Traditional Enterprise
Data
Cloud
Cloud Applications
NoSQL
EDWIn-Memory
(SAP Hana, …)Analytical
Appliances
Cloud DW(Redshift,..)
ODS
Big Data ETL
CDC
Sqoop
(Flume, Kafka, …)
Data Virtualization
Real-Time Data Access (On-Demand / Streaming)
Data Caching
Data
Serv
ices
Data Search & Discovery
Governance
Security
Optimization
Data
Abstr
action
Data
Tra
nsfo
rmation
Data
Federa
tionBatch
YARN / Workload Management
HDFS
HiveSparkDrill
Impala
Storm HBase SolrHunk
DW Streams NoSQL SearchSQL
Hadoop
TezMapRed.
![Page 11: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/11.jpg)
11
The Role of Data Virtualization
Combines data from several systems and publish it to the desired format with a few clicks
Denodo Data Virtualization is the only option verifying:
![Page 12: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/12.jpg)
12
Example: Distributed Report
Demo Scenario of LDW
Sales Data (TPC-DS)(280M)
Customer (TPC-DS)(2M)
MDM
12
SQL on Hadoop
Total Sales by Customer
![Page 13: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/13.jpg)
13
The Role of Data Virtualization
Combines data from several systems and publish it to the desired format with a few clicks
Abstracts applications from changes in the underlying infrastructure
Expose different logical views over the same data
Single entry point to apply Security and Governance
Denodo Data Virtualization is the only option verifying:
![Page 14: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/14.jpg)
14
DW + MDM Dimensions
Common LDW Patterns
TimeDimension
Fact table(sales) Product
Dimension
Retailer Dimension
EDW MDM
14
![Page 15: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/15.jpg)
DW Historical Offloading (Cold Data Storage)
Common LDW Patterns
Time Dimension Fact table(sales) Product Dimension
Retailer Dimension
EDW Hadoop
Current Sales Historical Sales
15
![Page 16: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/16.jpg)
DW + Cloud Dimensional Data
Common LDW Patterns
Time Dimension Fact table(sales) Product Dimension
Customer Dimension
EDW CRM
SFDC Customer
16
![Page 17: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/17.jpg)
Data Warehouse Federation
Common LDW Patterns
Time Dimensi
on
Fact table(sales)
CustomerDimension
Region
EDW
City
EDW
Fact table(Fidelity)
Customer Dimension
17
![Page 18: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/18.jpg)
Example Scenario: The Numbers
![Page 19: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/19.jpg)
19
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Customer Dim.2 M rows
Sales Facts290 M rows
Items Dim.400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.
vs.Sales Facts290 M rows
Items Dim.400 K rows
Customer Dim.2 M rows
Denodo 6.0 ArchitecturePerformance Comparison – Logical Data Warehouse vs. Physical Data Warehouse
![Page 20: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/20.jpg)
20
Denodo 6.0 Architecture
Query DescriptionReturned
RowsTime Netezza
Time Denodo (Federated Oracle,
Netezza & SQL Server)
Optimization Technique (automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less than current
list price17,05 K 3.5 sec. 5.2 sec On the fly data movement
Performance Comparison – Logical Data Warehouse vs. Physical Data Warehouse
![Page 21: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/21.jpg)
Performance
The “Move Processing to the Data” Paradigm
![Page 22: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/22.jpg)
22
Move Processing to the Data
Process the data where it resides
Process the data locally where it resides
DV System combines partial results
Minimizes network traffic
Leverages specialized data sources
![Page 23: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/23.jpg)
23
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Naive Strategy)
Naive Strategy: 350M rows moved through the network
![Page 24: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/24.jpg)
24
Move Processing to the Data: Example 1
Obtain Total Sales By Product (Move Processing to the Data)
Denodo Strategy: 30k rows moved through the network
![Page 25: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/25.jpg)
25
Move Processing to the Data: Example 1 (Alternative)
Obtain Total Sales By Product (Dimension Table Replicated)
Denodo Strategy: Dimension Table Replicated
Denodo Strategy: 20k rows moved through the network
![Page 26: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/26.jpg)
26
Move Processing to the Data: Example 2
Execution Strategy: Full aggregation pushdown not possible Two possible techniques: - On-the-fly data movement- Partial aggregation pushdown
Maximum Sales Discount By Product in The last year
Sales Discount: list_price (Product) – sale_price (Sales)
![Page 27: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/27.jpg)
27
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: On-the-fly Data Movement
Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K
rows inserted in the DW
Execute full query on the DW: 10k rows through the network
![Page 28: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/28.jpg)
28
Move Processing to the Data: Example 2
Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown
Products DB: 10K rows through the network
Data Warehouse: #rows through the network = 10K * average
#sale_prices_per_product
![Page 29: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/29.jpg)
Performance
Cost-based Optimization in Logical Architectures
![Page 30: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/30.jpg)
30
How to Choose the Best Execution Plan?Cost-Based Optimization in Data Virtualization
Data statistics to estimate size of intermediate result sets
Data Source Indexes (and other physical structures)
Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases
Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster)
Data Transfer rate
Must take into account:
![Page 31: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/31.jpg)
Q&AFind more details at: datavirtualization.bloghttp://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/
![Page 32: Denodo DataFest 2016: Logical Data Lakes: Performance Considerations](https://reader031.vdocuments.site/reader031/viewer/2022022202/5884afad1a28ab76798b6735/html5/thumbnails/32.jpg)
Thank you!
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
O C T O B E R 1 8 , 2 0 1 6 S A N F R A N C I S C O B A Y A R E A , C A
#DenodoDataFest