how to achieve fast data performance in big data, logical data warehouse, and operational scenarios
Post on 09-Jan-2017
173 Views
Preview:
TRANSCRIPT
Performance in Denodo 6.0Pablo Alvarez, Principal Technical Account Manager
Agenda1.Debunking the myths of virtual performance
2.Query Optimizer
3.Cache
4.Resource Management
5.Further Reading
3
It is a common assumption that a virtualized solution will be
much slower than a persisted approach via ETL:
1. There is a large amount of data moved through the
network for each query
2. Network transfer is slow
But is this really true?
4
Debunking the myths of virtual performance
1. Complex queries can be solved transferring moderate data volumes when
the right techniques are applied
Operational queries
Predicate delegation produces small result sets
Logical Data Warehouse and Big Data
Denodo uses characteristics of underlying star schemas to apply
query rewriting rules that maximize delegation to specialized sources
(especially heavy GROUP BY) and minimize data movement
2. Current networks are almost as fast as reading from disk
10GB and 100GB Ethernet are a commodity
5
Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse
Customer Dimension2 M rows
Sales Facts290 M rows
Items Dimension400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.
• Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the following scenario
• The baseline was set using the same queries with all data in a Netezza appliance
6
Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse
Query DescriptionReturned
RowsTime Netezza
Time Denodo (Federating Oracle,
Netezza & SQL Server)
Optimization Technique (automatically selected)
Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between 2000 and 2004
5,51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less than current
list price17,05 K 3.5 sec. 5,2 sec On the fly data movement
7
Performance and optimizations in Denodo 6.0Focused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules
8
Performance and optimizations in Denodo 6.0Comparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs
Uses relational logicMetadata is equivalent to that of a databaseEnables ad hoc querying
Key difference between ETL engines and DV:ETL engines are optimized for static bulk movements
Fixed data flowsData virtualization is optimized for queries
Dynamic execution plan per query
Therefore, the performance architecture presented here resembles that of a RDBMS
Query Optimizer
10
Step by Step
Metadata Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved in the query
Static Optimizer
• Query delegation
• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.)
• Data movement query plans
Cost Based Optimizer
• Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc.
Physical Execution Plan
• Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)
How Dynamic Query Optimizer Works
11
How Dynamic Query Optimizer WorksExample: Logical Data Warehouse
Total sales by retailer and product during the last month for the brand ACME
Time Dimension Fact table(sales) Product Dimension
Retailer Dimension
EDW MDM
SELECT retailer.name,
product.name,
SUM(sales.amount)
FROM
sales JOIN retailer ON
sales.retailer_fk = retailer.id
JOIN product ON sales.product_fk =
product.id
JOIN time ON sales.time_fk = time.id
WHERE time.date < ADDMONTH(NOW(),-1)
AND product.brand = ‘ACME’
GROUP BY product.name, retailer.name
12
How Dynamic Query Optimizer WorksExample: Non-optimized
1,000,000,000 rows
JOIN
JOIN
JOIN
GROUP BYproduct.name, retailer.name
100 rows 10 rows 30 rows
10,000,000 rows
SELECT
sales.retailer_fk,
sales.product_fk,
sales.time_fk,
sales.amount
FROM sales
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT
product.name,
product.id
FROM product
WHERE
produc.brand =
‘ACME’
SELECT time.date,
time.id
FROM time
WHERE time.date <
add_months(CURRENT_
TIMESTAMP, -1)
13
How Dynamic Query Optimizer WorksStep 1: Applies JOIN reordering to maximize delegation
100,000,000 rows
JOIN
JOIN
100 rows 10 rows
10,000,000 rows
GROUP BYproduct.name, retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
sales.amount
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
14
How Dynamic Query Optimizer WorksStep 2
10,000 rows
JOIN
JOIN
100 rows 10 rows
1,000 rowsGROUP BYproduct.name, retailer.name
Since the JOIN is on foreign keys
(1-to-many), and the GROUP BY is
on attributes from the dimensions,
it applies the partial aggregation
push down optimization
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
15
How Dynamic Query Optimizer WorksStep 3
Selects the right JOIN
strategy based on costs for
data volume estimations
1,000 rows
NESTED JOIN
HASH JOIN
100 rows10 rows
1,000 rowsGROUP BYproduct.name, retailer.name
SELECT sales.retailer_fk,
sales.product_fk,
SUM(sales.amount)
FROM sales JOIN time ON
sales.time_fk = time.id WHERE
time.date <
add_months(CURRENT_TIMESTAMP, -1)
GROUP BY sales.retailer_fk,
sales.product_fk
WHERE product.id IN (1,2,…)
SELECT
retailer.name,
retailer.id
FROM retailer
SELECT product.name,
product.id
FROM product
WHERE
produc.brand = ‘ACME’
16
How Dynamic Query Optimizer Works
The use of Automatic JOIN reordering groups branches that go to the same source to
maximize query delegation and reduce processing in the DV layer
End users don’t need to worry about the optimal “pairing” of the tables
The Partial Aggregation push-down optimization is key in those scenarios. Based on PK-
FK restrictions, pushes the aggregation (for the PKs) to the DW
Leverages the processing power of the DW, optimized for these aggregations
Reduces significantly the data transferred through the network (from 1 b to 10 k)
The Cost-based Optimizer picks the right JOIN strategies based on estimations on data
volumes, existence of indexes, transfer rates, etc.
Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata)
than for regular databases to take into consideration the different way those systems operate
(distributed data, parallel processing, different aggregation techniques, etc.)
Summary
17
How Dynamic Query Optimizer Works
Pruning of unnecessary JOIN branches (based on 1 to + associations) when the attributes of the 1-side are not projected
Relevant for horizontal partitioning and “fat” semantic models when queries do not need attributes for all the tables
Unnecessary tables are removed from the query (even for single-source models)
Pruning of UNION branches based on incompatible filters
Enables detection of unnecessary UNION branches in vertical partitioning scenarios
Automatic data movement
Creation of temp tables in one of the systems to enable complete delegation of a federated branch.
The target source needs to have the “data movement” option enabled for this option to be taken into account
Other relevant optimization techniques for LDW and Big Data
Caching
18
19
Caching
Sometimes, real time access & federation not a good fit:
Sources are slow (ex. text files, cloud apps. like Salesforce.com)
A lot of data processing needed (ex. complex combinations, transformations,
matching, cleansing, etc.)
Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in
the cache
Real time vs. caching
20
Caching
Denodo’s cache system is based on an external relational database
Traditional (Oracle, SLQServer, DB2, MySQL, etc.)
MPP (Teradata, Netezza, Vertica, Redshift, etc.)
In-memory storage (Oracle TimesTen, SAP HANA)
Works at view level.
Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
Manually – user initiated at any time
Time based - using the TTL or the Denodo Scheduler
Event based - e.g. using JMS messages triggered in the DB
Overview
21
Caching
Denodo offers two different types of cache
Partial:
Query-by-query cache
Useful for caching only the most commonly requested data
More adequate to represent the capabilities of non-relational sources, like web services or APIs with input parameters
Full:
Similar to the concept of materialized view
Incrementally updateable at row level to avoid unnecessary full refresh loads
Offers full push-down capabilities to the source, including group by and join operations
Caching options
22
Hybrid Performance for SaaS sourcesIncremental Queries (Available July 2016)
Merge cached data and fresh data to provide fully up-to-date results with minimum latency
Get Leads changed / addedsince 1:00AM
CACHELeads updatedat 1:00AM
Up-to-date Leads data
1. Salesforce ‘Leads’ data cached in VDP at 1:00 AM
2. Query needing Leads data arrives at 11:00 AM
3. Only new/changed leads are retrieved through the WAN
4. Response is up-to-date but query is much faster
Resource Management
24
Resource Management
Advanced Memory Management
Dynamic data buffers to control source federation with different data retrieval speeds, which
guarantees a low memory footprint
All operations are memory-constrained to prevent monopolization of resources by a single query.
The constraints are adjustable.
Swapping data to disk to handle large data sets so as not to overload the memory
On-the-fly modification of execution plans to prevent exceeding memory thresholds
Server Throttling Mechanisms
Control settings to limit concurrency (max queries, max. threads…)
Waiting queues for inbound connections
Connection pools for data sources
25
Resource Management
Enterprise Resource Manager
Apply resource restrictions based on a set of rules
Rules Classify Sessions into Groups (e.g. by user, role, application, source IP…)
E.g. Sessions from application ‘single customer view’ are assigned to group called ‘high
priority transactional’
Apply Restrictions for Each Group.
Change priority, change concurrency settings, change max timeouts, etc
Further Reading
27
Further Reading
Check also the following articles written by our CTO Alberto Pan in our blog:
• Myths in data virtualization performance
• http://www.datavirtualizationblog.com/myths-in-data-virtualization-
performance/
• Performance of Data Virtualization in Logical Data Warehouse scenarios
• http://www.datavirtualizationblog.com/performance-data-virtualization-logical-
data-warehouse-scenarios/
• Physical vs Logical Data Warehouse: the numbers
• http://www.datavirtualizationblog.com/physical-logical-data-warehouse-
performance-numbers/
Thanks!
www.denodo.com info@denodo.com
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
top related