how to achieve fast data performance in big data, logical data warehouse, and operational scenarios

Performance in Denodo 6.0Pablo Alvarez, Principal Technical Account Manager

Agenda1.Debunking the myths of virtual performance

2.Query Optimizer

3.Cache

4.Resource Management

5.Further Reading

It is a common assumption that a virtualized solution will be

much slower than a persisted approach via ETL:

1. There is a large amount of data moved through the

network for each query

2. Network transfer is slow

But is this really true?

Debunking the myths of virtual performance

1. Complex queries can be solved transferring moderate data volumes when

the right techniques are applied

Operational queries

Predicate delegation produces small result sets

Logical Data Warehouse and Big Data

Denodo uses characteristics of underlying star schemas to apply

query rewriting rules that maximize delegation to specialized sources

(especially heavy GROUP BY) and minimize data movement

2. Current networks are almost as fast as reading from disk

10GB and 100GB Ethernet are a commodity

Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse

Customer Dimension2 M rows

Sales Facts290 M rows

Items Dimension400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

• Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the following scenario

• The baseline was set using the same queries with all data in a Netezza appliance

Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse

Query DescriptionReturned

RowsTime Netezza

Time Denodo (Federating Oracle,

Netezza & SQL Server)

Optimization Technique (automatically selected)

Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5,51 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current

list price17,05 K 3.5 sec. 5,2 sec On the fly data movement

Performance and optimizations in Denodo 6.0Focused on 3 core concepts

Dynamic Multi-Source Query Execution Plans

Leverages processing power & architecture of data sources

Dynamic to support ad hoc queries

Uses statistics for cost-based query plans

Selective Materialization

Intelligent Caching of only the most relevant and often used information

Optimized Resource Management

Smart allocation of resources to handle high concurrency

Throttling to control and mitigate source impact

Resource plans based on rules

Performance and optimizations in Denodo 6.0Comparing optimizations in DV vs ETL

Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs

Uses relational logicMetadata is equivalent to that of a databaseEnables ad hoc querying

Key difference between ETL engines and DV:ETL engines are optimized for static bulk movements

Fixed data flowsData virtualization is optimized for queries

Dynamic execution plan per query

Therefore, the performance architecture presented here resembles that of a RDBMS

Query Optimizer

Step by Step

Metadata Query Tree

• Maps query entities (tables, fields) to actual metadata

• Retrieves execution capabilities and restrictions for views involved in the query

Static Optimizer

• Query delegation

• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.)

• Data movement query plans

Cost Based Optimizer

• Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc.

Physical Execution Plan

• Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)

How Dynamic Query Optimizer Works

How Dynamic Query Optimizer WorksExample: Logical Data Warehouse

Total sales by retailer and product during the last month for the brand ACME

Time Dimension Fact table(sales) Product Dimension

Retailer Dimension

EDW MDM

SELECT retailer.name,

product.name,

SUM(sales.amount)

sales JOIN retailer ON

sales.retailer_fk = retailer.id

JOIN product ON sales.product_fk =

product.id

JOIN time ON sales.time_fk = time.id

WHERE time.date < ADDMONTH(NOW(),-1)

AND product.brand = ‘ACME’

GROUP BY product.name, retailer.name

How Dynamic Query Optimizer WorksExample: Non-optimized

1,000,000,000 rows

GROUP BYproduct.name, retailer.name

100 rows 10 rows 30 rows

10,000,000 rows

SELECT

sales.retailer_fk,

sales.product_fk,

sales.time_fk,

sales.amount

FROM sales

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT

product.name,

product.id

FROM product

produc.brand =

‘ACME’

SELECT time.date,

time.id

FROM time

WHERE time.date <

add_months(CURRENT_

TIMESTAMP, -1)

How Dynamic Query Optimizer WorksStep 1: Applies JOIN reordering to maximize delegation

100,000,000 rows

100 rows 10 rows

10,000,000 rows

GROUP BYproduct.name, retailer.name

SELECT sales.retailer_fk,

sales.product_fk,

sales.amount

FROM sales JOIN time ON

sales.time_fk = time.id WHERE

time.date <

add_months(CURRENT_TIMESTAMP, -1)

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT product.name,

product.id

FROM product

produc.brand = ‘ACME’

How Dynamic Query Optimizer WorksStep 2

10,000 rows

100 rows 10 rows

1,000 rowsGROUP BYproduct.name, retailer.name

Since the JOIN is on foreign keys

(1-to-many), and the GROUP BY is

on attributes from the dimensions,

it applies the partial aggregation

push down optimization

sales.product_fk,

SUM(sales.amount)

time.date <

GROUP BY sales.retailer_fk,

sales.product_fk

SELECT

retailer.name,

retailer.id

FROM retailer

product.id

FROM product

How Dynamic Query Optimizer WorksStep 3

Selects the right JOIN

strategy based on costs for

data volume estimations

1,000 rows

NESTED JOIN

HASH JOIN

100 rows10 rows

1,000 rowsGROUP BYproduct.name, retailer.name

sales.product_fk,

SUM(sales.amount)

time.date <

GROUP BY sales.retailer_fk,

sales.product_fk

WHERE product.id IN (1,2,…)

SELECT

retailer.name,

retailer.id

FROM retailer

product.id

FROM product

The use of Automatic JOIN reordering groups branches that go to the same source to

maximize query delegation and reduce processing in the DV layer

End users don’t need to worry about the optimal “pairing” of the tables

The Partial Aggregation push-down optimization is key in those scenarios. Based on PK-

FK restrictions, pushes the aggregation (for the PKs) to the DW

Leverages the processing power of the DW, optimized for these aggregations

Reduces significantly the data transferred through the network (from 1 b to 10 k)

The Cost-based Optimizer picks the right JOIN strategies based on estimations on data

volumes, existence of indexes, transfer rates, etc.

Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata)

than for regular databases to take into consideration the different way those systems operate

(distributed data, parallel processing, different aggregation techniques, etc.)

Summary

Pruning of unnecessary JOIN branches (based on 1 to + associations) when the attributes of the 1-side are not projected

Relevant for horizontal partitioning and “fat” semantic models when queries do not need attributes for all the tables

Unnecessary tables are removed from the query (even for single-source models)

Pruning of UNION branches based on incompatible filters

Enables detection of unnecessary UNION branches in vertical partitioning scenarios

Automatic data movement

Creation of temp tables in one of the systems to enable complete delegation of a federated branch.

The target source needs to have the “data movement” option enabled for this option to be taken into account

Other relevant optimization techniques for LDW and Big Data

Caching

Sometimes, real time access & federation not a good fit:

Sources are slow (ex. text files, cloud apps. like Salesforce.com)

A lot of data processing needed (ex. complex combinations, transformations,

matching, cleansing, etc.)

Limited access or have to mitigate impact on the sources

For these scenarios, Denodo can replicate just the relevant data in

the cache

Real time vs. caching

Caching

Denodo’s cache system is based on an external relational database

Traditional (Oracle, SLQServer, DB2, MySQL, etc.)

MPP (Teradata, Netezza, Vertica, Redshift, etc.)

In-memory storage (Oracle TimesTen, SAP HANA)

Works at view level.

Allows hybrid access (real-time / cached) of an execution tree

Cache Control (population / maintenance)

Manually – user initiated at any time

Time based - using the TTL or the Denodo Scheduler

Event based - e.g. using JMS messages triggered in the DB

Overview

Caching

Denodo offers two different types of cache

Partial:

Query-by-query cache

Useful for caching only the most commonly requested data

More adequate to represent the capabilities of non-relational sources, like web services or APIs with input parameters

Similar to the concept of materialized view

Incrementally updateable at row level to avoid unnecessary full refresh loads

Offers full push-down capabilities to the source, including group by and join operations

Caching options

Hybrid Performance for SaaS sourcesIncremental Queries (Available July 2016)

Merge cached data and fresh data to provide fully up-to-date results with minimum latency

Get Leads changed / addedsince 1:00AM

CACHELeads updatedat 1:00AM

Up-to-date Leads data

1. Salesforce ‘Leads’ data cached in VDP at 1:00 AM

2. Query needing Leads data arrives at 11:00 AM

3. Only new/changed leads are retrieved through the WAN

4. Response is up-to-date but query is much faster

Resource Management

Advanced Memory Management

Dynamic data buffers to control source federation with different data retrieval speeds, which

guarantees a low memory footprint

All operations are memory-constrained to prevent monopolization of resources by a single query.

The constraints are adjustable.

Swapping data to disk to handle large data sets so as not to overload the memory

On-the-fly modification of execution plans to prevent exceeding memory thresholds

Server Throttling Mechanisms

Control settings to limit concurrency (max queries, max. threads…)

Waiting queues for inbound connections

Connection pools for data sources

Resource Management

Enterprise Resource Manager

Apply resource restrictions based on a set of rules

Rules Classify Sessions into Groups (e.g. by user, role, application, source IP…)

E.g. Sessions from application ‘single customer view’ are assigned to group called ‘high

priority transactional’

Apply Restrictions for Each Group.

Change priority, change concurrency settings, change max timeouts, etc

how to achieve fast data performance in big data, logical data warehouse, and operational scenarios

Technology

data warehouse logical design guide

p20 state core logical data model

logical data m -...

data warehouse - logical design

logical data model conventions 16143327

cda logical data design - centers for medicare and ... ·...

arcisbi logical data model - archives.gov

logical data analysis

logical data warehouse design - ulisboa · logical data...

logical relational data modeling standards - omgwiki.org...

designing a logical data warehouse -...

logical data design - cms

optimization in logical analysis of data

appendix b2 : logical data model / functional …

logical relational data modeling standards

data warehouse logical design - politecnico di...

data science track - logical operations · data scientist...

sam2 - logical data dossier v2€¦ · sam2 logical data...

performance considerations in logical data warehouse

data warehouse logical design using mysql