load distribution of analytical query workloads for database cluster architecture

Review onLoad Distribution of Analytical Query Workloads

For Database Cluster Architectures

Research Paperby

Thomas Phan Yahoo!, Inc., Sunnyvale, CA, USA

Wen-Syan Li IBM Almaden Research Center, San Jose, CA, USA

Introduction

Database Cluster

- A group of databases

Ex: a company may have two identical database systems, one as a production system for online transactions and one as a hot standby. And also some additional database systems which have a subset of base tables for various day today applications. These systems have their dedicated missions in the day time, but they could be left idle or under-utilized in the evening when batch workloads are processed. In such systems, query workloads can be distributed across different servers for better performance.

Load Distribution of Analytical Query Workloads

For Database Cluster Architectures

Introduction Cont.• Online query workloads - touch a relatively small region of

the data on the disk. Such queries are said to be selective

• Analytical Query Workloads - Touch large volumes of the disk.

• On Line Analytical Processing (OLAP) OLAP is an approach to answer multi-dimensional analytical (MDA) queries swiftly. OLAP is part of the broader category of business intelligence, which also encompasses relational database, report writing and data mining.

MQTs are required in OLAP applications as the query workloads tend to have complex structures and syntax.

• Materialized query tables (MQT)MQT is a table whose definition is based on the result of a query, and whose data is in the form of pre-computed results that are taken from one or more tables.

Idea of the Research

Framework for coordinating and optimizing execution of OLAP query workloads across a cluster of database servers with shared-nothing architecture.

For a database cluster, such an optimization is achieved when,the maximum completion time of the workloads across all database servers is minimized

Completion time at each database server includes,

MQT and index building time + query workload execution time

Expected Outcome : Server – MQT – Query mapping mechanism where workload execution time is minimized

Current ScenarioLoad distribution by dividing the workload into multiple sub-workloads and assign each sub-workload to a database server in some greedy manner, such as a round robin distribution.

Problems Identified

‽ Queries routed to a database server may not be collocated with their needed MQTs

‽ Some MQTs may not t in the data server that has a limited disk space

‽ Some sub-workloads may be more expensive to execute than others, so some server may be idle while other servers are still busy

‽ Some servers may be more powerful than others

Proposed Methodology

Proposed Methodology Cont.Key component - Scheduler

(operates on the queries and the MQTs produced by an existing MQT Advisor product)

Scheduler distributes the workload's queries and MQTs according to query-to-server and MQT-to-server mappings.

Distribution problem number of different distribution combinations =

Even for small parameters, the solution space grows exponentially, making an exhaustive search infeasible

Approach

To search through the solution space, Genetic Algorithm (GA) search heuristic is used which finds a near-optimal mapping of

• queries-to-servers • MQTs-to-servers and also• query-to-MQTs requirement mapping

Other options may had used• Tabu search• Simulated annealing• Steepest-ascent hill climbing

The justification for using GA as the searching algorithm instead of other methods, is not sufficient.

Genetic Algorithm DesignChromosome - Specific mappings of queries-to-servers and MQTs-to-servers along with its associated workload execution time

Best organism – mapping with lowest workload execution time

Previous work depicts, having two separate sets of chromosomes for a multi-objective optimization has a hard time converging.

To overcome that a single unified chromosome is used that represents collocations of queries and MQTs onto the same server.

Genetic Algorithm Crossover

GA Evaluation Function• Initialize the execution times for all servers in the chromosome.

• Check to see what MQTs have been materialized and at what server and accordingly adds these materialization times to the respective servers.

• key loop that looks across all collocations.• If a query needs a particular MQT, the query's execution with

the MQT is added to the execution time if the query and MQT are collocated. If they are not collocated, then the query's execution time without the MQT is added

• The function returns the maximum execution time among the servers.

• And it is omitted from the chromosome, go for next generation.

Mapping Optimization

Developed Genetic Algorithm is compared with an exhaustive search and three standard greedy algorithms.

And the paper shows that GA is a better solution in finding optimal mapping mechanism.

Narrowing down the research in to OLAP query workloads and usage of MQTs is not clearly justified.

And also there are some problems in the method followed to find the better solution.

Literature Review 1. Need to justify the single server approach Vs database cluster approach• Only describes the use of QCC (Query Cost Calibrator) in federated systems and not in database clusters

2. No enough evidence to emphasize the relativity of GA to solve optimization probleme.g. Evolutionary Optimization of File Assignment for a Large-Scale Video-on- Demand System by Jun Guo

3. Reasons for the use of scheduler algorithms for experiment is not clearly mentioned

4. Purpose for the use of OLAP is not identified• OLAP is used because MQTs are required OLAP

5. No review on the preparation of greedy algorithmse.g. Valuated matroids: a new look at the greedy algorithm by Andreas W.M. Dress, Walter Wenzel

What’s new on this topic ?• IQuery Performance Prediction for Analytical Workloads by Jennie Duggan

Uses a combination of isolated and concurrent query execution samples, as well as new query workload features and metrics that can capture how different query classes behave for various levels of resource availability and contention.

• Dynamic Prioritization of Database Queries by Sivaramakrishnan Narayanan, Florian Waas Presents a mechanism that continuously determines and re-computes the ideal target velocity of concurrent database processes based on their run-time statistics to achieveprioritization. In this scheme, every process autonomously adjusts its resource consumption using basic control theory principles.

Sources and data collection•5 types of scheduler algorithms

GAExhaustiveGreedy1-MQTs-then-queriesGreedy2-queries-then-MQTsGreedy3-no-MQTs

•2 data sourcesworkload produced by the standard TPC benchmark H

generator

synthetic benchmark based on the queries and MQTs from the TPC benchmarks

Tests implemented in standard C++ and run on an off the shelf desk top computer running Red Hat Linux.

GA run for 100 generations with a population size of 100 chromosomes.

Workload running time with increasing number of queriesExhaustive search and GA

Comparison against exhaustive search shows that GA gives very close results as in exhaustive search which implies accuracy.

Workload running time with increasing number of servers

GA, exhaustive search, greedy1, 2, 3

Workload running time with increasing number of queries

Issues

•Assumption as having infinite disk space But disk space is limited

•Tested for homogeneous servers only But if heterogeneous!!!

•Scalability tested up to limited number queries only But when increased???

•Emphasis on the suitability of GA from the very beginning. But results show Greedy 2 is also significant

Conclusion

-problem identification is successful-prior conclusions of results???-further improvements for algorithms is needed

Thank You !

load distribution of analytical query workloads for database cluster architecture

Technology

servers mqts

workloads queries

online query workloads

cluster of database

maximum execution time

querys execution time

data server

day time