elasca: workload-aware elastic scalability for partition based database systems taha rafiq mmath...

Elasca: Workload-Aware Elastic Scalability for Partition Based

Database Systems

Taha RafiqMMath Thesis Presentation

24/04/2013

2

Outline

1. Introduction & Motivation2. VoltDB & Elastic Scale-Out Mechanism3. Partition Placement Problem4. Workload-Aware Optimizer5. Experiments & Results6. Supporting Multi-Partition Transactions7. Conclusion

3

INTRODUCTION & MOTIVATION

4

DBMS Scalability

Replication

Partitioning

5

Traditional (DBMS) Scalability

Higher Load

Add Resources

Better Performance

Ability of a system to be enlarged to handle growing amount of work

Expensive Downtime

6

Elastic (DBMS) Scalability

Higher Load

Dynamically Add

Resources

Better Performance

Use of computer resources which vary dynamically to meet a variable workload

NoDowntime

Elastically Scaling a Partition Based DBMS

Re-Partitioning

7

Partition 1

Node 1Partition 1

Node 1

Partition 2

Node 2

Scale Out

Scale In

Elastically Scaling a Partition Based DBMS

Partition Migration

8

P1

Node 1

P2

P3 P4

Node 1

P1 P2

Node 2

P3 P4

Scale Out

Scale In

9

Partition Migration for Elastic Scalability

MechanismHow to add/remove nodes and move

partitions

Policy/StrategyWhich partitions to move when and where

during scale out/scale in

10

Elasca

Elastic Scale-Out Mechanism

Partition Placement & Migration Optimizer

=

+

11

VOLTDB & ELASTIC SCALE-OUT MECHANISM

12

What is VoltDB?

• In memory, partition based DBMS– No disk access = very fast

• Shared nothing architecture, serial execution– No locks

• Stored procedures– No arbitrary transactions

• Replication– Fault tolerance & durability

13

VoltDB Architecture

P1 P2

ES1 ES2

Initiator

Client Interface

P3 P1

ES1 ES2

Initiator

Client Interface

P2 P3

ES1 ES2

Initiator

Client Interface

Client ClientClient Client

Thr

eads

14

Single-Partition Transactions

P1 P2

ES1 ES2

Initiator

Client Interface

P3 P1

ES1 ES2

Initiator

Client Interface

P2 P3

ES1 ES2

Initiator

Client Interface


15

Multi-Partition Transactions

P1 P2

ES1 ES2

Initiator

Client Interface

P3 P1

ES1 ES2

Initiator

Client Interface

P2 P3

ES1 ES2

Initiator

Client Interface


ES1

16


P3 P4

ES3 ES4

Initiator

Client Interface

P1 P2

ES1 ES2Scale-Out Node

(Failed)

ES4

Initiator

Client Interface

ES1

P4

P1

17

Overcommitting Cores

• VoltDB suggests:Partitions per node < Cores per node

• Wasted resources when load is low or data access is skewed

IdeaAggregate extra partitions on each node

and scale out when load increases

18

PARTITION PLACEMENT PROBLEM

19

Given…Cluster and System Specifications

Number of CPU cores

MemoryMax. Number of Nodes

20

Given…

P1 P2 P3 P4 P5 P6 P7 P80

500

1000

1500

2000

2500

3000

Load Per Partition

Partition

Req

uest

s P

er S

eco

nd

21

Given…

P1 P2 P3 P4 P5 P6 P7 P80

200

400

600

800

1000

1200

Size of Each Partition

Partition

Size

in M

B

22

Given…

Partition Node 1 Node 2 Node 3

P1

P2

P3

P4

P5

P6

P7

P8

Current Partition-to-Node Assignment

23

Find…

Partition Node 1 Node 2 Node 3

P1 ? ? ?

P2 ? ? ?

P3 ? ? ?

P4 ? ? ?

P5 ? ? ?

P6 ? ? ?

P7 ? ? ?

P8 ? ? ?

Optimal Partition-to-Node Assignment (For Next Time Interval)

24

Optimization Objectives

Maximize ThroughputMatch the performance of a static, fully

provisioned system

Minimize Resources UsedUse the minimum number of nodes required

to meet performance demands

25

Optimization Objectives

Minimize Data MovementData movement adversely affects system performance and incurs network costs

Balance Load EffectivelyMinimizes the risk of overloading a node

during the next time interval

26

WORKLOAD-AWARE OPTIMIZER

System Overview

27

28

Statistics Collected

α. Maximum number of transactions that can be executed on a partition per second– Max capacity of Execution Sites

β. CPU overhead of host-level tasks– How much CPU capacity the Initiator uses

Effect of β

29

Estimating CPU Load

30

CPU Load Generated by Each Partition

Average CPU Load of Host-Level Tasks Per Node

Average CPU Load Per Node

31

Optimizer Details

• Mathematical Optimization vs. Heuristics• Mixed-Integer Linear Programming (MILP)• Can be solved using any general-purpose

solver (we use IBM ILOG CPLEX)• Applicable for wide variety of scenarios

Objective Function

32

Minimizes data movement as primary objective and balances load as secondary objective

Effect of ε

33

34

Minimizing Resources Used

• Calculate the minimum number of nodes that can handle the load of all the partitions– Non-integer assignment

• Explicitly tell optimizer how many nodes to use• If optimizer can’t find solution with minimum

nodes, it tries again with N + 1 nodes

35

Constraints

• Replication: Replicas of a given partition must be assigned to different nodes

• CPU Capacity: Sum of the load of partitions must be less than capacity of node

• Memory Capacity: All the partitions assigned to a node must fit in its memory

• Host-Level Tasks: The overhead of host-level tasks must not exceed capacity of single core

36

Staggering Scale In

• Fluctuating workload can result in excessive data movement

• Staggering scale in mitigates this problem• Delay scaling in by s time steps• Slightly higher resources used to provide

stability

37

EXPERIMENTAL EVALUATION

38

Optimizers Evaluated

• ELASCA: Our workload-aware optimizer• ELASCA-S: ELASCA with staggered scale in• OFFLINE: Offline optimizer that minimizes

resources used and data movement• GREEDY: A greedy first-fit optimizer• SCO: Static, fully provisioned system (no

optimization)

39

Benchmarks Used

• TPC-C: Modified to make it cleanly partitioned and fit in memory (3.6 GB)

• TATP: Telecommunication Application Transaction Processing Benchmark (250 MB)

• YCSB: Yahoo! Cloud Serving Benchmark with 50/50 read/write ratio (1 GB)

40

Dynamic Workloads

• Varying the aggregate request rate– Periodic waveforms • Sine, Triangle, Sawtooth

• Skewing the data access– Temporal skew– Statistical distributions• Uniform, Normal, Categorical, Zipfian

Temporal Skew

P1 P2 P3 P4 P5 P6 P7 P8

t = 1

Load

41

P1 P2 P3 P4 P5 P6 P7 P8

t = 2

Load

P1 P2 P3 P4 P5 P6 P7 P8

t = 3

Load

P1 P2 P3 P4 P5 P6 P7 P8

t = 4

Load

P1 P2 P3 P4 P5 P6 P7 P8

t = 1

Load

42

Experimental Setup

• Each experiment run for 1 hour• 15 time intervals– Optimizer run every four minutes

• Combination of simulation and actual runs– Exact numbers for data movement, resources

used and load balance through simulation

• Cluster has 4 nodes, 2 separate client machines

Data Movement (TPC-C)

43

Triangle Wave (f = 1)


44

Triangle Wave (f = 1), Zipfian Skew


45


Computing Resources Saved (TPC-C)

46


Load Balance (TPC-C)

47


Database Throughput (TPC-C)

48

Sine Wave (f = 2)


49

Sine Wave (f = 2), Normal Skew

Database Throughput (TATP)

50

Sine Wave (f = 2)

Database Throughput (YCSB)

51

Sine Wave (f = 2)


52


Optimizer Scalability

53

54

SUPPORTING MULTI-PARTITION TRANSACTIONS

55

Factors Affecting Performance

• Maximum MPT Throughput (η): The maximum number of transactions an execution site can coordinate per second

• Probability of MPTs (pmpt): Percentage of transactions that are MPTs

• Partitions Involved in MPTs: The number of partitions involved in MPTs

56

Changes to Model

CPU load generated by each partition is equal to sum of:

1. Load due to transaction work (same as SPTs)2. Load due to coordinating MPTs

Maximum MPT Throughput

57

Probability of MPTs

58

Effect on Resources Saved

59

Effect on Data Movement

60

61

CONCLUSION

62

Related Work

• Data replication and partitioning• Database consolidation• Live database migration• Key-value stores• Data placement

63

Elasca


Partition Placement & Migration Optimizer

=

+

64

Conclusion

• Elasca = Mechanism + Optimizer• Workload-Aware Optimizer– Meets performance demands– Minimizes computing resources used– Minimizes data movement– Effectively balances load

• Scalable to large problem sizes for online setting

65

Future Work

• Migrating to VoltDB 3.0– Intelligent client routing, master/slave

partitions

• Supporting multi-partition transactions• Automated parameter tuning• Transaction mixes• Workload prediction

66

Thank You

Questions?

elasca: workload-aware elastic scalability for partition based database systems taha rafiq mmath...

Documents

node slide

node assignment slide

partitioning slide

load increases slide

number of nodes slide

time interval slide

secondary objective

p1p2 node