gamma dbms part 1: physical database design
DESCRIPTION
Gamma DBMS Part 1: Physical Database Design. Shahram Ghandeharizadeh Computer Science Department University of Southern California. Outline. Alternative architectures: Shared-disk versus Shared-Nothing Declustering techniques. Shared-Disk Architecture. Emerged in 1980s: - PowerPoint PPT PresentationTRANSCRIPT
Gamma DBMSGamma DBMSPart 1: Physical Database DesignPart 1: Physical Database Design
Shahram GhandeharizadehShahram GhandeharizadehComputer Science DepartmentComputer Science DepartmentUniversity of Southern CaliforniaUniversity of Southern California
OutlineOutline
Alternative architectures:Alternative architectures: Shared-disk versus Shared-NothingShared-disk versus Shared-Nothing
Declustering techniques.Declustering techniques.
Shared-Disk ArchitectureShared-Disk Architecture
Emerged in 1980s:Emerged in 1980s: Many clients share Many clients share
storage and data: data storage and data: data remains available when remains available when a client fails.a client fails.
Network
Data
Shared-Disk ArchitectureShared-Disk Architecture
Advantages:Advantages: Many clients share Many clients share
storage and data.storage and data. Redundancy is Redundancy is
implemented in one implemented in one place protecting all place protecting all clients from disk clients from disk failure.failure.
Network
Shared-Disk ArchitectureShared-Disk Architecture
Advantages:Advantages: Many clients share Many clients share
storage and data.storage and data. Redundancy is Redundancy is
implemented in one implemented in one place protecting all place protecting all clients from disk clients from disk failure.failure.
Centralized backup: Centralized backup: The administrator does The administrator does not care/know how not care/know how many clients are on the many clients are on the network sharing network sharing storage.storage.
Network
Shared-Disk ArchitectureShared-Disk Architecture
Advantages:Advantages: Many clients share Many clients share
storage and data.storage and data. Redundancy is Redundancy is
implemented in one implemented in one place protecting all place protecting all clients from disk clients from disk failure.failure.
Centralized backup: Centralized backup: The administrator does The administrator does not care/know how not care/know how many clients are on the many clients are on the network sharing network sharing storage.storage.
Network
HighAvailability
DataBackup
DataSharing
Network failuresNetwork failures
What about network failures?What about network failures? Two host bus adapters per server,Two host bus adapters per server, Each server connected to a different switch.Each server connected to a different switch.
Shared-Disk ArchitectureShared-Disk Architecture
Storage Area Network Storage Area Network (SAN):(SAN): Block level access,Block level access, Write to storage is Write to storage is
immediate,immediate, Specialized hardware Specialized hardware
including switches, including switches, host bus adapters, disk host bus adapters, disk chassis, battery backed chassis, battery backed caches, etc.caches, etc.
ExpensiveExpensive Supports transaction Supports transaction
processing systems.processing systems.
Network Attached Network Attached Storage (NAS):Storage (NAS): File level access,File level access, Write to storage might Write to storage might
be delayed,be delayed, Generic hardware,Generic hardware, In-expensive,In-expensive, Not appropriate for Not appropriate for
transaction processing transaction processing systems.systems.
Concepts and TerminologyConcepts and Terminology
Virtualization:Virtualization: Available storage is represented as one HUGE Available storage is represented as one HUGE
disk drive, e.g., a SAN with a thousand 1.5 TB disk drive, e.g., a SAN with a thousand 1.5 TB disk provides 1 Petabyte of storage,disk provides 1 Petabyte of storage, Available storage is partitioned into Logical Unit Available storage is partitioned into Logical Unit
Numbers (LUNs),Numbers (LUNs), A LUN is A LUN is presentedpresented to one or more servers, to one or more servers, A LUN appears as a disk drive to a server.A LUN appears as a disk drive to a server.
SAN places blocks across physical disks SAN places blocks across physical disks intelligently to balance load.intelligently to balance load.
What to do when a PC fails?What to do when a PC fails?
Shared-NothingShared-Nothing
Each node (blade) consisted of one Each node (blade) consisted of one processor, memory, and a disk drive. processor, memory, and a disk drive.
Network
CPU1
CPUN
….
Shared-NothingShared-Nothing
Each node (blade) may consist of one or Each node (blade) may consist of one or several processors, memory, and one or several processors, memory, and one or several disk drives. several disk drives.
Network
….
CPU1
CPU2
CPUn
DRAM 1
DRAM 2
DRAM D
…
…
CPU1
CPU2
CPUn
DRAM 1
DRAM 2
DRAM D
…
…
Node 1 Node M
Shared-NothingShared-Nothing
Network
CPU1
CPUnM
….
Partition resources to construct logical Partition resources to construct logical nodes. With an 8 CPU PC, construct eight nodes. With an 8 CPU PC, construct eight logical nodes each with a CPU, fraction of logical nodes each with a CPU, fraction of memory, and one disk drive.memory, and one disk drive.
Data DeclusteringData Declustering
Data is partitioned across the nodes (why?):Data is partitioned across the nodes (why?): Random/round-robin,Random/round-robin, Hash partitioning,Hash partitioning, Range partitioning.Range partitioning.
Each piece of a table is termed a fragment.Each piece of a table is termed a fragment. Single attribute declustering strategiesSingle attribute declustering strategies Two multi-attribute declustering strategies:Two multi-attribute declustering strategies:
1.1. Multi-Attribute GrId deClustering (MAGIC)Multi-Attribute GrId deClustering (MAGIC)
2.2. Bubba’s Extended Range Declustering (BERD)Bubba’s Extended Range Declustering (BERD)
Horizontal DeclusteringHorizontal Declustering
Physical ViewPhysical View
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
TedTed 5050 60K60K
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
MikeMike 4545 90K90K
Logical ViewLogical View
namename ageage salarysalaryEmpEmp
Horizontal DeclusteringHorizontal Declustering
No partitioning attribute: Random and No partitioning attribute: Random and Round-robin.Round-robin.
Single attribute declustering strategies:Single attribute declustering strategies: Hash,Hash, Range.Range.
Note: the database administrator must choose one Note: the database administrator must choose one attribute as the partitioning attribute.attribute as the partitioning attribute.
Hash DeclusteringHash Declustering
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
TedTed 5050 60K60K
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
MikeMike 4545 90K90K
Physical ViewPhysical View
namename ageage salarysalary
salary % 3salary % 3
TedTed 5050 60K60K
KevinKevin 6262 120K120K
namename ageage salarysalary
BobBob 2020 10K10K
MikeMike 4545 90K90K
namename ageage salarysalary
ShidehShideh 1818 35K35K
AngelaAngela 5555 140K140K
namename ageage salarysalary
EmpEmp
salary is the salary is the partitioning partitioning
attribute.attribute.
Hash DeclusteringHash Declustering
Selections with equality predicates Selections with equality predicates referencing the partitioning attribute are referencing the partitioning attribute are directed to a single node:directed to a single node: Retrieve Emp where salary = 60KRetrieve Emp where salary = 60K
Equality predicates referencing a non-Equality predicates referencing a non-partitioning attribute and range predicates partitioning attribute and range predicates are directed to all nodes:are directed to all nodes: Retrieve Emp where age = 20 Retrieve Emp where age = 20 Retrieve Emp where salary < 20KRetrieve Emp where salary < 20K
SELECT *SELECT *FROM FROM EmpEmpWHERE salary=60KWHERE salary=60K
SELECT *SELECT *FROM FROM EmpEmpWHERE salary<20KWHERE salary<20K
Range DeclusteringRange Declustering
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
TedTed 5050 60K60K
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
MikeMike 4545 90K90K
Physical ViewPhysical View
namename ageage salarysalary
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
namename ageage salarysalary
TedTed 5050 60K60K
MikeMike 4545 90K90K
namename ageage salarysalary
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
namename ageage salarysalary
0-50K0-50K 51K-100K51K-100K 101K-101K-∞∞
EmpEmp
salary is the salary is the partitioning partitioning
attribute.attribute.
Range DeclusteringRange Declustering
Equality and range predicates referencing Equality and range predicates referencing the partitioning attribute are directed to a the partitioning attribute are directed to a subset of nodes:subset of nodes: Retrieve Emp where salary = 60KRetrieve Emp where salary = 60K Retrieve Emp where salary < 20KRetrieve Emp where salary < 20K
Predicates referencing a non-partitioning Predicates referencing a non-partitioning attribute are directed to all nodes.attribute are directed to all nodes.
In our In our example, example,
both both queries are queries are directed to directed to one node.one node.
An iPSC/2 Intel HypercubeAn iPSC/2 Intel Hypercube Year is 1988!Year is 1988! 32 Processor 32 Processor
HypercubeHypercube Each node consists of:Each node consists of:
80386 processor (12 80386 processor (12 MHz)MHz)
2 MB DRAM2 MB DRAM 333 MB disk333 MB disk A hypercube inter-A hypercube inter-
connect supporting connect supporting parallel transmission of parallel transmission of messages among messages among nodes.nodes.
Software ArchitectureSoftware Architecture Each node stores its fragment on its local disk drive.Each node stores its fragment on its local disk drive. Each node may build a B+-tree (clustered/non-clustered) and hash index on its Each node may build a B+-tree (clustered/non-clustered) and hash index on its
fragment of a relation.fragment of a relation. Each node has its own concurrency control and crash recovery mechanism.Each node has its own concurrency control and crash recovery mechanism.
Software ArchitectureSoftware Architecture
Software ArchitectureSoftware Architecture
……
Software ArchitectureSoftware Architecture
……
Software ArchitectureSoftware Architecture
……
Software ArchitectureSoftware Architecture Processes executing on one node shared memory – identical to today’s Processes executing on one node shared memory – identical to today’s
threads!threads! At initialization time, a node would start a fixed number of threads (processes).At initialization time, a node would start a fixed number of threads (processes). All threads listen on a well defined socket, waiting for the Scheduler to All threads listen on a well defined socket, waiting for the Scheduler to
dispatch work to them.dispatch work to them. A message contains the identity that the operator should assume:A message contains the identity that the operator should assume:
A “switch” statement would enable a thread to become a select, project, hash-join A “switch” statement would enable a thread to become a select, project, hash-join build, hash-join probe, etc…build, hash-join probe, etc…
The message specifies the role of the thread.The message specifies the role of the thread.
A Comparison of Range & HashA Comparison of Range & Hash
Closed simulation model:Closed simulation model: A client generates a range selection predicate: X < age < Y.A client generates a range selection predicate: X < age < Y. The age attribute value is unique with values ranging from 0 to The age attribute value is unique with values ranging from 0 to
999,999 (1 million rows).999,999 (1 million rows). A client does not generate a new request until its pending request A client does not generate a new request until its pending request
is processed by Gamma and returned.is processed by Gamma and returned. The system is multi-programmed by increasing the number of The system is multi-programmed by increasing the number of
clients in the system.clients in the system. A multi-programming level of 8 means there are 8 clients A multi-programming level of 8 means there are 8 clients
generating requests to the system (independent of one another).generating requests to the system (independent of one another).
……
32 Node Gamma32 Node Gamma
A Comparison of Range & HashA Comparison of Range & Hash
Closed simulation model:Closed simulation model: A client generates a range selection predicate: X < age < Y.A client generates a range selection predicate: X < age < Y. The age attribute value is unique with values ranging from 0 to The age attribute value is unique with values ranging from 0 to
999,999 (1 million rows).999,999 (1 million rows). A client does not generate a new request until its pending request A client does not generate a new request until its pending request
is processed by Gamma and returned.is processed by Gamma and returned. A 0.01% selection predicate retrieves 100 rows.A 0.01% selection predicate retrieves 100 rows. With a clustered B+-tree index, the 100 rows are grouped together With a clustered B+-tree index, the 100 rows are grouped together
in a few disk pages.in a few disk pages.
……
32 Node Gamma32 Node Gamma
A Comparison of Range & HashA Comparison of Range & Hash
Closed simulation model:Closed simulation model: A client generates a range selection predicate: X < age < Y.A client generates a range selection predicate: X < age < Y. The age attribute value is unique with values ranging from 0 to The age attribute value is unique with values ranging from 0 to
999,999 (1 million rows).999,999 (1 million rows). A client does not generate a new request until its pending request A client does not generate a new request until its pending request
is processed by Gamma and returned.is processed by Gamma and returned. A 0.01% selection predicate retrieves 100 rows.A 0.01% selection predicate retrieves 100 rows. With a clustered B+-tree index, the 100 rows are grouped together With a clustered B+-tree index, the 100 rows are grouped together
in a few disk pages.in a few disk pages. With range partitioning, the predicate is processed by one node.With range partitioning, the predicate is processed by one node. With hash partitioning, the predicate is processed by all 32 nodes With hash partitioning, the predicate is processed by all 32 nodes
with the scheduler coordinating the execution of each predicate with the scheduler coordinating the execution of each predicate on a node, and gathering of the results from every node.on a node, and gathering of the results from every node.
……
32 Node Gamma32 Node Gamma
0-0-31,24931,249
31,250 –31,250 –62,49962,499
968-750 –968-750 –1,000,0001,000,000
Declustering Techniques: Tradeoffs Declustering Techniques: Tradeoffs
Range selection predicate using a clustered BRange selection predicate using a clustered B++-tree, 0.01% -tree, 0.01% selectivity (10 records)selectivity (10 records)
RangeRange
Hash/Random/Round-robinHash/Random/Round-robin
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
A Comparison of Range & HashA Comparison of Range & Hash
Closed simulation model:Closed simulation model: A client generates a range selection predicate: X < age < Y.A client generates a range selection predicate: X < age < Y. The age attribute value is unique with values ranging from 0 to The age attribute value is unique with values ranging from 0 to
999,999 (1 million rows).999,999 (1 million rows). A client does not generate a new request until its pending request A client does not generate a new request until its pending request
is processed by Gamma and returned.is processed by Gamma and returned. A 1% selection predicate retrieves 10,000 rows.A 1% selection predicate retrieves 10,000 rows. With a clustered B+-tree index, the 10,000 rows are grouped With a clustered B+-tree index, the 10,000 rows are grouped
together.together.
……
32 Node Gamma32 Node Gamma
0-0-31,24931,249
31,250 –31,250 –62,49962,499
968-750 –968-750 –1,000,0001,000,000
A Comparison of Range & HashA Comparison of Range & Hash
Closed simulation model:Closed simulation model: A client generates a range selection predicate: X < age < Y.A client generates a range selection predicate: X < age < Y. The age attribute value is unique with values ranging from 0 to The age attribute value is unique with values ranging from 0 to
999,999 (1 million rows).999,999 (1 million rows). A client does not generate a new request until its pending request A client does not generate a new request until its pending request
is processed by Gamma and returned.is processed by Gamma and returned. A 1% selection predicate retrieves 10,000 rows.A 1% selection predicate retrieves 10,000 rows. With a clustered B+-tree index, the 10,000 rows are grouped With a clustered B+-tree index, the 10,000 rows are grouped
together.together. With Range partitioning, the predicate is processed using one or With Range partitioning, the predicate is processed using one or
two nodes.two nodes. With Hash partitioning, the predicate is processed by all the With Hash partitioning, the predicate is processed by all the
nodes with the scheduler coordinating the execution of the nodes with the scheduler coordinating the execution of the predicate.predicate.
……0-0-31,24931,249
31,250 –31,250 –62,49962,499
968-750 –968-750 –1,000,0001,000,000
Tradeoffs (Cont…) Tradeoffs (Cont…)
Range selection predicate using a clustered BRange selection predicate using a clustered B++-tree, 1% -tree, 1% selectivity (1000 records)selectivity (1000 records)
RangeRange
Hash/Random/Round-robinHash/Random/Round-robin
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Why Range Performs Poorly?Why Range Performs Poorly?
Note: Range performed poorly because the Note: Range performed poorly because the query (1% selection) imposed a high query (1% selection) imposed a high workload onto a node!workload onto a node! For a query with minimal (0.01% selection) For a query with minimal (0.01% selection)
workload requirement, Range is ideal!workload requirement, Range is ideal!
Two reasons:Two reasons: Random generation of selection predicates does Random generation of selection predicates does
NOT mean uniform distribution of workload NOT mean uniform distribution of workload across nodes.across nodes.
The number of ranges is the same as the number The number of ranges is the same as the number of nodes causing the tail-end servers to observe of nodes causing the tail-end servers to observe a lower load.a lower load.
3R1 R2 R3R1 R3 R2
R2 R1 R3R2 R3 R1
R3 R1 R2R3 R2 R1
{R1, R2, R3}{R1, R2, R3}
{R1, R2, R3}
{R1, R3} R2{R1, R3} R2
{R1, R3} R2
{R1, R3}R2{R1, R3}R2
6Idealcases
{R1, R3}R2
{R2, R3} R1{R2, R3} R1
{R2, R3} R1
{R2, R3}R1{R2, R3}R1
{R2, R3}R1
{R2, R1} R3{R2, R1} R3
{R2, R1} R3
{R2, R1}R3{R2, R1}R3
{R2, R1}R3
21
27 ways to 27 ways to assign 3 assign 3
requests to requests to the 3 the 3
nodes!nodes!Only 6 Only 6
result in a result in a uniform uniform
distribution distribution of of
requests.requests.
Tradeoffs (Cont…)Tradeoffs (Cont…)
Simple range partitioning may lead to load Simple range partitioning may lead to load imbalance for queries with high selectivity:imbalance for queries with high selectivity: Low performance: increased response time and Low performance: increased response time and
low system throughput.low system throughput.
Consider a table that maintains the grade of Consider a table that maintains the grade of students for different exams, range students for different exams, range partitioned on the grade.partitioned on the grade.
0-190-19 20-3920-39 40-5940-59 60-7960-79 80-10080-100
Tradeoffs (Cont…)Tradeoffs (Cont…)
Assume a range predicate overlaps 3 Assume a range predicate overlaps 3 partitions, e.g.,partitions, e.g., 0 < grade < 450 < grade < 45
45 < grade < 9045 < grade < 90
0-190-19 20-3920-39 40-5940-59 60-7960-79 80-10080-100
0-190-19 20-3920-39 40-5940-59 60-7960-79 80-10080-100
Tradeoffs (Cont…)Tradeoffs (Cont…)
Higher response time because 2 nodes sit Higher response time because 2 nodes sit idle while 3 nodes process the query idle while 3 nodes process the query (assuming overhead of parallelism is (assuming overhead of parallelism is negligible).negligible).
0-190-19 20-3920-39 40-5940-59 60-7960-79 80-10080-100
45 < grade < 9045 < grade < 90
Tradeoffs (Cont…)Tradeoffs (Cont…)
Lower throughput because node 3 becomes Lower throughput because node 3 becomes a bottleneck. a bottleneck. Assuming even distribution of access to ranges, when node 3 is Assuming even distribution of access to ranges, when node 3 is
utilized 100%, nodes 2 and 4 have a 66% utilization, while nodes 1 utilized 100%, nodes 2 and 4 have a 66% utilization, while nodes 1 and 5 are utilized 33%.and 5 are utilized 33%.
0-190-19 20-3920-39 40-5940-59 60-7960-79 80-10080-100
Hybrid Range Partitioning [VLDB’90]Hybrid Range Partitioning [VLDB’90]
To minimize the impact of load imbalance, To minimize the impact of load imbalance, construct more ranges than nodes, e.g., 10 construct more ranges than nodes, e.g., 10 ranges for a 5 node system.ranges for a 5 node system.
Predicates such as “0 < grade < 45” are now Predicates such as “0 < grade < 45” are now directed to all nodes.directed to all nodes.
Assuming even distribution of access to ranges Assuming even distribution of access to ranges where workload consists of predicates utilizing 3 where workload consists of predicates utilizing 3 sequential ranges, when node 3 become 100% sequential ranges, when node 3 become 100% utilized, nodes 2 and 4 are now utilized 83%, utilized, nodes 2 and 4 are now utilized 83%, while nodes 1 and 5 are utilized 66%.while nodes 1 and 5 are utilized 66%.
0-100-1051-6051-60
11-2011-2061-7061-70
21-3021-3071-8071-80
31-4031-4081-9081-90
41-5041-5091-10091-100
Multi-Attribute Declustering [SIGMOD’92]Multi-Attribute Declustering [SIGMOD’92]
Queries with minimal resource requirements Queries with minimal resource requirements should be directed to a few processors. should be directed to a few processors. Why?Why? Overhead of parallelismOverhead of parallelism
1.1. Impacts query response time adversely,Impacts query response time adversely,
2.2. Wastes system resources, reducing throughput.Wastes system resources, reducing throughput.
OLTP has come a long way:OLTP has come a long way: Heaviest transaction in TPC-C reads Heaviest transaction in TPC-C reads
approximately 400 records.approximately 400 records. Assuming no disk accesses, a low-end PC Assuming no disk accesses, a low-end PC
processes this transaction < 1 ms.processes this transaction < 1 ms. Transactions should be single sited!Transactions should be single sited!
RangeRange
Round-robinRound-robin
Multi-Attribute Declustering (E.g.)Multi-Attribute Declustering (E.g.)
Recall the Emp(name, age, salary) table.Recall the Emp(name, age, salary) table. Workload consists of two queries, each with Workload consists of two queries, each with
a 50% frequency of occurrence:a 50% frequency of occurrence: Query A, range query referencing the age Query A, range query referencing the age
attribute. On average, retrieves 5 tuples.attribute. On average, retrieves 5 tuples. Retrieve Emp where age > 21 and age < 22.Retrieve Emp where age > 21 and age < 22.
Query B, range query referencing the salary Query B, range query referencing the salary attribute. On average, retrieves 10 tuples.attribute. On average, retrieves 10 tuples. Retrieve Emp where salary > 50K and salary < 50.5KRetrieve Emp where salary > 50K and salary < 50.5K
Access methods: Access methods: A non-clustered BA non-clustered B++-tree index on age-tree index on age A clustered BA clustered B++-tree index on salary -tree index on salary
Ideally, both queries should be directed to Ideally, both queries should be directed to one node.one node.
Multi-Attribute Declustering (E.g. Cont...)Multi-Attribute Declustering (E.g. Cont...)
Range decluster Emp using age as the Range decluster Emp using age as the partitioning attribute.partitioning attribute.
Assuming a system configured with nine Assuming a system configured with nine nodes, the number of employed nodes is:nodes, the number of employed nodes is:
RangeRange IdealIdeal
AA 50% * 150% * 1 50% * 150% * 1
BB 50% * 950% * 9 50% * 150% * 1
AverageAverage 55 11
MAGICMAGIC Construct a multi-attribute grid directory on the Emp tableConstruct a multi-attribute grid directory on the Emp table
Each dimension corresponds to a partitioning attribute.Each dimension corresponds to a partitioning attribute. Each cell represents a fragment of the relation.Each cell represents a fragment of the relation.
11 11 44 44 77 77
11 11 44 44 77 77
22 22 55 55 88 88
22 22 55 55 88 88
33 33 66 66 00 00
33 33 66 66 00 00
SalarySalary
AAggee
0-200-20 21-2521-25 26-3026-30 31-3531-35 36-4036-40 41-7041-70
10-2010-20
21-2521-25
26-3026-30
31-3531-35
36-4036-40
41-6041-60
MAGIC (Low Correlation)MAGIC (Low Correlation) Low correlation between Low correlation between
salary and age attribute salary and age attribute values:values:
11 11 44 44 77 77
11 11 44 44 77 77
22 22 55 55 88 88
22 22 55 55 88 88
33 33 66 66 00 00
33 33 66 66 00 00
....
..
..
..
......
MAGICMAGIC RangeRange IdealIdeal
AA 50% * 350% * 3 50% * 150% * 1 50% * 150% * 1
BB 50% * 350% * 3 50% * 950% * 9 50% * 150% * 1
AvgAvg 33 55 11
....
..
..
......
..
....
..
..
..
..
....
..
..
......
..
....
....
..
....
....
......
....
..
..
..
..
..
....
..
....
....
MAGIC (High Correlation)MAGIC (High Correlation) High correlation between salary High correlation between salary
and age attribute values:and age attribute values:11 11 44 44 77 77
11 11 44 44 77 77
22 22 55 55 88 88
22 22 55 55 88 88
33 33 66 66 00 00
33 33 66 66 00 00
........
..
.. ....
MAGICMAGIC RangeRange IdealIdeal
AA 50% * 150% * 1 50% * 150% * 1 50% * 150% * 1
BB 50% * 150% * 1 50% * 950% * 9 50% * 150% * 1
AvgAvg 11 55 11
.. ......
......
..
........
..
..............
....
..........
..
..
....
....
..
....
....
.... ..
....
..
..
..
..
..
BERDBERD
Range partition Emp using the salary Range partition Emp using the salary attribute.attribute.
For the age attribute, construct an auxiliary For the age attribute, construct an auxiliary relation containing:relation containing:1.1. The age attribute value of each recordThe age attribute value of each record
2.2. Node containing that recordNode containing that record
Range partition the auxiliary relation using Range partition the auxiliary relation using the age attribute value.the age attribute value.
BERDBERD
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
TedTed 5050 60K60K
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
MikeMike 4545 90K90K
Physical ViewPhysical View
namename ageage salarysalary
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
namename ageage salarysalary
TedTed 5050 60K60K
MikeMike 4545 90K90K
namename ageage salarysalary
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
namename ageage salarysalary
0-50K0-50K 51K-100K51K-100K 101K-101K-∞∞
EmpEmp
salary is the salary is the primary primary
partitioning partitioning attribute.attribute.
BERD, Auxiliary relationBERD, Auxiliary relation
2020 00
1818 00
5050 11
4545 11
6262 22
5555 22
ageage NodeNode
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
namename ageage salarysalary
TedTed 5050 60K60K
MikeMike 4545 90K90K
namename ageage salarysalary
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
namename ageage salarysalary
0-50K0-50K 51K-100K51K-100K 101K-101K-∞∞
Auxiliary relationAuxiliary relation
BERD, Auxiliary relationBERD, Auxiliary relation
2020 00
1818 00
5050 11
4545 11
6262 22
5555 22
ageage NodeNode
2020 00
1818 00
ageage nodenode
0-200-20 21-5221-52 53-53-∞∞
Auxiliary relationAuxiliary relation
Range partition Range partition auxiliary auxiliary
relation using relation using the age the age
attribute.attribute.
5050 11
4545 11
ageage nodenode
6262 22
5555 22
ageage nodenode
BERD, Auxiliary relationBERD, Auxiliary relation
2020 00
1818 00
ageage nodenode
Aux.ageAux.age0-200-20
Aux.ageAux.age21-5221-52
Aux.ageAux.age
53-53-∞∞
5050 11
4545 11
ageage nodenode6262 22
5555 22
ageage nodenode
TedTed 5050 60K60K
MikeMike 4545 90K90K
namename ageage salarysalary
SalarySalary51K-100K51K-100K
KevinKevin 6262 120K120K
AngelaAngela 5555 140K140K
namename ageage salarysalary
SalarySalary
101K-101K-∞∞
BobBob 2020 10K10K
ShidehShideh 1818 35K35K
namename ageage salarysalary
SalarySalary0-50K0-50K
BERD (Cont…)BERD (Cont…)
High correlation between age and salary High correlation between age and salary attribute values:attribute values:
BERDBERD RangeRange IdealIdeal
AA 50% * 150% * 1 50% * 150% * 1 50% * 150% * 1
BB 50% * 150% * 1 50% * 950% * 9 50% * 150% * 1
AvgAvg 11 55 11
BERD (Cont…)BERD (Cont…)
Low correlation between age and salary Low correlation between age and salary attribute values:attribute values:
BERDBERD RangeRange IdealIdeal
AA 50% * 150% * 1 50% * 150% * 1 50% * 150% * 1
BB 50% * 950% * 9 50% * 950% * 9 50% * 150% * 1
AvgAvg 55 55 11
Is it possible to avoid lookup in the auxiliary table? Is it possible to avoid lookup in the auxiliary table?
Experimental environmentExperimental environment
Verified simulation model of the Gamma Verified simulation model of the Gamma database machinedatabase machine
A 32 processor systemA 32 processor system Database consists of a 100,000 tuple table Database consists of a 100,000 tuple table
based on the Wisconsin Benchmark.based on the Wisconsin Benchmark.
Experimental DesignExperimental Design
Correlation betweenCorrelation betweenpartitioning attribute partitioning attribute
valuesvalues
Workload Workload characteristics (A,B)characteristics (A,B)
Multiprogramming levelMultiprogramming level
LowLow HighHigh
Low, LowLow, Low
Low, ModerateLow, Moderate
Moderate, LowModerate, Low
Moderate, ModerateModerate, Moderate
Low-Low Query Mix (Low Correlation)Low-Low Query Mix (Low Correlation)
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Low-Low Query Mix (High Correlation)Low-Low Query Mix (High Correlation)
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Low-Moderate Mix (Low Correlation)Low-Moderate Mix (Low Correlation)
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Low-Moderate Mix (High Correlation)Low-Moderate Mix (High Correlation)
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Moderate-Moderate Mix (Low Correlation)Moderate-Moderate Mix (Low Correlation)
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Moderate-Moderate Mix (High Correlation)Moderate-Moderate Mix (High Correlation)
Multiprogramming LevelMultiprogramming Level
Throughput (Queries/Second)Throughput (Queries/Second)
Advantages of MAGICAdvantages of MAGIC
Provides a superior performance when Provides a superior performance when compared to BERD and Rangecompared to BERD and Range
Constructs the grid directory using the Constructs the grid directory using the workload of the relation. Changes the shape workload of the relation. Changes the shape of the grid directory in order to compensate of the grid directory in order to compensate for the different frequencies of access to the for the different frequencies of access to the partitioning attributes.partitioning attributes.
Minimizes the overhead of parallelism.Minimizes the overhead of parallelism. Supports partial declustering of a relation in Supports partial declustering of a relation in
large systems.large systems.
SummarySummary
Given the fast speed of CPUs, each Given the fast speed of CPUs, each query/transaction should be processed by query/transaction should be processed by one node ideally.one node ideally.
Parallelism versus Efficient ServersParallelism versus Efficient Servers
Even if all queries and transactions become Even if all queries and transactions become single-sited, parallelism is no substitute for single-sited, parallelism is no substitute for smart algorithms that make a single server smart algorithms that make a single server efficient.efficient.
Why?Why?
Why?Why?
Assume a single server that can process one Assume a single server that can process one request per second.request per second.
Two choices:Two choices:1.1. Extend it with Flash and obtain a throughput of 3 Extend it with Flash and obtain a throughput of 3
requests per second.requests per second.
2.2. Buy two additional servers and partition the data Buy two additional servers and partition the data across the 3 servers.across the 3 servers.
Given 3 simultaneous requests issued to Given 3 simultaneous requests issued to each alternative:each alternative: The single processor system will process 3 The single processor system will process 3
requests per second.requests per second. The 3 node system may not provide a throughput The 3 node system may not provide a throughput
of 3 requests per second.of 3 requests per second.
3R1 R2 R3R1 R3 R2
R2 R1 R3R2 R3 R1
R3 R1 R2R3 R2 R1
{R1, R2, R3}{R1, R2, R3}
{R1, R2, R3}
{R1, R3} R2{R1, R3} R2
{R1, R3} R2
{R1, R3}R2{R1, R3}R2
6Idealcases
{R1, R3}R2
{R2, R3} R1{R2, R3} R1
{R2, R3} R1
{R2, R3}R1{R2, R3}R1
{R2, R3}R1
{R2, R1} R3{R2, R1} R3
{R2, R1} R3
{R2, R1}R3{R2, R1}R3
{R2, R1}R3
21
27 ways to 27 ways to assign 3 assign 3
requests to requests to the 3 the 3
nodes!nodes!
Brain TeaserBrain Teaser
Given N servers and M requests, Given N servers and M requests, compute the probability of:compute the probability of:
M/N requests per node.M/N requests per node. Number of ways M requests may map onto N servers Number of ways M requests may map onto N servers
and the probability of each scenario.and the probability of each scenario.
Brain TeaserBrain Teaser
Given N servers and M requests, Given N servers and M requests, compute the probability of:compute the probability of:
M/N requests per node.M/N requests per node. Number of ways M requests may map onto N servers Number of ways M requests may map onto N servers
and the probability of each scenario.and the probability of each scenario.
Reward for correct answer:Reward for correct answer: