@ carnegie mellon databases inspector joins shimin chen phillip b. gibbons todd c. mowry anastassia...

@Carnegie MellonDatabases

Inspector Joins

Shimin Chen

Phillip B.

GibbonsTodd C. Mowry

Anastassia

Ailamaki2

Carnegie Mellon University

Intel Research Pittsburgh

2

1,2

1 1

1

Inspector Joins 2@Carnegie Mellon

Databases

Exploiting Information about Data

Ability to improve query depends on information quality

General stats on relations are inadequate May lead to incorrect decisions for specific queries Especially true for join queries

Previous approaches exploiting dynamic information Collecting information from previous queries

Multi-query optimization [Sellis’88] Materialized views [Blakeley et al. 86] Join indices [Valduriez’87]

Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04]

This study exploits the inner structure of hash joins


Databases

Idea: Examine the actual data in I/O partitioning phase Extract useful information to improve join phase

Exploiting Multi-Pass Structure of Hash Joins

I/O Partitioning Join

Extra information greatly helps phase 2

Inspection


Databases

Using Extracted Information

Enable a new join phase algorithm Reduce the primary performance bottleneck in hash joins i.e. Poor CPU cache performance Optimized for multi-processor systems

Choose the most suitable join phase algorithm for special input cases

I/O Partitioning

decide Cache

PartitioningCache Prefetching

Simple Hash JoinInspection

Join Phase

New AlgorithmExtracted Information


Databases

Outline

Motivation

Previous hash join algorithms

Hash join performance on SMP systems

Inspector join

Experimental results

Conclusions


Databases

Hash Table

Join Phase: (simple hash join) Build hash table, then probe hash table

GRACE Hash Join

I/O Partitioning Phase: Divide input relations into partitions with a hash function

Build Probe

Build Probe

Random memory accesses cause poor CPU cache performance

Over 70% execution time

stalled on cache misses!


Databases

Cache Partitioning[Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00]

Recursively produce cache-sized partitions after I/O partitioning

Avoid cache misses when joining cache-sized partitions

Overhead of re-partitioning

BuildProbeMemory-sized

PartitionsCache-sized

Partitions


Databases

Cache Prefetching[Chen et al. 04]

Reduce impact of cache misses Exploit available memory bandwidth Overlap cache misses and computations Insert cache prefetch instructions into code

Still incurs the same number of cache misses

Hash Table

ProbeBuild


Databases

Outline

Motivation



Inspector join


Conclusions


Databases

Hash Joins on SMP Systems

Previous studies mainly focus on uni-processors

Memory bandwidth is precious

Each processor joins a pair of partitions in join phase

Main Memory

Shared bus

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Build1

Probe1

Build4

Probe4

Build2

Probe2

Build3

Probe3


Databases

Previous Algorithms on SMP Systems

Join phase performance of joining a 500MB and a 2GB relations (details later in the talk)

Aggregate performance degrades dramatically over 4 CPUs

Reduce data movement (memory to memory, memory to cache)

Wall clock time Aggregate time on all CPUsGRACE

Cache partitioningCache prefetching

Number of CPUs used

Re-partition

cost

Number of CPUs used

Bandwidth-sharing


Databases

Inspector Joins

Extracted information: summary of matching relationships Every K contiguous pages in a build partition forms a sub-partition Tells which sub-partition(s) every probe tuple matches

Build Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

Probe Partition


Summary of Matching

Relationship


Databases

Cache-Stationary Join Phase

Recall cache partitioning: re-partition cost


Build PartitionProbe Partition

Hash TableCPU

Cache

We want to achieve zero copying

Copying cost

Copying cost


Databases

Cache-Stationary Join Phase

Joins a sub-partition and its matching probe tuples

Sub-partition is small enough to fit in CPU cache

Cache prefetching for the remaining cache misses

Zero copying for generating recursive cache-sized partitions


Build PartitionProbe Partition

Hash TableCPU

CacheSub-partition 0

Sub-partition 1

Sub-partition 2


Databases

Filters in I/O Partitioning

How to extract the summary efficiently?

Extend filter scheme in commercial hash joins

Conventional single-filter scheme Represent all build join keys Filter out probe tuples having no matches

Build Relation

Filter

Mem-sized

PartitionsConstruct Test


Probe Relation


Databases

Background: Bloom Filter

A bit vector A key is hashed d (e.g. d=3) times and represented by d bits

Construct: for every build join key, set its 3 bits in vector

Test: given a probe join key, check if all its 3 bits are 1 Discard the tuple if some bits are 0 May have false positives

0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1

Bit0=H0(key)

Bit1=H1(key)

Bit2=H2(key)

Filter


Databases

Multi-Filter Scheme

Single filter: a probe tuple entire build relation

Our goal: a probe tuple sub-partitions

Construct a filter for every sub-partition

Replace a single large filter with multiple small filters

Single Filter

Build Relatio

n

Partition 0

Partition 1

Partition 2

Sub0,0Sub0,1Sub0,2

Sub1,0Sub1,1Sub1,2

Sub2,0Sub2,1Sub2,2

Multi-Filter



Databases

Testing Multi-Filters

When partitioning the probe relation

Test a probe tuple against all the filters of a partition

Tells which sub-partition(s) the tuple may have matches

Store summary of matching relationships in partitions

Probe Relation

Partition 0

Partition 1

Partition 2

Multi-Filter

Test



Databases

Minimizing Cache Misses for Testing Filters

Single filter scheme: Compute 3 bit positions Test 3 bits

Multi-filter scheme: if there are S sub-partitions in a partition Compute 3 bit positions Test the same 3 bits for every filter, altogether 3*S bits

May cause 3*S cache misses !Test

Probe Relation

Partition 0

Partition 1

Partition 2

Multi-Filter

001

111

011S filters


Databases

Vertical Filters for Testing

Bits at the same position are contiguous in memory

3 cache misses instead of 3*S cache misses!

Horizontal vertical conversion after partitioning build relation Very small overhead in practice

Probe Relation

Partition 0

Partition 1

Partition 2

Test001

111

011

S filters

Contiguous in

memory



Databases

More Details in Paper

Moderate memory space requirement for filters

Summary information representation in intermediate partitions

Preprocessing for cache-stationary join phase

Prefetching for improving efficiency and robustness


Databases

Outline

Motivation



Inspector join


Conclusions


Databases

Experimental Setup

Relation schema: 4-byte join attribute + fixed length payload

No selection, no projection

50MB memory per CPU available for the join phase

Same join algorithm run on every CPU joining different partitions

Detailed cycle-by-cycle simulations A shared-bus SMP system with 1.5GHz processors Memory hierarchy is based on Itanium 2 processor


Databases

Partition Phase Wall-Clock Time

I/O partitioning can take advantage of multiple CPUs Cut input relations into equal-sized chunks Partition one chunk on every CPU Concatenate outputs from all CPUs

Enhanced cache partitioning: cache partitioning + advanced prefetching

Inspection incurs very small overhead

GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join

•500MB joins 2GB•100B tuples, 4B keys•50% probe tuples no matches

•A build matches 2 probe tuples

Number of CPUs used


Databases

Join Phase Aggregate Time

Inspector join achieves significantly better performancewhen 8 or more CPUs are used 1.7-2.1X speedups over cache prefetching 1.6-2.0X speedups over enhanced cache partitioning



Number of CPUs used

GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join


Databases

Results on Choosing Suitable Join Phase

Case #1: a large number of duplicate build join keys Choose enhanced cache partitioning When a probe tuple on average matches 4 or more sub-partitions

Case #2: nearly sorted input relations Surprisingly: cache-stationary join is very good

I/O Partitioning

decide Cache

PartitioningCache Prefetching

Simple Hash JoinInspection

Join Phase

Cache StationaryExtracted Info


Databases

Conclusions

Exploit multi-pass structure for higher quality info about data

Achieve significantly better cache performance 1.6X speedups over previous cache-friendly algorithms When 8 or more CPUs are used

Choose most suitable algorithms for special input cases

Idea may be applicable to other multi-pass algorithms


Databases

Thank You !


Databases

Partition Phase Wall-Clock Time

I/O partitioning can take advantage of multiple CPUs Cut input relations into equal-sized chunks Partition one chunk on every CPU Concatenate outputs from all CPUs

Inspection incurs very small overhead



Number of CPUs used

GRACECache prefetchingCache partitioningInspector join


Databases

Join Phase Aggregate Time

Inspector join achieves significantly better performancewhen 8 or more CPUs are used 1.7-2.1X speedups over cache prefetching 1.6-2.0X speedups over enhanced cache partitioning



Number of CPUs used

GRACECache prefetchingCache partitioningInspector join


Databases

CPU-Cache-Friendly Hash Joins

Recent studies focus on CPU cache performance I/O partitioning gives good I/O performance Random memory accesses cause poor CPU cache performance

Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00] Recursively produce cache-sized partitions from memory-sized

partitions Avoid cache misses during join phase Pay re-partitioning cost

Cache Prefetching [Chen et al. 04] Exploit memory system parallelism Use prefetches to overlap multiple cache misses and

computations

Hash Table

ProbeBuild


Databases

Example Special Input Cases

Example case #1: a large number of duplicate build join keys Count the average number of sub-partitions a probe tuple

matches Must check the tuple against all possible sub-partitions If too large, cache stationary join works poorly

Example case #2: nearly sorted input relations A merge-based join phase might be better?

Build Partition

Probe Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

A probe tuple


Databases

Varying Number of Duplicates per Build Join Key

Join phase aggregate performance

Choose enhanced cache part When a probe tuple on average matches 4 or more sub-

partitions


Databases

Nearly Sorted Cases

Sort both input relations, then randomly move 0%-5% of tuples

Join phase aggregate performance

Surprisingly: cache-stationary join is very good Even better than merge join when over 1% tuples are out-of-order


Databases

Analyzing Nearly Sorted Case

Partitions are also nearly sorted

Probe tuples matching a sub-partition are almost contiguous

Similar memory behavior as merge join

No cost for sorting out-of-order tuples

Build Partition

Probe Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

A probe tuple

Nearly Sorted Nearly Sorted

@ carnegie mellon databases inspector joins shimin chen phillip b. gibbons todd c. mowry anastassia...

Documents

cache partitioningshatdal

recursive cache

join phaseinspectionusing

hash functionjoin phase

remaining cache misseszero

poor cpu cache performanceover

suitable join phase

new join phase algorithm