inspector joins

Inspector Joins IC-65 Advances in Data Management Systems 1

Inspector Joins

By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry

VLDB 2005

Rammohan Narendula


Introduction

Query execution isI/O bound- so most of theresearch concentrateson main memory Goal- reduce no. of pagefaults thus reduce no. of disk I/Os

However, hash join is a special class of techniqueswhere hash-join becomesCPU bound given sufficientI/O bandwidth and employingAdvanced I/O techniques (I/O prefetching)Goal- reduce no. of cache misses


Exploiting Information about Data

• Ability to improve query depends on information quality• General stats on relations are inadequate

– May lead to incorrect decisions for specific queries

– Especially true for join queries

• Previous approaches exploiting dynamic information– Collecting information from previous queries

• Multi-query optimization [Sellis’88]

• Materialized views [Blakeley et al. 86]

• Join indices [Valduriez’87]

– Dynamic re-optimization of query plans [Kabra&DeWitt’98] [Markl et al. 04]

This study exploits the inner structure of hash joins


Exploiting Multi-Pass Structure of Hash Joins

• Idea: – Examine the actual data in I/O partitioning phase

– Extract useful information to improve join phase

I/O Partitioning Join

Extra information greatly helps phase 2

Inspection


Using Extracted Information

• Enable a new join phase algorithm – Reduce the primary performance bottleneck in hash joins

i.e. Poor CPU cache performance– Optimized for multi-processor systems

• Choose the most suitable join phase algorithm for special input cases

I/O Partitioning

decide Cache

PartitioningCache Prefetching

Simple Hash JoinInspection

Join Phase

New AlgorithmExtracted Information


Outline

• Motivation• Previous hash join algorithms• Hash join performance on SMP systems• Inspector join• Experimental results• Conclusions


Hash Table

• Join Phase: (simple hash join)– Build hash table, then probe hash table

GRACE Hash Join• I/O Partitioning Phase:

– Divide input relations into partitions with a hash function

Build Probe

Build Probe

• Random memory accesses cause poor CPU cache performance

Over 70% execution time

stalled on cache misses!


Cache Partitioning• Recursively produce cache-sized partitions after I/O partitioning

• Avoid cache misses when joining cache-sized partitions• Overhead of re-partitioning

BuildProbeMemory-sized

PartitionsCache-sized

Partitions


Cache Prefetching• Reduce impact of cache misses

– Exploit available memory bandwidth– Overlap cache misses and computations– Insert cache prefetch instructions into code

• Still incurs the same number of cache misses

Hash Table

ProbeBuild


Outline



Hash Joins on SMP Systems• Previous studies mainly focus on uni-processors

• Memory bandwidth is precious– It becomes the bottleneck in cache-prefetching techniques

• Each processor joins a pair of partitions in join phase

Main Memory

Shared bus

Cache

CPU

Cache

CPU

Cache

CPU

Cache

CPU

Build1

Probe1

Build4

Probe4

Build2

Probe2

Build3

Probe3


Inspector Joins • Extracted information: summary of matching relationships

– Every K contiguous pages in a build partition forms a sub-partition

– Tells which sub-partition(s) every probe tuple matches

Build Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

Probe Partition


Summary of Matching

Relationship


Cache-Stationary Join Phase

• Recall cache partitioning: re-partition cost


Build PartitionProbe Partition

Hash TableCPU

Cache

• We want to achieve zero copying

Copying cost

Copying cost


Cache-Stationary Join Phase

• Joins a sub-partition and its matching probe tuples• Sub-partition is small enough to fit in CPU cache• Cache prefetching for the remaining cache misses

• Zero copying for generating recursive cache-sized partitions


Build PartitionProbe Partition

Hash TableCPU

CacheSub-partition 0

Sub-partition 1

Sub-partition 2


Filters in I/O Partitioning

• How to extract the summary efficiently?• Extend filter scheme in commercial hash joins• Conventional single-filter scheme

– Represent all build join keys– Filter out probe tuples having no matches

Build Relation

Filter

Mem-sized

PartitionsConstruct Test


Probe Relation


Background: Bloom Filter• A bit vector

– A key is hashed d (e.g. d=3) times and represented by d bits

• Construct: for every build join key, set its 3 bits in vector• Test: given a probe join key, check if all its 3 bits are 1

– Discard the tuple if some bits are 0– May have false positives

0 0 0 1 1 1 0 0 0 1 1 0 0 1 0 0 0 0 0 1

Bit0=H0(key)

Bit1=H1(key)

Bit2=H2(key)

Filter


Multi-Filter Scheme• Single filter: a probe tuple entire build relation• Our goal: a probe tuple sub-partitions• Construct a filter for every sub-partition

• Replace a single large filter with multiple small filters

Single Filter

Build Relatio

n

Partition 0

Partition 1

Partition 2

Sub0,0Sub0,1Sub0,2

Sub1,0Sub1,1Sub1,2

Sub2,0Sub2,1Sub2,2

Multi-Filter



Testing Multi-FiltersWhen partitioning the probe relation

• Test a probe tuple against all the filters of a partition

• Tells which sub-partition(s) the tuple may have matches

• Store summary of matching relationships in partitions– This information is used to extract probe tuples in the order of partition IDs. A

special array is constructed using count sort technique for this purpose.

Probe Relation

Partition 0

Partition 1

Partition 2

Multi-Filter

Test



Cont’d…

• Extracting probe tuple information for every sub-partition using counting sort

– One array for each sub partition. Size of the array is number of matching probe tuples for that partition.

– The tuples are never visited or copied in the coutning sort.

• Joining pair of build and probe sub-partitions


Minimizing Cache Misses for Testing Filters

• Single filter scheme: – Compute 3 bit positions– Test 3 bits

• Multi-filter scheme: if there are S sub-partitions in a partition– Compute 3 bit positions– Test the same 3 bits for every filter, altogether 3*S bits

• May cause 3*S cache misses !

Test

Probe Relation

Partition 0

Partition 1

Partition 2

Multi-Filter

001

111

011S filters


Vertical Filters for Testing

• Bits at the same position are contiguous in memory• 3 cache misses instead of 3*S cache misses!

• Horizontal vertical conversion after partitioning build relation– Very small overhead in practice

Probe Relation

Partition 0

Partition 1

Partition 2

Test001

111

011

S filters

Contiguous in

memory



Outline



Experimental Setup• Relation schema: 4-byte join attribute + fixed length payload• No selection, no projection• 50MB memory per CPU available for the join phase• Same join algorithm run on every CPU joining different partitions

• Detailed cycle-by-cycle simulations– A shared-bus SMP system with 1.5GHz processors

– Memory hierarchy is based on Itanium 2 processor


Partition Phase Wall-Clock Time

• I/O partitioning can take advantage of multiple CPUs– Cut input relations into equal-sized chunks – Partition one chunk on every CPU– Concatenate outputs from all CPUs

• Enhanced cache partitioning: cache partitioning + advanced prefetching• Inspection incurs very small overhead

– Ratio of execution time with best algo- 0.88 to 0.94– Mainly computation cost of converting horizontal filters to vertical and testing

GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join

•500MB joins 2GB•100B tuples, 4B keys•50% probe tuples no matches

•A build matches 2 probe tuples

Number of CPUs used


Join Phase Aggregate Time

• Inspector join achieves significantly better performancewhen 8 or more CPUs are used

– Because of local optimization + catch prefetching– 1.7-2.1X speedups over cache prefetching

• Memory B/W becomes bottleneck when more no of processors are used– 1.6-2.0X speedups over enhanced cache partitioning



Number of CPUs used

GRACECache prefetchingCache partitioningEnhanced cache partitioningInspector join


Results on Choosing Suitable Join Phase

• Case #1: a large number of duplicate build join keys– Choose enhanced cache partitioning

– When a probe tuple on average matches 4 or more sub-partitions

• Case #2: nearly sorted input relations– Surprisingly: cache-stationary join is very good

I/O Partitioning

decide Cache

PartitioningCache Prefetching

Simple Hash JoinInspection

Join Phase

Cache StationaryExtracted Info


Conclusions• Exploit multi-pass structure for higher quality info about data• Achieve significantly better cache performance

– 1.6X speedups over previous cache-friendly algorithms

– When 8 or more CPUs are used

• Choose most suitable algorithms for special input cases• Idea may be applicable to other multi-pass algorithms


Thank You !


Previous Algorithms on SMP Systems

• Join phase performance of joining a 500MB and a 2GB relations (details later in the talk)

• Aggregate performance degrades dramatically over 4 CPUs

Reduce data movement (memory to memory, memory to cache)

Wall clock time Aggregate time on all CPUsGRACE

Cache partitioningCache prefetching

Number of CPUs used

Re-partition

cost

Number of CPUs used

Bandwidth-sharing


More Details in Paper• Moderate memory space requirement for filters• Summary information representation in intermediate partitions• Preprocessing for cache-stationary join phase• Prefetching for improving efficiency and robustness


Partition Phase Wall-Clock Time

• I/O partitioning can take advantage of multiple CPUs– Cut input relations into equal-sized chunks

– Partition one chunk on every CPU

– Concatenate outputs from all CPUs

• Inspection incurs very small overhead



Number of CPUs used

GRACECache prefetchingCache partitioningInspector join


Join Phase Aggregate Time

• Inspector join achieves significantly better performancewhen 8 or more CPUs are used– 1.7-2.1X speedups over cache prefetching

– 1.6-2.0X speedups over enhanced cache partitioning



Number of CPUs used

GRACECache prefetchingCache partitioningInspector join


CPU-Cache-Friendly Hash Joins• Recent studies focus on CPU cache performance

– I/O partitioning gives good I/O performance– Random memory accesses cause poor CPU cache performance

• Cache Partitioning [Shatdal et al. 94] [Boncz et al.’99] [Manegold et al.’00]– Recursively produce cache-sized partitions from memory-sized

partitions– Avoid cache misses during join phase– Pay re-partitioning cost

• Cache Prefetching [Chen et al. 04]– Exploit memory system parallelism– Use prefetches to overlap multiple cache misses and computations

Hash Table

ProbeBuild


Example Special Input Cases• Example case #1: a large number of duplicate build join keys

– Count the average number of sub-partitions a probe tuple matches

– Must check the tuple against all possible sub-partitions

– If too large, cache stationary join works poorly

• Example case #2: nearly sorted input relations– A merge-based join phase might be better?

Build Partition

Probe Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

A probe tuple


Varying Number of Duplicates per Build Join Key

• Join phase aggregate performance• Choose enhanced cache part

– When a probe tuple on average matches 4 or more sub-partitions


Nearly Sorted Cases

• Sort both input relations, then randomly move 0%-5% of tuples• Join phase aggregate performance• Surprisingly: cache-stationary join is very good

– Even better than merge join when over 1% tuples are out-of-order


Analyzing Nearly Sorted Case• Partitions are also nearly sorted• Probe tuples matching a sub-partition are almost contiguous• Similar memory behavior as merge join• No cost for sorting out-of-order tuples

Build Partition

Probe Partition

Sub-partition 0

Sub-partition 1

Sub-partition 2

A probe tuple

Nearly Sorted Nearly Sorted

inspector joins

Documents

cache partitioningrecursively

join phaseinspectionusing

hash functionjoin phase

join phase inspector

suitable join phase

new join phase algorithm

poor cpu cache performanceover

pair of partitions