improving hash join performance by exploiting intrinsic data skew by bryce cutt supervised by dr....

Improving Hash JoinPerformance By Exploiting

Intrinsic Data Skew

byBryce Cutt

supervised by

Dr. Ramon Lawrence

Introduction

• Databases are part of our lives• Hash Join is a core database algorithm

o Very I/O intensive for large databases Queries may take hours

o Any performance improvement is significant• Real datasets contain skew

o Skew is when some values occur more frequently o Skew can greatly reduce hash join performance

• Skew traditionally considered a bad thing for join algorithmso Try to mitigate negative effects of skew

• Adapt hash joino No longer just mitigateo Use foreknowledge of skew

Improve performance

Relational Model Definitions

Example Relations

Build Relation

Probe Relation

Part

Purchase

DHJ Algorithm Build Phase

Hash Function: modulo 5

DHJ Algorithm Build Phase, cont.

Probe Relation

DHJ Algorithm Probe Phase

DHJ Algorithm Probe Phase, cont.

DHJ Algorithm Cleanup Phase

DHJ Algorithm Cleanup Phase, cont.

Skewed Probe Relation

Statistics and Hash Joins

• Modern database systems maintain statistics such as histograms for query optimization

• What if hash join could use the statistics to choose the best build tuples to keep in memory?o Does not have to generate own

statistics

Histojoin Algorithm General Idea

• Same basic form as DHJ• Determines best build tuples from histogram

o In this case the tuples with partid 2 and 3• Create partitions for the best build tuples

o In addition to regular partitionso Freeze regular partitions first

• Perform a highly optimized multi-stage checko To determine the partition tuples belong in

Histojoin Algorithm Build Phase

Histojoin Algorithm Probe Phase

Implementation Details

• Avoided in algorithm descriptiono General enough to fit any database system

• But ultimately importanto Core of algorithm implementation specific

• Implemented ino Stand alone Java app

Optimistic implementationo PostgreSQL

HHJ Conservative implementation

Inaccurate Statistics

• Selections• Multi-join plans

o Samplingo SITs

• Handling dependent on implementationo PostgreSQL conservative memory usage

Experimental Results

• TPC-Ho Database commonly used to test database system

performanceo Skewed versionso 1GB dataset used in Java testso 10GB dataset used in PostgreSQL tests

Experimental Results, cont.

Java, Lineitem/Part, skewed, 1GBApprox. 20% faster


Java, Lineitem/Part,high skew, 1GBApprox. 60% faster


Java, Various Joins, Percent Improvement, 1GBApprox. 20% for skewed and 60% for high skew


Java, Lineitem/Part, Inaccurate Histogram, 1GB


Java, Lineitem/Part/Supplier,high skew, 1GBApprox. 75% faster


PostgreSQL, Lineitem/Part,skewed, 10GBApprox. 10% faster


PostgreSQL, Lineitem/Part, high skew, 10GBApprox. 60% faster


PostgreSQL, Various Joins, Percent Improvement, 10GB5-10% for skewed and 50-60% for high skew

Conclusion

• Histojoino significantly outperforms standard hash joins in the

presence of skew• Smart implementation mitigates pitfalls• Two papers have been published from this work• PostgreSQL patch currently in review

o Will be used by millions of users

Thank you

Thank you Dr. Lawrence

improving hash join performance by exploiting intrinsic data skew by bryce cutt supervised by dr....

Documents

algorithm descriptiongeneral

best build tuplesin

fasterexperimental results

hash join performanceskew

gbexperimental results

java tests10gb dataset

high skewexperimental

hash joinno