introduction to apache calcite

109
> < INTRODUCTION TO INTRODUCTION TO APACHE CALCITE APACHE CALCITE JORDAN HALTERMAN 1

Upload: jordan-halterman

Post on 08-Jan-2017

1.006 views

Category:

Software


48 download

TRANSCRIPT

Page 1: Introduction to Apache Calcite

><

INTRODUCTION TO

INTRODUCTION TO APACHE CALCITE

APACHE CALCITEJORDAN HALTERMAN

1

Page 2: Introduction to Apache Calcite

WHAT IS APACHE CALCITE?

next 2

Page 3: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 3

What is Apache Calcite?

• A framework for building SQL databases • Developed over more than ten years • Written in Java • Previously known as Optiq • Previously known as Farrago • Became an Apache project in 2013 • Led by Julian Hyde at Hortonworks

Page 4: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 4

Projects using Calcite

• Apache Hive • Apache Drill • Apache Flink • Apache Phoenix • Apache Samza • Apache Storm • Apache everything…

Page 5: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 5

What is Apache Calcite?

• SQL parser • SQL validation • Query optimizer • SQL generator • Data federator

Page 6: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE

ParseQueries are parsed using a JavaCC generated parser

ValidateQueries are validated against known database metadata

OptimizeLogical plans are optimized and converted into physical expressions

ExecuteP h y s i c a l p l a n s a r e converted into application-specific executions

01 02 03 04

Stages of query execution

6

Page 7: Introduction to Apache Calcite

COMPONENTS next 7

Page 8: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 8

Components of Calcite

• Catalog - Defines metadata and namespaces that can be accessed in SQL queries

• SQL parser - Parses valid SQL queries into an abstract syntax tree (AST)

• SQL validator - Validates abstract syntax trees against metadata provided by the catalog

• Query optimizer - Converts AST into logical plans, optimizes logical plans, and converts logical expressions into physical plans

• SQL generator - Converts physical plans to SQL

Page 9: Introduction to Apache Calcite

CATALOG next 9

Page 10: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 10

Calcite Catalog

• Defines namespaces that can be accessed in Calcite queries

• Schema

• A collection of schemas and tables

• Can be arbitrarily nested

• Table

• Represents a single data set

• Fields defined by a RelDataType

• RelDataType

• Represents fields in a data set

• Supports all SQL data types, including structs and

Page 11: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 11

Schema

• A collection of schemas and tables

• Schemas can be arbitrarily nested

Page 12: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 12

Schema

• A collection of schemas and tables

• Schemas can be arbitrarily nested

Page 13: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 13

Table

• Represents a single data set

• Fields are defined by a RelDataType

Page 14: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 14

Table

• Represents a single data set

• Fields are defined by a RelDataType

Page 15: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 15

RelDataType

• Represents the data type of an object

• Supports all SQL data types, including structs and arrays

• Similar to Spark’s DataType

Page 16: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 16

RelDataType

Page 17: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 17

RelDataType

data type enum

Page 18: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 18

Statistic

• Provide table statistics used in optimization

Page 19: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 19

Statistic

• Provide table statistics used in optimization

Page 20: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 20

Usage of the Calcite catalog

Page 21: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 21

Usage of the Calcite catalog

schema

Page 22: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 22

Usage of the Calcite catalog

schema table

Page 23: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 23

Usage of the Calcite catalog

schema table

data type

Page 24: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 24

Usage of the Calcite catalog

schema table

data typedata type field

Page 25: Introduction to Apache Calcite

SQL PARSER next 25

Page 26: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 26

Calcite SQL parser

• LL(k) parser written in JavaCC

• Input queries are parsed into an abstract syntax tree (AST)

• Tokens are represented in Calcite by SqlNode

• SqlNode can also be converted back to a SQL string via the unparse method

Page 27: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 27

JavaCC

• Java Compiler Compiler

• Created in 1996 at Sun Microsystems

• Generates Java code from a domain-specific language

• ANTLR is the modern alternative used in projects like Hive and Drill

• JavaCC has sparse documentation

Page 28: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 28

JavaCC

Page 29: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 29

JavaCC

Page 30: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 30

JavaCC

tokens

Page 31: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 31

JavaCC

tokens

or

Page 32: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 32

JavaCC

tokens

or

function call

Page 33: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 33

JavaCC

tokens

Java code

or

function call

Page 34: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 34

SqlNode

• SqlNode represents an element in an abstract syntax tree

Page 35: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 35

SqlNode

• SqlNode represents an element in an abstract syntax tree

select

Page 36: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 36

SqlNode

• SqlNode represents an element in an abstract syntax tree

identifiersselect

Page 37: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 37

SqlNode

• SqlNode represents an element in an abstract syntax tree

identifiersselect operator

Page 38: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 38

SqlNode

• SqlNode represents an element in an abstract syntax tree

identifiersselect operator identifier

Page 39: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 39

SqlNode

• SqlNode represents an element in an abstract syntax tree

identifiersselect operator identifier

data type

Page 40: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 40

SqlNode

• SqlNode represents an element in an abstract syntax tree

identifiersselect operator identifier

data typeidentifier

Page 41: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 41

SqlNode

• SqlNode’s unparse method converts a SQL element back into a string

Page 42: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 42

SqlNode

• SqlNode’s unparse method converts a SQL element back into a string

Page 43: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 43

SqlNode

Page 44: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 44

SqlNode

• SqlDialect indicates the capitalization and quoting rules of specific databases

Page 45: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 45

SqlNode

• SqlDialect indicates the capitalization and quoting rules of specific databases

Page 46: Introduction to Apache Calcite

QUERY OPTIMIZER next 46

Page 47: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 47

Query Plans

• Query plans represent the steps necessary to execute a query

Page 48: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 48

Query Plans

• Query plans represent the steps necessary to execute a query

Page 49: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 49

Query Plans

• Query plans represent the steps necessary to execute a query

table scan

table scan

Page 50: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 50

Query Plans

• Query plans represent the steps necessary to execute a query

inner join table scan

table scan

Page 51: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 51

Query Plans

• Query plans represent the steps necessary to execute a query

filterinner join table scan

table scan

Page 52: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 52

Query Plans

• Query plans represent the steps necessary to execute a query

filterinner join

project

table scan

table scan

Page 53: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 53

Query Plans

• Query plans represent the steps necessary to execute a query

filterinner join

project

table scan

table scan

Page 54: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 54

Query Optimization

• Optimize logical plan

• Goal is typically to try to reduce the amount of data that must be processed early in the plan

• Convert logical plan into a physical plan

• Physical plan is engine specific and represents the physical execution stages

Page 55: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 55

Query Optimization

• Prune unused fields

• Merge projections

• Convert subqueries to joins

• Reorder joins

• Push down projections

• Push down filters

Page 56: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 56

Query Optimization

Page 57: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 57

Query Optimization

Page 58: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 58

Query Optimization

Page 59: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 59

Query Optimization

push down project

Page 60: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 60

Query Optimization

push down project

push down filter

Page 61: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 61

Query Optimization

Page 62: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 62

Key Concepts

Relational algebra Row expressions Traits Conventions Rules Planners Programs

Page 63: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 63

Key Concepts

Relational algebra Row expressions Traits Conventions Rules Planners Programs

RelNode RexNode RelTrait Convention RelOptRule RelOptPlanner Program

Page 64: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 64

Relational Algebra

• RelNode represents a relational expression

• Largely equivalent to Spark’s DataFrame methods

• Logical algebra

• Physical algebra

Page 65: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 65

Relational Algebra

TableScan

Project

Filter

Aggregate

Join

Union

Intersect

Sort

Page 66: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 66

Relational Algebra

TableScan

Project

Filter

Aggregate

Join

Union

Intersect

Sort

SparkTableScan

SparkProject

SparkFilter

SparkAggregate

SparkJoin

SparkUnion

SparkIntersect

SparkSort

Page 67: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 67

Row Expressions

• RexNode represents a row-level expression

• Largely equivalent to Spark’s Column functions

• Projection fields

• Filter condition

• Join condition

• Sort fields

Page 68: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 68

Row Expressions

Input column ref Literal Struct field access Function call Window expression

Page 69: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 69

Row Expressions

Input column ref Literal Struct field access Function call Window expression

RexInputRef RexLiteral

RexFieldAccess RexCall RexOver

Page 70: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 70

Row Expressions

Page 71: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 71

Row Expressions

input ref

Page 72: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 72

Row Expressions

input ref

function call

Page 73: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 73

Traits

• Defined by the RelTrait interface

• Represent a trait of a relational expression that does not alter execution

• Traits are used to validate plan output

• Three primary trait types:

• Convention

• RelCollation

• RelDistribution

Page 74: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 74

Conventions

• Convention is a type of RelTrait

• A Convention is associated with a RelNode interface

• SparkConvention, JdbcConvention, EnumerableConvention, etc

• Conventions are used to represent a single data source

• Inputs to a relational expression must be in the same convention

Page 75: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 75

Conventions

Page 76: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 76

Conventions

Spark convention

Page 77: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 77

Conventions

Spark convention

JDBC convention

Page 78: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 78

Conventions

Spark convention

JDBC convention

converter

Page 79: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 79

Rules

• Rules are used to modify query plans

• Defined by the RelOptRule interface

• Two types of rules: converters and transformers

• Converter rules implement Converter and convert from one convention to another

• Rules are matched to elements of a query plan using pattern matching

• onMatch is called for matched rules

• Converter rules applied via convert

Page 80: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 80

Converter Rule

Page 81: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 81

Converter Rule

expression type

Page 82: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 82

Converter Rule

expression type

input convention

Page 83: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 83

Converter Rule

expression type

input convention

converted convention

Page 84: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 84

Converter Rule

expression type

input convention

converted convention

converter function

Page 85: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 85

Pattern Matching

Page 86: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 86

Pattern Matching

Page 87: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 87

Pattern Matching

no match :-(

Page 88: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 88

Pattern Matching

no match :-(

Page 89: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 89

Pattern Matching

match!

no match :-(

Page 90: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 90

Planners

• Planners implement the RelOptPlanner interface

• Two types of planners:

• HepPlanner

• VolcanoPlanner

Page 91: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 91

Heuristic Optimization

• HepPlanner is a heuristic optimizer similar to Spark’s optimizer

• Applies all matching rules until none can be applied

• Heuristic optimization is faster than cost-based optimization

• Risk of infinite recursion if rules make opposing changes to the plan

Page 92: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 92

Cost-based Optimization

• VolcanoPlanner is a cost-based optimizer

• Applies matching rules iteratively, selecting the plan with the cheapest cost on each iteration

• Costs are provided by relational expressions

• Not all possible plans can be computed

• Stops optimization when the cost does not significantly improve through a determinable number of iterations

Page 93: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 93

Cost-based Optimization

• Cost is provided by each RelNode

• Cost is represented by RelOptCost

• Cost typically includes row count, I/O, and CPU cost

• Cost estimates are relative

• Statistics are used to improve accuracy of cost estimations

• Calcite provides utilities for computing various resource-related statistics for use in cost estimations

Page 94: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 94

Cost-based Optimization

Page 95: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 95

Cost-based Optimization

Page 96: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 96

Cost-based Optimization

Page 97: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 97

Cost-based Optimization

Page 98: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 98

Cost-based Optimization

Page 99: Introduction to Apache Calcite

PUTTING IT ALL TOGETHER

next 99

Page 100: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 100

Putting it all together

Page 101: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 101

Putting it all together

Page 102: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 102

Putting it all together

Page 103: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 103

Putting it all together

Page 104: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 104

Putting it all together

Page 105: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 105

Putting it all together

Page 106: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 106

Putting it all together

Page 107: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 107

Putting it all together

Page 108: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 108

Putting it all together

Page 109: Introduction to Apache Calcite

><INTRODUCTION TO APACHE CALCITE 109

Putting it all together