sqak: doing more with keywords sandeep tata, guy m. lohman ibm almaden research center presented by...

SQAK:SQAK:Doing More with KeywordsDoing More with Keywords

Sandeep Tata, Guy M. LohmanIBM Almaden Research Center

Presented by Alex ZlotnikSeminar in Databases, 236826

[email protected]

Aggregate QueryA query that uses one of these functions:{Count, Average, Sum, Min, Max}

SQAK = SQL Aggregates SQAK = SQL Aggregates using Keywordsusing Keywords

Non Aggregates Aggregates

Other papers Covered Almost none

Partial results or tuples

OK Failure

ContentContent• The problem• Solution• Research• Experiments• Other Challenges• Power vs. Ease of Use• Overview

The problemThe problemWrite SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 in less than 3 minutes

Professor

id

dept

nameDepartment

deptid

location

name

Student

id

deptid

name

Enrollment

sectionid

grade

studentid

Section

courseid

term

sectionid

Section

courseid

term

sectionid

instructor

Courses

courseid

deptid

name

course “Seminar in Databases”Number of students registered

Fall semester in 2009

Courses

courseid

deptid

nameEnrollment

sectionid

grade

studentid

1

2

Section

courseid

term

sectionid

Section

courseid

term

sectionid

instructor

3

Join

Join

4

The problemThe problem

Write SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 in less than 3 minutes

course “Seminar in Databases”Number of students registered

Fall semester in 2009

5. SQL:

SELECT courses.name, section.term, count(students.id) as countFROM students, enrollment, section, coursesWHERE students.id = enrollment.id AND section.classid = enrollment.classidAND courses.courseid = section.courseid ANDlower(courses.name) LIKE ’\%seminar in databases\%’AND lower(section.term) = ’\%fall 2009\%’GROUP BY courses.name, section.term

Perfect world solutionPerfect world solution

SQAK: “Seminar in Databases” “Fall 2009” number students

SELECT courses.name, section.term, count(students.id) as countFROM students, enrollment, section, coursesWHERE students.id = enrollment.id AND section.classid = enrollment.classidAND courses.courseid = section.courseid ANDlower(courses.name) LIKE ’\%seminar in databases\%’AND lower(section.term) = ’\%fall 2009\%’GROUP BY courses.name, section.term

SQL:

Perfect world solutionPerfect world solution• Create aggregate queries using simple

keywords

• Little or no knowledge of the schema is required from the user

• No changes required in the database

• Any existing database

ProgressProgressThe needThe goalResearch Yada• Experiments• Other Challenges• Power vs. Ease of Use• Overview

Generating SQL - ContentsGenerating SQL - Contents

Generating SQL - ParserGenerating SQL - ParserParser (keywords)

Keyword {Candidates | Candidate = (Table, Column)}

Candidate Interpretations = Cross product of each Candidates list

“Seminar in Databases” “Fall 2009” number students {(Course.name, Section.term, count Enrollment.studentid), (Course.name, Section.term, count Student.id )}

Matching keyword against schema elements with approximate string matching Example: “students” Enrollment.studentid

“students” Student.id

• Inverted index from all text values to their columns Example: “Seminar in Databases” Courses.name

• Aggregates: sum, count, avg, min, max

Generating SQL - ParserGenerating SQL - ParserParser (keywords)

Initial Filtering

- CI with 2 keywords corresponding to the same column

{(count, Student.Id, Student.Name), (count, Student.Name, Student.Name)}

number student “Cohen”

- CI with 2 columns that are primary and foreign key

{(count, Student.Id, Student.Name), (count, Student.Id, Enroll.StudentId)}

number student “Cohen”

Generating SQL – SQN BuilderGenerating SQL – SQN Builder

(Course.name, Section.term, count Student.name)

“Seminar in Databases” “Fall 2009” number “Alex”

SQN Builder (Candidate Interpretations)

For every candidate interpretation Build the best matching sub-graph of the schema

Courses

Section

Enrollment

Students

Generating SQL – ScorerGenerating SQL – ScorerScorer (SQNs)

Find the best SQN and create SQL for it

Building Sub-graph (SQN)Building Sub-graph (SQN)• Input: Tables as nodes in the directed

schema graph

• Output: Connected sub-graph covering the tables

• Principle: Simplest model Making fewest assumptions, used in other papers too

• Attempt #1: Minimal covering sub-graph with directed path between every 2 tables

Building Sub-graph (SQN)Building Sub-graph (SQN)• Attempt #1: Minimal covering sub-graph with

directed path between every 2 tables

• Problem: Many-to-Many relationships

Section

Enrollment

Students Professor

Section

Course

Building Sub-graph (SQN)Building Sub-graph (SQN)• Attempt #2: Minimal covering sub-graph with

Node Clarity

• Node Clarity: The sub-graph doesn’t contain any node with multiple incoming edges

Section

Enrollment

Students

Professor Course

Department

Weak Reference

Building Sub-graph (SQN)Building Sub-graph (SQN)• Example: Find the number of students per course

• Query: courses count students

Section

Enrollment

Students

Professor Course

Department

Weak Reference

Students

Course

Department

Section

Enrollment

Students

Course

For each course, list the number of students that are in the same department that offers the course

Building Sub-graph (SQN)Building Sub-graph (SQN)• Input: Tables as nodes in the directed

schema graph

• Output: Minimal sub-graph with Node Clarity covering the tables

• Observation: The output is a tree

NP CompleteReduction from Exact 3-Cover

Building Sub-graph (SQN)Building Sub-graph (SQN)Greedy Heuristic Algorithm for finding min SQNCI = {nodes (tables) of keywords}Qagg = Aggregate nodeSQN = {}

Start BFS (non-directed) from the aggregate node, For every step i

1. Qi = Nodes discovered in step i2. for every node q in CI Qi

2.1 If NodeClear(q.path SQN) 2.1.1 SQN SQN q.path 2.1.2 CI CI \ {q}

2.1.3 if (CI = {}) return SQN3. If no progress was made

3.1 backtrack the added path4. Stop when all nodes in CI where found or BFS finishes

Minimality: By BFS

Building Sub-graph (SQN)Building Sub-graph (SQN)Greedy Heuristic Algorithm for finding min SQN

The algorithm finds minimal SQN

Complexity: Without backtracking: O(q2E2)Otherwise, exponential

Time limit: Stop the algorithm after fixed timeand run without node clarity = approx.

SteinerIn this case SQAK warns the user that the

result might not be accurate

Generating SQL – Scorer Generating SQL – Scorer (reminding)(reminding)

Scorer (SQNs)

Find the best SQN and create SQL for it

Score(CI, SQN) =

CIcol

SQNEdgescolMatch |)(|)(

Generating SQLGenerating SQLProcedure makeSimpleStatement(CI, SQN)1. Make SELECT clause from elements in CI2. Make FROM clause from nodes in SQN3. Make WHERE clause from edges in SQN4. Make GROUP BY clause from elements of

CI except aggregated node5. Add predicates in CI to the WHERE clause6. Return statementend procedure

Generating SQLGenerating SQL• Input: CI, SQN• Output: SQL• 3 types of queries:

– Simple: “Seminar in DB” count students

– Top1, single level aggregate: “department with max num students”

– Top1, double level aggregate: “department student with max avg

grade”

Generating SQLGenerating SQLtranslateSQN(CI, SQN)1. if SQN does not have a w-node then 1.1 Return makeSimpleStatement(CI, SQN)end if2. if SQN has a w-node and a single level aggregate then 2.1 Produce view u = makeSimpleStatement(CI,SQN) 2.2 Remove w-node from u’s SELECT clause and GROUP

BY clause 2.3 r = makeSimpleStatement(CI, SQN) 2.4 Add u to r’s FROM clause 2.5 Add join conditions joining all the columns in u to the

corresponding ones in r 2.6 return rend if

Generating SQL: Generating SQL: Top1 single level aggregateTop1 single level aggregate

• “department with max num students”

WITH temp(DEPTID, COURSEID) AS (SELECT DEPARTMENT.DEPTID, count(COURSES.COURSEID)FROM COURSES, DEPARTMENTWHERE DEPARTMENT.DEPTID = COURSES.DEPTIDGROUP BY DEPARTMENT.DEPTID),temp2(COURSEID) AS (SELECT max(COURSEID) FROM temp)SELECT temp.DEPTID, temp.COURSEIDFROM temp, temp2WHERE temp.COURSEID = temp2.COURSEID

Generating SQLGenerating SQLtranslateSQN(CI, SQN)…3. if SQN has a w-node and a double level aggregate then3.1 Produce view u = makeSimpleStatement (CI,SQN)3.2 Produce view v = aggregate of u from the second level aggregate term in the CI excluding the w-node in the

SELECT and GROUP BY clauses3.3 Produce r = Join u and v, equation on all the common

columns3.4 Return rend if

Generating SQL: Generating SQL: Top1 double level aggregateTop1 double level aggregate• “department student with max avg grade”

WITH temp( DEPTID, ID, GRADE) AS (SELECT STUDENTS.DEPTID, STUDENTS.ID,avg(ENROLLMENT.GRADE)FROM ENROLLMENT, STUDENTSWHERE STUDENTS.ID = ENROLLMENT.IDGROUP BY STUDENTS.DEPTID , STUDENTS.ID),

temp2( DEPTID, GRADE) AS (SELECT DEPTID, max(GRADE)FROM temp GROUP BY DEPTID)SELECT temp.DEPTID, temp.ID, temp.GRADEFROM temp, temp2WHERE temp.DEPTID = temp2.DEPTIDAND temp.GRADE = temp2.GRADE

SQAK ExpressivenessSQAK Expressiveness• No formal definition of rSQL

expressiveness – Future work

• Queries based on weak reference – cannot be expressed

ProgressProgressThe needThe goalResearch YadaExperiments• Other Challenges• Power vs. Ease of Use• Overview

ExperimentsExperimentsMetrics• Data Precision

Irrelevant, either the correct data is retrieved or not

• Effectiveness

• Savings

• Parameters

• Cost

Experiments - EffectivenessExperiments - Effectiveness

• SQAK – 93% (14 out of 15)• Steiner – 60% ( 9 out of 15)

Average grade received by a student named William fromthe EECS department

Experiments - EffectivenessExperiments - Effectiveness

SQAK – 100%Steiner – 87% (13 of 15)

TPCH DatabaseTPCH Database

Experiments - SavingsExperiments - SavingsCS num students

VS.SELECT DEPARTMENT.NAME, count(STUDENTS.ID)FROM STUDENTS, DEPARTMENTWHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID

AND lower(DEPARTMENT.NAME) LIKE ’%cs%’GROUP BY DEPARTMENT.NAME

Experiments - SavingsExperiments - Savings• Measure: #of schema elements +

# of join conditions• M(CS num students) = 1• M(SQL) = 5

SELECT DEPARTMENT.NAME, count(STUDENTS.ID)FROM STUDENTS, DEPARTMENTWHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID AND lower(DEPARTMENT.NAME) LIKE ’%cs%’GROUP BY DEPARTMENT.NAME

• Saved = 5 – 1 = 4


# of join conditions

• Not taken into account– SQL syntax and correctness– Top1 single and double level

constructions


# of join conditions

• Average Savings:

Experiments - ParametersExperiments - Parameters - Mismatch tolerance threshold

Match(keyword to column) < ? MATCH

: NOT_MATCH

• f – Mismatch penalty (used by the scorer)

Experiments - ParametersExperiments - Parameters• Low sensitivity to

mismatch threshold [0.4, 0.8]

• In lower thresholds wrong columns were selected for CI

• As expected, lower penalty (f=1.5) leads to lower accuracy

• Robust: f=2 or 3, between 0.4-0.8

Experiments – More SchemasExperiments – More Schemas• Large DB of IT assets of large enterprise

– 600 tables, each with several columns– Sample queries provide accuracy as presented– Generating SQL, always less than 1 second

• Warehouse DB– 14 tables, star schema

• Star schema is easier for SQAK – no backtracking

ProgressProgressThe needThe goalResearch YadaExperimentsOther Challenges• Power vs. Ease of Use• Overview

Other challengesOther challenges• Approximate Matching

– The user doesn’t know columns names– Proposal: hint list, either on paper or in code,

or ontology based normalization

• Missing Referential Integrity– Referential constraints not defined by the DBA– Proposal: Use (out of scope) algorithm to

discover referential constraints

Other challengesOther challenges• Tied or Close Plans

– Several SQNs with (close to) best score– Can occur in similar names, different

semantic areas of DB– The user selects the relevant SQL– Future research: Visualizing the

interesting SQNs

Other challengesOther challenges• Expressiveness

– Users adopt SQAK quickly and pose queries such as age > 18

– SQAK adds an appropriate WHERE clause

Power vs. Ease of UsePower vs. Ease of Use

OverviewOverview• SQAK - A system to create SQL queries

with aggregates from keywords• Useful – No knowledge of schema or

changes are required• Expressive, but with limitations• Trade-off between correctness and

computability cost. • Execution in many common cases –

polynomial, exponential at worst case

Personal opinionPersonal opinion• The idea is powerful• Experiments on industrial data would

emphasize the strengths and the weaknesses

• Some of the results were expected (Steiner tree)

• Translating CI to SQN is solving the same NP complete problem repeatedly. Caching mechanism would be very beneficiary

Questions?Questions?

MerciMerci

sqak: doing more with keywords sandeep tata, guy m. lohman ibm almaden research center presented by...

Documents