sqak: doing more with keywords sandeep tata, guy m. lohman ibm almaden research center presented by...
Post on 21-Dec-2015
215 views
TRANSCRIPT
SQAK:SQAK:Doing More with KeywordsDoing More with Keywords
Sandeep Tata, Guy M. LohmanIBM Almaden Research Center
Presented by Alex ZlotnikSeminar in Databases, 236826
Aggregate QueryA query that uses one of these functions:{Count, Average, Sum, Min, Max}
SQAK = SQL Aggregates SQAK = SQL Aggregates using Keywordsusing Keywords
Non Aggregates Aggregates
Other papers Covered Almost none
Partial results or tuples
OK Failure
ContentContent• The problem• Solution• Research• Experiments• Other Challenges• Power vs. Ease of Use• Overview
The problemThe problemWrite SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 in less than 3 minutes
Professor
id
dept
nameDepartment
deptid
location
name
Student
id
deptid
name
Enrollment
sectionid
grade
studentid
Section
courseid
term
sectionid
Section
courseid
term
sectionid
instructor
Courses
courseid
deptid
name
course “Seminar in Databases”Number of students registered
Fall semester in 2009
Courses
courseid
deptid
nameEnrollment
sectionid
grade
studentid
1
2
Section
courseid
term
sectionid
Section
courseid
term
sectionid
instructor
3
Join
Join
4
The problemThe problem
Write SQL for Number of students registered for the course “Seminar in Databases” in the Fall semester in 2009 in less than 3 minutes
course “Seminar in Databases”Number of students registered
Fall semester in 2009
5. SQL:
SELECT courses.name, section.term, count(students.id) as countFROM students, enrollment, section, coursesWHERE students.id = enrollment.id AND section.classid = enrollment.classidAND courses.courseid = section.courseid ANDlower(courses.name) LIKE ’\%seminar in databases\%’AND lower(section.term) = ’\%fall 2009\%’GROUP BY courses.name, section.term
Perfect world solutionPerfect world solution
SQAK: “Seminar in Databases” “Fall 2009” number students
SELECT courses.name, section.term, count(students.id) as countFROM students, enrollment, section, coursesWHERE students.id = enrollment.id AND section.classid = enrollment.classidAND courses.courseid = section.courseid ANDlower(courses.name) LIKE ’\%seminar in databases\%’AND lower(section.term) = ’\%fall 2009\%’GROUP BY courses.name, section.term
SQL:
Perfect world solutionPerfect world solution• Create aggregate queries using simple
keywords
• Little or no knowledge of the schema is required from the user
• No changes required in the database
• Any existing database
ProgressProgressThe needThe goalResearch Yada• Experiments• Other Challenges• Power vs. Ease of Use• Overview
Generating SQL - ContentsGenerating SQL - Contents
Generating SQL - ParserGenerating SQL - ParserParser (keywords)
Keyword {Candidates | Candidate = (Table, Column)}
Candidate Interpretations = Cross product of each Candidates list
“Seminar in Databases” “Fall 2009” number students {(Course.name, Section.term, count Enrollment.studentid), (Course.name, Section.term, count Student.id )}
Matching keyword against schema elements with approximate string matching Example: “students” Enrollment.studentid
“students” Student.id
• Inverted index from all text values to their columns Example: “Seminar in Databases” Courses.name
• Aggregates: sum, count, avg, min, max
Generating SQL - ParserGenerating SQL - ParserParser (keywords)
Initial Filtering
- CI with 2 keywords corresponding to the same column
{(count, Student.Id, Student.Name), (count, Student.Name, Student.Name)}
number student “Cohen”
- CI with 2 columns that are primary and foreign key
{(count, Student.Id, Student.Name), (count, Student.Id, Enroll.StudentId)}
number student “Cohen”
Generating SQL – SQN BuilderGenerating SQL – SQN Builder
(Course.name, Section.term, count Student.name)
“Seminar in Databases” “Fall 2009” number “Alex”
SQN Builder (Candidate Interpretations)
For every candidate interpretation Build the best matching sub-graph of the schema
Courses
Section
Enrollment
Students
Generating SQL – ScorerGenerating SQL – ScorerScorer (SQNs)
Find the best SQN and create SQL for it
Building Sub-graph (SQN)Building Sub-graph (SQN)• Input: Tables as nodes in the directed
schema graph
• Output: Connected sub-graph covering the tables
• Principle: Simplest model Making fewest assumptions, used in other papers too
• Attempt #1: Minimal covering sub-graph with directed path between every 2 tables
Building Sub-graph (SQN)Building Sub-graph (SQN)• Attempt #1: Minimal covering sub-graph with
directed path between every 2 tables
• Problem: Many-to-Many relationships
Section
Enrollment
Students Professor
Section
Course
Building Sub-graph (SQN)Building Sub-graph (SQN)• Attempt #2: Minimal covering sub-graph with
Node Clarity
• Node Clarity: The sub-graph doesn’t contain any node with multiple incoming edges
Section
Enrollment
Students
Professor Course
Department
Weak Reference
Building Sub-graph (SQN)Building Sub-graph (SQN)• Example: Find the number of students per course
• Query: courses count students
Section
Enrollment
Students
Professor Course
Department
Weak Reference
Students
Course
Department
Section
Enrollment
Students
Course
For each course, list the number of students that are in the same department that offers the course
Building Sub-graph (SQN)Building Sub-graph (SQN)• Input: Tables as nodes in the directed
schema graph
• Output: Minimal sub-graph with Node Clarity covering the tables
• Observation: The output is a tree
NP CompleteReduction from Exact 3-Cover
Building Sub-graph (SQN)Building Sub-graph (SQN)Greedy Heuristic Algorithm for finding min SQNCI = {nodes (tables) of keywords}Qagg = Aggregate nodeSQN = {}
Start BFS (non-directed) from the aggregate node, For every step i
1. Qi = Nodes discovered in step i2. for every node q in CI Qi
2.1 If NodeClear(q.path SQN) 2.1.1 SQN SQN q.path 2.1.2 CI CI \ {q}
2.1.3 if (CI = {}) return SQN3. If no progress was made
3.1 backtrack the added path4. Stop when all nodes in CI where found or BFS finishes
Minimality: By BFS
Building Sub-graph (SQN)Building Sub-graph (SQN)Greedy Heuristic Algorithm for finding min SQN
The algorithm finds minimal SQN
Complexity: Without backtracking: O(q2E2)Otherwise, exponential
Time limit: Stop the algorithm after fixed timeand run without node clarity = approx.
SteinerIn this case SQAK warns the user that the
result might not be accurate
Generating SQL – Scorer Generating SQL – Scorer (reminding)(reminding)
Scorer (SQNs)
Find the best SQN and create SQL for it
Score(CI, SQN) =
CIcol
SQNEdgescolMatch |)(|)(
Generating SQLGenerating SQLProcedure makeSimpleStatement(CI, SQN)1. Make SELECT clause from elements in CI2. Make FROM clause from nodes in SQN3. Make WHERE clause from edges in SQN4. Make GROUP BY clause from elements of
CI except aggregated node5. Add predicates in CI to the WHERE clause6. Return statementend procedure
Generating SQLGenerating SQL• Input: CI, SQN• Output: SQL• 3 types of queries:
– Simple: “Seminar in DB” count students
– Top1, single level aggregate: “department with max num students”
– Top1, double level aggregate: “department student with max avg
grade”
Generating SQLGenerating SQLtranslateSQN(CI, SQN)1. if SQN does not have a w-node then 1.1 Return makeSimpleStatement(CI, SQN)end if2. if SQN has a w-node and a single level aggregate then 2.1 Produce view u = makeSimpleStatement(CI,SQN) 2.2 Remove w-node from u’s SELECT clause and GROUP
BY clause 2.3 r = makeSimpleStatement(CI, SQN) 2.4 Add u to r’s FROM clause 2.5 Add join conditions joining all the columns in u to the
corresponding ones in r 2.6 return rend if
Generating SQL: Generating SQL: Top1 single level aggregateTop1 single level aggregate
• “department with max num students”
WITH temp(DEPTID, COURSEID) AS (SELECT DEPARTMENT.DEPTID, count(COURSES.COURSEID)FROM COURSES, DEPARTMENTWHERE DEPARTMENT.DEPTID = COURSES.DEPTIDGROUP BY DEPARTMENT.DEPTID),temp2(COURSEID) AS (SELECT max(COURSEID) FROM temp)SELECT temp.DEPTID, temp.COURSEIDFROM temp, temp2WHERE temp.COURSEID = temp2.COURSEID
Generating SQLGenerating SQLtranslateSQN(CI, SQN)…3. if SQN has a w-node and a double level aggregate then3.1 Produce view u = makeSimpleStatement (CI,SQN)3.2 Produce view v = aggregate of u from the second level aggregate term in the CI excluding the w-node in the
SELECT and GROUP BY clauses3.3 Produce r = Join u and v, equation on all the common
columns3.4 Return rend if
Generating SQL: Generating SQL: Top1 double level aggregateTop1 double level aggregate• “department student with max avg grade”
WITH temp( DEPTID, ID, GRADE) AS (SELECT STUDENTS.DEPTID, STUDENTS.ID,avg(ENROLLMENT.GRADE)FROM ENROLLMENT, STUDENTSWHERE STUDENTS.ID = ENROLLMENT.IDGROUP BY STUDENTS.DEPTID , STUDENTS.ID),
temp2( DEPTID, GRADE) AS (SELECT DEPTID, max(GRADE)FROM temp GROUP BY DEPTID)SELECT temp.DEPTID, temp.ID, temp.GRADEFROM temp, temp2WHERE temp.DEPTID = temp2.DEPTIDAND temp.GRADE = temp2.GRADE
SQAK ExpressivenessSQAK Expressiveness• No formal definition of rSQL
expressiveness – Future work
• Queries based on weak reference – cannot be expressed
ProgressProgressThe needThe goalResearch YadaExperiments• Other Challenges• Power vs. Ease of Use• Overview
ExperimentsExperimentsMetrics• Data Precision
Irrelevant, either the correct data is retrieved or not
• Effectiveness
• Savings
• Parameters
• Cost
Experiments - EffectivenessExperiments - Effectiveness
• SQAK – 93% (14 out of 15)• Steiner – 60% ( 9 out of 15)
Average grade received by a student named William fromthe EECS department
Experiments - EffectivenessExperiments - Effectiveness
SQAK – 100%Steiner – 87% (13 of 15)
TPCH DatabaseTPCH Database
Experiments - SavingsExperiments - SavingsCS num students
VS.SELECT DEPARTMENT.NAME, count(STUDENTS.ID)FROM STUDENTS, DEPARTMENTWHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID
AND lower(DEPARTMENT.NAME) LIKE ’%cs%’GROUP BY DEPARTMENT.NAME
Experiments - SavingsExperiments - Savings• Measure: #of schema elements +
# of join conditions• M(CS num students) = 1• M(SQL) = 5
SELECT DEPARTMENT.NAME, count(STUDENTS.ID)FROM STUDENTS, DEPARTMENTWHERE DEPARTMENT.DEPTID = STUDENTS.DEPTID AND lower(DEPARTMENT.NAME) LIKE ’%cs%’GROUP BY DEPARTMENT.NAME
• Saved = 5 – 1 = 4
Experiments - SavingsExperiments - Savings• Measure: #of schema elements +
# of join conditions
• Not taken into account– SQL syntax and correctness– Top1 single and double level
constructions
Experiments - SavingsExperiments - Savings• Measure: #of schema elements +
# of join conditions
• Average Savings:
Experiments - ParametersExperiments - Parameters - Mismatch tolerance threshold
Match(keyword to column) < ? MATCH
: NOT_MATCH
• f – Mismatch penalty (used by the scorer)
Experiments - ParametersExperiments - Parameters• Low sensitivity to
mismatch threshold [0.4, 0.8]
• In lower thresholds wrong columns were selected for CI
• As expected, lower penalty (f=1.5) leads to lower accuracy
• Robust: f=2 or 3, between 0.4-0.8
Experiments – More SchemasExperiments – More Schemas• Large DB of IT assets of large enterprise
– 600 tables, each with several columns– Sample queries provide accuracy as presented– Generating SQL, always less than 1 second
• Warehouse DB– 14 tables, star schema
• Star schema is easier for SQAK – no backtracking
ProgressProgressThe needThe goalResearch YadaExperimentsOther Challenges• Power vs. Ease of Use• Overview
Other challengesOther challenges• Approximate Matching
– The user doesn’t know columns names– Proposal: hint list, either on paper or in code,
or ontology based normalization
• Missing Referential Integrity– Referential constraints not defined by the DBA– Proposal: Use (out of scope) algorithm to
discover referential constraints
Other challengesOther challenges• Tied or Close Plans
– Several SQNs with (close to) best score– Can occur in similar names, different
semantic areas of DB– The user selects the relevant SQL– Future research: Visualizing the
interesting SQNs
Other challengesOther challenges• Expressiveness
– Users adopt SQAK quickly and pose queries such as age > 18
– SQAK adds an appropriate WHERE clause
Power vs. Ease of UsePower vs. Ease of Use
OverviewOverview• SQAK - A system to create SQL queries
with aggregates from keywords• Useful – No knowledge of schema or
changes are required• Expressive, but with limitations• Trade-off between correctness and
computability cost. • Execution in many common cases –
polynomial, exponential at worst case
Personal opinionPersonal opinion• The idea is powerful• Experiments on industrial data would
emphasize the strengths and the weaknesses
• Some of the results were expected (Steiner tree)
• Translating CI to SQN is solving the same NP complete problem repeatedly. Caching mechanism would be very beneficiary
Questions?Questions?
MerciMerci