query optimization over crowdsourced data
DESCRIPTION
Presented in VLDB 2013.TRANSCRIPT
Query Optimization over Crowdsourced Data
Hyunjung Park, Jennifer Widom Stanford University
Deco: Declarative Crowdsourcing
Give me a Spanish-speaking country.
Give me a country. What language do they speak in country X? What is the capital of country X?
8/27/2013 Hyunjung Park 2
“Find the capitals of eight Spanish-speaking countries”
DBMS
country language capital
Italy Italian Rome
Spain Spanish Madrid
… … …
country language capital
Italy Italian Rome
Spain Spanish Madrid
Deco System
Deco Query Optimization
• Crowd incurs monetary cost • Some query plans are much cheaper than others
• Cost estimation is complicated by: – Previously collected data – Unknown database state
– Inconsistency of human answers
8/27/2013 Hyunjung Park 3
Outline
• Motivating example • Deco data model and queries
• Cost and cardinality estimation
• Experimental results
8/27/2013 Hyunjung Park 4
Everything implemented in full prototype
Motivating Example: Plan 1
8/27/2013 Hyunjung Park 5
Give me a country.
What language do they speak in country X?
What is the capital of country X?
unseen
Spanish
F
T
T
F
“Find the capitals of eight Spanish-speaking countries”
8x
Give me a country. Give me a country. Give me a country.
Motivating Example: Plan 2
8/27/2013 Hyunjung Park 6
Give me a Spanish-speaking country.
What language do they speak in country X?
What is the capital of country X?
unseen
Spanish
F
T
T
F
“Find the capitals of eight Spanish-speaking countries”
8x
Preview of Experimental Results
0
5
10
15
Plan 1 Plan 2
Actual costs spent on Mechanical Turk
What is the capital of country X?
What language do they speak in country X?
Give me a Spanish-speaking country.
Give me a country.
8/27/2013 Hyunjung Park 7
($)
Outline
• Motivating example • Deco data model and queries
• Cost and cardinality estimation
• Experimental results
8/27/2013 Hyunjung Park 8
Deco: Data Model (1/2)
• Conceptual Relation: visible to end-users Country (country, language, capital)
• Resolution Rules: cleanse raw data using UDFs country: dupElim language: majority(3)
capital: majority(3)
8/27/2013 Hyunjung Park 9
Deco: Data Model (2/2)
• Fetch Rules: “access methods” for the crowd language => country
“Give me a {language}-speaking country.”
Ø => country “Give me a country.”
country => language “What language do they speak in {country}?”
country => capital “What is the capital of {country}?”
8/27/2013 Hyunjung Park 10
[$0.05]
[$0.01]
[$0.02]
[$0.03]
Deco: Queries
• Deco query: SQL query over conceptual relations SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8
• Query processor: access the crowd as needed to produce query result while: 1. Minimizing monetary cost
2. Reducing latency
8/27/2013 Hyunjung Park 11
query optimizer
query execution engine
Query Optimization
• Find the best query plan in terms of estimated monetary cost
• As in traditional query optimizer 1. Cost and cardinality estimation 2. Search space
3. Plan enumeration algorithm
8/27/2013 12 Hyunjung Park
Cost Estimation
• Total monetary cost = ∑Fetch F F.price × F.cardinality – Existing data is “free”
• Definition of Cardinality in Deco – Total number of expected output tuples from operator
until query execution terminates
• Cardinality estimation – Final database state needs to be estimated
simultaneously
8/27/2013 Hyunjung Park 13
Cardinality Estimation: Setting
• $0.05 for all fetch rules
• No existing data
• Selectivity factors – language=‘Spanish’: 0.1
– dupElim: 0.8 – majority(3): 0.4 (=1/2.5)
8/27/2013 Hyunjung Park 14
Cardinality Estimation: Plan 1
8/27/2013 15 Hyunjung Park
SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8
MinTuples[8]
Project[co,ca]
DLOJoin[co]
DLOJoin[co]
Resolve[dupeli] Resolve[maj3]
Resolve[maj3]Filter[la=’Spanish’]
Scan[CtryA]
Fetch[Øàco]
Scan[CtryD2]
Fetch[coàca]
Scan[CtryD1]
Fetch[coàla]
1
2
3
4 12
5 13
96
7 8 10 11
14
Ø => country country => language country => capital
Cost estimation: $0.05×(100+200+20) = $16.00 200
20
100
Cardinality Estimation: Plan 2
8/27/2013 16 Hyunjung Park
MinTuples[8]
Project[co,ca]
DLOJoin[co]
DLOJoin[co]
Resolve[dupeli] Resolve[maj3]
Resolve[maj3]Filter[la=’Spanish’]
Scan[CtryA]
Fetch[laàco]
Scan[CtryD2]
Fetch[coàca]
Scan[CtryD1]
Fetch[coàla]
1
2
3
4 12
5 13
96
7 8a 10 11
14
SELECT country, capital FROM Country WHERE language=‘Spanish’ MINTUPLES 8
language => country country => language country => capital
Cost estimation: $0.05×(10+20+20) = $2.50 20 10
20
8/27/2013 Hyunjung Park 17
0
1
2
3
Actual
Plan 2
Experimental Results
0
5
10
15
Actual
Plan 1
country => capital country => language language => country Ø => country
($) ($)
8/27/2013 Hyunjung Park 18
0
1
2
3
Actual Estimated
Plan 2
Experimental Results
0
5
10
15
Actual Estimated
Plan 1
country => capital country => language language => country Ø => country
($) ($)
Related Work
• Declarative approach for crowdsourcing – Arnold, CrowdDB, CrowdSearcher, Jabberwocky, Qurk, ...
• Crowd-powered algorithms/operations – Filter, sort, join, max, entity resolution, …
• Also: – Traditional query optimization – Heterogeneous or federated database systems
8/27/2013 19 Hyunjung Park
Summary
• Cost estimation in Deco – Distinguish between existing data vs. new data
– Estimate cardinality and final database state simultaneously
• In the paper: – Full description of cost estimation and plan
enumeration algorithms
– More experimental results
8/27/2013 Hyunjung Park 20
Thank you!