content-based routing: different plans for different data pedro bizarro joint work with shivnath...
TRANSCRIPT
Content-Based Routing:Different Plans for Different
Data
Pedro BizarroJoint work with
Shivnath Babu, David DeWitt, Jennifer Widom
September 1, 2005VLDB 2005
2
• The opportunity to improve:
– Optimizers pick a single plan for a query
– However, different subsets of data may have very different statistical properties
– May be more efficient to use different plans for different subsets of data
IntroductionIntroduction
3
Overview of CBR
• Eliminates single plan assumption• Identifies tuple classes• Uses multiple plans, each customized for a
different tuple class• CBR applies to any streaming data:
– E.gs.: stream systems, regular DBMS operators using iterators, and acquisitional systems.
• Adaptive and low overhead algorithm• Implemented in TelegraphCQ as an extension to
Eddies
4
Overview of Eddies
• Eddy routes tuples through a pool of operators
• Routing decisions based on operator characteristics: selectivity, cost, queue size, etc.
O1
Stream ofInput tuples
Output tuples
O2
O3
Eddy
• Routing: Tuples notdifferentiated based on content
• We call it SBR: Source-Based Routing
5
Content-Based Routing Example
• Consider stream S processed by O1, O2, O3
O1 O2 O3
Selectivities 30% 40% 60%Overall Operator Selectivities
• Best routing order is O1, then O2, then O3
6
Content-Based Routing Example
• Let A be an attribute with domain {a,b,c}
Value of A O1 O2 O3
A=a 32% 10% 55%
A=b 31% 20% 65%
A=c 27% 90% 60%
Overall 30% 40% 60%Content-Specific Selectivities
• Best routing order for A=a: O2, O1, O3
• Best routing order for A=b: O2, O1, O3
• Best routing order for A=c: O1, O3, O2
7
Classifier Attributes
• Goal: identify tuple classes– Each with a different optimal operator ordering
• CBR considers:– Tuple classes distinguished by content, i.e.,
attribute values
• Classifier attribute (informal definition):– Attribute A is classifier attribute for operator O
if the value of A is correlated with selectivity of O.
8
Best Classifier Attribute Example:
σ 1-σ
A=a 10% 90%
A=b 20% 80%
A=c 90% 10%
Overall 40% 60%
σ 1-σ
B=x 43% 57%
B=y 38% 62%
B=z 39% 61%
Overall 40% 60%
• Attribute A with domain {a, b, c}• Attribute B with domain {x, y, z}• Which is the best to use for routing decisions?• Similar to AI problem: classifier attributes for decision trees
– Two labels: pass operator, dropped by operator
• AI solution: Use GainRatio to pick best classifier attribute
9
d
i
ii
R
R
R
Rmation(A)SplitInfor
12log*
d
ii
i REntropyR
RREntropyARInfoGain
1
)()(),(
GainRatio to Measure Correlation
• R: random sample of tuples processed by operator O
σ)(σ).(σσ.Entropy(R) 1log1log 22
mation(A)SplitInfor
A),InfoGain(RA)R,GainRatio(
σ 1-σ
A=a 10% 90%
A=b 20% 80%
A=c 90% 10%
Overall 40% 60%
σ 1-σ
B=x 43% 57%
B=y 38% 62%
B=z 39% 61%
Overall 40% 60%
GainRatio(R, A) = 0.87 GainRatio(R, B) = 0.002
Formulas from T. Mitchell, Machine Learning. McGraw-Hill, '97.
10
Classifier Attributes:Definition
An attribute A is a classifier attribute for operator O, if for any large random sample
R of tuples processed by O, GainRatio(R,A)>, for some threshold
11
Content-Learns Algorithm:Learning Routes Automatically
• Content-Learns consists of two continuous, concurrent steps:– Optimization: For each Ol O1, …,On find:
• that Ol does not have a classifier attribute or
• find the best classifier attribute, Cl, of Ol.
– Routing: Route tuples according to the:• selectivities of Ol if Ol does not have a classifier
attribute or• according to the content-specific selectivities of the
pair <Ol, Cl> if Cl is the best classifier attribute of Ol
12
operator 3being profiled
0
In[]=
0
0
0
0
0
0
Out[]=
0
0
0
0
0
tuples in,tuples out
Content-Learns: Optimization Step
• Find Cl by profiling Ol:– Route a fraction of input tuples to Ol
– For each sampled tuple• For each attribute
– map attribute values to d partitions– update pass/fail counters
– When all sample tuples seen, compute Cl
2 -1 1 1CA[]=
4 operators
classifierattributes
4 4 7
sampled tuple
2 1 2
corresponding partitions
1
1
11
1
1
3 1 2
2 1 12
2 1
1
1 1
2
2
2 1
1
1 1 f1 f2 f3
3 attributes
2 partitions
13
Content-Learns: Routing Step• SBR routes to Ol with probability inversely proportional to
Ol’s selectivity, W[l]• CBR routes to operator with minimum:
– If Ol does not have a classifier attribute, its =W[l]– If Ol has a classifier attribute, its =S[l,i], j=CA[l], i=fj(t.Cj)
25% 40% 50% 60%W[]=
4 operators
operatorselectivities
2 -1 2 1CA[]=classifierattributes
3 1 2
tuple
2 1 1
corresponding partitions
S[]=50% 20% 75%
0% 80% 55%
-
-
detailedselectivities
40%
20%
21
1
-1
50%
55%2 partitions
15
CBR Update Overheads• Once per tuple:
– selectivities as fresh as possible • Once per sampled tuple:
– correlations between operators and content• Once per sample (~2500 tuples)
– Computing GainRatio and updating one entry in array CA
25% 40% 50% 60%W[]=
operators: 1,...,n
operatorselectivities
2 -1 2 1CA[]=classifierattributes
S[]=50% 20% 75%
0% 80% 55%
-
-
detailedselectivities
0
attributes: 1,…,k
In[]=
2
1
1
2
0
partitions:1,…,d
0
Out[]=
1
1
0
1
0
tuples in,tuples out
16
Experimental Results:Datasets
• Stream-star dataset– Synthetic dataset based on a star-schema:
SELECT * FROM stream S, d1, d2, …, dN WHERE s.fkd1 = d1.pk // Operator O1 … AND s.fkdN = dN.pk; // Operator ON
– Attribute S.attrC best classifier attribute for all operators– 8 other attributes in S not correlated with operator selectivities– 100K records in stream, 10K records in dimension tables
• Lab dataset– Explained later
17
Experimental Results:System, Metrics, Defaults
• TelegraphCQ version 0.2• Tao Linux release 1, 512 MB RAM, Pentium 4 – 2.4 GHz• Metrics: % improvement running time and routing calls• Default values:
Parameter Defaults Comment
P 6% Tuple sampling probability
|R| 150 tuples Sample size
d 24 Number of partitions
Confidence 95% Conf. interval in graphs
18
Experimental Results:Run-time Overheads
• Routing overhead– time to perform routing
decisions (SBR, CBR)
• Learning overhead:– Time to update data
structures (SBR, CBR) plus
– Time to compute gain ratio (CBR only).
0
1
2
3
4
5
6
7
8
SBR CBR SBR CBR SBR CBR
Learning per tupleRouting per tuple
Microseconds
8 joins4 joins 6 joins
Overhead increase: 30%-45%
19
Experimental Results:Varying Skew
• One operator with selectivity A, all others with selectivity B• Skew is A-B. A varied from 5% to 95%• Overall selectivity: 5%
-10%0%
10%20%30%40%50%60%70%
-10
0%
-80
%
-60
%
-40
%
-20
%
0%
20
%
40
%
Skew = A - B
% Improvement over SBR
6 joins
20
Experimental Results:Random Selectivities
• Attribute attrC correlated with the selectivities of the operators• Other attributes in stream tuples not correlated with selectivities• Random selectivities in each operator
77.3% 85.2%83.0%
6.3%6.5%6.5%
3.5% 4.8% 6.0%5.7% 2.3%10.6%
0%
20%
40%
60%
80%
100%
4 joins 6 joins 8 joins
Using wrong classifierNot using a classifierProfilingUsing right classifier
Breakdown ofrouting calls:
35.1%
29.4%
21.2%
12.8%
18.8% 19.7%
0%
5%
10%
15%
20%
25%
30%
35%
40%
4 joins 6 joins 8 joins
Routing callsExecution time
% Improvement over SBR
21
Experimental Results:Varying Aggregate Selectivity
• Aggregate selectivity in previous experiments was 5% or ~8%• Here we vary aggregate selectivity between 5% to 35%• Random selectivities within these bounds
37.0%31.2%
25.4%
20.0%
13.0%
22.0%18.5%
25.1%
0%5%
10%15%20%25%30%35%40%
5% 15% 25% 35%Aggregate selectivity
Routing callsExecution time
% Improvement over SBR
6 joins
22
Experimental Results:Datasets
• Lab dataset– Real data– 2.2 M readings from 54 sensors (Intel Research Berkeley Lab)– Single stream with attributes:
• Light• Humidity• Temperature• Voltage• sensorID• Year• Month• Day• Hours• Minutes• Seconds
23
Experimental Results:Challenging Adaptivity Experiment
(1)• Using Lab dataset• Example query:
SELECT * FROM sensors WHERE light>500;
• Observations:– Very high variation
in selectivity– Best classifier
attributes change with time
– No classifier attribute found for over half the time
24
Experimental Results:Challenging Adaptivity Experiment
(2)• Query: SELECT * FROM sensors
WHERE light BETWEEN lowL AND highL AND temperature BETWEEN lowT AND highT AND humidity BETWEEN lowH AND highH AND voltage BETWEEN lowV AND highV;
• lowX random number from lower 25% of domain• highX random number from upper 25%• Results for 50 different queries.• Average improvement of:
– 8% in routing calls– 5% in execution time– 7% in time spent evaluating operators– 18% in routing calls until a tuple is dropped
25
Experimental Results:Varying Operator Cost
• Run random query from previous slide• Run query for periods with correlations• Varied operator cost by running CPU intensive computations
0%
5%
10%
15%
20%
25%
0 100 200 300 400 500
Operator cost (microsecs)
Routing callsExecution time
% Improvement over SBR
4 operators
26
Conclusions
• CBR eliminates single plan assumption• Explores correlation between tuple content
and operator selectivities• Adaptive learner of correlations with
negligible overhead• Performance improvements over non-CBR
routing• Selectivity changes much more than
correlations
27
Acknowledgements
• Sam Madden and Amol Deshpande for providing the Lab dataset.
• Sailesh Krishnamurthy, Amol Deshpande, Joe Hellerstein, and the rest of the TelegraphCQ team for providing TelegraphCQ and answering all my questions.
30
Motivational Example (1):Intrusion Detection Query
• “Track packets with destination address matching a prefix in table T, and containing the 100-byte and 256-byte sequences “0xa...8” and “0x7...b” respectively as subsequence”
• SELECT * FROM packetsWHERE matches(destination, T)AND contains(data, “0xa...8”)AND contains(data, “0x7...b”);
O1
O2
O3
31
Motivational Example (2):Intrusion Detection Query
• Assume:– costs are: c3>c1>c2
– selectivities are: 3>1>2
• SBR routing converges to O2, O1, O3
SBR
Stream of tuples
O1 O2
O3
almost alltuples follow
this route
32
CBRSBR
Motivational Example (3):Intrusion Detection Query
• Suppose an attack (O2 and O3) on a network whose prefix is not in T (O1) is underway:– σ2 and σ3 will be very high, σ1 will be very low
– O1, O2, O3 will be the most efficient ordering for “attack” tuples
Stream of tuples
O1 O2
O3
almost alltuples follow
this route
Stream of tuples
O1 O2
O3
attacktuples follow
this route
non-attacktuples follow
this route
addr
33
Experimental Results:Varying Skew
• One operator with selectivity A, all others with selectivity B• Skew is A-B. A varied from 5% to 95%• Overall selectivity: 5%
-10%0%
10%20%30%40%50%60%70%
-10
0%
-80
%
-60
%
-40
%
-20
%
0%
20
%
40
%
Skew = A - B
% Improvement over SBR
-10%0%
10%20%30%40%50%60%70%
-10
0%
-80
%
-60
%
-40
%
-20
%
0%
20
%
40
%
60
%
80
%
10
0%
Skew = A - B
% Improvement over SBR
2 joins 6 joins
34
Related WorkAdaptivity in Stream Systems• [Avnur+] Eddies: Continuously Adaptive Query
Processing. SIGMOD'00.• [Arasu+] STREAM: The Stanford Stream Data
Manager. DEBul 26(1).• [Babu+] Adaptive ordering of pipelined stream
filters. SIGMOD'04.• [Babu+] StreaMon: An Adaptive Engine for
Stream Query Processing. SIGMOD'04.• [Carney+] Monitoring streams – a new class of
data management applications. VLDB'02.• [Chandrasekaran+] TelegraphCQ: Continuous
dataflow processing for an uncertain world. CIDR'03.
• [Chandrasekaran+] Psoup: a system for streaming queries over streaming data. VLDBj 12(2).
• [Deshpande] An initial study of overheads of eddies. SIGMODrec 33(1).
• [Deshpande+] Lifting the Burden of History from Adaptive Query Processing. VLDB'04.
• [Madden+] Continuously adaptive continuous queries over streams. SIGMOD'02.
• [Raman+] Using state modules for adaptive query processing. ICDE'03.
• [Tian+] Tuple Routing Strategies for Distributed Eddies. VLDB'03.
• [Viglas+] Maximizing the Output Rate of Multi-Way Join Queries over Streaming Information Sources. VLDB'03.
AQP Surveys• [Babu+] Adaptive Query Processing in the
Looking Glass. CIDR'05.• [Hellerstein+] Adaptive query processing:
Technology in evolution. DEBul 23(2).
Adaptivity in DBMS• [Babu+] Proactive Re-optimization. SIGMOD'05.• [Graefe+] Dynamic query evaluation plans.
SIGMOD'89.• [Kabra+] Efficient Mid-Query Re-Optimization of
Sub-Optimal Query Execution Plans. SIGMOD'98.• [Markl+] Robust Query Processing through
Progressive Optimization. SIGMOD'04.• [Wong+] Decomposition - a strategy for query
processing. TODS 1(3).Adaptivity in Distributed Systems• [Bouganim+] Dynamic query scheduling in data
integration systems. ICDE'00.• [Ives+] An adaptive query execution system for
data integration. SIGMOD'99.• [Ives+] Adaptive query processing for internet
applications. DEBul 23(2).• [Ives+] Adapting to Source Properties in
Processing Data Integration Queries. SIGMOD'04.• [Ives] Efficient Query Processing for Data
Integration. PhD thesis.• [Ng+] Dynamic query re-optimization. ICSSDM'99.• [Urhan+] Dynamic pipeline scheduling for
improving interactive performance of online queries. VLDB'01.
• [Urhan+] Cost based query scrambling for initial delays. SIGMOD'98.
• [Zhu+] Dynamic plan migration for continuous queries over data streams. SIGMOD'04.
35
Related Work (cont’d)Acquisitional Systems• [Madden+] The Design of an Acquisitional Query
Processor for Sensor Networks. SIGMOD'03.• [Deshpande+] Model-Driven Data Acquisition in
Sensor Networks. VLDB'04• [Deshpande+] Exploiting Correlated Attributes in
Acquisitional Query Processing. ICDE'05.Multiple Plans in Same Query• [Antoshenkov+] Query processing and
optimization in oracle rdb. VLDBj 5(4).• [Bizarro+] Content-Based Routing: Different Plans
for Different Data. VLDB'05.• [Polyzotis] Selectivity-Based Partitioning: A Divide-
and-Union Paradigm For Effective Query Optimization. Unpublished.
URDs and Modeling Uncertainty• [Anderson+] Index key range estimator. U. S.
Patent 4,774,657. • [Babcock+] Towards a Robust Query Optimizer: A
Principled and Practical Approach. SIGMOD'05.• [Chu+] Least expected cost query optimization:
An exercise in utility. SoPoDS'99.• [Ramamurthy+] Buffer-pool Aware Query
Optimization. CIDR'05.• [Viglas] Novel Query Optimization and Evaluation
Techniques, Ph.D. Thesis.
Optimization, Cardinality Estimation, Correlations
• [Selinger+] Access Path Selection in a Relational Database Management System. SIGMOD'79.
• [Acharya+] Join Synopses for Approximate Query Answering. SIGMOD'99.
• [Christodoulakis] Implications of certain assumptions in database performance evaluation. TODS 9(2).
• [Graefe] Query Evaluation Techniques for Large Databases. ACM Comput. Surv. 25(2).
• [Getoor+] Selectivity Estimation using Probabilistic Models. SIGMOD'01.
• [Ioannidis+] On the Propagation of Errors in the Size of Join Results. SIGMOD'91.
• [Ilyas+] CORDS: Automatic discovery of correlations and soft functional dependencies. SIGMOD'04.
• [Stillger+] LEO - DB2’s LEarning Optimizer. VLDB'01.
AI and Learning from Streams• [Mitchell] Machine Learning. McGraw-Hill, '97.• [Guha+] Data-streams and histograms. ACM
Symp. on Theory of Computing, '01.• [Domingos+] Mining high-speed data streams.
SIGKDD'00.• [Gehrke+] On computing correlated aggregates
over continual data streams. SIGMOD'01.