Download - Top- k Queries on Uncertain Data
Top-k Queries on Uncertain Data
指導教授:陳良弼 老師報告者:鄧雅文 97753034
Introduction Related Work Problem Formulation Future Work
Outline
Top-k query on certain data◦ Rank results according to a user-defined score◦ Important for explore large databases◦ E.g., top-2 = {T1, T2}
Introduction
TID PID ScoreT1 A 100T2 B 90T3 C 80T4 D 70
Uncertain database◦ How to define top-k on uncertain data?◦ Mutually exclusive rules
E.g., T1♁T4
Introduction (cont.)
TID PID Score Pr.T1 A 100 0.2T2 B 90 0.9T3 C 80 0.6T4 A 70 0.8… … … …
C. C. Aggarwal and P. S. Yu. A Survey of Uncertain Data Algorithms and Applications. In TKDE, 2009.◦ Causes:
Sensor networks, privacy, trajectories prediction…◦ The main areas of research on the uncertain data:
Modeling of uncertain data Uncertain data management
Top-k query, range query, NN query… Uncertain data mining
Clustering, classification, frequent pattern, outliers…
Related Work
M. Soliman, I. Ilyas, and K. Chang. Top-k Query Processing in Uncertain Databases. In ICDE, 2007.◦ Possible Worlds
Related Work (cont.)
◦ U-Topk query Return k tuples that can
co-exist in a possible worldwith the highest probability
E.g., {T1, T2} as U-Top2◦ U-kRanks query
Return k tuples each of whichis a clear winner in its rankover all possible worlds
E.g., {T2, T6} as U-2Ranks
Related Work (cont.)
M. Hua, J. Pei, W. Zhang, X. Lin. Ranking Queries on Uncertain Data: A Probabilistic Threshold Approach. In SIGMOD, 2008.◦ PT-k query
Return a set of all tupleswhose top-k probabilityvalues are at least p
E.g., {T1, T2, T5} as PT-2(with p=0.4)
Related Work (cont.)
T. Ge, S. Zdonik, and S. Madden. Top-k Queries on Uncertain Data: On Score Distribution and Typical Answers. In SIGMOD, 2009.◦ The tradeoff between reporting high-scoring
tuples and tuples with a high probability of being in the top-k
◦ Return a number of typical vectors that efficiently sample the distribution of all potential top-k tuple vectors
Related Work (cont.)
Example:◦ In an International Tenpin Bowling Championship,
the events include single, double, and trio. Due to the budget, the coach can only choose 3 players to attend. Therefore, we hope these 3 players can have relatively high probability to perform well over these 3 types of events.
Problem Formulation
◦ U-Top3={T2, T5, T6}
◦ But U-Top2={T1, T2}, U-Top1={T1}◦ How about also considering {T1, T2, T5} as top-3?
Problem Formulation (cont.)
TID Player Pr.T1 A 0.4100 T2 D 0.6200 T3 B 0.1400 T4 C 0.3400 T5 C 0.6600 T6 B 0.8600 T7 D 0.3800 T8 A 0.5900
Possible World Pr. Possible World Pr.
PW1 T1, T2, T3, T4 0.0121 PW9 T2, T3, T4, T8 0.0174
PW2 T1, T2, T3, T5 0.0235 PW10 T2, T3, T5, T8 0.0338
PW3 T1, T2, T4, T6 0.0743 PW11 T2, T4, T6, T8 0.1070
PW4 T1, T2, T5, T6 0.1443 PW12 T2, T5, T6, T8 0.2076
PW5 T1, T3, T4, T7 0.0074 PW13 T3, T4, T7, T8 0.0107
PW6 T1, T3, T5, T7 0.0144 PW14 T3, T5, T7, T8 0.0207
PW7 T1, T4, T6, T7 0.0456 PW15 T4, T6, T7, T8 0.0656
PW8 T1, T5, T6, T7 0.0884 PW16 T5, T6, T7, T8 0.1273
We choose the answers of a top-k query not only depending on the probability (P) but also on the confidence (C).◦ Confidence: to express the top-(k-1) probabilities of
the sets formed by k-1 tuples of this possible top-k answer E.g., k=3
{T1, T2, T3} as a possible top-k with P=0.0356C is composed in some way of Pr({T1, T2}) to be top-2=0.2542 and its confidence, Pr({T1, T3}) to be top-2=0.0218 and its confidence, Pr({T2, T3}) to be top-2=0.0512 and its confidence
Problem Formulation (cont.)
Since every possible top-k answer has two features—probability (P) and confidence (C), we only return those non-dominated ones as a result set.◦ E.g., {T1, T3, T5}: P=0.8, C=0.4
{T1, T4, T7}: P=0.5, C=0.7 {T2, T6, T7}: P=0.3, C=0.2 this will not be returned
Problem Formulation (cont.)
Formulate the confidence function Find an algorithm to generate the result set Try to calculate the confidence in an
efficient way Carry out an empirical study on datasets
Future Work
Thank you!