final exam, may 25, 2007 quality management in multimedia databases and data stream management...
TRANSCRIPT
Final Exam, May 25, 2007
Quality Management in Multimedia Databases and Data Stream Management
SystemsYicheng Tu
Department of Computer Sciences
Purdue University
Advisor: Prof. Sunil Prabhakar
Final Exam, May 25, 2007
Quality?
The nature, kind, or character (of something). Hence, the degree or grade of excellence, etc. possessed by a thing. Restricted to cases in which there is comparison (expressed or implied) with other things of the same kind.
- Oxford English dictionary
character with respect to fineness, or grade of excellence …
- Dictionary.com
Final Exam, May 25, 2007
Our Definition
series of parameters that describe the characteristics of data processing and lead to different degrees of user satisfaction
• Overlaps with the concept of Quality-of-Service (QoS)
• Not data quality
Final Exam, May 25, 2007
Problems
• Two types of problems– Determine the quality of concurrent
applications for maximal user satisfaction – To maintain quality of applications under
highly dynamic environments• Problems are system and application-
specific• Various techniques/solutions are
involved. – Resource reservation– Application adaptation
Final Exam, May 25, 2007
Roadmap
• Introduction• Controlling delays in data stream
management systems (DSMSs)• Quality-aware (media) data
replication• Other works
Final Exam, May 25, 2007
Data Stream Management Systems
• Data-active query-passive model
• Continuous query• Continuous data,
discarded after being processed
• Applications– Financial analysis– Mobile services– Sensor networks– Network monitoring
User
DSMS
User
User
Data
Data
Data
Data
Data
Query Results
Final Exam, May 25, 2007
Load Shedding
• Data processing in DSMS is quality-critical– Tuple processing delay– Data loss– Sampling rate, window size, …
• Overloading during spikes degraded quality (processing delay)
Solution: load shedding (i.e., adjust data loss) Eliminating excessive load by dropping data itemsUsers tolerate approximate query results
Final Exam, May 25, 2007
Load Shedding: Challenges
• Constantly discarding most packets would work• What happens to query accuracy?• The real (and hard) problem is:
How to maintain processing delays while minimizing data loss ?
SpecificallyWhen?How much? For how long?Which ones to discard?
Final Exam, May 25, 2007
State-of-the-Art
• Data triage (Reiss & Hellerstein, ICDE06)– Put data into an fast-track analyzer upon
overloading• LoadStar (Chi et al., VLDB05) • Accuracy of aggregate queries under load
shedding (Babcock et al., ICDE04)• QoS-driven load shedding (Tatbul et al.,
VLDB03, 06)All utilize intuitive rule-of-thumb algorithms to decide when, how much, and how long
Does not work under bursty arrival pattern and variable tuple processing cost
Final Exam, May 25, 2007
Our Approach
• Insight: treat load shedding as a control problem
• Control: manipulation of system states (outputs) by adjusting input(s) to system
• In our problem– processing delay -> output– amount of load injected -> input
• Problem reformulation:Let the output track the desirable value by changing the amount of load discarded
delay
time
Final Exam, May 25, 2007
Feedback Control
• Suitable for rejecting the effects of disturbances• Main components form a feedback control loop
PlantControlle
r
€
u(k)
Disturbance
€
y(k)
€
e(k)+
–
e(k) = yd - y(k)
Actuator
Reference Value yd
Plant: DSMS engine Actuator: load shedder
y: average data processing delay yd: desired processing delay
e: control error u: allowed load into DSMS
Final Exam, May 25, 2007
Issues
• System modeling– Critical for control loop design– Analytical models desirable but not currently
available– Experimental methods can be used
• Controller design• Database-specific challenges
– Lack of real-time measurement of output signal y
– Actuator may not be able to implement control signal correctly
Final Exam, May 25, 2007
Modeling Borealis
• Interestingly, system identification of Borealis shows a first-order model with single-queue characteristics
• In other words (block diagram)
Final Exam, May 25, 2007
Controller Design
• Design based on pole placement– Locations of pole(s) determine how fast/well
the system responds
• Guaranteed performance targets– Convergence rate - responsiveness– Damping - smoothness
• The controller:
Final Exam, May 25, 2007
DSMS-specific challenges
• A database system is different from a traditional control system in many ways
• Lack of real-time measurement of output signal y
• Actuator may not be able to implement control signal correctly
• Solutions are provided in the context of DSMS
• Need more systematic study from a control viewpoint
Final Exam, May 25, 2007
Experiments
• Controller and load shedder implemented in a real DSMS - Borealis
• Synthetic (“Pareto”) and real (“Web”) data streams
• Query network with variable average processing cost
• Experiments for comparison– Aurora - open loop– Baseline - primitive feedback control
Final Exam, May 25, 2007
Experiments: Inputs
Final Exam, May 25, 2007
Main Results - Synthetic Data
Final Exam, May 25, 2007
Main Results - Real Data
Final Exam, May 25, 2007
Main Results - Data Loss
Final Exam, May 25, 2007
Summary on Load Shedding
• Load shedding is an effective quality adaptation method in DSMSs
• Ad hoc solutions do not work well under dynamic load
• A load shedding approach based on feedback control theory shows promising results in a real-world DSMS
• Control theory could provide solutions to other database problems
• However, we need to address new challenges that are unique in database problems
Final Exam, May 25, 2007
Roadmap
• Introduction• Controlling delays in data stream
management systems (DSMSs)• Quality-aware (media) data
replication• Other works
Final Exam, May 25, 2007
Quality-Aware Queries in Multimedia DBMS
• Quality = QoS
• Querying the DB with quality parametersSELECT vid:[s]FROM VidLib1WHERE (vid, s) IN FindVideoWithObject( Someone )QUALITY Resolution = High, Color_depth = Low
Final Exam, May 25, 2007
Quality-aware Data Retrieval
• Quality (QoS) critical for media data• Varieties of user quality requirements
– Determined by user preference and resource availability
– Large number of quality combinations
• Adaptation techniques to satisfy quality needs– Dynamic adaptation: online transcoding– Static adaptation: retrieve precoded replica
from disk
Final Exam, May 25, 2007
Dynamic Adaptation
• Transcoding is very expensive in terms of CPU cost
• Situation may improve in the future
• Layered coding – Not standardized yet.– Less popular than
people expected
Final Exam, May 25, 2007
Static Adaptation
• Little CPU cost• Choice of many commercial service
providers• What about storage cost?
– On the order of total number of quality points
– Ignored in previous research assuming• Very few quality profiles• Storage is dirt cheap
– Excessively high for service providers
)!( dnO
Final Exam, May 25, 2007
Quality-Aware Replication
• Replicas are of different “quality”• Destination: point(s) in a metric quality
space• Costs of transformation among different
qualities are very high • Applications
– Multimedia– Materialized view– Biological structure
• Good news: read-only• Bad news: too much storage needed
Data
Quality Dimension 1
Quality Dimension 2
Final Exam, May 25, 2007
Two Quality Models
• Hard-Quality: Users are strict in their quality needs– Quality A cannot serve a request for quality B– Online transcoding is needed
• Soft-Quality: Users are willing to negotiate/compromise– Quality A can serve a request for quality B– With some penalties (quantified by utility
functions)
Final Exam, May 25, 2007
Hard-Quality Systems
• Problem is to minimize reject rate (probability) P under an overall storage constraint C, given– fk: query rate to that quality k– uk: service time for quality k– sk: storage consumption for quality k– ck: CPU consumption for quality k
• Map system to a multi-rate Erlang loss system• Reduced the problem to a 0-1 Knapsack• A (good) heuristic solution:
– Sort all qualities by their fk /sk values and fill in the storage C
Final Exam, May 25, 2007
Soft-quality system: the fixed-storage replica selection
(FSRS) Problem• An optimization: get the highest utility given the
popularity (fk), storage cost (sk) of all quality points under total storage S– u(j,k): the utility when a request on quality j is served by
quality k
• Utility is given as a function of distance in quality space– Requests served by the closest replica
Final Exam, May 25, 2007
The FSRS Algorithms (I)
• Problem is NP-hard: a variation of k-mean • We propose a heuristic algorithm named
Greedy– Aggresively selects replicas based on the ratio of
marginal utility gain (∆u) to cost (sk)
– Time complexity: O(m2I) where I is the # of replicas selected and m the total # of possible replicas
selected replica set P := Φavailable storage s’ := Swhile s’ > 0
add the quality point that yields the largest ∆u/sk value to P
decrease s’ by sk return P
Final Exam, May 25, 2007
The FSRS Algorithms (II)
• Greedy could pick some bad replicas, especially the earlier selections
• Remedy: remove those bad choices and re-select
• The Iterative Greedy algorithm:
• Time complexity: same as Greedy with a larger coefficient
P ← a solution given by Greedy
while there exists solution P’ s.t. U(P’) > U(P)
do P ← P’
return P
Final Exam, May 25, 2007
Other Extensions
• Our FSRS algorithms can be easily extended to handle– Multiple media objects– Further user-specified constraints on
replicas to be selected– Multiple servers
Final Exam, May 25, 2007
Dynamic Replication
• Popularity f of replicas could change over time• We only consider the situation where popularity of
all replicas of a media object changes together– Reasonable assumption in many systems– Competition for storage among media objects
• Desirable dynamic replication algorithms:– Find solutions as optimal as those by static FSRS
algorithms– Fast enough to make online decisions
• Naïve solution: run Greedy every time a change of f occurs
Final Exam, May 25, 2007
Replication Roadmap (RR)
• Consider the order replicas are selected by Greedy – follow a predefined path (RR) for each media object
• RRs are all convex• Exchanges of storage may happen between
two media objects, triggered by the increase/decrease of f– The one that becomes more popular takes storage
from the least popular one– The one that becomes less popular gives up storage
to the most popular one– It is efficient to make exchanges at the frontiers of
the RRs, no need to look inside
Final Exam, May 25, 2007
Replication Roadmap (continued)
• Storage exchanges, example:
Media A should take storage from media B as the slope of its current segment in RR is greater than that of B’s
Final Exam, May 25, 2007
Dynamic FSRS algorithm
• Based on the RR idea• Proved performance: results given are as
optimal as those chosen by Greedy• Preprocess phase:
– Build the RRs
• Online phase:– Performing exchanges till total utility
converges– Time complexity: O(I log V) where I: # of
storage exchanges occurs and V is the # of media objects
Final Exam, May 25, 2007
Effectiveness of FSRS Algorithms
• For comparison:– The optimal solution (by CPLEX)– Random selections– Local popularity-based
Final Exam, May 25, 2007
Efficiency of FSRS Algorithms
• CPLEX < Iterative Greedy < Greedy < Random < Local
• Results on a P4 2.4 GHz CPU:
Final Exam, May 25, 2007
Dynamic Replication Results
• Randomly generated changes of f
• Compare with Greedy
• Results with (almost) the same optimality as Greedy
• Reason: small number of storage exchanges
Final Exam, May 25, 2007
Summary on media replication
• Storage cost in static adaptation prohibits replication of all qualities
• Optimize toward lowest reject (hard-quality) or the highest utility (soft-quality) given storage constraints
• Two heuristics are proposed for static replication that gives near-optimal choices
• An online algorithm for a dynamic replication problem
Final Exam, May 25, 2007
Other Works
• VDBMS - a multimedia DBMS– Quality-of-Service Aware Query Processing
[EDBT04]– System architecture [MMSJ03, DMS03, ICDE03]
• Peer-to-peer media streaming – Performance analysis [MMCN04, TOMCCAP05]
• Genetic algorithms [JEC07]• Other topics in data stream systems
– Entity-based query processing [VLDB05]– Stream data compression [GSN06]
• Signal processing [JMASM07, CSC05]
Final Exam, May 25, 2007
Ongoing and Future Research
• Further investigate load shedding problem– Handle actuator uncertainty– Other control targets– Is the optimal achievable?
• Quality-aware replication:– General case of dynamic replication, why is a
random solution not so bad?– Conjecture: Greedy is 4/3-competitive?
• Application of control theory in other database topics– Self-tuning databases
Final Exam, May 25, 2007
Publications-1
[TKDE07] Y. Tu, J. Yan, G. Shen and S. Prabhakar. Multi-Quality Data Replication in Multimedia Databases. IEEE Transactions on Knowledge and Data Engineering (TKDE). 19(5):679-694, May 2007.
[JMASM07] L. Qu and Y. Tu. Change Point Estimation of Bi-Level Functions. Journal of Modern Applied Statistical Methods. 5(2), May 2007
[JEC] H. Fang, Q. Wang, Y. Tu and M.F . Horstemeyer. An Efficient Non-Dominated Sorting Algorithm for Evolutionary Algorithms. Accepted to Journal of Evolutionary Computation.
[ICDE07] Y. Tu, S. Liu, S. Prabhakar, B. Yao, and W. Schroeder. Using Control Theory for Load Shedding in Data Stream Management. In Procs. of ICDE, pp.490-491, Istanbul, Turkey, April 2007.
[GSN06] Y. Xia, Y. Tu, M. Atallah, and S. Prabhakar. Efficient Data Compression in Location Based Services. In Procs. of 2nd International Conference on Geosensor Networks, Boston, MA, October 2006.
[VLDB06] Y. Tu, S. Liu, S. Prabhakar, and B. Yao. Load Shedding in Stream Databases - A Control-Based Approach. In Proceedings of VLDB, pp.787-798, September 2006.
[TOMCCAP05] Y. Tu, J. Sun, M. Hefeeda, and S. Prabhakar. An Analytical Study of Peer-to-Peer Media Streaming Systems. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP). 1(4):354-376., November 2005.
Final Exam, May 25, 2007
Publications-2
[VLDB05] R. Cheng, B. Kao, S. Prabhakar, A. Kwan, and Y. Tu. Adaptive Stream Filters for Entity-Based Queries with Non-Value Tolerance. In Proceedings of VLDB, pp.37-48, August 2005.
[DEXA05a] Y. Tu, J. Yan, and S. Prabhakar. Quality-Aware Replication of Multimedia Data. In Proceedings of DEXA, pp. 240-249, August 2005.
[DEXA05b] Y. Tu, M. Hefeeda, Y. Xia, S. Prabhakar, and S. Liu. Control-based Quality Adaptation in Data Stream Management Systems.In Proceedings of
DEXA, pp. 746-755, August 2005. [CSC05] L. Qu and Y. Tu. Change Point Estimation of Bar Code Signals. In
Proceedings of International Conference on Scientific Computing. pp.109-114, Las Vegas, USA, June 2005.
[MMJS04] W. Aref, A. Catlin, A. Elmagarmid, J. Fan, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, Y. Tu and X. Zhu. VDBMS: A Testbed Facility for Research in Video Database Benchmarking. ACM/Springer Multimedia
Systems. 9(6):575-585., June 2004. [EDBT04] Y. Tu, S. Prabhakar, A. Elmagarmid and R. Sion. QuaSAQ: An Approach
to Enabling End-to-End QoS for Multimedia Databases. In Proceedings of Extending Database Technology (EDBT), pp.694-711, Herakolin, Greece., March 2004.
[MMCN04] Y. Tu, J. Sun and S. Prabhakar. Performance Analysis of A Hybrid Media Streaming System. In Proceedings of ACM/SPIE Conf. on Multimedia Computing and Networking (MMCN), pp.69-82, San Jose, CA., January 2004.
Final Exam, May 25, 2007
Publications-3
[DMS03] W. Aref, A. Catlin, A. Elmagarmid, J. Fan, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, Y. Tu and X. Zhu (alphabetical order). VDBMS: A Testbed Facility for Research in Video Database Benchmarking. In Proceedings of Intl. Conf. on Distributed Multimedia Systems (DMS) 2003, pp.160-166.
[ICDE02] W. Aref, A. Elmagarmid, J. Fan, J. Guo, M. Hammad, I. Ilyas, M. Marzouk, S. Prabhakar, A. Rezgui, A. Teoh, E. Terzi, Y. Tu, A. Vakali, X. Zhu (alphabetical order). A Distributed Database Server for Continuous Media. Procs. of ICDE, pp.490-491. San Jose, CA., March 2002.
[ICDE06] Y. Tu and S. Prabhakar. Control-Based Load Shedding in Data Stream Management Systems. PhD Workshop, in conjunction with ICDE 2006.
Submitted:Using control theory for self-tuning databases. Submitted to journal.
Final Exam, May 25, 2007
Thank you!
Questions?
Final Exam, May 25, 2007
QuaSAQ
• Quality-of-Service-Aware Query processing• Users do not need to know low-level details• Cost evaluation toward global optimization
goals– Throughput
• Utilizing current system/network QoS support to deliver the query results
• Theory first presented in Bertino et al., 2003• Prototyping is essential
Final Exam, May 25, 2007
QuaSAQ Architecture
• Our approach:– Augment the query evaluation and
optimization modules to directly take QoS into account
• Major components– Offline multimedia processor
• Transcode media objects into copies with different QoS/formats
• Estimate resource use
– Online components• QoS Browser• Quality Manager• QoS APIs
User
QoP Browser
Quality Manager
naturalinteraction
query
retrievereservedresources
evaluate
reservation &renegotiation
Storage
OS
Network
QoS APIs
working plan
QuaSAQ Architecture
DBA
OfflineProcessor