querying large scientific databases - daslab
TRANSCRIPT
![Page 1: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/1.jpg)
Querying Large Scientific DatabasesLuciano and Brian
![Page 2: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/2.jpg)
Scientific Data challenges Data is in the Terabytes
Queries are exploratory
“Paraphrased, the answer is in the database, but the Nobel-prize winning query is still unknown”
Unhelpful query results that take minutes to process
4 TB Database
VS
![Page 3: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/3.jpg)
Research Directions
One-minute database kernels for real-time performance.
Multi-scale query processing for gradual exploration.
Result-set post processing for conveying meaningful data.
Q1
Scan
Post processing engine
AVG MAX
Histogram
![Page 4: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/4.jpg)
Research Directions
Query morphing to adjust for proximity results.
Query alternatives to cope with lack of providence
![Page 5: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/5.jpg)
An Introduction to BlinkDB
![Page 6: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/6.jpg)
Motivation of BlinkDB
Real time analytics for web companies
![Page 7: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/7.jpg)
Taxonomy of Solutions
![Page 8: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/8.jpg)
Query Column Sets - Include Table
SELECT country, SUM(clicks), AVG(time_on_page), FROM website_data WHEREreferral_website = ‘Google’ GROUPBY country
Query Column Sets consist of the columns in the WHERE, HAVING, and GROUPBY fields but not in the payload columns.
Query Column Set -referral_website, country
![Page 9: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/9.jpg)
Query Column Sets Distribution
The most popular queries use very few Query Column Sets
By TimeBy Rank
The Query Column Sets being used are stable over time
![Page 10: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/10.jpg)
Limitations: Lack of (Full) Support for JoinsFull support only for joins where one table fits entirely into memory.
How much is this a problem?
![Page 11: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/11.jpg)
System Overview
Modifications to HiveQL to support confidence intervals, error estimates, and time constraints
Integration of sample selection into plan formationIntegration into Hive
Creation and maintenance of samples. Integrates into the distributed file system and cache
Run query Q over file cache when deciding on sample selection
![Page 12: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/12.jpg)
HiveQL ExtensionSELECT COUNT(*)
FROM Sessions
WHERE Genre = ‘western’
GROUP BY OS
ERROR WITHIN 10% AT CONFIDENCE 95%
![Page 13: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/13.jpg)
Stratified Samples
Uniform Sample Stratified Sample
Query
![Page 14: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/14.jpg)
Stratified Sampling (Continued)
![Page 15: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/15.jpg)
Storage of Stratified Samples
![Page 16: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/16.jpg)
Limitations of Stratified SamplesLarge
Expensive to create
Sorting
![Page 17: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/17.jpg)
How to Create Samples● Number of occurrences of each possible ∣ϕ∣-tuple
● Sort the table by the columns in ϕ
● For each ∣ϕ∣-tuple that has a number of occurrences greater than K, we need to randomly sample from those occurrences
● Finally, we write the sample out to disk
How long does this take?
According to authors, 5-30 minutes. (To be examined later)
![Page 18: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/18.jpg)
What Samples should we create?
We want column query sets that are used by our workload
Want to use QCS that will take up less space
![Page 19: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/19.jpg)
Sample Selection and Runtime
![Page 20: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/20.jpg)
Query Planner
Stratified Sample
Uniform Sample
QCS1 QCS2
Sample Selection and Runtime
![Page 21: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/21.jpg)
Sample Selection and Runtime
Query Planner
Error Latency Pro�file
Error Profile
Latency Profile
![Page 22: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/22.jpg)
Query Execution Bias Correction
Sample Selection and Runtime
![Page 23: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/23.jpg)
Results
![Page 24: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/24.jpg)
Query Column Sets Distribution
90% of the queries come from 10% of the templates. However, this is still 18 Query Column Sets for Conviva and 45 different Query Column Sets for Facebook.
By TimeBy Rank
![Page 25: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/25.jpg)
Performance - Simple Query on Filter, Groupby, Avg.
![Page 26: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/26.jpg)
Results● Is this experiment fair? What portion of the speedup is due to the probabilistic
framework? What portion is due to the clustering & redundancy that BlinkDB employs?
● BlinkDB states their sample build time should be between 5-30 minutes. This experiment shows a simple two column filter, groupby, aggregate query to take 20 minutes. Does this seem reasonable?
![Page 27: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/27.jpg)
Accuracy of ResultsAccuracy of results by sampling method.
![Page 28: Querying Large Scientific Databases - DASlab](https://reader034.vdocuments.site/reader034/viewer/2022042609/626337cd3416ce7167262be1/html5/thumbnails/28.jpg)
Extensions and Open Questions1)Could we carry around only certain payload columns with each QCS?
2)Could we build this index partially and incrementally? Why or why not?
3)How does this apply to scientific data management?