wanalytics: analytics for a geo- distributed data...
TRANSCRIPT
![Page 1: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/1.jpg)
WANalytics: Analytics for a geo-distributed data-intensive world
Ashish Vulimiri*, Carlo Curino+,Brighten Godfrey*, Konstantinos Karanasos+,
George Varghese+
* UIUC + Microsoft
![Page 2: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/2.jpg)
Large organizations today:Massive data volumes
• Data collected acrossseveral data centers forlow end-user latency
• Use cases:– User activity logs– Telemetry– …
DC1 DC2
DC3
![Page 3: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/3.jpg)
Current scales: 10s-100s TB/day
Microsoft n * 10s TB/dayTwitter 100 TB/dayFacebook 15 TB/dayYahoo 10 TB/dayLinkedIn 10 TB/day
across up to 10s of data centers
![Page 4: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/4.jpg)
Data must be analyzed as a whole
• Need to analyze all this data to extract insight
• Production workloads today:– Mix of SQL, MapReduce,
machine learning, …
Analy&cs
SQL
MR
ML
MR
MR k-‐means
![Page 5: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/5.jpg)
Analytics on geo-distributed data:Centralized approach inadequate
Current solution: copy all data to central DC, run analytics there
1. Consumes a lot of bandwidth– Cross-DC bandwidth is expensive, very scarce– “Total Internet capacity” only ≈ 100 Tbps
2. Incompatible with sovereignty– Many countries considering making copying
citizens’ data outside illegal– Speculation: derived info will still be OK
![Page 6: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/6.jpg)
Alternative: Geo-distributed analytics
we build system supporting geo-distributedanalytics execution- Leave data partitioned across DCs- Push compute down (distribute workflow
execution)
![Page 7: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/7.jpg)
Geo-distributed analyticspreprocess adserve_log
⋈MapReduce click_log
DC1 adserve_log
SQL
k-‐means clustering
Mahout preprocess click_log
MapReduce
adserve_log
click_log
Distributed execu&on: 0.03 TB/day Centralized execu&on: 10 TB/day t = 0 push down preprocess
click_log
DCn adserve_log
t = 1 distributed semi-‐join
t = 2 centralized k-‐means
![Page 8: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/8.jpg)
Geo-distributed analyticspreprocess adserve_log
⋈MapReduce click_log
DC1 adserve_log
SQL
k-‐means clustering
Mahout preprocess click_log
MapReduce
adserve_log
click_log
Distributed execu&on: 0.03 TB/day Centralized execu&on: 10 TB/day t = 0 push down preprocess
click_log
DCn adserve_log
t = 1 distributed semi-‐join
t = 2 centralized k-‐means
![Page 9: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/9.jpg)
Geo-distributed analyticspreprocess adserve_log
⋈MapReduce click_log
DC1 adserve_log
SQL
k-‐means clustering
Mahout preprocess click_log
MapReduce
adserve_log
click_log
Distributed execu&on: 0.03 TB/day Centralized execu&on: 10 TB/day
click_log
DCn adserve_log
t = 0 push down preprocess
t = 1 distributed semi-‐join
t = 2 centralized k-‐means
333x cost reducKon
![Page 10: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/10.jpg)
Building a system forGeo-distributed analytics
• Possible challenges to address:– Bandwidth– Fault tolerance– – Latency– Consistency
• Starting point: system we build targets the batch applications considered earlier
Sovereignty
![Page 11: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/11.jpg)
PROBLEM DEFINITION
![Page 12: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/12.jpg)
Computational model• DAGs of arbitrary tasks over geo-distributed data• Tasks can be orwhite box black box
preprocess adserve_log
⋈MapReduce
click_log
DC1
adserve_log
click_log
DCn
adserve_log preprocess click_log MapReduce
SQL
correlaKon analysis
user-‐provided code
![Page 13: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/13.jpg)
Unique characteristics(what make this problem novel)
1. Arbitrary DAG of computational tasks2. No control over data partitioning– Partitioning dictated by external factors,
e.g. end-user latency
3. Cross-DC bandwidth is only scarce resource– CPU, storage within DCs is relatively cheap
4. Unusual constraints:– heterogeneous bandwidth cost/availability– sovereignty
5. Bulk of load is stable, recurring workload– Consistent with production logs
![Page 14: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/14.jpg)
Problem statement• Support arbitrary DAG workflows on
geo-distributed data– Minimize bandwidth cost– Handle fault-tolerance, sovereignty
• Configure system to optimize given ~stable recurring workload (set of DAGs)
![Page 15: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/15.jpg)
KEY TAKE-AWAY 1:
Geo-distributed analytics is a fun and industrially relevant new instance of classic DB problems
![Page 16: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/16.jpg)
OUR APPROACH
![Page 17: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/17.jpg)
Architecture
End-‐users
End-‐user facing DB (handles OLTP)
Hive Mahout
MapReduce
Local ETL
Workload OpKmizer
logs
exec, repl policy
Coordinator ReporKng pipeline
DAGs
Results
Data transfer opKmizaKon
![Page 18: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/18.jpg)
Data transfer optimization:Trading CPU/storage for bandwidth
• Runtime optimization that works irresp of computation
• CPU, storage within DCs is cheap• Bandwidth crossing DCs is expensive• This is one way we trade CPU/storage for
bandwidth reduction
![Page 19: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/19.jpg)
Data transfer optimization:Caching
• We use aggressive caching:Cache all intermediate output
• If computation recurs:– recompute results– send diff(new results, old results)
• Actually worsens CPU, storage use
• But saves cross-DC bandwidth– all we care about
rold
rnew
rold
src
dst
diff(rnew, rold)
![Page 20: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/20.jpg)
Data transfer optimization:Caching
• Caching naturally helps if one DAG arrives repeatedly (intra-DAG)
• But interestingly: also helpsinter-DAG– When multiple DAGs share
common sub-operations– (Because we cache all
intermediate output)• E.g. TPC-CH– 5.99x for a part of the workload
![Page 21: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/21.jpg)
Data transfer optimization:Caching ≈ View maintenance
• Caching is a low-level, mechanical form of (materialized) view maintenance
+ Works for arbitrary computation- Compared to relational view maintenance• Is less efficient (CPU, storage)• Misses some opportunities
![Page 22: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/22.jpg)
KEY TAKE-AWAY 2:
The extreme ratio of bandwidth toCPU/storage allows for novel optimizations
![Page 23: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/23.jpg)
WORKLOAD OPTIMIZER
![Page 24: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/24.jpg)
Robust evolutionary approach
• Start by supporting existing “centralized” plan• Continuous adaptation (loop):– Come up with a set of alternative hypotheses – Measure their costs using pseudo-distributed execution• Novel mechanism with zero bandwidth-cost overhead
– Compute new best plan• Execution strategy• Data replication strategy
– Deploy new best plan
![Page 25: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/25.jpg)
Robust evolutionary approach
• Start by supporting existing “centralized” plan• Continuous adaptation (loop):– Come up with a set of alternative hypotheses – Measure their costs using pseudo-distributed execution• Novel mechanism with zero bandwidth-cost overhead
– Compute new best plan• Execution strategy• Data replication strategy
– Deploy new best plan
![Page 26: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/26.jpg)
Robust evolutionary approach
• Start by supporting existing “centralized” plan• Continuous adaptation (loop):– Come up with a set of alternative hypotheses – Measure their costs using pseudo-distributed execution• Novel mechanism with zero bandwidth-cost overhead
– Compute new best plan• Execution strategy• Data replication strategy
– Deploy new best plan
today (for rest see paper)
![Page 27: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/27.jpg)
Optimizing execution:Subproblem definition
• Given: – Core workload: a set of recurrent DAGs – Sovereignty, fault-tolerance requirements
• Need to decide best choice of: – Strategy for each task (e.g. hash join vs semi join) – Which task goes to which DC
![Page 28: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/28.jpg)
Optimizing execution:Difficulties
1. Optimizing even one task in isolation is very hard
2. Should jointly optimize all tasks in each DAG
3. Should jointly optimize all DAGs in workload– Caching
4. Sovereignty, fault-tolerance
![Page 29: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/29.jpg)
Optimizing execution:Difficulties
1. Optimizing even one task in isolation is very hard
DAG:
Data:
⋈
DC1 P1 Q1
DCn Pn Qn
P
Q
![Page 30: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/30.jpg)
Optimizing execution:Difficulties
1. Optimizing even one task in isolation is very hard
2. Should jointly optimize all tasks in each DAG
3. Should jointly optimize all DAGs in workload– Recall: caching helps when DAGs share sub-operations
4. Sovereignty, fault-tolerance
![Page 31: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/31.jpg)
Optimizing execution:Greedy heuristic
• Process all DAGs in parallel, separately.In each DAG:– Go over tasks in topological order– For each task, greedily pick
lowest-cost available strategy
![Page 32: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/32.jpg)
When does the greedy heuristic work?
• Contractive DAGs: picks optimal strategy– make up 98% of DAGs in our experiments
filter aggr summarize
Data size
extract features
combine
Data size
![Page 33: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/33.jpg)
When does the greedy heuristic work?
• Contractive DAGs: picks optimal strategy [98%]• DAGs that expand then contract: may not [2%]
filter aggr summarize
Data size
extract features
combine
Data size
![Page 34: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/34.jpg)
Optimizing execution:Beyond the heuristic
• Have a precise ILP formulation for special cases– SQL-only DAGs– MapReduce-only DAGs– (Handles fault-tolerance and sovereignity as
constraints)• Alternate heuristics
• General problem remains open
![Page 35: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/35.jpg)
KEY TAKE-AWAY 3:
The optimization space is massive, yet simple heuristics seem to yield good results
![Page 36: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/36.jpg)
EVALUATION
![Page 37: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/37.jpg)
Prototype: WANalytics• Implemented Hadoop-stack prototype– MapReduce, Hive, OpenNLP, Mahout, …
• Experiments up to 10s of TBs scale– Real Microsoft production workload– Three standard synthetic benchmarks:
BigBench, TPC-CH, Berkeley Big-Data– Mix of relational and non-relational
![Page 38: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/38.jpg)
0.00001
0.0001
0.001
0.01
0.1
1
10
0.0001 0.001 0.01 0.1 1 10
Data
tran
sfer
TB (c
ompr
esse
d)
TB (raw, uncompressed) Size of OLTP updates since last OLAP run
CentralizedDistributed: no cachingDistributed: with caching
Results: BigBench
330x
![Page 39: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/39.jpg)
0.00001
0.0001
0.001
0.01
0.1
1
10
0.0001 0.001 0.01 0.1 1 10
Data
tran
sfer
TB (c
ompr
esse
d)
TB (raw, uncompressed) Size of OLTP updates since last OLAP run
CentralizedDistributed: no cachingDistributed: with caching
Results: TPC-CH
360x
![Page 40: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/40.jpg)
Data
tran
sfer
Size of OLTP updates since last OLAP run
CentralizedDistributed: no cachingDistributed: with caching
Results: Microsoftproduction workload
257x
![Page 41: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/41.jpg)
0.00001
0.0001
0.001
0.01
0.1
1
0.0001 0.001 0.01 0.1
Data
tran
sfer
TB (c
ompr
esse
d)
TB (raw, uncompressed) Size of OLTP updates since last OLAP run
CentralizedDistributed: no cachingDistributed: with caching
Results: Berkeley Big-Data
3.5x
![Page 42: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/42.jpg)
KEY TAKE-AWAY 4:
The opportunity here is substantial:more than two orders of magnitude inmany workloads
![Page 43: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/43.jpg)
OPEN PROBLEMS
![Page 44: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/44.jpg)
Open Problems• Evolve optimizer beyond greedy• Even more general computational models– e.g. iteration
• Latency• Consistency• Sovereignty / privacy
![Page 45: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/45.jpg)
Open Problems• Evolve optimizer beyond greedy• Even more general computational models– e.g. iteration
• Latency• Consistency• Sovereignty / privacy
![Page 46: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/46.jpg)
Sovereignty: Partial support• Our system respects “data-at-rest”
regulations (e.g., German data should not be stored outside of Germany)
• But we allow arbitrary queries on the data• Limitation: we don’t differentiate between– Acceptable queries, e.g.
“what’s the total revenue from each city”– Problematic queries, e.g.
SELECT * FROM Germany
![Page 47: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/47.jpg)
Sovereignty: Partial support• Solution: either– Legally vet the core workload of queries/views– Use differential privacy mechanism
• Open problem
![Page 48: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/48.jpg)
KEY TAKE-AWAY 5:
This is just the first step, lots of related work, lots of fun work ahead
![Page 49: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/49.jpg)
Related Work• Distributed and parallel databases• Single-DC frameworks (Hadoop/Spark/…)• Data warehouses• Scientific workflow systems• Sensor networks• Stream-processing systems• …
![Page 50: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/50.jpg)
Unique characteristics(what make this problem novel)
1. Arbitrary DAG of computational tasks2. No control over data partitioning– Partitioning dictated by external factors,
e.g. end-user latency
3. Cross-DC bandwidth is only scarce resource– CPU, storage within DCs is relatively cheap
4. Unusual constraints:– heterogeneous bandwidth cost/availability– sovereignty
5. Bulk of load is stable, recurring workload– Consistent with production logs
![Page 51: WANalytics: Analytics for a geo- distributed data ...cidrdb.org/cidr2015/Slides/24_CIDR15_Slides_Paper24u.pdfAnalytics on geo-distributed data: Centralized approach inadequate Current](https://reader033.vdocuments.site/reader033/viewer/2022060902/609edcdf69c8543129478ebb/html5/thumbnails/51.jpg)
Summary• Centralized analytics is becoming untenable• Proposal: geo-distributed analytics execution• WANalytics, our system, introduces– Pseudo-distributed measurement– Joint multi-query + redundancy optimization– Caching
• On real and synthetic workloads: up to 360x less bandwidth than centralized
• Many challenges remain