xin-june27-425pm-room230c-130710131659-phpapp02 (1)

8/12/2019 xin-june27-425pm-room230c-130710131659-phpapp02 (1)

http://slidepdf.com/reader/full/xin-june27-425pm-room230c-130710131659-phpapp02-1 1/35

Shark: SQL and RichAnalytics at Scale

Reynold Xin, Josh Rosen, Matei Zaharia, Michael Franklin, ScottShenker, Ion Stoica

AMPLab, UC Berkeley

June 25 @ SIGMOD 2013



Challenges

Data size growing » Processing has to scale out over large

clusters » Faults and stragglers complicate DB design

Complexity of analysis increasing

» Massive ETL (web crawling) » Machine learning, graph processing » Leads to long running jobs



The Rise of MapReduce



What’s good about

MapReduce?1. Scales out to thousands of nodes in a fault-

tolerant manner

2. Good for analyzing semi-structured data andcomplex analytics

3. Elasticity (cloud computing) 4. Dynamic, multi-tenant resource sharing



“parallel relational database systems aresignicantly faster than those that rely on the

use of MapReduce for their query engines”

“I totally agree.”



This Research

1. Shows MapReduce model can be extended tosupport SQL efciently

» Started from a powerful MR-like engine (Spark) » Extended the engine in various ways

2. The artifact: Shark, a fast engine on top of MR » Performant SQL » Complex analytics in the same engine » Maintains MR benets, e.g. fault-tolerance



MapReduce Fundamental Properties?

Data-parallel operations » Apply the same operations on a dened set of data

Fine-grained, deterministic tasks » Enables fault-tolerance & straggler mitigation



Why Were Databases Faster?

Data representation » Schema-aware, column-oriented, etc »

Co-partition & co-location of data Execution strategies

» Scheduling/task launching overhead (~20s in Hadoop) »

Cost-based optimization

» Indexing

Lack of mid-query fault tolerance »

MR’s pull model costly compared to DBMS “push”

See Pavlo 2009, Xin 2013.



Why Were Databases Faster?

Data representation » Schema-aware, column-oriented, etc »

Co-partition & co-location of data

Execution strategies » Scheduling/task launching overhead (~20s in Hadoop) »

Cost-based optimization

» Indexing

Lack of mid-query fault tolerance »

MR’s pull model costly compared to DBMS “push”

See Pavlo 2009, Xin 2013.

Not fundamental to

“MapReduce”

Can besurprisingly

cheap



Introducing Shark

MapReduce-based architecture » Uses Spark as the underlying execution engine »

Scales out and tolerate worker failures

Performant » Low-latency, interactive queries »

(Optionally) in-memory query processing

Expressive and exible » Supports ot SQL and complex analytics » Hive compatible (storage, UDFs, types, metadata, etc)



Spark Engine

Fast MapReduce-like engine » In-memory storage for fast iterative computations »

General execution graphs

» Designed for low latency (~100ms jobs)

Compatible with Hadoop storage APIs »

Read/write to any Hadoop-supported systems, includingHDFS, Hbase, SequenceFiles, etc

Growing open source platform » 17 companies contributing code



More Powerful MR EngineGeneral task DAG

Pipelines functionswithin a stage

Cache-aware data locality & reuse

Partitioning-aware to avoid shufes

!"#$

&$#"$

'("&)*+

,-)

./-'0 1

./-'0 2

./-'0 3

45 *5

65 75

85

95

:5

; )(0<#"&=>+ ?",)&/0@ )-(/#/#"$



Client CLI JDBC

Hive Architecture

Metastore

Hadoop Storage (HDFS, S3, …)

Driver SQL

Parser Query

Optimizer Physical Plan

Execution

MapReduce



Client CLI JDBC

Shark Architecture

Metastore

Hadoop Storage (HDFS, S3, …)

Driver SQL

Parser

Spark

Cache Mgr. Physical Plan

Execution Query

Optimizer



Extending Spark for SQL

Columnar memory store

Dynamic query optimization

Miscellaneous other optimizations (distributed top-K, partition statistics & pruning a.k.a. coarse-

grained indexes, co-partitioned joins, …)



Columnar Memory Store

Simply caching records as JVM objects is inefcient(huge overhead in MR’s record-oriented model)

Shark employs column-oriented storage, apartition of columns is one MapReduce “record”.

2

!"#$%& ()"*+,-

3 1

!"A$ ,#B0 =->>+

CD2 1DE FDC

."/ ()"*+,-

2 !"A$ CD2

3 ,#B0 1DE

1 =->>+ FDCBenet: compact representation, CPU efcientcompression, cache locality.



Partial DAG Execution (PDE)

Lack of statistics for fresh data and the prevalentuse of UDFs necessitate dynamic approaches to

query optimization.

PDE allows dynamic alternation of query plans based on statistics collected at run-time.



Shufe Join

Stage 3Stage 2

Stage 1

Join

Result

Stage 1

Stage 2

JoinResult

Map Join (Broadcast Join)

minimizes network trafc



PDE Statistics

Gather customizable statistics at per-partitiongranularities while materializing map output.

» partition sizes, record counts (skew detection) » “heavy hitters” » approximate histograms

Can alter query plan based on such statistics » map join vs shufe join » symmetric vs non-symmetric hash join » skew handling



Complex Analytics IntegrationUnied system for SQL,machine learning

Both share the same setof workers and caches

def logRegress(points: RDD[Point] ): Vector {var w = Vector (D, _ => 2 * rand.nextDouble - 1)for (i <- 1 to ITERATIONS) {

val gradient = points.map { p =>val denom = 1 + exp(-p.y * (w dot p.x))(1 / denom - 1) * p.y * p.x

}.reduce(_ + _) w -= gradient

} w

}

val users = sql2rdd( "SELECT * FROM user u

JOIN comment c ON c.uid=u.uid" )

val features = users.mapRows { row =>new Vector (extractFeature1(row.getInt( "age" )),

extractFeature2(row.getStr( "country" )),...)}

val trainedVector = logRegress(features.cache())



Machine Learning Performance

!"#$%&' )*+',$-.&/

0 12 34 506 577 560

583

795

:;/.',.< =$/-$''.;&

0 47 76 34 >2 540

550

09>2

?@%-A B%C;;D

Runtime per iteration (secs)



Real Warehouse Benchmark

!

"#

#!

$#

%!!

&% &" &' &(

) * + , - . /

1 2 / 3 4 + 5 2 6

789:; 789:; 15-2;6 <-=/

%>% !>? !>$ %>!

1.7 TB Real Warehouse Data on 100 EC2 nodes



New Benchmark

!"#$%$

!"#$%$ '"(")

*(+,-./0

1-$23 '+.,3)

1-$23 '"(")

4 5 64 65 74

*890."( ',(:;9+,)

http://tinyurl.com/bigdata-benchmark



Other benets of MapReduce

Elasticity » Query processing can scale up and down dynamically

Straggler Tolerance

Schema-on-read & Easier ETL

Engineering » MR handles task scheduling / dispatch / launch » Simpler query processing code base (~10k LOC)



Berkeley Data Analytics Stack

Spark

Shark SQL

HDFS / Hadoop Storage

Mesos Resource Manager

SparkStreaming GraphX MLBase



Community

3000 people attendedonline training

800 meetup members 17 companies contributing



Conclusion

Leveraging a modern MapReduce engine and techniques from databases, Shark supports both

SQL and complex analytics efciently, whilemaintaining fault-tolerance.

Growing open source community » Users observe similar speedups in real use cases » http://shark.cs.berkeley.edu » http://www.spark-project.org



MapReduce MSs hark

xin-june27-425pm-room230c-130710131659-phpapp02 (1)

Documents