solving low latency query over big data with spark sql-(julien pierre, microsoft)

24

Upload: spark-summit

Post on 12-Aug-2015

790 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 2: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 3: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 4: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

Client Data Fluency

Office

Skype

Bing

Modern Data Capability

Instrumentation & Ingestion

Processing & Storage

Reporting & Analytics

Information Management

Mobile-First Analytics Experience

Experimentation

Page 5: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 6: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 7: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

Data Size

Query Latency

Page 8: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 9: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

Get results inline in Zeppelin

Need to open the results in Excel

Page 10: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

0 20 40 60 80 100 120 140 160 180 200

Cosmos

SparkSQL

SparkSQL with Cache

Write and Compile Query Submit and Wait in Job Queue Job Run Time

Page 11: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 12: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 13: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 14: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

Mesos Cluster/HDFS

Job Manager Zookeeper

Job Frontend Web API

Spark Driver Host Pool

Spark Hive Thrift Server Zeppelin Server

Avocado (Hive Query + Schedule Task)

Rover (Drag & Drop BI tool with Hive Code

Gen)

Zeppelin Web UI

MetastoreDB Hive Loader

Cosmos Storage

Page 15: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

Partition  1

Partition  2

...

Partition  n

Export  Cosmos  Partition

Partition  1

Partition  2

...

Partition  n

Task  2

HDFS.copyFromLocalFile

...

Task  n

Partition  1

Partition  2

...

Partition  n

saveAsParquetFile

Task  2...

Task  n

Page 16: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

<Database2>

<Table1><Database1>

<Partition1>

<Table2><Partition2>

MetastoreDB

Hive  Thrift  Server

Hive  Loader

Zeppelin  Server

UserQueryQuery

Page 17: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 18: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 19: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 20: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 21: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 22: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)

Data Ingest

Services

Clients

Transform Compute

Transform Compute

Data Streams

Data Sets

Store

Event Processing

HDFS Data Transportation

Spark Streaming Receiver

Analyst

Zeppelin Notebooks

Avocado

Simple query

Query language

“Analyze”

“Debug”

“Mine”

“Glance”

Data

Unified platform Intelligence Interactive

analytics Data

Products

Better Digital

Experiences

Dual users

“Bing”

“Office”

Page 23: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)
Page 24: Solving Low Latency Query Over Big Data with Spark SQL-(Julien Pierre, Microsoft)