solving low latency query over big data with spark sql-(julien pierre, microsoft)

Post on 12-Aug-2015

790 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Client Data Fluency

Office

Skype

Bing

Modern Data Capability

Instrumentation & Ingestion

Processing & Storage

Reporting & Analytics

Information Management

Mobile-First Analytics Experience

Experimentation

Data Size

Query Latency

Get results inline in Zeppelin

Need to open the results in Excel

0 20 40 60 80 100 120 140 160 180 200

Cosmos

SparkSQL

SparkSQL with Cache

Write and Compile Query Submit and Wait in Job Queue Job Run Time

Mesos Cluster/HDFS

Job Manager Zookeeper

Job Frontend Web API

Spark Driver Host Pool

Spark Hive Thrift Server Zeppelin Server

Avocado (Hive Query + Schedule Task)

Rover (Drag & Drop BI tool with Hive Code

Gen)

Zeppelin Web UI

MetastoreDB Hive Loader

Cosmos Storage

Partition  1

Partition  2

...

Partition  n

Export  Cosmos  Partition

Partition  1

Partition  2

...

Partition  n

Task  2

HDFS.copyFromLocalFile

...

Task  n

Partition  1

Partition  2

...

Partition  n

saveAsParquetFile

Task  2...

Task  n

<Database2>

<Table1><Database1>

<Partition1>

<Table2><Partition2>

MetastoreDB

Hive  Thrift  Server

Hive  Loader

Zeppelin  Server

UserQueryQuery

Data Ingest

Services

Clients

Transform Compute

Transform Compute

Data Streams

Data Sets

Store

Event Processing

HDFS Data Transportation

Spark Streaming Receiver

Analyst

Zeppelin Notebooks

Avocado

Simple query

Query language

“Analyze”

“Debug”

“Mine”

“Glance”

Data

Unified platform Intelligence Interactive

analytics Data

Products

Better Digital

Experiences

Dual users

“Bing”

“Office”

top related