the hive think tank - the microsoft big data stack by raghu ramakrishnan, cto for data, microsoft

Big Data @ Microsoft

Raghu RamakrishnanCTO for Data, Technical Fellow

Microsoft

Data and Analytics – 3 Pillars

SQL 2016Azure SQL DB

Azure SQL DW

SQL Server R services

On-prem and cloud

(Windows, Linux)

Cortana Intelligence

SuiteHadoop, Data Lake, Machine

learning, PowerBI, Data Factory, Streaming,

Perceptual Intelligence

On-prem connectivity

Microsoft

R serverHadoop

Teradata

On-prem and cloud

(Windows, Linux)

SQL Server 2016: Everything Built-In

The above graphic was published by Gartner, Inc. as part of a larger research document and should be evaluated in the context of the entire document. The Gartner document is available upon request from Microsoft. Gartner does not endorse any

vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner's research

organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

Consistent experience from on-premises to cloud

Microsoft Tableau Oracle

$120

$480

$2,230

Self-service BI per user

In-memory across all workloads

TPC-H non-clustered 10TB

Oracle is #4#2

SQL Server

#1

SQL Server

#3

SQL Server

built-inbuilt-in built-in built-in built-in

01

4

0 03

34

29

22

15

5

22

6

43

20

69

18

49

3

-80

-70

-60

-50

-40

-30

-20

-10

0

2010 2011 2012 2013 2014 2015

SQL Server Oracle MySQL2 SAP HANA

TPC-H non-clustered results as of 04/06/15, 5/04/15, 4/15/14 and 11/25/13, respectively. http://www.tpc.org/tpch/results/tpch_perf_results.asp?resulttype=noncluster

at massive scale

National Institute of Standards and Technology Comprehensive Vulnerability Database update 5/4/2015

In-Database Advanced AnalyticsNo need to move the data

Open source R with in-memory & massive scale – multi-threading & massive parallel processing

Data ScientistInteract directly with data

R built-in to SQL Server

Data Developer/DBAManage data and

analytics together

Example Solutions

• Sales forecasting

• Warehouse efficiency

• Predictive maintenance

Extensibility

?R

R Integration

Relational data

Analytic Library

T-SQL interface

010010

100100

010101

New R scripts

010010

100100

010101

010010

100100

010101

010010

100100

010101

• Credit risk protection

010010

100100

010101

Microsoft Azure Marketplace

Real-time operational analytics without moving the data

NEW

NEW

End-to-end mobile BI Advanced AnalyticsMission critical OLTP

High-performance open source R plus:

Enterprise Scale & Performance

– Scales from workstations to large clusters

– Scales to large data sizes

– Growing portfolio of Parallelized algorithms

Secure, Scalable R Deployment/Operationalization

Write Once Deploy Anywhere for multiple platforms

IDE for data scientists and developers

Enterprise Class Support

DistributedR

DeployR DevelopR

ScaleR

ConnectR

Cloud – SQL Server/SQL Azure

Shifting how you purchase and manage machines

Increased focus on Total Cost of Ownership and continuous improvements

Built from the same code base

We increased surface area compatibility with V12 Azure SQL Database

We’re learning how to run our own code – the good and the badWe’re using that to improve both product and service

Microsoft is the only provider both on-premises and in the cloud

Order history

Name SSN Date

Jane Doe cm61ba906fd 2/28/2005

Jim Gray ox7ff654ae6d 3/18/2005

John Smith i2y36cg776rg 4/10/2005

Bill Brown nx290pldo90l 4/27/2005

Sue Daniels ypo85ba616rj 5/12/2005

Order history

Name SSN Date

Jane Doe cm61ba906fd 2/28/2005


John Smith i2y36cg776rg 4/10/2005

Bill Brown nx290pldo90l 4/27/2005

Customer data

Product data

Order History

Stretch to cloud

Stretch SQL Server into AzureStretch warm and cold tables to Azure with remote query processing

App

Query

Microsoft Azure


SQL Server 2016

Azure SQL DW

Fully managed relational data warehouse-as-a-service

First elastic cloud data warehouse with proven SQL Server capabilities

Support your smallest to your largest data storage needs

Scales to petabytes of data

Massively Parallel Processing

Instant-on compute scales in seconds

Query Relational / Non-Relational

Saas

Azure

PublicCloud

Office 365Office 365

Get started in minutes

Integrated with Azure ML, PowerBI & ADF

Simple billing compute & storage

Pay for what you need, when you need it with dynamic pause

AzureAzure

Store any datarelations

Do any analysisSQL queries

Hive,

At any speedBatch

Hive

At any scale … elastic!

Anywhere

Data to Intelligent

Action

Web Logs, Omniture logs

On-Premise SQL Server

(customer and product data)

In-Store Activity with

Kinect sensors

Social Data

Diagnostic streaming

Event hubs

Machine Learning

Stream Analytics

Azure DataLake

Data Factory: Move Data, Orchestrate, Schedule, and Monitor

HDInsight HDInsight Machine Learning

Azure SQL Data Warehouse

Power BI

INGEST PREPARE ANALYZE PUBLISH

Stream Analytics

CONSUMEDATA SOURCES

Cortana

Web/LOB Dashboards

Azure Data Analytics Stack

REEF library

STORAGE

YARN

HDFS/WebHDFS API

Compute-tier Cache Clusters(Local ENs + CSM)RAM / SSD / HDD

WAS-based Remote Storage

Cosmos Store API

CLUSTER-WIDE RM (YARN++)

YARN + Federation

YARN + Rayon (Capacity reservation)

YARN + Mercury

Shared micro-services for all

metadata (extent map, logical name space, secure

store) based on Hekaton/RSL

rings

YARN + Mercury

YARN + Mercury

Application Engines

Per-job RM and runtimeM/R

U-SQLBatch

Spark

TezSpark

Runtime

Spark HiveU-SQL Azure ML Azure SA

COMPUTE TIER

SQL-DW HDInsightIaaS

Services

Windows

SMSG

LiveAds

CRM/DynamicsWindows Phone

Xbox Live

Office365

STB Malware ProtectionMicrosoft Stores

STB Commerce Risk

MessengerLCA

Exchange

YammerSkype

Bing

data managed: EBs

cluster sizes: 10s of Ks

# machines: 100s of Ks

daily I/O: >100 PBs

# internal developers: 1000s

# daily jobs: 100s of Ks

Observation

Pattern

Theory

Hypothesis

What will happen?

How can we make it happen?

Predictive

Analytics

Prescriptive

Analytics

What happened?

Why did it happen?

Descriptive

Analytics

Diagnostic

Analytics

Confirmation

Theory

Hypothesis

Observation

Implement Data Warehouse

Physical Design

ETL

Development

Reporting &

Analytics

Development

Install and Tune

Reporting & Analytics Design

Dimension Modelling

ETL Design

Setup Infrastructure

Understand Corporate Strategy

Data sources

ETL

BI and analytic

Data warehouse

Gather Requirements

Business Requirements

Technical Requirements

Ingest all data regardless of requirements

Store all data in native format without

schema definition

Do analysisUsing analytic engines

like Hadoop

Interactive queries

Batch queries

Machine Learning

Data warehouse

Real-time analytics

Devices

What happened?

What is happening?

Why did it happen?

What are key relationships?

What will happen?

What if?

How risky is it?

What should happen?

What is the best option?

How can I optimize?

Data sources

Handling failures

Sharing data, resources

Parallelism

Data-aware Optimization

Security, Compliance, Governance

Enterprise

Forrester Wave

Big Data Hadoop

Cloud Solutions

Q2 2016

• Interactive and Real-Time Analytics requires i

• Massive data volumes require scale-out stores using commodity servers, even archival storage

Tiered StorageSeamlessly move data across tiers, mirroring life-cycle and usage patterns

Schedule compute near low-latency copies of data

How can we manage this trade-off without moving data across

different storage systems (and governance boundaries)?

• Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming)

• Many users’ jobs (across these job types) run on the same machines (where the data lives)

Resource Management with Multitenancy and SLAsPolicy-driven management of vast compute pools co-located with data

Schedule computation “near” data

How can we manage this multi-tenanted heterogeneous job mix

across tens of thousands of machines?

Azure Data Lake Store

Fully managed cloud data store designed for analytics

Supports HDFS compliant analytics applications and tools

Petabyte files, unlimited account size

High throughput for analytics performance

Low latency ingestion with read as you write

AAD-based authentication, access auditing

File and folder-level ACLs, Encryption at rest

ADLS Security: Encryption-at-Rest

Transparently encrypts data flowing

to and from public networks as well

as at rest

Transparent server-side encryption

User can manage their own

encryption keys or let Azure Data

Lake Store manage the key using

Azure Key Vault

28

ADLS Security: Role-Based Access Control

Each file and directory is associated

with an owner and a group

Files or directories have separate

permissions (read(r), write(w),

execute(x)) for owners, members of

the group, and for all other users

Fine-grained access control lists

(ACLs) can be specified for specific

named users or named groups

29

ADL Store: IngressData can be ingested into Azure Data Lake Store from a variety of sources

Server logs

Azure Event Hub

Apache

Flume

Azure Storage Blobs

Custom programs

.NET SDK

JavaScript CLI

Azure Portal

Azure PowerShell

Azure Data Factory

Apache Sqoop

Azure SQL DB

Azure SQL DW

Azure tables

Table Storage

On-premises databases

SQL

30

ADL Store

Built-in

copy service

ADL Store: EgressData can be exported from Azure Data Lake Store into numerous targets/sinks

Azure SQL DB

SQL

Azure SQL DW

Azure

Tables

Table Storage

On-premises databases

Azure Data Factory

Apache Sqoop

Azure Storage Blobs

Custom programs

.NET SDK

JavaScript CLI

Azure Portal

Azure PowerShell

31

Built-in

copy service

ADL Store

Extent

Metadata

Data Data Data…

Remote Storage

Naming

Service

Secret Store

1) Filename Translation

3) Find Extents

4) Data

access

Remote storage tier

builds securely on

WAS

Secure

Works with

YARN!

COMPUTE

TIER

Secure Store Service

Intelligent ingest

Massively parallel

2) Azure Access Keys

• Interactive and Real-Time Analytics requires i

• Massive data volumes require scale-out stores using commodity servers, even archival storage

Tiered StorageScale storage independently of compute

Seamlessly move data across tiers, mirroring life-cycle and usage patterns

Schedule compute near low-latency copies of data

Data Lifecycle Management

How can we manage this trade-off without moving data across

different storage systems (and governance boundaries)?

Extent

Metadata

Data Data Data…

Remote Storage

Naming

Service

Secret Store

1) Filename Translation

3) Find Extents

4) Data

access

Remote storage tier

builds securely on

WAS

Secure

Works with

YARN!

COMPUTE

TIER

Data Data Data…

Secure Store Service

Local Storage

Intelligent ingest

Massively parallel

2) Azure Access Keys

Azure HDInsight—Linux and Windows

Managed, Monitored, Supported• Cluster customization – Install your favorite project

• Harness existing .Net & Java skills to write

customer extensions

• Supports broad ecosystem of ISVs

(Hadoop and Traditional)

Full Apache Hadoop• Batch – MapReduce, PIG, Hive, Spark

• Stream Processing and Analytics – Storm,

SparkStreaming

• Interactive SQL – Hive (Tez), and SparkSQL

• Table Serving – Hbase

• Machine Learning – SparkML, Mahout

BatchMapReduce, PIG, Hive, Spark

Interactive SQLHive (Tez), SparkSQL

Stream AnalyticsStorm, SparkStreaming

Machine LearningSparkML, Mahout

Table ServingHbase

Exploratory VisualizationJupyter, Zeppelin

Interactive SQL SQL DW

Stream AnalyticsAzure Stream Analytics

Machine LearningAzure ML

Table ServingAzure SQL DB

Exploratory VisualizationPower BI

• > 14 million field hours

http://www.ebird.org

http://www.ebird.org/

Tree Swallow

Azure Data Lake Analytics Service

A new distributed analytics service

Built on Apache YARN

Scales dynamically with a dial

Pay by the query

Supports Azure AD for access control, roles, and integration with on-premidentity systems

U-SQL language unifies the benefits of SQL with the power of C#

Hive etc. will be added over time

Processes data across Azure

41

Get started

Log in to Azure Create an ADLA account

Write and submit an ADLA job with U-SQL (or Hive/Pig)

The job reads and writes data from storage

1 2 3 4

30 seconds

ADLS

Azure Blobs

Azure DB

…

ADLA Complements HDInsight

HDInsight

Dedicated managed clusters for developers familiar with the Open Source: Java, Eclipse, Hive, etc.

Clusters offer customization, control, and flexibility in a managed Hadoop cluster

ADLA

Enables customers to leverage

existing experience with C#, SQL &

PowerShell

Offers convenience, efficiency, and

automatic scale in a “job service”

form factor over a system-managed

shared resource pool

U-SQL A hyper-scalable, highly extensible

language for preparing, transforming

and analyzing all data

Allows users to focus on the what—

not the how—of business problems

Built on familiar languages (SQL and

C#) and supported by a fully integrated

development environment

Built for data developers & scientists

44

U-SQL Language PhilosophyDeclarative query and transformation language:• Uses SQL’s SELECT FROM WHERE with GROUP BY/aggregation, joins,

SQL Analytics functions

• Optimizable, scalable

Operates on unstructured & structured data• Schema on read over files

• Relational metadata objects (e.g. database, table)

Extensible from ground up:• Type system is based on C#

• Expression language is C#

21

User-defined functions (U-SQL and C#)

User-defined types (U-SQL/C#) (future)

User-defined aggregators (C#)

User-defined operators (UDO) (C#)

U-SQL provides the parallelization and scale-out framework for

usercode• EXTRACTOR, OUTPUTTER, PROCESSOR, REDUCER, COMBINERS

Expression-flow programming style:• Easy to use functional lambda composition

• Composable, globally optimizable

Federated query across distributed data sources (soon)

REFERENCE MyDB.MyAssembly;

CREATE TABLE T( cid int, first_order DateTime

, last_order DateTime, order_count int, order_amount float );

@o = EXTRACT oid int, cid int, odate DateTime, amount float

FROM "/input/orders.txt“

USING Extractors.Csv();

@c = EXTRACT cid int, name string, city string

FROM "/input/customers.txt“

USING Extractors.Csv();

@j = SELECT c.cid, MIN(o.odate) AS firstorder

, MAX(o.date) AS lastorder, COUNT(o.oid) AS ordercnt

, SUM(c.amount) AS totalamount

FROM @c AS c LEFT OUTER JOIN @o AS o ON c.cid == o.cid

WHERE c.city.StartsWith("New")

&& MyNamespace.MyFunction(o.odate) > 10

GROUP BY c.cid;

OUTPUT @j TO "/output/result.txt"USING new MyData.Write();

INSERT INTO T SELECT * FROM @j;

45

Federated Queries: Query Data Where It LivesEasily query data in multiple Azure data stores without moving it to a single store

Benefits

Avoid moving large amounts of data across the network between stores

Single view of data irrespective of physical location

Minimize data proliferation issues caused by maintaining multiple copies

Single query language for all data

Each data store maintains its own sovereignty

Design choices based on the need

U-SQL

QueryResult

Query

46

Azure

Storage Blobs

Azure SQL

in VMs

Azure

SQL DB

Azure Data

Lake Analytics

Join Local (ADLS) and External Data

1. Create two tables.

• An external table ‘PurchaseOrders’ that refers to the

PurchaseOrders table in the external SQL Azure DB.

• A ‘local’ table ‘UserIdsTable’ created by ‘extracting’ User

Ids and region fields from the WebLogRecords.txt file

stored in Azure Data Lake.

2. Join the PurchaseOrders table with UserIds table on the

common UserId column.

Purchase orders table

Azure SQL DB

External

purchase orders

table

Local

user IDs

table

JOIN

(on User IDs)

Azure Data Lake

Analytics

Find sum of all purchases by users in the ‘en-us’ region

Query 9

47

WebLogRecords.txt

Concepts: Jobs, Stages and Vertexes

Each job is broken into a number

of vertexes

Each vertex is some work that

needs to be done

Input

Output

Output

6 Stages

8 Vertexes

Vertexes are organized into stages

– Vertexes in each stage do the same

work on the same data

– Vertex in one stage may depend on a

vertex in a earlier stage

Stages themselves are organized into

an acyclic graph

49

• Many different analytic engines (OSS and vendors; SQL, ML; batch, interactive, streaming)

• Many users’ jobs (across these job types) run on the same machines (where the data lives)

Resource Management with Multitenancy and SLAsPolicy-driven management of vast compute pools co-located with data

Schedule computation “near” data

How can we manage this multi-tenanted heterogeneous job mix

across tens of thousands of machines?

Resource Managers for Big Data

Allocate compute containers to competing jobs

Multiple job engines shared pool

Containers

YARN: Resource manager for Hadoop2.x

Corona, Mesos, Omega

Shared Data and Compute

Tiered Storage

Relational Query Engine

MachineLearning

Compute Fabric (Resource Management)

Multiple analytic engines sharing same

resource pool

Compute and store/cache on same machines

What’s Behind a U-SQL Query

. . .

. . . … … …

YARN Gaps

resource allocation SLOs

scalability limitations

• High allocation latency

• Support for specialized execution frameworks• Interactive environments, long-running services

• Amoeba Rayon

• Status: shipping in Apache Hadoop 2.6

• Mercury and Yaq

• Status: Now in Apache Hadoop trunk!

• Federation

• Status: prototype and JIRA

• Framework-level Pooling

• Enable frameworks that want to take over resource allocation to support millisecond-level response and adaptation times

• Status: spec

Microsoft Contributions to OSS Apache YARN

REEF

http://ww.reef-project.org http://reef.incubator.apache.org

http://ww.reef-project.org/

http://reef.incubator.apache.org/

http://aka.ms/adltechblog/

http://ww.reef-project.org and

http://reef.incubator.apache.org

http://aka.ms/adltechblog/

http://ww.reef-project.org/

http://reef.incubator.apache.org/