differentiate big data vs data warehouse use cases for a cloud … · 2019-04-24 · hadoop and...

57
Differentiate Big Data vs Data Warehouse use cases for a cloud solution James Serra Big Data Evangelist Microsoft [email protected] Blog: JamesSerra.com

Upload: others

Post on 20-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Differentiate Big Data vs Data Warehouse use cases for a cloud solution

James SerraBig Data Evangelist

Microsoft

[email protected]

Blog: JamesSerra.com

Page 2: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

About Me

▪ Microsoft, Big Data Evangelist

▪ In IT for 30 years, worked on many BI and DW projects

▪ Worked as desktop/web/database developer, DBA, BI and DW architect and developer, MDM

architect, PDW/APS developer

▪ Been perm employee, contractor, consultant, business owner

▪ Presenter at PASS Business Analytics Conference, PASS Summit, Enterprise Data World conference

▪ Certifications: MCSE: Data Platform, Business Intelligence; MS: Architecting Microsoft Azure

Solutions, Design and Implement Big Data Analytics Solutions, Design and Implement Cloud Data

Platform Solutions

▪ Blog at JamesSerra.com

▪ Former SQL Server MVP

▪ Author of book “Reporting with Microsoft SQL Server 2012”

Page 3: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

I tried to understand the Microsoft data platform on my own…

And felt like I was body slammed by Randy

Savage:

Let’s prevent that from happening…

Page 4: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Agenda

• Data Lake Driven Analytics

• Compute technologies

• Architecture Patterns

Page 5: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Enterprise Data Maturity Stages

Structured data is transacted and locally managed. Data used reactively

STAGE 2: Informative

STAGE 1: Reactive

Structured data is managed and analyzed centrally and informs the business

Data capture is comprehensive and scalable and leads business decisions based on advanced analytics

STAGE 4: Transformative

STAGE 3: Predictive Data transforms

business to drive desired outcomes.

Leaders(Real-time

intelligence)

Laggards(rear-view

mirror)

Any data, any source, anywhere at scale

Page 6: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Benefits of the cloud for Big Data

Agility• Unlimited elastic scale

• Pay for what you need

Innovation• Quick “Time to market”

• Fail fast

Risk• Availability

• Reliability

• Security

Total cost of ownership calculator: https://www.tco.microsoft.com/

Page 7: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 8: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Harness the growing and changing nature of data

Need to collect any data

StreamingStructured

Challenge is combining transactional data stored in relational databases with less structured data

Big Data = All Data. Requires a data lake

Get the right information to the right people at the right time in the right format

Semi-structured

“ ”

Page 9: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Observation

Pattern

Theory

Hypothesis

What will happen?

How can we make it happen?

Predictive

Analytics

Prescriptive

Analytics

What happened?

Why did it happen?

Descriptive

Analytics

Diagnostic

Analytics

Confirmation

Theory

Hypothesis

Observation

Two Approaches to getting value out of data: Top-Down + Bottoms-Up

Page 10: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Implement Data Warehouse

Physical Design

ETL

Development

Reporting &

Analytics

Development

Install and Tune

Reporting & Analytics Design

Dimension Modelling

ETL Design

Setup Infrastructure

Understand Corporate Strategy

Data Warehousing Uses A Top-Down Approach

Data sources

Gather Requirements

Business Requirements

Technical Requirements

Page 11: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

The “data lake” Uses A Bottoms-Up Approach

Ingest all data regardless of requirements

Store all data in native format without

schema definition

Do analysisUsing analytic engines

like Hadoop

Interactive queries

Batch queries

Machine Learning

Data warehouse

Real-time analytics

Devices

Page 12: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Exactly what is a data lake?

A storage repository, usually Hadoop, that holds a vast amount of raw data in its native format until it is needed.

• Inexpensively store unlimited data

• Collect all data “just in case”

• Store data with no modeling – “Schema on read”

• Complements Enterprise data warehouse (EDW)

• Frees up expensive EDW resources

• Quick user access to data

• ETL Hadoop tools

• Easily scalable

• Place to backup data to

• Place to move older data (archive)

Page 13: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Needs data governance so your data lake does not turn into a data swamp!

Page 14: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Objectives✓ Plan the structure based on optimal data retrieval✓ Avoid a chaotic, unorganized data swamp

Data Retention PolicyTemporary data

Permanent data

Applicable period (ex: project lifetime)

etc…

Business Impact / CriticalityHigh (HBI)

Medium (MBI)

Low (LBI)

etc…

Confidential ClassificationPublic information

Internal use only

Supplier/partner confidential

Personally identifiable information (PII)

Sensitive – financial

Sensitive – intellectual property

etc…

Probability of Data AccessRecent/current data

Historical data

etc…

Owner / Steward / SME

Subject Area

Security BoundariesDepartment

Business unit

etc…

Time PartitioningYear/Month/Day/Hour/Minute

Downstream App/Purpose

Common ways to organize the data:

Page 15: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Data Lake + Data Warehouse Better Together

Data sources

What happened?

Descriptive

Analytics

Diagnostic

Analytics

Why did it happen?

What will happen?

Predictive

Analytics

Prescriptive

Analytics

How can we make it happen?

Page 16: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Data Lake Data Warehouse

Schema-on-read Schema-on-write

Physical collection of uncurated data Data of common meaning

System of Insight: Unknown data to do

experimentation / data discovery

System of Record: Well-understood data to do

operational reporting

Any type of data Limited set of data types (ie. relational)

Skills are limited Skills mostly available

All workloads – batch, interactive, streaming,

machine learning

Optimized for interactive querying

Complementary to DW Can be sourced from Data Lake

Page 17: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Data WarehouseServing, Security & Compliance• Business people

• Low latency

• Complex joins

• Interactive ad-hoc query

• High number of users

• Additional security

• Large support for tools

• Dashboards

• Easily create reports (Self-service BI)

• Know questions

A data lake is just a glorified file folder with data files in it – how many end-

users can accurately create reports from it?

Page 18: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 19: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

What is Azure Databricks?A fast, easy and collaborative Apache® Spark™ based analytics platform optimized for Azure

Best of Databricks Best of Microsoft

Designed in collaboration with the founders of Apache Spark

One-click set up; streamlined workflows

Interactive workspace that enables collaboration between data scientists, data engineers, and business analysts.

Native integration with Azure services (Power BI, SQL DW, Cosmos DB, Blob Storage)

Enterprise-grade Azure security (Active Directory integration, compliance, enterprise -grade SLAs)

Page 20: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Azure HDInsight

Hadoop and Spark as a Service on Azure

Fully-managed Hadoop and Spark

for the cloud

100% Open Source Hortonworks

data platform

Clusters up and running in minutes

Managed, monitored and supported

by Microsoft with the industry’s best SLA

Familiar BI tools for analysis, or open source

notebooks for interactive data science

63% lower TCO than deploy your own

Hadoop on-premises*

*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

Page 21: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Hortonworks Data Platform (HDP) 3.0

Simply put, Hortonworks ties all the open source products together (20)

(under the covers of HDInsight 4.0 – public preview)

Page 22: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Have a mature Hadoop ecosystem already established that you want to migrate

Want to be 100% open source

Want to use Hadoop tools and make them available 24/7 at a lower cost

Want to use certain tools like Kafka/Storm/HBase/R Server/LLAP/Pig/Beam/Flink

Page 23: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

CONTROL EASE OF USE

Azure Data Lake

Analytics

Azure Data Lake Store Gen2

Azure Storage

Any Hadoop technology,

any distribution

Workload optimized,

managed clusters

Azure MarketplaceHDP | CDH | MapR

IaaS Clusters Managed Clusters

Azure HDInsight

Frictionless & Optimized

Spark clusters

Azure Databricks

BIG

DA

TA

S

TO

RA

GE

BIG

DA

TA

A

NA

LYT

ICS

Red

uced

Ad

min

istr

ati

on

K N O W I N G T H E V A R I O U S B I G D A T A S O L U T I O N S

Page 24: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Azure SQL Data WarehouseA relational data warehouse-as-a-service, fully managed by Microsoft.

Industries first elastic cloud data warehouse with enterprise-grade capabilities.

Support your smallest to your largest data storage needs while handling queries up to 100x faster.

Page 25: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

SMP vs MPP

Page 26: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Azure Analysis Services

Page 27: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 28: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Microsoft Products vs Hadoop/OSS Products

Note: Many of the Hadoop/OSS products are available in Azure

Microsoft Product Hadoop/Open Source Software Product

Office365/Excel OpenOffice/Calc

Cosmos DB MongoDB, MarkLogic, HBase, Cassandra

SQL Database SQLite, MySQL, PostgreSQL, MariaDB, Apache Ignite

Azure Data Lake Analytics/YARN None

Azure VM/IaaS OpenStack

Blob Storage HDFS, Ceph

Azure HBase Apache HBase (Azure HBase is a service wrapped around Apache HBase), Apache Trafodion

Event Hub Apache Kafka

Azure Stream Analytics Apache Storm, Apache Spark Streaming, Apache Flink, Apache Beam, Twitter Heron

Power BI Apache Zeppelin, Apache Jupyter, Airbnb Caravel, Kibana

HDInsight Hortonworks (pay), Cloudera (pay), MapR (pay)

Azure ML (Machine Learning) Apache Mahout, Apache Spark MLib, Apache PredictionIO

Microsoft R Open R

SQL Data Warehouse/Interactive queries Apache Hive LLAP, Presto, Apache Spark SQL, Apache Drill, Apache Impala

IoT Hub Apache NiFi

Azure Data Factory Apache Falcon, Airbnb Airflow, Apache Oozie, Apache Azkaban

Azure Data Lake Storage/WebHDFS HDFS Ozone

Azure Analysis Services/SSAS Apache Kylin, Apache Druid, AtScale (pay)

SQL Server Reporting Services None

Hadoop Indexes Jethro Data (pay)

Azure Data Catalog Apache Atlas

PolyBase Apache Drill

Azure Search Apache Solr, Apache ElasticSearch (Azure Search build on ES)

SQL Server Integration Services (SSIS) Talend Open Studio, Pentaho Data Integration

OthersApache Ambari (manage Hadoop clusters), Apache Ranger (data security such as row/column-level security), Apache Knox (secure entry point for

Hadoop clusters), Apache Flume (collecting log data)

Page 29: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Compute pool

SQL Compute

Node

SQL Compute

Node

SQL Compute

Node…

Compute pool

SQL Compute

Node

IoT data

Directly

read from

HDFS

Persistent storage

Storage pool

SQL

ServerSpark

HDFS Data Node

SQL

ServerSpark

HDFS Data Node

SQL

ServerSpark

HDFS Data Node

Kubernetes pod

AnalyticsCustom

apps BI

SQL Server

master instance

Node Node Node Node Node Node Node

SQL

Data mart

SQL Data

Node

SQL Data

Node

Compute pool

SQL Compute

Node

Storage Storage

Page 30: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

© Microsoft Corporation

Azure Data ExplorerFast and fully managed big data analytics service

Fully managed

for efficiency

Focus on insights, not the infra-

structure for fast time to value

No infrastructure to manage;

provision the service, choose the

SKU for your workload, and

create database.

Optimized for

streaming data

Get near-instant insights

from fast-flowing data

Scale linearly up to 200 MB per

second per node with highly

performant, low latency ingestion.

Designed for

data exploration

Run ad-hoc queries using the

intuitive query language

Returns results from 1 Billion

records < 1 second without

modifying the data or metadata

Page 31: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

© Microsoft Corporation

Azure Data Explorer overview

1. Capability for many data types,

formats, and sources

Structured (numbers), semi-structured

(JSON\XML), and free text

2. Batch or streaming ingestion

Use managed ingestion pipeline or

queue a request for pull ingestion

3. Compute and storage isolation

• Independent scale out / scale in

• Persistent data in Azure Blob Storage

• Caching for low-latency on compute

4. Multiple options to support

data consumption

Use out-of-the box tools and connectors

or use APIs/SDKs for custom solution

Page 32: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 33: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Questions to ask client

• Can you use the cloud?

• Is this a new solution or a migration?

• Do the developers have Hadoop skills?

• Will you use non-relational data (variety)?

• How much data do you need to store (volume)?

• Is this an OLTP or OLAP/DW solution?

• Will you have streaming data (velocity)?

• Will you use dashboards?

• How fast do the operational reports need to run?

• Will you do predictive analytics?

• Do you want to use Microsoft tools or open source?

• What are your high availability and/or disaster recovery requirements?

• Do you need to master the data (MDM)?

• Are there any security limitations with storing data in the cloud?

• Does this solution require 24/7 client access?

• How many concurrent users will be accessing the solution at peak-time and on average?

• What is the skill level of the end users?

• What is your budget and timeline?

• Is the source data cloud-born and/or on-prem born?

• How much daily data needs to be imported into the solution?

• What are your current pain points or obstacles (performance, scale, storage, concurrency, query times, etc)?

• Are you ok with using products that are in preview?

Page 34: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Advanced Analytics

Social

LOB

Graph

IoT

Image

CRM

INGEST STORE PREP MODEL & SERVE(& store)

Data orchestration and monitoring

Big data store Transform & Clean Data warehouse

AI

BI + Reporting

Azure Data Factory

SSIS

Azure Data Lake Storage Gen2

Blob Storage

Azure Data Lake Storage Gen1

SQL Server 2019 Big Data Cluster

Azure Databricks

Azure HDInsight

PolyBase & Stored Procedures

Power BI Dataflow

Azure Data Lake Analytics

Azure SQL Data Warehouse

Azure Analysis Services

SQL Database (Single, MI, HyperScale)

SQL Server in a VM

Cosmos DB

Power BI Aggregations

Page 35: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Blob Storage Data Lake Store

Azure Data Lake Storage Gen2

Large partner ecosystem

Global scale – All 50 regions

Durability options

Tiered - Hot/Cool/Archive

Cost Efficient

Built for Hadoop

Hierarchical namespace

ACLs, AAD and RBAC

Performance tuned for big data

Very high scale capacity and throughput

Large partner ecosystem

Global scale – All 50 regions

Durability options

Tiered - Hot/Cool/Archive

Cost Efficient

Built for Hadoop

Hierarchical namespace

ACLs, AAD and RBAC

Performance tuned for big data

Very high scale capacity and throughput

Page 36: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

INGEST STORE PREP & TRAIN MODEL & SERVE

C L O U D D A T A W A R E H O U S E

Azure Data Lake Store Gen2

Logs (unstructured)

Azure Data Factory

Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.

Media (unstructured)

Files (unstructured)

PolyBase

Business/custom apps

(structured)

Azure SQL Data

Warehouse

Azure Analysis

Services

Power BI

Page 37: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

INGEST STORE PREP & TRAIN MODEL & SERVE

M O D E R N D A T A W A R E H O U S E

Azure Data Lake Store Gen2

Logs (unstructured)

Azure Data Factory

Azure Databricks

Microsoft Azure also supports other Big Data services like Azure HDInsight to allow customers to tailor the above architecture to meet their unique needs.

Media (unstructured)

Files (unstructured)

PolyBase

Business/custom apps

(structured)

Azure SQL Data

Warehouse

Azure Analysis

Services

Power BI

Page 38: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

A D V A N C E D A N A L Y T I C S O N B I G D A T A

INGEST STORE PREP & TRAIN MODEL & SERVE

Cosmos DB

Business/custom apps

(structured)

Files (unstructured)

Media (unstructured)

Logs (unstructured)

Azure Data Lake Store Gen2Azure Data Factory Azure SQL Data

Warehouse

Azure Analysis

Services

Power BI

PolyBase

SparkR

Azure Databricks

Microsoft Azure also supports other Big Data services like Azure HDInsight, Azure Machine Learning to allow customers to tailor the above architecture to meet their unique needs.

Real-time apps

Page 39: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 40: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

INGEST STORE PREP & TRAIN MODEL & SERVE

R E A L T I M E A N A L Y T I C S

Sensors and IoT

(unstructured)

Apache Kafka for

HDInsight Cosmos DB

Files (unstructured)

Media (unstructured)

Logs (unstructured)

Azure Data Lake Store Gen2Azure Data Factory

Azure Databricks

Real-time apps

Business/custom apps

(structured)

Azure SQL Data

Warehouse

Azure Analysis

Services

Power BI

Microsoft Azure also supports other Big Data services like Azure IoT Hub, Azure Event Hubs, Azure Machine Learning to allow customers to tailor the above architecture to meet their unique needs.

PolyBase

Page 41: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 42: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Customer

analytics

Financial

modeling

Risk, fraud,

threat detection

Credit

analytics

Marketing

analytics

Faster innovation for a better

customer experience

Improved consumer outcomes and

increased revenue

Enhanced customer experience with

machine learning

Transform growth with predictive

analytics

Improved customer engagement with machine learning

Customer profiles

Credit history

Transactional data

LTV

Loyalty

Customer segmentation

CRM data

Credit data

Market data

CRM

Credit

Risk

Merchant records

Products and services

Clickstream data

Products

Services

Customer service data

Customer 360degree evaluation

Customer segmentation

Reduced customer churn

Underwriting, servicing and delinquency handling

Insights for new products

Commercial/retail banking, securities, trading and investment models

Decision science, simulations and forecasting

Investment recommendations

Real-time anomaly detection

Card monitoring and fraud detection

Security threat identification

Risk aggregation

Enterprise DataHub

Regulatory andcompliance analysis

Credit risk management

Automated credit analytics

Recommendation engine

Predictive analytics and targeted advertising

Fast marketing and multi-channel engagement

Customer sentiment analysis

Transaction data

Demographics

Purchasing history

Trends

Effective customer

engagement

Decision services

management

Risk and revenue

management

Risk and compliance

management

Recommendation

engine

Financial services use cases

Page 43: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

https://aka.ms/ADAG

Page 44: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Q & A ?James Serra, Big Data Evangelist

Email me at: [email protected] (email me for a copy of this deck)

Follow me at: @JamesSerra

Link to me at: www.linkedin.com/in/JamesSerra

Visit my blog at: JamesSerra.com (check out the “presentations” tab)

Page 45: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 46: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Who manages what?

Infrastructureas a Service

Storage

Servers

Networking

O/S

Middleware

Virtualization

Data

Applications

Runtime

Man

ag

ed

by M

icroso

ft

Yo

u s

cale

, m

ake

resi

lien

t &

man

ag

e

Platformas a Service

Sca

le, R

esilie

nce

an

d

man

ag

em

en

t by M

icroso

ft

Yo

u m

an

ag

e

Storage

Servers

Networking

O/S

Middleware

Virtualization

Applications

Runtime

Data

On PremisesPhysical / Virtual

Yo

u s

cale

, m

ake r

esi

lien

t an

d m

an

ag

e

Storage

Servers

Networking

O/S

Middleware

Virtualization

Data

Applications

Runtime

Softwareas a Service

Storage

Servers

Networking

O/S

Middleware

Virtualization

Applications

Runtime

Data

Sca

le, R

esilie

nce

an

d

man

ag

em

en

t by M

icroso

ft

Windows Azure

Virtual Machines

Windows Azure

Cloud Services

Page 47: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Microsoft data platform solutionsProduct Category Description More Info

SQL Server 2017 RDBMS Earned top spot in Gartner’s Operational Database magic

quadrant. JSON support. Linux support

https://www.microsoft.com/en-us/server-

cloud/products/sql-server-2017/

SQL Database RDBMS/DBaaS Cloud-based service that is provisioned and scaled quickly.

Has built-in high availability and disaster recovery. JSON

support. Managed Instance option soon

https://azure.microsoft.com/en-

us/services/sql-database/

SQL Data Warehouse MPP RDBMS/DBaaS Cloud-based service that handles relational big data.

Provision and scale quickly. Can pause service to reduce

cost

https://azure.microsoft.com/en-

us/services/sql-data-warehouse/

Azure Data Lake Store Hadoop storage Removes the complexities of ingesting and storing all of

your data while making it faster to get up and running with

batch, streaming, and interactive analytics

https://azure.microsoft.com/en-

us/services/data-lake-store/

HDInsight PaaS Hadoop

compute/Hadoop

clusters-as-a-service

A managed Apache Hadoop, Spark, R Server, HBase, Kafka,

Interactive Query (Hive LLAP) and Storm cloud service made

easy

https://azure.microsoft.com/en-

us/services/hdinsight/

Azure Databricks PaaS Spark clusters A fast, easy, and collaborative Apache Spark based analytics

platform optimized for Azure

https://databricks.com/azure

Azure Data Lake Analytics On-demand analytics job

service/Big Data-as-a-

service

Cloud-based service that dynamically provisions resources

so you can run queries on exabytes of data. Includes U-SQL,

a new big data query language

https://azure.microsoft.com/en-

us/services/data-lake-analytics/

Azure Cosmos DB PaaS NoSQL: Key-value,

Column-family,

Document, Graph

Globally distributed, massively scalable, multi-model, multi-

API, low latency data service – which can be used as an

operational database or a hot data lake

https://azure.microsoft.com/en-

us/services/cosmos-db/

Azure Database for PostgreSQL,

MySQL, and MariaDB

RDBMS/DBaaS A fully managed database service for app developers https://azure.microsoft.com/en-

us/services/postgresql

Page 48: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

ExpressRoute

We guarantee 99.95% ExpressRoute dedicated circuit availability

(3.6TB/hour)

(18TB/hour)

(36TB/hour)

(7.2TB/hour)

Page 49: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Data VolumeLo

wH

igh

Latency

Cost/GBHig

h Lo

w

Request Rate

Str

uct

ure

High

Low

SQL

(SQL DB, SQL DW)

NoSQL

(DocDB)

HDFS

(HDP, Cloudera)

ADLS

Hot Data Cold Data

Cache

Page 50: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Data Lake Store: Technical Requirements

64

Secure Must be highly secure to prevent unauthorized access (especially as all data is in one place).

Native format Must permit data to be stored in its ‘native format’ to track lineage & for data provenance.

Low latency Must have low latency for high-frequency operations.

Must support multiple analytic frameworks—Batch, Real-time, Streaming, ML etc.

No one analytic framework can work for all data and all types of analysis.

Multiple analytic

frameworks

Details Must be able to store data with all details; aggregation may lead to loss of details.

Throughput Must have high throughput for massively parallel processing via frameworks such as Hadoop and Spark

Reliable Must be highly available and reliable (no permanent loss of data).

Scalable Must be highly scalable. When storing all data indefinitely, data volumes can quickly add up

All sources Must be able ingest data from a variety of sources-LOB/ERP, Logs, Devices, Social NWs etc.

Page 51: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

50 TB

100 TB

500 TB

10 TB

5 PB

1.000

100

10.000

3-5 Way

Joins

▪ Joins +

▪ OLAP operations +

▪ Aggregation +

▪ Complex “Where”

constraints +

▪ Views

▪ Parallelism

5-10 Way

Joins

Normalized

Multiple, Integrated

Stars and Normalized

Simple

Star

Multiple,

Integrated

Stars

TB’s

MB’s

GB’s

Batch Reporting,

Repetitive Queries

Ad Hoc Queries

Data Analysis/Mining

Near Real Time

Data FeedsDaily

Load

Weekly

Load

Strategic, Tactical

Strategic

Strategic, Tactical

Loads

Strategic, Tactical

Loads, SLA

“Query Freedom“

“Query complexity““Data

Freshness”

“Query Data Volume“

“Query Concurrency“

“Mixed

Workload”

“Schema Sophistication“

“Data Volume”

DW SCALABILITY SPIDER CHART

MPP – Multidimensional

Scalability

SMP – Tunable in one dimension

on cost of other dimensions

The spiderweb depicts important attributes to consider when evaluating Data Warehousing options.

Big Data support is newest dimension.

Page 52: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

HiveQL T-SQL

Updates INSERT UPDATE, INSERT, DELETE

Transactions Supported Supported

Index Types Compact, Bitmap Clustered, Non-Clustered, Columnar

Latency Minutes Seconds

Data Types + map, struct, array Int, float, boolean, char, binary..

Functions Dozens of built-in functions Hundreds of built-in functions

Multi-Table Inserts Supported Not Supported

Partitioning Supported Supported

Multi-Level Partitioning Supported Not Supported

Views Read-Only Read-Only and Updateable

Extensibility UDF’s, MapReduce Scripts UDF’s, Stored Procedures

Page 53: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

HDInsight vs HDP on Azure VM

HDInsight HDP on Azure VM

PaaS (setup, scale, manage, patch, etc) IaaS

Managed by Microsoft Managed by customer

Storage separate (Blob or ADLS) Storage in VM (local disk), but can also

have storage in Azure blob or ADLS

Delete VM keeps data Delete VM deletes data (unless external)

Up to 30-days behind latest HDP version Latest HDP Version

Limited Hadoop projects Unlimited Hadoop projects

Microsoft supports VM and Hadoop Microsoft: VM, HDP: Hadoop

No on-prem version On-prem version

Page 54: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

PolyBase use cases

Page 55: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open
Page 56: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Ingestion Batch Presentation

Page 57: Differentiate Big Data vs Data Warehouse use cases for a cloud … · 2019-04-24 · Hadoop and Spark as a Service on Azure Fully-managed Hadoop and Spark for the cloud 100% Open

Company C

Event hubs

Machine Learning

Flatten & Metadata Join

Data Factory: Move Data, Orchestrate, Schedule, and Monitor

Machine Learning Azure SQL

Data Warehouse

Power BI

INGEST PREPARE ANALYZE PUBLISH

ASA Job Rule #2

CONSUMEDATA SOURCES

Cortana

Web/LOB Dashboards

On Premise

Hot Path

Cold Path

Archived Data

Data Lake Store

Simulated Sensors and devices

Blobs –Reference Data

Event hubs ASA Job Rule #1

Event hubs

Real-time Scoring

Aggregated Data

Data Lake Store

CSV Data

Data Lake Store

Data Lake Analytics

Batch Scoring

Offline Training

Hourly, Daily, Monthly Roll Ups

Ingestion

Batch

PresentationSpeed