hadoop in the cloud - the what, why and how from the experts

32
Hadoop in the cloud The what, why and how from the experts Nishant Thacker Technical Product Manager – Big Data Microsoft @nishantthacker

Upload: hadoop-summit

Post on 19-Jan-2017

329 views

Category:

Technology


1 download

TRANSCRIPT

Hadoop in the cloud – The what, why and how from the experts

Nishant ThackerTechnical Product Manager – Big DataMicrosoft

@nishantthacker

Hadoop in the Cloud

2

Agenda

Why Benefits of running Hadoop in the cloud

What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations

How Architecture of a Cloud deployment

Hadoop in the Cloud

3

Agenda

Why Benefits of running Hadoop in the cloud

What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations

How Architecture of a Cloud deployment

Traditional Hadoop Clusters

4

Up-front HW costs Capacity planning Hadoop expertise

Challenges with implementing Hadoop

Hadoop Clusters in the Cloud

6

No HW costs

$0

Unlimited scalePay what you need

Deployed in minutes

Why Hadoop in the cloud?

Distributed Storage• Files split across storage• Files replicated

• Nearest node responds• Abstracted

Administration

Hadoop Clusters

Extensible• APIs to extend functionality• Add new capabilities• Allow for inclusion in custom

environments

Automated Failover• Unmonitored failover to replicated data• Built for resiliency• Metadata stored for later retrieval

Hyper-Scale• Add resources as desired• Built to include commodity configs• Direct correlation of performance and

resources

Distributed Compute• Distributed processing• Resource Utilization• Cost-Efficient method calls

8

Distributed Storage• Files split across storage• Files replicated

• Nearest node responds• Abstracted

Administration

Cloud

Extensible• APIs to extend functionality• Add new capabilities• Allow for inclusion in custom

environments

Automated Failover• Unmonitored failover to replicated data• Built for resiliency• Metadata stored for later retrieval

Hyper-Scale• Add resources as desired• Built to include commodity configs• Direct correlation of performance and

resources

Distributed Compute• Distributed processing• Resource Utilization• Cost-Efficient method calls

9

Distributed Storage• Files split across storage• Files replicated

• Nearest node responds• Abstracted

Administration

Hadoop in the Cloud

Extensible• APIs to extend functionality• Add new capabilities• Allow for inclusion in custom

environments

Automated Failover• Unmonitored failover to replicated data• Built for resiliency• Metadata stored for later retrieval

Hyper-Scale• Add resources as desired• Built to include commodity configs• Direct correlation of performance and

resources

Distributed Compute• Distributed processing• Resource Utilization• Cost-Efficient method calls

10

Hadoop in the Cloud

11

Agenda

Why Benefits of running Hadoop in the cloud

What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations

How Architecture of a Cloud deployment

Hadoop in the Cloud - Options

Cloud

Hadoop in IaaS Hadoop in PaaS Big Data as a Service

Pros Complete Control On-Demand Cluster Sizing Storage - Local or Cloud

Cons Only VMs managed for HA Administration required Clusters need to stay active

Pros Fully managed – SLA bound Flexible resizing Customization Options Deployed in minutes

Cons Forgo some control

Pros Abstracted from clusters Automated resource

alignment Easy to use interface and APIs Familiar languages

Cons Forgo complete control Limited choice to tools

On-premises Hadoop

Software

Scenarios for deploying Hadoop as hybrid

CloudCloud

Specialized Workloads

HDInsight

Cloud

Bursting

HDInsight

Cloud

Backup/archive

HDInsight

Traditional Hadoop Clusters – On Prem

14

Hadoop Cluster

Worker Node

HDFSHDFS HDFS

Tasks Tasks Tasks Tasks Tasks Tasks

Task Tracker

Master Node

Client

Job (jar) file

Job (jar) file

Hadoop Clusters in the Cloud

Azure HDInsightHadoop and Spark as a Service on Azure

Fully managed Hadoop and Spark for the cloud

100% Open Source Hortonworks Data Platform

Clusters up and running in minutes

Managed, monitored and supported by Microsoft with the industry’s best enterprise SLA

Use familiar BI tools for analysis, or open source notebooks for interactive data science

63% lower total cost of ownership than deploy your own Hadoop on-premises*

*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”

HDInsight Cluster ArchitectureAz

ure

VNet

HTTP S

traffi

c

ODBC/JDBC

WebHCatalog Oozie Ambari

Secure gateway

AuthNHTTP Proxy

Highly availableHead nodes

Worker nodes

ADL S

Decoupling Compute from Storage

Latency? Consistency?Bandwidth?

Network

Decoupling Compute from Storage

Network

HDD-like latency

50 Tb+ aggregate bandwidth[1]

Strong consistency

[1] Azure Flat Network Architecture

Decoupling - Benefits

Cloud

NoSQL Workload

Pros Smaller clusters can

achieve the same level of performance as large clusters

No need to add nodes just for storage capacity

Depending on workloads, you see any where from 6x – 20x cost benefits

Query + ML Workload

Pros Clusters required only

while processing data Data Persists for tools to

connect and use Data can be replicated on

other geo Delete clusters when not

processing of data

Streaming Workload

Pros No need for large clusters

to hold historical streams data

Directly align throughput to cluster size as per SLA

Cluster up only when streams are active

Azure Data Lake StoreA hyper scale repository for big data analytics workloads

Hadoop File System (HDFS) for the cloud

No limits to scale

Store any data in its native format

Enterprise grade access control and encryption

Optimized for analytic workload performance

Customizecluster?

HDInsight cluster provisioning states

RDP to cluster, update config files (non-durable)

Ad hoc

Cluster customization optionsHive/Oozie MetastoreStorage accounts & VNET’sScriptAction

Via Azure portal

Ready for deployment Accepted

Cluster storage

provisioned

AzureVM configuratio

n

RunningTimed Out

Error

Cluster operational

Configuring HDInsight

Cluster customization (custom script

running

Config valuesJAR file placement in cluster

Via scripting / SDKNo

Yes

Cluster integration optionsEach cluster surfaces a REST endpoint for integration, secured via basic authN over SSL

/thrift – ODBC & JDBC

/Templeton – Job Submission, Metadata management

/ambari – Cluster health, monitoring

/oozie – Job orchestration, scheduling

Tauschia Copeland
2 by 2 diagram, thrift. templeton, oozie, and ambari. one by one with click

Hadoop in the Cloud

24

Agenda

Why Benefits of running Hadoop in the cloud

What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations

How Architecture of a Cloud deployment

Cloud Deployments for Big Data

25

Introducing Cortana Intelligence Suite

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards & Visualizations

Cortana

Bot Framework

Cognitive Services

Power BI

Information Management

Event Hubs

Data Catalog

Data Factory

Machine Learning and Analytics

HDInsight (Hadoop and Spark)

Stream Analytics

Intelligence

Data Lake Analytics

Machine Learning

Big Data Stores

SQL Data Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

Where Big Data is a cornerstone

Action

People

Automated Systems

Apps

Web

Mobile

Bots

Intelligence

Dashboards & Visualizations

Cortana

Bot Framework

Cognitive Services

Power BI

Information Management

Event Hubs

Data Catalog

Data Factory

Machine Learning and Analytics

HDInsight (Hadoop and Spark)

Stream Analytics

Intelligence

Data Lake Analytics

Machine Learning

Big Data Stores

SQL Data Warehouse

Data Lake Store

Data Sources

Apps

Sensors and devices

Data

Excel BI

Power BI

Mahout

HiveQL

HIVE

Sqoop Pig

Azure Data Lake Analytics

HBase on Azure HDInsight

Big Data Sources (Raw Unstructured)

Log files

Storm for Azure

HDInsight

Azure Stream

Analytics

Spark Streaming for

Azure HDInsight

Spark SQL

Spark MLib

Azure Data Lake Store

U-SQL

Data Orchestration/Workflow

Azure Data Factory

Oozie for Azure HDInsight

Kafka for Azure HDInsight(future)

SQL Server Integration

Services

Azure Machine Learning

R ServerSQL

Server R

Services

SSRS

SharePoint BI

Transactional systems

Azure SQL DW

SQL Server APS

ETL

Azure Event Hubs

Data Generation Streaming ConsumptionProcessingStorage

Ope

ratio

nal

Anal

ytica

l / E

xplo

rato

ry

Data WarehouseAzure

Website

SSAS

Spark MLLib

Summary

29

Why Benefits of running Hadoop in the cloud – Far outrun

tradeoffs

What Options to run Hadoop in the Cloud – IaaS, Paas,

Hybrid Hadoop Clusters in the cloud – Fully Managed Cluster Customizations – Immensely well leveraged

How Architecture of a Cloud deployment – Simplify

deployment

Get started today! For more information on HDInsight visit: http://azure.com/hdinsight For more information on Data Lake visit: http://azure.com/datalake

Q&A

[email protected]

@nishantthacker

Click icon to add picture

© 2016 Microsoft Corporation. All rights reserved.