hadoop in the cloud - the what, why and how from the experts
TRANSCRIPT
Hadoop in the cloud – The what, why and how from the experts
Nishant ThackerTechnical Product Manager – Big DataMicrosoft
@nishantthacker
Hadoop in the Cloud
2
Agenda
Why Benefits of running Hadoop in the cloud
What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations
How Architecture of a Cloud deployment
Hadoop in the Cloud
3
Agenda
Why Benefits of running Hadoop in the cloud
What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations
How Architecture of a Cloud deployment
Distributed Storage• Files split across storage• Files replicated
• Nearest node responds• Abstracted
Administration
Hadoop Clusters
Extensible• APIs to extend functionality• Add new capabilities• Allow for inclusion in custom
environments
Automated Failover• Unmonitored failover to replicated data• Built for resiliency• Metadata stored for later retrieval
Hyper-Scale• Add resources as desired• Built to include commodity configs• Direct correlation of performance and
resources
Distributed Compute• Distributed processing• Resource Utilization• Cost-Efficient method calls
8
Distributed Storage• Files split across storage• Files replicated
• Nearest node responds• Abstracted
Administration
Cloud
Extensible• APIs to extend functionality• Add new capabilities• Allow for inclusion in custom
environments
Automated Failover• Unmonitored failover to replicated data• Built for resiliency• Metadata stored for later retrieval
Hyper-Scale• Add resources as desired• Built to include commodity configs• Direct correlation of performance and
resources
Distributed Compute• Distributed processing• Resource Utilization• Cost-Efficient method calls
9
Distributed Storage• Files split across storage• Files replicated
• Nearest node responds• Abstracted
Administration
Hadoop in the Cloud
Extensible• APIs to extend functionality• Add new capabilities• Allow for inclusion in custom
environments
Automated Failover• Unmonitored failover to replicated data• Built for resiliency• Metadata stored for later retrieval
Hyper-Scale• Add resources as desired• Built to include commodity configs• Direct correlation of performance and
resources
Distributed Compute• Distributed processing• Resource Utilization• Cost-Efficient method calls
10
Hadoop in the Cloud
11
Agenda
Why Benefits of running Hadoop in the cloud
What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations
How Architecture of a Cloud deployment
Hadoop in the Cloud - Options
Cloud
Hadoop in IaaS Hadoop in PaaS Big Data as a Service
Pros Complete Control On-Demand Cluster Sizing Storage - Local or Cloud
Cons Only VMs managed for HA Administration required Clusters need to stay active
Pros Fully managed – SLA bound Flexible resizing Customization Options Deployed in minutes
Cons Forgo some control
Pros Abstracted from clusters Automated resource
alignment Easy to use interface and APIs Familiar languages
Cons Forgo complete control Limited choice to tools
On-premises Hadoop
Software
Scenarios for deploying Hadoop as hybrid
CloudCloud
Specialized Workloads
HDInsight
Cloud
Bursting
HDInsight
Cloud
Backup/archive
HDInsight
Traditional Hadoop Clusters – On Prem
14
Hadoop Cluster
Worker Node
HDFSHDFS HDFS
Tasks Tasks Tasks Tasks Tasks Tasks
Task Tracker
Master Node
Client
Job (jar) file
Job (jar) file
Azure HDInsightHadoop and Spark as a Service on Azure
Fully managed Hadoop and Spark for the cloud
100% Open Source Hortonworks Data Platform
Clusters up and running in minutes
Managed, monitored and supported by Microsoft with the industry’s best enterprise SLA
Use familiar BI tools for analysis, or open source notebooks for interactive data science
63% lower total cost of ownership than deploy your own Hadoop on-premises*
*IDC study “The Business Value and TCO Advantage of Apache Hadoop in the Cloud with Microsoft Azure HDInsight”
HDInsight Cluster ArchitectureAz
ure
VNet
HTTP S
traffi
c
ODBC/JDBC
WebHCatalog Oozie Ambari
Secure gateway
AuthNHTTP Proxy
Highly availableHead nodes
Worker nodes
ADL S
Decoupling Compute from Storage
Network
HDD-like latency
50 Tb+ aggregate bandwidth[1]
Strong consistency
[1] Azure Flat Network Architecture
Decoupling - Benefits
Cloud
NoSQL Workload
Pros Smaller clusters can
achieve the same level of performance as large clusters
No need to add nodes just for storage capacity
Depending on workloads, you see any where from 6x – 20x cost benefits
Query + ML Workload
Pros Clusters required only
while processing data Data Persists for tools to
connect and use Data can be replicated on
other geo Delete clusters when not
processing of data
Streaming Workload
Pros No need for large clusters
to hold historical streams data
Directly align throughput to cluster size as per SLA
Cluster up only when streams are active
Azure Data Lake StoreA hyper scale repository for big data analytics workloads
Hadoop File System (HDFS) for the cloud
No limits to scale
Store any data in its native format
Enterprise grade access control and encryption
Optimized for analytic workload performance
Customizecluster?
HDInsight cluster provisioning states
RDP to cluster, update config files (non-durable)
Ad hoc
Cluster customization optionsHive/Oozie MetastoreStorage accounts & VNET’sScriptAction
Via Azure portal
Ready for deployment Accepted
Cluster storage
provisioned
AzureVM configuratio
n
RunningTimed Out
Error
Cluster operational
Configuring HDInsight
Cluster customization (custom script
running
Config valuesJAR file placement in cluster
Via scripting / SDKNo
Yes
Cluster integration optionsEach cluster surfaces a REST endpoint for integration, secured via basic authN over SSL
/thrift – ODBC & JDBC
/Templeton – Job Submission, Metadata management
/ambari – Cluster health, monitoring
/oozie – Job orchestration, scheduling
Hadoop in the Cloud
24
Agenda
Why Benefits of running Hadoop in the cloud
What Options to run Hadoop in the Cloud Hadoop Clusters in the cloud Cluster Customizations
How Architecture of a Cloud deployment
Introducing Cortana Intelligence Suite
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards & Visualizations
Cortana
Bot Framework
Cognitive Services
Power BI
Information Management
Event Hubs
Data Catalog
Data Factory
Machine Learning and Analytics
HDInsight (Hadoop and Spark)
Stream Analytics
Intelligence
Data Lake Analytics
Machine Learning
Big Data Stores
SQL Data Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
Where Big Data is a cornerstone
Action
People
Automated Systems
Apps
Web
Mobile
Bots
Intelligence
Dashboards & Visualizations
Cortana
Bot Framework
Cognitive Services
Power BI
Information Management
Event Hubs
Data Catalog
Data Factory
Machine Learning and Analytics
HDInsight (Hadoop and Spark)
Stream Analytics
Intelligence
Data Lake Analytics
Machine Learning
Big Data Stores
SQL Data Warehouse
Data Lake Store
Data Sources
Apps
Sensors and devices
Data
Excel BI
Power BI
Mahout
HiveQL
HIVE
Sqoop Pig
Azure Data Lake Analytics
HBase on Azure HDInsight
Big Data Sources (Raw Unstructured)
Log files
Storm for Azure
HDInsight
Azure Stream
Analytics
Spark Streaming for
Azure HDInsight
Spark SQL
Spark MLib
Azure Data Lake Store
U-SQL
Data Orchestration/Workflow
Azure Data Factory
Oozie for Azure HDInsight
Kafka for Azure HDInsight(future)
SQL Server Integration
Services
Azure Machine Learning
R ServerSQL
Server R
Services
SSRS
SharePoint BI
Transactional systems
Azure SQL DW
SQL Server APS
ETL
Azure Event Hubs
Data Generation Streaming ConsumptionProcessingStorage
Ope
ratio
nal
Anal
ytica
l / E
xplo
rato
ry
Data WarehouseAzure
Website
SSAS
Spark MLLib
Summary
29
Why Benefits of running Hadoop in the cloud – Far outrun
tradeoffs
What Options to run Hadoop in the Cloud – IaaS, Paas,
Hybrid Hadoop Clusters in the cloud – Fully Managed Cluster Customizations – Immensely well leveraged
How Architecture of a Cloud deployment – Simplify
deployment
Get started today! For more information on HDInsight visit: http://azure.com/hdinsight For more information on Data Lake visit: http://azure.com/datalake
Q&A
@nishantthacker
Click icon to add picture