moving towards enterprise ready hadoop clusters on the cloud

19
© Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise ready Hadoop clusters on the cloud Hadoop Summit, Tokyo October 2016 Hemanth Yamijala, Hortonworks

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

851 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Enterprise ready Hadoop clusters on the cloud

Hadoop Summit, TokyoOctober 2016

Hemanth Yamijala, Hortonworks

Page 2: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

• Overview– Hortonworks Data Cloud– Architecture

• Improving enterprise readiness– Cloud storage– Governance– Reliability and fault tolerance

Page 3: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

HORTONWORKS DATA CLOUD - DEMO

Page 4: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Architecture

Amazon Web Services

Cloudbreak Services

Cloud controller (aka Cloudbreak)

Cloudbreak DB

Connector

AWS GCE Azure

HDP Cluster: ETL / EDW

Master GroupMaster Group: Hive, Spark

Ambari

Slave Group

Blueprint

HDP Cluster: Analytics

Master GroupMaster Group: LLAP, Zeppelin

Ambari

Slave Group

Blueprint

Cloudbreak Deployer

Access tools

Shell REST API Web UI

OpenStack

S3aFileSystem

S3aFileSystem

Page 5: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hortonworks Data Cloud - Summary

• Launch and manage clusters by workload type– ETL / EDW, Data science, Business

analytics

• Use highly scalable, durable storage for data (S3) & metadata (RDS)

• Share data and metadata among multiple ephemeral clusters

• Scale up and down at the click of a button

• Secure clusters with IAM roles, security groups, etc.

Page 6: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Matching Hadoop with the Cloud

Datacenter• Data Locality• Consistent Storage• Single cluster

administration

Cloud

• Scalable storage• Customizability• Cost effective

compute

• Scalable storage with performance and consistency

• Customizability with ease of administration

• Cost effective compute with SLA policies

Page 7: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Cloud Storage access facts

HDFS

Application

Input Output tmp

Interaction models

Application

HDFSInput

Output

Copy

• Cloud storage optimizes for scale– S3 data is replicated for high scale

access, durability• Data access is remote

– Data locality– Costlier metadata operations (E.g. hadoop fs –mv is actually a copy and delete)

• Eventual Consistency– Takes time for effect of modification

operations to permeate to all copies

Page 8: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance with Scalability• General strategy: Optimize by workload types

• ETL workloads– Typical pipeline: Bring in data => Transform => Repair

partitions => Compute statistics– Multiple metadata calls: Batched and issued in parallel

for performance gains

• Distcp– Optimized buffer management for transferring large files– Randomize input to Distcp to avoid hot-spotting S3 nodes

Page 9: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance with Scalability

• Analytics workloads – ORC file related optimizations– Support fast random access reads (both directions) by avoiding

tearing down S3 HTTP connections– Pass index information to compute tasks as part of split data to

avoid re-computation

• Ref: http://hortonworks.github.io/hdp-aws/s3-performance/index.html

• Status: Available, but performance optimizations never stop

Page 10: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Correctness with strong consistency

• Write operations followed by read may not return correct results– Issues for data pipelines, multi-stage jobs, etc.

• S3Guard project: Intermediate, consistent metadata store

• Write calls from S3AFileSystem update both S3 and metadata store

• S3AFileSystem automatically tries to reconcile metadata between S3 and metadata store on subsequent reads– Inconsistencies are handled based on policy

• Ref: https://issues.apache.org/jira/browse/HADOOP- 13345• Status: In progress

Page 11: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Securing data access via IAM Roles• Integration with cloud

provider

• Provide an IAM role as instance profile for a cluster

• Attach policies for accessing S3 to the role– E.g. Read-only access for BI

cluster to specific buckets

• Status: Available

Page 12: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data governance in Hadoop

• Apache Ranger– Fine grained, role-based access policies to data• File system level access control• Granularity for Hive columns

– Audit access information• Apache Atlas– Discover & index metadata– Track data lineage

Page 13: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data governance technical architecture – On Premise

On Premise HDP Cluster

Ranger Admin PolicyPolicy

Atlas AdminMetadata

Governed HDP Component (E.g.

Hive)Ranger Plugin

Atlas Plugin

LDAP / AD

Data Steward

Page 14: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Data Governance in the Cloud: Ease of administration with flexibility

• No longer a single compute cluster generating / accessing data

• Data & Metadata are still single and shared

• Evolve Atlas and Ranger to be data lake centric than cluster centric– Shared long running Admin components– Ephemeral plugins on compute clusters

• Ref: https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md• Status: Available as a Tech Preview

Page 15: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Shared Ranger / Atlas admin services

Available in Tech Preview in Hortonworks Data Cloud

ETL-EDW Cluster

Governed HDP Component (E.g. Hive)

LDAP / AD

Ranger Plugin

Atlas Plugin

Data Analytics Cluster

Governed HDP Component (E.g. Hive)

Ranger Plugin

Atlas Plugin

Ranger Admin PolicyPolicy

Atlas AdminMetadata

Cloud Controller

Shared Enterprise Services

Data Steward

Page 16: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

HDP Cloud Compute nodes on AWS

• Regular EC2 instances • Can attach EBS volumes or ephemeral storage disks• Grouped according to functionality / access requirements• Opportunistic provisioning – spot instances (work in

progress)

HDP Cluster

Master GroupGroup #1

Gateway node: Ambari

Master GroupGroup #2

Cloud Controller

Page 17: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Reliability with cost benefits• HDP host instances could become unhealthy

– Unreliable underlying infrastructure– Spot instances are transient, dependent on bid price– SLA impact for workloads

• Automatically replace un-healthy nodes– No costs incurred if node is not functional– Replace unhealthy instances to maintain a desired capacity

• Status: Work in progress

Page 18: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Auto-recovery of slave nodes

• Use Ambari to detect unhealthy status & notify Cloudbreak

• Decommission and terminate unhealthy instances• Provision new instances and add to cluster

HDP Cluster

Master GroupGroup #1

Gateway node: Ambari

Master GroupGroup #2Cloud Controller

Page 19: Moving towards enterprise ready Hadoop clusters on the cloud

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

THANK YOU! QUESTIONS?