moving towards enterprise ready hadoop clusters on the cloud
TRANSCRIPT
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Enterprise ready Hadoop clusters on the cloud
Hadoop Summit, TokyoOctober 2016
Hemanth Yamijala, Hortonworks
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
• Overview– Hortonworks Data Cloud– Architecture
• Improving enterprise readiness– Cloud storage– Governance– Reliability and fault tolerance
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
HORTONWORKS DATA CLOUD - DEMO
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture
Amazon Web Services
Cloudbreak Services
Cloud controller (aka Cloudbreak)
Cloudbreak DB
Connector
AWS GCE Azure
HDP Cluster: ETL / EDW
Master GroupMaster Group: Hive, Spark
Ambari
Slave Group
Blueprint
HDP Cluster: Analytics
Master GroupMaster Group: LLAP, Zeppelin
Ambari
Slave Group
Blueprint
Cloudbreak Deployer
Access tools
Shell REST API Web UI
OpenStack
S3aFileSystem
S3aFileSystem
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Data Cloud - Summary
• Launch and manage clusters by workload type– ETL / EDW, Data science, Business
analytics
• Use highly scalable, durable storage for data (S3) & metadata (RDS)
• Share data and metadata among multiple ephemeral clusters
• Scale up and down at the click of a button
• Secure clusters with IAM roles, security groups, etc.
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Matching Hadoop with the Cloud
Datacenter• Data Locality• Consistent Storage• Single cluster
administration
Cloud
• Scalable storage• Customizability• Cost effective
compute
• Scalable storage with performance and consistency
• Customizability with ease of administration
• Cost effective compute with SLA policies
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Cloud Storage access facts
HDFS
Application
Input Output tmp
Interaction models
Application
HDFSInput
Output
Copy
• Cloud storage optimizes for scale– S3 data is replicated for high scale
access, durability• Data access is remote
– Data locality– Costlier metadata operations (E.g. hadoop fs –mv is actually a copy and delete)
• Eventual Consistency– Takes time for effect of modification
operations to permeate to all copies
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance with Scalability• General strategy: Optimize by workload types
• ETL workloads– Typical pipeline: Bring in data => Transform => Repair
partitions => Compute statistics– Multiple metadata calls: Batched and issued in parallel
for performance gains
• Distcp– Optimized buffer management for transferring large files– Randomize input to Distcp to avoid hot-spotting S3 nodes
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Performance with Scalability
• Analytics workloads – ORC file related optimizations– Support fast random access reads (both directions) by avoiding
tearing down S3 HTTP connections– Pass index information to compute tasks as part of split data to
avoid re-computation
• Ref: http://hortonworks.github.io/hdp-aws/s3-performance/index.html
• Status: Available, but performance optimizations never stop
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Correctness with strong consistency
• Write operations followed by read may not return correct results– Issues for data pipelines, multi-stage jobs, etc.
• S3Guard project: Intermediate, consistent metadata store
• Write calls from S3AFileSystem update both S3 and metadata store
• S3AFileSystem automatically tries to reconcile metadata between S3 and metadata store on subsequent reads– Inconsistencies are handled based on policy
• Ref: https://issues.apache.org/jira/browse/HADOOP- 13345• Status: In progress
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Securing data access via IAM Roles• Integration with cloud
provider
• Provide an IAM role as instance profile for a cluster
• Attach policies for accessing S3 to the role– E.g. Read-only access for BI
cluster to specific buckets
• Status: Available
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data governance in Hadoop
• Apache Ranger– Fine grained, role-based access policies to data• File system level access control• Granularity for Hive columns
– Audit access information• Apache Atlas– Discover & index metadata– Track data lineage
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data governance technical architecture – On Premise
On Premise HDP Cluster
Ranger Admin PolicyPolicy
Atlas AdminMetadata
Governed HDP Component (E.g.
Hive)Ranger Plugin
Atlas Plugin
LDAP / AD
Data Steward
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Data Governance in the Cloud: Ease of administration with flexibility
• No longer a single compute cluster generating / accessing data
• Data & Metadata are still single and shared
• Evolve Atlas and Ranger to be data lake centric than cluster centric– Shared long running Admin components– Ephemeral plugins on compute clusters
• Ref: https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md• Status: Available as a Tech Preview
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Shared Ranger / Atlas admin services
Available in Tech Preview in Hortonworks Data Cloud
ETL-EDW Cluster
Governed HDP Component (E.g. Hive)
LDAP / AD
Ranger Plugin
Atlas Plugin
Data Analytics Cluster
Governed HDP Component (E.g. Hive)
Ranger Plugin
Atlas Plugin
Ranger Admin PolicyPolicy
Atlas AdminMetadata
Cloud Controller
Shared Enterprise Services
Data Steward
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
HDP Cloud Compute nodes on AWS
• Regular EC2 instances • Can attach EBS volumes or ephemeral storage disks• Grouped according to functionality / access requirements• Opportunistic provisioning – spot instances (work in
progress)
HDP Cluster
Master GroupGroup #1
Gateway node: Ambari
Master GroupGroup #2
Cloud Controller
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Reliability with cost benefits• HDP host instances could become unhealthy
– Unreliable underlying infrastructure– Spot instances are transient, dependent on bid price– SLA impact for workloads
• Automatically replace un-healthy nodes– No costs incurred if node is not functional– Replace unhealthy instances to maintain a desired capacity
• Status: Work in progress
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Auto-recovery of slave nodes
• Use Ambari to detect unhealthy status & notify Cloudbreak
• Decommission and terminate unhealthy instances• Provision new instances and add to cluster
HDP Cluster
Master GroupGroup #1
Gateway node: Ambari
Master GroupGroup #2Cloud Controller
© Hortonworks Inc. 2011 – 2016. All Rights Reserved
THANK YOU! QUESTIONS?