hadoop & cloud storage: object store integration in production

26
1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hadoop & Cloud Storage: Object Store Integration in Production Chris Nauroth Rajesh Balamohan Hadoop Summit 2016

Upload: dataworks-summithadoop-summit

Post on 16-Apr-2017

1.128 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Hadoop & Cloud Storage: Object Store Integration in Production

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop & Cloud Storage: Object Store Integration in ProductionChris NaurothRajesh BalamohanHadoop Summit 2016

Page 2: Hadoop & Cloud Storage: Object Store Integration in Production

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

About Us

Rajesh Balamohan, [email protected], Twitter: @rajeshbalamohan– Apache Tez Committer, PMC Member– Mainly working on performance in Tez– Have been using Hadoop since 2009

Chris Nauroth, [email protected], Twitter: @cnauroth– Apache Hadoop committer, PMC member, and Apache Software Foundation member– Working on HDFS and alternative file systems such as WASB and S3A– Hadoop user since 2010

Steve Loughran, [email protected], Twitter: @steveloughran– Apache Hadoop committer, PMC member, and Apache Software Foundation member– Hadoop deployment since 2008, especially Cloud integration, Filesystem Spec author.– Working on: Apache Slider, Spark+cloud integration, Hadoop + Cloud

Page 3: Hadoop & Cloud Storage: Object Store Integration in Production

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

⬢Hadoop/Cloud Storage Integration Use Cases

⬢Hadoop-compatible File System Architecture

⬢Recent Enhancements in S3A FileSystem Connector

⬢Hive Access Patterns

⬢Performance Improvements and TPC-DS Benchmarks with Hive-TestBench

⬢Next Steps for S3A and other Object Stores

⬢Q & A

Page 4: Hadoop & Cloud Storage: Object Store Integration in Production

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Why Hadoop in the Cloud?

Page 5: Hadoop & Cloud Storage: Object Store Integration in Production

5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop Cloud Storage Utilization Evolution

HDFS

Application

HDFS

Application

GoalEvolution towards cloud storage as the primary Data Lake

Input Output

Backup Restore

InputOutput

Copy

HDFS

Application

Input

Output

tmp

Page 6: Hadoop & Cloud Storage: Object Store Integration in Production

6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

What is the Problem?Cloud Object Stores designed for⬢ Scale⬢ Cost⬢ Geographic Distribution⬢ Availability⬢ Cloud app writers often modify apps to deal with cloud storage semantics and limitations

Challenges - Hadoop apps should work on HDFS or Cloud Storage transparently⬢ Eventual consistency⬢ Performance - separated from compute⬢ Cloud Storage not designed for file-like access patterns⬢ Limitations in APIs (e.g. rename)

Page 7: Hadoop & Cloud Storage: Object Store Integration in Production

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Goal and Approach

Goals⬢ Integrate with unique functionality of each cloud⬢ Optimize each cloud’s object store connector⬢ Optimize upper layers for cloud object stores

Overall Approach ⬢ Consistency in face of eventual consistency (use a secondary metadata store)⬢ Performance in the connector (e.g. lazy seek)⬢ Upper layer improvements (Hive, ORC, Tez, etc.)

Steve Loughran
if can optimise FileSystem listFiles/globber for object stores which do recursive lists in O(1), they'd all benefit —if we implement the list operation for each, and make sure Tez, Spark &c use the right calls
Sanjay Radia
[email protected] Is the overall approach correct break down of all the problems mentioned in the doc that SteveL and Sanjay started? Or shall we add more details from doc.
Steve Loughran
if you look at the S3a work, it went functionaliy, then perf & scale...I'd call out the latter two
Sanjay Radia
[email protected] [email protected] Are the ORC improvements shared across all connectors or per connectors - ie DOes Azure benefit from the work Rajesh has done recently?
Rajesh Balamohan
[email protected] ORC related fixes:============== - fixing resource leaks, sending reduced amount of ORC footer information in split payloads are common to all connectors. - ORC random read access pattern benefits from lazy seek/connection abort fixes in S3A layer. But no changes are applied for this in ORC.Hive:====- Table management changes like "creating tables", "analyzing tables", "repairing tables" are common across all connectors.Tez related fixes: ============- For systems which always provide "localhost" as its block locations, tez split grouping got optimized. I am not sure if this would be applicable to Azure (if Azure reports "localhost" for all its block locations, this specific change will not be applicable to them). I believe they report ip addresses.
Chris Nauroth
[email protected] , S3A getFileBlockLocations always reports exactly one block location, and the host within that block location is "localhost". WASB getFileBlockLocations reports multiple block locations, driven by file size and a configurable "fake block size", and the host within every block location is either "localhost", or a configured override of the host name. Based on that information, can you please clarify whether or not you expect the Tez optimization would help WASB too? Thanks!
Rajesh Balamohan
Tez optimization is specifically for "localhost" (for S3A). If WASB can return "localhost", then TEZ optimization would be applicable for WASB as well. When using Tez, It would be good not to override the host name for WASB with dummy locations.
Chris Nauroth
[email protected] , great, thank you. FWIW, I have never seen anyone override that host name in WASB configuration.
Page 8: Hadoop & Cloud Storage: Object Store Integration in Production

8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop-compatible File System Architecture

Page 9: Hadoop & Cloud Storage: Object Store Integration in Production

9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hadoop-compatible File System Architecture⬢ Applications

– File system interactions coded to file system-agnostic abstraction layer.• FileSystem class - traditional API

• FileContext/AbstractFileSystem classes - newer API providing split between client API and provider API

– Can be retargeted to a different file system by configuration changes (not code changes).

• Caveat: Different FileSystem implementations may offer limited feature set.

• Example: Only HDFS and WASB can run HBase.

⬢ File System Abstraction Layer– Defines interface of common file system operations: create, open, rename, etc.– Supports additional mix-in interfaces to indicate implementation of optional features.– Semantics of each operation documented in formal specification, derived from HDFS behavior.

⬢ File System Implementation Layer– Each file system provides a set of concrete classes implementing the interface.– A set of common file system contract tests execute against each implementation to prove its adherence to specified

semantics.

Sanjay Radia
Your prev slide with picture covers this beautifully. I would simply use this as talking notes for the prev slide(keep the slide for those reading the slides later). (also file context mention but do not need to go into why we created it - this is not the focus of this talk -- you have a lot of material.
Chris Nauroth
[email protected] , thank you. That makes sense.
Page 10: Hadoop & Cloud Storage: Object Store Integration in Production

10

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Cloud Storage ConnectorsAzure WASB ● Strongly consistent

● Good performance● Well-tested on applications (incl. HBase)

ADL ● Strongly consistent● Tuned for big data analytics workloads

Amazon Web Services S3A ● Eventually consistent - consistency work in progress by Hortonworks

● Performance improvements in progress● Active development in Apache

EMRFS ● Proprietary connector used in EMR● Optional strong consistency for a cost

Google Cloud Platform GCS ● Multiple configurable consistency policies● Currently Google open source● Good performance● Work under way for contribution to Apache

Chris Nauroth
[email protected] , how do you like this slide as an evolution of the one you added yesterday?
Chris Nauroth
BTW, I did not delete your version of the slide. I moved it to the end for now, in preparation for possible deletion later.
Sanjay Radia
I like this slide. I move the proprietary EMRFS sub-bullet one up.Q. What did u mean by "Optional strong consistency for a cost"? Isnt the Dynamio consistency part built in and free? I would say somethging like "Claims to have good consistency but we dont know details as it is closed source". And we havent been able to convince them to open source it.Added a bullet for Google.
Sanjay Radia
You can delete my old version -- yours is better.
Chris Nauroth
[email protected] , "optional strong consistency" refers to the fact that "consistent view" is optional at time of provisioning an EMR cluster. You don't have to enable it, but you can choose to opt into it.
Page 11: Hadoop & Cloud Storage: Object Store Integration in Production

11

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

11

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Case Study: S3A Functionality and Performance

Page 12: Hadoop & Cloud Storage: Object Store Integration in Production

12

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Authentication⬢ Basic

– AWS Access Key ID and Secret Access Key in Hadoop Configuration Files– Hadoop Credential Provider API to avoid using world-readable configuration files

⬢ EC2 Metadata– Reads credentials published by AWS directly into EC2 VM instances– More secure, because external distribution of secrets not required

⬢ AWS Environment Variables– Less secure, but potentially easier integration for some applications

⬢ Session Credentials– Temporary security credentials issued by Amazon Security Token Service– Fixed lifetime reduces impact of credential leak

⬢ Anonymous Login– Easy read-only access to public buckets for early prototyping

Page 13: Hadoop & Cloud Storage: Object Store Integration in Production

13

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Encryption

⬢ S3 Server-Side Encryption– Encryption of data at rest at S3

– Supports the SSE-S3 option: each object encrypted by a unique key using AES-256 cipher

– Now covered in S3A automated test suites

– Support for additional options under development (SSE-KMS and SSE-C)

Page 14: Hadoop & Cloud Storage: Object Store Integration in Production

14

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Supportability⬢ Documentation

– Backfill missing documentation, and include documentation in new enhancements

– To be published to hadoop.apache.org with Apache Hadoop 2.8.0 release

– Meanwhile, raw content visible on GitHub:

• https://github.com/apache/hadoop/blob/branch-2.8/hadoop-tools/hadoop-aws/src/site/markdown/tools/hadoop-aws/index.md

⬢ Error Reporting– Identify common user errors and provide more descriptive error messages

– S3 HTTP error codes examined and translated to specific error types

⬢ Instrumentation– Internal metrics covering a wide range of metadata and data operations

– Already proven helpful in flagging a potential performance regression in a patch

Page 15: Hadoop & Cloud Storage: Object Store Integration in Production

15

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance Improvements⬢ Lazy Seek

– Earlier implementation

• Reopened file in every seek call; Aborted connection in every reopen

• Positional Read was expensive (seek, read, seek)

– Current implementation

• Seek is a no-op call

• Performs real seek on need basis

⬢ Connection Abort Problem– Backward seeks caused connection aborts

– Recent modifications to S3AFileSystem fixes these and added support for sequential reads and random reads

• fs.s3a.experimental.input.fadvise

Steve Loughran
Cost of abort() ~ same as reading 500K of data; forward seek via setReadahead() optimises short range forward seeks; does nothing for long range or backward seeks
Steve Loughran
Not quite, but streaming to end (HAdoop-2.6.0 equally expensiv)
Steve Loughran
we haven't fixed these yet
Rajesh Balamohan
Yes, thought of mentioning the issues associated with the file access and how it impacts hive later (e.g getSplits, reading files etc in task side etc). Listing files, Renaming files aren't fixed yet. If the issues are covered in earlier slides, it would be good to cover lazy seek, connection abort, read-ahead settings here.
Page 16: Hadoop & Cloud Storage: Object Store Integration in Production

16

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive Access Patterns

⬢ ETL and Admin Activities– Bringing in dataset / Creating Tables

– Cleansing / Transforming Data

– Analyze Tables, Compute Column Statistics

– MSCK to fix partition related information

⬢ Read – Running Queries

⬢ Write– Store Output

Page 17: Hadoop & Cloud Storage: Object Store Integration in Production

17

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive - MSCK Improvements⬢ MSCK helps in fixing metastore for partitioned dataset

– Scan table path to identify missing partitions (expensive in S3)

Page 18: Hadoop & Cloud Storage: Object Store Integration in Production

18

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive - Analyze Column Statistics Improvements⬢ Hive needs statistics to run queries efficiently

– Gathering table and column statistics can be expensive in partitioned datasets

Page 19: Hadoop & Cloud Storage: Object Store Integration in Production

19

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance Considerations When Running Hive Queries⬢ Splits Generation

– File formats like ORC provides threadpool in split generation

⬢ ORC Footer Cache

– hive.orc.cache.stripe.details.size > 0

– Caches footer details; Helps in reducing data reads during split generation

⬢ Reduce S3A reads in Task side– hive.orc.splits.include.file.footer=true

– Sends ORC footer information in splits payload.

– Helps reducing the amount of data read in task side.

Page 20: Hadoop & Cloud Storage: Object Store Integration in Production

20

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Performance Considerations When Running Hive Queries⬢ Tez Splits Grouping

– Hive uses Tez as its default execution engine

– Tez groups splits based on min/max group setting, location details and so on

– S3A always provides “localhost” as its block location information

– When all splits-length falls below min group setting, Tez aggressively groups them into single split. This causes issues with S3A as single task ends up doing sequential operations.

– Fixed in recent releases

⬢ Container Launches– S3A always provides “localhost” for block locations.

– Good to set “yarn.scheduler.capacity.node-locality-delay=0”

Page 21: Hadoop & Cloud Storage: Object Store Integration in Production

21

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive-TestBench Benchmark Results⬢ Hive-TestBench has subset of queries from TPC-DS (https://github.com/hortonworks/hive-testbench)

⬢ m4x4x large - 5 nodes

⬢ TPC-DS @ 200 GB Scale in S3

⬢ “HDP 2.3 + S3 in cloud” vs “HDP 2.4 + S3 in cloud”

– Average speedup 2.5x

– Queries like 15,17, 25, 73,75 etc did not run in HDP 2.3 (throws AWS timeout exceptions)

Page 22: Hadoop & Cloud Storage: Object Store Integration in Production

22

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Hive-TestBench Benchmark Results - LLAP⬢ LLAP DAG runtime comparison with Hive

⬢ Reduces the amount of data to be read from S3 significantly; Improves runtime.

Page 23: Hadoop & Cloud Storage: Object Store Integration in Production

23

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Best Practices⬢ Tune multipart settings

– fs.s3a.multipart.threshold (default: Integer.MAX_VALUE)– fs.s3a.multipart.size (default: 100 MB)– fs.s3a.connection.timeout (default: 200 seconds)

⬢ Disable node locality delay in YARN– Set “yarn.scheduler.capacity.node-locality-delay=0” to avoid delays in container launches

⬢ Disable Storage Based authorization in Hive– hive.security.metastore.authorization.manager=org.apache.hadoop.hive.ql.security.authorization.DefaultHiveMetas

toreAuthorizationProvider

– hive.metastore.pre.event.listeners= (set to empty value)

⬢ Tune ORC threads for reducing split generation times– hive.orc.compute.splits.num.threads (default 10)

Page 24: Hadoop & Cloud Storage: Object Store Integration in Production

24

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Next Steps for S3A and other Object Stores

⬢ S3A Phase III– https://issues.apache.org/jira/browse/HADOOP-13204

⬢ Output Committers– Logical commit operation decoupled from rename (non-atomic and costly in object stores)

⬢ Object Store Abstraction Layer– Avoid impedance mismatch with FileSystem API

– Provide specific APIs for better integration with object stores: saving, listing, copying

⬢ Ongoing Performance Improvement– Less chatty call pattern for object listings

– Metadata caching to mask latency of remote object store calls

⬢ Consistency– Shield applications from the effects of eventually consistent object stores

Steve Loughran
1. HADOOP-9565: a proper object store API. Let's make a commitment to do this; explicit PUT/LIST/COPY calls; no rename() or mkdir(). This is not your Posix FS.2. HDFS Ozone
Page 25: Hadoop & Cloud Storage: Object Store Integration in Production

25

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Summary

⬢Evolution towards cloud storage

⬢Hadoop-compatible File System Architecture fosters integration with cloud storage

⬢ Integration with multiple cloud providers available: Azure, AWS, Google

⬢Recent enhancements in S3A

⬢Hive usage and TPC-DS benchmarks show significant S3A performance improvements

⬢More coming soon for S3A and other object stores

Page 26: Hadoop & Cloud Storage: Object Store Integration in Production

26

© Hortonworks Inc. 2011 – 2016. All Rights Reserved

Q & AThank You!