aws webcast - amazon elastic map reduce deep dive and best practices

70
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. Amazon Elastic MapReduce: Deep Dive and Best Practices Ian Meyers, AWS (meyersi@) October 29 th , 2014

Upload: amazon-web-services

Post on 21-Dec-2014

1.585 views

Category:

Technology


6 download

DESCRIPTION

Amazon Elastic MapReduce (EMR) is one of the largest Hadoop operators in the world. Since its launch five years ago, our customers have launched more than 15 million Hadoop clusters inside of EMR. In this webinar, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters and other Amazon EMR architectural patterns. We talk about how to scale your cluster up or down dynamically and introduce you to ways you can fine-tune your cluster. We also share best practices to keep your Amazon EMR cluster cost efficient.

TRANSCRIPT

Page 1: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Amazon Elastic MapReduce:

Deep Dive and Best Practices

Ian Meyers, AWS (meyersi@)

October 29th, 2014

Page 2: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Observations from AWS

Page 3: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Map-Reduce Engine Vibrant Ecosystem

Hadoop-as-a-Service

Massively Parallel

Cost Effective AWS Wrapper

Integrated to AWS services

What is EMR?

Page 4: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

HDFS

Amazon EMR

Page 5: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMRfs

HDFS

Amazon EMR

Amazon S3 Amazon

DynamoDB

Page 6: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMRfs

HDFS

Analytics languagesData management

Amazon EMR

Amazon S3 Amazon

DynamoDB

Page 7: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMRfs

HDFS

Analytics languagesData management

Amazon EMRAmazon

RDS

Amazon S3 Amazon

DynamoDB

Page 8: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMRfs

HDFS

Analytics languagesData management

Amazon

Redshift

Amazon EMRAmazon

RDS

Amazon S3 Amazon

DynamoDB

AWS Data Pipeline

Page 9: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Introduction

Launch clusters of any size in a matter of minutes

Use variety of different instance sizes that match

your workload

Don’t get stuck with hardware

Don’t deal with capacity planning

Run multiple clusters with different sizes, specs

and node types

Page 10: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices
Page 11: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Elastic MapReduce & Amazon S3

EMR has an optimised driver for Amazon S3

64MB Range Offset Reads to increase performance

Elastic MapReduce Consistent View further Increases Performance

Addresses Consistency

S3 Cost - $.03/GB - Volume Based Price Tiering

Page 12: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Observations from AWS

Page 13: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Design Patterns

Pattern #1: Transient vs. Alive Clusters

Pattern #2: Core Nodes and Task Nodes

Pattern #3: Amazon S3 & HDFS

Page 14: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Pattern #1: Transient vs. Alive Clusters

Page 15: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Pattern #1: Transient Clusters

Cluster lives for the duration of the job

Shut down the cluster when the job is done

Data persist on Amazon S3

Input & OutputData on

Amazon S3

Page 16: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Benefits of Transient Clusters1. Control your cost

2. Minimum maintenance

• Cluster goes away when job is done

3. Practice cloud architecture

• Pay for what you use

• Data processing as a workflow

Page 17: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Alive ClustersVery similar to traditional Hadoop deployments

Cluster stays around after the job is done

Data persistence model:

Amazon S3

Amazon S3 Copy To HDFS

HDFS and Amazon S3 as backup

Page 18: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Alive Clusters

Always keep data safe on Amazon S3 even if you’re

using HDFS for primary storage

Get in the habit of shutting down your cluster and start a

new one, once a week or month

Design your data processing workflow to account for failure

You can use workflow managements such as AWS Data

Pipeline

Page 19: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Pattern #2: Core & Task nodes

Page 20: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Core Nodes

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS

Run

TaskTrackers

(Compute)

Run DataNode

(HDFS)

Page 21: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Core Nodes

Can add core

nodes

More HDFS

space

More

CPU/memory

Master instance group

Amazon EMR cluster

Core instance group

HDFS HDFS HDFS

Page 22: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Core Nodes

Can’t remove

core nodes

because of

HDFS

Master instance group

Core instance group

HDFS HDFS HDFS

Amazon EMR cluster

Page 23: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Task Nodes

Run TaskTrackers

No HDFS

Reads from core

node HDFS Task instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

Page 24: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Task Nodes

Can add

task nodes

Task instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

Page 25: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Task Nodes

More CPU

power

More

memoryTask instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

Page 26: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Task Nodes

You can

remove task

nodes when

processing

is completedTask instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

Page 27: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Task Nodes

You can

remove task

nodes when

processing

is completedTask instance group

Master instance group

Core instance group

HDFS HDFS

Amazon EMR cluster

Page 28: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Task Node Use-CasesSpeed up job processing using Spot market

Run task nodes on Spot market

Get discount on hourly price

Nodes can come and go without interruption to your cluster

When you need extra horsepower for a short amount of time

Example: Need to pull large amount of data from Amazon S3

Page 29: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Pattern #3: Amazon S3 & HDFS

Page 30: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Option 1: Amazon S3 as HDFS

Use Amazon S3 as your

permanent data store

HDFS for temporary storage

data between jobs

No additional step to copy

data to HDFS

Amazon EMR Cluster

Task Instance

Group

Core Instance

Group

HDFS HDFS

Amazon S3

Page 31: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Benefits: Amazon S3 as HDFSAbility to shut down your cluster

HUGE Benefit!!

Use Amazon S3 as your durable storage

11 9s of durability

Page 32: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Benefits: Amazon S3 as HDFS

No need to scale HDFS

Capacity

Replication for durability

Amazon S3 scales with your data

Both in IOPs and data storage

Page 33: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Benefits: Amazon S3 as HDFS

Ability to share data between multiple clusters

Hard to do with HDFS

Amazon S3

EMR

EMR

Page 34: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Benefits: Amazon S3 as HDFSTake advantage of Amazon S3 features

Amazon S3 Server Side Encryption

Amazon S3 Lifecycle Policies

Amazon S3 versioning to protect against corruption

Build elastic clusters

Add nodes to read from Amazon S3

Remove nodes with data safe on Amazon S3

Page 35: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMR Consistent View

Provides a ‘consistent view’ of data on S3 within a Cluster

Ensures that all files created by a Step are available to Subsequent Steps

Index of data from S3, managed by Dynamo DB

Configurable Retry & Metastore

New Hadoop Config File emrfs-site.xml

fs.s3.consistent* System Properties

Page 36: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMR Consistent View

EMRfs

HDFS

Amazon EMR

Amazon S3 Amazon

DynamoDB

Processed Files RegistryFile Data

Page 37: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

EMR Consistent View

Manage data in EMRFS using the emrfs client:

emrfsdescribe-metadata, set-metadata-capacity, delete-metadata, create-metadata, list-metadata-stores - work with Metadata Stores

diff - Show what in a bucket is missing from the index

delete - Remove Index Entries

sync - Ensure that the Index is Synced with a bucket

import - Import Bucket Items into Index

Page 38: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

What About Data Locality?

Run your job in the same region as your Amazon

S3 bucket

Amazon EMR nodes have high speed connectivity

to Amazon S3

If your job Is CPU/memory-bound, locality doesn’t

make a huge difference

Page 39: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon S3 provides near linear scalability

S3 Streaming

Performance100 VMs; 9.6GB/s; $26/hr

350 VMs; 28.7GB/s; $90/hr

34 secs per terabyte

GB/Second

Rea

der

Connections

Performance & Scalability

Page 40: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

When HDFS is a Better Choice…

Iterative workloads

If you’re processing the same dataset more than once

Disk I/O intensive workloads

Page 41: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Option 2: Optimise for Latency with HDFS

1. Data persisted on Amazon S3

Page 42: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Option 2: Optimise for Latency with HDFS

2. Launch Amazon EMR and

copy data to HDFS with

S3distcp

S3D

istC

p

Page 43: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Option 2: Optimise for Latency with HDFS

3. Start processing data on

HDFS

S3D

istC

p

Page 44: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Benefits: HDFS instead of S3

Better pattern for I/O-intensive workloads

Amazon S3 as system of record

Durability

Scalability

Cost

Features: lifecycle policy, security

Page 45: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Observations from AWS

Page 46: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Amazon EMR Nodes and SizeUse M1.Small Instances for functional testing

Use XLarge + nodes for production workloads

Use CC2/C3 for memory and CPU intensive jobs

HS1, HI1, I2 instances for HDFS workloads

Prefer a smaller cluster of larger nodes

Page 47: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Holy Grail Question

How many nodes do I need?

Page 48: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Instance Resource Allocation

• Hadoop 1 - Static Number of Mappers/Reducers

configured for the Cluster Nodes

• Hadoop 2 - Variable Number of Hadoop

Applications based on File Splits and Available

Memory

• Useful to understand Old vs New Sizing

Page 49: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Instance Resources

1

24

8

16

32

64

128

256

512

10242048

4096

8192

16384

32768

65536

0

50

100

150

200

250

300

Memory (GB) Mappers* Reducers* CPU (ECU Units) Local Storage (GB)

Page 50: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Cluster Sizing Calculation1. Estimate the number of tasks your job requires.

2. Pick an instance and note down the number of tasks it can run in parallel

3. We need to pick some sample data files to run a test workload. The number of sample files should be the same number from step #2.

4. Run an Amazon EMR cluster with a single Core node and process your sample files from #3.

Note down the amount of time taken to process your sample files.

Page 51: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Cluster Sizing Calculation

Total Tasks * Time To Process Sample Files

Instance Task Capacity * Desired Processing Time

Estimated Number Of Nodes:

Page 52: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Example: Cluster Sizing Calculation1. Estimate the number of tasks your job requires

150

2. Pick an instance and note down the number of tasks it can run in parallel

m1.xlarge with 8 task capacity per instance

Page 53: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Example: Cluster Sizing Calculation

3. We need to pick some sample data files to run a

test workload. The number of sample files should

be the same number from step #2.

8 files selected for our sample test

Page 54: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Example: Cluster Sizing Calculation

4. Run an Amazon EMR cluster with a single core

node and process your sample files from #3.

Note down the amount of time taken to process

your sample files.

3 min to process 8 files

Page 55: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Cluster Sizing Calculation

Total Tasks For Your Job * Time To Process Sample Files

Per Instance Task Capacity * Desired Processing Time

Estimated number of nodes:

150 * 3 min 8 * 5 min

= 11 m1.xlarge

Page 56: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

File Best Practices

Avoid small files at all costs (smaller than

100MB)

Use Compression

Page 57: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Holy Grail Question

What if I have small file issues?

Page 58: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Dealing with Small Files

Use S3DistCP to combine smaller files together

S3DistCP takes a pattern and target file to combine smaller input files to larger ones

./elastic-mapreduce –jar

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args '--src,s3://myawsbucket/cf,\

--dest,hdfs:///local,\

--groupBy,.*XABCD12345678.([0-9]+-[0-9]+-

[0-9]+-[0-9]+).*,\

--targetSize,128,\

Page 59: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

CompressionAlways Compress Data Files On Amazon S3

Reduces Bandwidth Between Amazon S3 and

Amazon EMR

Speeds Up Your Job

Compress Task Output

Page 60: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

CompressionCompression Types:

Some are fast BUT offer less space reduction

Some are space efficient BUT Slower

Some are splitable and some are not

Algorithm % Space

Remaining

Encoding

Speed

Decoding

Speed

GZIP 13% 21MB/s 118MB/s

LZO 20% 135MB/s 410MB/s

Snappy 22% 172MB/s 409MB/s

Page 61: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Changing Compression TypeYou May Decide To Change Compression Type

Use S3DistCP to change the compression types of your files

Example: ./elastic-mapreduce --jobflow j-3GY8JC4179IOK --jar \

/home/hadoop/lib/emr-s3distcp-1.0.jar \

--args '--src,s3://myawsbucket/cf,\

--dest,hdfs:///local,\

--outputCodec,lzo’

Page 62: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Outline

Introduction to Amazon EMR

Amazon EMR Design Patterns

Amazon EMR Best Practices

Observations from AWS

Page 63: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

M1/C1 Instance Families

Heavily used by EMR Customers

However, HDFS Utilisation is typically very

Low

M3/C3 Offers better Performance/$

Page 64: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

M1 vs M3

Instance Cost / Map Task Cost / Reduce Task

m1.large $0.08 $0.15

m1.xlarge $0.06 $0.15

m3.xlarge $0.04 $0.07

m3.2xlarge $0.04 $0.07

Page 65: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

C1 vs C3

Instance Cost / Map Task Cost / Reduce Task

c1.medium $0.13 $0.13

c1.xlarge $0.35 $0.70

c3.xlarge $0.05 $0.11

c3.2xlarge $0.05 $0.11

Page 66: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Orc vs Parquet

File Formats designed for SQL/Data Warehousing

on Hadoop

Columnar File Formats

Compress Well

High Row Count, Low Cardinality

Page 67: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Orc File Format

Optimised Row Columnar Format

Zlib or Snappy External

Compression

250MB Stripe of 1 Column and

Index

RunLength or Dictionary Encoding

1 Output File per Container Task

Page 68: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Parquet File Format

Gzip or Snappy External

Compression

Array Data Structures

Limited Data Type Support for

Hive

Batch Creation

1GB Files

Page 69: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

Orc vs Parquet

Depends on the Tool you are using

Consider Future Architecture & Requirements

Test Test Test

Page 70: AWS Webcast - Amazon Elastic Map Reduce Deep Dive and Best Practices

In Summary

• Practice Cloud Architecture with Transient Clusters

• Utilize S3 as the system of record for durability

• Utilize Task Nodes on Spot for Increased performance and

Lower Cost

• Move to new Instance Families for Better Performance/$

• Exciting Developments around Columnar File Formats