hadoop aws infrastructure cost evaluation
DESCRIPTION
How to calculate the cost of a Hadoop infrastructure on Amazon AWS, given some data volume estimates and the rough use case ? Presentation attempts to compare the different options available on AWS.TRANSCRIPT
Hadoop Platform infrastructure cost evaluation
• High level requirements
• Cloud architecture
• Major architecture components
• Amazon AWS
• Hadoop distributions
• Capacity Planning
• Amazon AWS – EMR
• Hadoop distributions
• On-premise hardware costs
• Gotcha’s
Agenda
2
• Build an Analytical & BI platform for web log analytics
• Ingest multiple data sources:
• Log data
• internal user data
• Apply complex business rules
• Manage Events, filter Crawler Driven Logs, apply Industry and Domain Specific rules
• Populate/export to a BI tool for visualization.
High Level Requirements
3
• Today’s baseline: ~42 TB per year (~ 3.5TB raw data per month), 3 years store
• SLA: Should process data every day. Currently done once a month.
• Predefined processing via Hive; no exploratory analysis
• Everything in the cloud:
• Store (HDFS), Compute (M/R), Analysis (BI tool)
Non-Functional Requirements
4
• Seeding data in S3 (3 year’s data worth)
• Adding monthly net-new data only.
• Speed not of primary importance
Non-Functional Requirements [2]
5
• Cleaned-up log data per year 42 TB (3 years = 126 TB)
• Total disk space required should consider
• Compression (LZO 40%) – Reduces disk space required to ~25 *
• Replication Factor of 3 : ~75 TB
• 75% disk utilization maximum in Hadoop: 100TB
• Total disk capacity required for DN: ~100TB / year (17.5TB/ mo)
• (*disclaimer: depends on codec and data input)
Data Estimates for Capacity planning [2]
6
Data Estimates for Capacity planning: reduced logs
7
Expected data volume
Log data volume (TB)
After compression (Gzip 40%)
Data Replication on 3 nodes
70% disk utilization maximum (TB)
1 month 3.6 2.16 6.5 9.21 year 42 25 75 107
3 years 126 75.6 226 322
• Total disk capacity required for DN: ~10TB/ month
Amazon AWS
Cloud Solution Architecture
8
Hadoop
HDFS
S3 BI Tool
User
Logs
Metadata Extraction
Webservers
Hive Tables
Client
1. Copy data to
S3
2. Export data to HDFS
3. Process in M/R 4. Display
in BI tool
5. Retain results into S3
• Amazon Elastic Compute Cloud (EC2) is a web service that provides resizable compute capacity in the cloud.
• Manual set up of Hadoop on EC2
• Use EBS for storage capacity (HDFS)
• Storage on S3
Hadoop on AWS: EC2
9
• EC2 instances options
• Choose instance type
• Choose instance type availability
• Choose instance family
• Choose where the data resides:
• S3 – high latency, but highly available
• EBS
• Permanent storage?
• Snapshots to S3?
• Apache Whirr for set up
Running Hadoop on AWS: EC2
10
• Other choices:
• EBS-optimized instances: dedicated throughput between Amazon EC2 and Amazon EBS, with options between 500 Mbps and 1000 Mbps depending on the instance type used.
• Inter-region data transfer
• Dedicated instances: run on single-tenant hardware dedicated to a single customer.
• Spot instances: Name your price
Amazon EC2 – Instance features
11
• Amazon EC2 instances are grouped into six families: General purpose, Memory optimized, Compute optimized, Storage optimized, micro and GPU.
• General-purpose instances have memory to CPU ratios suitable for most general purpose apps.
• Memory-optimized instances offer larger memory sizes for high throughput applications.
• Compute-optimized instances have proportionally more CPU resources than memory (RAM) and are well suited for compute-intensive applications.
• Storage-optimized instances are optimized for very high random I/O performance , or very high storage density, low storage cost, and high sequential I/O performance.
• micro instances provide a small amount of CPU with the ability to burst to higher amounts for brief periods.
• GPU instances, for dynamic applications.
Amazon Instance Families
12
Data nodes
• On-Demand Instances – On-Demand Instances let you pay for compute capacity by the hour with no long-term commitments. This frees you from the costs and complexities of planning, purchasing, and maintaining.
• Reserved Instances – Reserved Instances give you the option to make a one-time payment for each instance you want to reserve and in turn receive a discount on the hourly charge for that instance. There are three Reserved Instance types (Light, Medium, and Heavy Utilization Reserved Instances) that enable you to balance the amount you pay upfront with your effective hourly price.
• Spot Instances – Spot Instances allow customers to bid on unused Amazon EC2 capacity and run those instances for as long as their bid exceeds the current Spot Price. The Spot Price changes periodically based on supply and demand, and customers whose bids meet or exceed it gain access to the available Spot Instances. If you have flexibility in when your applications can run, Spot Instances can significantly lower your Amazon EC2 costs.
Amazon Instances types availability
13
Amazon EC2 – Storage
14
Amazon EC2 – Instance types
15
BI instances
Master nodes
Data nodes
• Hadoop cluster is initiated when analytics is run
• Data is streamed from S3 to EBS Volumes
• Results from analytics stored to S3 once computed
• BI nodes permanent
Systems Architecture – EC2
16
Logs
AWS
NN SN
Hadoop
S3
DNs EN
HDFS on EBS drives
Client
Node Node
BI
Node Node
BI
• Probably not the best choice:
• EBS volumes make the solution costly
• If instead using instance storage, choices of EC2 instances either too small (a few Gigs) or too big (48 TB/per instance).
• Don’t need the flexibility – just want to use Hive
Hadoop on AWS: EC2
17
• EC2 Amazon Elastic MapReduce ( EMR) is a web service that provides a hosted Hadoop framework running on the EC2 and Amazon Simple Storage Service (S3).
Hadoop on AWS: EMR
18
• Elastic Map Reduce
• For occasional jobs – Ephemeral clusters
• Ease of use, but 20% costlier
• Data stored in S3 - Highly tuned for S3 storage
• Hive and Pig available
• Only pay for S3 + instances time while jobs running
• Or: leave it always on.
Running Hadoop on AWS - EMR
19
• EC2 instances with own flavor of Hadoop
• Amazon Apache Hadoop is 1.0.3 version. You can also choose MapR M3 or M5 (0.20.205) version.
• You can run Hive (0.7.1 or 0.8.1), Custom JAR, Streaming, Pig or Hbase.
Hadoop on AWS - EMR
20
• Hadoop cluster created elastically
• Data is streamed from S3 to initiate Hadoop cluster dynamically
• Results from analytics stored to S3 once computed
• BI nodes permanent
Systems Architecture – EMR
21
Logs
AWS
NNSN
Hadoop
S3
DNsEMR
HDFS from S3
Client
Instance
Instance
BI
Instance
Instance
BI
Amazon EMR– Instance types
22
BI instances
Master nodes
Data nodes
• Calculate and add:
• S3 cost (seeded data)
• Incremental S3 cost, per month
• EC2 cost
• EMR cost
• In/out Transfer of data cost
• Amazon support cost
• Infrastructure support Engineer cost
AWS calculator – EMR calculation
23
• Say for 24hrs/day, EMR cost:
AWS calculator – EMR calculation
24
• Say for 24hrs/day, 3 year S3:
AWS calculator – EMR calculation
25
• Say for 24hrs/day, 3 year EC2:
AWS calculator – EMR calculation
26
Data volume (in year)
Instances types
Price/yearRunning 24 hours/day
Price/yearRunning 8 hours/day
Price/yearRunning 8 hours/week
1 year - storing 42TB on S3
10 instances – Data nodes: m1.xlargeNN: m2.2xlargeBI: m2.2xlargeLoad balancer: t1.micro1 year reserved10 EMR instances (Subject to change depending on actual load)
$14.1k/mo * 12 = $169.2k
$8.9k * 12= $106k
$6.6k * 12 = $79.2k
3 years storing 126TB on S3
$19.5k *36 mos = $684k
$15.5k * 36 mos = $558k
$13.2k * 36 mos = $475
Amazon EMR Pricing – Reduced log volume
27
Hadoop on AWS: trade-offs
28
Feature EC2 EMR
Ease of use Hard – IT Ops costs Easy; Hadoop clusters can be of any size; can have multiple clusters.
Cost Cheaper Costlier: pay for EC2 + EMR
Flexibility Better: Access to full stack of Hadoop ecosystem
On demand Hadoop cluster: Ease of use - Hadoop installed, but with limited options
Portability Easier to move to dedicated hardware
Speed Faster Lower performance: all data is streamed from S3 for each job
Maintability Can choose any vendor;Can be updated to latest versoin;
Debugging tricky: cluster terminated, no logs
• EMR with Spot instances seems to be the trend for minimal cost, if SLA timeliness is not of primary importance.
• Use reserved instances to bring down cost drastically (60%).
• Compression on S3 ?
• Need to account for secondary NN?
• Ability to estimate better how many EMR nodes are needed with AWS’s AMI task configuration
EC2 Pricing Gotcha’s
29
• Transferring data between S3 and EMR clusters is very fast (and free), so long as your S3 bucket and Hadoop cluster are in the same Amazon region
• EMR’S3 File System streams data directly to S3 instead of buffering to intermediate local files.
• EMR’S3 File System adds Multipart Upload, which splits your writes into smaller chunks and uploads them in parallel.
• Store fewer, larger files instead of many smaller ones
• http://blog.mortardata.com/post/58920122308/s3-hadoop-performance
EMR Technical Gotcha’s
30
Data volume (in year)
Storage for Data nodes Instances Price, first year
126TB 6*12x2TB 10 data nodes, 3 Master
Dell PowerEdge R720: Processor E5-2640 2.50GHz, 8 cores, 12M Cache,Turbo, Memory 64GB Memory, Quad Ranked RDIMM for 2 Processors, Low VoltHard Drives 12 - 2TB 7.2K RPM SATA 3.5in Hot Plug Hard DriveNetwork Card Intel 82599 Dual Port 10GE Mezzanine Card
$10.6k * 6 DN + $7.3k * 3 = $128k
+ Vendor Support ($50k)+ Full-time person ($150k)=$328k
BI 4 nodes $43k
In house Hadoop cluster
31
32
Licensing and support costs
• Cloudera or Hortonworks
• Enterprise 24X7 Production Support - phone and support portal access(Support Datasheet Attached)
• Minimum $50k$
Hadoop Distributions:
33
Business Enterprise
Response Time : 1 HourAccess: Phone, Chat and Email 24/7
CostsGreater of $100 - or -•10% of monthly AWS usage for the first $0-$10K•7% of monthly AWS usage from $10K-$80K•5% of monthly AWS usage from $80K-$250K•3% of monthly AWS usage from $250K+
(about $800/yr)
http://aws.amazon.com/premiumsupport/
Response Time: 15 minutesAccess: Phone, Chat, TAM and Email 24/7
CostsGreater of $15,000 - or -•10% of monthly AWS usage for the first $0-$150K•7% of monthly AWS usage from $150K-$500K•5% of monthly AWS usage from $500K-$1M•3% of monthly AWS usage from $1M+
Amazon – Support EC2 & EMR
34
35
Thank You