autoscaling spark on aws ec2 - 11th spark london meetup
TRANSCRIPT
Who am I
•DevOPS•Build a few platforms in my life•mostly adtech, in-game analytics for Sony Playstation
•Currently advising Investment Banks•CTO Entropy Investments
Map-reduce is about quickly writing very inefficient code and then running it at massive scale
(C) Someone
Problem
•EC2 is a pay-for-what-you-use model
•You just have to decide how much resources you want to use before starting a cluster
Problem
Most common problems while running on EC2
Scaling up•My team needs a new cluster, how big it should be?
Scaling down•Did I shut down the DEV cluster before leaving the office on Friday evening?
Types of scaling
Vertical scaling - „Let’s get a bigger box”
•Change instance type•Change EBS parameters
Horizontal scaling - „Just add more nodes”
Autoscaling
•Automatic resizing based on demand•Define minimum/maximum instance count•Define when scaling should occur•Use metrics•Run your jobs and don’t worry about infrastructure
Autoscaling components
•AMI - machine image with installed spark•Launch configuration - defines:
•AMI•instance type•instance storage •public IP•security groups
Autoscaling components
•Autoscaling group•launch configuration•availability zones•VPC details•min/max servers•when to scale•metrics/health checks
spark-cloud
• Better scripts to start spark clusters on EC2
• Alpha version
• https://github.com/entropyltd/spark-cloud
What’s inside spark-cloud
Building AMI’s through packer
Packer is a tool for creating machine and container images for multiple platforms from a single source configuration.
Supports AWS, DigitalOcean, Docker, OpenStack, Parallels, QEMU, VirtualBox, VMware
Summary
•Spark and EC2 is a very common combination•Because it makes your life easier•And cheaper•spark-cloud script will help you•You can just worry about writing good Spark code!
Amazon S3 Tips
•Don’t use s3n://•Use s3a:// with hadoop 2.6
–Parallel rename, especially important for committing output–Supports IAM authentication–no „xyz_$folder$" files–input seek–multipart upload ( no 5GB limit )–Error recovery and retry
More info https://issues.apache.org/jira/browse/HADOOP-10400