let's talk operations! (hadoop summit 2014)
DESCRIPTION
These are the introductory slides I used (in some form or another) for the Let's Talk Operations! sessions for the 2014 Hadoop Summits. No video for this one!TRANSCRIPT
Let’s Talk Operations!Allen Wittenauer!
Twitter: @_a__w_ Email: aw @ apache.org!
How many individual grids should I have?
One big grid
Grid per project
• Pros!• Lower ops overhead!• One location for all data!
• Cons !• Dev and Prod on one
system
• Pros!• Capacity planning per project!
• Cons !• More headcount to maintain!• Multiple copies of data!• Data ingress is a mess
Data Center
Production
ETL
Development
ETL
Dev Prod
Base ETL Pull
Event FeedsDatabase Feeds
Base ETL Pull
Base ETL PullPost-Processed
Data
DC2DC1
Production
ETL
Development
How do I solve some common distcp issues?
• Common issues!• Version incompatibilities!• Network bandwidth consumption!!
• Some tricks!• Use WebHDFS!
• All modern versions support it!• Read and write in both directions!
• Create a separate queue with hard limits!• Pull from larger, push from smaller
Q&A
Allen Wittenauer Twitter: @_a__w_ Email: aw @ apache.org
Bonus Slide!
20 GB /, ... 200 GB task space (rest) HDFS
• root partitioning !!!!!
• non-root partitioning
5 GB swap 200 GB task space (rest) HDFS