tips from hadoop experts for beginners

Download Tips from Hadoop experts for beginners

Post on 22-Nov-2014

1.373 views

Category:

Data & Analytics

2 download

Embed Size (px)

DESCRIPTION

9 Hadoop expert tips for Hadoop Learners

TRANSCRIPT

  • 1. Getting started with HADOOP? Tips from Hadoop Professionals to help kick start your career

2. I would like to share my experience with you 1. I think practice is more important than theory, so do a quick start like use Cloudera QuickStart VM 2. Starting with the basics of installing and configuring Hadoop Using command line, when you are familiar with it, you can use GUI like ambari or cloudera manager Jin Zhan Square Enix - Senior Engineer Japan 3. Here are some tips - these are based on things which people should know but I have seen them get wrong - you probably have them already - and there are more than two! 1. You must increase ulimits http://blog.cloudera.com/blog/2009/03/configuration- parameters-what-can-you-just-ignore/ Mark H. Butler Software Engineer at Pataniqa Ltd Preston, United Kingdom 4. 2. Installing a NoSQL database? Use the YCSB benchmark to check it is working correctly https://github.com/brianfrankcooper/YCSB/wiki Mark H. Butler Software Engineer at Pataniqa Ltd Preston, United Kingdom 5. 3. Consider using compression (although there are tradeoffs!) http://comphadoop.weebly.com/ http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy- vs-lzf-vs-zlib-a-comparison-of http://www.slideshare.net/Hadoop_Summit/kamat-singh- june27425pmroom210cv2 http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/ http://www.cloudera.com/blog/2009/11/17/hadoop-at-twitter- part-1-splittable-lzo-compression/ https://github.com/twitter/hadoop-lzo Mark H. Butler Software Engineer at Pataniqa Ltd Preston, United Kingdom 6. 4. Don't install a Hadoop cluster manually - but there are many technologies to automate e.g. Puppet, Chef, Ansible, Vagrant http://blog.godatadriven.com/bare-metal-hadoop-provisioning- ansible-cobbler.html http://chimpler.wordpress.com/2013/01/20/deploying-hadoop- on-ec2-with-whirr/ http://java.dzone.com/articles/setting-hadoop-virtual-cluster http://www.diversit.eu/2012/05/setting-up-hadoop-cluster-using- puppet.html http://www.rpark.com/2013/02/using-chef-to-build-out-hadoop- cluster.html Mark H. Butler Software Engineer at Pataniqa Ltd Preston, United Kingdom 7. 5. Java and Scala are great but don't overlook Python - it's handy for prototyping one-off map-reduce jobs as you do not need a cluster to test http://www.michael-noll.com/tutorials/writing-an- hadoop-mapreduce-program-in-python/ Hope that helps! Mark H. Butler Software Engineer at Pataniqa Ltd Preston, United Kingdom 8. Technically speaking, Map Reduce is the base and Map = Select and Reduce = Group by so if you know what you want and how you want to summarize it then Hadoop is meant for you. Piyush Jindal Software Engineer at Target Bengaluru, Karnataka, India 9. Tips : 1. Good knowledge of Data Structure and Insight to Analyze the data is a Must. 2. Core JAVA and COLLECTION is must. 3. SQL and PL/SQL knowledge to solve complex scenarios will help a lot. These are the stepping stones to approach a problem in Bigdata and to provide solution as well.. SOMANATH NANDA Cloudera Certified Developer for Hadoop Cognizant Technology Solutions Bengaluru, Karnataka, India 10. 1. Audit your data to identify what might be useful but unexploited. 2. Study new technologies; they are moving rapidly. Merv Adrian Vice President at Gartner San Francisco Bay Area 11. some good examples in this whitepaper (note, registration required): http://www.mongodb.com/lp/big-data Mat Keep Principal Product Marketing Manager at MongoDB Inc. Hawkinge, Kent, United Kingdom 12. Here are some tips in no specific order 1. Best value of Hadoop comes from the combination of software and hardware designed for your specific needs. 2. Hardware configuration of your cluster is very important . If you work load is I/O bound then disk specs are important, if CPU bound then faster CPUs are better and if application is memory bound then server with larger memory are needed. Mohit Saxena Vice President -Technology Founder InMobi - A Global Mobile Ad Network Bengaluru Area, India 13. 3. Network connectivity between nodes is extremely important at least 1 gigabit NIC are must in Hadoop cluster so that inter communication aren't a bottleneck in your cluster as they can be huge drag. 4. Plan the size of storage and disk controller as per your need of read per sec that you want to achieve from each server. 5. Ganglia is a fairly good monitoring tool for Hadoop and it can point out bottlenecks . Mohit Saxena Vice President -Technology Founder InMobi - A Global Mobile Ad Network Bengaluru Area, India 14. For more information on best Hadoop courses for your career Check out the link below http://www.dezyre.com/Big-Data-and-Hadoop/19

Recommended

View more >