developing hadoop strategy for your enterprise
DESCRIPTION
Step by step discussion on creating Big Data or Hadoop strategy for any enterpriseTRANSCRIPT
Creating Hadoop Strategy for your enterprise
[email protected]@avkashchauhan
Agenda this hour:
Understanding the puzzle and solve it piece by piece
Various questions to ask before making the decisionChoosing Hadoop distribution & vendor application
Conclusion
4 pieces of puzzle:
1. Determining actual requirements2. Choosing Cluster type and Hadoop Distribution3. Choosing Data processing & Data
Analysis/Visualization vendors4. Running Hadoop Cluster Successfully
Understanding & Solving the puzzle……
Consideration: Data size?
Is the size of data large enough to process through Hadoop?What is the size of data you generate on regular basis?What is aggregate size of data per timeAre you moving from traditional data storage/processing to Hadoop or starting new?What ability and capacity you have to manage the data as it grows?
Consideration: Data storage Location
Is the data already stored somewhereOn-Premise storageIn Cloud with specific storage vendor – AWS/Google/Azure/?
Would you like to move data from current location to new source?Data storage process will start once cluster is ready
Choices are openCloud (Azure Storage, S3, Google )On Premise NASWhy now HDFS directly?
Consideration: Data type and processing time
Data type selectionAny Hadoop type?
Text, Avro, RCFile, Parquet, SequenceFileIf you have custom, can you support it on Hadoop?
File Storage methodcompressed / non-compressed
Are there any time requirement to process the data in limited time?
Output of one MapReduce job is input for second MapReduce job and second MapReduce job runs every 30 minutes
Processing time can improve depending on different source data types
Consideration: Hadoop Cluster
What choices you have?On-premise Cloud (i.e. EMR)3rd Party Hosting (Own Hadoop cluster in Cloud)
Questions to ask:Do you have IT to manage the cluster?Are you a significant small team with limited time and resources?What if you want to utilize existing infrastructure?
Consideration: Hadoop Distribution
Build your own HadoopChoose one from available Hadoop Distributions
Cloudera, Hortonworks, MapR, PivotalWhat about cloud based Hadoop?
EMRHadoop on AzureOthers…
Consideration of using EMR
Immediate available after 15-20 minutes of configurationNo hardware of any type needed, just your machine Pay as you go so you may end up paying a few dollars to run the PoCIf your data is already in Amazon Object storage (S3), EMR could be the very good choiceIf results set is large and another application use it as source, you will need to copy it locallyIf you will have to copy large amount of data to Amazon EMR then it may not be good option
Consideration: Choosing data processing app
Do you have in-house talent for data processing? Who is going to write the MapReduce jobs?Does your data transferred to Hive tables to process through SQL?Does your data transferred to Pig tables for processing?Are you going to hire programmers for writing MapReduce jobs?Do you understand you data very well?What is your timeline to try and accept a PoC?
Continued….
Consideration: Choosing data processing app
Are you going to choose 3rd party vendor?Do you have security requirements for vendors?What if a full services application is already available?Any limitation with results processing such as time limit?
Consideration: Choosing BI vendor
Do you already have BI application which you would want to use with processed results?Do you have limitation within BI related to data access and security?Would you like to have single vendor for data processing and data analysis and visualization or otherwiseFinancial and technical limitation choosing vendor
Continued….
Consideration: Choosing BI vendor
Most of BI vendors access Hadoop data through connectors so make sure is this something acceptable for your requirement Only a few vendors have BI application running natively on Hadoop which use MapReduce framework for data processing so make sure who you choose meet your need
Running Hadoop Cluster successfully
It is very important to run your Hadoop cluster properly to get ROIVarious jobs needs various cluster setting tweaked for best results and it take good amount of time to find optimum settingsCluster settings can be set globally or specific to individual jobsVarious applications submitting MapReduce jobs can use scheduler to schedule jobs on specific time
Continued….
Running Hadoop Cluster successfully
Some jobs may require more resources then others so such jobs needs rescheduling per cluster healthDepending on resources utilization it is best to schedule jobs at a time when resources are availableData security is very important so deploy appropriate security mechanism to meet your needTest various 3 party apps properly otherwise one bad application may unstable overall cluster
Continued….
Running Hadoop Cluster successfully
If possible deploy 3rd party application outside Namenode and provide SSH access to Hadoop nodes, this way you can minimize additional stress on Hadoop clusterInstall Hadoop Monitoring application to keep monitoring your Hadoop cluster healthKeep checking your cluster for any additional storage and other resource requirementsIf a particular node performs weaker then other nodes, find a replacement otherwise it will slow down the overall processing
Conclusion……
Conclusion: No one size fits all
While its is easier to setup and turn on Hadoop cluster however, the cost to keep Hadoop cluster up and running is significant so what to choose is very important. Each enterprise has unique requirementChoose what is best for you, based on your need and existing or new-coming data requirement.
Conclusion: Key differentiator
Most Hadoop distribution gives you exact same Hadoop runtime depending on selected version, however management and monitoring components could be the key differentiation among them.
Thanks….