developing hadoop strategy for your enterprise

Creating Hadoop Strategy for your enterprise

[email protected]@avkashchauhan

mailto:[email protected]

Agenda this hour:

Understanding the puzzle and solve it piece by piece

Various questions to ask before making the decisionChoosing Hadoop distribution & vendor application

Conclusion

4 pieces of puzzle:

1. Determining actual requirements2. Choosing Cluster type and Hadoop Distribution3. Choosing Data processing & Data

Analysis/Visualization vendors4. Running Hadoop Cluster Successfully

Understanding & Solving the puzzle……

Consideration: Data size?

Is the size of data large enough to process through Hadoop?What is the size of data you generate on regular basis?What is aggregate size of data per timeAre you moving from traditional data storage/processing to Hadoop or starting new?What ability and capacity you have to manage the data as it grows?

Consideration: Data storage Location

Is the data already stored somewhereOn-Premise storageIn Cloud with specific storage vendor – AWS/Google/Azure/?

Would you like to move data from current location to new source?Data storage process will start once cluster is ready

Choices are openCloud (Azure Storage, S3, Google )On Premise NASWhy now HDFS directly?

Consideration: Data type and processing time

Data type selectionAny Hadoop type?

Text, Avro, RCFile, Parquet, SequenceFileIf you have custom, can you support it on Hadoop?

File Storage methodcompressed / non-compressed

Are there any time requirement to process the data in limited time?

Output of one MapReduce job is input for second MapReduce job and second MapReduce job runs every 30 minutes

Processing time can improve depending on different source data types

Consideration: Hadoop Cluster

What choices you have?On-premise Cloud (i.e. EMR)3rd Party Hosting (Own Hadoop cluster in Cloud)

Questions to ask:Do you have IT to manage the cluster?Are you a significant small team with limited time and resources?What if you want to utilize existing infrastructure?

Consideration: Hadoop Distribution

Build your own HadoopChoose one from available Hadoop Distributions

Cloudera, Hortonworks, MapR, PivotalWhat about cloud based Hadoop?

EMRHadoop on AzureOthers…

Consideration of using EMR

Immediate available after 15-20 minutes of configurationNo hardware of any type needed, just your machine Pay as you go so you may end up paying a few dollars to run the PoCIf your data is already in Amazon Object storage (S3), EMR could be the very good choiceIf results set is large and another application use it as source, you will need to copy it locallyIf you will have to copy large amount of data to Amazon EMR then it may not be good option

Consideration: Choosing data processing app

Do you have in-house talent for data processing? Who is going to write the MapReduce jobs?Does your data transferred to Hive tables to process through SQL?Does your data transferred to Pig tables for processing?Are you going to hire programmers for writing MapReduce jobs?Do you understand you data very well?What is your timeline to try and accept a PoC?

Continued….

Consideration: Choosing data processing app

Are you going to choose 3rd party vendor?Do you have security requirements for vendors?What if a full services application is already available?Any limitation with results processing such as time limit?

Consideration: Choosing BI vendor

Do you already have BI application which you would want to use with processed results?Do you have limitation within BI related to data access and security?Would you like to have single vendor for data processing and data analysis and visualization or otherwiseFinancial and technical limitation choosing vendor

Continued….

Consideration: Choosing BI vendor

Most of BI vendors access Hadoop data through connectors so make sure is this something acceptable for your requirement Only a few vendors have BI application running natively on Hadoop which use MapReduce framework for data processing so make sure who you choose meet your need

Running Hadoop Cluster successfully

It is very important to run your Hadoop cluster properly to get ROIVarious jobs needs various cluster setting tweaked for best results and it take good amount of time to find optimum settingsCluster settings can be set globally or specific to individual jobsVarious applications submitting MapReduce jobs can use scheduler to schedule jobs on specific time

Continued….


Some jobs may require more resources then others so such jobs needs rescheduling per cluster healthDepending on resources utilization it is best to schedule jobs at a time when resources are availableData security is very important so deploy appropriate security mechanism to meet your needTest various 3 party apps properly otherwise one bad application may unstable overall cluster

Continued….


If possible deploy 3rd party application outside Namenode and provide SSH access to Hadoop nodes, this way you can minimize additional stress on Hadoop clusterInstall Hadoop Monitoring application to keep monitoring your Hadoop cluster healthKeep checking your cluster for any additional storage and other resource requirementsIf a particular node performs weaker then other nodes, find a replacement otherwise it will slow down the overall processing

Conclusion……

Conclusion: No one size fits all

While its is easier to setup and turn on Hadoop cluster however, the cost to keep Hadoop cluster up and running is significant so what to choose is very important. Each enterprise has unique requirementChoose what is best for you, based on your need and existing or new-coming data requirement.

Conclusion: Key differentiator

Most Hadoop distribution gives you exact same Hadoop runtime depending on selected version, however management and monitoring components could be the key differentiation among them.

Thanks….

developing hadoop strategy for your enterprise

Technology