developing hadoop strategy for your enterprise

21
MICROSOFT CONFIDENTIAL – INTERNA Creating Hadoop Strategy for your enterprise [email protected] @avkashchauha

Upload: avkash-chauhan

Post on 24-Jun-2015

379 views

Category:

Technology


0 download

DESCRIPTION

Step by step discussion on creating Big Data or Hadoop strategy for any enterprise

TRANSCRIPT

Page 1: Developing Hadoop strategy for your Enterprise

Creating Hadoop Strategy for your enterprise

[email protected]@avkashchauhan

Page 2: Developing Hadoop strategy for your Enterprise

Agenda this hour:

Understanding the puzzle and solve it piece by piece

Various questions to ask before making the decisionChoosing Hadoop distribution & vendor application

Conclusion

Page 3: Developing Hadoop strategy for your Enterprise

4 pieces of puzzle:

1. Determining actual requirements2. Choosing Cluster type and Hadoop Distribution3. Choosing Data processing & Data

Analysis/Visualization vendors4. Running Hadoop Cluster Successfully

Page 4: Developing Hadoop strategy for your Enterprise

Understanding & Solving the puzzle……

Page 5: Developing Hadoop strategy for your Enterprise

Consideration: Data size?

Is the size of data large enough to process through Hadoop?What is the size of data you generate on regular basis?What is aggregate size of data per timeAre you moving from traditional data storage/processing to Hadoop or starting new?What ability and capacity you have to manage the data as it grows?

Page 6: Developing Hadoop strategy for your Enterprise

Consideration: Data storage Location

Is the data already stored somewhereOn-Premise storageIn Cloud with specific storage vendor – AWS/Google/Azure/?

Would you like to move data from current location to new source?Data storage process will start once cluster is ready

Choices are openCloud (Azure Storage, S3, Google )On Premise NASWhy now HDFS directly?

Page 7: Developing Hadoop strategy for your Enterprise

Consideration: Data type and processing time

Data type selectionAny Hadoop type?

Text, Avro, RCFile, Parquet, SequenceFileIf you have custom, can you support it on Hadoop?

File Storage methodcompressed / non-compressed

Are there any time requirement to process the data in limited time?

Output of one MapReduce job is input for second MapReduce job and second MapReduce job runs every 30 minutes

Processing time can improve depending on different source data types

Page 8: Developing Hadoop strategy for your Enterprise

Consideration: Hadoop Cluster

What choices you have?On-premise Cloud (i.e. EMR)3rd Party Hosting (Own Hadoop cluster in Cloud)

Questions to ask:Do you have IT to manage the cluster?Are you a significant small team with limited time and resources?What if you want to utilize existing infrastructure?

Page 9: Developing Hadoop strategy for your Enterprise

Consideration: Hadoop Distribution

Build your own HadoopChoose one from available Hadoop Distributions

Cloudera, Hortonworks, MapR, PivotalWhat about cloud based Hadoop?

EMRHadoop on AzureOthers…

Page 10: Developing Hadoop strategy for your Enterprise

Consideration of using EMR

Immediate available after 15-20 minutes of configurationNo hardware of any type needed, just your machine Pay as you go so you may end up paying a few dollars to run the PoCIf your data is already in Amazon Object storage (S3), EMR could be the very good choiceIf results set is large and another application use it as source, you will need to copy it locallyIf you will have to copy large amount of data to Amazon EMR then it may not be good option

Page 11: Developing Hadoop strategy for your Enterprise

Consideration: Choosing data processing app

Do you have in-house talent for data processing? Who is going to write the MapReduce jobs?Does your data transferred to Hive tables to process through SQL?Does your data transferred to Pig tables for processing?Are you going to hire programmers for writing MapReduce jobs?Do you understand you data very well?What is your timeline to try and accept a PoC?

Continued….

Page 12: Developing Hadoop strategy for your Enterprise

Consideration: Choosing data processing app

Are you going to choose 3rd party vendor?Do you have security requirements for vendors?What if a full services application is already available?Any limitation with results processing such as time limit?

Page 13: Developing Hadoop strategy for your Enterprise

Consideration: Choosing BI vendor

Do you already have BI application which you would want to use with processed results?Do you have limitation within BI related to data access and security?Would you like to have single vendor for data processing and data analysis and visualization or otherwiseFinancial and technical limitation choosing vendor

Continued….

Page 14: Developing Hadoop strategy for your Enterprise

Consideration: Choosing BI vendor

Most of BI vendors access Hadoop data through connectors so make sure is this something acceptable for your requirement Only a few vendors have BI application running natively on Hadoop which use MapReduce framework for data processing so make sure who you choose meet your need

Page 15: Developing Hadoop strategy for your Enterprise

Running Hadoop Cluster successfully

It is very important to run your Hadoop cluster properly to get ROIVarious jobs needs various cluster setting tweaked for best results and it take good amount of time to find optimum settingsCluster settings can be set globally or specific to individual jobsVarious applications submitting MapReduce jobs can use scheduler to schedule jobs on specific time

Continued….

Page 16: Developing Hadoop strategy for your Enterprise

Running Hadoop Cluster successfully

Some jobs may require more resources then others so such jobs needs rescheduling per cluster healthDepending on resources utilization it is best to schedule jobs at a time when resources are availableData security is very important so deploy appropriate security mechanism to meet your needTest various 3 party apps properly otherwise one bad application may unstable overall cluster

Continued….

Page 17: Developing Hadoop strategy for your Enterprise

Running Hadoop Cluster successfully

If possible deploy 3rd party application outside Namenode and provide SSH access to Hadoop nodes, this way you can minimize additional stress on Hadoop clusterInstall Hadoop Monitoring application to keep monitoring your Hadoop cluster healthKeep checking your cluster for any additional storage and other resource requirementsIf a particular node performs weaker then other nodes, find a replacement otherwise it will slow down the overall processing

Page 18: Developing Hadoop strategy for your Enterprise

Conclusion……

Page 19: Developing Hadoop strategy for your Enterprise

Conclusion: No one size fits all

While its is easier to setup and turn on Hadoop cluster however, the cost to keep Hadoop cluster up and running is significant so what to choose is very important. Each enterprise has unique requirementChoose what is best for you, based on your need and existing or new-coming data requirement.

Page 20: Developing Hadoop strategy for your Enterprise

Conclusion: Key differentiator

Most Hadoop distribution gives you exact same Hadoop runtime depending on selected version, however management and monitoring components could be the key differentiation among them.

Page 21: Developing Hadoop strategy for your Enterprise

Thanks….