why hadoop big dataand map reduce

WhyHadoop,BigDataandMapReduce.Answer: because they work together really well. Each technology deals with the intrinsic problems encountered when extracting value from the data set.

ExampletechnologiesIamassumingyouarefamiliarwithsqlandrelationaldatabases,IwillonlyexpandtheNoSQLandMapReduceexamplesindetail.ForMapreducewelookatHadoopandNoSQLwelookatmongodb.MongodbisthebestNoSQLdatabasetouseforabeginnerinBigDatadevelopment.TypesofBigData

TypeofBigData Description PreferedTechnology

Structureddata Dataisorganisedintoentitiesthathaveadefinedformat,suchasXMLdocumentsordatabasetablesthatconformtoaparticularpredefinedschema

RDBMS,NoSQL

SemiStructureddata dataislooser,whiletheremaybeaschema,itcanbeignored,soitmaybeusedonlyasaguidetothestructureofthedata.

MapReduce,NoSQL

Unstructureddata Unstructureddatadoesnothaveanyparticularinternalstructure,forexample,plaintextorimagedata.

MapReduce

SimpleProblemcomputetheaverageageforthepeopleineachcityTheRelationalDatabaseManagementSystem(RDBMS)wayStructuredQueryLanguage(SQL)DatabasepeopleTABLEcity

Id

person_id age

1 1 20

1 2 34

3 3 17SELECT people.city.id, AVG(people.city.age) FROM people.city GROUP BY people.city.id TheMapreduce(hadoop)wayMAPSTEP (Scatter)

RECORD KEY VALUE

city=3,age=5 3 5

city=1,age=2 1 6

city=3,age=7 3 7

city=4,age=9 4 9

city=4,age=9 4 9

city=1,age=3 1 3

SHUFFLESTEP(SortbyKey)

KEY VALUES

1 3,6

3 5,7

4 9,9

REDUCESTEP (Gather)

KEY AGGREGATION

1 (3+6)/2

3 (5+7)/2

4 (9+9)/2

TheNoSQL(MongoDb)waymongodbaggregationframeworkTypicalDatasetobject{ "_id" : "35004", "city" : "Acmar", "pop" : 6055, "loc" : [-86.51557, 33.584132] }

aggregationfunctionputs coll.aggregate([ {"$group" => {_id: { city: "$city"}, pop: {"$sum" => "$pop"}}}, {"$group" => {_id: "$_id.city", avg_city_pop: {"$avg" => "$pop"}}}, {"$sort" => {avg_city_pop: -1}}, {"$limit" => 3} ]) MAPReduceSummaryMapreduceisParallelalgorithmthatfollowstheScatterGatherdesignpatternAllmapandreducecallsaredoneinparallelindependentlyMapReduceisagoodfitforprocessesthatneedtoanalysethewholedataset,inabatchoperation,speciallyforadhocanalysis.Mapreduceinterpretsthedataatprocessingtime.TheinputkeysandvaluesforMapReducearenotanintrinsicpropertyofthedata,buttheyarechosenbythepersonanalyzingthedata.Mapreduceislinearlyscalableprogrammingmodel.AdeveloperstaskistowriteMapfunctions&ReducerfunctionspipedtogetherthroughaShuffleSort.Mapreducedefinesamappingfromonesetofkeyvaluepairstoanother.Thesefunctionareoblivioustothesizeofthedataortheclustertheyareoperatingon,sotheycanbeusedunchangedforsmalldatasetandformassiveone.RDBMSSummaryRDBMSisgoodforpointqueriesorupdates,wherethedatasethasbeenindexedtodeliverlowlatencyretrievalandupdatetimesofarelativelysmallamountofdata.MapReducesuitsapplicationswherethedataiswrittenonce,andreadmanytimes,whereasarelationaldatabaseisgooddatasetsthatarecontinuallyupdated.Relationaldataisoftennormalizedtoretainitsintegrity&removeredundancy.NoSQLSummary32bitMongoDBonlyhandles2GBofdata12nodelimittothereplicasetstrategywithConsumptionofdiskspaceissueInMongoDB,adatabaseiscomposedofcollections(somehowtheequivalentoftablesofthosewhoareusedtoSQL).AcollectioncontainsdocumentswhichareJSONobjectQuerysareperformedonasinglecollection.ThereisnoequivalenttotheSQLJOIN.HenceneedtoDenormalisewhendesigningschemas.MongoDBdoesntenforceamodelorschema

MongoallowsoptionalfieldswhereSQLdoesntAsperMongo2.2,thereisaperdatabaselock,whichmeansthatoperationsonindependentobjectswithinthesamecollectioncantbedoneconcurrently.MongoDBoperationsareonlyatomiconthedocumentlevel,theselocks(intraditionaldatabasesareusedtoguardindexaccess)areonlyheldaslongasasingledocument/objectstakestoupdateinmemory.SummaryoftechnologyattributesAttribute TraditionalRDBMS NoSQL MapReduce

Data Size: Gigabytes Gigabytes Petabytes

Access: Interactive and Batch

Interactive and Batch

Batch

Updates: Read and write many times

Read and write many times

Write once , read many times

Structure: Static schema Object Types Dynamic schema

Integrity High High Low

Scaling Nonlinear Nonlinear Linear

ApacheHadoopEcosystem

Common:Asetofoperations&interfacesfordistributedfilesystems&generalI/O(Serialization,JavaRPC,persistentdatastructures)

Avro:Aserializationsystemforefficient,crosslanguagepersistentdatastorage. MapReduce:ADistributeddataprocessingmodelandexecutionenvironmentthsatruns

onlargeclustersofcommoditymachines. HDFS:Adistributedfilesystemthatrunsonlargeclustersofcommoditymachines. Pig:Adataflowlanguageandexecutionenvironmentforexploringverylargedatasets.

PigrunsonHDFSandMapReduceclusters. Hive:Adistributeddatawarehouse.HiveManagesdatastoredinHDFS&providesbatch

stylecomputations&ETLbyHQL. HBase:Adistributed,columnorienteddatabase,HBaseusesHDFSforitsunderlying

storage,supportedbothbatchstylecomputationsusingMapReduceandpointqueries. Sqoop:AToolforefficientlymovingdatabetweenRDBMS&HDFS

MahoutmachinelearningSummary1.RDBMScanscaleeffectivelywhenusedwithaclusteringtechnologylikeMySQLClusterforexample.Howeverrequiresahighlystructureddataset.Joinscanbeexpensive.2.NoSQLscaleseffectively.ForMongoDbtherearesomearchitecturallimitationsthatcouldformassivedatasetsbeabottleneckforcapacity.Elasticschemas,howevernojoinshenceneedtodenormalise.3.MapReduce,scalestolargestdataset.Canoptimisewriteshoweverreadsareslowandasdatasetisdistributedneedtostreamresultsetforqueries.

why hadoop big dataand map reduce

Documents