why hadoop big dataand map reduce
DESCRIPTION
Hadoop, Big dataTRANSCRIPT
-
WhyHadoop,BigDataandMapReduce.Answer: because they work together really well. Each technology deals with the intrinsic problems encountered when extracting value from the data set.
ExampletechnologiesIamassumingyouarefamiliarwithsqlandrelationaldatabases,IwillonlyexpandtheNoSQLandMapReduceexamplesindetail.ForMapreducewelookatHadoopandNoSQLwelookatmongodb.MongodbisthebestNoSQLdatabasetouseforabeginnerinBigDatadevelopment.TypesofBigData
TypeofBigData Description PreferedTechnology
Structureddata Dataisorganisedintoentitiesthathaveadefinedformat,suchasXMLdocumentsordatabasetablesthatconformtoaparticularpredefinedschema
RDBMS,NoSQL
SemiStructureddata dataislooser,whiletheremaybeaschema,itcanbeignored,soitmaybeusedonlyasaguidetothestructureofthedata.
MapReduce,NoSQL
Unstructureddata Unstructureddatadoesnothaveanyparticularinternalstructure,forexample,plaintextorimagedata.
MapReduce
SimpleProblemcomputetheaverageageforthepeopleineachcityTheRelationalDatabaseManagementSystem(RDBMS)wayStructuredQueryLanguage(SQL)DatabasepeopleTABLEcity
Id
person_id age
1 1 20
-
1 2 34
3 3 17SELECT people.city.id, AVG(people.city.age) FROM people.city GROUP BY people.city.id TheMapreduce(hadoop)wayMAPSTEP (Scatter)
RECORD KEY VALUE
city=3,age=5 3 5
city=1,age=2 1 6
city=3,age=7 3 7
city=4,age=9 4 9
city=4,age=9 4 9
city=1,age=3 1 3
SHUFFLESTEP(SortbyKey)
KEY VALUES
1 3,6
3 5,7
4 9,9
REDUCESTEP (Gather)
KEY AGGREGATION
1 (3+6)/2
3 (5+7)/2
4 (9+9)/2
-
TheNoSQL(MongoDb)waymongodbaggregationframeworkTypicalDatasetobject{ "_id" : "35004", "city" : "Acmar", "pop" : 6055, "loc" : [-86.51557, 33.584132] }
aggregationfunctionputs coll.aggregate([ {"$group" => {_id: { city: "$city"}, pop: {"$sum" => "$pop"}}}, {"$group" => {_id: "$_id.city", avg_city_pop: {"$avg" => "$pop"}}}, {"$sort" => {avg_city_pop: -1}}, {"$limit" => 3} ]) MAPReduceSummaryMapreduceisParallelalgorithmthatfollowstheScatterGatherdesignpatternAllmapandreducecallsaredoneinparallelindependentlyMapReduceisagoodfitforprocessesthatneedtoanalysethewholedataset,inabatchoperation,speciallyforadhocanalysis.Mapreduceinterpretsthedataatprocessingtime.TheinputkeysandvaluesforMapReducearenotanintrinsicpropertyofthedata,buttheyarechosenbythepersonanalyzingthedata.Mapreduceislinearlyscalableprogrammingmodel.AdeveloperstaskistowriteMapfunctions&ReducerfunctionspipedtogetherthroughaShuffleSort.Mapreducedefinesamappingfromonesetofkeyvaluepairstoanother.Thesefunctionareoblivioustothesizeofthedataortheclustertheyareoperatingon,sotheycanbeusedunchangedforsmalldatasetandformassiveone.RDBMSSummaryRDBMSisgoodforpointqueriesorupdates,wherethedatasethasbeenindexedtodeliverlowlatencyretrievalandupdatetimesofarelativelysmallamountofdata.MapReducesuitsapplicationswherethedataiswrittenonce,andreadmanytimes,whereasarelationaldatabaseisgooddatasetsthatarecontinuallyupdated.Relationaldataisoftennormalizedtoretainitsintegrity&removeredundancy.NoSQLSummary32bitMongoDBonlyhandles2GBofdata12nodelimittothereplicasetstrategywithConsumptionofdiskspaceissueInMongoDB,adatabaseiscomposedofcollections(somehowtheequivalentoftablesofthosewhoareusedtoSQL).AcollectioncontainsdocumentswhichareJSONobjectQuerysareperformedonasinglecollection.ThereisnoequivalenttotheSQLJOIN.HenceneedtoDenormalisewhendesigningschemas.MongoDBdoesntenforceamodelorschema
-
MongoallowsoptionalfieldswhereSQLdoesntAsperMongo2.2,thereisaperdatabaselock,whichmeansthatoperationsonindependentobjectswithinthesamecollectioncantbedoneconcurrently.MongoDBoperationsareonlyatomiconthedocumentlevel,theselocks(intraditionaldatabasesareusedtoguardindexaccess)areonlyheldaslongasasingledocument/objectstakestoupdateinmemory.SummaryoftechnologyattributesAttribute TraditionalRDBMS NoSQL MapReduce
Data Size: Gigabytes Gigabytes Petabytes
Access: Interactive and Batch
Interactive and Batch
Batch
Updates: Read and write many times
Read and write many times
Write once , read many times
Structure: Static schema Object Types Dynamic schema
Integrity High High Low
Scaling Nonlinear Nonlinear Linear
ApacheHadoopEcosystem
Common:Asetofoperations&interfacesfordistributedfilesystems&generalI/O(Serialization,JavaRPC,persistentdatastructures)
Avro:Aserializationsystemforefficient,crosslanguagepersistentdatastorage. MapReduce:ADistributeddataprocessingmodelandexecutionenvironmentthsatruns
onlargeclustersofcommoditymachines. HDFS:Adistributedfilesystemthatrunsonlargeclustersofcommoditymachines. Pig:Adataflowlanguageandexecutionenvironmentforexploringverylargedatasets.
PigrunsonHDFSandMapReduceclusters. Hive:Adistributeddatawarehouse.HiveManagesdatastoredinHDFS&providesbatch
stylecomputations&ETLbyHQL. HBase:Adistributed,columnorienteddatabase,HBaseusesHDFSforitsunderlying
storage,supportedbothbatchstylecomputationsusingMapReduceandpointqueries. Sqoop:AToolforefficientlymovingdatabetweenRDBMS&HDFS
-
MahoutmachinelearningSummary1.RDBMScanscaleeffectivelywhenusedwithaclusteringtechnologylikeMySQLClusterforexample.Howeverrequiresahighlystructureddataset.Joinscanbeexpensive.2.NoSQLscaleseffectively.ForMongoDbtherearesomearchitecturallimitationsthatcouldformassivedatasetsbeabottleneckforcapacity.Elasticschemas,howevernojoinshenceneedtodenormalise.3.MapReduce,scalestolargestdataset.Canoptimisewriteshoweverreadsareslowandasdatasetisdistributedneedtostreamresultsetforqueries.