why hadoop big dataand map reduce

Post on 11-Oct-2015

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Hadoop, Big data

TRANSCRIPT

  • WhyHadoop,BigDataandMapReduce.Answer: because they work together really well. Each technology deals with the intrinsic problems encountered when extracting value from the data set.

    ExampletechnologiesIamassumingyouarefamiliarwithsqlandrelationaldatabases,IwillonlyexpandtheNoSQLandMapReduceexamplesindetail.ForMapreducewelookatHadoopandNoSQLwelookatmongodb.MongodbisthebestNoSQLdatabasetouseforabeginnerinBigDatadevelopment.TypesofBigData

    TypeofBigData Description PreferedTechnology

    Structureddata Dataisorganisedintoentitiesthathaveadefinedformat,suchasXMLdocumentsordatabasetablesthatconformtoaparticularpredefinedschema

    RDBMS,NoSQL

    SemiStructureddata dataislooser,whiletheremaybeaschema,itcanbeignored,soitmaybeusedonlyasaguidetothestructureofthedata.

    MapReduce,NoSQL

    Unstructureddata Unstructureddatadoesnothaveanyparticularinternalstructure,forexample,plaintextorimagedata.

    MapReduce

    SimpleProblemcomputetheaverageageforthepeopleineachcityTheRelationalDatabaseManagementSystem(RDBMS)wayStructuredQueryLanguage(SQL)DatabasepeopleTABLEcity

    Id

    person_id age

    1 1 20

  • 1 2 34

    3 3 17SELECT people.city.id, AVG(people.city.age) FROM people.city GROUP BY people.city.id TheMapreduce(hadoop)wayMAPSTEP (Scatter)

    RECORD KEY VALUE

    city=3,age=5 3 5

    city=1,age=2 1 6

    city=3,age=7 3 7

    city=4,age=9 4 9

    city=4,age=9 4 9

    city=1,age=3 1 3

    SHUFFLESTEP(SortbyKey)

    KEY VALUES

    1 3,6

    3 5,7

    4 9,9

    REDUCESTEP (Gather)

    KEY AGGREGATION

    1 (3+6)/2

    3 (5+7)/2

    4 (9+9)/2

  • TheNoSQL(MongoDb)waymongodbaggregationframeworkTypicalDatasetobject{ "_id" : "35004", "city" : "Acmar", "pop" : 6055, "loc" : [-86.51557, 33.584132] }

    aggregationfunctionputs coll.aggregate([ {"$group" => {_id: { city: "$city"}, pop: {"$sum" => "$pop"}}}, {"$group" => {_id: "$_id.city", avg_city_pop: {"$avg" => "$pop"}}}, {"$sort" => {avg_city_pop: -1}}, {"$limit" => 3} ]) MAPReduceSummaryMapreduceisParallelalgorithmthatfollowstheScatterGatherdesignpatternAllmapandreducecallsaredoneinparallelindependentlyMapReduceisagoodfitforprocessesthatneedtoanalysethewholedataset,inabatchoperation,speciallyforadhocanalysis.Mapreduceinterpretsthedataatprocessingtime.TheinputkeysandvaluesforMapReducearenotanintrinsicpropertyofthedata,buttheyarechosenbythepersonanalyzingthedata.Mapreduceislinearlyscalableprogrammingmodel.AdeveloperstaskistowriteMapfunctions&ReducerfunctionspipedtogetherthroughaShuffleSort.Mapreducedefinesamappingfromonesetofkeyvaluepairstoanother.Thesefunctionareoblivioustothesizeofthedataortheclustertheyareoperatingon,sotheycanbeusedunchangedforsmalldatasetandformassiveone.RDBMSSummaryRDBMSisgoodforpointqueriesorupdates,wherethedatasethasbeenindexedtodeliverlowlatencyretrievalandupdatetimesofarelativelysmallamountofdata.MapReducesuitsapplicationswherethedataiswrittenonce,andreadmanytimes,whereasarelationaldatabaseisgooddatasetsthatarecontinuallyupdated.Relationaldataisoftennormalizedtoretainitsintegrity&removeredundancy.NoSQLSummary32bitMongoDBonlyhandles2GBofdata12nodelimittothereplicasetstrategywithConsumptionofdiskspaceissueInMongoDB,adatabaseiscomposedofcollections(somehowtheequivalentoftablesofthosewhoareusedtoSQL).AcollectioncontainsdocumentswhichareJSONobjectQuerysareperformedonasinglecollection.ThereisnoequivalenttotheSQLJOIN.HenceneedtoDenormalisewhendesigningschemas.MongoDBdoesntenforceamodelorschema

  • MongoallowsoptionalfieldswhereSQLdoesntAsperMongo2.2,thereisaperdatabaselock,whichmeansthatoperationsonindependentobjectswithinthesamecollectioncantbedoneconcurrently.MongoDBoperationsareonlyatomiconthedocumentlevel,theselocks(intraditionaldatabasesareusedtoguardindexaccess)areonlyheldaslongasasingledocument/objectstakestoupdateinmemory.SummaryoftechnologyattributesAttribute TraditionalRDBMS NoSQL MapReduce

    Data Size: Gigabytes Gigabytes Petabytes

    Access: Interactive and Batch

    Interactive and Batch

    Batch

    Updates: Read and write many times

    Read and write many times

    Write once , read many times

    Structure: Static schema Object Types Dynamic schema

    Integrity High High Low

    Scaling Nonlinear Nonlinear Linear

    ApacheHadoopEcosystem

    Common:Asetofoperations&interfacesfordistributedfilesystems&generalI/O(Serialization,JavaRPC,persistentdatastructures)

    Avro:Aserializationsystemforefficient,crosslanguagepersistentdatastorage. MapReduce:ADistributeddataprocessingmodelandexecutionenvironmentthsatruns

    onlargeclustersofcommoditymachines. HDFS:Adistributedfilesystemthatrunsonlargeclustersofcommoditymachines. Pig:Adataflowlanguageandexecutionenvironmentforexploringverylargedatasets.

    PigrunsonHDFSandMapReduceclusters. Hive:Adistributeddatawarehouse.HiveManagesdatastoredinHDFS&providesbatch

    stylecomputations&ETLbyHQL. HBase:Adistributed,columnorienteddatabase,HBaseusesHDFSforitsunderlying

    storage,supportedbothbatchstylecomputationsusingMapReduceandpointqueries. Sqoop:AToolforefficientlymovingdatabetweenRDBMS&HDFS

  • MahoutmachinelearningSummary1.RDBMScanscaleeffectivelywhenusedwithaclusteringtechnologylikeMySQLClusterforexample.Howeverrequiresahighlystructureddataset.Joinscanbeexpensive.2.NoSQLscaleseffectively.ForMongoDbtherearesomearchitecturallimitationsthatcouldformassivedatasetsbeabottleneckforcapacity.Elasticschemas,howevernojoinshenceneedtodenormalise.3.MapReduce,scalestolargestdataset.Canoptimisewriteshoweverreadsareslowandasdatasetisdistributedneedtostreamresultsetforqueries.

top related