big data and analytics - ada...

BigDataandAnalyticsHadoopEcosystem

Dr.Abzetdin AdamovSchoolofInformationTechnologyandEngineering

ADAUniversityhttp://site.ada.qu.edu.az/~aadamov

PreviouslyCoveredTopics

• KeydifferencesofTraditionalandBigDataArchitecture• TransferringComputationPoweragainstTransferringData• SchemaonReadvsSchemaonWrite• HadoopCore– Storage:HDFSArchitecture• HadoopCore– Processing:MapReduce Architecture

Objectives

• Vagrant+Provisioning+VirtualBox =RepeatableMultiWMs• Hadoop2.0vsHadoop1.0• HadoopEcosystemComponentsClassification• HadoopEcosystemComponentsKeyFeatures

HadoopEcosystemComponents

CompaniesbuildingontopofHadoop

• AmazonWebServices• Cloudera• Hortonworks• IBM• Intel• MapR Technologies• Microsoft• PivotalSoftware• Teradata

PoweredbyApacheHadoop

• https://wiki.apache.org/hadoop/PoweredBy

• ThousandscompaniesandorganizationswithHadoopClustersizefromseveraltohundredsthousandsnodes(40.000atYahoo)

HadoopCore=Storage+Compute

storage storage

storage storage

CPU RAM

YetAnotherResourceNegotiator(YARN)

HadoopDistributedFileSystem(HDFS)

Hadoop2.0vsHadoop1.0

Hadoop1.0Bottlenecks:HDFS/MapReduce

Hadoop2.0Architechture

YARN/MRv2vsMRv1Architecture

Hadoop2.0vsHadoop1.0– Processing

TheHadoopEcosystem

Hadoop

HortonworksHadoopDistribution

ClassificationofHadoopEcosystemComponents

AdministrationandServerCoordination Hue

DistributedStorage

ResourceManagement

ProcessingFramework

API

Analytics

Ambari Zookeeper

DataManagement Flume Sqoop

WorkflowEngine Oozie

WorkflowEngine Avro

HDFS

YARN

MapReduce

Mahout

MapReduce v2

MapReduce Pig HBase

Tez Hoya

Hive

ClassificationofHadoopEcosystemComponents

HadoopEcosystemComponents

DataManagementFrameworks

Framework Description

HadoopDistributedFileSystem(HDFS)

AJava-based, distributedfilesystemthatprovidesscalable,reliable,high-throughputaccesstoapplication datastoredacrosscommodityservers

YetAnotherResourceNegotiator(YARN)

Aframeworkforcluster resourcemanagementandjobscheduling

OperationsFrameworksFramework Description

Ambari AWeb-basedframework forprovisioning,managing,andmonitoringHadoopclusters

ZooKeeper Ahigh-performance coordinationservicefordistributedapplications

Cloudbreak AtoolforprovisioningandmanagingHadoopclustersinthecloud

Oozie Aserver-basedworkflowengine usedtoexecuteHadoopjobs

Ambari WEBUI(REST)

DataAccessFrameworksFramework DescriptionPig Ahigh-levelplatformforextracting, transforming,oranalyzinglargedatasets

Hive AdatawarehouseinfrastructurethatsupportsadhocSQLqueries

HCatalog Atableinformation,schema,andmetadatamanagementlayersupportingHive,Pig,MapReduce,andTezprocessing

Cascading Anapplication developmentframeworkforbuildingdataapplications,abstractingthedetailsofcomplexMapReduceprogramming

HBase Ascalable,distributed NoSQLdatabasethatsupportsstructureddatastorageforlargetables

Phoenix Aclient-sideSQLlayer overHBasethatprovideslow-latencyaccesstoHBasedata

Accumulo Alow-latency,largetabledatastorageandretrievalsystemwithcell-levelsecurity

Storm Adistributed computationsystemforprocessingcontinuousstreamsofreal-timedata

Solr Adistributedsearch platformcapableofindexingpetabytesofdata

Spark A fast,generalpurposeprocessingengineusetobuildandrunsophisticatedSQL,streaming,machinelearning,orgraphicsapplications

GovernanceandIntegrationFrameworksFramework DescriptionFalcon Adatagovernancetoolprovidingworkfloworchestration, datalifecycle

management,anddatareplicationservices.WebHDFS ARESTAPI that usesthestandardHTTPverbstoaccess,operate,andmanage

HDFSHDFSNFSGateway A gatewaythatenables accesstoHDFSasanNFSmountedfile systemFlume A distributed,reliable,andhighly-availableservicethatefficientlycollects,

aggregates,andmovesstreamingdataSqoop Asetoftoolsfor importingandexportingdatabetweenHadoopandRDBM

systemsKafka Afast,scalable,durable,andfault-tolerantpublish-subscribemessagingsystemAtlas Ascalableandextensible setofcoregovernanceservicesenablingenterprisesto

meetcomplianceanddataintegrationrequirements

SecurityFrameworksFramework DescriptionHDFS A storagemanagementservice providingfile anddirectorypermissions,even

moregranularfileanddirectoryaccesscontrollists,andtransparentdataencryption

YARN Aresourcemanagement servicewithaccesscontrollistscontrollingaccesstocomputeresourcesandYARNadministrativefunctions

Hive Adatawarehouseinfrastructure serviceprovidinggranularaccesscontrolstotablecolumnsandrows

Falcon Adatagovernancetoolprovidingaccesscontrol liststhatlimitwhomaysubmitHadoopjobs

Knox AgatewayprovidingperimetersecuritytoaHadoopclusterRanger Acentralized securityframeworkofferingfine-grainedpolicycontrolsforHDFS,

Hive,HBase,Knox,Storm,Kafka,andSolr

EcosystemComponentVersions

HadoopEcosystemComponents’KeyFeatures

HADOOPECOSYSTEMCOMPONENTS

Its important to understand the components in Hadoop Ecosystem to build right solutions for a given business problem.

ClassificationoftheHadoopEcosystemComponents

HadoopisstraightanswerforprocessingBigData.

HadoopEcosystemhasacombinationoftechnologieswhichproficientadvantageinsolvingData-orientedbusinessproblem.

COREHADOOPHadoopDistributedFileSystem(HDFS)Standsfor:managingbigdatasetswithHighVolume, VelocityandVariety.

MapReduceStandsfor:processinghighvolumedistributeddata

YetAnotherResourceNegotiator(YARN)Standsfor:resourcemanagement,jobscheduling andmonitoring

DATAACCESSApachePigStandsfor:highlevellanguagebuiltontopofMapReduce foranalyzinglargedatasetsandforDataFlow.

ApacheHiveStandsfor:highlevelquery languageanddatawarehouseinfrastructurebuilton topofHadoopforproviding datasummarization,queryandanalysis.

DATASTORAGE

ApacheHBaseStandsfor:NoSQLdatabasebuiltforhostinglargetableswithbillionsofrowsandmillionsofcolumnsontopofHadoop.

CasandraStandsfor:NoSQLdatabasebasedonkey-valuemodeldesigned forlinearscalabilityandhighavailability.

INTERACTION-VISUALIZATION-DEVELOPMENT

HcatalogStandsfor:providing integrationofHivemetadataforotherHadoopapplicationslikePig,MapReduce andothers.

LuceneStandsfor:high-performance, full-featuredtextsearchengine librarywrittenentirelyinJava.

HamaStandsfor:distributed frameworkbasedonBulkSynchronousParallel(BSP)computing formassivescientificcomputations likematrix,graphandnetworkalgorithms.

CrunchStandsfor:writing, testingandrunningMapReduce pipelines.

DATAINELLIGENCE

ApacheDrillStandsfor:lowlatencySQLqueryengineforHadoopandNoSQL.

ApacheMahoutStandsfor:scalablemachinelearning librarydesigned forbuilding predictiveanalyticsonBigData.Mahoutnowhasimplementations apachesparkforfasterinmemorycomputing.

DATAINTEGRATIONApacheSqoopStandsfor:lowlatencySQLqueryengine forHadoopandNoSQL.

ApacheFlumeStandsfor:distributed, reliable,andavailableserviceforefficientlycollecting,aggregating,andmovinglargeamountsoflogdata.

ApacheChukwaStandsfor:scalablelogcollectorusedformonitoring largedistributed filessystems.

MANAGEMENT,MONITORINGandORCHESTRATION

ApacheAmbariStandsfor:simplifying Hadoopmanagementbyproviding aninterfaceforprovisioning,managingandmonitoring ApacheHadoopClusters.

ApacheZookeeperStandsfor:maintainingconfiguration informationnaming,providing distributedsynchronization, andprovidinggroupservices.

ApacheOozieStandsfor:schedulingworkflowtomanageApacheHadoop jobs.

WhereCanWeUseMachineLearning(DataScience)

Healthcare• Predictdiagnosis• Prioritizescreenings• Reducere-admittancerates

Financialservices• FraudDetection/prevention• Predictunderwritingrisk• Newaccountriskscreens

PublicSector• Analyzepublicsentiment• Optimizeresourceallocation• Lawenforcement&security

Retail• Productrecommendation• Inventorymanagement• Priceoptimization

Telco/mobile• Predictcustomerchurn• Predictequipmentfailure• Customerbehavioranalysis

Oil&Gas• Predictivemaintenance• Seismicdatamanagement• Predictwellproduction levels

YARNasaDataOperatingSystem

ApplicationsRunNativelyINHadoop

HDFS2(Redundant,ReliableStorage)

YARN(ClusterResourceManagement)

BATCH(MapReduce)

INTERACTIVE(Tez)

STREAMING(Storm)

GRAPH(Giraph)

IN-MEMORY(Spark)

HPCMPI(OpenMPI)

EXISTING(Slider)

SEARCH(Solr)

Applicationsnowrun“in”Hadoop,insteadof“on”Hadoop.

Next Generation AnalyticsIterative & ExploratoryData is the structure

Traditional AnalyticsStructured & Repeatable

Structure built to store data

42

ModernDataApplicationsapproachtoInsights

Start with hypothesisTest against selected data

Data leads the way Explore all data, identify correlations

Analyze after landing… Analyze in motion…

Q&A ?Abzetdin Adamov,Assoc Prof.Emailmeat:[email protected]:@Linktomeat:www.linkedin.com/in/adamovVisitmyblogat:aadamov.wordpress.com

big data and analytics - ada...

Documents