processing big data with pentaho - presentation€¦ · summary: visual future-proof big data...

25
Processing Big Data with Pentaho Rakesh Saha Pentaho Senior Product Manager, Hitachi Vantara

Upload: others

Post on 21-May-2020

5 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

ProcessingBigDatawithPentahoRakeshSahaPentahoSeniorProductManager,HitachiVantara

Page 2: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

Agenda

• Processbigdatavisuallyinfuture-proofway– Demo

• Combinestreamdataprocessingwithbatch– Demo

Pentaho’sLatestandUpcomingFeaturesforProcessingBigData– BatchorReal-time

Page 3: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

BigDataProcessingisHARD

1)GartnerAnalyst,NickHeudecker;infoworld.com,Sept2015

"Through2018,70%ofHadoopdeploymentswillnotmeet

costsavingsandrevenuegenerationobjectivesduetoskills

andintegrationchallenges.”– GARTNER1

1NewSkillsNecessary

2HighEffortandRisk

3ContinuousChange

Page 4: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

BigDataIntegrationandAnalyticsWorkflowwithPentaho

BigDataChallenges• ProcessingSemi/un/structureddata

• Blendingbigdatawithtraditionaldata

• Maintainingsecurity,governanceofdata

• Processingstreamingdatainrealtimeandhistorically

• Enablingandoperationalizingdatascience

DataLake

AnalyticDatabase

PentahoAnalyzer

Sensor

Bigorsmalldata

PentahoData

IntegrationPentahoReporting

MSGQueueKafka,JMS,

MQTTMachineLearning

R,Python

Stream FeedbackLoop

LOBApplications

Embedded

PentahoData

Integration

Page 5: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

ProcessBigDataVisuallyinaFutureProofWay

Page 6: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

VisualBigDataProcessingwithPentaho

• What:VisuallyingestandprocessBigDataatenterprisescale

• WhatSpecial:VisuallydeveloponceandexecuteonanyenginewithAdaptiveExecutionLayer(AEL)

• Why– Difficulttofindqualifieddevelopers– Difficulttokeepupwithnewtechnologies

• AvailablesincePentaho7.1

Page 7: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

AdaptiveExecutionofBigData

BuildOnce,ExecuteonAnyEngineChallenge:Withrapidlychangingbigdatatechnology,codingonvariousenginescanbetime-consumingorimpossiblewithexistingresources

Solution:Future-proofdataintegrationandanalyticsdevelopmentinadrag-and-dropvisualdevelopmentenvironment,eliminatingtheneedforspecializedcodingandAPIknowledge.Seamlesslyswitchbetweenexecutionenginestofitdatavolumeandtransformationcomplexity

PDI

PentahoKettle

Page 8: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

AdaptiveExecutionforSpark

ProcessBigDataFasteronSparkWithoutAnyCodingChallenge:FindingthetalentandtimetoworkwithSparkandnewerbigdatatechnologies

Solution:MoreeasilydevelopbigdataapplicationsinPDIusingadaptiveexecutiontoingest,processandblenddatafromarangeofbigdatasourcesandscaleonSparkclusters

PDI

PentahoKettle

Page 9: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

UpcomingEnhancedAdaptiveExecutionLayer

• SimplifiedSetup– Fewerstepstosetup– Easytoconfigurefail-over,load-balancing

• Developmentproductivity– Robusttransformationerrorandstatusreporting– CustomizationofSparkjobs

• RobustEnterpriseSecurity– ClienttoAELconnectioncanbesecured– End-2-endKerberosimpersonationfromclienttooltocluster

PDIClient

Spark/HadoopProcessingNodes

HADOOPCLUSTER

AEL-SparkEngine(SparkDriver)

AEL-SparkDaemon(EdgeNodes)

Hadoop/SparkCompatibleStorageCluster

HDFS AzureStorage

AmazonS3

Etc…

SparkExecutors

Page 10: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

UpcomingBigDataFileFormatHandling

BigDataplatformsintroducedvariousdataformatstoimproveperformance,compression,andinteroperability

What:• VisualhandlingofdatafileswithBigDataformatsParquetandAvro– Readingandwritingfileswithspecificsteps– NativelyexecuteinSparkviaAEL

Why:• EaseofdevelopmentofBigDataprocessing

• Performanceimprovementduetoavoidanceofintermediateformats

Page 11: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

Demonstration

Page 12: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

RetailWebLogDataProcessingwithPentaho

• RunwithinSpoonviaPentahoduringdevelopmentandthenuseSparkclusterforproduction

• Lookups,sort,andParquetfilein/outandotherstepsastotestparallelandserialprocessingwithinSparkCluster

Page 13: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

CombineStreamProcessingwithBatchProcessing

Page 14: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

WhatisStreamDataProcessing?AndWhy?

• Batchdataprocessingisuseful,butsometimesbusinessesneedtoobtaincrucialinsightsfasterandactonthem

• Manyusecasesmustconsiderdata2+times:onthewire,andthensubsequentlyashistoricaldata

• Getcrucialtime-sensitiveinsights– Reacttocustomerinteractionsonawebsiteormobileapp– Predictriskofequipmentbreakdownbeforeithappens

FormerPOV“securedatainDW,thenOLAPASAPafterward”giveswayto

CurrentPOV“analyzeonthewire,writebehind”

Page 15: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

NEWStreamDataProcessingwithPentaho

• Visuallyingestandproducedatafrom/toKafkausingNEWsteps

• Processmicro-batchchunksofdatausingeitheratime-basedoramessagesize-basedwindow

• SwitchprocessingenginesbetweenSpark(Streaming)orNativeKettle

• Hardenstreamprocessinglibrariesandstepstoprocessdatafromtraditionalmessagequeues• Benefits:– Lowerthebartobuildstreamingapplications– Enablecombiningbatchandstreamdataprocessing

Page 16: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

HowtoProcessStreamDatainPentaho

• StepsforKafkaingestionandpublish– KafkaConsumer– KafkaProducer

• Stepsforstreamprocessing– Getrecordsfromstream

• Ingestandprocesscontinuousstreamofdatainnearreal-timeinparenttransformation

• Processmicro-batchofstreamdatainseparatechildtransformation

Page 17: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

CombinedDataProcessingUsingSpark&Pentaho

WebClickstreamandOtherLogs

TraditionalDB/DWandNoSQLDatastores

TraditionalMessageBus

DATASOURCES

IoT DataKafkaCluster

DataCollector

PentahoDIPDIcollectsdatafromsourcesincludingKafkaClusters

DataPublisher

AnalyticalDatabases

PentahoAnalytics

HADOOP/SPARKCLUSTER

DataStore

MicroServices

RTDataProcessors BatchDataProcessors

HadoopMR

HDFS

PentahoDIPDI can process streaming data using Sparkand Spark Streaming or Kettle engine in acompletely visual way

PentahoDIPDIcanretrieve

processedorblendeddatafromHadoop/SparkandpublishtoKafkaclustersorexternal

databases

Ingest Process Publish Reporting

KafkaCluster

Page 18: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

Demonstration

Page 19: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

RetailStoreEventProcessing

• CanberunwithinSpoonviaPentahoorwithinAEL-Sparkengine• UtilizesKafkain/out,Parquetoutandotherstepsastodemonstratestreamdataingestion,windowprocessingandmuchmore…

Page 20: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

AvailabilityandRoadmap

Page 21: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

Availability

• AdaptiveExecutionLayer(AEL)andSpark-AELavailableinPentaho7.1– SecureSparkintegration,high-availabilityandsecurityofAELisEEonly– SupportedHadoopdistrosinPentaho7.1- ClouderaCDHandPentaho8.0– ClouderaCDHandHortonworksHDP

• KafkastepsandstreamdataprocessingavailableinPentaho8.0– KafkafromClouderaandHortonworkstobesupported

Page 22: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

Roadmap

• ExtendingAELtosupportotherSparkdistrosandotherdataprocessingengines• Advancedstreamprocessingwithotherreal-timemessagingprotocolsandwindowingmechanism

• EnablingBigDatadrivenmachinelearningonbatchorstreamdata

• IntegratedwithbroaderHitachiVantara portfolio

Page 23: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

SUMMARY:VisualFuture-ProofBigDataProcessingwithPentaho

Visuallybuildstreamdataprocessingpipelinesfordifferentstreamingengines• ConfigureStreamdataprocessinglogic• Executelogicinmultiplestreamprocessingengineswithoutrework

• Connecttostreamingdatasources

NEWinPentahoü NativeStreaminginPDIü SparkStreamingviaAELü KafkaConnectivity

LeveragethepowerofAdaptiveExecutiontofuture-proofdataprocessingpipelines• Configurelogicwithoutcoding• Switchprocessingengineswithoutrework• HandleBigDataformatsmoreefficiently

NEWinPentahoü AdaptiveExecutionLayerü VisualSparkviaAELü NativeBigdataFormatHandling

Page 24: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different

NextStepsWanttolearnmore?

• Meet-the-Experts:– AnthonyDeShazor– LukeNazarro– CarloRusso

• RecommendedBreakoutSessions:– JonathanJarvis:UnderstandingParallelismwithPDIandAdaptiveExecutionwithSpark– MarkBurnette:UnderstandingtheBigDataTechnologyEcosystem

Page 25: Processing Big Data with Pentaho - Presentation€¦ · SUMMARY: Visual Future-Proof Big Data Processing with Pentaho Visually build stream data processing pipelines for different