ctd data science on hadoop - government...
TRANSCRIPT
![Page 1: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/1.jpg)
1©Cloudera,Inc.Allrightsreserved.
DataScienceonHadoop
1©Cloudera, Inc.Allrightsreserved.
JustinEricksonSeniorDirector,ProductManagement
![Page 2: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/2.jpg)
2©Cloudera,Inc.Allrightsreserved.
AgeofMachineLearning
2
Costofcompute
Datavolume
Time
MachineLearning
NOMachineLearning
1950s 1960s 1970s 1980s 1990s 2000s 2010s
![Page 3: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/3.jpg)
3©Cloudera,Inc.Allrightsreserved.
TheEnterprisePlatformforDataScienceandMachineLearning
Thedataisnowhere
30BCONNECTEDDEVICES
440xMOREDATA
ClouderafirsttointegrateSpark
ModernPlatformforMachineLearningandAdvancedAnalytics
Leadingadoptionamongenterprises
500Customers
RunSparkon
![Page 4: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/4.jpg)
4©Cloudera,Inc.Allrightsreserved.
Sampledatascience/machinelearningworkflowFromdatatoexplorationtoaction
DataEngineering DataScience(Exploratory) Production(Operational)
DataWrangling
VisualizationandAnalysis
ModelTraining&Testing
ProductionDataPipelines BatchScoring
OnlineScoringServing
DataGovernanceGovernance
Processing
AcquisitionReports,
Dashboards
![Page 5: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/5.jpg)
5©Cloudera,Inc.Allrightsreserved.
Thegoodnews
DataEngineering DataScience(Exploratory) Production(Operational)
DataWrangling
VisualizationandAnalysis
ModelTraining&Testing
ProductionDataPipelines BatchScoring
OnlineScoringServing
DataGovernanceGovernance
Processing
AcquisitionReports,
Dashboards
Datahasneverbeenmoreplentiful
Opensourcedatascienceandmachinelearninglibrariesarerapidlyevolving
Commodity(andon-demand)computemakesscalableproductionmachinelearningaffordable
![Page 6: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/6.jpg)
6©Cloudera,Inc.Allrightsreserved.
Thebadnews
DataEngineering DataScience(Exploratory) Production(Operational)
DataWrangling
VisualizationandAnalysis
ModelTraining&Testing
ProductionDataPipelines BatchScoring
OnlineScoringServing
DataGovernanceGovernance
Processing
AcquisitionReports,
Dashboards
Mostdatasciencedoneatsmallscale,individually,andisdifficulttoreplicate
Veryfewmodelsreachproduction
Teamshavedifferent,conflictingrequestsforlanguages&libraries
Dataneedstomoveacrossmultipledifferentsystems
![Page 7: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/7.jpg)
7©Cloudera,Inc.Allrightsreserved.
Additionalchallenges
AccessForsensitivedata,secureclustersaredifficulttoaccess.AndITtypicallydoesn’twantrandompackagesinstalledonasecurecluster.
Popularopensourcetoolsdon’teasilyconnecttotheseenvironments,oralwayssupportHadoopdataformats.
ScaleLaptopsrarelyhavecapacityformedium,letalonebigdata.Thisleadstoalotofsampling.
Popularframeworksdon’teasilyparallelizeonacluster.Typicallycodehastogetrewrittenforproduction.
DeveloperExperienceNotebooks,whileawesome,don’teasilysupportvirtualenvironmentanddependencymanagement,especiallyforteams.Thismakessharingandreproducibilityhard.
Notebooksarealsochallengingto“putintoproduction.”
![Page 8: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/8.jpg)
8©Cloudera,Inc.Allrightsreserved.
Thisyear,ourgoalistoenabledatascienceandmachinelearningatscale.
![Page 9: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/9.jpg)
9©Cloudera,Inc.Allrightsreserved.
OpendatascienceintheenterpriseIT
driveadoptionwhilemaintainingcompliance
DataScientistexplore,experiment,iterate
![Page 10: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/10.jpg)
10©Cloudera,Inc.Allrightsreserved.
Ourgoal:Anopenplatformfordatascienceatscale
HelpmoredatascientistsusethepowerofHadoop
Useapowerful,familiarenvironmentwithdirectaccessto
Hadoopdataandcompute
DataScientistDataEngineer
Makeiteasyandsecuretoaddnewusers,usecases
Offersecureself-serviceanalyticsandafasterpathtoproductiononcommon,affordableinfrastructure
EnterpriseArchitectHadoopAdmin
![Page 11: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/11.jpg)
11©Cloudera,Inc.Allrightsreserved.
IntroducingCloudera DataScienceWorkbenchSelf-servicedatasciencefortheenterprise
Acceleratesdatasciencefromdevelopmenttoproductionwith:• Secureself-serviceenvironmentsfordatascientiststoworkagainstCloudera clusters• SupportforPython,R,andScala,plusprojectdependencyisolationformultiplelibraryversions• Workflowautomation,versioncontrol,collaborationandsharing
![Page 12: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/12.jpg)
12©Cloudera,Inc.Allrightsreserved.
Demo
![Page 13: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/13.jpg)
13©Cloudera,Inc.Allrightsreserved.
Datascientistscan:• UseR,Python,orScalafromawebbrowser,withnodesktopfootprint• Installanylibraryorframeworkwithinisolatedprojectenvironments• DirectlyaccessdatainsecureclusterswithSparkandImpala• Shareinsightswiththeirteamforreproducible,collaborativeresearch• Automateandmonitordatapipelinesusingbuilt-injobscheduling
ITcan:• Givetheirdatascienceteamthefreedomtoworkhowtheywant,whentheywant• Staycompliantwithout-of-the-boxsupportforfullplatformsecurity,especiallyKerberos• Runon-premisesorinthecloud,whereverdataismanaged
WithCloudera DataScienceWorkbench…
![Page 14: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/14.jpg)
14©Cloudera,Inc.Allrightsreserved.
SolvingDataScienceisaFull-StackProblem
• Supportunlimiteddata• Providesufficienttools forAnalysts• Providesufficienttools forDataScientists+DataEngineers
• Enablereal-timeusecases• Providedatagovernance• Providefull-stacksecurity• Deployinthecloud• Integratewithpartnertools• BeeasyforITtodeploy/maintain
ü Hadoopü Impala,Hive,Hueü Spark,DataScienceWorkbench
ü Kafka,SparkStreamingü Navigator+Partnersü Kerberos,Sentry,RecordService,
KMS/KTSü Cloudera Directorü RichEcosystemü Cloudera Manager +Director
![Page 15: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/15.jpg)
15©Cloudera,Inc.Allrightsreserved.
Theimportanceofanopenecosystem
OpenEcosystem BlackBox
![Page 16: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/16.jpg)
©Cloudera,Inc.Allrightsreserved. 16
ThankyouThankYouJustinErickson
![Page 17: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/17.jpg)
17©Cloudera,Inc.Allrightsreserved.
![Page 18: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/18.jpg)
18©Cloudera,Inc.Allrightsreserved.
![Page 19: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/19.jpg)
19©Cloudera,Inc.Allrightsreserved.
![Page 20: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/20.jpg)
20©Cloudera,Inc.Allrightsreserved.
![Page 21: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/21.jpg)
21©Cloudera,Inc.Allrightsreserved.
![Page 22: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/22.jpg)
22©Cloudera,Inc.Allrightsreserved.
![Page 23: CTD Data Science on Hadoop - Government Executivecdn.govexec.com/media/ctd_data_science_on_hadoop.pdf · 2017-05-17 · •Enable real-time use cases •Provide data governance •Provide](https://reader034.vdocuments.site/reader034/viewer/2022042219/5ec560a7f1bc091b4526c56e/html5/thumbnails/23.jpg)
23©Cloudera,Inc.Allrightsreserved.