combine query language and data flow language for data science · what is apache pig à apache pig...

Post on 20-Aug-2020

14 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1 ©HortonworksInc.2011–2016.AllRightsReserved

SparkSQL+Pig-La.nCombineQueryLanguageandDataFlowLanguageforDataScience

JeffZhang(zjffdu@apache.org)May16,2017

2 ©HortonworksInc.2011–2016.AllRightsReserved

WhoamI

Ã  ASFMember,workinASFforalmost8years

Ã  CommiRerofApacheTez,Pig&Zeppelin

Ã  WorksinHortonworks

3 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

DataScience,alsoknownasdata-drivenscience,isaninterdisciplinaryfieldaboutscienYficmethods,processesandsystemstoextractknowledgeorinsightsfromdatainvariousforms,eitherstructuredorunstructured.

Ã  Describewhathappens

Ã  Explainwhathappens

Ã  Predictwhatwouldhappen

4 ©HortonworksInc.2011–2016.AllRightsReserved

DataScience

CollectData

DataMunging

DataAnalysisInsight

Product

online offline

5 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

§  CollectandTransformServerLogData•  UserAgentNormalizaYon•  RobotDetecYon•  Sessionize

§  MovedatafromDatabasetoHDFS

§  CollectandTransformSocialMediaData

6 ©HortonworksInc.2011–2016.AllRightsReserved

DataMunging

BeforeDataMunging AcerDataMunging

7 ©HortonworksInc.2011–2016.AllRightsReserved

DataAnalysis

Ã  CombinedifferentsourcesofdataandapplystaYsYcs,BItoolstogetinsightfromData–  WebTrafficMetrics–  UserSegmentaYonAnalysis–  A/BTest

8 ©HortonworksInc.2011–2016.AllRightsReserved

DataMungingvsDataAnalysis

DataMunging DataAnalysisDataSource Messy

Structured/UnstructuredUnorganized

Clean,NormalizedStructuredOrganized

Stability Regular,Stable Ad-hoc

Tools Python,Spark,Hadoopandetc.

R,Python,SQLandetc.

Datayouhavetobefullstackbigdataengineertododatascience?

Whatifyouareadataanalystwithoutmuchprogrammingskills?

9 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure

10 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisSpark

ApacheSparkisafast,in-memorydataprocessingenginewithelegantandexpressivedevelopmentAPIstoallowdataworkerstoefficientlyexecutestreaming,machinelearningorSQLworkloads.

11 ©HortonworksInc.2011–2016.AllRightsReserved

WhatisApachePig

Ã  ApachePigisahigh-levelplajormforcreaYngprogramsthatrunonApacheHadoop.ThelanguageforthisplajormiscalledPigLa.n.PigcanexecuteitsHadoopjobsinMapReduce,ApacheTez,orApacheSpark

•  Easeofprogramming

•  OpYmizaYonopportuniYes

•  Extensibility

12 ©HortonworksInc.2011–2016.AllRightsReserved

WordCount

Load

ForEach Group ForEach Order

StoreUsingSQL?

13 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-La.nvsSQL

SQL Pig-La.nLanguageType QueryLanguage

•  defactorstandard

DataFlowLanguage•  lazyevaluaYon•  supportpipelinesplit

DataSource StructuredData Structured/UnstructuredIntegraYon IntegratedwithmostofBITools VeryfewBItoolsintegratedwith

Pig-LaYn

Conclusion•  Pig-La.nforDataMunging•  SQLforDataAnalysis

14 ©HortonworksInc.2011–2016.AllRightsReserved

IntegrateSparkintoPig

LogicPlan

PhysicalPlan

Execu.onPlan

Execu.onEngine

PigScript

15 ©HortonworksInc.2011–2016.AllRightsReserved

CombineSparkSQL+Pig-La.n

SparkDataFrameTable

SparkSQL

DataMunging

DataAnalysis

SparkScalaAPI

SparkPythonAPI

SparkRAPI

PigLa.n

16 ©HortonworksInc.2011–2016.AllRightsReserved

Pig-Lain+SparkSQL

SparkDataFrameTable

SparkSQL

Load Store

DataMunging

DataAnalysis

17 ©HortonworksInc.2011–2016.AllRightsReserved

SparkTable(bank)

PigLaYn

SQL

18 ©HortonworksInc.2011–2016.AllRightsReserved

WheretorunPig-La.n&SparkSQL(Zeppelin)

ApacheZeppelinisaweb-basednotebookthatenablesinteracYvedataanalyYcs.YoucanmakebeauYfuldata-driven,interacYveandcollaboraYvedocumentswithSQL,Scalaandmore.

19 ©HortonworksInc.2011–2016.AllRightsReserved

JVM

ZeppelinServer

PigInterpreterGroup

Pig-LaYn SparkSQL

JVM

JVM

SparkInterpreterGroup

Scala Python R

Pig-LaYn+SparkSQLinZeppelin

20 ©HortonworksInc.2011–2016.AllRightsReserved

DataScienceInfrastructure(Recap)

21 ©HortonworksInc.2011–2016.AllRightsReserved

Demo

22 ©HortonworksInc.2011–2016.AllRightsReserved

Conclusion

Ã  LeveragethepowerofbothQueryLanguageandDataFlowLanguage

Ã  UseSparkasUnifiedExecuYonEngine.

Ã  ShareDatabetweenDataMunging&DataAnalysis

Ã  UseZeppelinasUnifiedDataSciencePlajorm

23 ©HortonworksInc.2011–2016.AllRightsReserved

Summary

Ã  DataMunging&DataAnalysis

Ã  UsePig-LaYnforDataMunging,UseSQLforDataAnalysis

Ã  RununderSparkEngine

Ã  UseZeppelinasunifiedDataSciencePlajorm

24 ©HortonworksInc.2011–2016.AllRightsReserved

CurrentStatus&What’sNext

Ã Status–  PIG-5080(Supportstorealiasassparktable)–  ZEPPELIN-2232(SupportSparkSQLforPigInterpreter)

Ã Next–  IntegrateSparkMLlibinPig–  UseDataFrameAPIinsteadofRDDAPItointegrateSparkwithPig–  SupporttoIntegratePigwithotherSparkAPIs,likeR,Python

25 ©HortonworksInc.2011–2016.AllRightsReserved

Q&A

26 ©HortonworksInc.2011–2016.AllRightsReserved

ThankYou

top related