data science methodology - ibm · data science the interest in data science • solve problems and...

34
The Data Science Process Polong Lin Big Data University Leader & Data Scientist IBM [email protected]

Upload: others

Post on 07-Feb-2020

12 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

The Data Science Process

Polong LinBig Data University Leader & Data [email protected]

Page 2: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

2

“Every day, we create 2.5 quintillion bytes of data —so much that 90% of the data in the world today has been created in the last two years alone.”

Page 3: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Datascience

Theinterestindatascience

• Solveproblemsandanswerquestionsusingdata

• Goaltoimprovefutureoutcomes

3

Whatisthedatascienceprocess?

Page 4: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

CRISP-DMMethodologydiagram

4

Business Understanding

Data Understanding

Data Preparation

Analytic Approach

Data Requirements

Data Collection

Modeling

Evaluation

Deployment

Feedback

CrossIndustryStandardProcessforDataMining

Page 5: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Everyprojectbeginswithbusinessunderstanding.

• Projectobjective?• Businesssponsorsplaythemostcriticalrole• Whatarewetryingtodo– whatisthegoal?• Howdoyoudefine“success”andhowcanyoumeasureit?

5

1.Businessunderstanding

Business Understanding

Page 6: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

6

1.Businessunderstanding

Business Understanding

Traffic:

Problem:Trafficcongestionwastestimeandmoney

Clearquestion:Howcanweoptimizetrafficlightdurationusingdataontrafficpatterns,weather,andpedestriantraffic?

Measurableoutcomes:

- %decreaseincommutetime

- %decreaseinlength/durationoftrafficjams

Page 7: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

2.AnalyticApproach

• Expressproblemincontextofstatisticalandmachinelearningtechniques• Regression:

• “Predictingrevenueinthenextquarter?”• Classification:

• “DoesthispatienthavecancerA,cancerB,oraretheyhealthy?”• Clustering:

• “Aretheregroupsofusersthatseemtobehavesimilarlytoeachother?”• Recommendation/Personalization:

• “HowcanItargetdiscountstospecificcustomers?”• OutlierDetection

7

Business Understanding

Analytic Approach

Page 8: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

8

• Linearregression• Logisticregression• Clustering

• K-means• Hierarchical• Density-based

• ClassificationTrees• RandomForests• Neuralnetworks

• Textmining(naturallanguageprocessing)

• Principalcomponentanalysis

• SupportVectorMachines• HiddenMarkovModels• …

Statistical/machinelearningtechnique(s)

Page 9: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Data Understanding

Data Requirements

Data Collection

Datacompilation

• Thechosenanalyticapproachdeterminesthedatarequirements.• Content,formats,representations

• Initialdatacollection isperformed.• AvailableData?• Obtaindata?• Revisedatarequirementsorcollectmoredata?

• Thendataunderstanding isgained.• Initialinsightsaboutdata• Descriptivestatisticsandvisualization• Additionaldatacollectiontofillgaps,ifneeded

9

Page 10: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

#1Whatcanyoutellmeaboutthisdata?

10

Page 11: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

#2Whatcanyoutellmeaboutthisdata?

11

Page 12: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

#3Whatcanyoutellmeaboutthisdata?

12

Page 13: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

#4Whatcanyoutellmeaboutthisdata?

13

Page 14: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

ImportanceofVisualization

14

Sameproperties:mean(x)=9mean(y)=7.5y=3.00+0.500xcorr(x,y)=0.816

Anscombe's Quartet

Page 15: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

CRISP-DMMethodologydiagram

15

Business Understanding

Data Understanding

Data Preparation

Analytic Approach

Data Requirements

Data Collection

Modeling

Evaluation

Deployment

Feedback

Page 16: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Datapreparation

• Datapreparation encompassesallactivitiestoconstructandcleanthedataset.• Datacleaning

• Missingorinvalidvalues

• Eliminatingduplicaterows

• Formattingproperly

• Combiningmultipledatasources• Transformingdata• Featureengineering• Textanalysis

• Acceleratedatapreparationbyautomatingcommonsteps

16

Data Preparation

• Arguablythemosttime-consumingstep• “80%oftheentireDSprocessisin

datacleaningandpreparation”

Page 17: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Modeling

Modeling

• Modeling:• Developingpredictiveordescriptivemodels• Maytryusingmultiplealgorithms

• Highlyiterativeprocess

17

Data Preparation

Evaluation

Page 18: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

K-meansClustering

Groupsimilarcuisinestogetherintok numberofclusters.

Example:Clustering

Page 19: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Groupsimilarcuisinestogetherintok numberofclusters.

k =3

K-meansClustering

Example:Clustering

Page 20: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Example:Clustering

20

[Age:18,Sex:M,BMI:23,Exercise:Frequent,Hobbies:Golf,…]

[Age:45,Sex:F,BMI:28,Exercise:Frequent,Hobbies:Baseball,…]

[Age:83,Sex:F,BMI:25,Exercise:Sedentary,Hobbies:Gymnastics,…]

[Age:28,Sex:M,BMI:23,Exercise:Normal,Hobbies:Softball,…]

[Age:30,Sex:F,BMI:25,Exercise:Normal,Hobbies:Golf,…]

[Age:15,Sex:M,BMI:22,Exercise:Frequent,Hobbies:Golf,…]

Model

CLUSTERA

CLUSTERB

CLUSTERC

CLUSTERB

CLUSTERA

CLUSTERA

Page 21: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Example:Classification

21

[Age:32,Sex:M,BMI:23,Exercise:Frequent,… ,Condition:Disorder1]

[Age:45,Sex:F,BMI:28,Exercise:Frequent,… ,Condition:Healthy]

[Age:63,Sex:F,BMI:21,Exercise:Sedentary,… ,Condition:Disorder2]

Model

[Age:48,Sex:M,BMI:23,Exercise:Sedentary,… ,Condition:________]

Disorder1

Page 22: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Modelevaluation

• Modelevaluation isperformedduringmodeldevelopmentandbeforemodeldeployment.• Understandthemodel’squality• Ensurethatitproperlyaddressesthebusinessproblem

• Diagnosticmeasures• Suitabletothemodelingtechniqueused• Training/Testingset• Refinemodelasneeded

• Statisticalsignificancetests

22

Evaluation

Modeling

Page 23: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

Deploymentandfeedback• Oncefinalized,themodelisdeployed intoaproductionenvironment.

• Maystartinalimited/testenvironment• Involvesotherroles:

• Solutionowner

• Marketing

• Applicationdevelopers

• ITadministration

• Getting Feedback:• Howwelldidthemodelperform?• Iterativeprocessformodelrefinementandredeployment• A/Btesting

23

Deployment

Feedback

BigDataUniversity:• Inactive->Active

Page 24: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

CRISP-DMMethodologydiagram

24

Business Understanding

Data Understanding

Data Preparation

Analytic Approach

Data Requirements

Data Collection

Modeling

Evaluation

Deployment

Feedback

Page 25: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

25

“All models are wrong but some are useful” – George Box, Statistician

Page 26: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

26

Variable#503

Varia

ble#5

03

Variable#503

Page 27: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

27

Variable#503

Varia

ble#5

03

Variable#503

Page 28: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

28

Variable#503

Varia

ble#5

03

Variable#503

Page 29: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

29

Page 30: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

LearningMoreAboutDataScienceWherecanyoulearnmoreaboutdatascience?

30

Page 31: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

31

Page 32: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

32

Page 33: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

33

Page 34: Data Science Methodology - IBM · Data science The interest in data science • Solve problems and answer questions using data • Goal to improve future outcomes 3 What is the data

34

BigDataUniversity.comFreecourses!• DataScience• BigData• DataEngineering

Earnbadges!

Learnanytime!

Foryourorganizations• Wecancreatededicatedportals

foryouremployeestogainskillsindatascience