data science methodology - ibm · data science the interest in data science • solve problems and...
TRANSCRIPT
The Data Science Process
Polong LinBig Data University Leader & Data [email protected]
2
“Every day, we create 2.5 quintillion bytes of data —so much that 90% of the data in the world today has been created in the last two years alone.”
Datascience
Theinterestindatascience
• Solveproblemsandanswerquestionsusingdata
• Goaltoimprovefutureoutcomes
3
Whatisthedatascienceprocess?
CRISP-DMMethodologydiagram
4
Business Understanding
Data Understanding
Data Preparation
Analytic Approach
Data Requirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
CrossIndustryStandardProcessforDataMining
Everyprojectbeginswithbusinessunderstanding.
• Projectobjective?• Businesssponsorsplaythemostcriticalrole• Whatarewetryingtodo– whatisthegoal?• Howdoyoudefine“success”andhowcanyoumeasureit?
5
1.Businessunderstanding
Business Understanding
6
1.Businessunderstanding
Business Understanding
Traffic:
Problem:Trafficcongestionwastestimeandmoney
Clearquestion:Howcanweoptimizetrafficlightdurationusingdataontrafficpatterns,weather,andpedestriantraffic?
Measurableoutcomes:
- %decreaseincommutetime
- %decreaseinlength/durationoftrafficjams
2.AnalyticApproach
• Expressproblemincontextofstatisticalandmachinelearningtechniques• Regression:
• “Predictingrevenueinthenextquarter?”• Classification:
• “DoesthispatienthavecancerA,cancerB,oraretheyhealthy?”• Clustering:
• “Aretheregroupsofusersthatseemtobehavesimilarlytoeachother?”• Recommendation/Personalization:
• “HowcanItargetdiscountstospecificcustomers?”• OutlierDetection
7
Business Understanding
Analytic Approach
8
• Linearregression• Logisticregression• Clustering
• K-means• Hierarchical• Density-based
• ClassificationTrees• RandomForests• Neuralnetworks
• Textmining(naturallanguageprocessing)
• Principalcomponentanalysis
• SupportVectorMachines• HiddenMarkovModels• …
Statistical/machinelearningtechnique(s)
Data Understanding
Data Requirements
Data Collection
Datacompilation
• Thechosenanalyticapproachdeterminesthedatarequirements.• Content,formats,representations
• Initialdatacollection isperformed.• AvailableData?• Obtaindata?• Revisedatarequirementsorcollectmoredata?
• Thendataunderstanding isgained.• Initialinsightsaboutdata• Descriptivestatisticsandvisualization• Additionaldatacollectiontofillgaps,ifneeded
9
#1Whatcanyoutellmeaboutthisdata?
10
#2Whatcanyoutellmeaboutthisdata?
11
#3Whatcanyoutellmeaboutthisdata?
12
#4Whatcanyoutellmeaboutthisdata?
13
ImportanceofVisualization
14
Sameproperties:mean(x)=9mean(y)=7.5y=3.00+0.500xcorr(x,y)=0.816
Anscombe's Quartet
CRISP-DMMethodologydiagram
15
Business Understanding
Data Understanding
Data Preparation
Analytic Approach
Data Requirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
Datapreparation
• Datapreparation encompassesallactivitiestoconstructandcleanthedataset.• Datacleaning
• Missingorinvalidvalues
• Eliminatingduplicaterows
• Formattingproperly
• Combiningmultipledatasources• Transformingdata• Featureengineering• Textanalysis
• Acceleratedatapreparationbyautomatingcommonsteps
16
Data Preparation
• Arguablythemosttime-consumingstep• “80%oftheentireDSprocessisin
datacleaningandpreparation”
Modeling
Modeling
• Modeling:• Developingpredictiveordescriptivemodels• Maytryusingmultiplealgorithms
• Highlyiterativeprocess
17
Data Preparation
Evaluation
K-meansClustering
Groupsimilarcuisinestogetherintok numberofclusters.
Example:Clustering
Groupsimilarcuisinestogetherintok numberofclusters.
k =3
K-meansClustering
Example:Clustering
Example:Clustering
20
[Age:18,Sex:M,BMI:23,Exercise:Frequent,Hobbies:Golf,…]
[Age:45,Sex:F,BMI:28,Exercise:Frequent,Hobbies:Baseball,…]
[Age:83,Sex:F,BMI:25,Exercise:Sedentary,Hobbies:Gymnastics,…]
[Age:28,Sex:M,BMI:23,Exercise:Normal,Hobbies:Softball,…]
[Age:30,Sex:F,BMI:25,Exercise:Normal,Hobbies:Golf,…]
[Age:15,Sex:M,BMI:22,Exercise:Frequent,Hobbies:Golf,…]
Model
CLUSTERA
CLUSTERB
CLUSTERC
CLUSTERB
CLUSTERA
CLUSTERA
Example:Classification
21
[Age:32,Sex:M,BMI:23,Exercise:Frequent,… ,Condition:Disorder1]
[Age:45,Sex:F,BMI:28,Exercise:Frequent,… ,Condition:Healthy]
[Age:63,Sex:F,BMI:21,Exercise:Sedentary,… ,Condition:Disorder2]
Model
[Age:48,Sex:M,BMI:23,Exercise:Sedentary,… ,Condition:________]
Disorder1
Modelevaluation
• Modelevaluation isperformedduringmodeldevelopmentandbeforemodeldeployment.• Understandthemodel’squality• Ensurethatitproperlyaddressesthebusinessproblem
• Diagnosticmeasures• Suitabletothemodelingtechniqueused• Training/Testingset• Refinemodelasneeded
• Statisticalsignificancetests
22
Evaluation
Modeling
Deploymentandfeedback• Oncefinalized,themodelisdeployed intoaproductionenvironment.
• Maystartinalimited/testenvironment• Involvesotherroles:
• Solutionowner
• Marketing
• Applicationdevelopers
• ITadministration
• Getting Feedback:• Howwelldidthemodelperform?• Iterativeprocessformodelrefinementandredeployment• A/Btesting
23
Deployment
Feedback
BigDataUniversity:• Inactive->Active
CRISP-DMMethodologydiagram
24
Business Understanding
Data Understanding
Data Preparation
Analytic Approach
Data Requirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
25
“All models are wrong but some are useful” – George Box, Statistician
26
Variable#503
Varia
ble#5
03
Variable#503
27
Variable#503
Varia
ble#5
03
Variable#503
28
Variable#503
Varia
ble#5
03
Variable#503
29
LearningMoreAboutDataScienceWherecanyoulearnmoreaboutdatascience?
30
31
32
33
34
BigDataUniversity.comFreecourses!• DataScience• BigData• DataEngineering
Earnbadges!
Learnanytime!
Foryourorganizations• Wecancreatededicatedportals
foryouremployeestogainskillsindatascience