data preparation for automated machine...

21
DATA PREPARATION FOR AUTOMATED MACHINE LEARNING © 2017 Impact Analytix, LLC - All rights reserved. 1 DATA PREPARATION FOR AUTOMATED MACHINE LEARNING BY JEN UNDERWOOD WHITE PAPER

Upload: others

Post on 06-Aug-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 1

DATAPREPARATIONFORAUTOMATEDMACHINELEARNINGBYJENUNDERWOOD

WHITEPAPER

Page 2: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 2

TABLEOFCONTENTSIntroduction........................................................................................................................3

ArtandScienceofDataPreparation......................................................................3

AutomationforFasterDataPreparation................................................................3

DataRobotforMachineLearning...........................................................................4

WheretoStart.....................................................................................................................5

MachineLearningLifecycle....................................................................................5

PlanforDataCollection..........................................................................................5

AvoidOverfittingandUnderfitting.........................................................................7

CollectandStructureData..................................................................................................8

GatherData............................................................................................................8

BewareofBias......................................................................................................11

ExploreandProfile............................................................................................................11

UnderstandYourData..........................................................................................12

DetectLeakage.....................................................................................................12

FindandReduceErrors........................................................................................13

ImproveDataQuality........................................................................................................15

EngineerFeatures.............................................................................................................18

Conclusion.........................................................................................................................20

RecommendedNextSteps...................................................................................20

JANUARY2018–WHITEPAPERCOMMISSIONEDBYDATAROBOT

Page 3: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 3

INTRODUCTIONThe beauty of the human mind in combination with automated machine learningempowers amazing predictive insights that might never be found using manualtechniques.Sincethequalityofpredictiveoutputreliesonthequalityof input,properdatapreparationisacriticalsuccessfactorforachievingoptimalmachinelearningresults.

THEARTANDSCIENCEOFDATAPREPARATIONTheiterativeprocessofpreparingdataforautomatedmachinelearningisbothanartandascience.Theartofdatapreparationrequiresknowledgeofthebusiness–or“domainexpertise”–toselecttherightproblemstosolve,identifycrucialinputdata,andcarefullytransformandengineerinformativefeaturestomaximizepredictivemodelaccuracy.Thescienceofdatapreparationinvolvescleansingandnormalizingcollecteddata,selectinginfluencer features, and generating training, testing, and validation data sets forautomatedmachinelearning.

Although automated machine learning solutions may provide safeguards to preventcommonmistakes,you’llstillwanttolearnhowtocorrectlyprepare,shape,andformat

your data to create greatmodels. Providing the wrong data, irrelevant data, orimproperly prepared data undermines model performance for generalization.Essentiallyyouwillwanttodesignaninputdatasetwithfeatureindependence,explainable variance, and maximum information gain to find signals in thenoise.

AUTOMATIONFORFASTERDATAPREPARATIONTo identify relationships in data — “the signals”— and isolate distracting,irrelevantdata—“thenoise”—youcanexpeditethehighlyiterativepredictive

datapreparationprocesswithautomatedmachinelearning.Inminutes,youcanfindthemostrelevantfeaturesandpinpointspecificareas inyourdatasetwhere

prediction errors occur to help you focus efforts on the right data and reduceexperimentationtime.

Afterrunningbasicinputdataandevaluatingtheresults,youwillenhanceinputdata,addfeatures,buildanothermodel,andreviewperformanceonceagain.You’llcontinuethisprocessuntilyourmodelmeetsperformanceobjectives.

Ideally subject matter experts that understand the business process and data sourcenuances will assist in the data preparation process. Depending on your project, datapreparationmightbeaone-timeactivityoraperiodicone.Asnewinsightsarerevealed,itiscommontoexperimentfurther.

“Youcanexpeditethehighlyiterativepredictivedatapreparationprocesswithautomatedmachinelearning.”

Page 4: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 4

DATAROBOTFORMACHINELEARNINGDataRobot is the world’s most advanced automated machine learningplatform.DataRobotautomatesthemachinelearningprocessfromdata ingestion to deployment. It delivers immediate value andunmatchedease-of-use,andnocomplicatedmathorscriptingisrequired.

DataRobot includes an array of data preparation features,automatingfeatureengineeringtofindkeyinsightsandhiddenpatterns. This invaluable technology expedites analyticalinvestigation across millions of variable combinations thatwould be far too time-consuming for manual humanexploration.

For optimal machine learning model performance, domainknowledge and best practices used by the world’s leading datascientistshavebeenuniquelybakedintoDataRobotblueprints.Usersofallskill levels can safely apply machine learning with its built-in optimizations andsafeguards.

DataRobot supports popular advancedmachine learning techniques and open sourcetoolssuchasApacheSpark,H2O,Scala,Python,R,andTensorFlow.Usingdrag-and-drop,point-and-click guided menu options, DataRobot users can simply and quickly createpredictivemodelswithautomatedmachinelearning.Theprocessissimple:

o Ingestdatasources

o Selectatargetvariabletopredict

o Automatically generate features, extract balanced samples, build and iteratethrough100sofmachinelearningmodels

o Visuallyexploretopperformingmodelsandkeyfindings

o Easilydeployandoperationalizemodels

Machinelearningdevelopmentstepsthatusedtotakeweeksormonthsofeffortcannowbe completed in hours. By embeddingDataRobot automatedmachine learningmodelintelligence into your reporting or business processes, you can quickly close the loopbetweeninsightandaction.

Sinceeachdatasetandbusinessobjectivecanbeuniquewithvariedchallenges,wehaveprovidedthefollowingguidelinestohelpgetyoustarted.Wealsoshareessentialtipsandadditionalresourcesforfurtherstudy.

“Domainknowledgeandbestpracticesusedbytheworld’sleadingdatascientistshavebeenuniquelybakedintoDataRobotblueprints.”

Page 5: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 5

WHERETOSTARTThe machine learning process begins with Business Understanding. This initial stepfocusesondefiningtherightproblemtosolveandrecognizingthebusinessobjectivesandrequirements.Afterselectingaproblem,youwillcollectandassessdata.DuringtheDataUnderstandingstep,youwillgetfamiliarwithavailabledatasources,identifydataqualityproblems,andperformexploratoryanalysis.Then,intheDataPreparationstep,youwillcleansethedata,shapingandtransformingitintoaflattenedformatforloadingintotheautomatedmachinelearningplatform.

MACHINELEARNINGLIFECYCLE

Figure1:OverviewoftheMachineLearningProcess

Forthepurposesofthiswhitepaper,wewillconcentrateoncollectingdataandpreparingitproperly.Wewillnotcovertheentiremachinelearninglifecycle.

Beforeyoubeginthedatacollectionanddatapreparationprocess,itisassumedthatyoualreadyhaveselected,defined,andisolatedabusinessproblemtosolvethatisaviablecandidateformachinelearning.Youshouldalsohavechosenatleastonemetricthatyouwanttobetterunderstand.Ifyouneedmoreinformationonthosestepsinthemachinelearningprocess,pleaserefertoourpreviouswhitepaper,MovingfromBItoMachineLearning.

PLANFORDATACOLLECTIONAsyoudeveloprequirementsformachinelearningmodeldatacollection,contemplatethebusinessprocessandreviewitfromdifferentperspectives.Considerwhathappensateach step, what data is captured, where it is stored, if data history or changes areretained,andifthatdatatrulyreflectstherealworldforresultingpredictions.Oftendataiscollectedinline-of-businessapplicationsordatawarehousesforotherreasons.Existing

Page 6: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 6

datamaybemissingsituationalcontextsuchaslocation,environmentalconditions,andother relevant variables for predicting an outcome. Document known issues andpreferreddatathatcouldbeaddedinthefuture.

TOIMPROVEINPUTDATACOLLECTION,DIAGRAMTHEBUSINESSPROCESSFLOW.IDENTIFYTHESTEPS,ENVIRONMENTALCONDITIONS,SCENARIOS,PEOPLE,SYSTEMS,AVAILABLEDATA,ANDMISSINGDATA.

As you continue planning to gather data for yourmachine learningmodeling project,you’ll need to confirmdecision-levelmetric granularity.Granularity refers to aunitofanalysis. A unit might be an opportunity, customer, or transaction. Granularity is

determined by the business objectives and how your model will be usedoperationally.Askstakeholdershowdecisionswillbemadefromthepredictivemodels.Aretheybasedonasinglecustomer,transaction,orevent,oraretheybasedonaggregatedataovertime?

To illustratetheseconcepts,wewillbereferringtoapubliclyavailableBankMarketingDataSet1fromUCI’smachinelearningrepository.Thesampledatasetcontainspartiallyprepareddata topredictclient termdepositscollectedduringthebank’stelemarketingcampaigns.

In the BankMarketing Data Set, the desired outcome to predict is client termdeposits. This is a binary yes or no outcome in the sample, but it could have

alternativelybeenatotalamountfiguretomaximizedeposits.Don’talwayslimityourselftocollectingoneoutcomevariablewhileassemblingdata.Thinkaboutotherquestionsthatmightbeaskedanddatathatwouldmakesensetoinclude.

Potentialinfluencerfeaturesfortheexampleclienttermdepositoutcomeincludeclientdemographics such as age, job, marital status, and education. Past credit and loanrepayment information is also important to know. Other features chosen includedcampaigncontacts,previousmarketingcampaignoutcomes,andseveralexternalsocialandeconomicenvironmentalattributessuchasemploymentrate.

NOTE:WHENSELECTINGFEATURES,YOUWILLWANTTOEXTRACTTHEMAXIMUMINFORMATIONFROMTHEMINIMUMNUMBEROFINPUTVARIABLES.

1UCImachinelearningrepositorydatasethttps://archive.ics.uci.edu/ml/datasets/bank+marketing

“Toillustratetheseconcepts,wewillbereferringtoapubliclyavailableBankMarketingDataSet1fromUCI’smachinelearningrepository.”

Page 7: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 7

AVOIDOVERFITTINGANDUNDERFITTINGOverfitting and underfitting are common mistakes for beginners who are preparingmachine learningmodeling data. Overfitting captures the noise in your data with anoverly complex, unreliable predictive model. Essentially what happens is the modelmemorizesunnecessarydetails.Whennewdatacomesin,themodelfails.

Ifyoudon’thaveenoughfeatures,yourmodelmightbeoversimplifiedandsufferfromunderfittingissues.Underfittingoccurswhenastatisticalalgorithmcannotcapturetheunderlyingpatternsinthedata.

Whyareoverfittingandunderfittingproblematic?Themachinelearningmodelwillhavetoomanypredictionerrorstobeusefulfordecision-making.

Figure2:CommonDataPreparationIssues

Forcategoricaldata,overfittingcanoccur ifahighnumberofcategoriesareobservedwith a small number of observations per category. These types of variables hold lessinformation for predictive value. For time series data, overly complex mathematicalfunctions that describe the relationship between the input variable and the targetvariablecanalsoleadtooverfitting.Inthemostextremeformofoverfitting,individualidentifiersare inadvertentlyusedasmachine learning inputs. Individual identifierscanperfectly model existing data, but would only by chance reliably model and predictoutcomesforotherdata.

Thus,thereisadelicatebalancebetweenbeingtoospecificwithtoomanyfeaturesandtoovaguewithnotenoughfeatures.Designingmachinelearningmodelfeatureswithjusttherightamountofpredictiveinformationgainandprecisionisakeyskill intheartofdatapreparation.

Page 8: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 8

COLLECTANDSTRUCTUREDATAAfter reviewing the business process and planning the machine learning input datarequirements,you’lldelveintothedatacollectionandshapingprocess.

GATHERDATAMachinelearningalgorithmsassumethateachrecordisindependentandarenotrelatedtootherrecords. If relationshipsexistbetweenrecords,youwillwanttocreateanewvariablecalleda feature ina columnwithin the rowofdata tocapture thatbehavior.Unlike third-normal form transactional or dimensional patterns used in businessintelligence,machinelearningrequiresdatatobeinputasa“flattened”table,view,orcommaseparated(.csv)flatfileofrowsandcolumns.Yourviewwillneedtocontainanoutcome metric and target variable, along with input predictor variables. This datarepresentationformachinelearningiscalledthefeaturematrix.

Figure3:ExamplePreparedDataSet

If you have data stored in several tables in a data warehouse or relational databaseformat,youwillneedtouserecordidentifierstojoinfieldsfrommultipletablestocreatea singleunified, flattened “view.” Formany target variables, inputdata is capturedatvariousbusinessprocessstepsinmultipledatasources.AsalesprocessmighthavedatainaCRM,emailmarketingprogram,Excelspreadsheet,and/oraccountingsystem.Ifthatisthecase,youwillwanttoidentifythefieldsinthosesystemsthatcanrelate,join,orblendthedifferentdatasourcestogether.

Prepareddatashouldbecollectedatalevelofanalyticalgranularityuponwhichyoucanmakedecisions.Chooseagranularitythatisactionable,understandable,andusefulinthe

Page 9: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 9

eventyouincorporatetheresultsintoyourexistingbusinessprocessorapplication.Forexample,ifyouwanttomakedailysalesforecasts,youneedtoinputdataatadaylevel

ratherthanweek,month,oryear.

Ifyouaretryingtocapturechangesindataoveracertaintimeperiod,checkifyourdatasourceisonlykeepingthecurrentstatevaluesofarecord.Mostdatawarehousesaredesignedtosavedifferentvaluesofarecordovertimeanddonot overwrite historical data values with current data values. TransactionalapplicationdatasourcessuchasSalesforceonlycontainthecurrentstatevalueforarecord.Ifyouwanttogetapriorvalue,youneedtohaveasnapshotofthehistoricaldatastoredorkeepthepriorvaluedataincustomfieldsonthe

currentrecord.

Whilestructuringinputdata,ensurethat it iscleanandconsistent.Theorderandmeaningofinputvariablesshouldremainthesamefromrecordtorecord.Inconsistentdataformats,“dirtydata,”andoutlierscanunderminethequalityofanalyticalfindings.

HOWMUCHDATATOCOLLECTTheactualnumberofrecordsisnotalwayseasytodetermineanddependsonpatternsinyourdata.Ifyouhavemorenoiseinyourdata,youwillneedmoredatatoovercomeit.Noiseinthiscontextmeansunobservedrelationshipsinthedatathatarenotcapturedbytheinputpredictorvariables.

Figure4:SampleSizeEstimation

To determine minimum data set sizes, consider the dimensionality of your data andpatterncomplexity.2

o Forsmallmodelswithafewvariables,10to20recordspervariablevaluemaybesufficient.

o Formorecomplexmodels,~100 recordspervariablevaluemaybeneeded tocapturepatterns.

o For complex models with ~100 input variables, you will need a minimum of10,000recordsinthedataforeachsubset(training,testing,andvalidation).

TRAINING,TESTING,ANDVALIDATIONDATASETSThemostcommonstrategyistosplitdataintotraining,testing,andvalidationdatasets.

2AppliedPredictiveAnalytics:PrinciplesandTechniquesfortheProfessionalDataAnalystbyDeanAbbott

“Shapingdatainvolvessubjectmatterexpertthoughttocreativelyselect,create,andtransformvariablesformaximuminfluence.”

Page 10: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 10

These data sets are usually created through a random selection of records withoutreplacement,meaningeachrecordbelongstooneandonlyonesubset.All threedatasetsshouldreflectyourreal-worldscenario.

CrossValidation

KeepinmindthatDataRobotincludesindustrystandardk-foldcrossvalidationfeaturesthatdividethedataintoksubsetswiththeholdoutrepeatedktimes.Eachtime,oneoftheksubsetsisusedasthetestsetandtheotherk-1subsetsarecombinedtoformatrainingset.DataRobot’sk-foldfeatureenablesyoutoindependentlychoosehowmuchdatayouwanttouseintesting.

TimeSeriesConsiderations

Data that changes over time should be reflected in your input data set. When timesequences(ContactMade>QuoteProvided>DealClosed)areimportantinpredictions,proportionallycollectingdatafromthosedifferenttimeperiodsisimportantaswell.Thekeyprincipleistoprovidedatathatreflectswhatactuallyhappensattherightlevelofoutcomemetricgranularity.

Whencollectingdata, thinkaboutthebalanceofyourvalues inyourraw data. In our example, howmany prospects do you have atdifferent ages and stages of life?Dotheyhaveloans?Whatarethebalances on outstanding loans?How many defaulted on a loan?Howrecentwasthedefault,?Whatis the prospect’s income history,credit ratings, debt-to-incomeratio,andsoon?

ThinkProportionally

Whenextractingasubsetofdata,besuretoincludeapproximatelythesameproportionofvariablesinyourDataRobotinputdatasetthatyouseeinthereal-worlddata.Ifyouprovide more records of one variable value, you can accidentally bias the machinelearningmodel’spredictions,therebydiminishingperformance.Ifyouhavedatasetswithmillionsorbillionsofrows,itismuchlesslikelythatyouwillencounteraccidentalbiasindatapreparation.

Figure5:SamplingData

Page 11: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 11

SAMPLINGThechoiceoftheoptimalsamplingmethod3foragivenproblemespeciallydependsonthecharacterofthedatasetandthedesiredproportionofthesubsets.Eachmethodhasitsadvantagesbutalsoitslimitations.

Forsimple,nearlyuniformlydistributeddatasets,themethodofsimplerandomsamplingmay be sufficient. For naturally well-ordered time series data, highly efficientdeterministic approaches such as convenience and systematic sampling can achievereliableresults.Whendealingwithcomplex,high-dimensionaldata,moresophisticatedandstratifiedsamplingtechniquescanreducethebiasandvarianceofmodelerror.

UnbalancedTwo-ClassProblemsDatasetsseldomcomewithevenlydistributedsamples.Unbalanceddataisacommonissue to remediate. Fraud or failure rate data are examples of unbalanced two-classproblems.Analyzingunbalanceddatacreatesuselessresultswithexceptionallyhigherrorrates.

Tobuildpredictivemodelsonunbalanceddata,youneedtoapplysamplingtechniquesthatincreasetheminorityclassproportionwithdownsamplingorupsamplingtocreateabalanceddataset.Aftertrainingamachinelearningmodelwithabalancedtrainingset,youwillvalidateperformanceofthemodelwiththereal-world,unbalanced,unseentestset.

BEWAREOFBIASWhileyouaccumulatedata,considerpotentialbiases.4Humannatureisconsciouslyandunconsciouslybiased.Cognitivebiasesaretendenciestothinkincertainwaysthatcanlead to irrational judgment.Outcome,omission, andmanyotherbias types caneasilycreepintothedatacollectionprocess.

If unknown bias exists, it is basically an unjustified assumption that your input datareflectsreality.Anymodelbuiltonsuchassumptionsreflectsonlythedistortedrealityandwillperformpoorly.Toreducepotentialbias, testhypotheses,pokeholes inyourown ideas,welcomechallenges,andconductpeer reviewsofyourdatacollectionandsamplingthoughtprocesses.Machinelearningprojectsshouldbegroupprojectsandnotdoneinisolation.

EXPLOREANDPROFILENow youwill assess the condition of your source data. DataRobot automates severalaspects of initial data examination. DataRobot also automates sampling to avoidconventional sampling and overfitting issues. During this step, you’ll visually look for

3CommonTypesofDataSamplingMethodshttps://en.wikipedia.org/wiki/Sampling4TheCognitiveBiasCodexhttps://upload.wikimedia.org/wikipedia/commons/a/a4/The_Cognitive_Bias_Codex_-_180%2B_biases%2C_designed_by_John_Manoogian_III_%28jm3%29.png

Page 12: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 12

trends, extreme values, outliers, exceptions, skewed data, incorrect values, andinconsistentandmissingdata.

UNDERSTANDYOURDATAAsyoubegintoexploreandunderstandyourdata,DataRobotprovidesdataprofilesfor

everyfeaturethatincludehowmanyvaluesareuniqueormissing,aswellasthestatisticalmean,standarddeviation,median,minimum,andmaximumvalue.Youcanalsoreviewthedistributionsofeachfeatureusingahistogramwithoptionalbinsettingsandapplytransformationstonormalizeyourdata.

Informative Features immediately ranks variables that provide the mostinformationgainforbuildingoptimalmachinelearningmodels.Knowingwhichareasofyourdatamostinfluencetheoutcomeisinvaluableonitsown.Thisinformationcanguidethebusinesstofocuslimitedtimeandresourcesonthe

activitiesthatmattermost.

Another one of DataRobot’s data preparation strengths is the array of differentvisualizationtoolsthatidentifyinfluentialfeatures,rankthem,anduncovererrors.TheFeature Ranking reportmeasures howmuch each feature by itself contributes to theaccuracyofamachinelearningmodel.

Figure6:FeatureImpact

DETECTLEAKAGELeakageistheaccidentalinclusionofoutcomeinformationthatwouldnotbelegitimatelyavailable for predictions. If you have inadvertent leakage, you’ll likely notice it in thefeaturerankingreportwhenafeaturehasanexceptionallyhighimpact.

“Bylookingatthesefindings,thebusinesscanimmediatelyappreciatethevariablesthattrulyinfluenceoutcomestogetimmediatevalue.”

Page 13: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 13

Figure7:LeakageExample

In our Bank Marketing Data Set example, duration was identified in the DataRobotFeatureImpactreportasaleakedfeaturewith100%impact.Thenextbestperformingfeaturecarriesa littlemore than10% impact. Ifyouseeanexceptionallyhigh impact,verifyprocessflowtimingforthatfeature.

TOAVOIDLEAKAGE,YOUNEEDTOCONSIDERTHETIMINGANDORDEROFEVENTSTOENSURETHATOUTCOMEDATAISNOTUSEDASINPUTDATA.

FINDANDREDUCEERRORSDataRobotModelX-RayandReasonCodescapabilitiescanalsoprovidedeepinsightsforenhancing data preparation. Model X-Ray enables interactive, visual exploration ofmachinelearningmodelperformance.Youcaneasilyseewhereamodelmakesmistakesbyselectinginputfeatures.Itshinesalightonissuesthatmightnotgetdetectedusingother tools.Model X-Ray allowsmachine learningmodel designers to concentrate onwherethemostmodelperformanceimprovementscanbemadeinthedatapreparationprocess.

Page 14: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 14

Figure8:ModelX-Ray

ReasonCodesunveilwhatvalueswithinafeaturedrivethemodel’sresults.Thistooliscrucialformachinelearningmodeldevelopmentprocessesthataresubjecttoregulatorycomplianceorlegalscrutiny.WithReasonCodes,youcandiscoverwhichcombinationsoffeaturevaluestriggeraspecificmachinelearningoutcome.Thisinformationcanalsobe useful throughout the iterative data preparation process to incrementally improveresults.

Figure9ReasonCodes

Page 15: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 15

IMPROVEDATAQUALITYWerecommendthatyouaddressdataqualityissuesasearlyaspossible.Hereareafewtipsforhandlingcommondataissuesinthedatapreparationprocess.

CorrectingIncorrectValues

Machinelearningmodelsassumetheinputdataiscorrect.Ifyouareseeingerrorsfromsourceapplicationsthatshouldgetfixed,abestpracticeistotryandresolvetheissueatthesourcesystemversusinadatapreparationprocess.

Treat incorrect values as missing if there is a minimal amount and you can’t easilydeterminecorrectvalues.Iftherearealotofinaccuratevalues,trytodeterminewhathappenedtorepairthem.Ifyoudomakechangestodata,documentyourreasoning.Alsocaptureinitialcontextandchangedvalueswithaflagtoidentifychanges.Thepatterninyourdatamightbehiddeninthoseincorrectvalues.

SkewedVariables

Forcontinuousvariables,reviewthedistributions,centraltendency,andvariablespreadin DataRobot. These are measured using various statistical metrics and visualizationmethodssuchashistograms.Continuousvariablesshouldbenormallydistributed.Ifnot,reduce skewnesswith transformations or by experimentingwith bin sizes for optimalprediction.

When a skewed distribution needs to be corrected, the variable is transformed by afunctionthathasadisproportionateeffectonthetailsofthedistribution.Logtransformslikelog(x),logn(x),log10(x);themultiplicativeinverse(1/x);squareroottransformsqrt(x);orpower(xn)arethemostfrequentlyusedcorrectivefunctions.

In the table to the left,several issues and formulastominimizeskewareshown.The before and after chartsillustrate how differentskewed variabletransformationscanbeusedto normalize featuredistributions.

Figure10:TransformationsforSkewness

Page 16: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 16

High-Cardinality

High-cardinality fields are categorical attributes that contain a very large number ofdistinctvalues.Examplesincludenames,ZIPcodes,andaccountnumbers.Althoughthesevariables can be highly informative, high-cardinality attributes are rarely used inpredictive modeling. The main reason is that including them will vastly increase thedimensionalityofthedataset,makingitdifficultorevenimpossibleformostalgorithmstobuildaccuratepredictivemodels.

Redundant,HighlyCorrelatedVariables

Duplicate,redundant,orotherhighlycorrelatedvariablesthatcarrythesameinformationshould be minimized. DataRobot algorithms will perform better without collinearvariables.Collinearityoccurswhentwoormorepredictorvariablesarehighlycorrelated,meaningthatonecanbelinearlypredictedfromtheotherswithasubstantialdegreeofaccuracy.

Toidentifyhighcorrelationbetweentwocontinuousvariables,reviewscatterplots.Thepatternofascatterplot indicates therelationshipbetweenvariables.Therelationshipcanbelinearornon-linear.Tofindthestrengthoftherelationship,computecorrelation.Correlationvariesbetween-1and+1.

Ifyouhavetwovariablesthatare almost identical and youdo want to retain thedifference between them,consider creating a ratiovariableasafeature.Anotherapproach is to use PrincipalComponent Analysis (PCA)outputasinputvariables.

Toavoidthecollinearityissue,do not include multiplevariables that are highlycorrelatedordatathatisfromthesamereportinghierarchy.Often those fields provideobviousinsights.Forexample,customerswholiveinthecityofTampaalsohappentoliveinthestateofFlorida.

Figure11:Scatterplotsforcorrelationdetection

Page 17: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 17

Missingvalues

Themostcommonrepairformissingvaluesisimputingalikelyorexpectedvalueusingameanorcomputedvaluefromadistribution.Ifyouusethemean,youmaybereducingyourstandarddeviationthusthedistributionimputationapproachismorereliable.

Anotherapproachistoremoveanyrecordwithmissingvalues.Don’tgettooambitiouswithfilteringoutmissingvalues.Ifyoudeletetoomanyrecords,youwillunderminethereal-worldaspectsinyouranalysis.Asyouaddressmissingvalues,donotlosetheinitialmissingvaluecontext.Acommondatapreparationapproachistoaddacolumntotherowtoflagdatawasmissingcodedwitha1or0.

ExtremeValuesandOutliers

Outliersarevaluesthatexceedthreestandarddeviationsfromthemean.Manymachinelearningalgorithmsaresensitivetooutlierssincethosevaluesaffectaverages(means)andstandarddeviationsinstatisticalsignificancecalculations.Ifyoucomeacrossunusualvaluesoroutliers,confirmthatthesedatapointsarerelevantandreal.Often,oddvaluesareerrors.

If theextremedatapointsareaccurate,predictable,andsomethingyoucancountonhappening again, do not remove them unless those points are unimportant. You canreduceoutlierinfluencebyusinglogtransformationsorconvertingthenumericvariabletoacategoricalvaluewithbinning.

Page 18: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 18

ENGINEERFEATURESFeaturecreationistheartofextractingmoreinformationfromexistingdatatoimprovethepredictivepowerofmachinelearningalgorithms.Youaremakingthedatayoualreadyhavemoreuseful.Strongfeaturesthatpreciselydescribetheprocessbeingpredictedcanimprovepatterndetectionandenablemoreactionableinsightstobefound.

Creating features from several combined variables and ratios usually provideshighermodelaccuracy thananysingle-variable transformationbecauseof theinformationgainassociatedwithdatainteractions.Ifyoueversawthemovieorread the book “Moneyball: TheArt ofWinning anUnfairGame”byMichaelLewis, you’ll know how baseball analysis was revolutionized with newperformancemetrics likeOn-BasePercentage(OBP)andSluggingPercentage(SLG). With feature engineering, you will be using a fundamentally similarapproach.

HUMANINGENUITYANDCREATIVITYREQUIREDFeatureengineeringischallengingbecauseitdependsonleveraginghumanintuition

to interpret implicit signals in data sets that machine learning algorithms use.Consequently,featureengineeringisoftenthedeterminingfactorinwhetheramachinelearningmodelingprojectissuccessfulornot.Thisstepintheprocessisexperimentalandusuallythebottleneckinautomatedmachinelearningprocesses.Although collected raw data fields can be used as-is in DataRobot withouttransformations, supplemental data, or calculations to trainmachine learningmodels,you’llalmostalwayswanttoaddmoreperspectivetoyourdatasetbydesigningfeatures.Engineeredfeaturesprovidebettercontexttodifferentiatepatternsinthedata.

Aggregations

Some commonly computed aggregate features including the mean (average), mostrecent,minimum,maximum,sum,multiplyingtwovariablestogether,andratiosmadebydividingonevariablebyanother.NoteDataRobotautomaticallygeneratesdateandtimeaggregationfeatures.

Ratios

Ratios can be excellent feature variables. Ratios can communicate more complexconcepts such as price-to-earnings ratio, where neither price nor earnings alone candeliverthisinsight.

Transformations

Transformation refers to the replacement of a variable by a function. For instance,replacingavariablexbythesquareorcuberootorlogarithmxisatransformation.Youtransformvariableswhenyouwanttochangethescaleofavariableorstandardizethevaluesofavariableforbetterunderstanding.Variabletransformationcanalsobedone

“Featureengineeringisoftenthedeterminingfactorinwhetheramachinelearningmodelingprojectissuccessfulornot.”

Page 19: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 19

usingcategoriesorbins tocreatenewvariables.Anexample transformationmightbebinningcontinuousLeadAgeintoLeadAgeGroupsorLoanAmount intoLoanAmountCategories.

FEATUREENGINEERINGTECHNIQUESHereareafewpopularfeatureengineering ideassharedonDataScienceCentral5thatcanbeusedtoextractmoreinformationfromyourinputdata:

1. Singlevariabletransformations

2. Ratioorfrequencyofcategoricalvariables

3. Combineimportantvariables

4. Computevariableinteractions

5. Changedatatypes

6. Computerelativedifferences

7. Cartesiantransformations

8. Bintransformations

9. Windowtimeseriesdata

10. Reframecontinuousvariables

11. Onehotencodingforsequenceproblems

12. Sparsevaluecoding

Thepossibilitiesforfeatureengineeringarelimitedonlybyyourownhumaningenuityandcreativity.Featureengineeringtrulyisthehumanartofdatapreparationforautomatedmachinelearning.

5DataScienceCentralhttps://www.datasciencecentral.com/profiles/blogs/feature-engineering-data-scientist-s-secret-sauce-1

Page 20: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 20

CONCLUSIONIn thiswhitepaper,webriefly introducedbasicdatapreparation formachine learningconcepts.Wediscussedhow toplanyourprojectandcollect,organize, structure,andshapedatainamachinelearning-friendlyformat.Wealsobestowedvitaltipsforfeatureengineering to help you master the art of data preparation for automated machinelearningmodels.

Asyouprogress inyourautomatedmachine learningmodel journey,eachareaof thedata preparation process should be further researched. For additional reading, thefollowingbooksarehighlyrecommended:

o DataPreparationforDataMiningbyDorianPyle

o DataPreprocessinginDataMiningbySalvadorGarcíaandJuliánLuengo

o AppliedPredictiveAnalytics:PrinciplesandTechniquesfortheProfessionalDataAnalystbyDeanAbbott

o FeatureEngineeringforMachineLearningModels:PrinciplesandTechniquesforDataScientistsbyAliceZhengandAmandaCasari

NotethatDataRobotalsoprovidesclassesthatcovertheseandotherrelatedtopics.

RECOMMENDEDNEXTSTEPSForadditionalinformationonautomatedmachinelearning,pleasecontactanexpertatDataRobot.It’seasytogetstarted.

o DataRobotwww.datarobot.com

o DataPreparationEssentialsforAutomatedMachineLearningwebinarrecordingwww.datarobot.com/webinar/data-prep/

o DataRobotAIAccelerationPackageswww.datarobot.com/product/enterprise-ai/

o DataRobotCourseswww.datarobot.com/education/all-courses/

Page 21: DATA PREPARATION FOR AUTOMATED MACHINE LEARNINGdatateam.mx/downloads/datarobot/Data-Preparation-for... · 2019-09-26 · science of data preparation involves cleansing and normalizing

DATAPREPARATIONFORAUTOMATEDMACHINELEARNING

©2017ImpactAnalytix,LLC-Allrightsreserved. 21

AboutDataRobotDataRobotoffersanenterprisemachine learningplatform that empowersusersofall skill levels tomakebetterpredictionsfaster.Incorporatingalibraryofhundredsofthemostpowerfulopensourcemachinelearningalgorithms,theDataRobot platformautomates, trains and evaluates predictivemodels in parallel, deliveringmore accuratepredictionsatscale.DataRobotprovidesthefastestpathtodatasciencesuccessfororganizationsofallsizes.Formoreinformation,visitwww.datarobot.com.

AbouttheAuthorJenUnderwood, founder of Impact Analytix, LLC, is an analytics industry expertwith a unique blend of productmanagement,designandover20yearsof“hands-on”advancedanalyticsdevelopment. Inadditiontokeepingapulseonindustrytrends,sheenjoysdiggingintooceansofdata.JenishonoredtobeanIBMAnalyticsInsider,SAScontributor,andformerTableauZenMaster.ShealsowritesforInformationWeek,O’ReillyMedia,andothertechindustrypublications.

JenhasaBachelorofBusinessAdministration–Marketing,CumLaudefromtheUniversityofWisconsin,Milwaukeeandapost-graduatecertificateinComputerScience–DataMiningfromtheUniversityofCalifornia,SanDiego.Formoreinformation,visitwww.jenunderwood.com.