big data
DESCRIPTION
bog data and sasTRANSCRIPT
-
Paper 036-2013
Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC
ABSTRACTAsdatasetscontinuetogrow,itisimportantforprogramstobewrittenveryefficientlytomakesurenotimeiswastedprocessingdata.Thispapercoversvarioustechniquestospeedupdataprocessingtimeforverylargedatasetsordatabases,includingPROCSQL,datastep,indexesandSASmacros.Someoftheseproceduresmayresultinjustaslightspeedincrease,butwhenyouprocess500millionrecordsperday,evena10%increaseisverygood.Thepaperincludesactualtimecomparisonstodemonstratethespeedincreasesusingthenewtechniques.
INTRODUCTIONMoreorganizationsarerunningintoproblemswithprocessingbigdataeveryday.Thebiggerthedata,thelongertheprocessingtimeinmostcases.Manyprojectshavetighttimeconstraintsthatmustbemetbecauseofcontractualagreements.Whenthedatasizeincreases,itcanmeanthattheprocessingtimewillbelongerthantheallottedtimetoprocessthedata.Sincetheamountofdatacannotbereduced(exceptinrarecases),thebestsolutionistoseekoutmethodstoreducetheruntimeofprogramsbymakingthemmoreefficient.Thisisalsoacheapermethodthansimplyspendingalotofmoneytobuybigger/fasterhardware,whichmayormaynotspeeduptheprocessingtime.ImportantNote:Inthispaper,whenevercodeispresentedthatisefficientitwillbeshowningreen.Codethatshouldnotbeusedisshowninred.WHATISBIGDATA?Therearemanydifferentdefinitionsofbigdata.Andmoredefinitionsarebeingcreatedeveryday.Ifyouask10people,youwillprobablyget10differentdefinitions.AtSASSolutionsonDemand(SSO)wehavemanyprojectsthatwouldbeconsideredbigdataprojects.Someoftheseprojectshavejobsthatrunanywherefrom16to40hoursbecauseofthelargeamountofdataandcomplexcalculationsthatareperformedoneachrecordordatapoint.
-
Theseprojectsalsohaveverylargeandfastservers.OneexampleofatypicalSASserverthatisusedbySSOhasthesespecifications:
24CPUcores 256GbofRAM 5+Tbofdiskspace VeryfastRAIDdiskdrivearrayswithadvancedcaches LinuxorAIXoperatingsystem Veryhighspeedinternalnetworkconnections(upto10Gbpersecond) Encrypteddatatransfersbetweenservers Production,Test,andDevelopmentserversthatareidentical Gridcomputingsystem(foronelargeproject)
TheprojectsalsouseathreetieredsystemwherethemainserverissupportedbyanOracledatabaseserverandafrontendterminalserverfortheendusers.ThesupportserversaretypicallysizedaboutthesamesizeastheSASserver.InmostcasesthedataissplitsomedataisstoredinSASdatasetswhilethedatausedmostbytheendusersisstoredinOracletables.ThissetupwaschosenbecauseOracletablesallowfasteraccesstodatainrealtime.Evenwiththislargeamountofcomputinghorsepower,itstilltakesalongtimetorunasinglejobbecauseofthelargeamountofdatatheprojectsuse.Mostoftheseprojectsprocessover200millionrecordsduringasinglerun.Thedataiscomplex,andrequiresalargenumberofcalculationstoproducethedesiredresults.BESTWAYSTOMEASURESPEEDIMPROVEMENTSSAShasseveralsystemoptionsthatareveryhelpfultodeterminethelevelofincreaseinperformance.Inthisthispaper,wewillfocusontheactualclocktime(notCPUtime)theprogramtakestorun.InanenvironmentwithmultipleCPUs,theCPUtimestatisticcanbeconfusingitsevenpossiblethatCPUtimecanbegreaterthantheclocktime.ThemostimportantSASoptionsformeasuringperformancearelistedbelowwithashortdescription:Stimer/FullstimerTheseoptionscontroltheamountofdataproducedforCPUusage.Fullstimeristhepreferredoptionfordebuggingandtryingtoincreaseprogramspeed.MemrptThisoptionshowstheamountofmemoryusageforeachstep.Whilememoryusageisnotdirectlyrelatedtoprogramspeedinallcases,thisdatacanbehelpfulwhenusedalongwiththeCPUtimedata.Msglvl=Ithisoptionoutputsinformationabouttheindexusageduringtheexecutionoftheprogram.
-
OptionsObs=NThisoptioncanbeveryusefultotestprogramsonasmallsubsetofdata.CaremustbetakentomakesurethatthisoptionisturnedoffforproductionDATABASE/DATASETACCESSSPEEDIMPROVEMENTSUSINGSQLSincemanySASprogrammersaccessdatathatisstoredinarelationaldatabaseaswellasSASdatasets,thisisakeyareathatcanbechangedtospeedupprogramspeed.Inanidealworld,theSASprogrammerswouldbeabletohelpdesignthedatabaselayout.But,thatisnotalwayspossible.Inthispaper,wewillassumethatthedatabasedesigniscomplete,andtheSASprogrammeraccessesthedatawithnoabilitytochangethedatabasestructure.TherearethreemainwaysSASdevelopersaccessdatainarelationaldatabase:
PROCsql LIBNAMEaccesstoadatabase Convertingdatabasedatatotextfiles(thisshouldbealastresortwhennoother
methodworks,suchaswhenusingcustomdatabasesoftware)
Mostofthemethodsdescribedherewillworkforallthreemethodsfordatabaseaccess.Oneoftheprimaryreasonstospeedupdatabaseaccessisthatitistypicallyoneoftheeasiestwaystospeedupaprogram.Databaseaccessnormallyusesalotofinput/output(I/O)todisk,whichisslowerthanreadingdatafrommemory.Advanceddiskdrivesystemscancachedatainmemoryforfasteraccess(comparedtodisk)butitsbesttoassumethatthesystemyouareusingdoesnothavedatacachedinmemory.ThesimplestwaytospeedaccesstoeitherdatabasesorSASdatasetsistomakesureyouareusingindexesasmuchaspossible.IndexesareveryfamiliartodatabaseprogrammersbutmanySASprogrammers,especiallybeginners,arenotasfamiliarwiththeuseofindexes.Usingdatawithoutindexesissimilartotryingtofindinformationinabookwithoutanindexortableofcontents.Evenafteraprojecthasstarted,itsalwayspossibletogobackandaddindexestothedatatospeedupaccess.Oraclehastoolsthatcanhelpaprogrammerdeterminewhichindexesshouldbeaddedtospeedupdatabaseaccessthesystemdatabaseadmins(DBAs)canhelpwiththeuseofthosetools.Therearemany,manymethodstospeedupdataaccess.Thispaperwilllistthemethodstheauthorhasusedovertheyearsthathaveworkedwell.AsimpleGooglesearchonthetopicofSQLefficiencywillfindothermethodsthatarenotcoveredinthispaper.
-
DBAscanbeveryhelpfulinmakingdatabasequeriesrunfaster.Ifthereisaqueryorsetofqueriesthatisrunninglong,agoodfirststepistogettheDBAstotakealookatthequerywhileitisrunningtoseeexactlyhowthequeryisbeingprocessedbythedatabase.Insomecases,thequeryoptimizerwillnotoptimizethequerybecauseofthewaytheSQLcodeiswritten.TheDBAscanmakesuggestionsabouthowtorewritethequerytomakeitrunbetter.Thefirstmethodistodropindexesandconstraintswhenaddingdatatoadatabasetable.Afterthedataisloaded,theindexesandconstraintsarerestored.Thisspeedsuptheprocessofdataloading,becauseitsfastertorestoretheindexesthantoupdatethemeverytimearecordisloaded.Thisisveryimportantifyouareimportingmillionsofrecordsduringadataload.Thereisoneproblemtowatchoutforwiththismethodyouhavetomakesurethedatabeingloadedisveryclean.Ifthedataisnotclean,itcouldcauseproblemslaterwhentheindexesandconstraintsareputbackintothetables.ThesecondmethodforspeedingupdatabaseaccessisusingtheexistsstatementinSQLratherthantheinstatement.ForexampleSelect*fromtable_aaWhereexists(select*fromordersowherea.prod_id=o.prod_id);isthebestwaytowriteanSQLstatementwithasubquery.ThethirdmethodistoavoidusingSQLfunctionsinWHEREclausesorpredicateclause.Anexpressionusingacolumn,forexamplesuchasafunctionhavingacolumnasanargument,cancausetheSQLoptimizertoignoretheuseofanindexonthatcolumn.ThisisanexampleofSQLcodethatshouldnotbeused:Whereto_number(SUBSTR(a.order_no,INSTR(b.order_no,'.')1))=to_number(SUBSTR(a.order_no,INSTR(b.order_no,'.')1))Anotherexampleofaproblemwithafunctionandanindexis:SelectnameFromordersWhereamount!=0;
-
Theproblemwiththisqueryisthatanindexcannottellyouwhatisnotinthedata!Sotheindexisnotusedinthiscase.ChangethisWHEREclausetoWhereamount>0;Andtheindexwillbeused.ThefourthmethodisadvicetonotuseHAVINGclausesinselectstatements.Thereasonforthisissimple:havingonlyfiltersrowsafteralltherowshavebeenreturned.Inmostqueries,youdonotwantallrowsreturned,justasubsetofrows.Therefore,onlyuseHAVINGwhensummaryoperationsareappliedtocolumnsrestrictedbytheWHEREclause.Selectstatefromorderwherestate=NC;groupbystate;IsmuchfasterthanSelectstatefromordergroupbystatehavingstate=NC;Anothermethodistominimizetheuseofsubqueries,insteadusejoinstatementswhenthedataiscontainedinasmallnumberoftables.Insteadofthisquery:SelectenameFromemployeesempwhereexists(selectpricefrompriceswhereprod_id=emp.prod_idandclass=J);Usethisqueryinstead:Selectename,Frompricespr,employeesempwherepr.prod_id=emp.prod_idandclass=J;TheorderthattablesarelistedintheSQLstatementcangreatlyimpactthespeedofaquery.Inmostcases,thetablewiththegreatestnumberofrowsshouldbelistedfirstinaquery.ThereasonisthattheSQLparsermovesfromrighttoleftratherthanlefttoright.Itscansthelasttablelisted,andmergesalloftherowsfromthefirsttablewiththerowsinthelasttable.
-
Forexample,iftableTab1has20,000rowsandTab2has1rowthenSelectcount(*)fromTab1,Tab2isthebestwaytowritethequeryInsteadofSelectcount(*)fromTab2,Tab1Whenqueryingdatafrommultipletablesitsverycommontoperformajoinbetweenthetables.However,ajoinisnotalwaysneeded.Averysimplewaytoquerytwotableswithonequeryistousethefollowingcode.SelectA.name,a.grade,B.name,b.gradeFromempa,empxbWhereb.emp_no=1010anda.emp_no=2010;Whenperformingajoinwithdistinct,itsmuchmoreefficienttouseexistsratherthanDISTINCTSelectdate,nameFromsalessWhereexists(selectXfromEmployeeempWhereemp.prod_id=s.prod_id);(Xisadummyvariablethatisneededtomakethisqueryworkcorrectly)Selectdistinctdate,nameFromsaless,employeeempWheres.prod_id=emp.prod_id;EXISTSisafasteralternativebecausethedatabaserealizesthatwhenthesubqueryhasbeensatisfiedonce,thequerycanbeterminated.
-
Theperformanceofgroupbyqueriescanbeimprovedbyremovingunneededrowsearlyintheselectionprocess.Thefollowingqueriesreturnthesamedata.However,thesecondqueryispotentiallyfaster,sincerowswillberemovedfromthequerybeforethesetoperatorsareapplied.Selecttitle,avg(pay_rate)FromemployeesGroupbyjobHavingjob=Manager;IsnotasgoodasSelecttitle,avg(pay_rate)FromemployeesHavingjob=ManagerGroupbyjob;SASMACROSPEEDINCREASESItsverycommonforbigdataprojectsthatuseSAStoemployalotofSASmacros.Macrossavealotoftimeincoding,andtheyalsomakecodemucheasiertoreuseanddebug.Thedownsidetomacrosisthatiftheyarenotusedcorrectly,theycanactuallyslowdownaprogramratherthanspeeditup.Thisisespeciallytrueifsomeofthemacrodebuggingfeaturesareturnedononcethecodeisfullytestedandreadytobeputintoproductionstatus.Herearesometechniquestousetomakesurethatmacrosdonotslowdownaprogram.
Thebasictipforusingmacrosisthatafterdebuggingthemacroiscomplete,setthesystemoptionsNOMLOGIC,NOMPRINT,NOMRECALL,andNOSYMBOLGEN. Iftheseoptionsarenotusedforaproductionjob,therearetwomainproblemsthatcanhappenFirst,thelogfilecangrowtobeverylarge.Insomeprogramswithalotofcodethatloopsmanytimes,thelogfilecangrowsolargethatitcanfillupthediskandcausetheprogramtocrash.TheotherproblemisthatwritingoutallthoselogmessagescangreatlyreducethespeedoftheprogrambecausediskI/Oisaveryslowprocess.Thefirstmacrotechniqueistousecompiledmacros.Acompiledmacrorunsfasterbecauseitdoesnotneedtobeparsedorcompiledwhentheprogramruns.Macrosshouldnotbe
-
compileduntiltheyarefullytestedanddebugged.Onecautionwithcompiledmacrosisthatoncetheyarecompiled,theycannotbeconvertedbackintoreadableSASsourcecode.Itisessentialtostorethemacrocodeinasafeplacesothatitcanbemodifiedoraddedtoatalaterdate.Anotheradvantageofcompiledmacrosisthatthecodeisnotvisibletotheuser.Thisisimportantifyouaregivingthecodetoacustomertousebuttheyshouldnotbeallowedtoviewthesourcecode.Anotherwaytospeedupmacrosistoavoidnestedmacroswhereamacroisdefinedinsideanothermacro.Thisisnotoptimalbecausetheinnermacroisrecompiledeverytimetheoutermacroisexecuted.Andwhenyouareprocessingmillionsofrecords,recompilingamacroforeachrecordcanreallyslowdowntheprogram.Ingeneral,itsalsoeasiertomaintainmacrosthatarenotnested.Itsmuchbettertodefinetwoormoremacrosseparatelyasshownbelow:%macrom1;%mendm1;%macrom2;%mendm1;Insteadof%macrom1;%macrom2;%mendm2;%mendm1;Callingamacrofromamacrowillnotslowdowntheprocessing,becauseitwillnotcausethecalledmacrotoberecompiledeverytimethemainmacroiscalled:%macrotest1;%another_macro(thismacrowasdefinedoutsideofmacrotest1)%mendtest1;
-
Although%includeistechnicallypartofthemacrolanguage,onebigdifferenceisthatanycodethatisputintotheprogramwith%includeisnotcompiledasamacro.Therefore,itwillrunfasterthananormalmacro.Thebestusefor%includealongwithmacrosistoputsimplestatementsintheincludefile:%letdept_name=Sales;%letnumber_div=4;Anothergoodideaforusing%includetospeedupaprogramisthatanexternalshellprogramthatcallsSAScanwriteoutvaluesasSAScodeintoatextfile,whicharethenincludedintotheSAScodeatthetimeofexecution.ThistechniqueallowsoneSASprogramtobewrittenthatisveryflexible.Thewaytheprogramisrundependsoninputspassedtothecodefromtheexternalprogram.TheexternalprogramcanbewritteninavarietyoflanguagessuchasC,C++,HTML,orJava.ThismethodalsomeansthattheSAScodeneverhastobechanged.Therefore,thereislesschanceofbugsbeingintroducedintotheprogram.TheSAScodecanevenbestoredatreadonlysothatnochangescanbemadetothesourcecode.Hereishowthismethodworks:
SASsourcecodeiswrittenwith%includestatementstosubsetdata Theexternalshellprogramcollectsinformationfromtheenduserforexample
Species=Mice Theexternalshellprogramwritesoutalineforeachpieceofinformationcollected
%letspecies=Mice;orifspecies=Mice; SASprogramiscalledfromtheshellprogramand%includefilesareexecuted Programrunsfasterbecausethecorrectdatasubsetisused
Theexternalshellprogramcanbesimpleorcomplex.ThemaingoalistocollectinformationtospeeduptheexecutionoftheSASprogrambymakingsurethecorrectdataisused.Theshellprogramcanalsocollectinformationfromuserstouseinformattingoftables,colors,outputformat(suchasODSmethods),loglocation,outputlocation,andsoon.GRIDCOMPUTINGOPTIONThebestoptiontospeedupabigdataprojectwithSASistousegridcomputing.ThisoptionisnotinexpensiveorsimpletosetupbutfornowitistheultimatewaytoincreasecomputingpoweranddecreaseprocessingtimeforSASprocessing.SSOcurrentlyusesagridsystemforoneofourlargeretailprojectsthatusesalotofdata,andhasaverytighttimelinetocompletethedailyandweeklyprocessing.
-
ThekeypointsforagridenvironmentatSSOare: Windowsserversforenduseraccess OneSASserver Onedatabaseserver Fourormoregridservers,whichareusedforthemaincomputing RAIDdiskarraysforfastaccesstodata Highspeed(10Gb)accessbetweenmainserverandgridnodes SASGridcomputingsoftwarepackage
HereisadiagramofatypicalgridcomputinglayoutusedatSSO:
Figure1GridComputingLayoutforSSOInthisexample,thereare12gridnodes,eachwith12coresand64GbofRAM(thediagramsays12of40nodesbecausetheother28nodesareusedfordevelopmentandtestservers.)TheDMZmeansthosesystemsanddiskarelocatedtogetherinonelocation.ThereisaDMZforthemainSASserver,andanOracleserverandaseparateDMZforthegridnodes.Thekeyadvantagesofgridcomputingare:
ImprovedprogramdistributionandCPUutilization
-
Canbeusedformultipleusersandmultipleapplications Job,queue,andhostmanagementservices Gridnodescanbesetupashotbackupsformainserver Simplifiesadministrationofmultiplesystems Allowseasymaintenancesincegridnodescanbeshutdownwithoutdisrupting
application Providesrealtimemonitoringofsystemsandapplications
InSSO,thegridnodesarenotusedfordataloading(ETL)orreporting.Theyareusedonlyforthenumbercrunchingaspectoftheproject.ThenormaldataflowforagridcomputingprojectinSSOisasfollows:
1. DataisloadeddailyandweeklyusingtheSASserverandOracleserver2. Dataispartitionedintosubsetsthatmatchthenumberofgridnodes(10setsfor10grid
nodes).Theusercanselectthemethodforpartitioning3. Duringprocessing,thedataiscopiedtothegridnodesforcomplexcalculations4. Whenprocessingisdone,theresultsarecopiedbacktotheSASandOracleservers
Itistechnicallypossibletouseagridcomputingarchitecturewithoutusingtheextragridnodesystems.Thismethodstillpartitionsthedata,andprocessesthedatainsmallerbatches.But,itisnotasfastasthefullgridsystemshownabove.Ofcourse,itisimportanttopointoutthedisadvantagesofagridsystem:
Greatlyincreasedcostforsoftwareandhardwarevs.anongridsystem Morepointsofpotentialfailure(gridnodes,connections,etc.) Hardertosetupandmaintainagridvs.asingleserversystem Agridmightnotspeedupallprocessingsuchasdataloads ExtraCPUprocessingtimeisusedcopyingdatabackandforthtothegridnodes MorediskI/Oisusedinagridsystem
GENERALSASPROGRAMMINGIDEASFORFASTERPROCESSINGOFBIGDATAThetopicsinthispapercoverwhatmightbeconsideredareasthatmightnotapplytoallSASprograms.TherestofthispaperwillgiveadviceonstandardSASprogrammingthedatastepalongwithvariousprocedures(procs).Thisisanimportantareatoconsiderbecauseitisapplicabletoawidevarietyofprograms.NoteverySASprogramwilluseSQLormacros,buteverySASprogramwilluseatleastonedatastep.ThebasicideatospeedupprocessingwiththedatastepistoreducetheamountofworkthatSASneedstodo.Onesimplewaytodothisistomakesurethatwhenpossible,asectionof
- codeisonlyexecutedonetimeinsteadofmanytimes(onetimeforeachrecordinthedata.)Forexample,theeasiestwaytodothisisbyusingtheretainstatement.Anotherlittleknownmethodforincreasingthespeedofcalculationsinadatastepinvolvestheuseofmissingvalues.Ifavariableisknowntohavealotofmissingvalues,itisabestpracticetolistthatvariablelastinamathematicalexpression.Forexample,ifthevariableT4hasalotofmissingvaluesthenTotal=(x*b)+c*(abc)+T4;IsmoreefficientthanTotal=T4+(x*b)+c*(abc);Thereasonforthisisthatifthemissingvalueisearly,thatmissingvalueispropagatedthroughallthecalculationsandSAShastousemoreCPUtimetocomputethevaluesandkeeptrackofthemissingvalues.Itisalsoagoodideatocheckforamissingvaluefirstbyusingcodelikethis:IfT4ne.thendoTotal=(x*b)+c*(abc)+T4;End;Inmostcases,PROCformatisamuchfasterwaytoassignvaluestodataratherthanusingalonglistofifthenstatements.Statementslikethis:ifeduc=0thenneweduc="
-
neweduc=put(educ,educf.);run;Inasimilarmanner,theuseoftheinfunctionuseslessCPUtimethanagroupoforstatements.InsteadofIfx=8orx=9orx=23orx=45thendo;UseIfxin(8,9,23,45)thendo;Thereasonforthischangeisthatwiththeuseofor,SASchecksalltheconditions.Theinfunctionstopsafteritfindsthefirstmatchingnumbertomaketheexpressiontrue.SASusesmoreCPUtimewhenithastoprocesslargervolumesofdata.Averyeasywaytoreducethesizeofthedataistoavoidusingthedefaultdatasizeforvariables.Bydefault,allSASnumericvariableshaveasizeof8bytes.Formanyvariables,8bytesismuchlargerthanisneeded.Forexample,avariablethatisusedfortheageofapersoncaneasilybestoredin3bytes,whichmeansthatthesizeofthedataforthatonevariablehasbeenreducedby5/8or62.5%.Whendealingwithverylargedatasets,thatnumberinthehundredsofmillionsofrecords,theCPUprocessingtimesavingscanbesubstantial.ManyprogramsthatwerewritteninolderversionsofSAScanbechangedtotakeadvantageofmoremodernSASprogrammingfeatures.InolderversionsofSAS,procedurescouldnotrunonasubsetofdata.Iftheanalysisneededtoberunonjustonesexforexample,anewdatasetwascreatedthatincludedjustmembersofthesexneeded.NowitsmuchquickertouseasubsetstatementintheprocedurestatementssuchasProcfreq;wheresex=Male;run;OrProcmeans;wheresex=Female;run;
-
Inmanycases,itispossibletowritecodeusingeithertraditionalSASDATAstepsandPROCsorwritecodeusingSQLstatementsinplaceoftheDATAstepsandPROCs.Thesetwopiecesofcodeproducethesameresults:Dataabc;Setold_data;KeepnamedatecityProcsort;Byname,date,city;VsProcsql;CreatetableabcasSelectname,date,cityFromold_dataOrderbyname,date,city;OneadvantageoftheSQLcodeisthatitismorecompactandeasiertoread(assumingknowledgeofSQLprogramming.)Thequestioncomesup:whichmethodisfaster?TheSQLcodeappearstobefastersincethereisonestepversustwostepsinthedatastepcode.Thetruthisthereisnoeasyanswertothatquestion.Theanswertowhichisfasterreallydependsonmanydifferentfactors:
Amountofdataprocessed Howmanyindexesareused Hardwareandsoftwareconfiguration(WindowsversusLinuxorUnix,PCversus
Mainframeandsoon.) TypeofanalysisneededcanitbedoneinthedatabaseoronlyinSAS
Ifpossible,itsagoodideatotestDATAstepsandPROCsversusSQLprocessingonasmallsubsetofdatatodeterminewhichmethodisfastest.
-
SASINDATABASEPRODUCTSAShasarelativelynewproductcalledSASInDatabase.ThebigadvantageofthisproductisthatisallowsSASjobstorundirectlyinthedatabaseserver.MostdatabaseshaveaverylimitedfeaturesetforstatisticalanalysisaddingSASdirectlyintoadatabasegreatlyincreasestheamountofanalysisthatcanbedonewithoutneedingtopulldataintoSAS(usingSQLoraDATAstep)andpotentiallysendingtheresultsbacktothedatabase.Currently,InDatabaseworksonthefollowingdatabases:Asterdatabase,EMCGreenplum,IBMDB2andNetezza,Oracle,andTeradata.InDatabaseusesmassiveparallelprocessing(MPP)toenhancesystemscalabilityandperformance.ItsmuchbettertomovetheprocessingtothedataratherthanmovethedatatoSAS,especiallyconsideringthefactthatI/Oisoneofthemainfactorsthatcanslowdownthespeedofaprogram.ThethreepartsofInDatabaseare:
SASScoringAccelerator AnalyticsAcceleratorforTeradata SASAnalyticAdvantageforTeradata
THREADSANDCPUCOUNTOPTIONSThesetwooptionscanbeveryhelpfulforspeedingupprocessingbutitsimportanttobecarefulwhenyouareusingthem.Ingeneral,itsbesttousethemonlyforverylargedatasets.Usingthemonsmallerdatasetsmightactuallyslowdownprocessing.TheSASsystemwilldecideiftheseoptionsareactuallyusedbasedondifferentfactorssuchasnumberofCPUsinstalledinthesystem,oroptionsselectedforagivenDATAsteporprocedurethatisused.Itsalsoaverygoodideatotestthethreadsoptionversusnothreadstomakesurethatthespeeddoesincreasebyusingthreads.ThebestwaytousetheCPUandthreadsoptionis:Optionsthreadscpu=actual;TheactualstatementontheCPUoptiontellsSAStousetheactualnumberofCPUsinstalledinthesystem.ItmightbetemptingtotrytouseahighernumberforCPUs.Inreality,itdoesnotworkthatway.Also,thisoptionmeanstheprogrammerdoesnothavetospendtimelearninghowmanyCPUsareinthesystem.AsimplewaytoexplainthethreadsoptionisthatitdividestheworkupintosmallerchunkssothattheycanbeworkedoninparallelbydifferentCPUs.ThisisveryhelpfulinmanydifferentSASprocedures.iftheprogramusesPROCSQLwiththepassthroughoption,thethreadsoptionwillhavenoimpactbecausetheSQLcodeispassedtothedatabasewhereitisexecuted.ThepassthroughoptiontreatsthedatabaseasasortofblackboxthatSAShasno
-
controlover.However,itispossiblethedatabasesystemmightuseitsownversionofmultithreadingtospeedupprocessingwithinthedatabase.AnotherimportantfactaboutthethreadsoptionisthattheresultscanvarydependingonwhattypeofhardwareisusedtoruntheSASprogram.Forexample,aprogramthatusesthreadsonLinuxmightnotworkaswellifthesourcecodeisrunonWindowsoramainframesystem.DONTFORGETABOUTMAKINGOLD/EXISTINGCODERUNFASTER!Programmersreadtechnicalpaperssuchasthisoneanddecidetostartusingthesetechniquesinthefuture.Whilethatisaverygoodidea,alltheinformationinthisarticlecan(andshould)beusedtoexamineoldcodetodetermineiftheoldcodecanbeimproved.Justbecauseoldcodehasbeenrunningwithoutproblems(sometimesforyears)doesnotmeanthatthecodeisefficient.Ifitisnotbroken,dontfixitisagoodsayingbutsometimesaprogrammightbebrokenevenwhenitproducesthecorrectresults.Inthiscontext,brokenmeansthatthecodecanbechangedtorunfasterwhilestillproducingthecorrectresults.
CONCLUSIONWithdatavolumesincreasingallthetime,itisimportanttoalwaysbemindfulofwaystospeedupprocessing.ItcanbeverytemptingtosimplythrowmoneyattheproblembybuyingfasterorbiggerserverstoruntheSAScode.Betterhardwarecanpotentiallyspeeduptheprocessing,butitisfarfromthecheapestwaytoincreaseperformance.ThispaperpresentedtwobasicwaystodecreaseprocessingtimeforbigdataprojectsbyusingbetterprogrammingtechniqueswithSQL,SASmacros,andgeneralSASprogrammingtechniques,andbyusingmultipleserversinagridenvironment.Thefirstthreemethodscanbeimplementedatverylowcosts,sotheyshouldbeevaluatedforallprojects.Fororganizationswithlargerbudgetsorverylargeamountsofdata,thegridenvironmentisagoodchoicetoinvestigate.ACKNOWLEDGEMENTSIwouldliketothankthewritingstaffatSASforeditinghelponthispaper.IwouldalsoliketothankallmycoworkersatRTI,SRAandSASwhohavehelpedmebecomeabetterSASprogrammerthroughoutmycareer.SpecialthankstoDr.BillSandersattheUniversityofTennessee(andlaterthedirectoroftheSASEVAASgroup)whoshowedmeSASprogrammingfortheveryfirsttime.
-
CONTACTINFORMATIONKevinMcGowanSASSolutionsonDemandKevin.McGowan@sas.com(919)5312731http://www.sas.comSASandallotherSASInstituteInc.productorservicenamesareregisteredtrademarksortrademarksofSASInstituteInc.intheUSAandothercountries.indicatesUSAregistration.Otherbrandandproductnamesaretrademarksoftheirrespectivecompanies.
2013 Table of Contents