big data

Upload: mangeshwagle

Post on 06-Jan-2016

7 views

Category:

Documents


0 download

DESCRIPTION

bog data and sas

TRANSCRIPT

  • Paper 036-2013

    Big Data, Fast Processing Speeds Kevin McGowan SAS Solutions on Demand, Cary NC

    ABSTRACTAsdatasetscontinuetogrow,itisimportantforprogramstobewrittenveryefficientlytomakesurenotimeiswastedprocessingdata.Thispapercoversvarioustechniquestospeedupdataprocessingtimeforverylargedatasetsordatabases,includingPROCSQL,datastep,indexesandSASmacros.Someoftheseproceduresmayresultinjustaslightspeedincrease,butwhenyouprocess500millionrecordsperday,evena10%increaseisverygood.Thepaperincludesactualtimecomparisonstodemonstratethespeedincreasesusingthenewtechniques.

    INTRODUCTIONMoreorganizationsarerunningintoproblemswithprocessingbigdataeveryday.Thebiggerthedata,thelongertheprocessingtimeinmostcases.Manyprojectshavetighttimeconstraintsthatmustbemetbecauseofcontractualagreements.Whenthedatasizeincreases,itcanmeanthattheprocessingtimewillbelongerthantheallottedtimetoprocessthedata.Sincetheamountofdatacannotbereduced(exceptinrarecases),thebestsolutionistoseekoutmethodstoreducetheruntimeofprogramsbymakingthemmoreefficient.Thisisalsoacheapermethodthansimplyspendingalotofmoneytobuybigger/fasterhardware,whichmayormaynotspeeduptheprocessingtime.ImportantNote:Inthispaper,whenevercodeispresentedthatisefficientitwillbeshowningreen.Codethatshouldnotbeusedisshowninred.WHATISBIGDATA?Therearemanydifferentdefinitionsofbigdata.Andmoredefinitionsarebeingcreatedeveryday.Ifyouask10people,youwillprobablyget10differentdefinitions.AtSASSolutionsonDemand(SSO)wehavemanyprojectsthatwouldbeconsideredbigdataprojects.Someoftheseprojectshavejobsthatrunanywherefrom16to40hoursbecauseofthelargeamountofdataandcomplexcalculationsthatareperformedoneachrecordordatapoint.

  • Theseprojectsalsohaveverylargeandfastservers.OneexampleofatypicalSASserverthatisusedbySSOhasthesespecifications:

    24CPUcores 256GbofRAM 5+Tbofdiskspace VeryfastRAIDdiskdrivearrayswithadvancedcaches LinuxorAIXoperatingsystem Veryhighspeedinternalnetworkconnections(upto10Gbpersecond) Encrypteddatatransfersbetweenservers Production,Test,andDevelopmentserversthatareidentical Gridcomputingsystem(foronelargeproject)

    TheprojectsalsouseathreetieredsystemwherethemainserverissupportedbyanOracledatabaseserverandafrontendterminalserverfortheendusers.ThesupportserversaretypicallysizedaboutthesamesizeastheSASserver.InmostcasesthedataissplitsomedataisstoredinSASdatasetswhilethedatausedmostbytheendusersisstoredinOracletables.ThissetupwaschosenbecauseOracletablesallowfasteraccesstodatainrealtime.Evenwiththislargeamountofcomputinghorsepower,itstilltakesalongtimetorunasinglejobbecauseofthelargeamountofdatatheprojectsuse.Mostoftheseprojectsprocessover200millionrecordsduringasinglerun.Thedataiscomplex,andrequiresalargenumberofcalculationstoproducethedesiredresults.BESTWAYSTOMEASURESPEEDIMPROVEMENTSSAShasseveralsystemoptionsthatareveryhelpfultodeterminethelevelofincreaseinperformance.Inthisthispaper,wewillfocusontheactualclocktime(notCPUtime)theprogramtakestorun.InanenvironmentwithmultipleCPUs,theCPUtimestatisticcanbeconfusingitsevenpossiblethatCPUtimecanbegreaterthantheclocktime.ThemostimportantSASoptionsformeasuringperformancearelistedbelowwithashortdescription:Stimer/FullstimerTheseoptionscontroltheamountofdataproducedforCPUusage.Fullstimeristhepreferredoptionfordebuggingandtryingtoincreaseprogramspeed.MemrptThisoptionshowstheamountofmemoryusageforeachstep.Whilememoryusageisnotdirectlyrelatedtoprogramspeedinallcases,thisdatacanbehelpfulwhenusedalongwiththeCPUtimedata.Msglvl=Ithisoptionoutputsinformationabouttheindexusageduringtheexecutionoftheprogram.

  • OptionsObs=NThisoptioncanbeveryusefultotestprogramsonasmallsubsetofdata.CaremustbetakentomakesurethatthisoptionisturnedoffforproductionDATABASE/DATASETACCESSSPEEDIMPROVEMENTSUSINGSQLSincemanySASprogrammersaccessdatathatisstoredinarelationaldatabaseaswellasSASdatasets,thisisakeyareathatcanbechangedtospeedupprogramspeed.Inanidealworld,theSASprogrammerswouldbeabletohelpdesignthedatabaselayout.But,thatisnotalwayspossible.Inthispaper,wewillassumethatthedatabasedesigniscomplete,andtheSASprogrammeraccessesthedatawithnoabilitytochangethedatabasestructure.TherearethreemainwaysSASdevelopersaccessdatainarelationaldatabase:

    PROCsql LIBNAMEaccesstoadatabase Convertingdatabasedatatotextfiles(thisshouldbealastresortwhennoother

    methodworks,suchaswhenusingcustomdatabasesoftware)

    Mostofthemethodsdescribedherewillworkforallthreemethodsfordatabaseaccess.Oneoftheprimaryreasonstospeedupdatabaseaccessisthatitistypicallyoneoftheeasiestwaystospeedupaprogram.Databaseaccessnormallyusesalotofinput/output(I/O)todisk,whichisslowerthanreadingdatafrommemory.Advanceddiskdrivesystemscancachedatainmemoryforfasteraccess(comparedtodisk)butitsbesttoassumethatthesystemyouareusingdoesnothavedatacachedinmemory.ThesimplestwaytospeedaccesstoeitherdatabasesorSASdatasetsistomakesureyouareusingindexesasmuchaspossible.IndexesareveryfamiliartodatabaseprogrammersbutmanySASprogrammers,especiallybeginners,arenotasfamiliarwiththeuseofindexes.Usingdatawithoutindexesissimilartotryingtofindinformationinabookwithoutanindexortableofcontents.Evenafteraprojecthasstarted,itsalwayspossibletogobackandaddindexestothedatatospeedupaccess.Oraclehastoolsthatcanhelpaprogrammerdeterminewhichindexesshouldbeaddedtospeedupdatabaseaccessthesystemdatabaseadmins(DBAs)canhelpwiththeuseofthosetools.Therearemany,manymethodstospeedupdataaccess.Thispaperwilllistthemethodstheauthorhasusedovertheyearsthathaveworkedwell.AsimpleGooglesearchonthetopicofSQLefficiencywillfindothermethodsthatarenotcoveredinthispaper.

  • DBAscanbeveryhelpfulinmakingdatabasequeriesrunfaster.Ifthereisaqueryorsetofqueriesthatisrunninglong,agoodfirststepistogettheDBAstotakealookatthequerywhileitisrunningtoseeexactlyhowthequeryisbeingprocessedbythedatabase.Insomecases,thequeryoptimizerwillnotoptimizethequerybecauseofthewaytheSQLcodeiswritten.TheDBAscanmakesuggestionsabouthowtorewritethequerytomakeitrunbetter.Thefirstmethodistodropindexesandconstraintswhenaddingdatatoadatabasetable.Afterthedataisloaded,theindexesandconstraintsarerestored.Thisspeedsuptheprocessofdataloading,becauseitsfastertorestoretheindexesthantoupdatethemeverytimearecordisloaded.Thisisveryimportantifyouareimportingmillionsofrecordsduringadataload.Thereisoneproblemtowatchoutforwiththismethodyouhavetomakesurethedatabeingloadedisveryclean.Ifthedataisnotclean,itcouldcauseproblemslaterwhentheindexesandconstraintsareputbackintothetables.ThesecondmethodforspeedingupdatabaseaccessisusingtheexistsstatementinSQLratherthantheinstatement.ForexampleSelect*fromtable_aaWhereexists(select*fromordersowherea.prod_id=o.prod_id);isthebestwaytowriteanSQLstatementwithasubquery.ThethirdmethodistoavoidusingSQLfunctionsinWHEREclausesorpredicateclause.Anexpressionusingacolumn,forexamplesuchasafunctionhavingacolumnasanargument,cancausetheSQLoptimizertoignoretheuseofanindexonthatcolumn.ThisisanexampleofSQLcodethatshouldnotbeused:Whereto_number(SUBSTR(a.order_no,INSTR(b.order_no,'.')1))=to_number(SUBSTR(a.order_no,INSTR(b.order_no,'.')1))Anotherexampleofaproblemwithafunctionandanindexis:SelectnameFromordersWhereamount!=0;

  • Theproblemwiththisqueryisthatanindexcannottellyouwhatisnotinthedata!Sotheindexisnotusedinthiscase.ChangethisWHEREclausetoWhereamount>0;Andtheindexwillbeused.ThefourthmethodisadvicetonotuseHAVINGclausesinselectstatements.Thereasonforthisissimple:havingonlyfiltersrowsafteralltherowshavebeenreturned.Inmostqueries,youdonotwantallrowsreturned,justasubsetofrows.Therefore,onlyuseHAVINGwhensummaryoperationsareappliedtocolumnsrestrictedbytheWHEREclause.Selectstatefromorderwherestate=NC;groupbystate;IsmuchfasterthanSelectstatefromordergroupbystatehavingstate=NC;Anothermethodistominimizetheuseofsubqueries,insteadusejoinstatementswhenthedataiscontainedinasmallnumberoftables.Insteadofthisquery:SelectenameFromemployeesempwhereexists(selectpricefrompriceswhereprod_id=emp.prod_idandclass=J);Usethisqueryinstead:Selectename,Frompricespr,employeesempwherepr.prod_id=emp.prod_idandclass=J;TheorderthattablesarelistedintheSQLstatementcangreatlyimpactthespeedofaquery.Inmostcases,thetablewiththegreatestnumberofrowsshouldbelistedfirstinaquery.ThereasonisthattheSQLparsermovesfromrighttoleftratherthanlefttoright.Itscansthelasttablelisted,andmergesalloftherowsfromthefirsttablewiththerowsinthelasttable.

  • Forexample,iftableTab1has20,000rowsandTab2has1rowthenSelectcount(*)fromTab1,Tab2isthebestwaytowritethequeryInsteadofSelectcount(*)fromTab2,Tab1Whenqueryingdatafrommultipletablesitsverycommontoperformajoinbetweenthetables.However,ajoinisnotalwaysneeded.Averysimplewaytoquerytwotableswithonequeryistousethefollowingcode.SelectA.name,a.grade,B.name,b.gradeFromempa,empxbWhereb.emp_no=1010anda.emp_no=2010;Whenperformingajoinwithdistinct,itsmuchmoreefficienttouseexistsratherthanDISTINCTSelectdate,nameFromsalessWhereexists(selectXfromEmployeeempWhereemp.prod_id=s.prod_id);(Xisadummyvariablethatisneededtomakethisqueryworkcorrectly)Selectdistinctdate,nameFromsaless,employeeempWheres.prod_id=emp.prod_id;EXISTSisafasteralternativebecausethedatabaserealizesthatwhenthesubqueryhasbeensatisfiedonce,thequerycanbeterminated.

  • Theperformanceofgroupbyqueriescanbeimprovedbyremovingunneededrowsearlyintheselectionprocess.Thefollowingqueriesreturnthesamedata.However,thesecondqueryispotentiallyfaster,sincerowswillberemovedfromthequerybeforethesetoperatorsareapplied.Selecttitle,avg(pay_rate)FromemployeesGroupbyjobHavingjob=Manager;IsnotasgoodasSelecttitle,avg(pay_rate)FromemployeesHavingjob=ManagerGroupbyjob;SASMACROSPEEDINCREASESItsverycommonforbigdataprojectsthatuseSAStoemployalotofSASmacros.Macrossavealotoftimeincoding,andtheyalsomakecodemucheasiertoreuseanddebug.Thedownsidetomacrosisthatiftheyarenotusedcorrectly,theycanactuallyslowdownaprogramratherthanspeeditup.Thisisespeciallytrueifsomeofthemacrodebuggingfeaturesareturnedononcethecodeisfullytestedandreadytobeputintoproductionstatus.Herearesometechniquestousetomakesurethatmacrosdonotslowdownaprogram.

    Thebasictipforusingmacrosisthatafterdebuggingthemacroiscomplete,setthesystemoptionsNOMLOGIC,NOMPRINT,NOMRECALL,andNOSYMBOLGEN. Iftheseoptionsarenotusedforaproductionjob,therearetwomainproblemsthatcanhappenFirst,thelogfilecangrowtobeverylarge.Insomeprogramswithalotofcodethatloopsmanytimes,thelogfilecangrowsolargethatitcanfillupthediskandcausetheprogramtocrash.TheotherproblemisthatwritingoutallthoselogmessagescangreatlyreducethespeedoftheprogrambecausediskI/Oisaveryslowprocess.Thefirstmacrotechniqueistousecompiledmacros.Acompiledmacrorunsfasterbecauseitdoesnotneedtobeparsedorcompiledwhentheprogramruns.Macrosshouldnotbe

  • compileduntiltheyarefullytestedanddebugged.Onecautionwithcompiledmacrosisthatoncetheyarecompiled,theycannotbeconvertedbackintoreadableSASsourcecode.Itisessentialtostorethemacrocodeinasafeplacesothatitcanbemodifiedoraddedtoatalaterdate.Anotheradvantageofcompiledmacrosisthatthecodeisnotvisibletotheuser.Thisisimportantifyouaregivingthecodetoacustomertousebuttheyshouldnotbeallowedtoviewthesourcecode.Anotherwaytospeedupmacrosistoavoidnestedmacroswhereamacroisdefinedinsideanothermacro.Thisisnotoptimalbecausetheinnermacroisrecompiledeverytimetheoutermacroisexecuted.Andwhenyouareprocessingmillionsofrecords,recompilingamacroforeachrecordcanreallyslowdowntheprogram.Ingeneral,itsalsoeasiertomaintainmacrosthatarenotnested.Itsmuchbettertodefinetwoormoremacrosseparatelyasshownbelow:%macrom1;%mendm1;%macrom2;%mendm1;Insteadof%macrom1;%macrom2;%mendm2;%mendm1;Callingamacrofromamacrowillnotslowdowntheprocessing,becauseitwillnotcausethecalledmacrotoberecompiledeverytimethemainmacroiscalled:%macrotest1;%another_macro(thismacrowasdefinedoutsideofmacrotest1)%mendtest1;

  • Although%includeistechnicallypartofthemacrolanguage,onebigdifferenceisthatanycodethatisputintotheprogramwith%includeisnotcompiledasamacro.Therefore,itwillrunfasterthananormalmacro.Thebestusefor%includealongwithmacrosistoputsimplestatementsintheincludefile:%letdept_name=Sales;%letnumber_div=4;Anothergoodideaforusing%includetospeedupaprogramisthatanexternalshellprogramthatcallsSAScanwriteoutvaluesasSAScodeintoatextfile,whicharethenincludedintotheSAScodeatthetimeofexecution.ThistechniqueallowsoneSASprogramtobewrittenthatisveryflexible.Thewaytheprogramisrundependsoninputspassedtothecodefromtheexternalprogram.TheexternalprogramcanbewritteninavarietyoflanguagessuchasC,C++,HTML,orJava.ThismethodalsomeansthattheSAScodeneverhastobechanged.Therefore,thereislesschanceofbugsbeingintroducedintotheprogram.TheSAScodecanevenbestoredatreadonlysothatnochangescanbemadetothesourcecode.Hereishowthismethodworks:

    SASsourcecodeiswrittenwith%includestatementstosubsetdata Theexternalshellprogramcollectsinformationfromtheenduserforexample

    Species=Mice Theexternalshellprogramwritesoutalineforeachpieceofinformationcollected

    %letspecies=Mice;orifspecies=Mice; SASprogramiscalledfromtheshellprogramand%includefilesareexecuted Programrunsfasterbecausethecorrectdatasubsetisused

    Theexternalshellprogramcanbesimpleorcomplex.ThemaingoalistocollectinformationtospeeduptheexecutionoftheSASprogrambymakingsurethecorrectdataisused.Theshellprogramcanalsocollectinformationfromuserstouseinformattingoftables,colors,outputformat(suchasODSmethods),loglocation,outputlocation,andsoon.GRIDCOMPUTINGOPTIONThebestoptiontospeedupabigdataprojectwithSASistousegridcomputing.ThisoptionisnotinexpensiveorsimpletosetupbutfornowitistheultimatewaytoincreasecomputingpoweranddecreaseprocessingtimeforSASprocessing.SSOcurrentlyusesagridsystemforoneofourlargeretailprojectsthatusesalotofdata,andhasaverytighttimelinetocompletethedailyandweeklyprocessing.

  • ThekeypointsforagridenvironmentatSSOare: Windowsserversforenduseraccess OneSASserver Onedatabaseserver Fourormoregridservers,whichareusedforthemaincomputing RAIDdiskarraysforfastaccesstodata Highspeed(10Gb)accessbetweenmainserverandgridnodes SASGridcomputingsoftwarepackage

    HereisadiagramofatypicalgridcomputinglayoutusedatSSO:

    Figure1GridComputingLayoutforSSOInthisexample,thereare12gridnodes,eachwith12coresand64GbofRAM(thediagramsays12of40nodesbecausetheother28nodesareusedfordevelopmentandtestservers.)TheDMZmeansthosesystemsanddiskarelocatedtogetherinonelocation.ThereisaDMZforthemainSASserver,andanOracleserverandaseparateDMZforthegridnodes.Thekeyadvantagesofgridcomputingare:

    ImprovedprogramdistributionandCPUutilization

  • Canbeusedformultipleusersandmultipleapplications Job,queue,andhostmanagementservices Gridnodescanbesetupashotbackupsformainserver Simplifiesadministrationofmultiplesystems Allowseasymaintenancesincegridnodescanbeshutdownwithoutdisrupting

    application Providesrealtimemonitoringofsystemsandapplications

    InSSO,thegridnodesarenotusedfordataloading(ETL)orreporting.Theyareusedonlyforthenumbercrunchingaspectoftheproject.ThenormaldataflowforagridcomputingprojectinSSOisasfollows:

    1. DataisloadeddailyandweeklyusingtheSASserverandOracleserver2. Dataispartitionedintosubsetsthatmatchthenumberofgridnodes(10setsfor10grid

    nodes).Theusercanselectthemethodforpartitioning3. Duringprocessing,thedataiscopiedtothegridnodesforcomplexcalculations4. Whenprocessingisdone,theresultsarecopiedbacktotheSASandOracleservers

    Itistechnicallypossibletouseagridcomputingarchitecturewithoutusingtheextragridnodesystems.Thismethodstillpartitionsthedata,andprocessesthedatainsmallerbatches.But,itisnotasfastasthefullgridsystemshownabove.Ofcourse,itisimportanttopointoutthedisadvantagesofagridsystem:

    Greatlyincreasedcostforsoftwareandhardwarevs.anongridsystem Morepointsofpotentialfailure(gridnodes,connections,etc.) Hardertosetupandmaintainagridvs.asingleserversystem Agridmightnotspeedupallprocessingsuchasdataloads ExtraCPUprocessingtimeisusedcopyingdatabackandforthtothegridnodes MorediskI/Oisusedinagridsystem

    GENERALSASPROGRAMMINGIDEASFORFASTERPROCESSINGOFBIGDATAThetopicsinthispapercoverwhatmightbeconsideredareasthatmightnotapplytoallSASprograms.TherestofthispaperwillgiveadviceonstandardSASprogrammingthedatastepalongwithvariousprocedures(procs).Thisisanimportantareatoconsiderbecauseitisapplicabletoawidevarietyofprograms.NoteverySASprogramwilluseSQLormacros,buteverySASprogramwilluseatleastonedatastep.ThebasicideatospeedupprocessingwiththedatastepistoreducetheamountofworkthatSASneedstodo.Onesimplewaytodothisistomakesurethatwhenpossible,asectionof

  • codeisonlyexecutedonetimeinsteadofmanytimes(onetimeforeachrecordinthedata.)Forexample,theeasiestwaytodothisisbyusingtheretainstatement.Anotherlittleknownmethodforincreasingthespeedofcalculationsinadatastepinvolvestheuseofmissingvalues.Ifavariableisknowntohavealotofmissingvalues,itisabestpracticetolistthatvariablelastinamathematicalexpression.Forexample,ifthevariableT4hasalotofmissingvaluesthenTotal=(x*b)+c*(abc)+T4;IsmoreefficientthanTotal=T4+(x*b)+c*(abc);Thereasonforthisisthatifthemissingvalueisearly,thatmissingvalueispropagatedthroughallthecalculationsandSAShastousemoreCPUtimetocomputethevaluesandkeeptrackofthemissingvalues.Itisalsoagoodideatocheckforamissingvaluefirstbyusingcodelikethis:IfT4ne.thendoTotal=(x*b)+c*(abc)+T4;End;Inmostcases,PROCformatisamuchfasterwaytoassignvaluestodataratherthanusingalonglistofifthenstatements.Statementslikethis:ifeduc=0thenneweduc="
  • neweduc=put(educ,educf.);run;Inasimilarmanner,theuseoftheinfunctionuseslessCPUtimethanagroupoforstatements.InsteadofIfx=8orx=9orx=23orx=45thendo;UseIfxin(8,9,23,45)thendo;Thereasonforthischangeisthatwiththeuseofor,SASchecksalltheconditions.Theinfunctionstopsafteritfindsthefirstmatchingnumbertomaketheexpressiontrue.SASusesmoreCPUtimewhenithastoprocesslargervolumesofdata.Averyeasywaytoreducethesizeofthedataistoavoidusingthedefaultdatasizeforvariables.Bydefault,allSASnumericvariableshaveasizeof8bytes.Formanyvariables,8bytesismuchlargerthanisneeded.Forexample,avariablethatisusedfortheageofapersoncaneasilybestoredin3bytes,whichmeansthatthesizeofthedataforthatonevariablehasbeenreducedby5/8or62.5%.Whendealingwithverylargedatasets,thatnumberinthehundredsofmillionsofrecords,theCPUprocessingtimesavingscanbesubstantial.ManyprogramsthatwerewritteninolderversionsofSAScanbechangedtotakeadvantageofmoremodernSASprogrammingfeatures.InolderversionsofSAS,procedurescouldnotrunonasubsetofdata.Iftheanalysisneededtoberunonjustonesexforexample,anewdatasetwascreatedthatincludedjustmembersofthesexneeded.NowitsmuchquickertouseasubsetstatementintheprocedurestatementssuchasProcfreq;wheresex=Male;run;OrProcmeans;wheresex=Female;run;

  • Inmanycases,itispossibletowritecodeusingeithertraditionalSASDATAstepsandPROCsorwritecodeusingSQLstatementsinplaceoftheDATAstepsandPROCs.Thesetwopiecesofcodeproducethesameresults:Dataabc;Setold_data;KeepnamedatecityProcsort;Byname,date,city;VsProcsql;CreatetableabcasSelectname,date,cityFromold_dataOrderbyname,date,city;OneadvantageoftheSQLcodeisthatitismorecompactandeasiertoread(assumingknowledgeofSQLprogramming.)Thequestioncomesup:whichmethodisfaster?TheSQLcodeappearstobefastersincethereisonestepversustwostepsinthedatastepcode.Thetruthisthereisnoeasyanswertothatquestion.Theanswertowhichisfasterreallydependsonmanydifferentfactors:

    Amountofdataprocessed Howmanyindexesareused Hardwareandsoftwareconfiguration(WindowsversusLinuxorUnix,PCversus

    Mainframeandsoon.) TypeofanalysisneededcanitbedoneinthedatabaseoronlyinSAS

    Ifpossible,itsagoodideatotestDATAstepsandPROCsversusSQLprocessingonasmallsubsetofdatatodeterminewhichmethodisfastest.

  • SASINDATABASEPRODUCTSAShasarelativelynewproductcalledSASInDatabase.ThebigadvantageofthisproductisthatisallowsSASjobstorundirectlyinthedatabaseserver.MostdatabaseshaveaverylimitedfeaturesetforstatisticalanalysisaddingSASdirectlyintoadatabasegreatlyincreasestheamountofanalysisthatcanbedonewithoutneedingtopulldataintoSAS(usingSQLoraDATAstep)andpotentiallysendingtheresultsbacktothedatabase.Currently,InDatabaseworksonthefollowingdatabases:Asterdatabase,EMCGreenplum,IBMDB2andNetezza,Oracle,andTeradata.InDatabaseusesmassiveparallelprocessing(MPP)toenhancesystemscalabilityandperformance.ItsmuchbettertomovetheprocessingtothedataratherthanmovethedatatoSAS,especiallyconsideringthefactthatI/Oisoneofthemainfactorsthatcanslowdownthespeedofaprogram.ThethreepartsofInDatabaseare:

    SASScoringAccelerator AnalyticsAcceleratorforTeradata SASAnalyticAdvantageforTeradata

    THREADSANDCPUCOUNTOPTIONSThesetwooptionscanbeveryhelpfulforspeedingupprocessingbutitsimportanttobecarefulwhenyouareusingthem.Ingeneral,itsbesttousethemonlyforverylargedatasets.Usingthemonsmallerdatasetsmightactuallyslowdownprocessing.TheSASsystemwilldecideiftheseoptionsareactuallyusedbasedondifferentfactorssuchasnumberofCPUsinstalledinthesystem,oroptionsselectedforagivenDATAsteporprocedurethatisused.Itsalsoaverygoodideatotestthethreadsoptionversusnothreadstomakesurethatthespeeddoesincreasebyusingthreads.ThebestwaytousetheCPUandthreadsoptionis:Optionsthreadscpu=actual;TheactualstatementontheCPUoptiontellsSAStousetheactualnumberofCPUsinstalledinthesystem.ItmightbetemptingtotrytouseahighernumberforCPUs.Inreality,itdoesnotworkthatway.Also,thisoptionmeanstheprogrammerdoesnothavetospendtimelearninghowmanyCPUsareinthesystem.AsimplewaytoexplainthethreadsoptionisthatitdividestheworkupintosmallerchunkssothattheycanbeworkedoninparallelbydifferentCPUs.ThisisveryhelpfulinmanydifferentSASprocedures.iftheprogramusesPROCSQLwiththepassthroughoption,thethreadsoptionwillhavenoimpactbecausetheSQLcodeispassedtothedatabasewhereitisexecuted.ThepassthroughoptiontreatsthedatabaseasasortofblackboxthatSAShasno

  • controlover.However,itispossiblethedatabasesystemmightuseitsownversionofmultithreadingtospeedupprocessingwithinthedatabase.AnotherimportantfactaboutthethreadsoptionisthattheresultscanvarydependingonwhattypeofhardwareisusedtoruntheSASprogram.Forexample,aprogramthatusesthreadsonLinuxmightnotworkaswellifthesourcecodeisrunonWindowsoramainframesystem.DONTFORGETABOUTMAKINGOLD/EXISTINGCODERUNFASTER!Programmersreadtechnicalpaperssuchasthisoneanddecidetostartusingthesetechniquesinthefuture.Whilethatisaverygoodidea,alltheinformationinthisarticlecan(andshould)beusedtoexamineoldcodetodetermineiftheoldcodecanbeimproved.Justbecauseoldcodehasbeenrunningwithoutproblems(sometimesforyears)doesnotmeanthatthecodeisefficient.Ifitisnotbroken,dontfixitisagoodsayingbutsometimesaprogrammightbebrokenevenwhenitproducesthecorrectresults.Inthiscontext,brokenmeansthatthecodecanbechangedtorunfasterwhilestillproducingthecorrectresults.

    CONCLUSIONWithdatavolumesincreasingallthetime,itisimportanttoalwaysbemindfulofwaystospeedupprocessing.ItcanbeverytemptingtosimplythrowmoneyattheproblembybuyingfasterorbiggerserverstoruntheSAScode.Betterhardwarecanpotentiallyspeeduptheprocessing,butitisfarfromthecheapestwaytoincreaseperformance.ThispaperpresentedtwobasicwaystodecreaseprocessingtimeforbigdataprojectsbyusingbetterprogrammingtechniqueswithSQL,SASmacros,andgeneralSASprogrammingtechniques,andbyusingmultipleserversinagridenvironment.Thefirstthreemethodscanbeimplementedatverylowcosts,sotheyshouldbeevaluatedforallprojects.Fororganizationswithlargerbudgetsorverylargeamountsofdata,thegridenvironmentisagoodchoicetoinvestigate.ACKNOWLEDGEMENTSIwouldliketothankthewritingstaffatSASforeditinghelponthispaper.IwouldalsoliketothankallmycoworkersatRTI,SRAandSASwhohavehelpedmebecomeabetterSASprogrammerthroughoutmycareer.SpecialthankstoDr.BillSandersattheUniversityofTennessee(andlaterthedirectoroftheSASEVAASgroup)whoshowedmeSASprogrammingfortheveryfirsttime.

  • CONTACTINFORMATIONKevinMcGowanSASSolutionsonDemandKevin.McGowan@sas.com(919)5312731http://www.sas.comSASandallotherSASInstituteInc.productorservicenamesareregisteredtrademarksortrademarksofSASInstituteInc.intheUSAandothercountries.indicatesUSAregistration.Otherbrandandproductnamesaretrademarksoftheirrespectivecompanies.

    2013 Table of Contents