neural networks with sas enterprise miner

NEURALNETWORKSWITHSASENTERPRISEMINER

SCIENTIFICBOOKS

ÍNDICE

INTRODUCTION

SASENTERPRISEMINERWORKINGENVIRONMENT

1.1SASENTERPRISEMINERINTRODUCTION

1.2STARTINGWITHSASENTERPRISEMINER

1.2.1CreateaNewProject

1.2.2ProjectStartCode

1.2.3CreateaNewProcessFlowDiagram

1.2.4CreateaDataSource

1.2.5ConnectNodesinDiagramWorkspace

1.2.6RuntheProcessFlowDiagram

1.2.7ViewResults

1.2.8CreateAModelPackage

1.3SASENTERPRISEMINERUSERINTERFACE

1.3.1SasEnterpriseMinermainmenu

1.3.2TheSASEnterpriseMinerNodeToolbar

NEURALNETWORKSNODESINSASENTERPRISEMINER

2.1NEURALNETWORKDESCRIPTION

2.2NEURALNETWORKSWITHSASENTERPRISEMINER

2.3OPTIMIZATIONANDADJUSTMENTOFMODELSWITHNETS:NEURALNETWORKNODE

2.3.1OverviewofFeedforwardNeuralNetworks

2.3.2SimpleNeuralNetworks

2.3.3Perceptrons

2.3.4HiddenLayers

2.3.5MultilayerPerceptrons(MLPs)

2.3.6RadialBasisFunction(RBF)Networks

2.3.7LocalProcessingNetworks

2.3.8WidthandAltitude

2.3.9OrdinaryRBFandNormalizedRBF

2.3.10ErrorFunctions

2.3.11Initialization

2.3.12PreliminaryTraining

2.3.13TrainingTechniques

2.3.14Scoring

2.3.15PreparingtheData

2.3.16ObjectiveFunctionsandErrorFunctions

2.3.1ChoiceofArchitecture

2.3.2NeuralNetworkNodeTrainProperties

2.3.3NeuralNetworkNodeResults

2.3.4ExamplewithaNeuralNetworkModel

2.4AUTONEURALNODE

2.4.1NetworkArchitectures

2.4.2ExampleusingtheSearchTrainAction

2.5DMNEURALNODE

2.5.1VariableRequirementsfortheDMNeuralNode

2.5.2DMNeuralNodeTrainProperties

2.5.3DMNeuralNoderainProperties:DMNeuralNetwork

2.5.4DMNeuralNodeTrainProperties:ConvergenceCriteria

2.5.5DMNeuralNodeTrainProperties:ModelCriteria

2.5.6DMNeuralNodeStatusProperties

2.5.7DMNeuralNodeResults

2.6TWOSTAGENODE

2.6.1TwoStageNodeOutputData

2.7SUPPORTVECTORMACHINE(SVM)NODE

2.7.1SVMNodeTrainProperties

2.7.2SVMNodeTrainProperties:RegularizationParameter

2.7.3SVMNodeTrainProperties:Kernel

2.7.4SVMNodeTrainProperties:CrossValidation

2.7.5SVMNodeTrainProperties:Sampling

2.7.6SVMNodeTrainProperties:PrintOptions

2.7.7SVMNodeStatusProperties

2.7.8SVMNodeResults

2.7.9SVMNodeExample

NEURALNETWORKMODELSTOPREDICTRESPONSWEANDRISK

3.1ANEURALNETWORKMODELTOPREDICTRESPONSE

3.1.1InputDataNode

3.1.2DataPartitionNode

3.2SETTINGTHENEURALNETWORKNODEPROPERTIES

3.3ASSESSINGTHEPREDICTIVEPERFORMANCEOFTHEESTIMATEDMODEL

3.4RECEIVEROPERATINGCHARACTERISTIC(ROC)CHARTS

3.5HOWDIDTHENEURALNETWORKNODEPICKTHEOPTIMUMWEIGHTSFORTHISMODEL?

3.6SCORINGADATASETUSINGTHENEURALNETWORKMODEL

3.7SCORECODE

3.8ANEURALNETWORKMODELTOPREDICTLOSSFREQUENCYINAUTOINSURANCE

3.8.1LossFrequencyasanOrdinalTarget

3.8.2Targetlayercombinationandactivationfunctions

3.9DATAPREPARATION

3.9.1Modelselectioncriterionproperty

3.9.2Scoreranksintheresultswindow

3.9.3Scoringanewdatasetwiththemodel

3.9.4ClassificationofRisksforRateSettinginAutoInsurancewithPredictedProbabilities

SPECIFICNEURALNETWORKSTOPREDICTRISK

4.1ALTERNATIVESPECIFICATIONSOFTHENEURALNETWORKS

4.2MULTILAYERPERCEPTRON(MLP)NEURALNETWORK

4.3ARADIALBASISFUNCTION(RBF)NEURALNETWORK

4.4COMPARISONOFALTERNATIVEBUILT-INARCHITECTURESOFTHENEURALNETWORKNODE

4.4.1MultilayerPerceptron(MLP)Network

4.4.2OrdinaryRadialBasisFunctionwithEqualHeightsandWidths(ORBFEQ)

4.4.3OrdinaryRadialBasisFunctionwithEqualHeightsandUnequalWidths(ORBFUN)

4.4.4NormalizedRadialBasisFunctionwithEqualWidthsandHeights(NRBFEQ)

4.4.5NormalizedRadialBasisFunctionwithEqualHeightsandUnequalWidths(NRBFEH)

4.4.6NormalizedRadialBasisFunctionwithEqualWidthsandUnequalHeights(NRBFEW)

4.4.7NormalizedRadialBasisFunctionwithEqualVolumes(NRBFEV)

4.4.8NormalizedRadialBasisFunctionwithUnequalWidthsandHeights(NRBFUN)

4.4.9User-SpecifiedArchitectures

4.5AUTONEURALNODE

4.6DMNEURALNODE

4.7DMINEREGRESSIONNODE

4.8COMPARINGTHEMODELSGENERATEDBYDMNEURAL,AUTONEURAL,ANDDMINEREGRESSIONNODES

CLUSTERANALYSISWITHNEURALNETWORKS

5.1CLUSTERANALYSISONENTERPRISEMINER

5.2CLUSTERNODE

5.2.1ClusterNodeGeneralProperties

5.2.2ClusterNodeTrainProperties

5.2.3ClusterNodeTrainProperties:NumberofClusters

5.2.4ClusterNodeTrainProperties:SelectionCriterion

5.2.5ClusterNodeScoreProperties

5.2.6ClusterNodeReportProperties

5.2.7ClusterNodeStatusProperties

5.2.8ClusterNodeResultsWindow

5.2.9Example

5.3SOM/KOHONENNODE

5.3.1SOM/KohonenNodeTrainProperties

5.3.2SOM/KohonenNodeTrainProperties:Segment

5.3.3SOM/KohonenNodeTrainProperties:SeedOptions

5.3.4SOM/KohonenNodeTrainProperties:BatchSOMTraining

5.3.5SOM/KohonenNodeTrainProperties:Local-LinearOptions

5.3.6SOM/KohonenNodeTrainProperties:Nadaraya-WatsonOptions

5.3.7SOM/KohonenNodeTrainProperties:KohonenVQ

5.3.8SOM/KohonenNodeTrainProperties:Kohonen

5.3.9SOM/KohonenNodeTrainProperties:NeighborhoodOptions

5.3.10SOM/KohonenNodeResults

5.4VARIABLECLUSTERINGNODE

5.4.1VariableClusteringNodeAlgorithm

5.4.2VariableClusteringNodeandMissingDataSetValues

5.4.3VariableClusteringNodeandLargeDataSets

5.4.4VariableClusteringNodeandVariableRoles

5.4.5UsingtheVariableClusteringNode

5.4.6VariableClusteringNodeTrainProperties

5.4.7VariableClusteringNodeTrainProperties:StoppingCriteriaProperties

5.4.8VariableClusteringNodeScoreProperties

5.4.9VariableClusteringNodeStatusProperties

5.4.10Example

5.4.11TwoStageVariableClusteringExample

5.4.12PredictiveModelingwithVariableClusteringExample

PREDICTIVEMODELSCOMPARISON

6.1MODELSCOMPARISON

6.2MODELSFORBINARYTARGETS:ANEXAMPLEOFPREDICTINGATTRITION

6.2.1LogisticRegressionforPredictingAttrition

6.2.2DecisionTreeModelforPredictingAttrition

6.2.3ANeuralNetworkModelforPredictingAttrition

6.3MODELSFORORDINALTARGETS:ANEXAMPLEOFPREDICTINGTHERISKOFACCIDENTRISK

6.3.1LiftChartsandCaptureRatesforModelswithOrdinalTargets

6.3.2LogisticRegressionwithProportionalOddsforPredictingRiskinAutoInsurance

6.3.3DecisionTreeModelforPredictingRiskinAutoInsurance

6.3.4NeuralNetworkModelforPredictingRiskinAutoInsurance

6.3.5ComparisonofAllThreeAccidentRiskModels

6.4BOOSTINGANDCOMBININGPREDICTIVEMODELS

6.5MODELVALIDATION

6.5.1StochasticGradientBoosting

6.5.2AnIllustrationofBoostingUsingtheGradientBoostingNode

6.5.3TheEnsembleNode

6.5.4ComparingtheGradientBoostingandEnsembleMethodsofCombiningModels

INTRODUCTION

Aneuralnetworkcanbedefinedasasetofhighlyinterconnectedelementsofinformationprocessing,whichareabletolearntheinformationthatfeedsthem.Themainfeatureofthisnewtechnologyofneuralnetworksisthatitcanapplytoalargenumberofproblemsthatcanrangefromcomplexproblemstotheoreticalsophisticatedmodelssuchasimagerecognition,voicerecognition,financialanalysis,analysisandfilteringofsignals,etc.Inthis chapter we will see applications of neural networks for the improvement of thetechniquesofclassification,discrimination,prediction,etc.

A neural network consists of processing units that are called neurons or nodes. Thesenodesareorganizedintogroupsthatarecalled“layers”.Thereareusuallythreetypesoflayers:an input layer,oneormorehidden layersanda layerofoutput.Connectionsareestablishedbetweenadjacentnodesineachlayer.Theinputlayer,whichpresentsdatatothenetwork,consistsofinputnodesthatreceivetheinformationdirectlyfromtheoutside.Output layer represents the response of the network to a given as this informationtransferredtooutsideinput.Hiddenorintermediatelayersareresponsibleforprocessinginformation and interposed between the input and output layers are the only oneswhohavenoconnectionwiththeoutside.

Successivechapterspresentexamplesthatclarifytheapplicationofthemodelsinthefieldofneuralnetworks.TheexamplesaresolvedstepbystepwithSASEnterpriseMinerinordertomakeeasiertheunderstandingofthemethodologiesused.

1Chapter1

SASENTERPRISEMINERWORKINGENVIRONMENT

1.1SASENTERPRISEMINERINTRODUCTION

SASInstituteimplementsdatamininginEnterpriseMinersoftware,whichwillbeusedinthisbook,andotherproceduresandmodules(STAT,ETS,…)whichwillalsobeusedthroughoutthetext.SASInstitutedefinestheconceptofDataMiningastheprocessof selecting (Selecting), explore (Exploring),modify (Modifying), modeling (Modeling)and rating (Assessment) large amounts of data with the aim of uncovering unknownpatternswhichcanbeusedasacomparativeadvantagewithrespecttocompetitors.Thisprocess issummarizedwith theacronymSEMMAwhichare the initialsof the5phaseswhich comprise the process ofDataMining according to SAS Institute. Each of thesestageshasassociateddifferentnodes,asshownbelow:

Phaseof selection (Sample): carries appenddata (AppendNode), partition of data(Data Partition Node), import data (File Import Node), filter data (Filter Node), associated data source (Input Data Source Node), merge data (Merge Node) andsampling(SampleNode)nodes.SeeFigures1-1and1-2.

Figure1-1

Figure1-2

Phaseofexploration(Explore):hasbeenassociatednodes:Association(AssociationNode), clustering (Cluster Node), descriptive statistics with DMDB procedure(DMDBNode),graphexploratoryanalysis(GraphExploreNode),analysisofunions(Link Analysis Node), basket market (Basket Market Node), Graphics (MultiplotNode), path analysis (Patch Analysis Node), self-organized neural networks(SOM/Kohonen),statisticalexploratoryanalysis (StatExploreNode),dividesnumericvariablesintodisjointorhierarchicalclusters(VariableClusteringNode)andselectionofvariables(VariableSelectionNode).SeeFigures1-3and1-4.

Figure1-3

Figura1-4

Phaseofmodification(MODIFY):Removevariablesfromdatasetsorhidevariablesfromthemetadata(DropNode), replacemissingvalues indatasets (ImputeNode),modelnon-linearfunctionsofmultiplemodesofcontinuousdistribution(InteractiveBinnig Node), obtain principal components (Principal Components Node),replacement ofmissingvalues(ReplacementNode),create ad hoc sets of ruleswithuser-definable outcomes (Rules Builder Node), Definition of variables (Data SetAttributes)andtransformationofVariables(TransformVariablesNode).SeeFigures1-5and1-6

Figure1-5

Figure1-6

Phaseofmodeling (Model): findbetterneuralnetworkconfigurations (AutoneuralNode),performdecisiontrees(DecisionTreeNode),performregressionanalysisondatasetsthathaveabinaryorintervalleveltargetvariable(DmineRegressionNode),fitanadditivenonlinearmodelthatusesthebucketedprincipalcomponentsasinputsto predict a binary or an interval target variable (DMNeural Node), create new

models by combining the posterior probabilities (for class targets) or the predictedvalues (for interval targets) from multiple predecessor models (Ensemble Node),searchforanoptimalpartitionofthedatadefinedintermsofthevaluesofasinglevariable across Gradient Boosting Machine (Gradient Boosting Node), performinteractive decision trees (Interactive Decision Tree Application), perform bothvariableselectionandmodel-fitting(LARsNode),useak-nearestneighboralgorithmtocategorizeorpredictobservation(MemoryBasedReasoningMBRNode), importand assess amodel thatwas not created by one of theEnterpriseMinermodelingnodes (Model Import Node), provides a variety of feedforward networks that arecommonlycalledbackpropagationorbackpropnetwork(NeuralNetworkNode),usePLSalgoritmtomodelcontinuousandbinarytargets(PartialLeastSquaresNode),fit both linear and logistic regression models (Regression node), improve theclassificationof rare events (Rule InductionNode), enables tomodel a class targetandanintervaltarget(TwoStageNode).SeeFigures1-7y1-8.

Figure1-7

Figura1-8

Phaseofassessment (Assess): provide tabular andgraphical information to assistusers in determining appropriate probability cutoff point(s) for decision making withbinary target models (Cutoff Node), define profiles for a target that produces optimaldecisions(DecisionsNode),comparetheperformanceofcompetingmodelsusingvariousbenchmarking criteria (Model Comparison Node), manage SAS scoring code that isgeneratedfromatrainedmodelormodelstosavetheSASscoringcodetoalocationonyour client computer and to run the SAS scoring code (Score Node) and examinesegmentedorclustereddataandidentifyfactorsthatdifferentiatedatasegmentsfromthepopulation(SegmentProfileNode).SeeFigures1-9y1-10.

Figure1-9

Figura1-10

1.2STARTINGWITHSASENTERPRISEMINER

SAS Enterprise Miner software was installed in one of two configurations:WorkstationorEnterpriseClient.EitherconfigurationprovidescompleteEnterpriseMinercapabilities.

The Enterprise Miner Workstation uses SAS services that run on your personalcomputerorlaptopcomputer.TheSASDeploymentWizardwillconfiguretheseservicesduringinstallation.

The Enterprise Client connects to remote SAS servers using the SAS WebInfrastructure Platform (WIP).Your organization’s SAS system administratormust firstinstallEnterpriseMineronaserver,andthenensurethat theWIPisrunningbeforeyoucanlogintoaclient/serversession.

IntheWorkstationconfiguration,allprocessingoccurslocallyandnoconnectionstoSASMetadataServerandSASFoundationServerareneeded.

In the Enterprise Client configuration, the Web Infrastructure Platform (WIP)functionsasabridgebetweentheclientsandtheservers.Multipleclientscanconnecttomultiple servers. This is the preferred configuration for distributed client/servercomputing.TheWIPmust be configured.This task is usually carried out by your SASsystemadministrators.TheclientscommunicatewiththeWIPusingtheHTTPprotocol.

To access Enterprise Miner simply type in Windows programs SAS EnterpriseMiner13.1orSASEnterpriseMinerWorksatation11-1(Figure1-11).

Figure1-11

1.2.1CreateaNewProjectAfteryoulaunchyourEnterpriseMiner13.1session,youcancreateanewproject.

1. Fromtheapplicationmainmenu,selectFile→New→Project.

TheCreateNewProjectwindowopens(Figure1-12):

Figure1-12

Selectanavailableserverforyourproject.ClickNext.

2. Type the name of your project in theProjectName field, and specify the serverdirectory that youwant to use for your project in theSASServerDirectory field(Figure1-13).

Figure1-13

ClickNext.

3. Use theBrowse button to select and specify the location for your project’s SASFoldersontheserver.Yourserverfolderswillvaryfromthefoldersthataredisplayedintheexamplebelow(Figure1-14).

Figure1-14

ClickNext.

4. Review the project information you have just specified in the New ProjectInformationtable(Figure1-15).UsetheBackbuttonifyouneedtomakechanges.WhenyouarereadytocreateyourEnterpriseMinerproject,clickFinish.

Figure1-15

1.2.2ProjectStartCodeYoucanspecifySAScodethatyouwanttoruneverytimeyoustartyourEnterpriseMinerproject.Thisiscalledprojectstartcode.

1. Select your newly created project in the Enterprise Miner Project Navigator. Theprojectpropertiespaneldisplaysbelow(Figure1-16).

Figure1-16

Clicktheellipsisbutton[…]totherightoftheProjectStartCodepropertytoopentheProjectStartCodewindow.

2. UsethespaceinthiswindowtoenteranySAScodethatyouwanttoruneachtimetheprojectstarts,suchasLIBNAMEstatements,gridcomputingsettings,andsoon(Figure1-17).

Figure1-17

Whenyouhavefinishedenteringallof theprojectstartcodeyouwant,clickRunNow.Youcanselect theLogtabtoviewtheresultsofyourstartcode.Whenyouaresatisfiedwithyourstartcode,clickOKtoclosethewindow.

Note:Youcanmodifyyourprojectstartcodebyrepeatingthepreviousstepsatanytime.

1.2.3CreateaNewProcessFlowDiagramTocreateanewdiagram,selectFile→New→Diagramfromthemainmenu.TypeanameforyourdiagramintheCreateNewDiagramwindowandclickOKtocreatethediagram.ThenewdiagramopensintheDiagramWorkspace.

1.2.4CreateaDataSourceTocreateadatasource,selectFile→NewData→Sourcefromthemainmenu.TheDataSourceWizardopens.YouusetheDataSourcetosetupandconfigureexternaldatatablesforusewithEnterpriseMiner.

SAS EnterpriseMiner includes some example data sources that are already set up andreadyforyoutouse.WewilladdthesampledatasourcestotheDataSources folder inyourProjectNavigator.

1. From the Enterprise Miner main menu, select Help → Generate Sample DataSources(Figure1-18).

Figure1-18

2. The Generate Sample Data Sources window opens. This window containsrepresentative data sets from the SAMPSIO data library that shipswith EnterpriseMiner (Figure 1-19). The various data sets are useful for different types of datamininganalyses.

Figure1-19

By default, all sample data sets are selected for creation. ClickOK to close thewindowandtocreatethesampledatasources.

3. GotoyourProjectNavigator treeandopentheDataSources folder (Figure1-20).You will see the newly created example Enterprise Miner data sources from theSAMPSIOlibraryinside.

Figure1-20

Whenadata source is selected in theProjectNavigator, thedata sourceattributesare displayed in the Properties Panel. To use a data source in your process flowdiagram,selectthedatasource(thisexampleusestheHomeEquitydatasource)intheProjectPanelanddragittotheDiagramWorkspace.

1.2.5ConnectNodesinDiagramWorkspaceConnectdataminingtoolnodesintheDiagramWorkspacetocreatealogicalprocessflowfor your data mining project. When you create a process flow, follow the SEMMA(Sample,Explore,Modify,Model,Assess)dataminingmethodology. Information flowsfrom node to node in the direction of the connecting arrows. The following exampleconnectstheHomeEquitydatasourdcenodetotheSamplenode.

1. Drag theHomeEquity data source from theData Sources folder of your ProjectNavigator onto your Diagram Workspace. Then, from the Sample tab of yourEnterpriseMinernode toolbar,dragaSamplenodeonto theworkspace (Figure1-21).

Figure1-21

2. Toconnectthenodes,movethepointertotherightedgeoftheInputDatanodeicon.Thepointericonchangesfromanarrowtoapencil(Figure1-22).

Figure1-22

3. DragthepointertotheleftedgeoftheSamplingnode.Releasethepointer,andthenodesareconnected(Figure1-23).

Figure1-23

1.2.6RuntheProcessFlowDiagramTogenerate results fromaprocess flowdiagraminyourDiagramWorkspace,youmustrun the process flowpath to execute the code that is associatedwith eachnode.Right-clickthenodefromwhichyouwanttoruntheflow(normally,aterminalnode)andselectRunfromthepop-upmenu(Figure1-24).

Figure1-24

1.2.7ViewResultsToviewtheresultsofacompletedrun,right-clickthenodethatyouranandselectResultsfromthepop-upmenu.YoucanviewtheresultsofthenoderunfromtheResultswindowthatopens(Figure1-25).

Figure1-25

1.2.8CreateAModelPackageYoucanexportthecontentsofyourmodeltocomparableEnterpriseMinerenvironmentsusinganEnterpriseMinermodelpackage.Tosaveyourprocessflowdiagramasamodelpackage,right-clickthelastnodeinyourmodel,andfromthemenu,selectCreateModelPackage(figure1-26).

Figure1-26

When the Inputwindow opens, specify a name for yourmodel package and clickOK(Figure1-27).

Figure1-27

Yourmodel is now saved inside theModel Packages folder of your Project Navigator(Figure1-28).

Figure1-28

1.3SASENTERPRISEMINERUSERINTERFACETheFigure1-29showstheSASenterprisemineruserinterface.

Figure1-29

Thekeyelementsareasfollows:

Toolbar:TheSASEnterpriseMinerToolbarisagraphicsetofnodebuttonsandtoolsthatyouusetobuildprocessflowdiagramsintheDiagramWorkspace.Thetextnameofanynodeortoolbuttonisdisplayedwhenyoupositionyourmousepointeroverthebutton.

Toolbar Shortcut Buttons: The SAS Enterprise Miner toolbar shortcut buttons are agraphicsetofuserinterfacetoolsthatyouusetoperformcommoncomputerfunctionsandfrequently used SAS EnterpriseMiner operations. The text name of any tool button isdisplayedwhenyoupositionyourmousepointeroverthebutton.

ProjectPanel:UsetheProjectPaneltomanageandviewdatasources,diagrams,results,andprojectusers.

PropertiesPanel:UsethePropertiesPaneltoviewandeditthesettingsofdatasources,diagrams,nodes,results,andusers.

DiagramWorkspace:UsetheDiagramWorkspace tobuild,edit, run,andsaveprocessflowdiagrams. In thisworkspace, yougraphically build, order, and sequence thenodesthatyouusetomineyourdataandgeneratereports.

PropertyHelpPanel:TheHelpPanel displays a short description of the property thatyouselectinthePropertiesPanel.ExtendedhelpcanbefoundintheHelpmainmenu.

SASEnterpriseMinerbehaveslikemostWindowsapplications:

Panelscanberesizedbydraggingtheedgeofthepanel.

WindowswithintheDiagramWorkspacecanberesizedbydraggingwindowcorners.

Windowscanbeminimized,maximized,andclosedbyusingthethreecontrolbuttonsintheupperrightcornerofallwindows.

AsingleclickwiththeleftmousebuttonselectsobjectsintheDiagramWorkspace.

Right-clickinganobjectopensapop-upmenuofactions,ifanyareassociatedwiththatobject.

Text-basedapplicationmenuscanbeactivatedbypressingthe<ALT>keyincombinationwiththeunderscoredletterinthemenuname.

Thenameorfunctionofmostbuttonsappearsintextifyoupositionyourmousepointeroverthebutton.TochangethebehavioroftooltipsinSASEnterpriseMiner,selectOptions Preferences.ThenusethePropertySheetTooltipsandtheToolsPaletteTooltipspropertiestoconfigurethetooltipbehavior.

Note:TheNodesToolbarandtheToolbarShortcutButtonstogetherarecollectivelycalledtheToolsPalette.

1.3.1SasEnterpriseMinermainmenuEnterprise Miner contains a menu that lets you select and execute common

tasks.Thismenu(topofFigure1-13)presentstheoptionsFile,Edit,View,Options,Actions,WindowsandHelpwhosesub-itemsandpurposeswillbeexploredbelow.

File

New

Project—createsanewproject.

Diagram—createsanewdiagram.

DataSource—createsanewdatasourceusingtheDataSourcewizard.

Library—creates,modifies,ordeletesaLibraryusingtheLibrarywizard.

OpenProject—opensanexistingproject.YoucanalsocreateanewprojectfromtheOpenProjectwindow.

RecentProjects—liststheprojectsonwhichyouweremostrecentlyworking.

OpenModel—opensamodelpackageSPKfilethatyouhavepreviouslycreated.

OpenModelPackage—opens awindow that you can use to view and comparemodelpackages.

RegisterModel—registersmodel.

OpenDiagram—opensthediagramthatyouselectintheProjectPanel.

CloseDiagram—closestheopendiagramthatyouselectintheProjectPanel.

ClosethisProject—closesthecurrentproject.

Import Diagram from XML— imports a diagram that has been defined by anXMLfile.

SaveDiagram—savesadiagramasanimage(BMPorGIF)orasanXMLfile.

PrintDiagram— prints the contents of thewindow that is open in theDiagramWorkspace.

Print Preview Diagram — creates an onscreen preview of the contents of thewindowthatisselectedforprinting.

DeletethisProject—deletesthecurrentproject.

Exit—endstheSASEnterpriseMinersessionandclosesthewindow.

Edit

Cut—deletestheselecteditemandcopiesittotheclipboard.

Copy—copiestheselectednodetotheclipboard.

Paste—pastesacopiedobjectfromtheclipboard.

Delete—deletestheselecteddiagram,datasource,ornode.

Rename—renamestheselecteddiagram,datasource,ornode.

Duplicate—createsacopyoftheselecteddatasource.

SelectAll—selectsallofthenodesintheopendiagram.

ClearAll—clearstextfromtheProgramEditor.

Find/Replace— opens the Find/Replace window so that you can search for andreplacetextintheProgramEditor,Log,andResultswindows

GoToLine—openstheGoToLinewindow.Enterthelinenumberonwhichyouwanttoentertext.

Layout

Horizontally — creates an ordered, horizontal arrangement of the layout ofnodesthatyouhaveplacedintheDiagramWorkspace.

Vertically— creates an ordered, vertical arrangement of the layout of nodesthatyouhaveplacedintheDiagramWorkspace.

Zoom— increases or decreases the size of the process flow diagram within thediagramwindowbytheamountthatyouchoose.

Copy Diagram to Clipboard— copies an image of the current process flowdiagramintheDiagramWorkspacetotheWindowsclipboard.

View

ProgramEditor—opens aSASProgramEditorwindow inwhichyou can enterSAScode.

ProjectLog—opensaProjectLogwindow.

Explorer—opensaSASExplorerwindow.YouusetheExplorerwindowtoviewSASLibrariesandtheirtables.

Metadata — opens a SAS Explorer window that you can use to open metadatainformationforprojectsandresults.

RefreshProject—updatestheselectedprojecttreetoincorporateanychangesthatweremadetotheprojectfromoutsidetheSASEnterpriseMineruserinterface.

Actions

AddNode—addsanodethatyouselecttotheDiagramWorkspace.

SelectNodes—openstheSelectNodesWindow.

ConnectNodes—openstheConnectNodesWindow.YoumustselectanodeintheDiagramWorkspacetomakethismenuitemavailable.YoucanconnectthenodethatyouselecttoanynodesthathavebeenplacedinyourDiagramWorkspace.

DisconnectNodes—openstheDisconnectNodeswindow.YoumustselectanodeintheDiagramWorkspacetomakethismenuitemavailable.YoucandisconnectthenodethatyouselectfromanynodesthatareconnectedtotheselectednodeinyourDiagramWorkspace.

Update—updatestheselectednodetoincorporateanychangesthatyouhavemade.

Run— runs the selectednodeandanypredecessornodes in theprocess flow thathavenot been executed, or submits any code that you type in the ProgramEditorwindow.

StopRun—interruptsacurrentlyrunningprocessflow.

StopCode—interruptscurrentlyrunningcode.

ViewResults—openstheResultswindowfortheselectednode.

CreateModelPackage—generatesaminingmodelpackage.

ExportPathasSASProgram—savesthepaththatyouselectasaSASprogram.Inthewindowthatopens,youcanspecifythelocationtowhichyouwanttosavethefile,andwhetheryouwantthecodetorunthepathorcreateamodelpackage.

Options

Preferences—opensthePreferenceswindow.Usethefollowingoptionstochangetheuserinterface:

UserInterface

PropertySheetTooltips—Thissettingspecifieswhethertooltipsaredisplayedonvariouspropertysheetsthatappearthroughouttheuserinterface.

ToolsPaletteTooltips—Thissettingspecifieswhethertooltipsdisplaythetoolnameonly,displaythetoolnameandadescription,orsuppresstooltipsfor thetoolbuttonsinthetoolspalette.

Open Last Opened Project Automatically — specifies whether SASEnterpriseMinershouldopenthelastopenedprojectwhenyoulogon.

Open Last Viewed Diagram Automatically — specifies whether SASEnterpriseMiner should open the last viewed diagramwhen you log on.Youmust specify Yes in both this property and Open Last Opened ProjectAutomaticallytoautomaticallyopenthelastvieweddiagram.

NumberofRecentProjects—specifiesthenumberofprojectsthatappearsintheRecentProjectslist.

InteractiveSampling

SampleMethod—specifieswhetherdefaultsamplesaremadefromthetopofthedatasetoraredrawnatrandomwhenexploringdata.

FetchSize—specifiestheamountofdatatofetchduringinteractivesampling.YoucanchoosefromDefault(2000observations)orMax(allobservations).

RandomSeed—letsyouspecifythedefaultvalueforRandomSeedentries.

ModelPackageOptions

GenerateCScoreCode—specifieswhethertogenerateandaddCScorecodetoreportsandResultspackages

GenerateJavaScoreCode—specifieswhethertogenerateandaddJavaScorecodetoResultspackages

JavaScoreCodePackage—specifiestheJavapackagenametobeusedwhenSASEnterprise Miner is configured to generate and add Java Score code toResultspackages.

RunOptions

Gridprocessing—specifieswhethertousegridprocessingwhenavailableortoneverusegridprocessing.IfyouenableGridprocessingforaSASEnterpriseMinerproject, youmust specify aproject location that is common toallGridnodes.

ResultsOptions

Log/OutputLineNumbers—specifiesthenumberoflinesthataredisplayedinanylogsoroutputs.

Window

Tile—displayswindowsintheDiagramWorkspacesothatallwindowsarevisible

atthesametime.

Cascade—displayswindowsintheDiagramWorkspacesothatwindowsoverlap.

Help

Contents—opens theSASEnterpriseMinerHelpwindow,which enables you toviewalltheSASEnterpriseMinerReferenceHelp.

ComponentProperties—opensa table thatdisplays thecomponentpropertiesofeachtool.

GenerateSampleDataSources—createssampledatasourcesthatyoucanaccessfromtheDataSourcesfolder.

Configuration—displaysthecurrentsystemconfigurationofyourSASEnterpriseMinersession.

About—displays informationabout theversionofSASEnterpriseMinerthatyouareusing.

1.3.2TheSASEnterpriseMinerNodeToolbarThe SAS Enterprise Miner Node Toolbar is located directly above the DiagramWorkspace.ItisatabbedgraphiccollectionofSASEnterpriseMinernodesandtoolsthatyouusetobuildprocessflowdiagrams(Figure1-30).

Figure1-30

If you position your mouse pointer over a Toolbar button, a text ToolTip appears anddisplaysthenodeortoolname.

TouseaToolbarnodeinaprocessflowdiagram,clickthenodebuttonanddragitintotheDiagramWorkspace.TheToolbarbutton remains inplace and thenode in theDiagramWorkspaceisreadytobeconnectedandconfiguredforuseinyourprocessflowdiagram.

SASTextMinerandCreditScoringforSASEnterpriseMinerarenot includedwiththebaseversionofSASEnterpriseMiner.IfyoursitehasnotlicensedCreditScoringforSASEnterpriseMinerandSASTextMiner,thosedataminingtoolsdonotappearinyourSASEnterpriseMinersoftware.

2Chapter2

NEURALNETWORKSNODESINSASENTERPRISEMINER

2.1NEURALNETWORKDESCRIPTIONA neural network can be defined as a set of highly interconnected elements of

informationprocessing,whichareabletolearntheinformationthatfeedsthem.Themainfeatureofthisnewtechnologyofneuralnetworksisthatitcanapplytoalargenumberofproblemsthatcanrangefromcomplexproblemstotheoreticalsophisticatedmodelssuchas image recognition, voice recognition, financial analysis, analysis and filtering ofsignals, etc. In this chapter we will see applications of neural networks for theimprovementofthetechniquesofclassification,discrimination,prediction,etc.

Neuralnetworksaretryingtoemulatethenervoussystem,inawaywhichareableto reproduce some of the major tasks that develops the human brain, to reflect thefundamental characteristics of the same behavior. They really try to model neuralnetworks isoneof thephysiological structuresof thebrain,Neuronandsupportgroupsstructuredandinterlinkedofseveralofthem,knownasnetworksofneurons.Inthisway,theybuildsystemsthatpresentacertaindegreeofintelligence.However,wemustinsistonthefactthatartificialneuralsystems,likeanyothertoolbuiltbyman,havelimitationsandhaveonlyasuperficialresemblancetotheirbiologicalcounterparts.Neuralnetworks,in relation to the processing of information, inherited three basic characteristics ofnetworks of biological neurons: massive parallelism, nonlinear response of neuronsagainstthereceivedinputsandprocessinginformationthroughmultiplelayersofneurons.

Oneof themainpropertiesof thesemodels is their ability to learnandgeneralizefromreal-lifeexamples.I.e.,thenetworklearnstorecognizetherelationship(whichdoesnotceasetobeequivalenttoestimateafunctionaldependence)thatexistsbetweenthesetof inputs provided as examples and their corresponding outputs, so that, after learning,whenthenetworkpresentsyouanewentry(evenifitisincompleteorpossessanyerror),based on the functional relationship established in the sameIt’s able to generalize itofferingawayout.Asaresult,canbedefinedanartificialneuralnetworkasanintelligentsystemabletonotonlylearn,butalsotogeneralise.

Aneuralnetworkconsistsofprocessingunitsthatarecalledneuronsornodes.Thesenodesareorganizedintogroupsthatarecalled“layers”.Thereareusuallythreetypesoflayers:an input layer,oneormorehidden layersanda layerofoutput.Connectionsareestablishedbetweenadjacentnodesineachlayer.Theinputlayer,whichpresentsdatatothenetwork,consistsofinputnodesthatreceivetheinformationdirectlyfromtheoutside.Output layer represents the response of the network to a given as this informationtransferredtooutsideinput.Hiddenorintermediatelayersareresponsibleforprocessinginformation and interposed between the input and output layers are the only oneswhohavenoconnectionwiththeoutside.

2.2NEURALNETWORKSWITHSASENTERPRISEMINER

SAS Enterprise Miner allows to work with neural networks through multiple nodeslocatedinthecategoryModel(Figure2-1).

Figure2-1

TheAutoNeural node can be used to automatically configure a neural network. TheAutoNeural node implements a search algorithm to incrementally select activationfunctionsforavarietyofmultilayernetworks.

The DMNeural node is another modeling node that you can use to fit an additivenonlinear model. The additive nonlinear model uses bucketed principal components asinputs to predict a binary or an interval target variable with automatic selection of anactivationfunction.

The Neural Network node enables you to construct, train, and validate multilayerfeedforward neural networks. Users can select from several predefined architectures ormanuallyselectinput,hidden,andtargetlayerfunctionsandoptions.

The SVM node uses supervised machine learning to perform binary classificationproblems,includingpolynomial,radialbasisfunction,andsigmoidnonlinearkernels.Thestandard SVM problem solves binary classification problems by constructing a set ofhyperplanes that maximize the margin between two classes. The SVM node does notsupportmulti-classproblemsorsupportvectorregression.

TheTwoStagenodeenablesyoutocomputeatwo-stagemodelforpredictingaclassandinterval targetvariables at the same time.The interval targetvariable isusually avaluethat is associated with a level of the class target. This node allows you to use neuralnetworks for general a model in c stages in order to predict a class variable and acontinuousvariable.Youcanusetheperceptronmultilayer,generalizedlinearmodelsandradialbasisfunctions.

SASEnterpriseMiner also allows you toworkwith neural networks through the nodeSOM/Kohonen, overlooking classification tasks. This node is located in the categoryModify(Figure2-29).

Figure2-2

2.3OPTIMIZATIONANDADJUSTMENTOFMODELSWITHNETS:NEURALNETWORKNODE

TheNeural Network node allows you to use neural networks for the optimization andadjustment of models through Multilayer Perceptrons, radial basis functions andgeneralized linearmodels.As shown inFigure2-1, is located in theModelof theSASEnterpriseMinermenucategory.

Artificial neural networkswere originally developed by researcherswhowere trying tomimic the neurophysiologyof the humanbrain.By combiningmany simple computingelements(neuronsorunits) intoahighlyinterconnectedsystem,theseresearchershopedto produce complex phenomena such as intelligence. In recent years, neural networkresearchers have incorporatedmethods from statistics and numerical analysis into theirnetworks.Whilethereisconsiderablecontroversyoverwhetherartificialneuralnetworksarereallyintelligent,thereisnodoubtthattheyhavedevelopedintoveryusefulstatisticalmodels.Morespecifically, feedforwardneuralnetworksareaclassof flexiblenonlinearregression, discriminant, and data reduction models. By detecting complex nonlinearrelationships in data, neural networks can help to make predictions about real-worldproblems.

Neuralnetworksareespeciallyusefulforpredictionproblemswhere:

nomathematicalformulaisknownthatrelatesinputstooutputs.

predictionismoreimportantthanexplanation.

thereisalotoftrainingdata.

Commonapplicationsofneuralnetworksincludecreditriskassessment,directmarketing,andsalesprediction.

TheNeuralNetworknodeprovidesavarietyoffeedforwardnetworksthatarecommonlycalledbackpropagationorbackpropnetworks.Thisterminologycausesmuchconfusion.Strictlyspeaking,backpropagationreferstothemethodforcomputingtheerrorgradientfora feedforwardnetwork,astraightforwardapplicationof thechainruleofelementarycalculus. By extension, backprop refers to various training methods that usebackpropagationtocompute thegradient.Byfurtherextension,abackpropnetwork isafeedforward network trained by any of various gradient-descent techniques. Standardbackprop isaeuphemismfor thegeneralizeddeltarule, the training technique thatwaspopularized by Rumelhart, Hinton, andWilliams in 1986 and which remains the mostwidelyusedsupervisedtrainingmethodforfeedforwardneuralnets.Standardbackpropisalsooneofthemostdifficulttouse,tedious,andunreliabletrainingmethods.Unliketheother training methods in the Neural Network node, standard backprop comes in twovarieties.

Batchbackprop,likeconventionaloptimizationtechniques,readstheentiredataset,updatestheweights,readstheentiredataset,updatestheweights,andsoon.

Incrementalbackprop readsone case, updates theweights, readsone case, updatestheweights,andsoon.

Batch backprop is one of the slowest training methods. Although the Neural Networknodeprovidesanoptionforbatchbackprop,it isrecommendedthatyouneveruseitforseriouswork. Incrementalbackpropcanbeuseful forvery large, redundantdata sets, ifyouareskilledatsettingthelearningrateandmomentumappropriately.

Fortunately,thereisnoneedtosufferthroughtheslowconvergenceandthetedioustuningof standard backprop. Much of the neural network research literature is devoted toattempts tospeedupbackprop.Mostof thesemethodsare inconsequential; two thatareeffectiveareQuickpropandRPROP,bothofwhichareavailable in theNeuralNetworknode. Inaddition, theNeuralNetworknodeprovidesavarietyofconventionalmethodsfornonlinearoptimization thathavebeendevelopedbynumericalanalystsover thepastseveralcenturiesandthatareusuallyfasterandmorereliablethanthealgorithmsfromtheneuralnetworkliterature.

Upuntil theearly1990s,neuralnetworkswereoftenviewedasalternativestostatisticalmethods.Someresearchersmadeoutlandishclaimsthatneuralnetworkscouldbeusedtoanalyze data with no expertise required on the part of the analyst. These unjustifiableclaims,combinedwiththeunreliabilityofearlyalgorithmssuchasstandardbackprop,ledtoabacklashinwhichmanypeople,especiallystatisticians,dismissedneuralnetworksasentirelyworthlessfordataanalysis.Butinrecentyears,ithasbeenwidelyrecognizedthatmanykindsofneuralnetworksarestatisticalmethods,andthatwhenneuralnetworksaretrained via reliable methods such as conventional optimization techniques or Bayesianlearning,theresultsarejustasvalidasthoseobtainedbymanynonlinearornonparametricstatisticalmethods.

Neuralnetworks,likeotherstatisticalmethods,cannotmagicallycreateinformationoutofnothing— the rule “garbage in, garbage out” still applies. The predictive ability of aneuralnetworkdependsinpartonthequalityofthetrainingdata.Itisalsoimportantfortheanalysttohavesomeknowledgeofthesubjectmatter,especiallyforselectinginputsandchoosinganappropriateerrorfunction.Experiencedneuralnetworkuserstypicallytryseveral architectures to determine the best network for a specific data set. The designprocessandthetrainingprocessarebothiterative.

2.3.1OverviewofFeedforwardNeuralNetworksUnitsandconnectionsAneuralnetworkconsistsofunits(neurons)andconnectionsbetweenthoseunits.Therearethreekindsofunits.

Input Units obtain the values of input variables and optionally standardize thosevalues.

HiddenUnitsperform internal computations, providing thenonlinearity thatmakesneuralnetworkspowerful.

OutputUnitscomputepredictedvaluesandcomparethosepredictedvalueswiththevaluesofthetargetvariables.

Unitspassinformationtootherunitsthroughconnections.Connectionsaredirectionalandindicate the flow of computation within the network. Connections cannot form loops,sincetheNeuralNetworknodeallowsonlyfeedforwardnetworks.

Thefollowingrestrictionsapplytofeedforwardnetworks:

Inputunitscanbeconnectedtohiddenunitsortooutputunits.

Hiddenunitscanbeconnectedtootherhiddenunitsortooutputunits.

Outputunitscannotbeconnectedtootherunits.

Due to limitations in the SAS supervisor, the total number of connections in anetworkcannotexceedroughly32,000.

PredictedValuesandErrorFunctionsEachunit produces a single computedvalue.For input andhiddenunits, this computedvalueispassedalongtheconnectionstootherhiddenoroutputunits.Foroutputunits,thecomputed value is what statisticians call a predicted value. The predicted value iscomparedwiththetargetvaluetocomputetheerrorfunction,whichthetrainingmethodsattempttominimize.

Weight,Bias,andAltitudeMost connections in a network have an associated numeric value called a weight orparameter estimate. The training methods attempt to minimize the error function byiterativelyadjustingthevaluesoftheweights.Mostunitsalsohaveoneortwoassociatednumericvaluescalledthebiasandaltitude,whicharealsoestimatedparametersadjustedbythetrainingmethods.

CombinationFunctions

Hiddenandoutputunitsusetwofunctionstoproducetheircomputedvalues.First,allthecomputed values from previous units feeding into the given unit are combined into asingle value using a combination function. The combination function uses theweights,bias,andaltitude.Twogeneralkindsofcombinationfunctionarecommonlyused.

LinearCombinationFunctions—computea linearcombinationof theweightsandthe values feeding into the unit and then add the bias value (the bias acts like anintercept).

RadialCombinationFunctions—compute the squaredEuclideandistancebetweenthevectorofweightsandthevectorofvaluesfeedingintotheunitandthenmultiplybythesquaredbiasvalue(thebiasactsasascalefactororinversewidth).

Thereisalsoanadditivecombinationfunctionthatusesnoweightsorbias.

ActivationFunctionsThevalueproducedbythecombinationfunctionistransformedbyanactivationfunction,which involves no weights or other estimated parameters. Several general kinds ofactivationfunctionsarecommonlyused.

IdentityFunctionisalsocalledalinearfunction.Itdoesnotchangethevalueoftheargument,anditsrangeispotentiallyisunbounded.

SigmoidFunctionsareS-shapedfunctionssuchasthelogisticandhyperbolictangentfunctionsthatproduceboundedvalueswithinarangeof0to1or-1to1.

Softmax Function is called a multiple logistic function by statisticians and is ageneralizationof the logistic function thataffectsseveralunits together, forcing thesumoftheirvaluestobeone.

ValueFunctionsareboundedbell-shapedfunctionssuchastheGaussianfunction.

Exponential and Reciprocal Functions are bounded below by zero but unboundedabove.

NetworkLayersAnetworkmaycontainmanyunits,perhapsseveralhundred.Theunitsaregroupedintolayerstomakethemeasiertomanage.EnterpriseMinersupportsmultipleinputlayers,ahiddenlayer,andmultipleoutputlayers.IntheNeuralNetworknode,whenyouconnecttwolayers,everyunitinthefirstlayerisconnectedtoeveryunitinthesecondlayer.

Alltheunitsinagivenlayersharecertaincharacteristics.Forexample,alltheinputunitsinagivenlayerhavethesamemeasurementlevelandthesamemethodofstandardization.All theunits in agivenhidden layerhave the samecombination functionand the sameactivation function. All the units in a given output layer have the same combinationfunction,activationfunction,anderrorfunction.

2.3.2SimpleNeuralNetworksThesimplestneuralnetworkhasasingleinputunit(independentvariable),asingletarget(dependentvariable),andasingleoutputunit(predictedvalues).

Figure2-3

InFigure 2-3 the slash inside the box represents a linear (or identity) output activationfunction.Instatisticalterms,thisnetworkisasimplelinearregressionmodel.Iftheoutputactivation function were a logistic function, then this network would be a logisticregressionmodel.

2.3.3PerceptronsOneof the earliest neural network architectureswas theperceptron,which is a type oflinear discriminant model. A perceptron uses a linear combination of inputs for thecombination function. Originally, perceptrons used a threshold (Heaviside) activationfunction, but training a network with threshold activation functions is computationallydifficult. Incurrentpractice, theactivation function isalmostalwaysa logistic function,whichmakesaperceptronequivalentinfunctionalformtoalogisticregressionmodel.

Asanexample,aperceptronmighthavetwoinputsandasingleoutput(Figure2-4).

Figure2-4

Inneuralnetworkterms,thediagramshowstwoinputsconnectedtoasingleoutputwithalogistic activation function (represented by the sigmoid curve in the box). In statisticalterms,thisdiagramshowsalogisticregressionmodelwithtwoindependentvariablesandonedependentvariable.

2.3.4HiddenLayersNeuralnetworksmayapplyadditionaltransformationsviaahiddenlayer.Typically,eachinputunitisconnectedtoeachunitinthehiddenlayer,andeachhiddenunitisconnectedto each output unit.The hiddenunits combine the input values and apply an activationfunction,whichmaybenonlinear.Thenthevaluesthatwerecomputedbythehiddenunitsare combined at the output units, where an additional (possibly different) activationfunction is applied. If such a network uses linear combination functions and sigmoidactivationfunctions,itiscalledamultilayerperceptronorMLP.Itisalsopossibletouseothercombinationfunctionsandotheractivationfunctionstoconnectlayersinmanyotherways.Anetworkwiththreehiddenunitswithdifferentactivationfunctionsisshowninthefollowingdiagram(Figure2-5).

Figure2-5

2.3.5MultilayerPerceptrons(MLPs)Themostpopularformofneuralnetworkarchitectureisthemultilayerperceptron(MLP),whichisthedefaultarchitectureintheNeuralNetworknode.Amultilayerperceptron

hasanynumberofinputs.

hasahiddenlayerwithanynumberofunits.

useslinearcombinationfunctionsinthehiddenandoutputlayers.

usessigmoidactivationfunctionsinthehiddenlayers.

hasanynumberofoutputswithanyactivationfunction.

has connections between the input layer and the first hidden layer, between thehiddenlayers,andbetweenthelasthiddenlayerandtheoutputlayer.

TheNeuralNetwork node supportsmany variations of this general form. For example,youcanadddirectconnectionsbetweentheinputsandoutputs,oryoucancutthedefaultconnectionsandaddnewconnectionsofyourown.

Given enough data, enough hidden units, and enough training time, anMLPwith onehiddenlayercanlearntoapproximatevirtuallyanyfunctiontoanydegreeofaccuracy.(Astatistical analogy is approximating a function with nth order polynomials.) For thisreason,MLPsareknownasuniversalapproximators,andcanbeusedwhenyouhavelittlepriorknowledgeoftherelationshipbetweeninputsandtargets.

2.3.6RadialBasisFunction(RBF)NetworksARadialBasisFunctionNetwork

hasanynumberofinputs.

typicallyhasahiddenlayerwithanynumberofunits.

usesradialcombinationfunctionsinthehiddenlayer,basedonthesquaredEuclideandistancebetweentheinputvectorandtheweightvector.

typically uses exponential or softmax activation functions in the hidden layer, inwhichcasethenetworkisaGaussianRBFnetwork.

useslinearcombinationfunctionsintheoutputlayer.

hasanynumberofoutputswithanyactivationfunction.

hasconnectionsbetweentheinputlayerandthehiddenlayer,andbetweenthehiddenlayerandtheoutputlayer.

2.3.7LocalProcessingNetworksMLPsaresaidtobedistributed-processingnetworksbecause theeffectofahiddenunitcanbedistributedovertheentireinputspace.Ontheotherhand,GaussianRBFnetworksare said to be local-processing networks because the effect of a hidden unit is usuallyconcentratedinalocalareacenteredattheweightvector.

2.3.8WidthandAltitudeThehiddenlayerofanRBFnetworkdoesnothaveanythingthat’sexactlythesameasthebiasterminanMLP.Instead,RBFshaveawidthassociatedwitheachhiddenunitorwiththeentirehiddenlayer.Insteadofaddingthewidthinthecombinationfunctionlikeabias,youdividetheEuclideandistancebythewidth.Sincedividingbyawidthofzerowouldproduceundefinedresults,theNeuralNetworknodehasadifferentparameterizationusingthesquarerootofthereciprocalofthewidth.Inthenamesoftheweights,thesquarerootofthereciprocalofthewidthiscalledthe“bias”forlackofanappropriateterm,althoughitdoesnothavetheusualinterpretationofabias.

Someradialcombinationfunctionshaveanotherparametercalledthealtitudeoftheunit.ThealtitudeisthemaximumheightoftheGaussiancurveabovethehorizontalaxis.

2.3.9OrdinaryRBFandNormalizedRBFThereare twodistinct typesofGaussianRBFarchitectures.The first type, theordinaryRBF (ORBF)network,uses theexponential activation function, so theactivationof theunit is aGaussian“bump”asa functionof the inputs.The second type, thenormalizedRBF(NRBF)network,uses thesoftmaxactivationfunction,so theactivationsofall thehiddenunitsarenormalizedtosumtoone.Whilethedistinctionbetweenthesetwotypesof Gaussian RBF architectures is sometimes mentioned in the NN literature, itsimportancehasrarelybeenappreciatedexceptbyTao(1993)andWerntges(1993).

The output activation function in RBF networks is customarily the identity. Using anidentity output activation function is a computational convenience in training, but it ispossibleandoftendesirabletouseotheroutputactivationfunctionsjustasyouwouldinanMLP. TheNeuralNetwork node sets the default output activation function forRBFnetworksthesamewayitdoesforMLPs.

2.3.10ErrorFunctionsAnetworkistrainedbyminimizinganerrorfunction(alsocalledanestimationcriterionor Lyapunov function). Most error functions are based on the maximum likelihoodprinciple,althoughcomputationallyitisthenegativeloglikelihoodthatisminimized.Thelikelihood is based on a family of error (noise) distributions for which the resultingestimator has various optimality properties. M estimators are formally similar tomaximum likelihoodestimators,but for certainkindscalled redescendingMestimators,nopropererrordistributionexists.

Someofthemorecommonlyusederrorfunctionsare

Normaldistribution—alsocalledtheleast-squaresormean-squared-errorcriterion.Suitableforunboundedintervaltargetswithconstantconditionalvariance,nooutliers,andasymmetricdistribution.Canalsobeusedforcategoricaltargetswithoutliers.

HuberM-estimator—Suitableforunboundedintervaltargetswithoutliersorwithamoderatedegreeofinequalityoftheconditionalvariancebutasymmetricdistribution.Canalsobeusedforcategoricaltargetswhenyouwanttopredictthemoderatherthantheposteriorprobability.

RedescendingMestimators—Suitableforunboundedintervaltargetswithsevereoutliers.Canalsobeusedforpredictingonemodeofamultimodaldistribution.Includesbiweightandwaveestimators.

Gammadistribution—Suitableforskewed,positiveintervaltargetswheretheconditionalstandarddeviationisproportionaltotheconditionalmean.

Poissondistribution—Suitableforskewed,nonnegativeintervaltargets,especiallycountsofrareevents,wheretheconditionalvarianceisproportionaltotheconditionalmean.

Bernoullidistribution—Suitableforatargetthattakesonlythevalueszeroandone.Sameasabinomialdistributionwithonetrial.

Entropy—Crossorrelativeentropyforindependentintervaltargetswithvaluesbetweenzeroandoneinclusive.

MultipleBernoulli—Suitableforcategorical(nominalorordinal)targets.

MultipleEntropy—Crossorrelativeentropyforintervaltargetsthatsumtooneandhavevaluesbetweenzeroandoneinclusive.AlsocalledKullback-Leiblerdivergence.

For a categorical target variable, the default error function is multiple Bernoulli. Forintervaltargets,thedefaulterrorfunctionisnormal.

2.3.11InitializationTheresultsoftraininganeuralnetworkwithhiddenunitsmaydependontheinitialvaluesof the weights. For MLPs, there is no known rational method for computing initialweights,sorandominitialweightsareused.TheNeuralNetworknodesuppliespseudo-randominitialweights.

By default, all output connection weights are initialized to zero, and output biases areinitialized by applying the inverse of the activation function to the mean of the targetvariable.

Forunitswitha linearcombination function, the randominitialweightsareadjustedbydividingby thesquare rootof the fan-inof theunit (thenumberofconnectionscomingintotheunit).

2.3.12PreliminaryTrainingLocalminimaareacommonproblemwithnonlinearnetworks.Thesimplestwaytodealwithlocalminimaistotrainthenetworkforafewiterationsfromeachofalargenumber(perhaps10,100,or1000)ofrandominitializations.TheNeuralNetworknodeselectsthebestestimatesobtainedduringthesepreliminaryrunstobeusedforsubsequenttraining.

2.3.13TrainingTechniquesTheNeuralNetworknodeprovidesawidevarietyoftrainingtechniques,includingbothconventionaloptimizationtechniquesandtechniquesfromtheneuralnetworkliterature.Theconventionaloptimizationtechniquesarewell-provenalgorithmsthataredescribedindetailinthedocumentationfortheNLPProcedureintheSAS/ORproduct.

Themostpopularconventionaloptimizationtechniquesincludethefollowing:

Default—Thedefaultmethodselectsthebesttrainingtechniqueforyoubasedontheweightsthatareappliedatexecution.

Trust-Region—TheTrust-Regionmethodisrecommendedforsmallandmediumoptimizationproblemswithupto40parameters.

Levenberg-Marquardt—isveryfastandreliableforsmallleast-squaresnetworks,butrequiresalargeamountofmemory(quadraticinthenumberofestimatedparameters).

Quasi-Newtontechniques—aregoodformedium-sizednetworks.Theyrequirequadraticmemory,butonlyabouthalfasmuchasLevenberg-Marquardt.Quasi-NewtontechniquesusuallyrequiremoreiterationsthanLevenberg-Marquardtbuteachiterationrequireslessfloating-pointcomputation.

Conjugategradienttechniques—aregoodforlargenetworkswhenthereisnotenoughmemoryfortheabovetechniques—memoryrequirementsareonlylinear.Theyusuallyrequiremoreiterationsthaneitheroftheabovetechniques,buteachiterationrequireslessfloating-pointcomputation.Conjugategradienttechniquesmaybeagoodchoiceifyouhaveaveryfastdiskdrive.

TheNeuralNetwork node also provides three training techniques popular in the neuralnetworkliterature,allofwhichhavelinearmemoryrequirements.

Standardbatchbackprop—isthemostpopulartrainingmethodofall,butitisslow,unreliable,andrequirestheusertotunethelearningratemanually,whichcanbeatediousprocess.

Quickprop—usuallyrequiresmoreiterationsthanconjugategradientmethods,buteachiterationisveryfast.QuickpropisaNewton-likemethodbutusesadiagonalapproximationtotheHessianmatrix.Itismuchfasterandmorereliablethanstandardbatchbackpropandrequireslittletuning.

RPROP—usuallyrequiresmoreiterationsthanconjugategradientmethods,buteachiterationisveryfast.RPROPusesaseparatelearningrateforeachweight,andadaptsallthelearningratesduringtraining.RPROPisthemoststableofallthe“prop”techniquesandrarelyrequiresanytuning.

For all of the conventional optimization techniques, theobjective functiondecreasesoneachiterationuntilalocalminimumisreached,atwhichpointthealgorithmterminates.Forthe“prop”techniques,theobjectivefunctionmaygodownanduprepeatedlyduringtraining.Forstandardbackprop,ifyouspecifyalearningratethatistoohigh,theweightswilldivergeandtheobjectivefunctionwillincreasewithoutlimit.

2.3.14ScoringAfterdevelopingatrainednetworkthatmeetsyourrequirements,youcanusethenetworktomakepredictionsforvariousdatasets.TheNeuralNetworknodeallowsyoutoscorethetraining,validation,test,andscoredatasetsinconjunctionwithtraining.

2.3.15PreparingtheDataPreprocessthedata

Beforeyoutrainaneuralnetwork,youmaywant toconsider thefollowingdataminingtasks:

SampletheInputData—TheSamplingnodeenablesyoutoextractasampleofyourinputdata.Samplingisrecommendedforextremelylargedatabases,becauseitcantremendouslydecreasetrainingtime.Ifthesampleissufficientlyrepresentative,thenrelationshipsfoundinthesamplecanbeexpectedtogeneralizetothecompletedataset.

Create Partitioned Data Sets— The Data Partition node enables you to split thesample into training, validation, and test data sets. The training data set is used tolearnthenetworkweights.Thevalidationdataset isused tochooseamongvariousnetworkarchitectures.ThevalidationdatasetisalsousedbytheModelComparisonnodeformodelassessment.Thetestsetisusedtoobtainafinal,unbiasedestimateofgeneralizationerror.

Use Only the Important Variables—When your data set has a large number ofinputs,itistemptingtousemostoralloftheinputs.However,itisoftenbettertouseonlyasmallnumberofimportantinputs,whichcangreatlyreducethetimerequiredto train the network, as well as improving the prediction results. Twenty to thirtyinputsmaybebetter than threehundred. Ifyouknowfromyourbusinessexpertisethataninputisnotusefulinpredictingthetarget,thenexcludeitfromtheregressionanalysis. The Explore window and the Multiplot node enable you to createexploratory plots that can help you identify important inputs (predictors).You canalsousetheVariableSelectionnodetoremoveunpromisinginputs.

Whenmany inputs are available, the choice of network architecture is especiallyimportant.Forexample,MLPstendtobebetteratignoringirrelevantinputsthanaresomeRBFnetworks.Havingmanyinputsalsoreducesthenumberofhiddenunitsyoucanuse, since thenumberofweights connecting an input layer and ahiddenlayerisequaltotheproductofthenumberofunitsineach.Forexample,ifyouhavefive inputs, using 100 hidden units may be quite practical. But if you have 100inputs and try to use 100 hidden units, youwill have over 10,000weights in thenetworkandtrainingwilltakeaverylongtime.

Transform Data and Filter Outliers— Since neural networks can learn nonlinearfunctions, they are not as sensitive to transformations of the input variables as areregressionmodels. Nevertheless, the use of appropriate input transformations canimprovegeneralizationandspeedup training.Outliers in the inputvariablesareofspecialconcernbecausetheycanhaveundueinfluenceonpredictions,justlikehigh-leveragepointsinlinearregression.

Ontheotherhand, transformationsof the targetvariablesare justas important forneural networks as for regression. For example, if you are using least-squaresestimation,transformingthetargetvariablestohaveconditionalnormaldistributionswithconstantvariancecanimprovegeneralization.Moreimportant,transformationoftargetschangestherelativeimportanceoferrorsdependingonthetargetvalue.

Forexample,supposeyouaretryingtopredictthepriceofsomecommodity.Ifthepriceofthecommodityis10(inwhatevercurrencyunit)andtheoutputofthenetis5or15,yieldingadifferenceof5,thatisahugeerror.Ifthepriceofthecommodityis1000andtheoutputofthenetis995or1005,yieldingthesamedifferenceof5,that is a tinyerror.Youdonotwant thenetwork to treat those twodifferencesasequally important. By taking logarithms, you are effectively measuring errors interms of ratios rather than differences, since a difference between two logscorresponds to the ratio of the original values. This has approximately the sameeffect as looking at percentage differences, abs(target-output)/target or abs(target-output)/output,ratherthansimpledifferences.

The Transform Variables node enables you to apply several transformations to avariable, suchas log, exponential, square root, inverse, and square.Thenodealsoincludes three power transformations (maximize normality, maximize correlationwiththetarget,andequalizespreadwith target levels)plusanoptimalbinningforrelationshiptotargettransformation.

ImputeMissingValues—TheNeuralNetworknodeomitscasesfromtrainingifanyoftheinputsaremissingorifallofthetargetsaremissing.Henceyoumaywanttoreplacemissingdatawith imputedvalues.The Imputenodeenablesyou to replaceintervalmissingvalueswiththeinput’smean,median,ormidrange.YoucanalsousearobustM-estimatortoimputemissingvalues.Missingvaluesofacategoricalinputcan be replaced with the mode. The node also includes a tree-based imputationmethod for replacingmissing values. In addition to imputingmissingvalues, if aninputhasmanymissingvalues(morethanfivetotwentytimesthenumberofhiddenunits),itisoftenusefultogenerateadummyvariablethathasthevalue1whenthecorresponding input variable ismissing, and the value 0 otherwise. Then both theoriginalinputvariableandthedummyvariablecanbeusedasinputstothenetwork.

UseOtherModelingNodes—Forsomeproblems,aneuralnetworkmaynotbethebestanalyticaltool.Iftheinput-outputfunctionisapproximatelylinearandthedataare noisy, linear regressionmay produce better generalization. If few of the inputvariablesarerelevantoryouwantaninterpretablepredictionrule,tree-basedmodelsmay be preferable. Cross-model assessment can be performed with the ModelComparisonnodetohelpyouchoosethebestmodelforscoringnewdata.

SpecifytheInputDataSets

InputdatasetsaresuppliedtotheNeuralNetworknodefromanInputDatanode.Whenyouarereadytotraintheneuralnetwork,youselectthetraining,validation,andtestdata

setsintheImportedDatapropertyoftheNeuralNetworknode.Youcanalsosetthescoredataset,whichisnotrequiredtocontainthetargetvariable.

Thepurposesof the training,validation,and testsetsaredescribedbyBishop(1995) inthefollowingquote.

“Since our goal is to find the network having the best performance on new data, thesimplestapproachtothecomparisonofdifferentnetworksistoevaluatetheerrorfunctionusingdatawhichisindependentofthatusedfortraining.Variousnetworksaretrainedbyminimizationofanappropriateerrorfunctiondefinedwithrespect toa trainingdataset.Theperformanceofthenetworksisthencomparedbyevaluatingtheerrorfunctionusingan independentvalidationset, and thenetworkhaving thesmallesterrorwith respect tothe validation set is selected. This approach is called the hold out method. Since thisprocedurecanitselfleadtosomeoverfittingtothevalidationset,theperformanceoftheselected network should be confirmed by measuring its performance on a thirdindependentsetofdatacalledatestset.”

SpecifytheInputsandtheTargets

Variables are assigned roles andmeasurement levels in the InputDatanode.Forneuralnetworks,variablerolesare:

input—forinputvariables

target—fortargetvariables

freq—forafrequencyvariable

ID—forvariablesthatarenotusedinmodelingbutareretainedinthescoreddatasets

rejected—forvariablesthatarenotusedinmodelingandarenotwantedinscoreddatasets

TheNeuralNetwork node requires one ormore input variables and one ormore targetvariables.Tochangethemodelroleofavariable,usetheVariablestableintheInputDatanodepropertiesorusetheVariablestableintheDataSourceproperties.

The inputs and targets can be nominal, ordinal, or interval.However, youmay not usebothordinalinputsandordinaltargetsinthesamenetwork.

Foracategorical(thatis,nominalorordinal)inputwithCcategories,theNeuralNetworknode automatically generates C— 1 dummy variables. For nominal inputs, deviationcodingisused.Foreachcaseintheithcategoryfori<C,theithdummyvariableissetto1,andallotherdummyvariablesaresetto0.Fori=C,alldummyvariablesaresetto-1.Forexample,ifthereisanominalinputcalledSEXwithvaluesMale,Female,Both,andNeither.Thedummyvariableswouldhavethevaluesshowninthefollowingtable,whereitisassumedthatthecategoriesareorderedalphabetically.

PairingArchitectureswithCombinationandActivationFunctions

RadialcombinationfunctionswithaltitudesarenotusefulwithORBFarchitectures,sincethe altitudewould be redundantwith the hidden-to-outputweights.Altitudes are usefulwith NRBF architectures, since the normalization performed by the softmax functionintervenes between the altitudes and hidden-to-output weights. Hence there is a widervarietyofNRBFarchitecturesthanORBFarchitectures.

The practical differences among these architectures can be understood in terms of theisoactivationcontoursofthehiddenunits.Anisoactivationcontouristhesetofallpointsintheinputspaceyieldingaspecifiedvaluefortheactivationfunctionofagivenhiddenunit.

SinceanMLPuseslinearcombinationfunctions,thesetofallpointsinthespacehavingagivenvalueoftheactivationfunctionisahyperplane.Thehyperplanescorrespondingtodifferentactivationlevelsareparalleltoeachother(thehyperplanesfordifferentunitsarenotparallelingeneral).Theseparallelhyperplanesaretheisoactivationcontours.

TheORBFarchitecturesuseradialcombinationfunctionsandtheexponentialactivationfunction.Radialcombination functionsarebasedon theEuclideandistancebetween thevector of inputs to the unit and the vector of corresponding weights. Thus, theisoactivationcontoursforORBFnetworksareconcentrichyperspheres.TheoutputofanORBFnetwork consists of a numberof superimposedbumps, hence theoutput is quitebumpyunlessmanyhiddenunitsareused.ThusanORBFnetworkwithasmallnumberoffewhiddenunitsisincapableoffittingmanysimple,smoothfunctions,andshouldrarelybeused.

TheNRBFarchitecturesalsouseradialcombinationfunctionsbuttheactivationfunctionis softmax, which forces the sum of the activations for the hidden layer to equal one.Hencetheactivationofeachhiddenunitdependsontheactivationsofpotentiallyallotherhiddenunitsinthesamelayer,althoughnear-byunitswillhavemoreeffectthanthosefaraway. Because the activation of one hidden unit depends on other hidden units, theisoactivationcontoursofaNRBFnetworkareconsiderablymorecomplicatedthanthoseofORBFnetworksorMLPs.

Consider the caseof anNRBFnetworkwithonly twohiddenunits. If thehiddenunitshaveequalwidths,theisoactivationcontoursareparallelhyperplanes;infact,thisnetworkisequivalent toanMLPwithone logistichiddenunit. If thehiddenunitshaveunequalwidths, theisoactivationcontoursareconcentrichyperspheres;suchanetworkissimilartoanORBFnetworkwithoneGaussianhiddenunit.

If therearemore than twohiddenunits inanNRBFnetwork, the isoactivationcontourshavenosuchsimplecharacterization.IftheRBFwidthsareverysmall,theisoactivationcontours are approximately piecewise linear for RBF units with equal widths, andapproximately piecewise spherical for RBF units with unequal widths. The larger thewidths, the smoother the isoactivation contours where the pieces join. As Shorten andMurray-Smith(1996)pointout, theactivation isnotnecessarilyamonotonefunctionofdistancefromthecenterwhenunequalwidthsareused.

In aNRBFnetwork, each output unit computes a nonnegativeweighted average of thehidden-to-outputweights.Hencetheoutputvaluesmustliewithintherangeofthehidden-to-outputweights.Therefore,ifthehidden-to-outputweightsarewithinareasonablerange(suchastherangeofthetargetvalues),youcanbesurethattheoutputswillbewithinthatsame range for all possible inputs, evenwhen the net is extrapolating. No comparablyuseful bound exists for the output of an ORBF network orMLP, since the output canpotentially be as large as the sum of the absolute values of all the hidden-to-outputweights.

If you extrapolate far enough in a Gaussian ORBF network with an identity outputactivation function, the activation of every hidden unit will approach zero, hence theextrapolatedoutputofthenetworkwillequaltheoutputbias.Ifyouextrapolatefarenoughin anNRBF network, one hidden unitwill come to dominate the output.Hence if youwantthenetworktoextrapolatedifferentvaluesinadifferentdirections,anNRBFshouldbeusedinsteadofanORBF.

The NRBFEQ architecture is a smoothed variant of the learning vector quantization(LVQ)andcounterpropagationarchitectures.InLVQandcounterpropagation,thehiddenunitsareoftencalledcodebookvectors.LVQamounts tonearest-neighborclassificationonthecodebookvectors,whilecounterpropagationisnearest-neighborregressiononthecodebookvectors.TheNRBFEQarchitectureusesnotjustthesinglenearestneighbor,butaweightedaverageofnearneighbors.AsthewidthoftheNRBFEQfunctionsapproacheszero, theweightsapproachonefor thenearestneighborandzeroforallothercodebookvectors.LVQand counterpropagationuse adhoc algorithmsof uncertain reliability, butstandardnumericaloptimizationtechniques(nottomentionbackprop)canbeappliedwiththeNRBFEQarchitecture.

In a NRBFEQ architecture, if each observation is taken as an RBF center, and if theweightsaretakentobethetargetvalues,theoutputsaresimplyweightedaveragesofthetarget values, and the network is identical to the well-known Nadaraya-Watson kernelregressionestimator,whichhasbeenreinventedatleasttwiceintheneuralnetliterature.A similarNRBFEQnetwork used for classification is equivalent to kernel discriminantanalysis. Kernels with variable widths are also used for regression in the statisticalliterature.Suchkernel estimators correspond to theNRBFEVarchitecture, inwhich thekernel functions have equal volumes but different altitudes. In the neural net literature,variable-widthkernelsappearalways tobeof theNRBFEHvariety,withequalaltitudesbut unequal volumes. The analogy with kernel regression would make the NRBFEVarchitecturetheobviouschoice,butwhichofthetwoarchitecturesworksbetterinpracticeisanopenquestion.

2.3.16ObjectiveFunctionsandErrorFunctionsIntroductiontoObjectiveFunctionsandErrorFunctions

The Neural Network node trains a network by minimizing an objective function. Theobjectivefunctionisthesumofatotalerrorfunctionandapenaltyfunction,dividedbythe total frequency. The total error function is a summation over all the cases in thetraining data and all the target variables of an individual error function applied to eachtargetvalueandcorrespondingoutput.Thepenaltyfunctionis theproductof theweightdecayconstantandthesumofallnetworkweightsotherthanoutputbiases.

The objective function has several different forms depending on the types of errorfunctions used. Maximum likelihood is the most important of all statistical estimationmethods.Forcomputationalpurposes,itisthenegativeloglikelihoodthatisminimized,rather than the likelihood that is maximized. For noise distributions in the exponentialfamily,thedeviancefunctioncanbeusedinsteadofthelikelihoodfunction.Thedevianceis the difference between the likelihood for the actual network and the likelihood for asaturated model in which there is one weight for every case in the training set. Thedeviancecannotbenegativeandiszeroforaperfectfittonoise-freedata;thesepropertiesprovide thedeviancewithcomputationaladvantages,sodeviance training isoftenfasterthan likelihood training. However, the deviance function cannot be used for noisedistributionsthatarenotintheexponentialfamily,andtheNeuralNetworknodedoesnotestimate scaleparameterswhendeviance training is used.Because scaleparameters arenot estimatedwhen the deviance function is used, if there aremultiple target variables,you should take care to standardize the target variables appropriately. A third form ofobjective function is provided by M-estimation, which is primarily used for robustestimationwhen the target valuesmay containoutliers.As in likelihood estimation,M-estimationcanbeusedtoestimatescaleparameters.

Ifyoudonotspecifyanobjectivefunction,theNeuralNetworknodetriestochooseonethatiscompatiblewithallofthespecifiederrorfunctions.Theallowablecombinationsoferrorfunctionandobjectivefunctionareshowninthefollowingtable.

2.3.1ChoiceofArchitectureIndevelopinganeuralnetwork,therearemanychoicestobemade:thenumberofinputstouse,whichbasicnetworkarchitecturetouse,whethertouseahiddenlayer,thenumberofunitsinthehiddenlayer,theactivationandcombinationfunctionstouse,andsoon.Ifyouhaveconsiderablepriorinformationaboutthefunctiontobelearned,youmaybeabletomake someof these choicesbasedon theoretical considerations.Moreoften, it takestrialanderrortofindagoodarchitecture.Thissectionwilloffersomegeneralguidelines.

Sinceregressionandtreesareoftenmuchfastertotrainthanneuralnets,youshouldtrythosemethodsfirsttoprovideabaselineforcomparingneuralnetworkresults.

Forexploringnetworkarchitectures,thereisnoneedtousehugedatasets.Foranintervaltarget, thenumberof trainingcasesneedonlybe5 to25times thenumberofestimatedparameters in the network, depending on the noise level in the target variable. Forclassification,thenumberoftrainingcasesinthesmallestclassneedonlybe5to25timesthenumberofestimatedparametersinthenetwork.Youcanusearepresentativesample(astratifiedsample ifyouare trying topredict rareevents)of thedata forexplorationandsave the weights of the best architecture you find. Then you can use those weights toinitializetrainingonalargerdatasettofine-tunethenetwork.

HiddenLayer

Your neural network architecture may not need a hidden layer at all. Linear andgeneralized linearmodels are useful in awide variety of applications.And even if thefunctionyouwant to learn ismildlynonlinear,youmaygetbettergeneralizationwithasimplelinearmodelthanwithacomplicatednonlinearmodelifthereistoolittledataortoomuchnoisetoestimatethenonlinearitiesaccurately.

InMLPsthathaveanyofawidevarietyofcontinuousnonlinearhidden-layeractivationfunctions, one hidden layer with an arbitrarily large number of units suffices for theuniversal approximation property. But there is no theory to tell you howmany hiddenunitsareneededtoapproximateanygivenfunction.

NumberofHiddenUnits

Thebestnumberofhiddenunitsdependsonthenumberoftrainingcases,theamountofnoise,andthecomplexityofthefunctionyouaretryingtolearn.Ifyouareusinganeuralnetwork,youprobablydon’tknowexactlyhowmuchnoisethetargetvalueshave,orhowcomplex the function is. Hence, it will generally be impossible to determine the bestnumber of hidden units without training numerous networks with different numbers ofhiddenunitsandestimatingthegeneralizationerrorofeachnetwork.

The simplest procedure is to beginwith a networkwith no hidden units, then add onehidden unit at a time. Estimate the generalization error of each network. Stop addinghiddenunitswhenthegeneralizationerrorgoesup.

TheNeuralNetwork nodewill provide youwith a rough guess for the largest number ofhidden units you can use without serious risk of overfitting, providing that you supply a

roughguessforthenoiselevelinthetargetvalues.YoucanspecifythenoiselevelasHigh,Moderate,Low,orNoiselessviatheNeuralNetworkingnode’spropertiespanel.Thenoiselevelscanbeinterpretedaccordingtothefollowingtable.

2.3.2NeuralNetworkNodeTrainPropertiesThefollowingtrainpropertiesareavailableintheNeuralNetworknode:

Variables—UsetheVariablespropertyoftheNeuralNetworknodetospecifytheproperties of each variable that youwant to use in your data source. Select thebuttontotherightoftheVariablespropertytoopenavariablestable.Youcansetthevariable status to either Use or Don’t Use in the table, as well as select whichvariablesyouwantincludedinreports.

ContinueTraining—Use theContinueTraining property of theNeuralNetworknode to specify whether current estimates should be used as starting values fortraining.IftheContinueTrainingpropertyissettoYes,theestimatesfromapreviousrunof thenodewillbeusedas initialvalues for subsequent training, andall otherpropertieswillbeignored.TousetheContinueTrainingproperty,anestimatesdataset must have been created by the node prior to setting the Continue TrainingpropertytoYes.

Network— Select the button to the right of the Network property to open awindowthatyoucanuse tocustomizenetworkoptions.TheavailableuserdefinednetworkoptionsthatyoucanchooseintheUserDefinedNetworkOptionswindoware

Architecture—UsetheArchitecturepropertyof theNeuralNetworknodetospecify thenetworkarchitecture thatyouwant touseduringnetwork training.The default isMLP (multilayer perceptron),which has no direct connections,andthenumberofhiddenneuronsisdatadependent.

Youcanchoosefromthefollowingnetworkarchitectures:

GLIM—generalizedlinearmodel.

MLP—(defaultsetting)multilayerperceptron.

ORBFEQ—ordinaryradialbasisfunctionwithequalwidths.

ORBFUN—ordinaryradialbasisfunctionwithunequalwidths.

NRBFEH—normalizedradialbasiswithequalheights.

NRBFEV—normalizedradialbasiswithequalvolumes.

NRBFEW—normalizedradialbasiswithequalwidths.

NRBFEQ—normalizedradialbasiswithequalwidthsandheights.

NRBFUN—normalizedradialbasiswithunequalwidthsandheights.

User—enablestheusertodefineanetworkthathasasinglehiddenlayer.

Selecting the User property enables the User Defined Network Optionspropertysetting.

DirectConnection—Directconnectionsdefine linear layers,whereashiddenneurons define nonlinear layers. Use the Direct Connection property of theNeuralNetworknodetoindicatewhetheryouwanttocreatedirectconnectionsfrominputstooutputsinyourneuralnetwork.ThedefaultsettingfortheDirectConnectionpropertyisNo.Inthedefaultsetting,eachinputunitisconnectedtoeachhiddenunit,andeachhiddenunitisconnectedtoeachoutputunit.Settingthe Direct Connection property to Yes adds direct connections between eachinputunitandeachoutputunit.

NumberofHiddenUnits—UsetheNumberofHiddenUnitspropertyoftheNeuralNetworknodetospecifythenumbernofhiddenunitsthatyouwanttousein thehiddenlayer.Permissiblevaluesare integersbetween1and64.Thedefaultvalueis3.

Randomization Distribution — Use the Randomization Distributionpropertyof theNeuralNetworkingnode tospecify thedistribution toapply tothe weights in your user-defined network. Permissible settings for theRandomizationDistributionpropertyoftheNeuralNetworkingnodeare

Normal—TheweightsaredistributedusingaNormalbellcurve.NormalisthedefaultsettingfortheRandomizationDistributionproperty.

Cauchy—Thedistributedweights favor themost frequentvalues in theCauchydistribution.Cauchyismostoftenusedwhenyouwant topredicttheapproximateconditionalmodeinsteadoftheconditionalmean.

Uniform—TheweightsareuniformlydistributedacrossthenetworkwiththeUniformDistributionsetting.

Randomization Center — Use the Randomization Center property of theNeuralNetworkingnodetospecifythecenterofthedistributionoftheweightsofyouruser-definednetwork.Permissiblevaluesarerealnumbers.ThedefaultvaluefortheRandomizationCenterpropertyis0.0.

RandomizationScale—UsetheRandomizationScalepropertyof theNeuralNetworkingnodetospecifythescaleofthedistributionoftheweightsinyouruser-definednetwork.Permissiblevaluesarerealnumbers.ThedefaultvaluefortheRandomizationScalepropertyis0.1.

InputStandardization—UsetheInputStandardizationpropertyoftheNeuralNetworking node to specify the method that is used to standardize the inputvalues in your user-defined network. Permissible values are None, StandardDeviation, Range, and Midrange. The default setting for the InputStandardizationpropertyisStandardDeviation.

HiddenLayerCombinationFunction—Thecomputedvaluesfrompreviousunitsinaneuralnetworkfeedingintothegivenunitarecombinedintoasinglevalueusingacombinationfunction.Thecombinationfunctionusestheweights,bias,andaltitude.UsetheHiddenLayerCombinationFunctionpropertyoftheNeuralNetworkingnodetospecify thehiddenlayercombinationfunction thatyouwanttouseinyouruserdefinednetwork.Permissiblesettingsare

Add—Additive combination function.Additive combination functionsareusedinarchitecturessuchasgeneralizedadditivenetworks.Thedefaultactivation function for the additive combination function is the Identityfunction.Theadditivecombinationfunctioncanuseanynumberofhiddenunits.

Linear—Linearcombinationfunction.Linearcombinationfunctionsaremostoftenusedformultilayerperceptrons(MLPs).Thedefaultactivationfunction for the Linear combination function is the Hyperbolic Tangent.TheLinearcombinationfunctioncanuseanynumberofhiddenunits.

EQSlopes— EQSlopes is identical to the Linear combination function,exceptthatthesameconnectionweightsareusedforeachunitinthelayer,althoughdifferentunitshavedifferentbiases.EQSlopesismainlyusedforordinal targets. The default activation function for the EQSlopescombination function is the Hyperbolic Tangent. The EQSlopescombinationfunctioncanuseanynumberofhiddenunits.

EQRadial—Radialbasis functionwithequalheights andwidths for allunitsinthe layer.Radialcombinationfunctions thathaveonehiddenunituse the Exponential activation function by default. Radial combinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

EHRadial—Radialbasis functionwithequalheightsbutnotwidths forall units in the layer. Radial combination functions that have one hiddenunitusetheExponentialactivationfunctionbydefault.Radialcombinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

EWRadial—Radialbasisfunctionwithequalwidthsbutnotheightsforall units in the layer. Radial combination functions that have one hiddenunitusetheExponentialactivationfunctionbydefault.Radialcombinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

EVRadial—Radialbasisfunctionwithequalvolumesforallunitsinthelayer. Radial combination functions that have one hidden unit use theExponential activation function by default. Radial combination functions

thathavetwoormorehiddenunitsusetheSoftmaxactivationfunctionbydefault.

XRadial—Radialbasisfunctionwithunequalwidthsandheightsforallunitsinthe layer.Radialcombinationfunctions thathaveonehiddenunituse the Exponential activation function by default. Radial combinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

Default—TheNeuralNetworkingnodeusesthedefaultPROCNEURALsetting for theHidden Layer Combination Function, based on target andotherNeuralNetworknodepropertysettings.

HiddenLayerActivationFunction—Thevalueproducedbythecombinationfunctionistransformedbyanactivationfunction,whichinvolvesnoweightsorotherestimatedparameters.UsetheHiddenLayerActivationFunctionpropertyof theNeuralNetworkingnode to specify thehidden layeractivation functionforyouruser-definednetwork.Permissiblevaluesare

ArcTangent — Arc tangent hidden layer activation function for linearcombination functions. The Arc Tangent function can be used as analternative to thedefaultHyperbolicTangent activation function.The arctangentfunctionfornetinputgisarctan(g).

Elliot — Elliot hidden layer activation function for linear combinationfunctions.TheElliot functioncanbeusedasanalternative to thedefaultHyperbolicTangentactivationfunction.

Hyperbolic Tangent — Hyperbolic Tangent is the default hidden layeractivation function for linear combination functions. The hyperbolictangentfunctionfornetinputgishtan(g).

Logistic—TheLogistichiddenlayeractivationfunctionfornetinputgis(1+exp(-g))-1.

Gauss—TheGausshiddenactivationfunctionisusefulwhenthedatatocontainslocalfeatures(forexample,bumps)inlowdimensionalsubspaces.

Sine—Use the Sine hidden activation functionwith caution, because ityieldsan infiniteVapnik-Chervonenkis dimension,which can cause poorgeneralization.TherangefortheSineactivationfunctionis[-1,1]and thefunctionfornetinputgissin(g).

Cosine—UsetheCosinehiddenactivationfunctionwithcaution,becauseityieldsaninfiniteVapnik-Chervonenkisdimension,whichcancausepoorgeneralization.Therangefor theCosineactivationfunction is [-1,1]andthefunctionfornetinputgiscos(g).

Default—TheNeuralNetworkingnodeusesthedefaultPROCNEURALsettingfortheHiddenLayerActivationFunction,basedonthecombinationfunctionsettingandthenumberofhiddenunitsthatexistinthelayer.

HiddenBias—UsetheHiddenBiaspropertyoftheNeuralNetworkingnodeto specify whether to compensate the user-defined network for hidden bias.PermissiblevaluesareYesandNo.ThedefaultvalueisYes.

Target Layer Combination Function — The computed values frompredecessorunitsthatfeedintothegivenunitarecombinedintoasinglevalueusingacombinationfunction.Thecombinationfunctionusestheweights,bias,andaltitude.UsetheTargetLayerCombinationFunctionpropertyoftheNeuralNetworkingnodetospecifythetargetlayercombinationfunctionforyouruser-definednetwork.Permissiblevaluesare

Add—Additivecombinationfunctionwithnoweightsorbias.Additivecombination functions are used in architectures such as generalizedadditive networks. The default activation function for the additivecombination function is the Identity function. The additive combinationfunctioncanuseanynumberofhiddenunits.

Linear—Linearcombinationfunctionscomputea linearcombinationoftheweights and the values that feed into the unit and then add the biasvalue (thebias acts like an intercept).Thedefault activation function forthe Linear combination function is the Hyperbolic Tangent. The Linearcombinationfunctioncanuseanynumberofhiddenunits.

EQSlopes— EQSlopes is identical to the Linear combination function,exceptthatthesameconnectionweightsareusedforeachunitinthelayer,althoughdifferentunitshavedifferentbiases.EQSlopesismainlyusedforordinal targets. The default activation function for the EQSlopescombination function is the Hyperbolic Tangent. The EQSlopescombinationfunctioncanuseanynumberofhiddenunits.

EQRadial—Radialbasis functionwithequalheights andwidths for allunitsinthe layer.Radialcombinationfunctions thathaveonehiddenunituse the Exponential activation function by default. Radial combinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

EHRadial—Radialbasis functionwithequalheightsbutnotwidths forall units in the layer. Radial combination functions that have one hiddenunitusetheExponentialactivationfunctionbydefault.Radialcombinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

EWRadial—Radialbasisfunctionwithequalwidthsbutnotheightsfor

all units in the layer. Radial combination functions that have one hiddenunitusetheExponentialactivationfunctionbydefault.Radialcombinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

EVRadial—Radialbasisfunctionwithequalvolumesforallunitsinthelayer. Radial combination functions that have one hidden unit use theExponential activation function by default. Radial combination functionsthathavetwoormorehiddenunitsusetheSoftmaxactivationfunctionbydefault.

XRadial—Radialbasisfunctionwithunequalwidthsandheightsforallunitsinthe layer.Radialcombinationfunctions thathaveonehiddenunituse the Exponential activation function by default. Radial combinationfunctions that have two ormore hidden units use the Softmax activationfunctionbydefault.

Default—TheNeuralNetworkingnodeusesthedefaultPROCNEURALsettingfortheTargetLayerCombinationFunction.Thedefaulttargetlayercombinationfunctionvariesaccordingtotheactivationfunction.

TargetLayerActivationFunction—Thevalueproducedbythecombinationfunctionistransformedbyanactivationfunction,whichinvolvesnoweightsorotherestimatedparameters.UsetheTargetLayerActivationFunctionpropertyoftheNeuralNetworkingnodetospecifythetargetlayeractivationfunctionforyour user-defined network. The default activation function depends on themeasurement level of the target variable. Select a function with a range thatcorrespondstotheactualdistributionoftargetvalues.Permissiblevaluesare

Identity — Identity is the default target layer activation function forintervaltargetvariables.Identityissuggestedforunboundtargetvariableswithrange(-∞,∞).TheIdentityfunctionforthenetinputgisg.

Linear—Lineartargetlayeractivationfunction.

Exponential — The Exponential target layer activation function issuggested for target variables with the range (0, ∞). The exponentialfunctionforthenetinputgisexp(g).

Reciprocal—theReciprocal target layeractivationfunctionissuggestedfor targetvariableswith the range (0,∞).The reciprocal function for thenetinputgis1/g.

Square—theSquaretargetlayeractivationfunctionissuggestedfortargetvariableswiththerange[0,∞).Thesquarefunctionforthenetinputisg2.

Logistic — Logistic is the default target layer activation function for

ordinal targetvariableswith the range (0,1).The logistic function for thenetinputis(1+exp(-g))-1.

MLogistic—Multiple logistic, orSoftmax, is theonlyoutput activationfunction allowed for nominal targets with an error function of multipleBernoulli,multiple entropy, ormultinomial.The range for theMLogistic

targetvariablesis(0,1)andthefunctionforthenetinputis .

Default—TheNeuralNetworkingnodeusesthedefaultPROCNEURALsetting for the Target Layer Activation Function, based on other NeuralNetworknodepropertysettings.Thedefaulttargetlayeractivationfunctiondependsontheselectedcombinationfunction.

Target Layer Error Function— Each unit in a neural network produces asinglecomputedvalue.Inputandhiddenunitspassthecomputedvaluestootherhiddenoroutputunits.Thepredictedvalueforoutputunitsiscomparedwiththetargetvaluetocomputetheerrorfunction.UsetheTargetLayerErrorFunctionpropertyoftheNeuralNetworkingnodetospecifythetargetlayererrorfunctionforyouruser-definednetwork.Permissiblevaluesare

Normal — Normal distribution, also called the least-squares or mean-squared-errorcriterion.Normal is suitable for unbounded interval targetswith constant conditional variance, no outliers, and a symmetricdistribution. Normal can also be used for categorical targets that haveoutliers.

Cauchy—TheCauchydistributionmaybeusedwithanykindof targetvariable. It ismostoftenusedwhenyouwant topredict theapproximateconditionalmodeinsteadoftheconditionalmean.

Logistic — Logistic distribution may be used with any kind of targetvariable. It is most often used with interval targets where the noisedistributionmaycontainoutliers.

Huber — The Huber M-estimator is suitable for unbounded intervaltargetsthathaveoutliersorthathaveamoderatedegreeofinequalityoftheconditionalvarianceandasymmetricdistribution.Hubercanalsobeusedforcategorical targetswhenyouwant topredict themoderather than theposteriorprobability.

Biweight — The Biweight M-estimator may be used with any kind oftargetvariable. It ismostoftenusedwith interval targetswhere thenoisedistributionmaycontainsevereoutliers.Becauseofsevereproblemswithlocalminima,youshouldobtain initialvaluesby trainingwith theHuberM-estimatorbeforeusingthebiweightM-estimator.

Wave — The Wave M-estimator may be used with any kind of targetvariable. It is most often used with interval targets where the noisedistributionmaycontainsevereoutliers.Becauseofsevereproblemswithlocalminima,youshouldobtain initialvaluesby trainingwith theHuberM-estimatorbeforeusingtheWaveM-estimator.

Gamma — The Gamma distribution may be used only with strictlypositive interval target variables. It ismost oftenusedwhen the standarddeviationofthenoiseisproportionaltothemeanofthetargetvariable.

Default—TheNeuralNetworkingnodeusesthedefaultPROCNEURALsettingfortheTargetLayerErrorFunction,basedonthespecifiedobjectivefunctionsettings.

TargetBias—UsetheTargetBiaspropertyoftheNeuralNetworkingnodetospecify a target bias. Permissible values areYes andNo.The default value isYes.

WeightDecay—Use theWeightDecayproperty tospecify theweightdecayparameter.Thelargertheweightdecayvalue,thegreatertherestrictionontotalweightgrowth.Theweightdecayparameterdoesnotaffecttheoutputweightswhentheactivationfunctioninthehiddenlayerisset toSoftmax,becausetheadjustment to the error is too small.Thedefault setting for theWeightDecaypropertyiszero.

Optimization—Select the buttontotherightof theOptimizationpropertytoopenawindowthatyoucanusetocustomizeNeuralNetworkoptimizationsettings.

The following Neural Network optimization properties can be configured in theOptimizationwindow:

Training Technique — Use the Training Technique property of the NeuralNetworknodetospecifytheneuralnetworktrainingalgorithmthatyouwanttouse.Theavailabletrainingtechniquesare

Default—UsetheDefaultpropertytoletSASchoosethedefaulttrainingtechnique.ThedefaultmethodthatSASchoosesdependsonthenumberofweightsthatareappliedtothedataatexecution.

Trust-Region—Use the Trust-Region technique for small andmediumoptimizationproblemsthathaveupto40parameters.

Levenberg-Marquardt — Use the Levenberg-Marquardt technique forsmoothleastsquaresobjectivefunctionsthathaveupto100parameters.

Quasi-Newton — Use the Quasi-Newton technique for mediumoptimizationproblemsthathaveupto500parameters.

Conjugate Gradient — Use the conjugate gradient technique for largedataminingproblemswithover500parameters.

Double Dogleg — The Double Dogleg optimization technique. TheDoubleDoglegoptimizationmethodcombinestheideasofQuasi-Newtonand Trust-Region methods. The Double Dogleg algorithm computes ineachiterationthesteps(k)asthelinearcombinationofthesteepestdescentorascentsearchdirections1(k)andaQuasi-Newtonsearchdirections2(k)thatisforcednottoleaveaprespecifiedTrust-Regionradius.

Back Prop — The standard batch backpropagation technique. Standardbackprop is the most popular back propagation method, but it is slow,unreliable,andrequirestheusertotunethelearningratemanually,whichcanbeatediousprocess.

RProp— The standard incremental backprop technique uses a separatelearning rate for each weight and adapts all the learning rates duringtraining.TheRProptechniquetendstotakemoreiterationsthanconjugategradientmethods,buteachiterationisveryfast.

QProp—TheQuickProptechniqueisaNewton-likemethodwhichusesadiagonalapproximationtotheHessianmatrix.LikeRProp,QProptendstotakemoreiterationsthanconjugategradientmethods,buteachiterationisveryfast.

MaximumIterations—Use theMaximum Iterations property of theNeuralNetworknodeTrainOptionstospecifythemaximumnumberofiterationsthatyouwanttoallowduringnetworktraining.Permissiblevaluesareintegersfrom1to1000.Thedefaultvalueis20.

MaximumTime—Use theMaximumTimepropertyof theNeuralNetworknodeTrainOptionstospecifythemaximumamountofCPUtimethatyouwanttouseduringtraining.Permissiblevaluesare5minutes,10minutes,30minutes,1hour,2hours,4hours,or7hours.ThedefaultsettingfortheMaximumTimepropertyis1hour.

Nonlinear Options — The Nonlinear Options section of the Optimizationwindowcontainsthefollowingsettings:

Absolute — Use the Absolute convergence property of the NeuralNetworknodetospecifyanabsoluteconvergencecriterion.TheAbsoluteconvergenceisafunctionoftheloglikelihoodfortheintercept-onlymodel.The optimization is to maximize the loglikelihood. The default value is-1.34078E154.

AbsoluteFunction—Use theAbsoluteFunctionpropertyof theNeuralNetwork node to specify an absolute function convergence criterion.

AbsoluteFunctionisafunctionoftheloglikelihoodfor theintercept-onlymodel.Theminimumvalueis0andthedefaultvalueis0.0.

AbsoluteFunctionTimes—UsetheAbsoluteFunctionTimespropertyoftheNeuralNetworknodetospecifythenumberofsuccessiveiterationsforwhichtheabsolutefunctionconvergencecriterionmustbesatisfiedbeforetheprocesscanbe terminated.Theminimumvalue and thedefault valueareboth1.

AbsoluteGradient—Use theAbsoluteGradientpropertyof theNeuralNetworknode to specify the absolute gradient convergence criterion thatyouwant to use. All values must be larger than 0. The default value is1.0E-5.

AbsoluteGradientTimes—UsetheAbsoluteGradientTimespropertyoftheNeuralNetworknodetospecifythenumberofsuccessiveiterationsforwhichtheabsolutegradientconvergencecriterionmustbesatisfiedbeforetheprocesscanbeterminated.Thedefaultvalueis1.

Absolute Parameter — Use the Absolute Parameter property of theNeural Network node to specify the absolute parameter convergencecriterion.Thedefaultvalueis1.0E-8.Thevaluemustbegreaterthanzero.

Absolute Parameter Times — Use the Absolute Parameter TimespropertyoftheNeuralNetworknodetospecify thenumberofsuccessiveiterationsforwhich theabsoluteparameterconvergencecriterionmustbesatisfiedbeforetheprocesscanbeterminated.Thedefaultvalueis1.

Relative Function—Use the Relative Function property of the NeuralNetworknode to specify the relative function convergence criterion. Theminimumandthedefaultvalueareboth0.0.

RelativeFunctionTimes—UsetheRelativeFunctionTimespropertyoftheNeuralNetworknodetospecifythenumberofsuccessiveiterationsforwhich therelativefunctionconvergencecriterionmustbesatisfiedbeforetheprocesscanbeterminated.Thedefaultvalueis1.

RelativeGradient — use the Relative Gradient property of the NeuralNetworknode to specify the relative gradient convergence criterion. Thedefaultvalueis1.0E-6.

RelativeGradientTimes—UsetheRelativeGradientTimespropertyoftheNeuralNetworknodetospecifythenumberofsuccessiveiterationsforwhich the relativegradientconvergencecriterionmustbesatisfiedbeforetheprocesscanbeterminated.PermissiblevaluesfortheRelativeGradientTimespropertyareintegersgreaterthan0.Thedefaultvalueis1.

PropagationOptions—ThePropagationOptionssectionoftheOptimizationwindowcontainsthefollowingsettings:

Accelerate—UsetheAcceleratepropagationoptiontospecifytherateofincreaseoflearningfortheRPROPoptimizationtechnique.

Decelerate—UsetheDeceleratepropagationoptiontospecifytherateofdecreaseoflearningfortheRPROPoptimizationtechnique.

Learn—UsetheLearnpropagationoptiontospecifythelearningrateforBACKPROP or the initial learning rate for QPROP and RPROPoptimizationtechniques.

MaximumLearning—UsetheMaximumLearningpropagationoptiontospecifythemaximumlearningratefortheRPROPoptimizationtechnique.

MinimumLearning—UsetheMinimumLearningpropagationoptiontospecifytheminimumlearningratefortheRPROPoptimizationtechnique.

Momentum — Use the Momentum propagation option to specify themomentumrate.

Maximum Momentum — Use the Momentum propagation option tospecify the maximum momentum rate for the QPROP optimizationtechnique.

Tilt — Use the Tilt propagation option to specify the tilt rate for theQPROPoptimizationtechnique.

PreliminaryTrainingOptions—ThePreliminaryTrainingOptionssectionoftheOptimizationwindowcontainsthefollowingsettings:

Enable — Use the Enable property to specify whether to enablepreliminarytrainingintheNeuralNetworkingnode.ThedefaultsettingfortheEnablepropertyisYes.

Number of Runs — Use the Number of Runs property of the NeuralNetworknodetospecifythenumberofpreliminaryrunsthatyouwanttoperform. Permissible values are integers greater than or equal to 1. Thedefaultvalueis5.

MaximumIterations—UsetheMaximumIterationspropertytospecifythe maximum number of iterations that you want to allow duringpreliminary network training. Permissible values are integers from 1 to100.Thedefaultvalueis10.

Maximum Time — Use the Maximum Time property to specify themaximum amount of CPU time that youwant to use during preliminary

training.Permissiblevaluesare5minutes,10minutes,30minutes,1hour,2hours, 4 hours, or 7 hours.The default setting for theMaximumTimepropertyis1hour.

Initialization Seed — Use the Initialization Seed property to specify therandomseedvaluethatyouwanttouseduringnetworkweightoptimization.

ModelSelectionCriterion—Use theModel SelectionCriterion property ofthe Neural Network node to specify the most appropriate model from yournetwork.Theavailablecriteriaare

Profit/Loss— (default setting)The Profit/Lossmodel selection criterionchoosesthemodelthatmaximizestheprofitorminimizesthelossforthecasesinthevalidationdataset.Ifavalidationdatasetisnotavailable,thenthenodeusesthetrainingdataset.

Misclassification — The Misclassification model selection criterionchooses the model that has the smallest misclassification rate for thevalidationdataset.Ifthevalidationdatasetdoesnotexist,thenthetrainingdatasetisused.

AverageError—TheAverageErrormodelselectioncriterionchoosesthemodelthathasthesmallestaverageerrorforthevalidationdataset.Ifthevalidationdatasetdoesnotexist,thenthetrainingdatasetisused.

SuppressOutput—Set theSuppressOutputpropertyof theNeuralNetworknodetoYes ifyouwant tosuppressallprintedoutputgeneratedbytheNeuralNetworknode.ThedefaultsettingfortheSuppressOutputpropertyisNo.

2.3.3NeuralNetworkNodeResultsYou can open the Results window of the Neural Network node by right-clickingthenodeandselectingResultsfromthepop-upmenu.

SelectViewfromthemainmenutoviewthefollowingresultsintheResultswindow:

Properties

Settings— displays awindowwith a read-only table of theNeuralNetworknodepropertiesconfigurationwhenthenodewaslastrun.

RunStatus— indicates the statusof theNeuralNetworknode run.TheRunStart Time, Run Duration, and information about whether the run completedsuccessfullyaredisplayedinthiswindow.

Variables—atableofthevariablesinthetrainingdataset.

TrainCode—thecodethatEnterpriseMinerusedtotrainthenode.

Notes — Select the button to the right of the Notes property to open awindowthatyoucanusetostorenotesofinterest,suchasdataorconfigurationinformation.

SASResults

Log—theSASlogoftheNeuralNetworkrun.

Output—theSASoutputoftheNeuralNetworkrun.TheSASOutputfortheNeuralNetworknode includesavariablesummary,amodeleventsummary,adecisionmatrix, a table of predicted anddecision variables, optimization startparameter estimates, dual quasi-Newton optimization statistics, optimizationresults parameter estimates, optimization summary statistics, fit statistics fortrain,classification tables for trainandvalidationdata sets,decision tables fortrain and validation data sets, an event classification table, assessment scorerankingsfortrainandvalidationdatasets,andassessmentscoredistributionfortrainandvalidationdatasets.

FlowCode—theSAScodeusedtoproducetheoutputthattheNeuralNetworknodepassesontothenextnodeintheprocessflowdiagram.

Scoring

SASCode—theSASscorecodethatwascreatedbythenode.TheSASscorecodecanbeusedoutsideof theEnterpriseMiner environment in customuserapplications.

PMMLCode—thePMMLCodethatwasgeneratedbythenode.ThePMMLCodemenuitemisdimmedandunavailableunlessPMMLisenabled.

Assessment

Fit Statistics— opens the Neural Network Node Fit Statistics Table, whichcontains fit statistics from themodel that are calculated for interval and non-intervaltargets.

ResidualStatistics—openstheresidualstatisticschartforanintervaltarget.

ScoreRankingsOverlay—Inascorerankingschart,severalstatisticsforeachdecile (group) of observations are plotted on the vertical axis. For a binarytarget, all observations in the scored data set are sorted by the posteriorprobabilities of the event level in descending order. For a nominal or ordinaltarget,observationsaresorted fromhighestexpectedprofit to lowestexpectedprofit (or fromlowestexpected loss tohighestexpected loss).Then thesortedobservations are grouped into deciles based on the Decile Bin property andobservations in a decile are used to calculate the statistics that are plotted indecilescharts.

The ScoreRankingsOverlay plot displays both train and validatestatisticsonthesameaxis. Bydefault, thehorizontalaxisofascore rankings chart displays the deciles (groups) of theobservations. The vertical axis displays the following values, andtheirmean,minimum,andmaximum(ifany).

posteriorprobabilityoftargetevent

numberofevents

cumulativeandnoncumulativeliftvalues

cumulativeandnoncumulative%response

cumulativeandnoncumulative%capturedresponse

gain

actualprofitorloss

expectedprofitorloss.

ScoreDistribution — The Score Distribution chart plots the proportions ofevents(bydefault),nonevents,andothervaluesontheverticalaxis.Thevalueson the horizontal axis represent the model score of a bin. The model scoredependsonthepredictionofthetargetandthenumberofbucketsused.

For categorical targets, observations are grouped into bins, based on theposteriorprobabilitiesoftheeventlevelandthenumberofbuckets.

TheScoreDistributionchartofausefulmodelshowsahigherpercentageofeventsforhighermodelscoreandahigherpercentageofnoneventsforlowermodel scores.For interval targets,observationsaregrouped intobins,basedon the actual predicted values of the target. The default chart choice isPercentage of Events. Multiple chart choices are available for the ScoreDistributionChart.Thechartchoicesare

PercentageofEvents—forcategoricaltargets

NumberofEvents—forcategoricaltargets

CumulativePercentageofEvents—forcategoricaltargets

MeanforPredicted—forintervaltargets

Max.forPredicted—forintervaltargets

Min.forPredicted—forintervaltargets

Model

IterationPlot—bydefault,alineplotoftheaveragesquarederrorversusthetrainingiterationisdisplayed.

Weights—Final—atwo-dimensionalhistogramofallweightsgeneratedfortheselectediterationoftheselectedrun.

Weights—History—achartcontainingtheneuralhistoryestimates.

Table—displaysa table thatcontains theunderlyingdatausedtoproduceachart.TheTablemenu item isdimmedandunavailableunlessa resultschart isopenandselected.

Plot—openstheGraphWizardtomodifyanexistingResultsplotorcreateaResultsplotofyourown.ThePlotmenu item isdimmedandunavailableunless aResultschartortableisopenandselected.

2.3.4ExamplewithaNeuralNetworkModelNeuralnetworksareaclassofparametricmodelsthatcanaccommodateawidervarietyofnonlinearrelationshipsbetweenasetofpredictorsandatargetvariablethancanlogisticregression.Building aneural networkmodel involves twomainphases.First, youmustdefinethenetworkconfiguration.Youcanthinkofthisstepasdefiningthestructureofthemodelthatyouwanttouse.Then,youiterativelytrainthemodel.

Aneuralnetworkmodelwillbemorecomplicatedtoexplaintothemanagementofyourorganizationthanaregressionoradecisiontree.However,youknowthatthemanagementwouldpreferastrongerpredictivemodel,evenifitismorecomplicated.So,youdecidetorun a neural network model, which you will compare to the other models later in theexample.

Because neural networks are so flexible, SAS EnterpriseMiner has two nodes that fitneuralnetworkmodels: theNeuralNetworknodeandtheAutoNeuralnode.TheNeuralNetworknodetrainsaspecificneuralnetworkconfiguration;thisnodeisbestusedwhenyouknowalotaboutthestructureofthemodelthatyouwanttodefine.TheAutoNeuralnode searches over several network configurations to find one that best describes therelationshipinadatasetandthentrainsthatnetwork.

ThisexampledoesnotusetheAutoNeuralnode.However,youareencouragedtoexplorethefeaturesofthisnodeonyourown.

Beforecreatinganeuralnetwork,youwillreducethenumberofinputvariableswiththeVariable Selection node. Performing variable selection reduces the number of inputvariablesandsavescomputerresources.TousetheVariableSelectionnodetoreducethenumberofinputvariablesthatareusedinaneuralnetwork:

SelecttheExploretabontheToolbar.

SelecttheVariableSelectionnodeicon.DragthenodeintotheDiagramWorkspace.

ConnecttheTransformVariablesnodetotheVariableSelectionnode.

IntheDiagramWorkspace,right-clicktheVariableSelectionnode,andselectRun fromtheresultingmenu.ClickYesintheConfirmationwindowthatopens.Inthewindowthatappearswhenprocessingcompletes,clickResults.TheResultswindowappears.ExpandtheVariableSelectionwindow(Figure2-5).

Figure2-5

Examinethetabletoseewhichvariableswereselected.TheroleforvariablesthatwerenotselectedhasbeenchangedtoRejected.ClosetheResultswindow.

Note:In this example, forvariable selection, a forward stepwise least squares regression

methodwas used. Itmaximizes themodelR-square value. Formore information aboutthismethod,seetheSASEnterpriseMinerHelp.

ClosetheResultswindow.

The input data is now ready to be modeled with a neural network. To use the NeuralNetworknodetotrainaspecificneuralnetworkconfiguration:

FromtheModeltabontheToolbar,selecttheNeuralNetworknodeicon.DragthenodeintotheDiagramWorkspace.

ConnecttheVariableSelectionnodetotheNeuralNetworknode.

SelecttheNeuralNetworknode.InthePropertiesPanel,scrolldowntoviewtheTrainproperties, and click on the ellipses that represent the value ofNetwork.TheNetworkwindowappears.

Changethefollowingproperties:

Clickon thevalueofDirectConnectionandselectYes from thedrop-downmenu thatappears. This selection enables the network to have connections directly between theinputsandtheoutputsinadditiontoconnectionsviathehiddenunits.

Click on the value of Number of Hidden Units and enter 5. This example trains amultilayerperceptronneuralnetworkwithfiveunitsonthehiddenlayer.

ClickOK.

IntheDiagramWorkspace,right-clicktheNeuralNetworknode,andselectRunfromtheresultingmenu.ClickYesintheConfirmationwindowthatopens.

In the window that appears when processing completes, click Results. The Resultswindow appears.Maximize the Score Rankings Overlaywindow. From the drop-downmenu,selectCumulativeTotalExpectedProfit(Figure2-6).

Figure2-6

ComparetheseresultstothosefromtheRegressionnode.Accordingtothismodel,ifyouweretosolicitthebest40%oftheindividuals,thetotalexpectedprofitfromthevalidationdatawouldbeapproximately$1900.Ifyouweretosoliciteveryoneonthelist,thenbasedonthevalidationdata,youcouldexpectapproximately$2350profitonthecampaign.

2.4AUTONEURALNODEThe AutoNeural node belongs to theModel category in the SAS datamining process of Sample, Explore,Modify,Model, Assess (SEMMA). You can use the AutoNeural node as an automated tool to help you find optimalconfigurationsforaneuralnetworkmodel.

TheAutoNeuralnodeconductslimitedsearchesinordertofindbetternetworkconfigurations.Thereareseveraloptionsthatthenodeusestocontrolthealgorithm.

Hiddennodesareaddedoneatatime.Anodemaycontainoneormoreneurons.

Foreach training iteration,oneestimatevectorandone fitvectorare retainedaccording to the criteria that aredisplayedbelow.

Subsequenttrainingisinitializedwiththecurrentestimates.ThenewnodeisinitializedbyPROCNEURALwithoutputweights=0.Therefore,thebeginningerroristhesameasthefinalerrorofthepreviouslevel.

Anadjustablesettingforthemaximumnumberofiterationsisused.ThetrainingMaximumIterationsisadjustedhigheriftheselectediterationequalstheMaximumIterations.TheMaximumIterationsisadjustedloweriftheselectediterationissignificantlylowerthantheMaximumIterationssetting.

Anadjustablesettingfortheamountoftimeafterwhichtrainingstopsisused.TheTotalTimepropertyenablesyoutosetthemaximumamountoftimethatcanbeusedfortraining.

The scale parameter for random numbers decreases at each training iteration when you use the cascadearchitecture.

Youcanfreezeornotfreezepreviouslytrainedlayers.

Preliminarytrainingisperformed,dependingonthevalueoftheToleranceproperty.

Thedefaultcombinationfunctionsanderrorfunctionsareused.

AutoNeuralModelselectioncriteria:

2.4.1NetworkArchitecturesTheAutoneuralnodecancreatedifferenttypesoffeedforwardnetworkarchitectures.Inthechartsbelow,thesolidordashedlinesrepresentweightvectorsthatareoptimizedbythemodel fitting function. Each activation function and target function contains a biasweightthatisalsooptimized.

At each step, a new networkmodel is optimized using an iterative nonlinear solution.Multiplecandidateactivation functionsareoptimized, then thebestonesare retained inthemodel.TheAutoneuralnodeonlyperformsthemodelsearchwhentheTrainActionpropertyissettoSearch.

MLP(Multi-LayerPerceptron)networksaregenearllynotinterpretableduetothehighlynonlinear nature of combinations of activation functions. While these algorithms havebeentunedtoavoidoverfitting,youshouldusebothvalidationandtestdatatomakesurethatthesenetworksaregeneralizableforyourmodel.

Inasinglehiddenlayernetwork,newhiddenneuronunitsareaddedoneatatimeandareselectedaccordingtowhichactivationfunctionprovides themostbenefit.Manymodelsmaybesuccessfullyfitwithasinglehiddenlayer.

Inafunnelnetwork,newhiddenneuronunitsareaddedtoformafunnelpatternandareselectedaccordingtowhichactivationfunctionprovidesthemostbenefit.

In ablock network, new hidden neuron units are added as new layers in the network,accordingtowhichactivationfunctionprovidesthemostbenefit.

Finally,inacascadenetwork,newhiddenneuronunitsareaddedandconnectedfromallexisting inputandhiddenneurons.Allpreviously trainedweightsandbiasesare frozen.Eachadditionalstepoptimizesthefitofthenetworktotheresidualsofthepreviousstep.

2.4.2ExampleusingtheSearchTrainActionThefollowingexampledemonstrateshowtouse thesearch trainaction, inwhichnodesareaddedtothenetworkdependingonthearchitecturethatyouspecify.

Followthesestepstocreatethefollowingprocessflowdiagram:

1. DefinetheDataSource

2. AddNodestotheDiagramWorkspace

3. SettheAutoNeuralNodeProperties

4. RuntheAutoNeuralNode

5. ViewtheAutoNeuralNodeOutput

6. ViewtheAutoNeuralNodeTrainingCode

1.DefinetheDataSource

1. Fromthemainmenu,select:File→New→DataSource.

2. SelectSASTableasthemetadatasourceandclickNext.

3. When prompted to Select a SAS Table, enter SAMPSIO.DMAGECR in the inputfieldandclickNext.

4. IntheTablePropertieswindow,clickNext.

5. IntheMetadataAdvisorOptionswindow,selecttheAdvancedadvisor.ClickNext.

6. In theColumnMetadatawindow, set the role of the variable good_bad toTarget.ClickNext.

7. IftheDecisionProcessingwindowopens,selectNoandclickNext.

8. In theDataSourceAttributeswindow, leave thedatasourceroleasRawandclickNext.

9. IntheSummarywindow,clickFinish.

2.AddNodestotheDiagramWorkspace

1. DragthenewlycreateddatasourceontoanEnterpriseMinerdiagramworkspace.

2. DragaDataPartitionnodeonto thediagram from thenode toolbarSample folder,andthenconnecttheAll:GermanCreditnodetotheDataPartitionnode.

3. DraganAutoNeuralnodeontothediagramfromthenodetoolbarModelfolder,andconnecttheDataPartitionnodetotheAutoNeuralnode.

3.SettheAutoNeuralNodeProperties

1. ClicktheAutoNeuralnodeinthediagramworkspacetoselectit.

2. InthePropertiespanelfortheAutoNeuralnode,settheTrainActionpropertyoftheModelOptionsgrouptoSearch.

3. IntheActivationFunctionspropertygroup,settheDirect,Normal,Square,andTanhpropertiestoYes.SettheremainingactivationfunctionstoNo.

4. NotethattheTerminationpropertyoftheModelOptionsgroupissettoOverfitting.Because you partitioned the data and you are using a class target variable, themisclassificationrateoftheValidationdatasetisused.

5. NotethattheTolerancepropertyoftheModelOptionsgroupissettoMedium.Thissettingindicatesthatpreliminarystatementsareexecutedwhenthenoderuns.

4.RuntheAutoNeuralNode

1. Right-clicktheAutoNeuralnode,andselectRun.

2. WhentheAutoNeuralnoderuniscomplete,selectResults.

5.ViewtheAutoNeuralNodeOutput

1. IntheResultswindow,expandtheOutputwindow.

2. Thefirst sectionof theOutputwindowdisplaysavariable summary,a tableof themodelevents,andatableofthepredictedanddecisionvariables.Scrolldowntothefirstsearchthatthenodeperformed.

––Search#1SINGLELAYERtrial#1:DIRECT:Training–—

_ITER__AIC__AVERR__MISC__VAVERR__VMISC_

0598.6910.610860.30000.608880.29766

1431.4390.401800.18750.620170.29766

2421.3270.389160.19000.719230.30100

3420.8470.388560.18750.712250.29431

4420.6910.388360.19000.741030.29766

5420.6470.388310.19000.743690.29431

6420.6230.388280.19000.764620.29766

7420.6140.388270.19000.772830.29431

8420.6100.388260.19000.789790.29431

8420.6100.388260.19000.789790.29431

–—SelectedIterationbasedon_VMISC_–—


3420.8470.388560.18750.712250.29431

Eight iterationsareperformed in this search (thedefault setting for theMaximumIterationsproperty is eight).Becauseyoupartitioned thedataandyouareusingaclass targetvariable, themisclassification rateof theValidationdataset isused toselectanditeration.Basedontheseresults,thenetworkfromiteration3isselected.

Note: Because this network only includes a direct connection, no preliminarytrainingisperformed.

3. ScrolldowntoTrial#2.ThistrialisperformedusingtheTanhactivationfunction.–—Search#1SINGLELAYERtrial#2:TANH:Prelim––-

TheNEURALProcedure

PreliminaryStartingObjectiveNumber

TrainingRandomFunctionofTerminating

RunSeedValueIterationsCriteria

11967658870.4172869202258

212584984240.4936845028078

311211668700.4021403581098

414634835360.4286288123818

512666644920.4241415262048

613704870910.3973752041148

714726102050.4366535531988

819865825590.4194035499878

––Search#1SINGLELAYERtrial#2:TANH:Training–—


0543.9000.397380.19000.623280.23746

1531.1660.381460.18000.577970.25084

2517.9230.364900.18000.615470.26756

3513.2750.359090.16500.631340.28094

4502.3820.345480.18250.646710.27425

5495.0900.336360.18000.671260.27425

6486.7780.325970.17500.702880.26421

7481.5780.319470.16750.713320.26087

8474.1390.310170.16750.716590.27425

8474.1390.310170.16750.716590.27425

–—SelectedIterationbasedon_VMISC_–—


0543.9000.397380.190.623280.23746

Preliminarytrainingisperformedinthisrun.Thefinalweightsofthebestnetworkthatwastrainedinthepreliminarytrainingareusedtoinitializethesubsequenttraining.Iteration0ischosenasthebestnetworkinthissearch.

4. TheAutoNeuralnodethenrunssearchesusingthenormalandsquareactivationfunctions.ScrolldownintheOutputwindow.NotethattheAutoNeuralnoderunsasecondsearch,usingalloftheactivationfunctionsthatyouspecified.Eachofthesesearchesisrunwithpreliminarytraining.

5.Scrolldowntoviewthefinaltraininghistory.

–—FinalTrainingHistory–—

_step__func__status__iter__AVERR__MISC__AIC__VAVERR__VMISC_

SINGLELAYER1DIRECTinitial00.610860.3000598.6910.608880.29766

SINGLELAYER1DIRECTcandidate30.388560.1875420.8470.712250.29431

SINGLELAYER1TANHcandidate00.397380.1900543.9000.623280.23746

SINGLELAYER1NORMALreject50.358640.2000512.9140.687560.27759

SINGLELAYER1SQUAREreject00.362310.1775515.8500.622000.27759

SINGLELAYER1TANHkeep00.397380.1900543.9000.623280.23746

SINGLELAYER2DIRECTreject20.278030.1300556.4280.755960.30100

SINGLELAYER2TANHreject00.298150.1550688.5200.772380.27759

SINGLELAYER2NORMALreject00.302770.1650692.2190.730990.26087

SINGLELAYER2SQUAREreject80.331620.1775715.2980.692210.27090

Inthistable,youcanseetheresultsofallofthesearchesthatthenoderan.Thefirstrowsindicatetheresultsofthefirstsearch.Twoofthenetworks,basedontheDirectandTanhactivationfunctions,wereidentifiedascandidatenetworks.Theremainingnetworks,basedontheNormalandSquareactivationfunctions,wererejected.All

ofthenetworksinthesecondsearchwererejected.

Becauseithasthelowestvalueof_VMISC_,iteration0oftheTanhnetworkfromthefirstsearchwaschosen.

6. Scrolldowntoviewthefinalmodel.–—FinalModel–—

–—Stopping:Terminationcriteriawassatisfied:overfittingbasedon_VMISC_

_func__AVERR__VAVERR_neurons

TANH0.397380.623282

=======

2

ThenetworkthatwaschosenusesaTanhactivationfunction,andhastwohiddenunits.YouspecifythenumberofhiddenunitsintheTotalNumberofHiddenUnitsproperty.

7.ViewtheAutoNeuralNodeTrainingCodeYoucanview the final networkby examining theSAS training code.From theResultswindow main menu, select View Properties Train Code. Scroll down in the TrainCodewindowtoviewthefinalnetwork.*––––––––––––––––––––*;

*AutoNeuralFinalNetwork;

*––––––––––––––––––––*;

*;

procneuraldata=EMWS.Part_TRAINdmdbcat=WORK.AutoNeural_DMDB

validdata=EMWS.Part_VALIDATE

;

input%INTINPUTS/level=intervalid=interval;

input%BININPUTS/level=nominalid=binary;

input%NOMINPUTS/level=nominalid=nominal;

targetgood_bad/level=NOMINALid=good_bad;

*––––––––––––––––––––*;

*Layer#1;

*––––––––––––––––––––*;

Hidden2/id=H1x1_act=TANH;

connectintervalH1x1_;

connectbinaryH1x1_;

connectnominalH1x1_;

connectH1x1_good_bad;

Youcanvisualizethisnetworkasfollows.Notethatthisdiagramdoesnotseparatetheoutputandtargetlayers.

2.5DMNEURALNODETheDMNeuralnodeenablesyoutofitanadditivenonlinearmodelthatusesthebucketedprincipalcomponentsasinputstopredictabinaryoranintervaltargetvariable.

ThealgorithmthatisusedinDMNeuralnetworktrainingwasdevelopedtoovercomethefollowing problems of the common neural networks for data mining purposes. Theseproblems are likely to occur especially when the data set contains highly collinearvariables.

Nonlinear estimation problem — The nonlinear estimation problem in commonneuralnetworksisseriouslyunderdetermined,whichyields tohighlyrank-deficientHessian matrices and results in extremely slow convergence of the nonlinearoptimization algorithm. In other words, the zero eigenvalues in a Hessian matrixcorrespondto longandveryflatvalleys in theshapeof theobjectivefunction.Thetraditionalneuralnetworkapproachhasseriousproblemstodecidewhenanestimateisclose toanappropriatesolutionand theoptimizationprocesscanbeprematurelyterminated.Thisisovercomebyusingestimationwithfull-rankHessianmatricesofafewselectedprincipalcomponentsintheunderlyingprocedure.

Computingtime—Eachfunctioncallincommonneuralnetworkscorrespondstoasingle run through the entire training data set; normally many function calls areneeded for convergence of the nonlinear optimization. This requires a tremendouscalculationtimetogetanoptimizedsolutionfordatasetsthathavealargenumberofobservations.InDMNeuralnetworktrainingoftheDMNeuralnode,weobtainasetof grid points from the selected principal component and a multidimensionalfrequencytablefromthetrainingdatasetfornonlinearoptimization.Inotherwords,segmentsofthedataaretrainedinsteadoftheentiredata,andthecomputingtimeisreduceddramatically.

Finding global optimal solution— For the same reason, common neural networkalgorithmsoftenfindlocalratherthanglobaloptimalsolutionsandtheoptimizationresults are very sensitive with respect to the starting point of the optimization.However,theDMNeuralnetworktrainingcanfindagoodstartingpointthat is lesssensitivetotheresults,becauseituseswellspecifiedobjectivefunctionsthatcontainafewparametersandcandoaverysimplegridsearchforthefewparameters.

In the DMNeural training process, a principal components analysis is applied to thetraining data set to obtain a set of principal components. Then a small set of principalcomponents is selected for further modeling. This set of components shows a goodprediction of the target with respect to a linear regressionmodel with an R2 selectioncriterion.Thealgorithmobtainsasetofgridpointsfromtheselectedprincipalcomponentand amultidimensional frequency table from the training data set. The frequency tablecontainscountinformationoftheselectedprincipalcomponentsataspecifiednumberofdiscretegridpoints.

Ineachstageof theDMNeural trainingprocess, the trainingdataset is fittedwitheightseparate activation functions. The DMNeural node selects the one that yields the bestresults. The optimization with each of these activation functions is processedindependently.ThefollowingtableliststheeightactivationfunctionsthatareusedintheDMNeuraltrainingprocess:

TableofActivationFunctions

FunctionName FunctionExpressionofInputx

SQUARE (A+Bx)x

TANH A*tanh(Bx)

ARCTAN A*arctan(Bx)

LOGIST

GAUSS A*exp(-[Bx]2)

SIN A*sin(Bx)

COS A*cos(Bx)

EXP A*exp(Bx)

In theDMNeuralnode, the link functiondependson the typeof the targetvariable.Bydefault, the IDENT andLOGIST functions are used for a binary target and an intervaltarget,respectively.

TableofLinkFunctions

FunctionName FunctionExpressionofInputx

INDET x

LOGIST

In the first stage, the response variable is used as the target variable. Startingwith thesecondstage,thealgorithmusestheresidualsofthemodelfromthepreviousstageasthenewtargetvariables.Inotherwords,modelestimationisanadditivestage-wiseprocess.Theresponsevariableyismodeledinthefirststageonly.Foreachoftheotherstages,theresiduals from the previous stage are modeled. The selection of the best activationfunction at each stage in the training process is based on the smallest SSE (Sum ofSquared Error) or the smallest misclassification rate. In the final stage, an additivenonlinearmodelisgeneratedasthefinalpredictedmodel.

See thePredictiveModeling section for information that applies to all of thepredictivemodelingnodes.

2.5.1VariableRequirementsfortheDMNeuralNodeTheDMNeuralnoderequiresonetargetvariable,eitherbinaryorinterval,andatleasttwoinputvariablestoperformtheDMNeuralnetworktraining.

Youmayalsospecifythefollowingvariables:

Cost—containsthecostofadecision.Youshouldfirstassignthecostmodelroletotheappropriatevariableswhenyoucreatethedatasource.Youcanthenassignacostvariable to each decisionwhen you define a profitmatrix with costs in the targetprofileforthetarget.Formoreinformationaboutdefiningaprofitmatrixwithcosts,seetheTargetProfilersection.

Frequency—avariablethatrepresentsfrequencyofoccurrencesineachobservation.Observations that have amissing or negative value for the frequency variable areexcluded from the analysis.You can assign the freqmodel role to the appropriatevariablewhenyoucreatethedatasource.

2.5.2DMNeuralNodeTrainPropertiesThefollowingtrainpropertiesareassociatedwiththeDMNeuralnode:

Variables—specifiesthepropertiesofeachvariableinthedatasourcethatyouwant

touse.Selectthe buttontotherightoftheVariablespropertytoopenavariablestable.YoucansetthevariablestatustoeitherUseorDon’tUseinthetable,andyoucansetavariable’sreportstatustoYesorNo.

2.5.3DMNeuralNoderainProperties:DMNeuralNetwork

LowerBoundR2—UsetheLowerBoundR2propertyoftheDMNeuralNetworknodetospecifythelowerR-squareboundthatyouwanttousetoselectasmallsetofprincipalcomponents at each stage of the optimizationprocess. Permissible valuesarerealnumbersbetween0and1.Thedefaultvalueis5.0E-5.

MaxComponent—Use theMaxComponentpropertyof theDMNeuralNetworknodetospecifythemaximumnumberofprincipalcomponentsthatyouwanttousetopredicttargetvariablevalues.Permissiblevaluesareintegersbetween2and6.Thedefaultvalueis3.

MaxEigenVector—UsetheMaxEigenVectorpropertyoftheDMNeuralNetworknodetospecifytheupperboundthatyouwanttouseforthenumberofeigenvectorsthatareavailableforselection.Permissiblevaluesareintegersgreaterthanorequalto2.Thedefaultvalueis400.

Max Function Call — Use the Max Function Call property of the DMNeuralNetworknode to specify the upper bound that youwant to use for the number offunction calls in each optimization. Permissible values are integers greater thanorequalto1.Thedefaultvalueis500.

MaxIteration—UsetheMaxIterationpropertyoftheDMNeuralNetworknodetospecifytheupperboundthatyouwanttouseforthenumberofiterationsperformedduringeachoptimization.Permissiblevaluesareintegersgreaterthanorequalto1.Thedefaultvalueis200.

Max Stage — Use the Max Stage property of the DMNeural Network node tospecify an upper bound for the number of stages utilized in the DMNeuraloptimization.Permissiblevaluesareintegersbetween1and10.Thedefaultvalueis3.

2.5.4DMNeuralNodeTrainProperties:ConvergenceCriteria

Absolute Gradient — Use the Absolute Gradient property of the DMNeuralNetworknodetosettheGradientconvergencecriterion.Permissiblevaluesarerealnumbersgreaterthan0.Thedefaultvalueis5.0E-4.

Gradient—Use theGradient property of theDMNeuralNetworknode to set theGradientconvergencecriterion.Permissiblevaluesare realnumbersgreater than0.Thedefaultvalueis1.0E-8.

2.5.5DMNeuralNodeTrainProperties:ModelCriteria

Selection—UsetheSelectionpropertyoftheDMNeuralNetworknodetospecifyacriterionwhere,forthebestactivationfunctionateachstageinthetrainingprocess.

Default — The default setting is the objective function in the optimizationcriterionthatyouspecifiedintheOptimizationproperty.

SSE—thesumofsquarederrors.

ACC— the number of correctly classified observations divided by the totalnumberoftrainingobservations.

Optimization—UsetheOptimizationpropertyoftheDMNeuralnodetospecifytheobjective function that you want to be optimized in the iterated network trainingprocess.Thechoicesare

SSE(default)—Thesumofsquarederrorsisminimized.

ACC — The correct classification rate, that is, the number of correctlyclassifiedobservationsdividedby the totalnumberof trainingobservations, ismaximized.

PrintOption—Use thePrintOptionpropertyof theDMNeuralNetworknode tospecifythelevelofdetailthatfornodeoutput.

Default—printsthestandardnodeoutput.

PAll — prints an extended version of the standard node output. This is thedefaultsetting.

PShort—printsanabbreviatedsetofnodeoutputs.

Noprint—noprintoutputisproduced.

2.5.6DMNeuralNodeStatusPropertiesThefollowingstatuspropertiesareassociatedwiththisnode:

CreateTime—displaysthetimethatthenodewascreated.

RunID—displaystheidentifierofthenoderun.Anewidentifieriscreatedeverytimethenoderuns.

LastError—displaystheerrormessagefromthelastrun.

LastStatus—displaysthelastreportedstatusofthenode.

LastRunTime—displaysthetimeatwhichthenodewaslastrun.

RunDuration—displaysthelengthoftimeofthelastnoderun.

GridHost—displaysthegridserverthatwasusedduringthenoderun.

User-AddedNode—specifiesifthenodewascreatedbyauserasaSASEnterpriseMinerextensionnode.

2.5.7DMNeuralNodeResultsYoucanopentheResultswindowoftheDMNeuralnodebyright-clickingthenodeandselectingResultsfromthepop-upmenu.

Select View from the main menu to view the following information in the Resultswindow:

Properties

Settings— displays awindowwith a read-only table of theDMNeural nodepropertiesconfigurationwhenthenodewaslastrun.

RunStatus— indicates the statusof theDMNeuralnode run.TheRunStartTime, Run Duration, and information about whether the run completedsuccessfullyaredisplayedinthiswindow.

Variables—atableofthevariablesinthetrainingdataset.Youcanresizeandrepositioncolumnsbydraggingbordersorcolumnheaders,andyoucantogglecolumn sorts between descending and ascending by clicking on the columnheaders.


Notes — opens a window and displays (read-only) any notes that werepreviouslyenteredintheGeneralProperties—Noteswindow.

SASResults

Log—theSASlogoftheDMNeuralnoderun.

Output—theSASoutputoftheDMNeuralnoderundisplaysthefollowing:

variablesummary

modeleventssummary

tableofpredictedanddecisionvariables

componentselectiontableforSS(y)andR2

PROCDMNEURLsummary

responseprofileforthetargetvariable

tableofPROCDMNEURLvariablestatistics

stagewisegoodness-of-fitcriteriontables

stagewisecomponentselectiontables

summarytableacrossstages

fitstatisticsforthetargetvariable

classificationtableforthetargetvariable

eventclassificationtable

assessmentscorerankings

assessmentscoredistributiontable

FlowCode— the SAS code used to produce the output that the DMNeuralnodepassesontothenextnodeintheprocessflowdiagram.

Scoring

SASCode—theSASscorecodethatwascreatedbythenode.TheSASscorecodecanbeusedoutsideof theEnterpriseMiner environment in customuserapplications.SASCodecontainsDMNeuralvaluesofinputvariablesonly.ThisincludeDMNeuralofunknown levels (if specified)andDMNeuralof specificlevels. If no DMNeural values were generated, the SAS Code menu item isdimmedandunavailable.

PMMLCode—theDMNeuralnodedoesnotgeneratePMMLcode.

Assessment

FitStatistics—displaysthefollowingfitstatistics:

_ERR_—ErrorFunction

_SSE_—SumofSquaredErrors

_MAX_—MaximumAbsoluteError

_DIV_—DivisorforASE

_NOBS_—SumofFrequencies

_WRONG_—NumberofWrongClassifications

_DISF_—FrequencyofClassifiedCases

_MISC_—MisclassificationRate

_ASE_—AverageSquaredError

_RASE_—RootAverageSquaredError

_AVERR_—AverageErrorFunction

_DFT_—TotalDegreesofFreedom

_DFM_—ModelDegreesofFreedom

_DFE_—DegreesofFreedomforError

_MSE_—MeanSquaredErrors

_RMSE_—RootMeanSquaredErrors

_NW_—NumberofWeight

_FPE_—FinalPredictionError

_RFPE_—RootFinalPredictionError

_AIC_—Akaike’sInformationCriterion

_SBC_—SchwarzBayesianCriterion

ClassificationChart—displaysastackedbarchartoftheclassificationresultsfor a categorical target variable. The horizontal axis displays the target levelsthatobservationsactuallybelongto.Thecolorofthestackedbarsidentifiesthetargetlevelsthatobservationsareclassifiedinto.

ScoreRankingsOverlay—Inascorerankingschart,severalstatisticsforeachdecile (group) of observations are plotted on the vertical axis. For a binarytarget, all observations in the scored data set are sorted by the posteriorprobabilities of the event level in descending order. For a nominal or ordinaltarget,observationsaresorted fromhighestexpectedprofit to lowestexpectedprofit (or fromlowestexpected loss tohighestexpected loss).Then thesortedobservations are grouped intodeciles andobservations in a decile are used tocalculate the statistics that are plotted in deciles charts. The Score RankingsOverlayplotdisplaysbothtrainandvalidatestatisticsonthesameaxis.

Bydefault, thehorizontal axisofa score rankingschartdisplays thedeciles(groups)of theobservations.Theverticalaxisdisplays thefollowingvalues,andtheirmean,minimum,andmaximum(ifany):

posteriorprobabilityoftargetevent

numberofevents

cumulativeandnoncumulativeliftvalues

cumulativeandnoncumulative%response

cumulativeandnoncumulative%capturedresponse

gain

actualprofitorloss

expectedprofitorloss

ScoreDistribution — The Score Distribution chart plots the proportions ofevents(bydefault),nonevents,andothervaluesontheverticalaxis.Thevalueson the horizontal axis represent the model score of a bin. The model scoredepends on the prediction of the target and the number of buckets used. Forcategorical targets, observations are grouped into bins, based on the posteriorprobabilitiesoftheeventlevelandthenumberofbuckets.Forinterval targets,observationsaregroupedintobins,basedontheactualpredictedvaluesof thetarget.TheScoreDistributionchartofausefulmodelshowsahigherpercentageofeventsforhighermodelscoreandahigherpercentageofnoneventsforlowermodelscores.

Model

StagewiseOptimizationStatistics—Ineachstageof theDMNeural trainingprocess, the training data set is fittedwith eight separate activation functions.The Stagewise Optimization Statistics window displays a lattice of plots thatshowmodel fit statistics of different activation functions at each stage of thetrainingprocess.ThefitstatisticsthataredisplayedinthelatticegraphareSumofSquaredErrorsandAccuracyPercent.

Table—displaysatablethatcontainstheunderlyingdatathatisusedtoproduceachart.TheTablemenuitemisdimmedandunavailableunlessaresultschartisopenandselected.

Plot—usetheGraphwizardtomodifyanexistingresultsplotortocreatearesultsplotofyourown

2.6TWOSTAGENODETheTwoStagenodeenablesyoutomodelaclasstargetandanintervaltarget.Theintervaltargetvariable isusually thevalue that isassociatedwitha levelof theclass target.Forexample,thebinaryvariablePURCHASEisaclasstargetthathastwolevels:YesandNo,andtheintervalvariableAMOUNTcanbethevaluetargetthatrepresentstheamountofmoneythatacustomerspendsonthepurchase.

The TwoStage node supports two types of modeling: sequential and concurrent. Forsequentialmodeling,aclassmodelandavaluemodelarefittedfortheclasstargetandtheinterval target, respectively, in the first and the second stages. By defining a transferfunctionandusingthefilteroption,youcanspecifyhowtheclasspredictionfortheclasstargetisappliedandwhethertouseallorasubsetofthetrainingdatainthesecondstagefor interval prediction.Thepredictionof the interval target is computed from the valuemodelandoptionallyadjustedbytheposteriorprobabilitiesoftheclasstargetthroughthebias adjustment option. A posterior analysis that displays the value prediction for theintervaltargetbytheactualvalueandpredictionoftheclasstargetisalsoperformed.ThescorecodeoftheTwoStagenodeisacompositeoftheclassandvaluemodels.Thevaluemodelisusedtocreateassessmentplots.

For concurrentmodeling, the values of the value target for observations that contain anon-eventvaluefortheclasstargetaresettomissingpriortotraining.Then,theTwoStagenodefitsaneuralnetworkmodelfortheclassandvaluetargetvariablessimultaneously.

ThedefaultTwoStagenodefitsasequentialmodelthathasadecisiontreeclassmodelandaneuralnetworkvaluemodel.Posteriorprobabilitiesfromthedecision treeclassmodelareusedasinputfortheregressionvaluemodel.

Before reading this document, you should be familiar with the Predictive Modeling,RegressionNode,DecisionTreeNode,andNeuralNetworkNodedocumentation.

2.6.1TwoStageNodeOutputDataAfteryouhavesuccessfullyruntheTwoStagenodeinadiagram,youcanviewtheoutputdatasets that itgenerates.With theTwoStagenodeselected in theDiagramWorkspace,selectthe button to therightof theExportedDataproperty.Thisopensa tableof theexportedTwoStagedata.

Ifdataexistsforanexporteddataset,youcanselecttherowinthetableandclickoneofthefollowing:

Browse toopenawindowwhereyoucanbrowse the first 100observationsof thedataset.

ExploretoopentheExplorewindow,whereyoucansampleandplotthedata.

Properties to open thePropertieswindow for thedata set.ThePropertieswindowcontains a Table tab and a Variables tab. The tabs contains summary information(metadata)aboutthetableandthevariables.

Bydefault,thescoredscoredatasetisexportedtosuccessornodes.

Inadditiontotheoriginalinput,target,andfrequencyvariablesfrompredecessornodes,the scored data set (training, validation, test, or sco re) contains the following newvariables:

BP_orBL_—bestprofitorloss,dependingonthetargetprofile.

CP_orCL_—computedprofitorloss,dependingonthetargetprofile.

D_—thedecisionvariable.

EP_orEL_—expectedprofitorloss,dependingonthetargetprofile.

I_—thenormalizedcategorythatthecaseisclassifiedinto.

S_— the standardized input variables. This variable is generated only for neuralmodels.

H<number>—thehiddenunits.Thisvariableisgeneratedonlyforneuralmodels.

U_—theu-normalizedcategorythatthecaseisclassifiedinto.

P_— the posterior probabilities for categorical targets or the predicted values forintervaltargets.

·R_—residuals.

2.7SUPPORTVECTORMACHINE(SVM)NODEAsupportvectormachine(SVM)isasupervisedmachine-learningmethodthatisusedtoperform classification and regression analysis. Vapnik (1995) developed the concept ofSVMintermsofhardmargin,andlaterheandhiscolleagueproposedtheSVMwithslackvariables, which is a soft margin classifier. The standard SVM model solves binaryclassification problems that produce non-probability output (only sign +1/-1) byconstructing a set of hyperplanes thatmaximize themargin between two classes.Mostproblemsinafinitedimensionalspacearenotlinearlyseparable.Inthiscase,theoriginalspaceneedstobemappedintoamuchhigherdimensionalspaceoraninfinitedimensionalspace,whichmakestheseparationeasier.SVMusesakernelfunctiontodefinethelargerdimensionalspace.

The experimental SAS Enterprise Miner SVM node uses PROC SVM and PROCSVMSCORE.ThetoolislocatedontheModeltaboftheSASEnterpriseMinertoolbar.In SAS Enterprise Miner 12.1, the SVM node supports only binary classificationproblems,includingpolynomial,radialbasisfunction,andsigmoidnonlinearkernels.TheSVM node does not perform multi-class problems or support vector regression. Thefrequencyvariableisignored.

2.7.1SVMNodeTrainProperties

Variables—Selectthe buttontoopentheVariables—SVMtable,whichenablesyoutoviewthecolumnsmetadata,oropenanExplorewindowtoviewavariable’ssamplinginformation,observationvalues,oraplotofvariabledistribution.YoucanspecifyUse andReport variable values. TheName, Role, and Level values for avariable are displayed as read-only properties. The following buttons and checkboxesprovideadditionaloptionstoviewandmodifyvariablemetadata:

Apply — changes metadata based on the values supplied in the drop-downmenus,checkbox,andselectorfield.

Reset—changesmetadatabacktoitsstatebeforeuseoftheApplybutton.

Label—addsacolumnforalabelforeachvariable.

Mining— adds columns for the Order, Lower Limit, Upper Limit, Creator,Comment,andFormatTypeforeachvariable.

Basic — adds columns for the Type, Format, Informat, and Length of eachvariable.

Statistics—addsstatisticsmetadataforeachvariable.

Explore— opens an Explore window that enables you to view a variable’ssamplinginformation,observationvalues,oraplotofvariabledistribution.

EstimationMethod—specifiesanSVMmodelingmethod.

Full Dense Quadratic Programming (FQP) — Full dense quadraticprogramming refers to the method used to store a Hessian matrix, not thetechniques used to solve the QP problem. A full dense matrix requiressubstantialamountsofmemory,evenfor relativelysmallproblems.Therefore,this technique is not recommended for medium- or large-sized data sets. Forcertain problems that are small enough to support a full dense Hessian, thismethodmightbequickerthanothermethods.

Decomposed Quadratic Programming (DQP) —Decomposed quadraticprogramming provides solutions by decomposing large scale quadraticprogramming problems into a series of smaller quadratic programmingproblems.Decomposedquadraticprogrammingdivides the inputdata into twopartitions.Thedatainasmallpartitionisoptimizedasasub-problem,andtheremainingdatainamuchlargerpartitionremainsconstant.Asdatapointsinthesub-problem are optimized, they are iteratively replaced with data from thelarger partition by evaluating the optimality conditions. The large quadraticprogramming problem is resolved by successively resolving a series of smallquadraticprogrammingproblems.

Lagrangian SVM (LSVM)— The Lagrangian SVMmethod is suitable forlargescaleSVclassificationmodelswitha linearkernel. It isalsosuitable formedium-sized models with nonlinear kernels. The Lagrangian SVM methodfollows Mangasarian and Musicant (2000), where the modified quadraticprogramming problem has no linear equality constraint and no upper bounds.The only constraints are that estimated parameters must be nonnegative. Forlinear kernels, Lagrangian SVM does not require data (X, y) in the core.LagrangianSVMisaniterativemethodthatmakesmultiplepassesthroughthedataset.

Least Squares SVM (LSSVM) — The least squares SVM method wasoriginally developed by Suykens andVandewalle (1999). Least squares SVMsolves linear and nonlinear C classification problems. Two versions of leastsquaresSVMexist for classification. The faster version,which is suitable forsmallern×nkernelmatrices,computesandstoresthe(dense)Choleskyfactor.For larger n, the kernelmatrixmust not be stored in core when the iterativeconjugategradientalgorithmisused.Theslowerversion,which issuitable forsmallandmediumsizedn,usesaconjugategradientsolution.

TuningMethod—specifiesthemethodusedtoperformtuning.

None — Tuning is not performed. The constant value specified as a tuningparameterisused.

GridSearch—gridsearch.

OptimalSearch—directoptimalorpatternsearch.

OptimizationOptions—enablesyoutoreadandmodifyvaluesforthefollowing:

MaximumQPSize—specifiesthemaximumsizeinobservationsofthesmalldecomposed quadratic programming partition when the Estimation MethodpropertyissettoDQP.ThedefaultsettingfortheMaximumQPSizepropertyis100.

MaximumIteration—specifiestheupperboundforthenumberofiterationsineachoptimization.Thedefaultis200.

Maximum Function Call — specifies an upper bound for the number offunctioncallsineachoptimization.Thedefaultis500.

GradientConvergenceCriterion—specifiesarelativegradientconvergencecriterionfortheoptimizationprocess.Thedefaultis1e-8.

AbsoluteGradientConvergenceCriterion— specifies an absolute gradientconvergencecriterionfortheoptimizationprocess.Thedefaultis5e-4.

RelativeConvergenceCriterion—specifiesarelativeparameterconvergence

criterionfortheoptimizationprocess.Thedefaultis1e-8.

Absolute Convergence Criterion — specifies an absolute gradientconvergencecriterionfortheoptimizationprocess.Thedefaultis5e-4.

UseConjugateGradientMethod—specifieswhethertheiterativeconjugategradientmethodisused.Whentherearemorethan3000observationsintrainingdata, the Default option uses the iterative conjugate gradient methodautomatically.

UpperBoundofConjugateGradient Iteration—This option is valid onlyfor themethodsLSSVMwith anykernel, aswell asLSVMmethodwith anynonlinear kernel. Instead of solving the large linear system via Choleskydecomposition,thesystemissolvedbytheiterativeconjugategradientmethod.Thisoptionspecifiesanupperboundforthenumberofiterations.Thedefaultis400.

ConjugateGradientTerminationTolerance—Thisoptionisvalidonlywhenthe iterativeconjugategradientmethod isbeingsolved.Theoptionspecifiesaterminationtolerancefortheiteration.Bydefault,1e-6isused.

Scale Predictors — specifies whether the predictor variables are scaled beforeentering the analysis. The Scale Predictors property defaults to Yes and becomesdimmed and unavailable when the Kernel property is set to Polynomial, becausepolynomialkernelsalwaysusescaledpredictors.

2.7.2SVMNodeTrainProperties:RegularizationParameter

RegularizationParameter—UsetheRegularizationParameterpropertytochoosethemethodthatyouwanttousetospecifyavaluefortheRegularizationParameter.

Constant—SettheRegularizationParameterpropertytoConstantifyouwantto use the Constant Value property to manually specify a value for theRegularizationParameterC.

Tuning—Set theRegularizationParameterpropertytoTuningifyouwant touseatuningmethodtochoosethebestRegularizationParameterCfromarangeofvaluesthatyouspecify.

Constantvalue—When theRegularizationParameterproperty isset toConstant,use the Constant Value property to specify the value of the polynomial orderparameter.Thevalueofpolynomialorderparametermustbea realnumbergreaterthanzero.

Tuning Range —When the Regularization Parameter property is set to Tuning,select the button to the right of the Tuning Range property to open a windowwhereyoucanspecifyparametersfortuningyourregularizationparametervalue:

StartValue—When theRegularizationParameter property is set toTuning,usetheStartValuepropertytospecifythebeginningvalueofthetuningrange.

EndValue—WhentheRegularizationParameterissettoTuning,usetheEndValuepropertytospecifytheendvalueofthetuningrange.

IncrementValue—When theRegularizationParameter is set toTuning,usethe Increment Value property to specify the increment value for each tuningstep.

2.7.3SVMNodeTrainProperties:Kernel

Kernel—specifiesthedesiredkernelfunction.ThedefaultkernelfunctionisLinear.

Linear—K(u,v)=uTv.

Polynomial—K(u,v)=(uTv+1)pwithpolynomialorderp.The1isaddedinordertoavoidzero-valueentriesintheHessianmatrixforlargevaluesofp.

Radial Basis Function — K(u,v) = exp[-p (u - v)2], Gaussian radial basisfunctionkernel.Theradialbasisfunctionkernelrequiresyoutospecifyavalueforthekernelscaleparameterp.

Sigmoid—K(u,v)= tanh(p*(uTv)+q)wherep is thekernel scaleparameterandqisthekernellocationparameter.

PolynomialKernelParameter—When theKernel function is set toPolynomial,selectthe buttontotherightofthePolynomialKernelParameterpropertytoopena window where you can specify the following parameters for your polynomialkernelfunction:

PolynomialOrder—UsethePolynomialOrderpropertytochoosethemethodthatyouwanttousetospecifytheorderofthepolynomialkernelparameter.

Constant—SetthePolynomialOrderpropertytoConstantifyouwanttouse the Constant Value property to manually specify a value for thepolynomialorderparameter.

Tuning—SetthePolynomialOrderpropertytoTuningifyouwanttousea tuningmethod tochoose thebestvalueof thepolynomialorder fromarangeofvaluesthatyouspecify.

ConstantValue—WhenthePolynomialOrderpropertyissettoConstant,usethe Constant Value property to specify the value of the polynomial orderparameter.Thevalueofpolynomialorderparametermustbeanintegerbetween1and10.

TuningStartValue—When thePolynomialOrderproperty isset toTuning,usetheTuningStartValuepropertytospecifythebeginningvalueofthetuningrange.

TuningEndValue—When thePolynomialOrder property is set toTuning,usetheTuningEndValuepropertytospecifytheendvalueofthetuningrange.

Tuning Increment Value —When the Polynomial Order property is set toTuning,usetheTuningIncrementValuepropertytospecifytheincrementvalueforeachstepinkernelparametertuning.

RBF Kernel Parameter — When the Kernel function is set to Radial BasisFunction,select the button to therightof theRBFKernelParameterproperty toopenawindowwhereyoucanspecifythefollowingparametersforyourRBFkernelfunction:

Scale Parameter—Use the Scale Parameter property to choose themethodthatyouwanttousetospecifythevalueforthekernelscalingparameterp.

Constant—Set theScaleParameterproperty toConstant ifyouwant tousetheConstantValuepropertytomanuallyspecifyavalueforthekernelscalingparameterp.

Tuning—SettheScaleParameterpropertytoTuningifyouwanttouseatuning algorithm to select the best scaling parameter p from a range ofvaluesthatyouspecify.

ConstantValue—When theScaleParameterproperty isset toConstant,usethe Constant Value property to specify the value of the radial basis functionscalingparameterp.Thevaluefor theConstantValuepropertymustbea realnumbergreaterthanzero.

TuningStartValue—WhentheScaleParameterpropertyissettoTuning,usethe Tuning Start Value property to specify the beginning value of the tuningrange.

TuningEndValue—WhentheScaleParameterpropertyissettoTuning,usetheTuningEndValuepropertytospecifytheendvalueofthetuningrange.

Tuning Increment Value — When the Scale Parameter property is set toTuning,usetheTuningIncrementValuepropertytospecifytheincrementvalueforeachstepinparametertuning.

SigmoidKernelParameter—When theKernelproperty isset toSigmoid,selectthe buttontotherightoftheSigmoidKernelParameterpropertytoopenawindowwhereyoucanspecifythefollowingparametersforyoursigmoidkernelfunction:

SigmoidKernelScaleParameters

KernelScaleParameter—Use theKernelScaleParameter property tochoosethemethodthatyouwanttousetospecifythevalueforthesigmoidkernelscalingparameterp.TheKernelScaleParameterpropertyhas twosettings:

Constant—SettheKernelScaleParameterpropertytoConstantifyouwanttousetheConstantValuepropertytomanuallyspecifyavalueforthehyperplanescalingparameterp.

Tuning— Set the Kernel Scale Parameter property to Tuning if you

want to use an iterative tuning algorithm to select the best scalingparameterpfromarangeofvaluesthatyouspecify.

ConstantValue—When the Kernel Scale Parameter property is set toConstant, use the Constant Value property to specify the value of thescalingparameterp.Thevaluefor theConstantValuepropertymustbearealnumbergreaterthanzero.

TuningStartValue—WhentheKernelScaleParameterpropertyissettoTuning,usetheTuningStartValuepropertytospecifythebeginningvalueofthetuningrange.

TuningEndValue—WhentheKernelScaleParameterpropertyissettoTuning,usetheTuningEndValuepropertytospecifytheendvalueofthetuningrange.

TuningIncrementValue—WhentheKernelScaleParameterpropertyisset to Tuning, use the Tuning Increment Value property to specify theincrementvalueforeachstepinparametertuning.

SigmoidKernelLocationParameters

Kernel Location Parameter — Use the Kernel Location Parameterpropertytochoosethemethodthatyouwanttousetospecifythevalueforthehyperplanelocationparameterq.

Constant—SettheKernelLocationParameterpropertytoConstantifyouwanttousetheConstantValuepropertytomanuallyspecifyavalueforthekernellocationparameterq.

Tuning—SettheKernelLocationParameterpropertytoTuningifyouwanttouseaniterativealgorithmtoselectthekernellocationparameterqfromarangeofvaluesthatyouspecify.

ConstantValue—WhentheKernelLocationParameterpropertyissettoConstant, use the Constant Value property to specify the value of thelocationparameterq.ThevaluefortheConstantValuepropertymustbearealnumbergreaterthanzero.

TuningStartValue—When theKernelLocationParameter property issettoTuning,usetheTuningStartValuepropertytospecifythebeginningvalueofthetuningrange.

TuningEndValue—WhentheKernelLocationParameterpropertyissettoTuning,use theTuningEndValueproperty tospecify theendvalueofthetuningrange.

Tuning Increment Value — When the Kernel Location Parameter

property is set to Tuning, use the Tuning Increment Value property tospecifytheincrementvalueforeachstepinparametertuning.

2.7.4SVMNodeTrainProperties:CrossValidation

CrossValidation—specifieswhethercrossvalidationisused.

Method—specifiesthecrossvalidationmethodwhentheCrossValidationpropertyissettoYes.

Random—performsrandomcrossvalidation.ThenumberofcrossvalidationrunsisequaltothevaluespecifiedintheFoldproperty.LetNbethenumberofobservations in the input data set and F be the value specified in the Foldproperty. Each cross validation run uses N/F observations that are randomlyselectedfromtheinputdataset.

Testset—permitsyoutousethetestdatasetasthevalidationdataset.

Fold—specifies thenumberof training runsand thenumberof leftoutdata sets.Typicalvaluesare fold=4or fold=10.For fold=1, leave-one-outestimation isused.Thedefaultvalueis10.

2.7.5SVMNodeTrainProperties:Sampling

ApplySampling—specifieswhethersamplingisusedforSVMtraining.Stratifiedsamplingisusedforclasstargets.

SampleSize—specifiesthesamplingsizeforSVMtraining.Thedefaultissamplesizeis1000observations.

2.7.6SVMNodeTrainProperties:PrintOptions

PrintOption— use the PrintOption property to configure the granularity of theprintedoutputfromtheSVMnode.

Default—suppressesprintingofsomeSVMmodeldetails.

All—printsallSVMresultsinformation.

NoPrint—suppressesprintingofallSVMresultsinformation.

OptimizationHistory—specifieswhethertoprintthedetailedoptimizationhistory.

2.7.7SVMNodeStatusPropertiesThefollowingstatuspropertiesareassociatedwiththisnode:

CreateTime—displaysthetimethatthenodewascreated.








2.7.8SVMNodeResultsYou can open the Results window of the SVM node by right-clicking the node andselectingResultsfromthepop-upmenu.

SelectViewfromthemainmenutoviewthefollowingresultsintheResultsPackage:

Properties

Settings — displays a window with a read-only table of the configurationinformation in theSVM nodePropertiesPanel.The informationwascapturedwhenthenodewaslastrun.

RunStatus—indicatesthestatusoftheSVMnoderun.TheRunStartTime,RunDuration, and information aboutwhether the run completed successfullyaredisplayedinthiswindow.

Variables—aread-only tableofvariablemeta informationabout thedatasetsubmitted to theSVMnode .The table includescolumns tosee thevariable’sname,use,report,role,andlevel.

TrainCode—thecodethatSASEnterpriseMinerusedtotrainthenode.

Notes—displaysanyuser-creatednotesofinterest.

SASResults

Log—theSASlogoftheSVMnoderun.

Output—theSASoutputoftheSVMnoderun.

FlowCode—TheSAScodeused to produce the output that theSVM nodepassesontothenextnodeintheprocessflowdiagram.

Scoring

SASCode—theSASscorecodethatwascreatedbythenode.TheSASscorecodecanbeusedoutsideof theSASEnterpriseMinerenvironment incustomuserapplications.

PMMLCode—theSVMnodedoesnotgeneratePMMLcode.

Assessment

FitStatistics—atableofthefitstatisticsfromthemodel.

ClassificationChart—abarchartthatshowsthecorrectclassificationofthevaluesofthetargetvariable.

ScoreRankingsOverlay<TARGET>—opens theScoreRankingsOverlay

chart.

ScoreRankingsMatrix—openstheScoreRankingsMatrixchart.

ScoreDistribution—openstheScoreDistributionchart.

Model

Histogram: Support Vectors — opens a histogram that displays thedistributionofsupportvectors.

SVMFitStatistics—atableoftheSVMfitstatisticsfromthetrainedmodel.

TuningHistory—atableofTuninghistory.This table isgeneratedwhentheTuningMethodpropertyissettoGridSearch.

Support Vectors — displays a table of support vectors with LaGrangemultipliers.

Table—displaysatablethatcontainstheunderlyingdatathatisusedtoproduceachart.TheTablemenuitemisdimmedandunavailableunlessaresultschartisopenandselected.

Plot—opens theSelectaChartTypewizard tomodifyanexistingResultsplotorcreateaResultsplotofyourown.

2.7.9SVMNodeExampleFromtheSASEnterpriseMinermenu,selectHelp GenerateSampleDataSources toopentheGenerateSampleDataSourceswindow.

Figure2-25

SelectOKtogeneratethesampledatasources.Thenewsampledatasourcesautomaticallypopulate theDataSources folderof theSASEnterpriseMinerProjectNavigator.Createanewprocessflowdiagram,andthendragtheHomeEquitydatasourcefromtheDataSourcesfolderof theSASEnterpriseMinerProjectNavigatorontothediagramworkspace.DragaDataPartitionnodefromtheSampletaboftheSASEnterpriseMinernodetoolbar,andanSVMnodefromtheModeltaboftheSASEnterpriseMinernodetoolbar,andconnectthesenodestotheHomeEquitydatasourceasshownbelow.Right-clicktheSVMnodeandselectRun.

FromtheSVMResultswindow,examinetheFitStatisticstable.

Figure2-26

ThetablebelowcontainsselectedfitstatisticsfromtheSVMtraining.Sincesamplingwasused,thestatisticsinthetablebelowarebasedon1000observations,whichdiffersfromthenumberof trainingobservations thatwereused togenerate theprevious fit statisticstable.

Figure2-27

The distribution of nonzero Lagrange Multipliers and Support Vectors are shown asfollows:

Figure2-28

Figure2-29

WhentheGridSearchtuningmethodisused,itshistoryisshownasatable.

Figure2-30

TheSASOutputcontainsthefollowingoutputfromtheSVMrun:

TheSVMProcedure

ParameterTuningProcess:Technique=DQP

---StartTuningProcess:Technique=DQP---

Terminatepatternsearchafter2iterations(3trainingruns)withmaximumgriddistance=0.1125.

BestsolutionwithC=0.55selected.

TheSVMProcedure

RegularizationParameterC=0.55

ClassificationTable(TrainingData)

Misclassification=161Accuracy=83.90

Predicted

Observed01Missing

0798.03.00000

1158.041.00000

Missing000

TheSVMProcedure

SupportVectorClassificationC_CLAS

KernelFunctionLinear

EstimationMethodDQP

MaximumQPSize100

NumberofObservations(Train)1000

NumberofEffects20

RegularizationParameterC0.550000

ClassificationError(Training)161.000000

ObjectiveFunction-202.236319

L1Loss1.261213E-13

Inf.NormofGradient1.412204E-13

SquaredEuclideanNormofw8.598777

GeometricMargin0.682043

NumberofSupportVectors384

NumberofS.VectoronMargin368

NormofLongestVector4.288153

RadiusofSphereAroundSV4.210008

EstimatedVCDimofClassifier153.406184

LinearKernelConstant(Fit)2.535592

LinearKernelConstant(PCE)2.535592

NumberofKernelCalls19711817

CHAPTER3

NEURALNETWORKMODELSTOPREDICTRESPONSWEANDRISK

3.1ANEURALNETWORKMODELTOPREDICTRESPONSE

Thissectiondiscussestheneuralnetworkmodeldevelopedtopredicttheresponsetoaplanned direct mail campaign. The campaign’s purpose is to solicit customers for ahypothetical insurance company. A two-layered network with one hidden layer waschosen. Three units are included in the hidden layer. In the hidden layer, thecombinationfunctionchosenislinear,andtheactivationfunctionishyperbolictangent.Intheoutputlayer,alogisticactivationfunctionandBernoullierrorfunctionareused.The logistic activation function results in a logistic regression typemodelwith non-lineartransformationoftheinputs,asshowninprevioschapter.ModelsofthistypeareingeneralestimatedbyminimizingtheBernoullierrorfunctions.Minimizationof theBernoullierrorfunctionisequivalenttomaximizingthelikelihoodfunction.

Display 3.1 shows the process flow for the response model. The first node in theprocessflowdiagramistheInputDatanode,whichmakestheSASdatasetavailableformodeling.ThenextnodeisDataPartition,whichcreatestheTraining,Validation,and Test data sets. The Training data set is used for preliminary model fitting. TheValidation data set is used for selecting the optimumweights. TheModel SelectionCriterionpropertyissettoAverageError.

Aspointedout earlier, the estimationof theweights is donebyminimizing the errorfunction.Thisminimizationisdonebyaniterativeprocedure.Eachiterationyieldsasetof weights. Each set of weights defines a model. If I set the Model SelectionCriterionpropertytoAverageError,thealgorithmselectsthesetofweightsthatresultsinthesmallesterror,wheretheerroriscalculatedfromtheValidationdataset.

SinceboththeTrainingandValidationdatasetsareusedforparameterestimationandparameter selection, respectively, an additional holdout data set is required for anindependentassessmentofthemodel.TheTestdatasetissetasideforthispurpose.

Display3.1

3.1.1InputDataNodeIcreatethedatasourcefortheInputDatanodefromthedatasetNN_RESP_DATA2.IcreatethemetadatausingtheAdvancedAdvisorOptions,andIcustomizeitbysettingtheClassLevelsCountThresholdpropertyto8,asshowninDisplay3.2

Display3.2

I set adjusted prior probabilities to 0.03 for response and 0.97 for non-response, asshowninDisplay3.3.

Display3.3

3.1.2DataPartitionNodeThe input data is partitioned such that 60% of the observations are allocated fortraining,30%forvalidation,and10%forTest,asshowninDisplay3.4.

Display3.4

3.2SETTINGTHENEURALNETWORKNODEPROPERTIES

Hereisasummaryoftheneuralnetworkspecificationsforthisapplication:

•Onehiddenlayerwiththreeneurons

•Linearcombinationfunctionsforboththehiddenandoutputlayers

•Hyperbolictangentactivationfunctionsforthehiddenunits

•Logisticactivationfunctionsfortheoutputunits

•TheBernoullierrorfunction

•TheModelSelectionCriterionisAverageError

ThesesettingsareshowninDisplays5.5–5.7.

Display3.5showsthePropertiespanelfortheNeuralNetworknode.

Display3.5

To define the network architecture, click located to the right oftheNetworkproperty.TheNetworkPropertiespanelopens,asshowninDisplay3.6.

Display3.6

SetthepropertiesasshowninDisplay3.6andclickOK.

To set the iteration limit, click located to the right of theOptimization property.The Optimization Properties panel opens, as shown in Display 3.7. SetMaximumIterationsto100.

Display3.7

AfterrunningtheNeuralNetworknode,youcanopentheResultswindow,showninDisplay 3.8. The window contains four windows: Score Rankings Overlay, IterationPlot,FitStatistics,andOutput.

Display3.8

TheScoreRankingsOverlaywindowinDisplay3.8showsthecumulativelift for theTrainingandValidationdatasets.Clickthedownarrownexttothetextboxtoseealistofavailablechartsthatcanbedisplayedinthiswindow.

Display3.9showstheiterationplotwithAverageSquaredErrorateachiterationfortheTraining andValidation data sets. The estimation process required 70 iterations. Theweights from the 49th iteration were selected. After the 49th iteration, the Average

Squared Error started to increase in theValidation data set, although it continued todeclineintheTrainingdataset.

Display3.9

You can save the table corresponding to the plot shown in Display 3.9 by clickingtheTablesiconandthenselectingFile→SaveAs.Table3.1showsthethreevariables_ITER_(iterationnumber),_ASE_(AverageSquaredErrorfortheTrainingdata),and_VASE_(AverageSquaredErrorfromtheValidationdata)atiterations41-60.

Table3.1

You can print the variables _ITER_, _ASE_, and _VASE_ by using the SAS codeshowninDisplay3.10.

Display3.10

3.3ASSESSINGTHEPREDICTIVEPERFORMANCEOFTHEESTIMATEDMODEL

In order to assess the predictive performance of the neural network model, runtheModelComparisonnodeandopentheResultswindow.IntheResultswindow,theScoreRankingsOverlayshowstheLiftchartsfortheTraining,Validation,andTestdatasets.TheseareshowninDisplay3.11.

Display3.11

ClickthearrowintheboxatthetopleftcorneroftheScoreRankingOverlaywindowtoseealistofavailablecharts.

SASEnterpriseMinersavestheScoreRankingstableasEMWS.MdlComp_EMRANK.Tables5.2,5.3,and5.4arecreatedfromthesaveddatasetusingthesimpleSAScodeshowninDisplay3.12.

Display3.12

Table3.2

Table3.3

Table3.4

TheliftandcaptureratescalculatedfromtheTestdataset(showninTable3.4)shouldbeusedforevaluatingthemodelsorcomparingthemodelsbecausetheTestdatasetisnotusedintrainingorfine-tuningthemodel.

To calculate the lift and capture rates, SAS Enterprise Miner first calculates thepredicted probability of response for each record in the Test data. Then it sorts therecords indescendingorderof thepredictedprobabilities (alsocalled the scores) anddividesthedatasetinto20groupsofequalsize.InTable3.4,thecolumnBinshowstheranking of these groups. If themodel is accurate, the table should show the highestactualresponserateinthefirstbin,thesecondhighestinthenextbin,andsoon.From

thecolumn%Response,itisclearthattheaverageresponserateforobservationsinthefirstbinis8.36796%.Theaverageresponseratefortheentiretestdatasetis3%.HencetheliftforBin1,whichistheratiooftheresponserateinBin1totheoverallresponserate,is2.7893.Theliftforeachbiniscalculatedinthesameway.ThefirstrowofthecolumnCumulative%Response shows the response rate for the first bin.The secondrowshowstheresponserateforbins1and2combined,andsoon.

Thecapturerateofabinshowsthepercentageoflikelyrespondersthatitisreasonabletoexpecttobecapturedinthebin.FromthecolumnCapturedResponse,youcanseethat13.951%ofallrespondersareinBin1.

FromtheCumulative%CapturedResponsecolumnofTable3.3,youcanbeseenthat,by sending mail to customers in the first four bins, or the top 20% of the targetpopulation,itisreasonabletoexpecttocapture39%ofallpotentialrespondersfromthetarget population. This assumes that the modeling sample represents the targetpopulation.

3.4RECEIVEROPERATINGCHARACTERISTIC(ROC)CHARTSDisplay3.13,takenfromtheResultswindowoftheModelComparisonnode,displaysROCcurvesfortheTraining,Validation,andTestdatasets.AnROCcurveshowsthevalues of the true positive fraction and thefalse positive fraction at different cut-offvalues,whichcanbedenotedbyPc.Inthecaseoftheresponsemodel,iftheestimatedprobabilityofresponseforacustomerrecordwereaboveacut-offvaluePc, thenyouwouldclassifythecustomerasaresponder;otherwise,youwouldclassifythecustomerasanon-responder.

Display3.13

IntheROCchart,thetruepositivefractionisshownontheverticalaxis,andthefalsepositivefractionisonthehorizontalaxisforeachcut-offvalue(Pc).

If thecalculatedprobabilityof response (P_resp1) isgreater thanequal to thecut-offvalue, then the customer (observation) is classified as a responder. Otherwise, thecustomerisclassifiedasnon-responder.

Truepositivefractionistheproportionofresponderscorrectlyclassifiedasresponders.Thefalsepositivefractionistheproportionofnon-respondersincorrectlyclassifiedasresponders. The true positive fraction is also called sensitivity, and specificity is theproportion of non-responders correctly classified as non-responders.Hence, the falsepositivefractionis1-specificity.AnROCcurvereflectsthetradeoffbetweensensitivity

andspecificity.

ThestraightdiagonallinesinDisplay3.13thatarelabeledBaselinearetheROCchartsof a model that assigns customers at random to the responder group and the non-responder group, and hence has no predictive power.On these lines, sensitivity = 1-specificityatallcut-offpoints.ThelargertheareabetweentheROCcurveofthemodelbeing evaluated and thediagonal line, thebetter themodel.The areaunder theROCcurve is a measure of the predictive accuracy of the model and can be used forcomparingdifferentmodels.

Table3.5showssensitivityand1-specificityatvariouscut-offpointsinthevalidationdata.

Table3.5

FromTable3.5,youcanseethatatacut-offprobability(Pc)of0.02,forexample,thesensitivityis0.84115.Thatis,atthiscut-offpoint,youwillcorrectlyclassify84.1%ofrespondersasresponders,butyouwillalsoincorrectlyclassify68.1%ofnon-respondersasresponders,since1-specificityatthispointis0.68139.Ifinsteadyouchoseamuchhighercut-offpointofPc=0.13954,youwouldclassify0.284%oftruerespondersasrespondersand0.049%ofnon-respondersasresponders.Inthiscase,byincreasingthecut-offprobabilitybeyondwhichyouwouldclassifyanindividualasaresponder,youwouldbereducingthefractionoffalsepositivedecisionsmade,while,atthesametime,also reducing the fraction of true positive decisions made. These pairings of a truepositive fraction with a false positive fraction are plotted as the ROC curve for theVALIDATEcaseinDisplay3.13.

TheSASmacro inDisplay3.14demonstrates thecalculationof the truepositive rate(TPR)andthefalsepositiverate(FPR)atthecut-offprobabilityof0.02.

Display3.14

Tables 5.6, 5.7, and 5.8, generated by the macro shown in Display 3.14, show thesequenceofstepsforcalculatingTPRandFPRforagivencut-offprobability.

Table3.6

Table3.7

Table3.8

FormoreinformationaboutROCcurves,seethetextbookbyA.A.AfifiandVirginiaClark(2004).4

Display3.15showstheSAScodethatgeneratedTable3.5.

Display3.15

https://www.safaribooksonline.com/library/view/predictive-modeling-with/9781607648185/xhtml/Chapter5.xhtml#en4

3.5HOWDIDTHENEURALNETWORKNODEPICKTHEOPTIMUMWEIGHTSFORTHISMODEL?Inother section, Idescribedhow theoptimumweightsare found inaneuralnetworkmodel. Idescribed the two-stepprocedureofestimatingandselecting theweights. Inthissection,Ishowtheresultsofthesetwostepswithreferencetotheneuralnetworkmodeldiscussedinprevioussections.

TheweightssuchasthoseshowninEquations5.1,5.3,5.5,5.7,5.9,5.11,and5.13areshownintheResultswindowoftheNeuralNetworknode.Youcanseetheestimatedweights created at each iteration by opening the results window andselectingView→Model→Weights-History.Display3.16 shows a partial viewof theWeights-Historywindow.

Display3.16

Thesecondcolumn inDisplay3.16 shows theweightof thevariableAGE inhiddenunit1ateachiteration.TheseventhcolumnshowstheweightofAGEinhiddenunit2ateachiteration.ThetwelfthcolumnshowstheweightofAGEinthethirdhiddenunit.Similarly, you can trace through the weights of other variables. You can save theWeights-HistorytableasaSASdataset.

To see the final weights, open the Results window.SelectView→Model→Weights_Final.Then,clicktheTableicon.Selectedrowsofthefinalweights_tableareshowninDisplay3.17.

Display3.17

Outputsofthehiddenunitsbecomeinputstothetargetlayer.Inthetargetlayer,theseinputsarecombinedusingtheweightsestimatedbytheNeuralNetworknode.

In the model I have developed, the weights generated at the 49th iteration are theoptimalweights,becausetheAverageSquaredErrorcomputedfromtheValidationdatasetreachesitsminimumatthe49thiteration.ThisisshowninDisplay3.18.

Display3.18

3.6SCORINGADATASETUSINGTHENEURALNETWORKMODELYoucanusetheSAScodegeneratedbytheNeuralNetworknode toscoreadatasetwithin SAS EnterpriseMiner or outside. This example scores a data set inside SASEnterpriseMiner.

TheprocessflowdiagramwithascoringdatasetisshowninDisplay3.19.

Display3.19

SettheRolepropertyofthedatasettobescoredtoScore,asshowninDisplay3.20.

Display3.20

TheScoreNodeapplies theSAScodegeneratedbytheNeuralNetworknode to theScoredatasetNN_RESP_SCORE2,showninDisplay3.19.

For each record, theprobabilityof response, theprobabilityofnon-response, and theexpectedprofitofeachrecordiscalculatedandappendedtothescoreddataset.

Display3.21showsthesegmentofthescorecodewheretheprobabilitiesofresponseand non-response are calculated. The coefficients ofHL1,HL2, andHL3 inDisplay3.21aretheweightsinthefinaloutputlayer.ThesearesameasthecoefficientsshowninDisplay3.17.

Display3.21

ThecodesegmentgiveninDisplay3.21calculatestheprobabilityofresponseusingtheformulaP_resp1i=11+exp(−ηi21),

whereμi21=-1.29172481842873*HL1+1.59773122184585*HL2

+-1.56981973539319*HL3-1.0289412689995;

ThisformulaisthesameasEquation5.14inSection5.2.4.Thesubscriptiisaddedtoemphasizethatthisisarecord-levelcalculation.InthecodeshowninDisplay3.21,theprobabilityofnon-responseiscalculatedasp_resp0=11+exp(ηi21).

TheprobabilitiescalculatedabovearemodifiedbythepriorprobabilitiesIenteredpriortorunningtheNeuralNetworknode.TheseprobabilitiesareshowninDisplay3.22.

Display3.22

You can enter the prior probabilities when you create the Data Source. Priorprobabilities are entered because the responders are overrepresented in themodelingsample,whichisextractedfromalargersample.Inthelargersample,theproportionofrespondersisonly3%.Inthemodelingsample,theproportionofrespondersis31.36%.Hence,theprobabilitiesshouldbeadjustedbeforeexpectedprofitsarecomputed.TheSAS code generated by theNeuralNetwork node and passed on to theScore nodeincludesstatementsformakingthisadjustment.Display3.23showsthesestatements.

Display3.23

Display3.24showstheprofitmatrixusedinthedecision-makingprocess.

Display3.24

Given the above profit matrix, calculation of expected profit under the alternativedecisions of classifying an individual as responder or non-responder proceeds asfollows. Using the neural network model, the scoring algorithm first calculates theindividual’s probability of response and non-response. Suppose the calculatedprobabilityofresponseforanindividualis0.3,andprobabilityofnon-responseis0.7.Theexpectedprofitiftheindividualisclassifiedasresponderis0.3x$5+0.7x(–$1.0)=$0.8.Theexpectedprofitiftheindividualisclassifiedasnon-responderis0.3x($0)+0.7x ($0) = $0. Hence classifying the individual as responder (Decision1) yields ahigher profit than if the individual is classified as non-responder (Decision2). Anadditional field isaddedto therecord in thescoreddataset indicating thedecision toclassifytheindividualasaresponder.

ThesecalculationsareshowninthescorecodesegmentshowninDisplay3.25.

Display3.25

3.7SCORECODEThe score code is automatically saved by the Score node in the sub-directory\Workspaces\EMWSn\Scorewithintheprojectdirectory.

Forexample,inmycomputer,theScorecodeissavedbytheScorenodeinthefolderC:\TheBook\EM12.1\EMProjects\Chapter5\Workspaces\EMWS3\Score. Alternatively,youcansavethescorecodeinsomeotherdirectory.Todoso,runtheScorenode,andthen clickResults. Select either theOptimized SASCodewindow or the SASCodewindow.ClickFile→SaveAs, and enter thedirectory andname for saving the scorecode.

3.8ANEURALNETWORKMODELTOPREDICTLOSSFREQUENCYINAUTOINSURANCEThepremiumthataninsurancecompanychargesacustomerisbasedonthedegreeofriskofmonetarylosstowhichthecustomerexposestheinsurancecompany.Thehigherthe risk, the higher the premium the company charges. For proper rate setting, it isessential to predict the degree of risk associated with each current or prospectivecustomer.Neuralnetworkscanbeusedtodevelopmodelstopredicttheriskassociatedwitheachindividual.

In this example, I develop a neural network model to predict loss frequency at thecustomer level. I use the target variable LOSSFRQ that is a discrete version of lossfrequency.ThedefinitionofLOSSFRQispresentedinthebeginningofthischapter.Ifthetargetisadiscreteformofacontinuousvariablewithmorethantwolevels,itshouldbetreatedasanordinalvariable,asIdointhisexample.

The goal of the model developed here is to estimate the conditionalprobabilities Pr(LOSSFRQ=0|X), Pr(LOSSFRQ=1|X), and Pr(LOSSFRQ=2|X),whereXisasetofinputsorexplanatoryvariables.

Ihavealreadydiscussed thegeneral frameworkforneuralnetworkmodels inSection5.2.ThereIgaveanexampleofaneuralnetworkmodelwithtwohiddenlayers.Ialsopointed out that the outputs of the final hidden layer can be considered as complextransformationsoftheoriginalinputs.ThesearegiveninEquations5.1through5.12.

In the following example, there is only one hidden layer, and it has three units. Theoutputs of the hidden layer are HL1, HL2, and HL3. Since these are nothing buttransformationsof theoriginal inputs, Icanwrite thedesiredconditionalprobabilitiesas:

Pr(LOSSFRQ=0|HL1,HL2,andHL3),Pr(LOSSFRQ=1|HL1,HL2,andHL3),andPr(LOSSFRQ=2|HL1,HL2,andHL3).

3.8.1LossFrequencyasanOrdinalTargetIn order for theNeuralNetwork node to treat the target variable LOSSFRQ as anordinalvariable,weshouldsetitsmeasurementleveltoOrdinalinthedatasource.

Display3.26showstheprocessflowdiagramfordevelopinganeuralnetworkmodel.

Display3.26

ThedatasourcefortheInputDatanodeiscreatedfromthedatasetCh5_LOSSDAT2using the Advanced Metadata Advisor Options with the Class Levels CountThresholdpropertysetto8,whichisthesamesettingshowninDisplay3.2.

Loss frequency is a continuous variable. The target variable LOSSFRQ, used in theNeural Network model developed here, is a discrete version of loss frequency. It iscreatedasfollows:

LOSSFRQ=0if loss frequency=0, LOSSFRQ=1 if 0 < loss frequency < 1.5, and LOSSFRQ=2iflossfrequency≥1.5.

Display 3.27 shows a partial list of the variables contained in the data setCh5_LOSSDAT2.The target variable isLOSSFRQ. I changed itsmeasurement levelfromNominaltoOrdinal.

Display3.27

For illustration purposes, I enter a profitmatrix,which is shown inDisplay 3.28.Aprofitmatrix canbe used for classifying customers into different risk groups such aslow,medium,andhigh.

Display3.28

The SAS code generated by the Neural Network node, shown in Display 3.29,illustrateshowtheprofitmatrixshowninDisplay3.28isusedtoclassifycustomersintodifferentriskgroups.

Display3.29

In the SAS code shown in Display 3.29, P_LOSSFRQ2, P_LOSSFRQ1, andP_LOSSFRQ0aretheestimatedprobabilitiesofLOSSFRQ=2,1,and0,respectively,foreachcustomer.

Theproperty settingsof theDataPartition node for allocating records for Training,Validation,andTestare60,30,and10,respectively.ThesesettingsarethesameasinDisplay3.4.

TodefinetheNeuralNetwork,click locatedtotherightoftheNetworkpropertyoftheNeuralNetworknode.TheNetworkwindowopens,asshowninDisplay30.

Display3.30

ThegeneralformulasofCombinationFunctionsandActivationFunctionsofthehiddenlayersaregivenbyequations5.1-5.12ofsection5.2.2.

IntheNeuralNetworkmodeldevelopedinthissection,thereisonlyonehiddenlayerandthreehiddenunits,eachhiddenunitproducinganoutput.Theoutputsproducedbythe hidden units are HL1, HL2, and HL3. These are special variables that are

constructed from inputs contained in the dataset Ch5_LOSSDAT2. These inputs arestandardizedpriortobeingusedbythehiddenlayer.

3.8.2TargetlayercombinationandactivationfunctionsAs you can see from Display 3.30, I set the Target Layer CombinationFunction property to Linear and the TargetLayer Activation Function property toLogistic.Thesesettingsresultinthefollowingcalculations.

SincewesettheTargetLayerCombinationFunctionpropertytobelinear,theNeuralNetworknodecalculates linearpredictors foreachobservation from thehidden layeroutputsHL1,HL2,andHL3.SincethetargetvariableLOSSFRQisordinal,takingthevalues0,1and2,theNeuralNetworknodecalculatestwolinearpredictors.

Toverify the calculations, run theNeuralNetworknode shown inDisplay 3.26 andopen theResultswindow. In theResultswindow, clickView→Scoring→SASCodeandscrolldownuntilyousee“WritingtheNodeLOSSFRQ”asshowninDisplay3.31.

Display3.31

Scrolldownfurthertoseehoweachcustomer/recordisassignedtoarisklevelusingtheprofit matrix shown in Display 3.28. The steps in assigning a risk level to acustomer/recordareshowninDisplay3.32.

Display3.32

3.9DATAPREPARATION

3.9.1ModelselectioncriterionpropertyI set theModel Selection Criterion property to Average Error. During the trainingprocess, theNeuralNetworknodecreatesanumberofcandidatemodels,oneateachiteration. If we set Model Selection Criterion to Average Error, the NeuralNetwork node selects the model that has the smallest error calculated using theValidationdataset.

3.9.2ScoreranksintheresultswindowOpentheResultswindowoftheNeuralNetworknodetoseethedetailsoftheselectedmodel.

TheResultswindowisshowninDisplay3.33.

Display3.33

The score rankings are shown in the Score Rankings window. To see how lift andcumulativeliftarepresentedforamodelwithanordinaltarget,youcanopenthetablecorresponding to the graph by clicking on the Score Rankings window, and thenclickingontheTableicononthetaskbar.SelectedcolumnsfromthetablebehindtheScoreRankingsgraphareshowninDisplay3.34.

Display3.34

InDisplay3.34,column3hasthetitleEvent.Thiscolumnhasavalueof2forallrows.Youcan infer thatSASEnterpriseMiner created the lift charts for the target level2,whichisthehighestlevelforthetargetvariableLOSSFRQ.

When the target is ordinal, SAS Enterprise Miner creates lift charts based on theprobabilityofthehighestlevelofthetargetvariable.Foreachrecordinthetestdataset,SAS Enterprise Miner computes the predicted or posteriorprobabilityPr(lossfrq=2|Xi) from themodel. Then it sorts the data set in descendingorder of the predicted probability and divides the data set into 20 percentiles(bins).7Withineachpercentileitcalculatestheproportionofcaseswithactuallossfrq=2.Theliftforapercentileistheratiooftheproportionofcaseswithlossfrq=2inthepercentiletotheproportionofcaseswithlossfrq=2intheentiredataset.

Toassessmodelperformance,theinsurancecompanymustrankitscustomersbytheexpectedvalueofLOSSFRQ,insteadofthehighestlevelofthetargetvariable.Todemonstratehowtoconstructalifttablebasedonexpectedvalueofthetarget,Iconstructtheexpectedvalueforeachrecordinthedatasetbyusingthefollowingformula.

https://www.safaribooksonline.com/library/view/predictive-modeling-with/9781607648185/xhtml/Chapter5.xhtml#en7

E(lossfrq|Xi)=Pr(lossfrq=0|Xi)*0+Pr(lossfrq=1|Xi)*1+Pr(lossfrq=2|Xi)*2

ThenIsortedthedatasetindescendingorderofE(lossfrq|Xi),anddividedthedatasetinto20percentiles.Withineachpercentile,Icalculatedthemeanoftheactualorobservedvalueofthetargetvariablelossfrq.Theratioofthemeanofactuallossfrqinapercentiletotheoverallmeanofactuallossfrqforthedatasetistheliftofthepercentile.

ThesecalculationsweredoneintheSASCodenode,whichisconnectedtotheNeuralNetworknodeasshownDisplay3.26.

TheSASprogramusedintheSASCodenodeisshowninDisplays5.88,5.89,and5.90.

ThelifttablesbasedonE(lossfrq|Xi)areshowninTables5.9,5.10,and5.11.

Table3.9

Table3.10

Table3.11

3.9.3ScoringanewdatasetwiththemodelTheNeuralNetworkmodelwedevelopedcanbeusedforscoringadatasetwherethevalueofthetargetisnotknown.TheScorenodeusesthemodeldevelopedbytheNeuralNetworknodetopredictthetargetlevelsandtheirprobabilitiesfortheScoredataset.

Display3.26showstheprocessflowforscoring.TheprocessflowconsistsofanInputDatanodecalledCh5_LOSSDAT_SCORE2,whichreadsthedatasettobescored.TheScorenodeintheprocessflowtakesthemodelscorecodegeneratedbytheNeuralNetworknodeandappliesittothedatasettobescored.

TheoutputgeneratedbytheScorenodeincludesthepredicted(assigned)levelsofthetargetandtheirprobabilities.ThesecalculationsreflectthesolutionofEquations5.18through5.25.

Display3.35showstheprogrammingstatementsusedbytheScorenodetocalculatethepredictedprobabilitiesofthetargetlevels.

Display3.35

Display3.36showstheassignmentofthetargetlevelstoindividualrecordsusingtheprofitmatrix.

Display3.36

Thefollowingcodesegmentscreateadditionalvariables.Display3.37showsthetargetlevelassignmentprocess.ThevariableI_LOSSFRQiscreatedaccordingtotheposteriorprobabilityfoundforeachrecord.

Display3.37

InDisplay3.38,thevariableEM_CLASSIFICATIONrepresentspredictedvaluesbasedonmaximumposteriorprobability.EM_CLASSIFICATIONisthesameasI_LOSSFRQshowninDisplay3.37.

Display3.38

Display3.39showsthelistofvariablescreatedbytheScorenodeusingtheNeuralNetworkmodel.

Display3.39

3.9.4ClassificationofRisksforRateSettinginAutoInsurancewithPredictedProbabilities

Theprobabilitiesgeneratedbytheneuralnetworkmodelcanbeusedtoclassifyriskfortwopurposes:

•toselectnewcustomers

•todeterminepremiumstobechargedaccordingtothepredictedrisk

Foreachrecordinthedataset,youcancomputetheexpectedLOSSFRQas:

E(lossfrq|Xi)=Pr(lossfrq=0|Xi)*0+Pr(lossfrq=1|Xi)*1+Pr(lossfrq=2|Xi)*2. Customerscanberankedbythisexpectedfrequencyandassignedtodifferentriskgroups.

CHAPTER4

SPECIFICNEURALNETWORKSTOPREDICTRISK

4.1ALTERNATIVESPECIFICATIONSOFTHENEURALNETWORKSThe general Neural Network model consists of linear combinations functions andhyperbolic tangentactivation functions in thehidden layers, anda linearcombinationfunction and a logistic activation function in the output layer. The multilayerperceptronsnetworksuselinearcombinationfunctionsandsigmoidactivationfunctionsinthehiddenlayers.SigmoidactivationfunctionsareS-shaped.

In this section, I introduce you to other types of networks known as Radial BasisFunction (RBF) networks, which have different types of combination activationfunctions. You can build both Multilayer Perceptron (MLP) networks and RBFnetworksusingtheNeuralNetworknode.

4.2MULTILAYERPERCEPTRON(MLP)NEURALNETWORKIn an MLP neural network the hidden layer combination functions are linear. ThehiddenLayerActivationsFunctionsaresigmoidfunctions.

TheHiddenLayerActivationFunctioninanMLPnetworkisasigmoidfunctionsuchastheHyperbolicTangent,ArcTangent,Elliot,orLogistic.ThesigmoidfunctionsareS-shaped.

If you want to use a built-in MLP network, click located at the right oftheNetworkpropertyoftheNeuralNetworknode.Set theArchitectureproperty toMultilayerPerceptronintheNetworkPropertieswindow,asshowninDisplay6.40.

Display6.40

When you set the Architecture property to Multilayer Perceptron, the NeuralNetwork node uses the default values Linear and Tanh for the Hidden LayerCombinationFunctionand theHiddenLayerActivation Function properties. Thedefault values for the Target Layer Combination Function and Target LayerActivationFunctionpropertiesareLinearandExponential.YoucanchangetheTargetLayerCombinationFunction,TargetLayerActivationFunction,andTargetLayerErrorFunctionproperties.IchangedtheTargetLayerActivationFunctionpropertytoLogistic, as shown inDisplay 6.40. If the Target isBinary and theTarget LayerActivation Function is Logistic, then you can set the Target Layer ErrorFunctionpropertytoBernoullitogenerateaLogisticRegression-typemodel.

4.3ARADIALBASISFUNCTION(RBF)NEURALNETWORKThehiddenlayercombinationfunctioninaMLPneuralnetworkistheweightedsumorinnerproductof thevectorof inputsandvectorofcorrespondingweightsplusa biascoefficient. In contrast to theHiddenLayerCombination function in anMLP neuralnetwork,thehiddenlayercombinationfunctioninaRBFneuralnetworkistheSquaredEuclidian Distance between the vector of inputs and the vector of correspondingweights(centerpoints),multipliedbysquaredbias.

In the Neural Network node, the Basis functions are defined in terms of thecombination and activation functions of the hidden units. The RBF neural networkwhich uses the especial combination function is called the Ordinary Radial withUnequal Widths or ORBFUN if the activation function of the hidden units is anexponentialfunction.

To build an ORBFUN neural network model, click located to the right oftheNetworkpropertyoftheNeuralNetworknode.IntheNetworkwindowthatopens,settheArchitecturePropertytoOrdinaryRadial-UnequalWidth,asshowninDisplay6.41.

Display6.41

AsyoucanseeinDisplay6.41,therowscorrespondingtoHiddenLayerCombinationFunction andHidden Layer Activation Function are shaded, indicating that theirvalues are fixed for this network. Hence this is called a built-in network in SASEnterpriseMiner terminology.However, the rowscorresponding to theTarget LayerCombinationFunction,TargetLayerActivationFunction,andTargetLayerErrorFunctionarenotshaded,soyoucanchangetheirvalues.Forexample,tosettheTargetLayer Activation Function to Logistic, click on the value column correspondingtoTargetLayerActivationFunctionpropertyandselectLogistic.SimilarlyyoucanchangetheTargetLayerErrorFunctiontoBernoulli.

Aspointedoutearlierinthischapter,theerrorfunctionyouselectdependsonthetypeofmodelyouwanttobuild.

Display6.42showsa listofRBFneuralnetworksavailable inSASEnterpriseMiner.YoucanseeallthenetworkarchitecturesIfyouclickonthedown-arrowintheboxtotherightoftheArchitecturepropertyintheNetworkwindow.

Display6.42

Display6.43showsagraphoftheradialbasisfunctionwithheight=1andwidth=1.

Display6.43


Display6.44


Display6.45

Iftherearethreeunitsinahiddenlayer,eachhavingaradialbasisfunction,andiftheradial basis functions are constrained to sum to 1, then they are called NormalizedRadialBasisFunctionsinSASEnterpriseMiner.Withoutthisconstraint,theyarecalledOrdinaryRadialBasisFunctions.

4.4COMPARISONOFALTERNATIVEBUILT-INARCHITECTURESOFTHENEURALNETWORKNODEYou can produce a number of neural network models by specifying alternativearchitecturesintheNeuralNetworknode.Youcanthenmakeaselectionfromamongthesemodels,basedonliftorothermeasures,usingtheModelComparisonnode. Inthissection,Icomparethebuilt-inarchitectureslistedintheDisplay6.46.

Display6.46

Display 6.47 shows the process flow diagram for comparing the Neural Networkmodels produced by different architectural specifications. In all the eight modelscompared in this section, I set the Assessment property to Average Error andtheNumberofHiddenUnitspropertyto5.

Display6.47

4.4.1MultilayerPerceptron(MLP)NetworkThe topNeuralNetwork node in Display 6.47 is aMLP network, and its propertysettingsareshowninDisplay6.48.

Display6.48

ThedefaultvalueoftheHiddenLayerCombinationFunctionpropertyisLinearandthe default value of theHiddenLayerActivation Functions property is Tanh.Afterrunningthisnode,youcanopentheResultswindowandviewtheSAScodegeneratedbytheNeuralNetworknode.Display6.49showsasegmentoftheSAScodegeneratedbytheNeuralNetworkmodewiththeNetworksettingsshowninDisplay6.48.

Display6.49

The code segment shown in Display 6.50 shows that the final outputs in the TargetLayerarecomputedusingaLogisticActivationFunction.

Display6.50

4.4.2OrdinaryRadialBasisFunctionwithEqualHeightsandWidths(ORBFEQ)

Display 6.51 shows an example of the outputs of fiveHiddenUnits of anORBFEQwithasingleinput.

Display6.51

InthesecondNeuralNetworknodeinDisplay6.47,theArchitecturepropertyissettoOrdinaryRadial–EqualwidthasshowninDisplay6.52.

Display6.52

Display6.53showsasegmentof theSAScode,whichshows thecomputationof theoutputsofthehiddenunitsoftheORBFEQneuralnetworkspecifiedinDisplay6.52.

Display6.53

Display6.54showsthecomputationoftheoutputsofthetargetlayeroftheORBFEQneuralnetworkspecifiedinDisplay6.52.

Display6.54

4.4.3OrdinaryRadialBasisFunctionwithEqualHeightsandUnequalWidths(ORBFUN)

Display 6.55 shows an example of the output of the five hiddenunits Hk(k=1,2,3,4 and 5) with a single standardized input and arbitrary valuesforwjkandbk.

Display6.55

TheArchitecturepropertyoftheORBFUNNeuralNetworknode(thethirdnodefromthe top in Display 6.46) is set to Ordinary Radial – UnequalWidth. The remainingNetworkpropertiesarethesameasinDisplay6.52.

Displays 5.56 and 5.57 show the SAS code segment showing the calculation of theoutputsofthehiddenandtargetlayersofthisORBFUNnetwork.

Display6.56

Display6.57

4.4.4NormalizedRadialBasisFunctionwithEqualWidthsandHeights(NRBFEQ)

Display6.58showstheoutputofthefivehiddenunitsHk(k=1,2,3,4and5)withasinglestandardizedinput,andarbitraryvaluesforwjkandb.

Inthiscase,theheightak=1forallthehiddenunits.Hencelog(ak)=0.

Display6.58

ForaNRBFEQneuralnetwork,theArchitecturepropertyissettoNormalizedRadial–EqualWidthandHeight.TheremainingNetworksettingsarethesameasinDisplay6.52.

Displays 5.59 and 5.60 show segments from the SAS code generated by theNeuralNetworknode.Display6.59showsthecomputationoftheoutputsofthehiddenunitsandDisplay6.60showsthecalculationofthetargetlayeroutputs.

Display6.59

Display6.60

4.4.5NormalizedRadialBasisFunctionwithEqualHeightsandUnequalWidths(NRBFEH)

Display 6.61 shows the output of the five hidden unitsHk(k=1,2,3,4 and 5) with asingle standardized input AGE (p =1), and arbitrary values for wjk and bk. Theheightak=1forallthehiddenunits.Hencelog(ak)=0.

Display6.61

TheArchitecturepropertyfortheNRBFEHneuralnetworkissettoNormalizedRadial–EqualHeight.TheremainingNetworksettingsarethesameasinDisplay6.52.

Display6.62showsasegmentoftheSAScodeusedtocomputethehiddenunitsinthisNRBEHnetwork.

Display6.62

Display6.63showstheSAScodeusedtocomputethetargetlayeroutputs.

Display6.63

4.4.6NormalizedRadialBasisFunctionwithEqualWidthsandUnequalHeights(NRBFEW)

Display6.64showstheoutputofthefivehiddenunitsHk(k=1,2,3,4and5)withasinglestandardizedinputandarbitraryvaluesforwjk,ak,andb.

Display6.64

TheArchitecture property of the estimatedNRBFEWnetwork is set toNormalizedRadial–EqualWidth.TheremainingNetworksettingsarethesameasinDisplay6.52.

Display 6.65 shows the code segment showing the calculation of the output of thehiddenunitsintheestimatedNRBFEWnetwork.

Display6.65

Display6.66showsthecalculationoftheoutputsofthetargetlayer.

Display6.66

4.4.7NormalizedRadialBasisFunctionwithEqualVolumes(NRBFEV)

Display 6.67 shows the output of the five hidden unitsHk(k=1,2,3,4 and 5) with asinglestandardizedinput,andarbitraryvaluesforwjkandbk.

Display6.67

TheArchitecturepropertyoftheNRBFEVarchitectureusedinthisexampleisset toNormalizedRadial–EqualVolumes.TheremainingNetworksettingsarethesameasinDisplay6.52.

Display6.68showsasegmentoftheSAScodegeneratedbytheNeuralNetworknodefortheNRBFEVnetwork.

Display6.68

Display6.69showsthecomputationoftheoutputsofthetargetlayerintheestimatedNRBFEVnetwork.

Display6.69

4.4.8NormalizedRadialBasisFunctionwithUnequalWidthsandHeights(NRBFUN)

Display6.70showstheoutputofthefivehiddenunitsHk(k=1,2,3,4and5)withasinglestandardizedinputandarbitraryvaluesforwjk,ak,andbk.

Display6.70

TheArchitecturepropertyfortheNRBFUNnetworkissettoNormalizedRadial–UnequalWidthandHeight.TheremainingNetworksettingsarethesameasinDisplay6.52.

Display6.71showsthesegmentoftheSAScodeusedforcalculatingtheoutputsofthehiddenunits.

Display6.71

Display6.72showstheSAScodeneededforcalculatingtheoutputsofthetargetlayeroftheNRBFUNnetwork.

Display6.72

4.4.9User-SpecifiedArchitecturesWiththeuser-specifiedArchitecturesetting,youcanpairdifferentCombinationFunctionsandActivationFunctions.EachpairgeneratesadifferentNeuralNetworkmodel,andallpossiblepairscanproducealargenumberofNeuralNetworkmodels.

YoucanselectthefollowingsettingsfortheHiddenLayerActivationFunction:

•ArcTan

•Elliott

•HyperbolicTangent

•Logistic

•Gauss

•Sine

•Cosine

•Exponential

•Square

•Reciprocal

•Softmax

YoucanselectthefollowingsettingsfortheHiddenLayerCombinationFunction:

•Add

•Linear

•EQSlopes

•EQRadial

•EHRadial

•EWRadial

•EVRadial

•XRadial

4.5AUTONEURALNODEAsitsnamesuggests,theAutoNeuralnodeautomaticallyconfiguresaNeuralNetworkmodel. It uses default combination functions and error functions.The algorithm testsdifferentactivationfunctionsandselectstheonethatisoptimum.

Display6.73showsthepropertysettingsoftheAutoNeuralnodeusedinthisexample.SegmentsofSAScodegeneratedbyAutoNeuralnodeareshowninDisplays5.74and5.75.

Display6.73

Display 6.74 shows the computation of outputs of the hidden units in the selectedmodel.

Display6.74

FromDisplay 6.74 it is clear that in the selectedmodel inputs are combined using aLinearCombinationFunctioninthehiddenlayer.ASinActivationFunctionisappliedtocalculatetheoutputsofthehiddenunits.

Display6.75showsthecalculationoftheoutputsoftheTargetlayer.

Display6.75

4.6DMNEURALNODEDMNeural node fits a non linear equation using bucketed principal components asinputs.Themodelderivestheprincipalcomponentsfromtheinputsinthetrainingdataset. As explained in Chapter 2, the principal components are weighted sums of theoriginal inputs, the weights being the eigenvectors of the variance covariance orcorrelationmatrix of the inputs. Since each observation has a set of inputs, you canconstruct a Principal Component value for each observation from the inputs of thatobservation. The Principal components can be viewed as new variables constructedfromtheoriginalinputs.

TheDMNeuralnodeselectsthebestprincipalcomponentsusingtheR-squarecriterionina linear regressionof the targetvariableon theprincipalcomponents.Theselectedprincipalcomponentsarethenbinnedorbucketed.Thesebucketedvariablesareusedinthemodelsdevelopedatdifferentstagesofthetrainingprocess.

The model generated by the DMNeural Node is called an additive nonlinear modelbecauseitisthesumofthemodelsgeneratedatdifferentstagesofthetrainingprocess.Inotherwords, theoutputof the finalmodel is the sumof theoutputsof themodelsgeneratedatdifferentstagesofthetrainingprocessasshowninDisplay6.80.

Inthefirststageamodelisdevelopedusingtheresponsevariableisusedasthetargetvariable. An identity link is used if the target is interval scaled, and a logistic linkfunctionisusedifthetargetisbinary.Theresidualsofthemodelfromthefirststageareusedasthetargetvariablevaluesinthesecondstage.Theresidualsofthemodelfromthesecondstageareusedasthetargetvariablevaluesinthethirdstage.Theoutputofthefinalmodelisthesumoftheoutputsofthemodelsgeneratedinthefirst,secondandthird stages as can be seen inDisplay 6.80.The first, second and third stages of themodel generation process are referred to as stage 0, stage 1 and stage 2 in the codeshowninDisplays5.77through5.80.SeeSASEnterpriseMiner12.1ReferenceHelp.

Display6.76showsthePropertiespaneloftheDMNeuralnodeusedinthisexample.

Display6.76

AfterrunningtheDMNeuralnode,opentheResultswindowandclickScoring→SAScode.ScrolldownintheSASCodewindowtoviewtheSAScode,showninDisplays5.77through5.79.

Display6.77showsthattheresponsevariableisusedasthetargetonlyinstage0.

Display6.77

Themodelatstage1isshowninDisplay6.77.Thiscodeshowsthatthealgorithmusestheresidualsofthemodelfromstage0(_RHAT1)asthenewtargetvariable.

Display6.78

Themodel at stage 2 is shown inDisplay 6.79. In this stage, the algorithmuses theresidualsofthemodelfromstage1(_RHAT2)asthenewtargetvariable.Thisprocessofbuilding themodel inonestage fromthe residualsof themodel fromthepreviousstagecanbecalledanadditivestage-wiseprocess.

Display6.79

You can see in Display 6.79 that the Activation Function selected in stage 2 isSQUARE. The selection of the best activation function at each stage in the trainingprocessisbasedonthesmallestSSE(SumofSquaredError).

InDisplay6.80,youcanseethatthefinaloutputofthemodelisthesumoftheoutputsof the models generated in stages 0, 1, and 2. Hence the underlying model can bedescribedasanadditivenonlinearmodel.

Display6.80

4.7DMINEREGRESSIONNODETheDmineRegressionnodegeneratesaLogisticRegressionforabinary target.TheestimationoftheLogisticRegressionproceedsinthreesteps.

Inthefirststep,apreliminaryselectionismade,basedonMinimumR-Square.Fortheoriginal variables, theR-Square is calculated from a regression of the target on eachinput; for the binned variables, it is calculated from a one-wayAnalysis ofVariance(ANOVA).

Inthesecondstep,asequentialforwardselectionprocessisused.Thisprocessstartsbyselectingtheinputvariablethathasthehighestcorrelationcoefficientwiththetarget.Aregression equation (model) is estimated with the selected input. At each successivestepofthesequence,anadditionalinputvariablethatprovidesthelargestincrementalcontributionto theModelR-Square isaddedto theregression.If the lowerboundforthe incremental contribution to theModelR-Square is reached, the selection processstops.

TheDmineRegressionnodeincludestheoriginalinputsandnewcategoricalvariablescalledAOV16variableswhichareconstructedbybinningtheoriginalvariables.TheseAOV16variablesareuseful in takingaccountof thenon-linear relationshipsbetweentheinputsandthetargetvariable.

Afterselectingthebestinputs,thealgorithmcomputesanestimatedvalue(prediction)of the target variable for each observation in the training data set using the selectedinputs.

In the thirdstep, thealgorithmestimatesa logistic regression(in thecaseofabinarytarget)witha single input,namely theestimatedvalueorpredictioncalculated in thefirststep.

Display6.81 shows thePropertiespanelof theDmineRegression nodewith defaultpropertysettings.

Display6.81

YoushouldmakesurethattheUseAOV16VariablespropertyissettoYesifyouwanttoincludetheminthevariableselectionandmodelestimation.

Display6.82showsthevariableselectedinthefirststep.

Display6.82

.

Display 6.83 shows the variables selected by the forward least-squares stepwiseregressioninthesecondstep.

Display6.83

4.8COMPARINGTHEMODELSGENERATEDBYDMNEURAL,AUTONEURAL,ANDDMINEREGRESSIONNODESIn order to compare the predictive performance of the models produced bytheDMNeural,AutoNeural,andDmineRegressionnodes,IcreatedtheprocessflowdiagramshowninDisplay6.84.

Display6.84

TheresultswindowoftheModelComparisonnodeshowstheCumulative%CapturedResponsechartsforthethreemodelsfortheTraining,Validation,andTestdatasets.

Display6.85showstheCumulative%CapturedResponsechargesfortheTrainingdataset.

Display6.85

Display 6.86 shows the Cumulative%Captured Response charges for the Validationdataset.

Display6.86

Display6.87showstheCumulative%CapturedResponsechargesfortheTestdataset.

Display6.87

From the Cumulative %Captured Response, there does not seem to be a significantdifferenceinthepredictiveperformanceofthethreemodelscompared.

CHAPTER3

CLUSTERANALYSISWITHNEURALNETWORKS

5.1CLUSTERANALYSISONENTERPRISEMINER

SAS Institute considers explore phase within the process ofDataMining (Exploring),which takesplaceafter thephaseofselection(Selecting). Initially theexplorationphaseleads to partners the nodes shown in Figure 5-1, but three nodes perform clusteranalysis: Cluster Node (clustering), SOM/Kohonen (self-organized neuralnetworks)andVariableClusteringNode(dividesnumericvariablesintodisjointorhierarchicalclusters).

Figure5-1

TheCluster node enables you to segment your data by grouping observations that arestatistically similar. Observations that are similar tend to be in the same cluster, andobservations that are different tend to be in different clusters. The cluster identifier foreachobservationcanbepassedtoothertoolsforuseasaninput,ID,ortargetvariable.Itcan also be used as a group variable that enables automatic construction of separatemodelsforeachgroup.

The SOM/Kohonen node enables you to perform unsupervised learning by usingKohonen vector quantization (VQ), Kohonen self-organizing maps (SOMs), or batchSOMs with Nadaraya-Watson or local-linear smoothing. Kohonen VQ is a clusteringmethod,butSOMsareprimarilydimension-reductionmethods.

The Variable Clustering node is a useful tool for selecting variables or clustercomponents for analysis. Variable clustering removes collinearity, decreases variableredundancy,andhelpsrevealtheunderlyingstructureoftheinputvariablesinadataset.Largenumbersofvariablescancomplicatethetaskofdeterminingtherelationshipsthatmightexistbetweentheindependentvariablesandthetargetvariableinamodel.Modelsthat are built with too many redundant variables can destabilize parameter estimates,confoundvariable interpretation,andincreasethecomputingtimethat isrequiredtorunthe model. Variable clustering can reduce the number of variables that are required tobuildreliablepredictiveorsegmentationmodels.

5.2CLUSTERNODEUse theCluster node toperformobservation clustering,which canbeused to segmentdatabases. Clustering places objects into groups or clusters suggested by the data. Theobjects in each cluster tend to be similar to each other in some sense, and objects indifferentclusterstendtobedissimilar.Ifobviousclustersorgroupingscouldbedevelopedpriortotheanalysis,thentheclusteringanalysiscouldbeperformedbysimplysortingthedata.

TheclusteringmethodsintheClusternodeperformdisjointclusteranalysisonthebasisofEuclideandistancescomputedfromoneormorequantitativevariablesandseeds thataregeneratedandupdatedbythealgorithm.Youcanspecifytheclusteringcriterionthatisused tomeasure thedistancebetweendataobservationsandseeds.Theobservationsaredividedintoclusterssothateveryobservationbelongstoatmostonecluster.

Toperformclustering,youmusthavearawdatasetoratrainingdatasetfeedingintotheClusternode.

After clustering is performed, the characteristics of the clusters can be examinedgraphically using the Clustering Results Package. The consistency of clusters acrossvariables is of particular interest. Themultidimensional charts and plots enable you tographicallycomparetheclusters.

Notethattheclusteridentifierforeachobservationcanbepassedtoothernodesforuseasaninput,ID,group,ortargetvariable.Thedefaultisgroupvariable.

TheCluster nodealso exports the cluster statistics and the cluster seeddata sets to thesuccessornodes.

TheCluster node can impute missing values of database observations. However, youmight want to preprocess the data in other ways before using the Cluster node. Forexample,youmightwant toremoveoutliers,as theyoftenappearas individualclusters,andtheymightdistortother,moreimportantclusters.Formostapplications,thevariablesshouldbetransformedsothatequaldistancesareofequalpracticalimportance.

TheClusternodeacceptsbinary,nominal,ordinal,andintervaldata.Binary,nominal,andordinalvariablesarecodedasnumericdummyvariablesforprocessing.

Variables with large variances tend to have more effect on the resulting clusters thanvariables with small variances. If all variables are measured in the same units (forexample,dollars), thenstandardizationmightnotbenecessary.Otherwise,someformofstandardizationisrecommended.

Nonlineartransformationsofthevariablescanchangethenumberofpopulationclusters.Therefore,youshouldbecarefulwhenusingnonlineartransformations.

Thefollowingnodesareparticularlyusefulinpreprocessingdataforclusteranalysis:

UsetheSamplingnodetoselectarandomsampleofthedataforinitialanalysis.

UsetheTransformVariablesnodetostandardizethevariablesinsomeway.

Use theDataPartitionnode topartition thedata into training, validation, and testdata sets. Partitioning has no effect on the cluster analysis itself. However, theClusternodecanscorethesedatasets.

5.2.1ClusterNodeGeneralPropertiesThefollowinggeneralpropertiesareassociatedwiththeClusternode:

NodeID—TheNodeIDpropertydisplaystheIDthatSASEnterpriseMinerassignstoanode inaprocess flowdiagram.Node IDsare importantwhenaprocess flowdiagramcontainstwoormorenodesofthesametype.ThefirstclusternodeaddedtoadiagramwillhaveaNodeIDofClus.ThesecondclusternodeaddedtoadiagramwillhaveaNodeIDofClus2,andsoon.

Imported Data — The Imported Data property accesses the Imported Data —Clusterwindow.The ImportedData—Clusterwindowcontainsa listof theportsthatprovidedatasourcestotheClusternode.Selectthe buttontoopenatableoftheimporteddata.

If data exists for an importeddata source, youcan select the row in the importeddatatableandclickoneofthefollowingbuttons:

Browsetoopenawindowwhereyoucanbrowsethedataset.


Properties to open thePropertieswindow for the data source.ThePropertieswindowcontains aTable tab and aVariables tab.The tabs contain summaryinformation(metadata)aboutthetableandvariables.

ExportedData— accesses the Exported Data— Cluster window. The ExportedData—ClusterwindowcontainsalistoftheoutputdataportsthattheClusternodecreatesdataforwhenit runs.Select the button to therightof theExportedDatapropertytoopenatablethatliststheexporteddatasets.

If data exists for an importeddata source, youcan select the row in the importeddatatableandclickoneofthefollowingbuttons:

Browsetoopenawindowwhereyoucanbrowsethedataset.


Properties to open thePropertieswindow for the data source.ThePropertieswindowcontains aTable tab and aVariables tab.The tabs contain summaryinformation(metadata)aboutthetableandvariables.

Notes—Selectthe buttontotherightoftheNotespropertytoopenawindowthatyoucanusetostorenotesofinterest,suchasdataorconfigurationinformation.

5.2.2ClusterNodeTrainPropertiesThefollowingtrainpropertiesareassociatedwiththeClusternode:

Variables—Selectthe buttontoopentheVariables—Clustable,whichenablesyoutoviewthecolumnsmetadata,oropenanExplorewindowtoviewavariable’ssamplinginformation,observationvalues,oraplotofvariabledistribution.Youcanspecify Use and Report variable values. The Name, Role, and Level values for avariablearedisplayedasread-onlyproperties.

The following buttons and check boxes provide additional options to view andmodifyvariablemetadata:

Apply— Changes metadata based on the values supplied in the drop-downmenus,checkbox,andselectorfield.

Reset—ChangesmetadatabacktoitsstatebeforeyouclickApply.

Label—Addsacolumnforalabelforeachvariable.

Mining—Adds columns for the Order, Lower Limit, Upper Limit, Creator,Comment,andFormatTypeforeachvariable.

Basic— Adds columns for the Type, Format, Informat, and Length of eachvariable.

Statistics—Addsstatisticsmetadataforeachvariable.

Explore—Opens an Explore window that enables you to view a variable’ssamplinginformation,observationvalues,oraplotofvariabledistribution.

Cluster Variable Role — Use the Cluster Variable Role property to specify themodelrolethatyouwanttobeassignedtotheclustervariable.Bydefault,theclustervariable (segment identifier) is assigneda roleofgroup.Model roles areSegment,ID,Input,andTarget.ThemodelroleofSegmentisusefulforBY-groupprocessing.Note that the segment identifier retains the selected variable role as it is passed tosubsequent tools in the process flow diagram. The default setting for the ClusterVariableRolepropertyisSegment.

InternalStandardization—UsetheInternalStandardizationpropertytospecifytheinternalstandardizationmethodthattheClusternodeshoulduse.

None—thevariablesarenotstandardizedbeforeclustering.

Standardization — (default setting) the variable values are divided by thestandarddeviation.Themeanisnotsubtracted.

Range — the variable values are divided by the range. The mean is not

subtracted.

5.2.3ClusterNodeTrainProperties:NumberofClusters

Specification Method — Use the Specification Method property to choose themethod thatSASEnterpriseMinerwill use to determine themaximumnumber ofclusters.ChoosebetweenAutomaticandUser-Specifiedmethods.

The Automatic setting (default) configures SAS Enterprise Miner toautomaticallydeterminetheoptimumnumberofclusterstocreate.

WhentheAutomaticsettingisselected,thevalueintheMaximumNumberofClusterspropertyintheNumberofClusterssectionisnotusedtosetthemaximum number of clusters. Instead, SAS EnterpriseMiner first makes apreliminary clustering pass, beginning with the number of clusters that isspecified as the Preliminary Maximum value in the Selection Criterionproperties.

After the preliminary pass completes, themultivariatemeans of the clustersare used as inputs for a second pass that uses agglomerative, hierarchicalalgorithmstocombineandreducethenumberofclusters.Then,thesmallestnumberofclustersthatmeetsallfourofthefollowingcriteriaisselected.

ThenumberofclustersmustbegreaterthanorequaltothenumberthatisspecifiedastheMinimumvalueintheSelectionCriterionproperties.

Thenumberofclustersmusthavecubicclusteringcriterionstatisticvaluesthat are greater than the CCC Cutoff that is specified in the SelectionCriterionproperties.

ThenumberofclustersmustbelessthanorequaltotheFinalMaximumvalue.

Apeakinthenumberofclustersexists.

Note:Ifthedatatobeclusteredissuchthatthefourcriteriaarenotmet,thenthenumberofclustersissettothefirstlocalpeak.Inthisevent,thefollowingwarningisdisplayedintheClusternodelog:

WARNING:ThenumberofclustersselectedbasedontheCCCvaluesmaynotbe

valid.PleaserefertothedocumentationoftheCubicClusteringCriterion.

Therearenonumberofclustersmatchingthespecifiedminimumandmaximum

numberofclusters.Thenumberofclusterswillbesettothefirstlocal

peak.

After the number of clusters is determined, a final pass runs to produce thefinalclusteringfortheAutomaticsetting.

The User-Specified setting enables you to use the Maximum value in theSelectionCriterionpropertiestomanuallyspecifyanintegervaluegreaterthan2

for the maximum number of clusters. Because the Maximum Number ofClusterspropertyisamaximum,it ispossibletoproduceanumberofclustersthatissmallerthanthespecifiedmaximum.

Toforcethenumberofclusterstosomeexactnumbern,specifynforboththeMaximumandMinimumvaluesintheSelectionCriterionproperties.Itwon’tmatterwhetheryouchooseAutomaticorUserSpecifyasyourSpecificationMethod.Eithermethodwillproduceexactlynclusters.

MaximumNumberofClusters—UsetheMaximumNumberofClusterspropertyto specify themaximumnumber of clusters that youwant to use in your analysis.Permissiblevaluesarenonnegative integers.Theminimumvaluefor theMaximumNumberofClusterspropertyis2,andthedefaultvalueis10.

5.2.4ClusterNodeTrainProperties:SelectionCriterion

Clustering Method — If you select Automatic as your Specification Methodproperty, Clustering Method specifies how SAS Enterprise Miner calculatesclusteringdistances.

Average—thedistancebetween twoclusters is theaveragedistancebetweenpairsofobservations,oneineachcluster.

Centroid — the distance between two clusters is defined as the (squared)Euclideandistancebetweentheircentroidsormeans.

Ward— (default) the distance between two clusters is the ANOVA sum ofsquares between the two clusters summed over all the variables. At eachgeneration, the within-cluster sum of squares isminimized over all partitionsobtainablebymergingtwoclustersfromapreviousgeneration.

PreliminaryMaximum—PreliminaryMaximum specifies themaximumnumberofclusterstocreateduringthepreliminarytrainingpass.ThedefaultsettingfortheMaximumpropertyis50.Permissiblevaluesareintegersgreaterthan2.

Minimum—Minimumspecifiestheminimumnumberofclustersthisisacceptablefora finalsolution.Thedefaultsettingfor theMinimumproperty is2.Permissiblevaluesareintegersgreaterthan2.

FinalMaximum—FinalMaximumspecifiesthemaximumnumberofclustersthatareacceptableforafinalsolution.ThedefaultsettingfortheMaximumproperty is20.Permissiblevaluesareintegersgreaterthan2.

CCCCutoff— If you select Automatic as your Specification Method, the CCCCutoffspecifiestheminimumcubicclusteringcutoffcriteria.ThedefaultsettingfortheCCCCutoff property is 3. Permissible values for theCCCCutoff property areintegersgreaterthan0.

5.2.5ClusterNodeScorePropertiesThefollowingscorepropertiesareassociatedwiththeClusternode:

Cluster Variable Role — Use the Cluster Variable Role property to specify themodelrolethatyouwanttobeassignedtotheclustervariable.Bydefault,theclustervariable (segment identifier) isassigneda roleofgroup.Model rolesareSegment,ID, Input, and Target. The model role of Segment is useful for BY-groupprocessing.Notethatthesegmentidentifierretainstheselectedvariableroleasitispassed to subsequent tools in theprocess flowdiagram.Thedefault setting for theClusterVariableRolepropertyisSegment.

Hide Original Variables — Use the Hide Original Variables property to specifywhetheryouwant to display the original transformedvariables in the nodeoutput.ThedefaultsettingfortheBooleanHideOriginalVariablespropertyisYes.

ClusterLabelEditor—Selectthe buttontoopenaneditorthatyouusetospecifyandeditdescriptionstothevariousclustersegments.Thenodewillusethesestringstocreateanewvariablecalled_SEGMENT_LABEL_thatdescribes theassociatedvalueofthe_SEGMENT_variable.

5.2.6ClusterNodeReportPropertiesThefollowingreportpropertiesareassociatedwiththeClusternode:

ClusterGraphs—WhensettoNo,theClusterGraphspropertysuppressesclustergraphicaloutputshowingdistributionofvariables.ThedefaultsettingfortheClusterGraphspropertyisYes.

TreeProfile—WhensettoNo,theTreeProfilepropertysuppressesthetreeprofilefromtheoutput.ThedefaultsettingfortheTreeProfilepropertyisYes.

Distance Plot and Table — Use the Distance Plot and Table report to specifywhether to generate reports based on the distance between clusters. The defaultsettingfortheDistancePlotandTablepropertyisYes.

5.2.7ClusterNodeStatusPropertiesThefollowingstatuspropertiesareassociatedwiththisnode:

CreateTime—displaysthetimeatwhichthenodewascreated.








5.2.8ClusterNodeResultsWindowYou can open the Results window of theCluster node by right-clicking the node andselectingResultsfromthepop-upmenu.


Properties

Settings— displays awindow that includes a read-only table of theClusternodepropertiesconfigurationwhenthenodewaslastrun.

RunStatus—indicatesthestatusoftheClusternoderun.TheRunStartTime,RunDuration, and information about whether the run completed successfullyaredisplayedinthiswindow.

Variables—theVariablessettinginthemenuopensawindowthatcontainsatable of the variables submitted to the clustering node. The variables tabledisplaystheuse,report,role,andlevelforeachvariable.

TrainCode—thecodethatSASEnterpriseMinerusedtotrainthenode.

Notes—allowsuserstoreadorcreatenotesofinterest.

SASResults

Log—theSASlogoftheclusterrun.

Output— theSASoutput of the cluster run.For theCluster node, theSASoutput includes a variable summary, Ward’s Minimum Variance ClusterAnalysis,Eigenvalues of theCovarianceMatrix,RMSTotalSampleStandardDeviation,RMSdistancebetweenobservations,aClusterhistory,andaVariableImportancetable.

FlowCode—theSAScodeusedtoproducetheoutputthat theClusternodepassesontothenextnodeintheprocessflowdiagram.

Scoring

SASCode—theSASscorecodethatwascreatedbythenode.TheSASscorecodecanbeusedoutsidetheSASEnterpriseMinerenvironmentincustomuserapplications.

PMMLCode—thePMMLCodethatwasgeneratedbythenode.ThePMMLCodemenuitemisdimmedandunavailableunlessPMMLisenabled.

Summary Statistics— Three summaries are available: aMean Statistics table, a

ClusterStatisticstable,andaSegmentSizepiechart.

Mean Statistics — The Mean Statistics table provides a row of clusterinformationforeveryclustersegmentthatwascreatedduringyourClusternoderun.

TheMean Statistics table contains the following columns for each segmentrow:

Clustering Criterion — the clustering criterion value. The numericalvalueisdeterminedbytheClusteringMethodspecifiedintheClusternodePropertiespanel.

MaximumRelativeChange inClusterSeeds— themaximum relativechange in the cluster seed values over the clusters from the previousiteration.Therelativechangeinaclusterseedisthedistancebetweentheoldseedandthenewseeddividedbythescalingfactor.

Improvement in Clustering Criterion — the algorithm underlying theCluster node attempts tominimize cubic clustering criterion scores. Theimprovementinclusteringcriterionreferstoarelativechange,ordecrease,inthecubicclusteringcriterionvaluesfromoneiterationtothenext.

SegmentID—anumericalIDforagivencluster.

Frequency of Cluster — the sum of frequencies (observations) fornonmissingvaluesinacluster.

Root-Mean-SquareStandardDeviation— the rootmean square acrossvariablesoftheclusterstandarddeviations,whichisequaltotherootmeansquaredistancebetweenobservationsinthecluster.

MaximumDistance fromCluster Seed— themaximumdistance fromtheclusterseedtoanyobservationinthecluster.

NearestCluster—thenumberoftheclusterthathasthemeanclosesttothemeanofthecurrentcluster.

DistancetoNearestCluster—thedistancebetweenthecentroids(means)ofthecurrentclusterandthenearestothercluster.

InputVariableColumns—TheMeanStatisticstablecontainsacolumnfor each input variable that is used in the clustering calculations. Eachvariable’s column contains the clustermeans for each input variable, foreachsegmentnumber.

Segment Size — The Segment Size pie chart displays clusters by segmentnumber.Thewidthofeachpieslicereflectseachsegment’spercentageofallthe

clustereddata.EachsegmentIDappearsasyoumoveyourmousepointeracrosspie slices of interest.To see the data table that SASEnterpriseMiner uses tocreatetheSegmentSizepiechart,clickthepiecharttoselectit,thenfromtheResultswindowmainmenu,selectView Table.

ClusterStatistics—TheClusterStatisticstablecontainssummarycolumnsforeach input variable as well as columns for segment values and cumulativestatisticssummedacrossallinputvariables.TheClusterStatisticstablecontainsthefollowingrowsforeachcolumn:

DMDB_FREQ— the sum of frequencies (observations) for nonmissingvaluesofeachvariable.

DMDB_WEIGHT— the sumofweights fornonmissingvaluesof eachvariable.

DMDB_MEAN — the mean value statistic that is computed from allnonmissingvaluesofeachvariable.

DMDB_STD — the standard deviation statistic computed from allnonmissing values of each variable. The Cluster Statistics table alsoprovidesaDMDBstandarddeviationstatistic that issummarizedoverallinputvariables.

LOCATION—theSTDIZE=locationmeasurethatincorporatesreplacedmissingvalues.

SCALE—theSTDIZE=scalemeasurethatincorporatesreplacedmissingvalues.

DMDB_MIN—theminimumofallnonmissingvaluesforeachvariable.

DMDB_MAX—themaximumofallnonmissingvaluesforeachvariable.

CRITERION—theclusteringcriterionvaluethatisusedforallvariables,as calculated by the ClusteringMethod property specified in theClusternodePropertiesPanel.

PSEUDO_F — the Pseudo-F statistic, summarized across all inputvariables[([(R2)/(c-1)])/([(1-R2)/(n-c)])]whereR2istheobservedoverallR2,cisthenumberofclusters,andnisthenumberofobservations.

ERSQ — the approximate expected value of the squared multiplecorrelation R, which is the proportion of variance accounted for by theclustersunderauniformnullhypothesis.Ifthenumberofclustersisgreaterthan one-fifth of the number of observations, ERSQ and CCC are givenmissingvalues.

CCC—thecubicclusteringcriterionvalue,whichisusefulindeterminingthenumberofclustersinthedata.CCCvaluesthataregreaterthan2or3indicate good clusters, values between 0 and 2 indicate potential clusters(buttheyshouldbeconsideredwithcaution),andlargenegativevaluescanindicate outliers. If the number of clusters is greater than one-fifth thenumberofobservations,CCCandERSQaregivenmissingvalues.

TOTAL_STD—thetotalstandarddeviationofeachvariable.TheClusterStatisticstableprovidestotalstandarddeviationstatisticsthataresummedforallclustersandforindividualinputvariables.

WITHIN_STD — the pooled within-cluster standard deviation of eachvariable. The Cluster Statistics table provides within-cluster standarddeviationstatistics thatarepooledforallclustersandfor individual inputvariables.

RSQ — the squared multiple correlation R, which is the proportion ofvarianceaccountedforbytheclusters.TheClusterStatisticstableprovidestheRSQstatisticpooledoverallinputvariablesandforeachinputvariable.

RSQ_RATIO — the ratio of between-cluster variance to within-clustervariance (R2/(1 - R2)). The Cluster Statistics table provides the RSQ-RATIO statistic that is summarized over all input variables and for eachinputvariable.

SEED—seedsforeachclusterandvariable.

CLUS_MEAN—themeanofeachvariablewithineachcluster.

CLUS_STD — the standard deviations of each variable within eachcluster.

CLUS_MIN—theminimumofeachvariablewithineachcluster.

CLUS_MAX—themaximumofeachvariablewithineachcluster.

CLUS_FREQ— the sum of frequencies for nonmissing values of eachvariablewithineachcluster.

CCCPlot—displaysacubicclusteringcriterionscatterplotthatchartsnumberofclustersagainstcubicclusteringcriterionscores.

InputMeansPlot— displays a scatter plot that charts the normalizedmeanvaluesforeachinputvariable.Thenormalizedmeanvaluesforeachvariablearetabulatedforeachindividualclusteraswellasacrossallclusters.

ClusterProfile—viewoneofthefollowingclusterprofiles:

Tree—shows thedecision tree thatwasused to form the individual clusters.

Thedecisiontreeisbasedonyourclustereddata.Theclustervariable(SegmentID)isusedasthedecisiontreetargetvariable.YoucanusetheClusterProfileTree to identify influential input variables, such as which variables are mosteffectiveforgroupingobservationsintoclusters.

Youcanselectanodeandhideorviewanode’schildrenbyclickingontheminusorplus(Figure5-2).

Figure5-2

NodeRules—shows in text format thesplittingrules thatwereused to formindividualclusters.

VariableImportance—displaysavariablestablethatcontainsthenumberofsplittingrules,thenumberofsurrogaterules,andtherelativeimportanceofeachvariable.

Segment Plot — displays a segment plot of the discriminating segment(clustering)variables in thedata set (Figure5-3).Discriminatingvariablesarevariables that have high importance scores. Not all clustering variables willappear in segment plots. Input cluster variables that have a zero importancescore are called nondiscriminating variables. Nondiscriminating variables arenot included in segment plots because they do not contribute to clusterformation.Positionyourmousepointeroverasegmentwithinaplottoseethenameofthesegmentvariable,thepercentageofthesegmentvariablevaluesthatarewithinthesegmentthatyouareviewing,andtheformattedvaluesused.

Figure5-3

ClusterDistance—Clusterdistance results (distancesbetweenclustermeans) aresummarizedinatableofclusterdistancesandaclusterdistanceplot.

Plot—Theclusterdistanceplotprovidesagraphicalrepresentationofthesizeofeachclusterandtherelationshipamongclusters(Figure5-4).Thegraphaxesare determined from multidimensional scaling analysis, using a matrix ofdistancesbetweenclustermeansas input.Theasterisksare theclustercenters,andthecirclesrepresenttheclusterradii.Aclusterthatcontainsonlyonecaseisdisplayedasanasterisk.Theradiusofeachclusterdependsonthemostdistantcaseinthatcluster,andcasesmightnotbeuniformlydistributedwithinclusters.Therefore,itmightappearthatclustersoverlap,butinfact,eachcaseisassignedtoonlyonecluster.Thedistanceamongtheclustersisbasedonthecriteriathatarespecifiedtoconstructtheclusters.

Figure5-4

Table — The cluster distance table for n clusters contains n rows and ncolumns. Each cell in the table displays the Euclidean distance between thecentroidsoftheclustersfromtherowandcolumn.

Table—Displaysatablethatcontainstheunderlyingdatausedtoproduceachart.TheTablemenu itemisdimmedandunavailableunlessa resultschart isopenandselected.

Plot—UsetheGraphWizardtomodifyanexistingResultsplotorcreateaResultsplotofyourown.TheGraphWizardmenuitemisdimmedandunavailableunlessaResultschartortableisopenandselected.

5.2.9Example

Considerthefollowingscenario.Abaseballmanagerwantstoidentifyandgroupplayerson the teamwhoarevery similarwith respect to several statisticsof interest.Note thatthere is no response variable in this example. The manager simply wants to identifydifferentgroupsofplayers.Themanageralsowantstolearnwhatdifferentiatesplayersinonegroupfromplayersinadifferentgroup.

The data set for this example is located inSAMPSIO.DMABASE.The following tablecontainsadescriptionoftheimportantvariables.

Forthisexample,yousetthemodelroleforTEAM,POSITION,LEAGUE,DIVISION,andSALARYtoRejected.SALARYisrejectedbecausethisinformationiscontainedinthe LOGSALAR variable. No target variables are used in a cluster analysis or self-organizingmap(SOM).Ifyouwanttoidentifygroupsbasedonatargetvariable,considerapredictivemodelingtechniqueandspecifyacategoricaltargetvariable.

Cluster analysis is often referred to as supervised classification because it attempts topredictgrouporclassmembershipforaspecificcategoricalresponsevariable.Clustering,ontheotherhand,isreferredtoasunsupervisedclassificationbecauseitidentifiesgroupsorclasseswithin thedatabasedonall the inputvariables.Thesegroups,orclusters,areassignednumbers.However,theclusternumbercannotbeusedtoevaluatetheproximitybetweenclusters.

Cluster analysis is often referred to as supervised classification because it attempts topredictgrouporclassmembershipforaspecificcategoricalresponsevariable.Clustering,ontheotherhand,isreferredtoasunsupervisedclassificationbecauseitidentifiesgroupsorclasseswithin thedatabasedonall the inputvariables.Thesegroups,orclusters,areassignednumbers.However,theclusternumbercannotbeusedtoevaluatetheproximitybetweenclusters.

Self-organizingmapsattempttocreateclustersandplottheresultingclustersonamapsothatclusterproximitycanbeevaluatedgraphically.Thisexampledoesnotcontainaself-organizingmap.However,theSOM/Kohonennodeisusedtocreateself-organizingmaps.

BuildingtheProcessFlowDiagram

ThisexampleusesthesamediagramworkspacethatyoucreatedinChapter2.Youhavethe option to create a new diagram for this example, but instructions to do so are notprovidedinthisexample.First,youneedtoaddtheSAMPSIO.DMABASEdatasourcetoproject.

IntheProjectPanel,right-clickDataSourcesandclickCreateDataSource.

IntheDataSourceWizard—MetadataSourcewindow,clickNext.

IntheDataSourceWizard—SelectaSASTableandenterSAMPSIO.DMABASEintheTablefield.ClickNext.

IntheDataSourceWizard—TableInformationwindow,clickNext.

IntheDataSourceWizard—MetadataAdvisorOptionswindow,selectAdvanced.ClickNext.

IntheDataSourceWizard—ColumnMetadatawindow,makethefollowingchanges:

ForthevariableNAME,settheRoletoID.

ForthevariablesTEAM,POSITION,LEAGUE,DIVISION,andSALARYsettheRoletoRejected.

EnsurethatallothervariableshaveaRoleofInput.

ClickNext.

IntheDataSourceWizard—CreateSamplewindow,clickNext.

IntheDataSourceWizard—DataSourceAttributeswindow,clickNext.

IntheDataSourceWizard—Summarywindow,clickFinish.

This creates theAll: Baseball Data data source in your Project Panel. Drag theAll:BaseballDatadatasourcetoyourdiagramworkspace.

If you explore the data in the All: Baseball Data data source, you will notice thatSALARYandLOGSALARhavemissingvalues.Althoughit isnotalwaysnecessary toimputemissingvalues,attimestheamountofmissingdatacanpreventtheClusternodefromobtainingasolution.TheClusternodeneedssomecompleteobservationsinordertogenerate the initial clusters.When the amount of missing data is too extreme, use theReplacementorImputenodetohandlethemissingvalues.Thisexampleusesimputationtoreplacemissingvalueswiththemedian.

On theModify tab,draganImputenode toyourdiagramworkspace.Connect theAll:Baseball Data data source to the Impute node. In the Interval Variables propertysubgroup,setthevalueoftheDefaultInputMethodpropertytoMedian.

OntheExploretab,dragaClusternodetoyourdiagramworkspace.ConnecttheImpute

nodetotheClusternode.

Bydefault,theClusternodeusestheCubicClusteringCriterion(CCC)toapproximatethenumberofclusters.Thenodefirstmakesapreliminaryclusteringpass,beginningwiththenumberofclustersthatisspecifiedinthePreliminaryMaximumvalueintheSelectionCriterionproperties.Afterthepreliminarypasscompletes,themultivariatemeansoftheclusters are used as inputs for a second pass that uses agglomerative, hierarchicalalgorithms tocombineand reduce thenumberofclusters.Then, thesmallestnumberofclustersthatmeetsallfourofthefollowingcriteriaisselected.

ThenumberofclustersmustbegreaterthanorequaltothenumberthatisspecifiedastheMinimumvalueintheSelectionCriterionproperties.

ThenumberofclustersmusthavecubicclusteringcriterionstatisticvaluesthataregreaterthantheCCCCutoffthatisspecifiedintheSelectionCriterionproperties.

ThenumberofclustersmustbelessthanorequaltotheFinalMaximumvalue.

Apeakinthenumberofclustersexists.

Note:Ifthedatatobeclusteredissuchthatthefourcriteriaarenotmet,thenthenumberofclustersissettothefirstlocalpeak.Inthisevent,thefollowingwarningisdisplayedinthe Cluster node log:WARNING: The number of clusters selected based on the CCCvalues may not be valid. Please refer to the documentation of the Cubic ClusteringCriterion.Therearenonumberofclustersmatchingthespecifiedminimumandmaximumnumberof clusters.Thenumberof clusterswill be set to the first local peak.After thenumberofclustersisdetermined,afinalpassrunstoproducethefinalclusteringfortheAutomaticsetting.

Youarenowreadytoclustertheinputdata.

UsingtheClusterNode

Right-clicktheClusternodeandclickRun.IntheConfirmationwindow,clickYes.ClickResultsintheRunStatuswindow(Figure5-5).

Figure5-5

NoticeintheSegmentSizewindowthattheClusternodecreated4clusters.Onthemainmenu,selectView→ClusterProfile→VariableImportance.TheVariable Importancewindow displays each variable thatwas used to generate the clusters and their relativeimportance.NoticethatNO_ASSTS,NO_ERROR,andNO_OUTShaveanImportanceof 0. These variables were not used by the Cluster node when the final clusters werecreated.Onthemainmenu,selectView→SummaryStatistics→InputMeansPlot.SeeFigure5-6.

Figure5-6

Thisplotdisplays thenormalizedmeanvalueforeachvariable,both insideeachclusterandforthecompletedataset.Noticethatthein-clustermeanforcluster1isalwayslessthantheoverallmean.But,incluster4,thein-clustermeanisalmostalwaysgreaterthantheoverallmean.Clusters2and3eachcontainsomein-clustermeansbelowtheoverallmeanandsomein-clustermeansabovetheoverallmean.

FromtheInputMeansPlot,youcaninferthattheplayersincluster1areyoungerplayersthat are earning a belowaverage salary.Their 1986 and career statistics are also belowaverage.Conversely,theplayersincluster4areveteranplayerswithaboveaverage1986andcareerstatistics.Theseplayersreceiveanaboveaveragesalary.Theplayersinclusters2 and 3 excel in some areas, but perform poorly in others. Their average salaries areslightlyaboveaverage.

ClosetheResultswindow.

ExaminingtheClusters

SelecttheClusternodeinyourprocessflowdiagram.ClicktheellipsisbuttonnexttotheExportedDataproperty.TheExportedData—Clusterwindowappears.ClickTRAINandclickExplore.TheClus_TRAINwindowcontainstheentireinputdatasetandthreeadditionalcolumnsthatareappendedtotheend.ScrolltotherightendofthewindowtolocatetheSegmentID,Distance,andSegmentDescriptioncolumns.TheSegmentID

and Segment Description columns display what cluster each observation belongs to(Figure5-7).

Figure5-7

Next,createaplotthatcomparesthenumberofhitsforeachcluster.

Onthemainmenu,selectActions→Plot.

IntheSelectaChartTypewindow,clickBox.ClickNext

IntheSelectChartRoleswindow,findthe_SEGMENT_LABEL_variable.SettheRolefor _SEGMENT_LABEL_ to X. Next, find the NO_HITS variable. Set the Role forNO_HITStoY.ClickNext.

IntheDataWHEREClausewindow,clickNext.

In theChart Titleswindow, enterNumber ofHits inEachCluster in theTitle field.LeavetheFootnotefieldblank.EnterSegmentfortheXAxisLabel.EnterHitsfortheYAxisLabel.ClickNext.IntheChartLegendswindow,clickFinish.SeeFigure5-8.

Figure5-8

Noticethattheaveragenumberofhitsinclusters1and3issignificantlylowerthantheaveragenumberofhits in clusters2 and4.You should repeat these steps to createboxplotsforseveralothervariables.

5.3SOM/KOHONENNODE

SOM/Kohonen node belongs to the Explore category of the SAS SEMMA (Sample,Explore,Modify,Model,Assess)dataminingprocess.YouusetheSOM/KohonennodetoperformunsupervisedlearningbyusingKohonenvectorquantization(VQ),Kohonenself-organizing maps (SOMs), or batch SOMs with Nadaraya-Watson or local-linearsmoothing.KohonenVQisaclusteringmethod,whereasSOMsareprimarilydimension-reductionmethods.For cluster analysis, theClusteringnode is recommended insteadofKohonenVQorSOMs.

5.3.1SOM/KohonenNodeTrainProperties

ThefollowingtrainpropertiesareassociatedwiththeSOM/Kohonennode:

Variables—UsetheVariablespropertytospecifythepropertiesofthevariablesthatyouwanttousefromthedatasource.Selectthe buttontoopenavariablestable.Method — Use the Method property to specify one of the following networkmethods.

BatchSOM—FortheBatchSOMmethod, largermapsareusuallybetter,aslong asmost clusters have at least five or ten cases. However, the larger themap, thelongeritwill take to train.Youmayspecify thenumberof rowsandcolumnsinthetopologicalmapusingtheRowandColumnproperties,andyouspecify the final neighborhood size using the Final Size property in theNeighborhoodOptionstable.Thefinalneighborhoodsizeshouldusuallybesetinproportion to themapsize.Forexample, ifyoudouble thenumberof rowsand columns in the map, you should double the final neighborhood size.Choosingthemapsizeandfinalneighborhoodsizegenerallyrequirestrialanderror.BatchSOMisthedefaultsettingfortheMethodproperty.Kohonen SOM — For the Kohonen SOM method, larger maps are usuallybetter, as long as most clusters have at least five or ten cases. However, thelargerthemap,thelongeritwill take to train.Youmayspecify thenumberofrowsandcolumnsinthetopologicalmapusingtheRowandColumnproperties,thefinalneighborhoodsizeusing theFinalSizeproperty in theNeighborhoodOptions table, and the learning rate parameters by using the Learning Rate,InitialRate,FinalRate,andNumberofStepspropertiesintheKohonenOptionstable.Thefinalneighborhoodsizeshouldusuallybesetinproportiontothemapsize.Forexample, ifyoudouble thenumberofrowsandcolumnsinthemap,youshoulddoublethefinalneighborhoodsize.Choosingthemapsizeandfinalneighborhoodsizegenerally requires trialanderror.The learning ratechangesduringtraining.Iftheinitialseedsarerandom,thenitisimportanttostartwithahighlearningrate,suchas0.9.Iftheinitialseedsareobtainedbyusingprincipalcomponentsorsomepreliminaryanalysis,thentheinitiallearningrateshouldbemuchlower.KohonenVQ—FortheKohonenVQmethod,youmayspecifythenumberofclusters by using the Maximum Number of Clusters property. Choosing thenumber of clusters generally requires trial and error. You may specify thelearningratebyusingtheLearningRatepropertyintheKohonenOptionstable.Thelearningratechangesduringtraining.Iftheinitialseedsarerandom,thenitisimportanttostartwithahighlearningrate,suchas0.5.Iftheinitialseedsareobtainedthroughsomepreliminaryanalysis,thentheinitiallearningrateshouldbemuchlower.

Internal Standardization — Use this property to specify one of the followinginternalstandardizationmethods.

None—specifiesthatthevariablesarenotstandardizedpriortoclustering.Range — specifies that the minimum value be subtracted from the variablevaluesandthenthatthevariablevaluesbedividedbytherange.Standardization — specifies that the mean be subtracted from the variablevalues and then that the variable values be divided by (standard deviation) *SQRT(dimension).Dimension refers to the number of unique categories. Fornominal variables, the default dimension is C, the number of categories. Thedefault dimension for ordinal variables is 1.This is thedefault setting for theInternalStandardizationproperty.

5.3.2SOM/KohonenNodeTrainProperties:SegmentFindingagoodmap sizegenerally requires trial anderror. If themap size is too small,thenitwillnotaccuratelyreflectnonlinearityinthedata.Ifthemapsizeistoolarge,theanalysiswilltakealotofcomputertimeandmemory,andemptyclustersmightproducemisleading results. If you change the map size, then you might need to change theneighborhood size in proportion to the map size, because the amount of smoothingdependsontheratiooftheneighborhoodsizetothemapsize.

Row—UsetheRowpropertytospecifythenumberofrowsinthetopologicalmap.Thedefaultvalueis10.Avalueof1createsaone-dimensionalmap.Column — Use the Column property to specify the number of columns in thetopological map. The default value is 10. A value of 1 creates a one-dimensionalmap.

5.3.3SOM/KohonenNodeTrainProperties:SeedOptionsInitialMethod—UsetheInitialMethodpropertytospecifytheseedinitializationmethod.

Thefollowinginitializationmethodsareavailable:

Default—usesthePrincompmethodtoinitializeseedsifthenetworkmethodis set to BatchSOM, and the Outlier method if the networkmethod is set toKohonenSOMorKohonenVQ.Outlier—selectsinitialseedsthatarewellseparatedusingthefullreplacementalgorithm.First—selectsthefirstcompletecasesasinitialseeds.Thenextcompletecasethatisseparatedfromthefirstseedbyatleastthespecifiedradiusbecomesthesecondseed.Subsequentcasesareselectedasnewseedsif,andonlyif,theyareseparated from all previous seeds by at least the radius, and if themaximumnumberofseedsisnotexceeded.Separate — selects initial seeds that are well separated using a partialreplacementalgorithm.Thismethodselects the first completecaseas the firstseed.Thenextcompletecasethatisseparatedfromthefirstseedbyatleastthespecifiedradiusbecomesthesecondseed.Subsequentcasesareselectedasnewseedsiftheyareseparatedfromallpreviousseedsbyatleasttheradiusaslongasthemaximumnumberofseedsisnotexceeded.Ifacaseiscompletebutfailstoqualifyasanewseed, thiscaseis thenconsideredtoreplaceoneof theoldseeds.Anoldseed is replaced if thedistancebetweenthecaseandtheclosestseed is greater than the minimum distance between seeds. The seed that isreplacedisselectedfromthetwoseedsthatareclosesttoeachother.Theseedthatisreplacedistheoneofthesetwoseedsthathastheshortestdistancetotheclosest of the remaining seedswhen the other seed is replaced by the currentobservation.Principal Components — initializes seeds on an evenly spaced grid in theplaneofthefirsttwoprincipalcomponents.Ifthenumberofrowsislessthanorequaltothenumberofcolumns,thenthefirstprincipalcomponentisplacedtovarywiththecolumnnumber,andthesecondprincipalcomponentisplacedtovarywiththerownumber.Ifthenumberofrowsisgreaterthanthenumberofcolumns, then the first principal component is placed to vary with the rownumber,andthesecondprincipalcomponentisplacedtovarywiththecolumnnumber.

Radius—UsetheRadiuspropertytospecifytheminimumdistancebetweenclusterseeds.TheRadiuspropertyacceptsrealnumbers.Thedefaultsettingis0.0.

5.3.4SOM/KohonenNodeTrainProperties:BatchSOMTraining

When theMethodproperty is set toBatchSOM orKohonenSOM, use the followingBatchSOMTrainingPropertiestospecifyadvancedbatchtrainingoptions.

UseDefaults—UsethispropertytospecifywhethertousethedefaultbatchSOMtrainingoptionstorunthenode.ThedefaultbatchSOMtrainingoptionsdependonthetypeofnetworkthatyouspecifiedintheMethodproperty.

IftheMethodpropertyissettoBatchSOM,SOMtrainingisalwaysperformed,andboththelocal-linearandNadaraya-Watsonsmoothingmethodsareapplied.IftheMethodpropertyissettoKohonenSOM,SOMtrainingisnotperformed,andneitherthelocal-linearnorNadaraya-Watsonsmoothingmethodisapplied.

Local-Linear Smoothing—Use the Local-Linear Smoothing property to specifywhethertoperformlocal-lineartraining.IfthepropertyissettoYes,youcanusetheLocal-LinearOptionspropertiestomodifythesmoothingoptions.Nadaraya-WatsonSmoothing—UsetheNadaraya-WatsonSmoothingpropertytospecifywhether toperformNadaraya-Watson training. If theproperty isset toYes,you can use the Nadaraya-Watson Options properties to modify the smoothingoptions.

5.3.5SOM/KohonenNodeTrainProperties:Local-LinearOptionsConvergenceCriterion—Use theConvergenceCriterion property to specify theconvergence criterion for the local-linear smoothing method. The ConvergenceCriterionpropertyisarealnumbergreaterthanorequalto0,andthedefaultvalueis1.0E-4(0.0001).MaxIterations—UsetheMaxIterationspropertytospecifythemaximumnumberofiterationsforthelocal-linearsmoothingmethod.TheMaxIterationspropertyisanintegergreaterthanorequalto1,andthedefaultvalueis10.

The Use Defaults property must be set to No before you can change either of thepropertiesintheLocal-LinearOptions.

5.3.6SOM/KohonenNodeTrainProperties:Nadaraya-WatsonOptionsConvergenceCriterion—Use theConvergenceCriterion property to specify theconvergencecriterionfortheNadaraya-Watsonsmoothingmethod.TheConvergenceCriterionpropertyisarealnumbergreaterthanorequalto0,andthedefaultvalueis1.0E-4(0.0001).MaxIterations—UsetheMaxIterationspropertytospecifythemaximumnumberof iterations for the Nadaraya-Watson smoothing method. The Max Iterationspropertyisanintegergreaterthanorequalto1,andthedefaultvalueis10.

The Use Defaults property must be set to No before you can change either of thepropertiesintheNadaraya-WatsonOptions.

5.3.7SOM/KohonenNodeTrainProperties:KohonenVQMaximumNumberofClusters—If theMethodproperty isset toKohonenVQ,usetheMaximumNumberofClusterspropertytospecifythenumberofclustersthatarecreatedwhenyourunthenode.TheMaximumNumberofClustersisanintegergreaterthanorequalto1.Thedefaultvalueis10.

5.3.8SOM/KohonenNodeTrainProperties:KohonenBatchTraining—Use theBatchTrainingproperty to specifywhether toperformbatchtrainingwhentheMethodpropertyissettoKohonenSOM.UseDefaults—Use this property to specifywhether to use the defaultKohonentrainingsettings.KohonenOptions—Selectthebuttontoopenapropertiestablethatallowsyoutocustomize the Kohonen Options settings. The Method property must be set toKohonenSOMorKohonenVQ and theUseDefaults propertymust be set toNobeforeyoucanopentheKohonenOptionspropertiestable.

LearningRate—Use theLearningRateproperty tospecify the learningrateforKohonentraining.Thedefaultvalueis0.9.InitialRate—UsetheInitialRateproperty tospecify the initial learningrateforKohonen training.The default value is 0.9.Valid values are real numbersbetween0and1.FinalRate—UsetheFinalRatepropertytospecifythefinallearningrateforKohonen training. The default value is 0.2. Valid values are real numbersbetween0and1.Number of Steps — Use the Number of Steps property to specify the stepnumber atwhich the learning rate reaches the final learning rate. The defaultvalueis1,000.ConvergenceCriterion—Use theConvergenceCriterionproperty tospecifythevalueoftheconvergencecriterion.Thedefaultvalueis0.0001.Max Iterations—Use theMax Iterations property to specify themaximumnumber of training iterations that you want to perform. An iteration is theprocessingthatisperformedontheentiredataset.Thedefaultvalueis100.MaxSteps—UsetheMaxStepspropertytospecifythemaximumnumberofsteps to perform during Kohonen training. A step is the processing that isperformedonasingleobservation.Thedefaultvalueis1,200.

5.3.9SOM/KohonenNodeTrainProperties:NeighborhoodOptions

If you selectedBatchSOMorKohonenSOMas theMethodproperty, you canuse thefollowingpropertiestospecifyNeighborhoodOptions:

UseDefaults—Usethispropertytospecifywhethertousethedefaultvaluesofthefollowingneighborhoodoptions.NeighborhoodOptions—Select thebutton toopen aproperties table that allowsyoutocustomizetheNeighborhoodOptionssettings.NotethattheMethodpropertymustbesettoBatchSOMorKohonenSOMandtheUseDefaultspropertymustbesettoNobeforeyoucanopentheNeighborhoodOptionspropertiestable.

ThefollowingpropertiesareavailableintheNeighborhoodOptionspropertiestable:

Size — Use the Size property to specify the neighborhood size. Theneighborhood size must be greater than or equal to 0. The default value isMax(5,Max(Rows,Columns)/2)

oKernelShape—Validvaluesare0(uniform),1(Epanechnikov),2(biweight)and3(triweight).Thedefaultvalueis1.

KernelMetric—Validvaluesare0(max),1(cityblock),and2(Euclidean).Thedefaultvalueis0.

InitialSize—UsetheInitialSizepropertytospecifytheinitialsizefortheneighborhood.Thedefaultvalueis5.

FinalSize—UsetheFinalSizepropertytospecifythefinalsizefortheneighborhood.Thedefaultvalueis0.

NumbertoReset—UsetheNumbertoResetpropertytospecifythenumberofstepsafterwhichneighborhoodsizeandkernelarereset.Thedefaultvalueis100.

NumberofSteps—UsetheNumberofStepspropertytospecifythestepnumberatwhichtheneighborhoodsizereachesthevalueoftheFinalSizepropertyforKohonenSOMtraining.Thedefaultvalueis1,000.

IterationNumber—UsetheIterationNumberpropertytospecifytheiterationnumberatwhichtheneighborhoodsizereachesthevalueoftheFinalSizepropertyforBatchSOMtraining.Thedefaultvalueis3.

5.3.10SOM/KohonenNodeResultsOpen the Results window by right-clicking the node and selectingResultsfromthepop-upmenu.


Properties

Settings— displays a window with a read-only table of the SOM/Kohonennodepropertiesconfigurationwhenthenodewaslastrun.

RunStatus—displaysthestatusoftheSOM/Kohonennoderun.Informationabouttherunstarttime,runduration,runIDandcompletionstatusaredisplayedinthiswindow.

Variables—displaysa tableof thevariables submitted to theSOM/Kohonennode. The variables table displays the Name, Use, Report, Role and Levelsettingsforeachvariable.


Notes—displays(inread-onlymode)anynotesthatwerepreviouslyenteredintheGeneralProperties—Noteswindow.

SASResults

Log—theSASlogoftheSOM/Kohonennoderun.

Output — the SAS output of the SOM / Kohonen node run. The SOM /KohonennodeSASoutputcontainsavariablesummary.

Flow Code — the SAS code used to produce the output that the SOM /Kohonennodepassesontothenextnodeintheprocessflowdiagram.

TrainGraphs—isdimmedandunavailableintheSOM/Kohonennode

ReportGraphs—isdimmedandunavailableintheSOM/Kohonennode

Scoring

SASCode—theSASscorecodethatwascreatedbytheSOM/Kohonennode.The SOM / Kohonen node SAS code can be edited and used in externalapplicationsoutsideoftheEnterpriseMinerenvironment.

PMMLCode—theSOM/KohonennodedoesnotgeneratePMMLcode.

Model

MeanStatistics—TheMean Statistics table displays following statistics foreachcluster.

ClusteringCriterion

MaximumRelativeChangeinClusterSeeds

ImprovementinClusteringCriterion

SOMSegmentID

FrequencyofCluster—thenumberofobservationsineachcluster.

Root-Mean-Square Standard Deviation — the root mean square erroracross variables of the cluster standard deviations, which is equal to therootmeansquaredistancebetweenobservationsinthecluster.

MaximumDistancefromClusterSeed—themaximumdistancefromtheclusterseedtoanyobservationinthecluster.

NearestCluster—thenumberoftheclusterthathasameanclosesttothemeanofthecurrentcluster.

DistancetoNearestCluster—thedistancebetweenthecentroids(means)ofthecurrentclusterandthenearestothercluster.

SOMDimension1

SOMDimension2

SOMID

Meanofeachinputvariable.

Map—ifyousettheMethodpropertytoBatchSOMorKohonenSOM,a topologicalmappingof theinput space to the clusters is produced. If you set the Method property toKohonen VQ, a graphicalrepresentationofkeycharacteristicsofthesegmentsthataregeneratedbythenodeisproduced(Figure5-9).

Figure5-9

Toexamineaclusterinthemap,placeyourmousepointeroveraclusterinthetopologicalmap.Atextbox will display the SOM dimensions and Frequency of Cluster (the number of observations in acluster).

Whenyouselectacluster,thecorrespondingrowintheMeanStatistictableishighlightedanddisplaysstatistics for that cluster. To selectmultiple clusters, hold your rightmouse button down and drag tocreatearectangleovertheclustersinthemap.

Bydefault,theSOM/Kohonennodeusestheclusterfrequencytocolorthecluster.UsetheSelectChartcomboboxtoselectastatistictocolortheclusters.

SegmentPlot—ifyousettheMethodpropertytoKohonenVQ,theSegmentPlotprovidesagraphicalrepresentation of key characteristics of the segments that are generated from theKohonenVQmethod(Figure5-10).

Figure5-10

Toexamineasegmentinthemap,placeyourpointeroverasliceofthepie.AtextboxwilldisplaythesegmentID.

Whenyouselectaslice, thecorrespondingrowintheMeanStatistic table ishighlightedanddisplaysstatisticsforthatcluster.Toselectmultipleclusters,pressCTRLandclicktheclustersthatyouwant.

By default, the size of each slice in the two-dimensional pie chart is proportional to the clusterfrequency. To use a different variable to create the pie chart, right-click in themap and selectData

Options,andthenassigntheResponseroletothatvariable.

AnalysisStatistics—TheAnalysisStatisticstabledisplaysthefollowingstatistics:

TypeofObservation

SOMSegmentID

SOMDimension1

SOMDimension2

SOMID

StatisticApplyingoverAllVariables.

Statisticsforeachinputvariable

Table—displays a table of the data that is associatedwith the graph that you have open.Data table columnheadingsdisplayvariablelabelsbydefault.TodisplaythevariablenamesusedinSAScodeascolumnheadings,right-clickinanytablecellandselectShowVariableNames.Toviewlabelsinthecolumnheadingsagain,right-clickinanytablecellandselectShowVariableLabels.

Plot—enablesyoutocreateormodifyagraphofthedatainthetablethatyouhaveopen.

5.4VARIABLECLUSTERINGNODETheVariableClustering node ison theExplore tab of theEnterpriseMiner tools bar.Variableclusteringisausefultoolfordatareduction,suchaschoosingthebestvariablesor cluster components for analysis. Variable clustering removes collinearity, decreasesvariableredundancy,andhelpsrevealtheunderlyingstructureoftheinputvariablesinadataset.

Supposeadirectmailcompanywantstobuildmodelsusingadatabasethathashundredsor thousandsofvariables foreachcustomer.Shouldapredictiveorsegmentationmodelattempt touseallof theavailablevariables?Largenumbersofvariablescancomplicatethe task of determining the relationships that might exist between the independentvariablesandthetargetvariableinamodel.Modelsthatarebuiltwithtoomanyredundantvariables can destabilize parameter estimates, confound variable interpretation, andincrease the computing time that is required to run the model. Variable clustering canreduce the number of variables that are required to build reliable predictive orsegmentationmodels.

Variable clustering divides numeric variables into disjoint or hierarchical clusters. Theresultingclusterscanbedescribedasalinearcombinationofthevariablesinthecluster.Thelinearcombinationofvariablesisthefirstprincipalcomponentofthecluster.Thefirstprincipal components are called the cluster components. Cluster components providescores for each cluster. The first principal component is a weighted average of thevariablesthatexplainsasmuchvarianceaspossible.ThealgorithmforVariableClusteringseeksthemaximumvariancethatisexplainedbytheclustercomponents,summedoveralltheclusters.

TheVariableClusteringnodeclustercomponentsareoblique,andnotorthogonal,evenwhen the cluster components are first principal components. In an ordinary principalcomponent analysis, all components are computed from the same variables. The firstprincipal component is orthogonal to the secondprincipal component and to eachotherprincipal component. With the Variable Clustering node, each cluster component iscomputedfromadifferentsetofvariablesthanalltheotherclustercomponents.Thefirstprincipalcomponentofoneclustermightbecorrelatedwiththefirstprincipalcomponentofanothercluster.TheVariableClusteringnodeperformsatypeofobliquecomponentanalysis.

Asinprincipalcomponentanalysis,eitherthecorrelationorthecovariancematrixcanbeanalyzed. If correlations are used, all variables are treated as equally important. Ifcovariances are used, variables with larger variances have more importance in theanalysis.

When properly used as a variable-reduction tool, the Variable Clustering node canreplace a large set of variables with the set of cluster components with little loss ofinformation.Agivennumberofclustercomponentsdoesnotgenerallyexplainasmuchvariance as the same number of principal components on the full set of variables.However, the cluster components are usually easier to interpret than the principal

components,evenifthelatterarerotated.

Forexample,aneducationaltestmightcontainfiftyitems.TheVariableClusteringnodecanbeusedtodividetheitemsinto,say,fiveclusters.Eachclustercanthenbetreatedasasubtest,withthesubtestscoresgivenbytheclustercomponents.

5.4.1VariableClusteringNodeAlgorithmThe algorithm behind theVariableClustering node is both divisive and iterative. Bydefault,theVariableClusteringnodebeginswithallvariablesinasinglecluster.

Itthenrepeatsthefollowingsteps:

1. A cluster is chosen for splitting. Depending on the options specified, the selectedcluster has either the smallest percentage of variation explained by its clustercomponent(usingtheVariationProportionproperty)orthelargesteigenvaluethatisassociatedwiththesecondprincipalcomponent(usingtheMaximumEigenvalueproperty).

2. The chosen cluster is split into two clusters by finding the first two principalcomponents, performing an orthoblique rotation (raw quartimax rotation on theeigenvectors;Harris and Kaiser, 1964), and assigning each variable to the rotatedcomponentwithwhichithasthehighersquaredcorrelation.

3. Variables are iteratively reassigned to clusters to try to maximize the varianceaccounted for by the cluster components. You can require the reassignmentalgorithms to maintain a hierarchical structure for the clusters (using the KeepHierarchiesproperty).

TheVariableClusteringnodealgorithmstopssplittingwheneither:

themaximumnumberofclustersasspecifiedbytheMaximumClusterspropertyisreached,

each cluster satisfies the stopping criteria specified in the Variation Proportionpropertyand/ortheMaximumEigenvalueproperties

Usingdefaultsettings,theVariableClusteringnodestopssplittingwheneachclusterhasonly one eigenvalue greater than 1, thus satisfying the most popular criterion fordeterminingthesufficiencyofasingleunderlyingdimension.

Theiterativereassignmentofvariablestoclustersproceedsintwosteps.Thefirststepisanearest component sorting phase, similar in principle to the nearest centroid sortingalgorithmsdescribedbyAnderberg(1973). Ineach iteration, theclustercomponentsarecomputed,andeachvariable isassignedto thecomponentwithwhich ithas thehighestsquaredcorrelation.Thesecondphaseinvolvesasearchalgorithminwhicheachvariableistestedtoseewhetherassigningittoadifferentclusterincreasestheamountofvarianceexplained.Ifavariableisreassignedduringthesearchphase,thecomponentsofthetwoclustersinvolvedarerecomputedbeforethenextvariableistested.Thenearestcomponentsortingstepismuchfasterthanthesearchstepbutismorelikelytobetrappedbyalocaloptimum.

Using principal components, the nearest component sorting step is an alternating leastsquaresmethodandconvergesrapidly.Thesearchstepcanbeverytimeconsumingfora

largenumberofvariables.Butifthedefaultinitializationmethodisused,thesearchstepisrarelyabletosubstantiallyimprovetheresultsofthenearestcomponentsortingstep.Inthis case, the search takes fewer iterations. If random initialization is used, the nearestcomponentsortingstepmightbetrappedbyalocaloptimumfromwhichthesearchphasecanescape.

YoucanusetheVariableClusteringnodetoperformhierarchicalclusteringbyusingtheKeepHierarchies property. TheKeepHierarchies property restricts the reassignment ofvariables so that split clusters maintain a tree structure.When a cluster is split duringhierarchical clustering, a variable in one of the two newly formed clusters can bereassignedtotheothernewcluster,butnottoanyothercluster.

5.4.2VariableClusteringNodeandMissingDataSetValuesIf an observation contains missing values, theVariableClustering node excludes theobservation from the analysis. If the input data set that you want to perform variableclusteringoncontainsasignificantamountofobservationswithmissingvalues,itmightbe beneficial to use theReplacement or Impute nodes to replace or impute missingvariablevaluesbeforesubmittingthedatasettotheVariableClusteringnode.

5.4.3VariableClusteringNodeandLargeDataSetsTheVariableClusteringnodeismostcomputationallyeffectivewhenusedwithdatasetsthat have less than 100 variables and less than 100,000 rows. Running the standardvariable clustering algorithms with larger data sets results in noticeable processorperformancedecreases.Whenyouneed toperformvariableclusteringonadataset thathasmorethan100variablesormorethan100,000rows,youshouldmakechangestoyourprocessflowdiagramconfigurationtoimprovethecomputationalperformance.

Ifyourdatasetcontainsmore than100,000observations,youshoulduse theSamplingnodetoobtainasampleofthedatasetforvariableclustering.Youcanselectfromseveralsamplingstrategiesthatreducethenumberofobservationsinyourprocessflowdiagramtolessthan100,000.

If your data set contains more than 100 variables, use the node’s two-stage variableclusteringalgorithm.

5.4.4VariableClusteringNodeandVariableRolesTheVariableClusteringnodeisdesignedtoclusternumericvariables.YoucanusetheInclude Class Variables property to analyze class variables through the use of dummyvariables,butcareshouldbeusedwhenincludingclassvariablesintheanalysis.VariableClusteringnodeperformancecansufferwhenclassvariablesthathavealargenumberoflevelsareanalyzed.ThishappensbecausetheVariableClusteringnodecreatesadummyvariable for each measurement level that it detects. Nominal class variables can betroublesomebecauseeverynon-uniquecombinationofcharactersortextwouldbecomeanewdummyvariable.

IfyourVariableClusteringnodecreatesanunusuallycomplexclustermap,youmightwanttoinspectthemeasurementlevelsforthevariablesinyourinputdataset.Whenyouselect the input data set node in your process flowdiagram, you canuse theVariablesproperty to view your input data set variable measurement levels. The VariableClustering node might provide better results if you reclassify some of your nominalvariablesinthetableintointervalvariables.

5.4.5UsingtheVariableClusteringNodeThe default configuration of theVariableClustering node often provides satisfactoryresults.Ifyouwanttochangethefinalnumberofclusters,youcanmodifythesettingsfortheMaximum Clusters,Maximum Eigenvalue, orVariation Proportion properties.TheMaximumEigenvalue andVariation Proportion criteria usually produce similarresults, but occasionally can cause different clusters to be selected for splitting. TheMaximumEigenvalue criterion tends to choose clusters that have a large number ofvariables.TheVariationProportioncriterionismorelikelytoselectclustersthathaveasmallnumberofvariables.

The Variable Clustering node usually requires more computer processing than acomparablePrincipalComponentanalysis,but itcanbefaster thansomeoftheiterativefactoringmethods.

Ifyouhavemorethan30variables,youcanwanttoreduceyourVariableClusteringnodeprocessingtimebyusingoneormoreofthefollowingmethods:

SpecifyanumericvaluefortheMaximumClusterspropertyifyouknowhowmanyclustersyouwant.

SettheKeepHierarchiespropertytoYes.

SettheTwoStageClusteringpropertytoYes.

5.4.6VariableClusteringNodeTrainPropertiesThefollowingtrainpropertiesareassociatedwiththeVariableClusteringnode:

Variables—UsetheVariablestabletospecifythestatusforindividualvariablesthatare imported into the Variable Clustering node. Select the button to open awindowcontainingthevariablestable.YoucansetthevariablestatustoeitherUseorDon’tUse in the table,viewthecolumnsmetadata,oropenanExplorewindowtoview a variable’s sampling information, observation values, or a plot of variabledistribution.

Clustering Source — Use the Clustering Source property to specify the sourcematrixthatyouwanttouseforvariableclusteringanalysis.ThedefaultsettingfortheClusteringSourcepropertyisCorrelation.

Correlation—Setting theClusteringSourceproperty toCorrelationuses thecorrelation matrix of standardized variables. When the Cluster ComponentpropertyissettoPrincipalComponents,thesourcematrixprovideseigenvalues.

Covariance—Setting theClusteringSource property toCovariance uses thecovariance matrix of raw variables. The Covariance property setting permitsvariables that have a large variance to have more effect on the clustercomponents than variables that have a small variance. When the ClusterComponentpropertyissettoPrincipalComponents,thesourcematrixprovideseigenvectors.

KeepHierarchies—UsetheKeepHierarchiespropertytospecifywhetherclustersshould be maintained at different levels in order to create a hierarchical clusterstructure.ThedefaultsettingfortheKeepHierarchiespropertyisYes.

Include Class Variables— TheVariableClustering node is designed to createclustersfromnumericalvariables.TheIncludeClassVariablespropertyprovidesaway to includeclassvariables in thevariableclustering.Thedefault setting for theIncludeClassVariablespropertyisNo.WhentheIncludeClassVariablespropertyis set toYes, class variables are converted to dummy variables and are treated asintervalinputs.Usersshouldpaycloseattentiontothebehaviorofclustereddummyvariables because dummy variables from one class variable can be sorted intodifferent clusters. When dummy variables are created, the dummy variables arepassedontothenodethatfollowstheVariableClusteringnode.WhentheIncludeClass Variables property is set to No, all class variables are passed on to thesuccessornodeswithoutanyprocessing.

TwoStageClustering—UsetheTwoStageClusteringpropertytospecifywhetherto perform two-stage clustering. Two-stage clustering is normally used whensubmitting data sets with more than 100 variables or 100,000 observations to theVariable Clustering node. The default setting for the Two Stage Clustering

propertyisAuto.

5.4.7 Variable Clustering Node Train Properties: StoppingCriteriaProperties

MaximumClusters—UsetheMaximumClustersproperty tospecify the largestnumberofclustersdesired.IfyoudonotspecifyavaluefortheMaximumClustersproperty,thedefaultsettingcountsthenumberofvariablesintheinputdatasetandusesthatvalue.TheVariableClusteringnodestopssplittingclustersafterreachingthe Maximum Clusters value, regardless of what other splitting options arespecified.Validvaluesareintegersgreaterthanorequalto1.

MaximumEigenvalue—Use theMaximumEigenvalue property to specify thelargest permissible threshold for the second eigenvalue of each cluster.When thesecond eigenvalue of a cluster exceeds the Maximum Eigenvalue setting, theVariableClusteringnodestopssplittingthecluster.Ifyoudonotspecifyavalueforthe Maximum Eigenvalue property, the default setting counts the number ofvariables in the input data set and uses that value. Valid values are real numbersgreaterthanorequalto1.

Variation Proportion — Use the Variation Proportion property to specify theupper threshold for the proportion of variation criterion. Using this criterion, theVariable Clustering node splits the cluster that has the smallest proportion ofvariationexplained,aslongastheproportionofvariationvalueislessthanorequalto thevaluespecifiedfor theVariationProportionproperty.ThedefaultvaluefortheVariationProportionpropertyis0.TheVariationProportionpropertyacceptsrealnumbersbetween0and1.0.

If you specify values for both the Variation Proportion property and theMaximumEigenvalueproperty,theVariableClusteringnodesearchesfirstforacluster to split using theMaximumEigenvalue criterion. If no clustermeets theMaximumEigenvalue criterion, theVariableClustering node next looks for acluster to splitbasedon theVariationProportion criterion. IfyouspecifyvaluesforboththeVariationProportionpropertyand theMaximumClustersproperty,theresultingnumberofclusterswillalwaysbelessthanorequaltotheMaximumClusterspropertyvalue.

PrintOption—UsethePrintOptionproperty toconfigure thegranularityof theprintedoutputfromtheVariableClusteringnode.

Summary—Printsvariableclusteringresultssummaryinformationonly.

Short—Suppressesprintingof thecluster structure, scoringcoefficients, andinter-cluster correlation matrices. Short is the default setting for the PrintOptionproperty.

All—Printsallvariableclusteringresultsinformation.

None—Suppressesprintingofallvariableclusteringresultsinformation.

SuppressSamplingWarning—UsetheSuppressSamplingWarningpropertytoconfigurewhetheryouwanttheVariableClusteringnodetosuppressthesamplingwarning when you submit a large data set for clustering. Performing variableclustering on large data sets that have 100,000 or more observations can be verycomputationallyintensive.SASEnterpriseMinerrecommendsthatyouusesamplingtoperformvariableclusteringonlargedatasets.ThedefaultsettingfortheSuppressSamplingWarningisNo.SettheSuppressSamplingWarningproperty toYes ifyouwanttoruntheVariableClusteringnodewithoutanyrestrictiononthenumberofsubmittedobservations.

5.4.8VariableClusteringNodeScorePropertiesThefollowingscorepropertiesareassociatedwiththeVariableClusteringnode:

VariableSelection—UsetheVariableSelectionproperty tospecify thevariablesorcomponentsthatyouwanttoexportfromtheclusters.

Cluster Component — the Variable Clustering node exports a linearcombination of the variables from each cluster. Cluster Component is thedefaultsettingfortheVariableSelectionproperty.

BestVariables—theVariableClusteringnodeexports thevariables ineachclusterthathavetheminimumR-squareratiovalues.

Interactive Selection — Use the Interactive Selection property to open aninteractive variable selection table that permits you to manually choose importantvariables.Select the button to the right of the InteractiveSelection property toopenthevariablestableforinteractiveselection.

HideRejectedVariables—Use theHideRejectedVariables property to specifywhether tohide the rejectedvariables fromall successornodes in theprocess flowdiagram.ThedefaultsettingfortheHideRejectedVariablesisYes.

5.4.9VariableClusteringNodeStatusPropertiesThefollowingstatuspropertiesareassociatedwiththisnode:

CreateTime—displaysthetimeatwhichthenodewascreated.








5.4.10ExampleThis example performs variable clustering on an example SAS data set composed oftrainingmetadataaboutapplicantsforcredit.Thetrainingdatahasabinarytargetvariablenamed GOOD_BAD. The GOOD_BAD status indicates whether the individual in thetrainingdatadefaultedonhisloan.Theexampleusesvariableclusteringtocreateamodelthatusesacombinationoflatentvectorstopredictthegoodrisksamongapoolofcreditapplicants.

Use theSASEnterpriseMinerCreateDataSourceWizard to create theGermanCreditDatadatasource.TheexampledataislocatedintheSASsamplelibrarySAMPSIO.UsetheSASsampleGermanCreditdataset,SAMPSIO.DMAGECR.

ConfigurethevariablesintheSAMPSIO.DMAGECRdatasetasfollows:

SettheroleofthebinaryvariableGOOD_BADasthetargetvariable.

Set the measurement level of all the input variables except the character variablePURPOSEtoInterval.

Ensure that the measurement level of PURPOSE is set to Nominal, and that themeasurementlevelofGOOD_BADisBinary.

SavetheSAMPSIO.DMAGECRdatasetusingtheroleofeitherTrainorRaw.

Createanewprocessflowdiagram.Next,connecttheGermanCreditdatasourcewithaVariableClusteringnode(foundintheExploretabofthenodetoolbar)asshownbelow.UsethedefaultpropertysettingsfortheVariableClusteringnode.

Right-click theVariableClustering node, and then selectRun to run the process flowdiagram.When the Run Status window indicates that the run is completed, select theResults button to open the Variable Clustering node Results window. Examine theCluster Plot (Figure 5-11) to view the seven clusters thatwere created from the set ofintervalinputvariablesintheGermanCredittrainingdata:

Figure5-11

ExaminetheVariableSelectiontableintheResultswindowtoviewtheclusterstatistics,sortedbycluster.ThecomponentsintheVariableSelectiontablewillbeexportedtothenodethatfollowstheVariableClusteringnodeinaprocessflowdiagram.TheR-SquarewithNextClusterComponentcolumnofthetableindicatestheR-squarescoreswiththenearest cluster. If the clusters arewell separated, R-square score values should be low.Smallvaluesinthe1–R2RatiocolumnoftheVariableSelectiontablealsoindicategoodclustering.

TheVariableClusteringnoderunwasperformedusingitsdefaultpropertysettings.Bydefault,theVariableClusteringnodeexportsclustercomponentstothenodethatfollowsin a process flow diagram. Note that the Variable Selected column of the VariableSelectiontableinthiscasereadsYESonlyfortheclustercomponentvariables(CLUS1,CLUS2,andsoon.)TheVariableClusteringnodecanalsoexportthebestvariablefromeachclustertosuccessornodesifyoudesire.

To export the best variable fromeach cluster (insteadof cluster components), close theVariableClustering node Results window, and then change the value of theVariableSelectionpropertyfromClusterComponenttoBestVariables.Run thenodeafteryoumake the configuration changes and open the Results window.When you examine theVariable Selection table, the Variables Selected column displays YES for the bestvariable in each cluster (Figure 5-12).When you use theBestVariable setting for theVariableSelectionproperty, thebestvariable ineachcluster is thevariable thathas thelowest1–R2Ratio.

Figure5-12

TheResultswindowdendrogramusesa treehierarchy todisplayhow theclusterswereformedFigure5-13.

Figure5-13

YoucanusetheResultswindowmenutodisplaytheflowcodefortheclustercomponents(Figure5-14).Theflowcodecontainsthescorecode:View SASResults FlowCode.

Figure5-14

Thetypeofoutput that theVariableClusteringnodeproducesvariesaccording tohowthe nodeExportOptions are configured.You can verifywhich variables are exportedduring aVariableClustering node run by attaching a successor node to theVariableClustering node and running the successor node. After the run, open the table for thesuccessornodeVariablespropertytoviewtheinputvariables thatwereexportedbytheVariableClusteringnode.

This example has produced both cluster components and best variable cluster valuesduring separateVariableClustering node runs. To view how exported data passes tosuccessornodes,addasuccessorRegressionnodeandconfiguretheVariableClusteringnodetoexportclustercomponents.

DragaRegressionnodefromtheModeltabofthenodetoolsbarontoyourprocessflowdiagram. Connect the Variable Clustering node to the Regression node. Leave theRegressionnodepropertiesintheirdefaultstates.TheVariableClusteringnodeneedstobe configured to export cluster components instead of the best variables within eachcluster.

Select the Variable Clustering node in the process flow diagram, and then set thefollowingproperties:

set the Variable Selection property in the Export Options section to ClusterComponent.

ensurethattheHideRejectedVariablespropertyintheExportOptions section issettoYes.

settheIncludeClassVariablespropertytoNo.

When the runcompletes, selectOK in theRunStatuswindow. (This exampledoesnotvisittheresultsoftheRegressionnode.)WiththeRegressionnodeselectedintheprocess

flow diagram, click on the button to the right of the Variables property in theRegressionnodegeneralpropertiespanel.AtableopensthatdisplaysallofthevariablesthattheVariableClusteringnodeexportedtotheRegressionnode(Figure5-15):

Figure5-15

The table shows the exported variables: the GOOD_BAD target variable, the nominalPURPOSEvariablethatwasneitheranalyzednorrejected,andthesevenvariableclustercomponentsthattheVariableClusteringnodecreated.

Because thevalueof theHideRejectedVariablespropertyof theVariableClusteringnodewassettoYes,noneoftherejectedvariablesarepassedontothesuccessornode.

5.4.11TwoStageVariableClusteringExampleThe two stage variable clustering example uses a data set calledGD_200VARS_10000ROWS_NOCLASS.Theexampledatasetistypicalofadatasourcethatshoulduse two-stagevariableclustering.Theexampledatasource isa rawdatasetthathas200intervalinputvariables,noclassvariables,and10,000observations.Thedatasethasabinary targetvariablecalledTARGET.Thisexampleconfigures the inputdatasourceasatrainingdataset.

Fortheinputdataset,usetheCreateDataSourceWizard tocreateadatasourcefromthe SAS SAMPSIO sample library. Use the SAS data set,SAMPSIO.GD_200VARS_10000ROWS_NOCLASStocreateanewdatasource.

Confirmthatthenewlycreateddatasourceisconfiguredasfollows:

theAdvancedadvisorisselected.

thebinaryvariableTARGETisthetargetvariable.

ID_CUSTandID_STOREarenominalIDvariables.

SEGMENTisanominalsegmentvariable.

theremainingvariablesareallintervalinputvariables.

Constructthefollowingprocessflowdiagram:

SelecttheVariableClusteringnodeandconfigurethefollowingproperties:

SettheTwoStageClusteringpropertytoYes.

Leaveallotherpropertysettings in theirdefault state.Thenodewillexportclustercomponents.

Right-clicktheVariableClusteringnodeintheprocessflowdiagramandselectRun.IntheConfirmationwindow,selectYes.IntheRunStatuswindow,selectResults.

Youcanseethatwithineachglobalcluster,allresultsaresimilartothoseforsingle-stageclustering.TheglobalclustersintheGClustercolumnarelabeledGC1,GC2,andsoon.Theformulafor thenumberofglobalclusters isNumberofclusters=INT((numberofvariables/100)+2).Therefore,thenumberofclustersisINT((200/100)+2)=4.

Because thereare fourglobalclusters, the resultscontain fourclusterplots, fourclustertables,andsoon.Thetableforglobalcluster1(GC1)isshownbelow(Figure5-16):

Figure5-16

The Global Cluster Dendrogram below (Figure 5-17) is generated from scored clustercomponentsandhierarchicalclustering.Itshowstherelationshipbetweenglobalclustersandprovidesalinktothegapbetweenglobalclustersandsub-clusters.

Figure5-17

5.4.12PredictiveModelingwithVariableClusteringExampleThis example demonstrates how the Variable Clustering node can enhance theperformance of a data mining predictive model. The SAS sample data setSAMPSIO.DMEXA1isusedasadatasource.Thedatasourceispartitionedintotrainingandvalidationdatasetsforaregressionanalysis.

The partitioned data is passed on to three identically configured successorRegressionnodeswiththefollowingvariations:

ThedatathatflowstothefirstRegressionnodehasnovariableclusteringperformedonthedata.

ThedatathatflowstothesecondRegressionnodehasvariableclusteringperformedon the data. The Variable Clustering 1 node is configured to output clustercomponents.

ThedatathatflowstothethirdRegressionnodehasvariableclusteringperformedonthe data. TheVariableClustering 2 node is configured to output the best clustervariables.

Theoutputfromthe threeregressionmodels is thensubmitted toaModelComparisonnodetoassesstheregressionmodelwiththebestperformance.

1. ConfiguretheDMEXA1datasourceasfollows:

Use the Create Data Source Wizard to select the sample SAS data setSAMPSIO.DMEXA1.

After you specify the SAS table SAMPSIO.DMEXA1 in the Data SourceWizard,selecttheBasicsettingforthemetadataadvisor.

In the Basic Column Metadata table for SAMPSIO.DMEXA1, make thefollowingconfigurations:

SettheRoleforvariableACCTNUMtoID.

SettheRoleforvariableSTATECODtoRejected.

SettheRoleforvariablePURCHASEtoTarget.

In the Data Source Attributes table for SAMPSIO.DMEXA1, set the DataSourceRoleattributetoyourchoiceofTrainorRaw.

ThenewlycreatedSAMPSIO.DMEXA1datasourceshouldhave43intervalinputvariables,4nominalinputvariables,and1,966observations.

2. AddandconnectaDataPartitionnodetothediagram.IntheDataSetAllocationssection of the Data Partition node properties panel, make the followingconfigurations:

SettheTrainingpropertyvalueto60%.

SettheValidationpropertyvalueto40%.

3. Addand connect twoVariableClustering nodes to the diagram as shown. In theExportOptionssectionoftheVariableClusteringnodepropertiespanel,makethefollowingconfigurations:

Set the Variable Selection property value on the uppermost VariableClusteringnodeinthediagramtoClusterComponents.

Set the Variable Selection property value on the remaining VariableClusteringnodeinthediagramtoBestVariables.

4. Addandconnect threeRegressionnodes to thediagramas shown.Leaveall threeRegressionnodesintheirdefaultconfiguration.

5. AddandconnecttheModelComparisonnodetothediagramasshown.Leave theModelComparisonnodeinitsdefaultconfiguration.

6. Right-clicktheModelComparisonnodeandselectRunfromthepop-upmenu.IntheRunStatuswindow,selectOK.

7. Beforeexamining theModelComparison node results, examine thevariables thatwerepassedtothetwoRegressionnodesthatfollowVariableClusteringnodes.Inthe properties panel for theRegression node that follows the uppermostVariableClustering node, select the button that is located to the right of theVariablesproperty to view the table of imported variables. The table shows that the firstVariable Clustering node summarized 43 interval variables in 13 clustercomponents, and then passed the cluster components to the successorRegressionnodeasnewinputs(Figure5-18).

Figure5-18

8. In the properties panel for theRegression node that follows the secondVariableClustering node, select the button that is located to the right of theVariablesproperty to view the table of imported variables. The table shows that the secondVariableClusteringnodechose13intervalvariables,eachvariablethebestclusterrepresentative.The13bestclustervariablesarepassedtothesuccessorRegressionnodeasnewinputs(Figure5-19).

Figure5-19

9. Right-clicktheModelComparisonnodeinthediagramandselectResultsfromthepop-upmenutocomparetheresultsofthethreeRegressionmodels.

10. The Fit Statistics table and the SASLog show that theModelComparison nodechosetheregressionmodelthatfollowedtheVariableClusteringnodethatexportedcluster components as the best predictive model. The regression models useunclusteredvariable inputsorbest cluster variable inputsbasedonvalidationerrorandmisclassificationrates(Figure5-20).

Figure5-20

CHAPTER6

PREDICTIVEMODELSCOMPARISON

6.1MODELSCOMPARISONThischaptercomparestheoutputandperformanceofthreemodelingtools—DecisionTree,Regression,andNeuralNetwork—byusingallthreetoolstodeveloptwotypesofmodels—onewithabinarytargetandonewithanordinaltarget.Ihopethischapterwillhelpyoudecidewhatapproachtotake.

The model with a binary target is developed for a hypothetical bank that wants topredict the likelihood of a customer’s attrition so that it can take suitable action toprevent theattrition ifnecessary.Themodelwithanordinal target isdeveloped forahypothetical insurance company that wants to predict the probability of a givenfrequency of insurance claims for each of its existing customers, and then use theresultingmodeltofurtherprofilecustomerswhoaremostlikelytohaveaccidents.

TheStochasticBoostingandEnsemblenodesaredemonstratedforcombiningmodels.

6.2MODELSFORBINARYTARGETS:ANEXAMPLEOFPREDICTINGATTRITION

Attrition can bemodeled as either a binary or a continuous target.When youmodelattritionasabinarytarget,youpredicttheprobabilityofacustomer“attriting”duringthenextfewdaysormonths,giventhecustomer’sgeneralcharacteristicsandchangeinthe pattern of the customer’s transactions.With a continuous target, you predict theexpectedtimeofattrition,ortheresiduallifetime,ofacustomer.Forexample,ifabankor credit card companywants to identify customerswho are likely to terminate theiraccountsatanypointwithinapredefinedintervaloftimeinthefuture,thecompanycanmodelattritionasabinarytarget.If,ontheotherhand,theyareinterestedinpredictingthe specific time at which the customer is likely to attrit, then the company shouldmodelattritionasacontinuoustarget,andusetechniquessuchassurvivalanalysis.

In this chapter, Ipresent anexampleofahypotheticalbank thatwants todevelopanearly warning system to identify customers who are most likely to close theirinvestment accounts in the next three months. To meet this business objective, anattritionmodelwithabinarytargetisdeveloped.

Thecustomerrecordinthemodelingdatasetfordevelopinganattritionmodelconsistsofthreetypesofvariables(fields):

•Variables indicating thecustomer’spast transactions (suchasdeposits,withdrawals,purchaseofequities,etc.)bymonthforseveralmonths

• Customer characteristics (demographic and socio-economic) such as age, income,lifestyle,etc.

•Targetvariableindicatingwhetheracustomerattritedduringapre-specifiedinterval

AssumethatthemodelwasdevelopedduringDecember2006,andthatitwasusedtopredictattritionfortheperiodJanuary1,2007throughMarch31,2007.

Display 5.1 shows a chronological view of a data record in the data set used fordevelopinganattritionmodel.

Display5.1

HerearesomekeyinputdesigndefinitionsthatarelabeledinDisplay5.1:

•inputs/explanatoryvariableswindow:Thiswindowreferstotheperiodfromwhichthetransactiondataiscollected.

•operationallag:ThemodelexcludesanydataforthemonthofJuly2006toallowfortheoperationallagthatcomesintoplaywhenthemodelisusedforforecastinginrealtime. This lag may represent the period between the time at which a customer’stransactiontakesplaceandthetimeatwhichitisrecordedinthedatabaseandbecomesavailabletobeusedinthemodel.

•performancewindow:Thisisthetimeintervalinwhichthecustomersinthesampledata set are observed for attrition. If a customer attritedduring this interval, then thetargetvariableATTRtakesthevalue1;ifnot,ittakesthevalue0.

The model should exclude all data from any of the inputs that have been capturedduringtheperformancewindow.

IfthemodelisdevelopedusingthedatashowninDisplay5.1andusedforpredicting,orscoring,attritionpropensityfortheperiodJanuary1,2007throughMarch2007,thenthe inputswindow, the operational lag, and the predictionwindow look likewhat isshowninDisplay5.2whenthescoringorpredictionisdone.

Display5.2

At the time of prediction (say on Dec 31, 2006), all data in the inputs/explanatoryvariableswindowisavailable.Inthishypotheticalexample,however,theinputsarenotavailable for themonth of December 2006 because of the time lag in collecting theinputdata.Inaddition,nodataisavailableforthepredictionwindow,sinceitresidesinthefuture(seeDisplay5.2).

Asmentionedearlier,weassumethatthemodelwasdevelopedduringDecember2006.Themodeling data setwas created by observing a sample of current customerswithinvestment accounts during a time interval of three months starting August 1, 2006throughOctober31,2006.This is theperformancewindow,asshowninDisplay5.1.The fictitious bank is interested in predicting the probability of attrition, specificallyamongtheircustomers’ investmentaccounts.Accordingly, ifan investmentaccount isclosed during the performance window, an indicator value of 1 is entered on thecustomer’srecord;otherwise,avalueof0isgiven.Thus,forthetargetvariableATTR,1representsattritionand0representsnoattrition.

Eachrecordinthemodelingdatasetincludesthefollowingdetails:

•monthlytransactions

•balances

•monthlyratesofchangeofcustomerbalances

•speciallyconstructedtrendvariables

These trend variables indicate patterns in customer transaction behavior, etc., for allaccountsheldbythecustomerduringtheInputs/Explanatoryvariableswindow,whichspanstheperiodofJanuary1,2006, throughJune30,2006,asshowninDisplay5.1.The customer’s records also show how long the customer has held an investmentaccountwiththebank(investmentaccounttenure).Inaddition,eachrecordisappendedwith the customer’s age, marital status, household income, home equity, life-stageindicators, and number of months the customer has been on the books (customertenure),etc.Allofthesedetailsarecandidatesforinclusioninthemodel.

In the process flow shown in Display 5.3, the data source is created first from thedataset ATTRITION2_C. TheInput Data node reads this data. The data is thenpartitionedsuchthat45%oftherecordsareallocatedtotraining,35%tovalidation,and25%fortesting.Iallocatedmorerecordsfortrainingandvalidationthanfortestingsothat themodels are estimated as accurately as possible.Thevalueof the PartitioningMethod property is set to Default, which performs stratification with respect to thetargetvariableifthetargetvariableiscategorical.Ialsoimputedmissingobservationsusing the Impute node. Display 5.3 shows the process flow formodeling the binarytarget.

Display5.3

6.2.1LogisticRegressionforPredictingAttritionThetopsegmentofDisplay5.3showstheprocessflowforalogisticregressionmodelofattrition.IntheRegressionnode,IsettheModelSelectionpropertytoStepwise,asthis method was found to produce the most parsimonious model. For the stepwiseselection,IsettheEntrySignificanceLevelpropertyto0.05andtheStaySignificanceLevel property to 0.05. I set the Use Selection Defaults property to No and theMaximumNumberofStepspropertyto100.Beforemakingafinalchoiceofvalueforthe Selection Criterion property, I tested the model using different values. In thisexample,theValidationErrorcriterionproducedamodelthatmadethemostbusinesssense.Display5.4showsthemodelthatisestimatedbytheRegressionnodewiththesepropertysettings.

Display5.4

The variables that are selected by theRegression node are btrend and duration. Thevariable btrend measures the trend in the customer’s balances during the six-monthperiod prior to the operational lag period and the operational lag period. If there isadownwardtrendinthebalancesovertheperiod,e.g.,adeclineof1percent,thentheoddsofattritionincreaseby100*{e−0.1485(−1)−1}=16.0%.Thismayseemabithigh,butthedirectionoftheresultdoesmakesense.(Also,keepinmindthattheseestimatesarebasedonsimulateddata.Itisbestnottoconsiderthemasgeneralresults.)

The variable duration measures the customer’s investment account tenure. Longertenure corresponds to lower probability of attrition. This can be interpreted as thepositiveeffectofcustomerloyalty.

Display 5.5 shows the Cumulative Lift chart from the Results window oftheRegressionnode.

Display5.5

Table5.1showsthecumulativeliftandcaptureratesfortheTrain,Validation,andTestdatasets.

Table5.1

6.2.2DecisionTreeModelforPredictingAttritionThemiddlesectionofDisplay5.3showstheprocessflowfortheDecisionTreemodelofattrition.

IsettheSplittingRuleCriterionpropertytoProbChisq,theSubtreeMethodpropertyto Assessment, and theAssessmentMeasure property to Average Square Error. Mychoiceofthesepropertyvaluesissomewhatarbitrary,butitresultedinatreethatcanserveasanillustration.Youcansetalternativevaluesforthesepropertiesandexaminethetreesproduced.

AccordingtothepropertyvaluesIset,thenodesaresplitonthebasisofp-valuesofthePearson Chi-Square, and the sub-tree selected is based on theAverage Square ErrorcalculatedfromtheValidationdataset.

Display 5.6 shows the tree produced according to the above property values.TheDecisionTree node selected only one variable, btrend, in the tree. The variablebtrendisthetrendinthecustomerbalances,asexplainedearlier.TheRegressionnodeselectedtwovariables,whichwerebothnumeric,whiletheDecisionTreeselectedonenumericvariableonly.Youcouldeithercombinetheresultsfrombothofthesemodelsor create a larger set of variables that would include the variables selected bytheRegressionnodeandtheDecisionTreenode.

Display5.6

Display5.7showsthecumulativeliftoftheDecisionTreemodel.

Display5.7

ThecumulativeliftandcumulativecaptureratesforTrain,Validation,andTestdataareshowninTable5.2.

Table5.2

6.2.3ANeuralNetworkModelforPredictingAttritionThe bottom segment ofDisplay 5.3 shows the process flow for theNeural Networkmodelofattrition.

In developing a Neural Network model of attrition, I tested the following twoapproaches:

•Useall theinputs inthedataset,set theArchitectureproperty toMLP(multi/layerperceptron),theModelSelectionCriterionpropertytoAverageError,andtheNumberofHiddenUnitspropertyto10.

•Use thesamepropertysettingsasabove,butuseonlyselected inputs.The includedvariables are: _NODE_, a special variable created by the Decision Tree Node thatpreceded theNeuralNetworkNode;btrend,a trendvariable selectedby theDecisionTreeNodeanddurationwhichwasmanuallyincludedbysettingthe“use”propertyto“Yes”inthevariablestableoftheNeuralNetworkNode.

Asexpected,thefirstapproachresultedinanover-fittedmodel.TherewasconsiderabledeteriorationintheliftcalculatedfromtheValidationdatasetrelativetothatcalculatedfromtheTrainingdataset.

The second approach yielded a much more robust model. With this approach, IcomparedthecumulativeliftofthedifferentmodelsgeneratedbysettingtheNumberofHiddenUnitspropertyto5,10,15,and20.ThemodelthathasthebestliftfortheValidationdataistheonegeneratedbysettingtheNumberofHiddenUnitspropertyto10.

Using only a selected number of inputs enables you to quickly test a variety ofarchitecturalspecifications.Forpurposesofillustration,Ipresentbelowtheresultsfromamodelwiththefollowingpropertysettings:

•Architectureproperty:MultilayerPerceptron

•ModelSelectionCriterionproperty:AverageError

•NumberofHiddenUnitsproperty:10

The bottom segment of Display 5.3 shows the process flow for the neural networkmodelofattrition.Inthisprocessflow,IusedtheDecisionTreenodewiththeSplittingRule Criterion property set to ProbChisq,Assessment Measure property set toAverageSquareError,LeafVariablepropertytoYes,VariableSelectionpropertytoYes, and Leaf Role property to Input to select the inputs for use in the NeuralNetworknode.

TheliftchartsforthismodelareshowninDisplay5.8.

Display5.8

Table 5.3 shows the cumulative lift and cumulative capture rates for the Train,Validation,andTestdatasets.

Table5.3

Display5.9showstheliftchartsforallmodelsfromtheModelComparisonnode.

Display5.9

Table5.4showsacomparisonofthethreemodelsusingTestdata.

Table5.4

FromTable5.4itappearsthatboththeLogisticRegressionandNeuralNetworkmodelshave outperformed the Decision Treemodel by a slight margin and that there is nosignificant difference in performance between the Logistic Regression and NeuralNetwork models. The top four deciles (the top eight demi-deciles), based on theLogisticRegressionmodel,capture80.4%ofattritors,whilethetopfourdecilesbasedon theNeuralNetworkmodel also capture 80.0%of the attritors,while the top four

deciles,basedontheDecisionTreemodel,capture76.8%ofattritors.Thecumulativelift for all the threemodels are monotonically declining with each successive demi-decile.

6.3MODELSFORORDINALTARGETS:ANEXAMPLEOFPREDICTINGTHERISKOFACCIDENTRISK

In previous chapter, I demonstrated a neural network model to predict the lossfrequencyforahypotheticalinsurancecompany.Thelossfrequencywasacontinuousvariable,butIusedadiscreteversionofitasthetargetvariable.HereIfollowthesameprocedure, and create a discretized variable lossfrqfrom the continuous variable lossfrequency.Thediscretizationisdoneasfollows:

lossfrqtakesthevalues0,1,2,and3accordingthefollowingdefinitions:

lossfrq=0ifthelossfrequency=0lossfrq=1if0<lossfrequency<1.5 lossfrq=2if1.5≤lossfrequency<2.5lossfrq=3if thelossfrequency≥2.5.

ThegoalistodevelopthreemodelsusingtheRegression,DecisionTree,andNeuralNetworknodestopredictthefollowingprobabilities:

Pr(lossfrq=0|X)=Pr(lossfrequency=0|X)Pr(lossfrq=1|X)=Pr(0<lossfrequency

<1.5|X)Pr(lossfrq=2|X)=Pr(1.5≤lossfrequency<2.5|X)Pr(lossfrq=3|X)=Pr(loss

frequency≥2.5|X)

whereXisavectorofinputsorexplanatoryvariables.

Whenyouareusinga targetvariablesuchas lossfrq,which isadiscreteversionofacontinuousvariable,thevariablebecomesordinalifithasmorethantwolevels.

ThefirstmodelIdevelopforpredictingtheaboveprobabilitiesisaProportionalOddsmodel (or logistic regressionwithcumulative logits link)using theRegression node.TheRegressionnodeproducessuchamodel if themeasurement levelof thetarget issettoOrdinal.

Thesecondmodel Idevelop isaDecisionTreemodelusing theDecisionTreenode.TheDecisionTreenode does not produce any equations, but it does provide certainrulesforportioningthedatasetintodisjointgroups(leafnodes).Therulesarestatedintermsofinputranges(ordefinitions).Foreachgroup,theDecisionTreenodegivesthepredictedprobabilities:

Pr(lossfrq=0|X),Pr(lossfrq=1|X),Pr(lossfrq=2|X)andPr(lossfrq=3|X).

TheDecisionTreenodealsoassignsatargetlevelsuchas0,1,2,and3toeachgroup,andhencetoalltherecordsbelongingtothatgroup.

The thirdmodel is developedbyusing theNeuralNetworknode,which produces aLogit type model, provided you set the Target Activation Function property toLogistic.Withthissetting,yougetamodelsimilartotheoneshowninChapter5.

Display5.10showstheprocessflowforcomparingtheordinaltarget.

Display5.10

6.3.1LiftChartsandCaptureRatesforModelswithOrdinalTargets

MethodUsedbySASEnterpriseMiner

When the target is ordinal, SAS Enterprise Miner creates lift charts based on theprobabilityofthehighestlevelofthetargetvariable.Thehighestlevelinthisexampleis3.ForeachrecordintheTestdataset,SASEnterpriseMinercomputesthepredicted,orposterior,probabilityPr(lossfrq=3|Xi) from themodel.Then it sorts thedataset indescendingorderof thepredictedprobabilityanddivides thedata set into20deciles,whichIrefertoalternativelyasdemi-deciles.Withineachdecile,SASEnterpriseMinercalculatestheproportionofcases,aswellasthetotalnumberofcases,withlossfrq=3.Theliftforadecileistheratiooftheproportionofcaseswherelossfrq=3inthedeciletotheproportionofcaseswithlossfrq=3intheentiredataset.Thecapturerateistheratioofthetotalnumberofcaseswithlossfrq=3withinthedeciletothetotalnumberofcaseswithlossfrq=3intheentiredataset.

AnAlternativeApproachUsingExpectedLossfrq

As pointed out earlier in Section 5.5.1.4, you sometimes need to calculate lift andcaptureratebasedontheexpectedvalueofthetargetvariable.Foreachmodel,Ipresentlifttablesandcaptureratesbasedontheexpectedvalueofthetargetvariable.

ForeachrecordintheTestdataset,Icalculatetheexpectedlossfrqas

E(lossfrq|Xi)=Pr(lossfrq=0|Xi)*0+Pr(lossfrq=1|Xi)*1

+Pr(lossfrq=2|Xi)*2+Pr(lossfrq=3|Xi)*3.

Next,IsorttherecordsofthedatasetindescendingorderofE(lossfrq|Xi),divide thedatasetinto20demi-deciles(bins),andwithineachdemi-decile,Icalculatethemeanoftheactualorobservedvalueofthetargetvariablelossfrq.Theratioof themeanofactuallossfrqinademi-deciletotheoverallmeanofactuallossfrqforthedatasetisthelift for thedemi-decile.Likewise, thecapturerateforademi-decile is theratioof thesumof the actual lossfrqfor all the recordswithin the demi-decile to the sumof theactuallossfrqofalltherecordsintheentiredataset.

6.3.2LogisticRegressionwithProportionalOddsforPredictingRiskinAutoInsurance

GiventhatthemeasurementlevelofthetargetvariableissettoOrdinal,aProportionalOddsmodelisestimatedbytheRegressionnodeusingthepropertysettingsshowninDisplays7.11and7.12.

Display5.11

Display5.12

The reasons for the settings of the Selection Model property and the SelectionCriterionpropertyarecitedinprevioussections.ThechoiceofValidationErrorisalsopromptedbythelackofanappropriateprofitmatrix.

Display5.13showstheestimatedequationsfortheProportionalOddsmodel.

Display5.13

Display 5.13 shows that the estimated model is Logistic with Proportional Odds.Accordingly, the slopes (coefficients of the explanatory variables) for the three logitequationsarethesame,buttheinterceptsaredifferent.ThereareonlythreeequationsinDisplay5.13,whiletherearefoureventsrepresentedbythefourlevels(0,1,2,and3)ofthetargetvariablelossfrq.ThethreeequationsgiveninDisplay5.13plusthefourthequation,givenbytheidentity

Pr(lossfrq=0)+Pr(lossfrq=1)+Pr(lossfrq=2)+Pr(lossfrq=3)=1

aresufficienttosolvefortheprobabilitiesofthefourevents.(SeeChapter6toreviewhowtheseprobabilitiesarecalculatedbytheRegressionnode.)Inthemodelpresented,thetargetvariablehasonlythreelevels,whilethetargetvariableinthemodelpresentedinDisplay5.13hasfourlevels:0,1,2,and3.

Thevariables selectedby theRegressionnode areAGE (age of the insured),CRED(creditscoreoftheinsured),andNPRVIO(numberofpriorviolations).ThecumulativeliftchartsfromtheResultswindowoftheRegressionnodeareshowninDisplay5.14.

Display5.14

ThecumulativeliftchartsshowninDisplay5.14arebasedonthehighestlevelofthetarget variable, as described in Section 7.3.1.UsingTest data, alternative cumulativelift,andcaptureratesbasedonE(lossfrq),werecomputedandareshowninTable5.5.

Table5.5

6.3.3DecisionTreeModelforPredictingRiskinAutoInsurance

Display5.15showsthesettingsofthepropertiesoftheDecisionTreenode.

Display5.15

Note that I set the Subtree Method property to Assessment and the AssessmentMeasurepropertytoAverageSquareError.Thesechoiceswerearrivedataftertryingoutdifferentvalues.Thechosensettingsendupyieldingareasonablysizedtree—onethat is not too small nor too large. In order to find general rules for making thesechoices, youneed to experimentwithdifferentvalues for thesepropertieswithmanyreplications.

Displays 7.16A and 7.16B show a part of the tree produced by the property settingsgiveninDisplay5.15.

Display5.16A

Display5.16B

TheDecisionTreenodehas selected the three variablesAGE,CRED (credit rating),and NPRVIO (number of previous violations), which were also selected bytheRegressionnode.Inaddition,theDecisionTreenodeselectedfivemorevariables:MILEAGE(milesdrivenbythepolicyholder),NUMTR(numberofcreditcardsownedbythepolicyholder),DELINQ(numberofdelinquencies),HEQ(valueofhomeequity),andDEPC(numberofdepartmentstorecardsowned).

Display5.17showstheimportanceofthevariablesusedinsplittingthenodes.

Display5.17

Display5.18showstheliftchartsfortheDecisionTreemodel.

Display5.18

Table5.6showstheliftchartscalculatedusingtheexpectedlossfrequency.

Table5.6

6.3.4NeuralNetworkModelforPredictingRiskinAutoInsurance

The property settings for theNeuralNetwork node are shown inDisplays 7.19 and7.20.

Display5.19

Display5.20

Becausethenumberofinputsavailableinthedatasetissmall,IpassedallofthemintotheNeuralNetworknode. I set theNumber ofHiddenUnitsproperty to its defaultvalueof3,sinceanyincreaseinthisvaluedidnotimprovetheresultssignificantly.

Display5.21showstheliftchartsfortheneuralnetworksmodel.

Display5.21

Table 5.7 shows the cumulative lift and capture rates for theNeuralNetworkmodelbasedontheexpectedlossfrequencyusingtheTestdata.

Table5.7

6.3.5ComparisonofAllThreeAccidentRiskModelsTable 5.8 shows the lift and capture rates calculated for the Test data setrankedbyE(lossfrq),asoutlinedinprevioussection.

Table5.8

FromTable5.8,itisclearthatthelogisticregressionwithproportionaloddsisthewinnerintermsofliftandcapturerates.Thetableshowsthat,forthelogisticregression,thecumulativeliftatBin8is1.95.Thismeansthattheactual average loss frequency of the customers who are in the top 40percentilesbasedonthelogisticregressionis1.95timestheoverallaverageloss frequency. The corresponding numbers for the Neural Network andDecisionTreemodelsare1.92and1.85,respectively.

6.4BOOSTINGANDCOMBININGPREDICTIVEMODELS

Sometimesyoumaybeabletoproducemoreaccuratepredictionsusingacombinationofmodelsinsteadofasinglemodel.GradientBoostingandEnsemblemethodsaretwodifferentwaysofcombiningmodels.GradientBoostingstartswithaninitialmodelandupdatesitbysuccessivelyaddingasequenceofregressiontreesinastep-wisemanner.Each tree in the sequence is created by using the residuals (gradients of an errorfunction)fromthemodelinthepreviousstepasthetarget.IfthedatasetusedtocreateeachtreeinthesequenceoftreesisarandomsamplefromtheTrainingdataset,thenthemethod iscalledStochasticGradientBoosting. InStochasticGradientBoosting,asample is selected at each step by Random Sampling Without Replacement. In theselected sample,observations areweightedaccording to theerrors from themodel inthe previous step. In SAS Enterprise Miner 12.1, Gradient Boosting is done bytheGradientBoostingnode.

WhileBoostingandStochasticBoostingcombinethemodelscreatedsequentiallyfromtheinitialmodel,theEnsemblemethodcombinesacollectionofmodelsindependentlycreated for the same target. The models combined can be of different types.TheEnsemble node can be used to combine different models created by differentnodes.

.TheyareintendedtogiveyouanunderstandingofGradientBoostingandStochasticGradient Boosting in general. To gain further insights into Gradient Boosting andStochasticGradientBoostingtechniques,youshouldreadthepapersbyFriedman.

6.5MODELVALIDATION

6.5.1StochasticGradientBoostingInStochasticGradientBoosting,ateachiteration,thetreeisdevelopedusingarandomsample(withoutreplacement)fromtheTrainingdataset.Inthesamplingateachstep,weightsareassignedtotheobservationsaccordingthepredictionerrorsinthepreviousstep,assigninghigherweightstoobservationswithlargererrors.

6.5.2AnIllustrationofBoostingUsingtheGradientBoostingNode

Display 5.22 shows the process flow diagram for a comparison of models createdbyGradientBoostingandDecisionTreenodes.

Display5.22

ThepropertysettingsoftheGradientBoostingnodeareshowninDisplay5.23.

Display5.23

ThepropertysettingsfortheDecisionTreenodeareshowninDisplay5.24.

Display5.24

Table5.9showsacomparisonofCumulativeLiftandCumulativeCaptureRateforthetwomodelsfortheTestdata.

Table5.9

BycomparingtheCumulativeLiftandCumulativeCaptureratesforGradientBoostingandDecisionTreemodelsfromthe15thpercentileonwards,wecanfindthatGradientBoostinghasyieldedslightlybetterpredictionsthantheDecisionTreemodel.

6.5.3TheEnsembleNodeWe can use the Ensemble node to combine the three models’ LogisticRegression,DecisionTree,andNeuralNetworknodes,asshowninDisplay5.25.

Display5.25

ThetargetvariableusedintheInputDatanodeisATTR,whichisabinaryvariable.Ittakesthevalue1ifacustomerattritedduringagiventimeintervaland0ifthecustomerdidnotattrit.WhenyouruntheRegression,theDecisionTree, theNeuralNetwork,andtheEnsemblenodesasshowninDisplay5.25, theEnsemblenodecalculates theposteriorprobabilitiesandan“Into”variableforeachrecordinthedatasetfromeachmodel.

TheEnsemblenodecreatesthevariableslistedTable5.10.

Table5.10

InthelastcolumnofTable5.10,IassignedhypotheticalvaluesforeachofthevariablesforillustratingthedifferentcombinationmethodsoftheEnsemblenode.Notethatthevalue of the “Into” variable from each model is based on the Maximum PosteriorProbability criterion.That is, if the predicted probability from amodel for level 1 ishigherthanthepredictedprobabilityforlevel0,thentherecord(customer)isgiventhelabel1.If1representstheeventthatthecustomerattritedand0representstheeventthatthe customer did not attrit, then by assigning level 1 to the customer, the model isclassifyinghim/herasan“attritor”.YoucanusesomeothercriteriasuchasMaximumExpectedProfitsbyprovidingaweightmatrixoraprofitmatrix.

Forexample,fromTable5.10itcanbeseenthattheprobabilityofattritioncalculatedfrom a Logistic Regression model for the hypothetical customer is 0.6, and theprobability of no attrition for the same customer is 0.4. Since 0.6>0.4, the record isassignedlevel1ifyouusetheLogisticRegressionmodelforclassification.

In theEnsemblenode there are threemethods for combiningmultiplemodels. Theyare:

•Average

•Maximum

•Voting

◦Average

◦Proportion

6.5.4ComparingtheGradientBoostingandEnsembleMethodsofCombiningModels

ThetaskofcomparingthecombinedmodelfromtheEnsemblenodewiththeindividualmodelsincludedintheEnsembleisleftasanexercise.Inthissection,wecomparethecombinedmodelgeneratedbytheEnsemblenodewiththeonegeneratedbytheGradientBoostingnode.Display5.26showstheprocessflowdiagramfordemonstratingtheEnsemblenodeandalsoforcomparingthemodelsgeneratedbytheEnsembleandGradientBoostingnodes.

Display5.26

ThepropertysettingsoftheEnsemblenodeareshowninDisplay5.27.ThepropertysettingsoftheGradientBoostingnodearesameasthoseshownDisplay5.23.

Display5.27

Table 5.12 shows Cumulative Lift and Cumulative Capture Rate from the modelsdevelopedbytheEnsembleandGradientBoostingnodesfortheTestdata.

Table5.12

FromTable5.12,itappearsthattheEnsemblemethodhasproducedbetterpredictionsthanGradientBoostinginpercentiles5-20.

neural networks with sas enterprise miner

Documents