rjournal_2011-2_baier~et~al

Upload: jake-cruz-calderon

Post on 02-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 RJournal_2011-2_Baier~et~al

    1/7

    CONTRIBUTED RESEARCH A RTICLES 5

    Creating and Deploying an Applicationwith (R)Excel and RThomas Baier, Erich Neuwirth and Michele De Meo

    Abstract We present some ways of using R inExcel and build an example application using thepackage rpart . Starting with simple interactiveuse of rpart in Excel, we eventually package thecode into an Excel-based application, hiding alldetails (including R itself) from the end user. Inthe end, our application implements a service-oriented architecture (SOA) with a clean separa-tion of presentation and computation layer.

    Motivation

    Building an application for end users is a very chal-lenging goal. Building a statistics application nor-mally involves three different roles: application de-veloper, statistician, and user. Often, statisticiansare programmers too, but are only (or mostly) fa-miliar with statistical programming (languages) anddenitely are not experts in creating rich user inter-faces/applications for (casual) users.

    For manymaybe even for mostapplications of statistics, Microsoft Excel is used as the primary userinterface. Users are familiar with performing simplecomputations using the spreadsheet and can easilyformat the result of the analyses for printing or inclu-sion in reports or presentations. Unfortunately, Exceldoes not provide support for doing more complexcomputations and analyses and also has documentedweaknesses for certain numerical calculations.

    Statisticians know a solution for this problem, andthis solution is called R. R is a very powerful pro-gramming language for statistics with lots of meth-ods from different statistical areas implemented invarious packages by thousands of contributors. But

    unfortunately, Rs user interface is not what everydayusers of statistics in business expect.The following sections will show a very simple

    approach allowing a statistician to develop an easy-to-use and maintainable end-user application. Ourexample will make use of R and the package rpartand the resulting application will completely hide thecomplexity and the R user interface from the user.

    rpart implements recursive partitioning and re-gression trees. These methods have become powerfultools for analyzing complex data structures and have been employed in the most varied elds: CRM, -

    nancial risk management, insurance, pharmaceuticalsand so on (for example, see: Altman (2002), Hastieet al. (2009), Zhang and Singer (1999)).

    The main reason for the wide distribution of tree- based methods is their simplicity and intuitiveness.

    Prediction of a quantitative or categorical variable,is done through a tree structure, which even non-professionals can read and understand easily. Theapplication of a computationally complex algorithmthus results in an intuitive and easy to use tool. Pre-diction of a categorical variable is performed by aclassication tree, while the term regression tree is usedfor the estimation of a quantitative variable.

    Our application will be built for Microsoft Exceland will make use of R and rpart to implement thefunctionality. We have chosen Excel as the primarytool for performing the analysis because of various

    advantages: Excel has a familiar and easy-to-use user inter-

    face.

    Excel is already installed on most of the work-stations in the industries we mentioned.

    In many cases, data collection has been per-formed using Excel, so using Excel for the anal-ysis seems to be the logical choice.

    Excel provides many features to allow ahigh-quality presentation of the results. Pre-

    congured presentation options can easilyadapted even by the casual user.

    Output data (mostly graphical or tabular pre-sentation of the results) can easily be used infurther processing e.g., embedded in Power-Point slides or Word documents using OLE (asubset of COM, as in Microsoft Corporation andDigital Equipment Corporation (1995)).

    We are using R for the following reasons:

    One cannot rely on Microsoft Excels numericaland statistical functions (they do not even givethe same results when run in different versionsof Excel). See McCullough and Wilson (2002)for more information.

    We are re-using an already existing, tested andproven package for doing the statistics.

    Statistical programmers often use R for perform-ing statistical analysis and implementing func-tions.

    Our goal for creating an RExcel -based applica-tion is to enable any user to be able to perform the

    computations and use the results without any specialknowledge of R (or even of RExcel ). See Figure 1 foran example of the applications user interface. Theresults of running the application are shown in Figure2.

    The R Journal Vol. 3/2, December 2011 ISSN 2073-4859

    http://cran.r-project.org/package=rparthttp://cran.r-project.org/package=rpart
  • 8/11/2019 RJournal_2011-2_Baier~et~al

    2/7

  • 8/11/2019 RJournal_2011-2_Baier~et~al

    3/7

    CONTRIBUTED RESEARCH A RTICLES 7

    Scilab can easily be integrated into OpenOfce.orgapplications, like Calc (see OpenOfce.org (2006)).ROOo is available via Drexel (2009) and already sup-ports Windows, Linux and MacOS X.

    RExcel supports various user interaction modes:

    Scratchpad and data transfer mode: Menus controldata transfer from R to Excel and back; com-mands can be executed immediately, either fromExcel cells or from R command line

    Macro mode: Macros, invisible to the user, controldata transfer and R command execution

    Spreadsheet mode: Formulas in Excel cells controldata transfer and command execution, auto-matic recalculation is controlled by Excel

    Throughout the rest of this article, we will describehow to use scratchpad mode and macro mode for proto-typing and implementing the application.

    For more information on RExcel and statconn-DCOM see Baier and Neuwirth (2007).

    Implementing the classicationand regression treeWith RExcel s tool set we can start developing our ap-plication in Excel immediately. RExcel has a scratch-pad mode which allows you to write R code directly

    into a worksheet and to run it from there. Scratchpadmode is the user interaction mode we use when pro-totyping the application. We will then transform theprototypical code into a real application. In prac-tice, the prototyping phase in scratchpad mode will be omitted for simple applications. In an Excel hostedR application we also want to transfer data betweenExcel and R, and transfer commands may be embed-ded in R code. An extremely simplistic example of code in an R code scratchpad range in Excel mightlook like this:

    #!rput inval ' Sheet1 ' !A1result

  • 8/11/2019 RJournal_2011-2_Baier~et~al

    4/7

    8 CONTRIBUTED RESEARCH A RTICLES

    Command Description

    #!rput variable range store the value (contents) of a range in an R variable#!rputdataframe variable range store the value of a range in an R data frame#!rputpivottable variable range store the value of a range in an R variable#!rget r-expression range store the value of the R expression in the range#!rgetdataframe r-expression range store the data frame value of the R expression in the range#!insertcurrentplot cell-address insert the active plot into the worksheet

    Table 1: Special RExcel commands used in sheets

    Figure 3: Graphical representation of the classication tree in the RExcel based application.

    The R Journal Vol. 3/2, December 2011 ISSN 2073-4859

  • 8/11/2019 RJournal_2011-2_Baier~et~al

    5/7

    CONTRIBUTED RESEARCH A RTICLES 9

    The potential of this approach is obvious: Verycomplex models, such as classication and regressiontrees, are made available to decision makers (in thiscase, the risk manager) directly in Microsoft Excel.See Figure 2 on page 6 for a screen-shot of the presen-

    tation of the results.

    Tidying up the spreadsheetIn an application to be deployed to end users, R codeshould not be visible to the users. Excels mechanismfor performing operations on data is to run macroswritten in VBA (Visual Basic for Applications, theprogramming language embedded in Excel). RExcelimplements a few functions and subroutines whichcan be used from VBA. Here is a minimalistic exam-ple:

    Sub RExcelDemo()RInterface.StartRServerRInterface.GetRApply "sin", _

    Range("A2"), Range("A1")RInterface.StopRServer

    End Sub

    GetRApply applies an R function to argumentstaken from Excel ranges and puts the result in a range.These arguments can be scalars, vectors, matrices ordataframes. In the above example, the function sin isapplied to the value from cell A1 and the result is putin cell A2. The R function given as the rst argumentto GetRApply does not necessarily have to be a namedfunction, any function expression can be used.

    RExcel has several other functions that may be called in VBA macros. RRun runs any codegiven as string. RPut and RGet transfer matri-ces, and RPutDataframe and RGetDataframe transferdataframes. RunRCall similar to GetRApply calls anR function but does not transfer any return value toExcel. A typical use of RunRCall is for calling plotfunctions in R. InsertCurrentRPlot embeds the cur-rent R plot into Excel as an image embedded in aworksheet.

    In many cases, we need to dene one or moreR functions to be used with GetRApply or the otherVBA support functions and subroutines. RExcel hasa mechanism for that. When RExcel connects to R (us-ing the command RInterface.StartRServer), it will checkwhether the directory containing the active workbookalso contains a le named RExcelStart.R . If it ndssuch a le, it will read its contents and evaluate themwith R (source command in R). After doing this, REx-cel will check if the active workbook contains a work-sheet named RCode . If such a worksheet exists, its

    contents also will be read and evaluated using R.Some notes on handling R errors in Excel: The R im-plementation should check for all errors which areexpected to occur in the application. The applicationitself is required to pass correctly typed arguments

    when invoking an R/ function. If RExcel calls a func-tion in R and R throws an error, an Excel dialog boxwill pop up informing the user of theerror. An alterna-tive to extensive checking is to use Rs try mechanismto catch errors in R and handle them appropriately.

    Using these tools, we now can dene a macro per-forming the actions our scratchpad code did in theprevious section. Since we can dene auxiliary fun-cions easily, we can also now make our design moremodular.

    The workhorse of our application is the followingVBA macro:

    Sub PredictApp()Dim outRange As RangeActiveWorkbook.Worksheets("predict") _

    .ActivateIf Err.Number 0 Then

    MsgBox "This workbook does not " _ & "contain data for rpartDemo"

    Exit SubEnd IfClearOutputRInterface.StartRServerRInterface.RunRCodeFromRange _

    ThisWorkbook.Worksheets("RCode") _ .UsedRange

    RInterface.GetRApply _ "function(" _

    & "trainingdata,groupvarname," _ & "newdata)predictResult(fitApp("_ & "trainingdata,groupvarname)," _ & "newdata)", _

    ActiveWorkbook.Worksheets( _ "predict").Range("R1"), _

    AsSimpleDF(DownRightFrom( _ ThisWorkbook.Worksheets( _

    "database").Range("A1"))), _ "V16", _

    AsSimpleDF(ActiveWorkbook _ .Worksheets("predict").Range( _

    "A1").CurrentRegion)

    RInterface.StopRServerSet outRange = ActiveWorkbook _

    .Worksheets("predict").Range("R1") _ .CurrentRegion

    HighLight outRangeEnd Sub

    The function predictResult is dened in theworksheet RCode .

    Packaging the applicationSo far, both our code (the R commands) and data have been part of the same spreadsheet. This may be conve-nient while developing the RExcel -based application

    The R Journal Vol. 3/2, December 2011 ISSN 2073-4859

  • 8/11/2019 RJournal_2011-2_Baier~et~al

    6/7

  • 8/11/2019 RJournal_2011-2_Baier~et~al

    7/7

    CONTRIBUTED RESEARCH A RTICLES 11

    The VBA Add-In and the R package createdthroughout this article

    For the simplest setup, all compontents are in-stalled locally. As an alternative, you can also install

    Excel, RExcel

    and our newly built VBA Add-In onevery workstation locally and install everything elseon a (centralized) server machine. In this case, R,rscproxy , our applications R package and statconn-DCOM are installed on the server machine and onehas to congure RExcel to use R via statconnDCOMon a remote server machine. Please beware that thiskind of setup can be a bit tricky, as it requires a cor-rect DCOM security setup (using the Windows tooldcomcnfg).

    DownloadingAll examples are available for download from theDownload page on http://rcom.univie.ac.at .

    BibliographyE. I. Altman. Bankruptcy, Credit Risk, and High Yield

    Junk Bonds. Blackwell Publishers Inc., 2002.

    T. Baier. rcom: R COM Client Interface and internalCOM Server, 2007. R package version 1.5-1.

    T. Baier and E. Neuwirth. R (D)COM Server V2.00,2005. URL http://cran.r-project.org/other/DCOM.

    T. Baier and E. Neuwirth. Excel :: COM :: R . Com- putational Statistics , 22(1):91108, April 2007.URL http://www.springerlink.com/content/uv6667814108258m/ .

    Basel Committee on Banking Supervision. Interna-tional Convergence of Capital Measurement andCapital Standards. Technical report, Bank for In-ternational Settlements, June 2006. URL http://www.bis.org/publ/bcbs128.pdf .

    L. Breiman, J. Friedman, R. Olshen, and C. Stone.Classication and Regression Trees. Wadsworth andBrooks, Monterey, CA, 1984.

    J. M. Chambers. Programming with Data. Springer,New York, 1998. URL http://cm.bell-labs.com/cm/ms/departments/sia/Sbook/ . ISBN 0-387-98503-4.

    R. Drexel. ROOo, 2009. URLhttp://rcom.univie.ac.at/ .

    T. Hastie, R. Tibshirani, and J. Friedman. The Elementsof Statistical Learning: Data Mining, Inference, andPrediction. Springer, 2009.

    INRIA. Scilab. INRIA ENPC, 1989. URL http:

    //www.scilab.org .B. D. McCullough and B. Wilson. On the accuracy of

    statistical procedures in Microsoft Excel 2000 andExcel XP. Computational Statistics and Data Analysis,40:713721, 2002.

    Microsoft Corporation. Microsoft Ofce 2000/Vi-sual Basic Programmers Guide. In MSDN Li-brary, volume Ofce 2000 Documentation. Mi-crosoft Corporation, October 2001. URL http://msdn.microsoft.com/ .

    Microsoft Corporation and Digital Equipment Corpo-ration. The component object model specication.Technical Report 0.9, Microsoft Corporation, Octo- ber 1995. Draft.

    OpenOfce.org. OpenOfce, 2006. URL http://www.openoffice.org/ .

    OpenOfce.org. Uno, 2009. URL http://wiki.services.openoffice.org/wiki/Uno/ .

    H. Zhang and B. Singer. Recursive Partitioning in the Health Sciences. Springer-Verlag, New York, 1999.ISBN 0-387-98671-5.

    Thomas BaierDepartment of Scientic ComputingUniversitity of Vienna1010 Vienna [email protected]

    Erich NeuwirthDepartment of Scientic Computing

    Universitity of Vienna1010 Vienna [email protected]

    Michele De MeoVenere Net SpaVia della Camilluccia, 69300135 [email protected]

    The R Journal Vol. 3/2, December 2011 ISSN 2073-4859

    http://rcom.univie.ac.at/http://rcom.univie.ac.at/http://cran.r-project.org/other/DCOMhttp://cran.r-project.org/other/DCOMhttp://cran.r-project.org/other/DCOMhttp://www.springerlink.com/content/uv6667814108258m/http://www.springerlink.com/content/uv6667814108258m/http://www.springerlink.com/content/uv6667814108258m/http://www.bis.org/publ/bcbs128.pdfhttp://www.bis.org/publ/bcbs128.pdfhttp://cm.bell-labs.com/cm/ms/departments/sia/Sbook/http://cm.bell-labs.com/cm/ms/departments/sia/Sbook/http://cm.bell-labs.com/cm/ms/departments/sia/Sbook/http://rcom.univie.ac.at/http://rcom.univie.ac.at/http://rcom.univie.ac.at/http://www.scilab.org/http://www.scilab.org/http://www.scilab.org/http://msdn.microsoft.com/http://msdn.microsoft.com/http://msdn.microsoft.com/http://www.openoffice.org/http://www.openoffice.org/http://www.openoffice.org/http://wiki.services.openoffice.org/wiki/Uno/http://wiki.services.openoffice.org/wiki/Uno/http://wiki.services.openoffice.org/wiki/Uno/mailto:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]:[email protected]://wiki.services.openoffice.org/wiki/Uno/http://wiki.services.openoffice.org/wiki/Uno/http://www.openoffice.org/http://www.openoffice.org/http://msdn.microsoft.com/http://msdn.microsoft.com/http://www.scilab.org/http://www.scilab.org/http://rcom.univie.ac.at/http://rcom.univie.ac.at/http://cm.bell-labs.com/cm/ms/departments/sia/Sbook/http://cm.bell-labs.com/cm/ms/departments/sia/Sbook/http://www.bis.org/publ/bcbs128.pdfhttp://www.bis.org/publ/bcbs128.pdfhttp://www.springerlink.com/content/uv6667814108258m/http://www.springerlink.com/content/uv6667814108258m/http://cran.r-project.org/other/DCOMhttp://cran.r-project.org/other/DCOMhttp://rcom.univie.ac.at/