extending and customizing ibm spss statistics with python, r, and .net (2)
Post on 08-Aug-2015
228 Views
Preview:
TRANSCRIPT
© 2010 IBM Corporation
Business Analytics software
Extending and Customizing IBM SPSS Statistics with R, Python, and .NET
Jon Peck
Senior Software Engineer, IBM
peck@us.ibm.com
November, 2010
© 2010 IBM Corporation
Business Analytics software
2
IBM SPSS Statistics IBM ® SPSS ® Statistics has an extensive command language (syntax) for data acquisition,
manipulation, and statistical and graphical procedures
Programmability and scripting dramatically extend these built-in capabilities
Allow custom user interfaces and output to be produced
Converting large SASapplications is likelyto require the use ofprogrammability
© 2010 IBM Corporation
Business Analytics software
3
Agenda Programmability introduction
Four examples– Automating repetitive work:
applySyntaxToFiles
– Integrating programs and scripting:
SPSSINC MODIFY TABLES
– Adding a procedure from R:
SPSSINC QUANTILE REGRESSION
– Adding a procedure in Python:
SPSSINC TURF
© 2010 IBM Corporation
Business Analytics software
4
Programmability increases your power, flexibility, and productivity
Generalization–React flexibly to metadata, results, and the environment–Benefit: Write fewer similar jobs
Automation–Embed program logic in jobs–Benefit: Less manual work
Extension–Tap existing R or Python statistical modules–Add your own or extend standard procedures and transformations– Benefit: More capabilities
Integration–Connect IBM SPSS Statistics inputs and outputs to other agents– Benefit: Make IBM SPSS Statistics part of a larger production process
More productivity and more fun
© 2010 IBM Corporation
Business Analytics software
5
IBM SPSS Statistics embeds three programming languages
Plug-ins let you extend capabilities using–Python–R–.NET languages (Windows only)
Free plug-in downloads
SPSS Developer Central web site provides articles, SPSS-written modules, plug-ins and user contributions
–New SPSS Community on IBM myDeveloperWorks
© 2010 IBM Corporation
Business Analytics software
6
GET FILE="c:/data/important.sav".
BEGIN PROGRAM PYTHON.import spssprint "Hello, IBM"END PROGRAM.DESCRIPTIVES ....
Python or R program code goes in the normal Statistics syntax window
My first Python program
© 2010 IBM Corporation
Business Analytics software
7
A program in the input stream can communicate with IBM SPSS Statistics and control it and use Python or R facilities and modules (internal mode)
spss.Submit("GET FILE='c:/data/cars.sav'.")
A Python or .NET application can embed IBM SPSS Statistics inside itself (external mode)
–User interface does not appear
There is a lower level C API available in an SDK
Programmability combines SPSS Statistics with Python, R, or .NET
© 2010 IBM Corporation
Business Analytics software
8
Programmability functionality is fully integrated into IBM SPSS StatisticsPrograms run in the regular syntax stream
Users can define IBM SPSS Statistics syntax for program and scripts via Extension mechanism.
Users can create dialog boxes and menus using the Custom Dialog Builder.
–Not just for extensions or programs
Python and R output appears in the Viewer–plain text–pivot tables–charts
© 2010 IBM Corporation
Business Analytics software
9
Python and R Programmability API's cover these areas
State information of Statistics
Get/Set variable dictionary information
Get/Set data
Get Viewer output (via xmlworkspace)
Create tables/charts/text objects in Viewer
Run Statistics commands (Python only)
© 2010 IBM Corporation
Business Analytics software
10
Python and VB scripting API's cover user interface and output
Programmability is a backend (SPSS Processor) domain
Scripting is mainly a frontend (user interface, including output) domain
Managing output Viewer and objects– tables: formatting, pivoting, editing, …– objects: visibility, order, titles, outline
text,…
General user interface control
Almost anything you can do via the user interface
Not available for R
© 2010 IBM Corporation
Business Analytics software
11
Statistics, graphs, and data management via Statistics
Two pages of VB.NET code
.NET plug-in embeds Statistics inside another programExample: Statistical Explorer
© 2010 IBM Corporation
Business Analytics software
12
Python and R are open source software
Programmability plug-ins are an optional installation– They are free (but require a Statistics license)– They make possible tapping the work of the Python and R communities– Python and R have license agreements– IBM Non-warrenty license agreement– For R, GPL license
© 2010 IBM Corporation
Business Analytics software
13
Extension commands eliminate need for user to learn Python or R
Extension mechanism lets you define IBM SPSS Statistics-style syntax for programs
IBM SPSS Statistics takes care of validation and parsing
Passes user input to a program in an easy-to-digest form
Automatically loaded when IBM SPSS Statistics starts–Look to the user like built in commands
Easy to distribute to others
© 2010 IBM Corporation
Business Analytics software
Extension Name Description
PLS Partial least squares (P)
PROPOR Confidence intervals for proportions (P)
SPSSINC APRIORI Association rules (R)
SPSSINC BREUSCH PAGAN Residual heteroscedasticity tests (R)
SPSSINC HETCOR Polychoric and polyserial correlation (P+R)
SPSSINC MFP GLM Fractional polynomial generalized linear models (R)
SPSSINC QQPLOT2 Empirical Q-Q plots (R)
SPSSINC QUANTREG Quantile regression (R)
SPSSINC RAKE Adjust weights to control totals (P)
SPSSINC RANFOR & SPSSINC RANPRED
Random forests (R)
SPSSINC RASCH Rasch models (R)
SPSSINC ROBUST REGR Robust regression (R)
SPSSINC TOBIT REGR Tobit regression (R)
SPSSINC TURF TURF analysis (P)
Some statistical extensions on Dev Central
14
© 2010 IBM Corporation
Business Analytics software
Extension Name Description
FUZZY Case-control exact and approximate matching (P)
GATHERMD Gather data file metadata (P)
HIDECOLS Hide pivot table columns (P)
SCRIPTEX SCRIPT commands with parameters (P)
SETSMACRO Syntax for using variable sets (P)
SPSSINC ANON Anonomize data (P)
SPSSINC COMPARE DATASETS Compare two sav files (P)
SPSSINC CREATE DUMMIES Create dummy variables for categories (P)
SPSSINC GETURI DATA Read data from the Internet (P)
SPSSINC MERGE TABLES Merge two pivot tables (P)
SPSSINC MODIFY OUTPUT Set Viewer outline titling and styling (P)
SPSSINC MODIFY TABLES Set pivot table cell and label styling (P)
SPSSINC TRANS Apply Python functions to cases (P)
SPSSINC TRANSLATE Translate Viewer output (P)
TEXT Create block of text in Viewer (P)15
Some non-statistical extensions on Dev Central
© 2010 IBM Corporation
Business Analytics software
16
–Write Python or R functions to implement the functionality or tap existing packages• Use input API's to get data to Python or R• Use output API's to create pivot tables
Can each be a single line of code
–For extensions,• Define the syntax in an xml file• Use tools in extension.py (Python) or spsspkg (R) to receive
parsed output and pass to implementing function• New in v18: R version of extension.py
–Use the Custom Dialog Builder to create the interface• The CDB is not just for extensions
–Test and document!–Package and distribute–Contributions to Developer Central are welcome
Documentation is at SPSS Developer Central
You can create and share your own additions to IBM SPSS Statistics
© 2010 IBM Corporation
Business Analytics software
17
Example: SPSSINC BREUSCH PAGAN– implemented using an R package
SPSSINC_BREUSCH_PAGAN.xml specifies the syntax to the Statistics parser
The R mapping code in SPSSINC_BREUSCH_PAGAN.R respecifies the syntax and invokes the executing routine with parsed parameters
– overlaps with xml syntax definition but provides additional features
SPSSINC BREUSCH PAGAN DEPENDENT = salary ENTER = educ jobcat /OPTIONS MISSING=LISTWISE /SAVE RESIDUALSDATASET=resids COEFSDATASET=coefs.
Extension commands: validation and mapping from syntax to Python or R function parameters is handled for you
© 2010 IBM Corporation
Business Analytics software
18
An XML file defines the syntax to the SPSS Universal Parser
© 2010 IBM Corporation
Business Analytics software
19
Python or, in this case, R code gets the parsed syntax, which is turned into function arguments
© 2010 IBM Corporation
Business Analytics software
20
Expand the audience by creating IBM SPSS Statistics syntax and dialog boxes
© 2010 IBM Corporation
Business Analytics software
21
Example I
Generalize and automate work
You have syntax files and need to process datasets not known in advance every day
applySyntaxToFiles function applies a syntax file to each file in input specification
© 2010 IBM Corporation
Business Analytics software
22
Apply standard processing to an unknown set of files
Produce processed data and reports
Use programmability to automate routine processes
© 2010 IBM Corporation
Business Analytics software
23
begin program.import spss, spssaux3spssaux3.applySyntaxToFiles(inputspec="c:/temp/parts/*.sav", syntax = "c:/myjobs/dailychecks.sps", outputdatadir = "c:/temp/processed", outputfiledir = "c:/temp/processed", logfile ="c:/temp/processed/report.txt")end program.
dailychecks.sps could apply data cleaning rules, modify data, and create reports
Could be run daily through Production Mode or C&DS job scheduler or used interactively
Extended version available as SPSSINC PROCESS FILES
Use a program to drive processing
© 2010 IBM Corporation
Business Analytics software
24
Example II
Automate dynamic or static formatting of tables
Use integrated scripting for better table presentation
© 2010 IBM Corporation
Business Analytics software
25
• TableLooks provide static formatting for entire areas of a table– data cells– row and column layers
• You want tables with formatting beyond tableLooks• Many users copy tables to Excel and manually format
them • Basic and Python Scripting provide programmatic way to
do formatting• SPSSINC MODIFY TABLES provides syntax for
extensive formatting– Eliminates need to know scripting– Uses Extension mechanism for programs and Python scripting
SPSSINC MODIFY TABLES extension command manipulates table formatting and structure
© 2010 IBM Corporation
Business Analytics software
SPSSINC MODIFY TABLES SUBTYPE='Crosstabulation' DIMENSION=ROWS SELECT='Std. Residual'/STYLES TEXTSTYLE=BOLD BACKGROUNDCOLOR=255 0 0APPLYTO='abs(x) >2'.
Use dynamic highlighting to make crosstab table easier to read
26
© 2010 IBM Corporation
Business Analytics software
27
Dialog created withCustom Dialog Builder
Generates extension command syntax
Easy to distribute
Custom dialog boxes are easy to create
© 2010 IBM Corporation
Business Analytics software
28
SPSSINC MODIFY TABLES subtype='variables in the equation' SELECT="B" "Sig."/STYLES TEXTCOLOR = 0 0 255 BACKGROUNDCOLOR=0 255 0.
Use static formatting to call out parts of a table
© 2010 IBM Corporation
Business Analytics software
29
SPSSINC MODIFY TABLES SUBTYPE="Custom Table" SELECT = "Total" DIMENSION=ROWS/STYLES BACKGROUNDCOLOR=255 255 88 TEXTSTYLE = BOLD
Format CTABLES totals to call them out
© 2010 IBM Corporation
Business Analytics software
30
SPSSINC MODIFY TABLES SUBTYPE='Report' SELECT="<<ALL>>"/STYLES APPLYTO=DATACELLS TEXTCOLOR=255 255 255 TEXTSTYLE=BOLDCUSTOMFUNCTION="customstylefunctions.washColumnsBlue".
def washColumnsBlue(obj, i, j, numrows, numcols, section, more): mincolor=150. maxcolor=255. increment = (maxcolor - mincolor)/(numcols-1) colorvalue = round(mincolor + increment * j) obj.SetBackgroundColorAt(i,j, RGB((mincolor, mincolor, colorvalue)))
Use custom functions for special effects
© 2010 IBM Corporation
Business Analytics software
32
Example III
Extend IBM SPSS Statistics by tapping the work of the R and Python communities
Add R procedures seamlessly to IBM SPSS Statistics
© 2010 IBM Corporation
Business Analytics software
33
R
R is a programming language for statistics–leading edge statistics–many contributed statistics and graphics packages–free
R is not so easy to learn–Documentation by experts for experts–Feels like a complex programming language – because it is–Syntax is a lot like C
–Error in optim(rho, f, control = control, hessian = TRUE, method = “BFGS”) :initial value in ‘vmmin’ is not finite
• Good for programmers(?); bad for users
R holds data in memory
R for SAS and SPSS Users, Bob Muenchen, Addison-Wesley, 2008
© 2010 IBM Corporation
Business Analytics software
34
R procedures can be accessed from IBM SPSS Statistics using the R plug-inThe R plug-in makes it easy to use R packages
–IBM SPSS Statistics datasets and Viewer output can be processed by R using plug-in
–Graphical, text, and table output appear in the Viewer• Pivot tables can be created with R code
–New IBM SPSS Statistics datasets can be created from R–R communicates with IBM SPSS Statistics via API's in plug-in
–Integration requires writing a little R wrapper code–IBM SPSS Statistics can provide
• dialog box interface• IBM SPSS Statistics-style syntax• pivot table output
Plug-in is downloadable from Developer Central
© 2010 IBM Corporation
Business Analytics software
35
Quantile regression models conditional quantilesOrdinary regression models conditional mean
Median regression is 50th quantile
Estimating quantiles is useful with varying spread, asymmetries, outliers
Areas of application include–empirical finance
• value at risk• mutual fund investment styles• credit scoring
–school quality–demand analysis–others
© 2010 IBM Corporation
Business Analytics software
36
SPSS QUANTILE REGRESSION extension embeds R quantreg package
© 2010 IBM Corporation
Business Analytics software
39
Example IV
Extend IBM SPSS Statistics by adding procedures in Python
TURF analysis
© 2010 IBM Corporation
Business Analytics software
40
TURF Analysis is popular in market research
Total Unduplicated Reach and Frequency (TURF)
Find the highest coverage of positive responses for a small number of questions
Example: How do you reach the largest audience by advertising on a few kinds of sports?
• football, cricket, basketball, cycling, ...
Example: What ice cream flavors should you offer in your shops that have three dispensing machines?
Example: What phone features should you promote?–multi-line, voicemail, paging, internet ...
Simple FREQUENCIES does not account for overlap
© 2010 IBM Corporation
Business Analytics software
41
Must compute all possible set unions of positive responses (up to a maximum number of variables).
Each set is a list of case ID’s with positive response on a question.
This problem is computationally explosive
Calculations for best 10 combinations of variables
Variables Set Union Calculations
3 4
6 57
12 4070
24 4,540,361
48 8,682,997,422
Is a scripting language like Python too slow?
TURF calculations are demanding
© 2010 IBM Corporation
Business Analytics software
42
Extension command SPSSINC TURF is implemented in Python
Provides–Dialog box interface–IBM SPSS Statistics style syntax–The computations–Pivot table output
Fewer than 300 lines of Python code–Plus dialog box definition–Plus extension command syntax definition
Executes requests involving a few million set comparisons in a few minutes
Initial version written in two days
© 2010 IBM Corporation
Business Analytics software
43
Telcosurvey(9 variables1000 cases)
dialogcreatedwithCustomDialogBuilder
Analysis of phone data
© 2010 IBM Corporation
Business Analytics software
44
PivottablecreatedfromPythoncode
Best singles are conference calling, call forwarding,and call waiting
Results show the combination of features – best reach
© 2010 IBM Corporation
Business Analytics software
45
Calculations completed in a few seconds
The best three are not the top three one at a time
© 2010 IBM Corporation
Business Analytics software
46
Python and R integrationUnification of programs and scriptsCustom Dialog BuilderExtensionsSPSS Developer Central is your friend
Where we have been today
© 2010 IBM Corporation
Business Analytics software
48
Programmability increases your power, flexibility, and productivity with IBM SPSS Statistics
Generalization and automation
–applySyntaxToFiles
–SPSS MODIFY TABLES
Extension
–SPSSINC QUANTREG using R
–SPSSINC TURF using Python
–Many new extension commands available
Integration
–applySyntaxToFiles as part of a process
And it's still more fun
top related