very basic spss

8/14/2019 Very Basic SPSS

http://slidepdf.com/reader/full/very-basic-spss 1/61

The Very Basics of SPSS (ver.16 and up, Windows)

Statistical Computing Group @ Research Data ServicesSAS, University of Pennsylvania

Last modified: 03/16/2009

This online seminar is to help you get started with basic data management and analysis in SPSS.It is for those people who:

Are new to statistical data work and want to learn to use SPSS to manage data and perform common analysis.

Took some undergraduate (or perhaps some graduate) introductory statistics course in

SPSS long ago and want to refresh their memory.

SPSS is one of the most user-friendly commercial statistical packages. As such, even beginnersof statistical analysis would find its point-and-click and dialogue boxes interface veryapproachable and easy to use. However, in the long run, you will benefit a lot more by learningSPSS by SPSS Syntax. The pros of the syntax approach are:

• An efficient way for documentation and reproducibility (this is the reason I wouldstrongly discourage you from keeping relying on the point-and-click approach).

• Much quicker and efficient once you learn how to write and run syntax commands.

• Can perform things unavailable/inaccessible from the point-and-click menus.

So, this online seminar attempts to prepare you for writing syntax commands yourself in thefuture to perform simple data tasks and run basic procedures.

Aside from this online seminar, you have access to a lot of great instructional SPSS resources online for free. We strongly recommend that you use those resources to the full.

This online seminar assumes that you are using SPSS ver.16 and up. Be aware there are

some significant changes between ver.15 and before and ver.16 on.

Contents:

1. Getting Started: Let’s Open SPSS and Bring in Data................................................................. 22. How to Get Descriptive Statistics and Graphs.......................................................................... 143. How to Define Variable Properties........................................................................................... 294. How to Create and Recode Variables ....................................................................................... 335. How to Subset (Select) Data..................................................................................................... 456. How to Sort and Split Data ....................................................................................................... 537. Simple Regression Example ..................................................................................................... 57

1

http://delicious.com/SCG_RDS/SPSS

http://delicious.com/SCG_RDS/SPSS



1. Getting Started: Let’s Open SPSS and Bring in Data

In this section, we will play with the SPSS windows and get a broad idea of what things look likein the SPSS environment. In so doing, we also learn how to bring data in SPSS. By the end of this section we will have a rough but good idea what role each window plays for your data

management and analysis work and how to get SPSS ready for our work.

First, let’s create a working directory for this practice in whatever location you prefer. It isalways a good idea to keep one project directory for one project. Let’s call this new workingdirectory “verybasicSPSS.” Then download from here the data files we are going to use in thisonline workshop. Save them in the working directory you just created.

Now, let’s launch the program.

Click on the SPSS icon, OR Choose from Window’s Start menu SPSS 1x.x (whatever version you have; this

workshop is using Version 16) for Windows:

Start Programs > SPSS for Windows > SPSS 1x.x for Windows

Some of you may have a dialogue box (called “SPSS for Windows 1x.x [your version]”) poppingup asking “What would you like to do?”. In that case, let’s just click on “Cancel” for now and gosee what windows we have in the SPSS interface.

You should be seeing an untitled SPSS Data Editor window now (like below).

The Data Editor window shows you the working (= currently open) dataset in a spreadsheetformat. Of course, it is now new and empty. You see there are two sheets in this window, theData View and the Variable View. Currently, the Data View is active (in yellow). You see amessage from SPSS at the bottom of the window. Currently, it is “SPSS Processor is ready” for your work.

2

http://www.ssc.upenn.edu/scg/spss/sps1_verybasicSPSS_dta.zip

http://www.ssc.upenn.edu/scg/spss/sps1_verybasicSPSS_dta.zip



Before getting started with our work, let’s change the output settings. From the menu bat at thetop (of whichever window),

Edit

Options…

This will bring you the “Options” dialogue box. Here, you can control what to display in your output. Click the “Viewer” tab. Here is one setting I strongly recommend that you choose:“Display commands in log.” You will see why in a moment. For now, just check the box andclick OK.

Now, let’s first bring in a data. We’ll use these files for this practice.

xls_gss93.xls csv_gss93subset.csv fix_gss93subset.dat GSS93 subset.sav

* The second and third files are subsets of the last one “GSS93 subset” data (7 variables, 97 observations) indifferent formats for our practice purpose.

Reading Data from Excel Files

We will start with importing the excel file “xls_gss93subset.xls” into SPSS. First, open the excelfile and understand how it is formatted. The first row has variable names, and the data part isfrom the second row and below. Close the excel file and let’s start reading this file into SPSS.

3



Start SPSS by clicking on the SPSS icon or from the Window’s Start menu. From the SPSSmenu bat at the top, go:

File

Open

Data…

This brings up a dialogue box “Open Data” as shown below. Go to your working directory“verybasicSPSS,” then select “Excel (*.xls)” format from “Files of type.” Then select the excelfile “xls_gss93.xls” you saved there. Then click Open. (see below for a visualized instruction).

(4) Click Open.

(3) Select this file.

(2) “Files of type” isExcel. This brings upour excel file in the

above window.

(1) Go to your workingdirectory where you

saved the excel file.

Now you should be seeing another dialogue box “Opening Excel Data Source.”

4



As we first checked, the excel file has variable names in the first row. So check the “Readvariable names from the first row of the data” box. Click OK. Now you have a new, unsaveddata in another SPSS Data Editor window. To save the data in the SPSS format, go from the pull-down menu:

File

Save

Let’s save it in your working directory with the name “xls_gss93.” Let’s keep this data for amoment.

Reading Data from Text Files (comma-separated-values)

Okay, we are next try importing “csv_gss93subset.csv,” a text file in the comma separated values

format. The first line contains variable names. From the menu bar at the top,

File

Open

Data…

This again should bring up a dialogue box called “Open Data.” Make sure you are looking inyour working directory “verybasicSPSS.” Since our file extension is .csv, we need to select “AllFiles(*.*)” from “Files of type.” Then “csv_gss93subset.csv” shows up in the window. Select itand click Open.

5



Then the “Text Import Wizard” dialogue box shows up, which has six steps. Click Next, and inStep 2 of 6, check on the “Yes” radio button to the question “Are variable names included at thetop of your file?” because we do have variable names in the first row, and else accept the defaultsettings and keep moving on by clicking Next. Then in Step 6 of 6, you will see click Finish .

You should see another SPSS Data Editor window [DataSet2] popping up. As is clear, multipledata files can be simultaneously open in SPSS (we’ll mention about this a bit more later ).Browse the one you just read in, and let’s just close it without saving.

6



Reading Data from Text Files (ASCII fixed format)

Finally, let’s practice reading the data “fix_gss93subset.dat,” which is an ASCII fixed format file.This type of data always comes with a codebook that specifies which column corresponds towhich variable. Take a look at this text file (left below; notice there is no variable name header)

and its codebook (right below).

11320 3143

215 0 2044

31325 2043

425 0 4045

555 0 1078

65125 2283

71122 2255

85124 3275

91322 1231

1025 0 1054

Variable Name Column Number

id 1-4

wrkstat 5

marital 6

agewed 7-8

sibs 9-10

childs 11

age 12-13

Now, unlike the previous examples, there is no easy point-and-click method to read this type of data. What do we do then? The best way is to write syntax commands ourselves to bring in thisdata. Let’s open a new syntax file for this work. From the menu bar at the top, go:

File

New

Stntax

You now should be seeing the SPSS Syntax Editor , which is another important window in theSPSS environment (as I emphasized in the introduction, you should eventually learn to use and

write the Syntax file for your work . You are now getting a little glimpse of it…). Let’s save it as“verybasicspss” in your working directory. Now, let’s type the following commands (be sure tospecify the file location where you saved the file “fix_gss93subset.dat”).

data list fixed file='[specify your working directory ]\fix_gss93subset.dat' / id 1-4 wrkstat 5 marital 6 agewed 7-8 sibs 9-10 childs 11 age 12-13.

Always end your

comment with a period.

The command DATA LIST is to read a text format data file by assigning names and formats toeach variable in the file. The keyword FIXED follows to tell SPSS that our data is a fixed format(actually, this is the SPSS default so you can skip it). The command FILE = “file location/name

here ” specifies your fixed format file and its location. After the slash (/) we provide SPSS with

variable definitions (the variable names and column numbers) from the codebook.

Two syntax rules you must remember here:

1. Notice that the whole command ended with a period (“.”). In SPSS, each command in

SPSS must be completed with a period “.”.

2. SPSS Syntax is NOT case-sensitive.

7



Now, let’s execute our commands. First, highlight them, then to run the highlighted part, hit theRun Current button or alternatively hit Ctrl + R keys. When you run the above command,another Data Editor window should open for this new data. But what did you get there?

You should be seeing a blank spreadsheet under the “*Untitled4 []” heading, although from theVariable View it looks like SPSS seems to have variable information. Why aren’t we seeing the

data itself?

To read the data, we need to run another command to actually use this data (because to use thisdata, SPSS needs to read it!). Get back to your Syntax Editor , and first make sure we are workingon this data set.

This pull-down menuindicates your active datasource.

When you have multiple data files open at the same time, you need to tell SPSS which data fileyou are working on (which is called “Active” file). You can make your file active by simplyclicking anywhere in the Data Editor window of the data you want to use (in this case,“*Untitled4 []”), or when you have your Syntax Editor open, you can use the pull-down menu (inthis case, it should be set to “Unnamed” since “*Untitled4 []” is neither saved nor named).

Once you make sure “*Untitled4 []” is active, type in the following command (don’t forget acomma), highlight and run it.

list.

Now, what do you have in your Data Editor and Output Viewer ? You should now be seeing thedata content in the Editor , and the command LIST is executed and the result is in the OutputViewer .

The point is this: SPSS just keeps it in its memory and does not read the data until it needs to, because that’s efficient in terms of processing. In this example, SPSS encounters the proceduralcommand LIST, realizes it needs the data “*Untitled4 []” to execute LIST and produce results onthat data, and only at that moment does it read in the data.

8



But suppose you want to explicitly force a data pass so that you can immediately see the read-indata in the Data Viewer . The command EXECUTE does that for you. If you run the following,

data list fixed file='[specify your working directory ]\fix_gss93subset.dat' / id 1-4 wrkstat 5 marital 6 agewed 7-8 sibs 9-10 childs 11 age 12-13.

execute.

… then you would immediately see the result in your Data Editor without running any procedural command. EXECUTE forces all the data to pass (including the data transformation,where you for example create or recode variables and need to read the new data with those newvariables so you can use them), but it does nothing else to the session. It just forces a data pass.

But as I said, SPSS reads the data as it needs to after all, so in most cases EXECUTE is rarely if ever necessary. In fact, to use EXECUTE at every single data transformation command slowsdown the processing because SPSS is forced to read the data at every single EXECUTE, evenwhen data reading is unnecessary at that moment. So you should use EXECUTE sparingly. Wewill be back to this command later and discuss a couple of situations where you absolutely mustrun EXECUTE.

Anyway, let’s take a look at our output.

We checked “Display commands in log”in the Options menu, so SPSS displaysthe syntax it ran on the output.

9



You see the data content listed. You can save your output by going from the drop-down menu.

File > Save As…

The file extension for the SPSS output is .spv (Note: SPSS older than version 16 has the

extension .spo for the output files. To open and view .spo files in SPSS version 16 or later, youneed to install SPSS Legacy Viewer . For more information, see the SPSS technical supportwebsite). The left-side pane is SPSS’s outline view of your output. It serves like a table of content and allows you to navigate different parts of your output by clicking on small outputicons.

You also see why I strongly recommended you set your “Options” to “Display commands inlog.” Notice that in the output, you see all the syntax we have run so far printed out, even thoseyou didn’t write yourself. This is why I strongly recommended that you set “Display commandsin log” in the “Options” menu. First, having the actual commands you run along with thecorresponding output helps you greatly with documentation. You can always see what command

and options you used to generate the output that you have, and you can always keep track of exactly what you did with the data. This is very important.

Further, be aware that SPSS syntax like those is running beneath the point-and-click interface,

even when you simply use those pull-down menus without writing syntax commands yourself and

do not see the actual commands SPSS runs . As mentioned earlier, you should eventually learn towrite commands by using your Syntax Editor yourself and run them from there, instead of pointing and clicking. This is also very important for documentation and reproducibility.

Had you written and run the following syntax commands yourself, you would have gotten thesame results. They were the syntax running beneath your pointing and clicking.

Read an Excel fileget data/type = xls/file = ' [specify your working directory ]\verybasicSPSS\xls_gss93.xls'/sheet = name "xls_gss93subset"/readnames = on.execute.

Read a Comma-Separated-Values fileget data/type = txt

/file = '[specify your working directory ]\verybasicSPSS\csv_gss93subset.csv'/firstcase = 2/delimiters = ","/variables =id f2.0wrkstat f1.0marital f1.0agewed f2.0sibs f2.0

10

http://support.spss.com/newSupport/Student/Utilities/SPSS/LegacyViewer/readme.html

http://www.spss.com/statistics/changes.htm




http://support.spss.com/newSupport/Student/Utilities/SPSS/LegacyViewer/readme.html



childs f1.0age f2.0.

execute.

The command GET DATA is to read external files into SPSS.

For further syntax help, you always can go from the menu bar at the top,

Help

Command Syntax Reference

We will learn some additional basics of command writing throughout this workshop.

We can also directly input the data from the Syntax Editor . Type the following lines.

* Read dataline from syntax file .data list / id 1-3 sex 5 (A) age 7-8 treat 10.begin data001 f 43 0002 f 25 1003 m 36 0end data.

We use two commands. One is DATA LIST (we already learned it), and the other is a pair of BEGIN DATA and END DATA. Again notice each command is finalized by a period at the end.BEGIN DATA and END DATA are used when data are entered within the command sequence,and data records are placed in between.

One important thing you need to remember from this example is this part:

sex 5 (A)

By default, SPSS treats variables as numeric. The variable sex here is a character variable (f/m).By putting (A) after the variable name and the column number, you tell SPSS that this is acharacter variable.

Open SPSS files

Now, let’s open an SPSS system file “GSS93 subset.sav.” This is actually the easiest part. Fromthe menu bar of the SPSS Data Editor window at the top, go:

File

Open

Data…

11



Find and open “GSS93 subset.sav” by double-clicking on it or choosing it and hitting OK.

There you go. Let’s click on the tab of the “Variable View” sheet and see what you have there.

Here’s what you should be seeing now.

12



The Data View sheet and the Variable View sheet look very similar, but the latter hasinformation about the variables in the data set shown in the Data View sheet, including variablenames, data type (Numeric or string, etc), variable and value labels, how the missing values arecoded, etc.

Most of those information cells have hidden dialogue boxes or pull-down menus which you cancall up by selecting the cell and then clicking on the gray button that comes up on the right sideof the cell. For example, let’s try activating the dialogue box for the values of the variable

“marital.”

Can also get each variable’s information here.

(1) Click on thisutton

13



(2) Then value labels

dialogue box for thevariable “marital”

shows up.

Now, technically this box allows you to define/modify values of the variable. However, I just brought this up to warn you in case you happen to find it and want to use it. DON’T USE THIS

BOX for the data management purposes. Although it looks easy, to use this dialogue box isdangerous. It makes it extremely difficult to keep track of changes you made to the data, becauseit does not leave any record of your action. You should use the Syntax Editor instead. Let’s justclick Cancel to close the “Value Labels” dialogue box.

Let’s close all the data sources other than the one we just read in, “GSS93 subset.sav.”

2. How to Get Descriptive Statistics and Graphs

In this section, we will learn how to explore our data by running simple descriptive statistics anddrawing graphs. We do both the point-and-click approach and the syntax approach.

Let’s get started. We still have the “GSS93 subset.sav” data in SPSS (if not, open it). We firstwant to have descriptive information of the variable “educ.” From the menu bar at the top, go:

Analyze

Descriptive Statistics…

Descriptives

This brings you a dialogue box named “Descriptives.” You have a list of the variables in the left pane. Let’s select the variable educ (“Highest Year of School Completed”) by double-clicking onthem, OR by highlighting the variable (you can select multiple variables by clicking on themwhile pressing and holding the Ctrl key) and then hitting the arrow button between the two panes.

Then let’s click the Options… button and you have the “Descriptives: Options” dialogue box.We can check boxes of statistics we want to see. So suppose we want to check mean, standarddeviation, minimum, maximum, and skewness values of these two variables. Click Continue.You are back to the “Descriptives” box.

14



(4) Then click

Paste.

(3) Click Continue after selecting optionsyou want.

(2) Options brings upthe “Descriptives:

Options” dialogue box.

(1) Select variables by highlightingthem and hitting

Okay, we are ready to get descriptive statistics for this variable. Now, let’s click Paste. What didyou get? You should now have got an SPSS Syntax Editor window like the one below.

SPSS commands are

completed with a period.

What we did here just now is just to paste the syntax command that SPSS writes to obtaindescriptive statistics. As I emphasized, always be aware that SPSS syntax commands like this are

running beneath the point-and-click interface, even when you simply use those pull-down menu

and click OK. You should learn to write SPSS syntax yourself eventually.

Now, let’s take a look at the pasted command. The SPSS command to get descriptive statistics isDESCRIPTIVES followed by its subcommand VARIABLES = varname . The most basicstructure of SPSS syntax command language is:

COMMAND <options if any>/ [SUBCOMAND <options if any>] .

The slash (/) is to separate subcommands. But this basic form can take slightly different formscommand by command. In DESCRIPTIVES, for example, the subcommand VARIABLES

immediately follows the command DESCRIPTIVES, and before the slash (/).

15



It is always a good idea to add comments to your syntax file for the documentation purpose. Usean asterisk (*) or the command COMMENT to start your comment text. Remember, all the SPSS

commands must end with a period, and that rule applies to comments as well . This is imperativeto indicate the end of your comment with a period. Let me show you how so. Run the following bloc of commands. What did you get in your Output Viewer ?

* Descriptives for years of educationDESCRIPTIVES VARIABLES=educ/STATISTICS=MEAN STDDEV MIN MAX SKEWNESS .

You got nothing, except for the log of the syntax you just ran. Why? Because SPSS treatseverything between * (or COMMENT) and the next period as your comment. In this case,“Descriptives for… SKEWNESS.” is all treated as a bloc of comment, so DESCRIPTIVES …was not executed as a command (… and you are left dumfounded to find no computation resultsshown in your Output Viewer ). So, you always must end your comment with a period.

A flip side you can see from this example is that in other words, you can start with * or COMMENT and keep commenting over multiple lines till you end it with a period. This may behelpful if you need to add extensive comments to your syntax. So let’s fix our syntax.

* Descriptives for years of educationWe can comment over multiple lines,Just don’t forget to end it with a period .

DESCRIPTIVESVARIABLES=educ/STATISTICS=MEAN STDDEV /*standard deviation*/ MIN MAX SKEWNESS .

A bloc of your comment between an *

and a period (“.”), over multiple lines.

Notice you have your comment over multiple lines. As you can see, alternatively, you can use /*

COMMENT HERE */ as well. In this case, */ instead of a period indicates the end of your comment. This way comments can be inserted in your command lines.

Let’s execute this syntax command, including the comment. Select (highlight) the whole syntaxcommand and hit the Run Current button at the top of the Syntax Editor or hit Ctrl + R keys.

See your results displayed in the SPSS Output Viewer .

16



SPSS has in the left pane the output index table. Click any listed index, and SPSS navigates youto the corresponding result objects in the right pane (feel free to try). I highlight the descriptivesto bring the corresponding output to my view.

17



You can copy and paste those output items. As an example, try right-clicking on the descriptivetable, selecting “Copy,” and then pasting it on your word processor document.

The average year of school completed is 13.04 years. Surprisingly, there are people with zeroeducation. There seems to be no real concern about skewness.

How different is the mean years of school completed between male and female? Let’s comparetheir mean values.

Analyze

Compare means

Means…

This will bring up the “Means” dialogue box. Select the respondent’s educ (“Year of schoolcompleted”) variable under the “Dependent list” heading, and respondent’s sex for “Independentlist.” Click Options… and add median and skewness to your statistics, and then click Continue.

Then click Paste. Highlight and run the command.

* Comparing years of education by sex .MEANS TABLES=educ BY sex/CELLS MEAN COUNT STDDEV MEDIAN SKEW.

You should be seeing the result that on average highest year of education completed is 13.19 for male respondents and 12.92 for female respondents. The median value seems close to the meanvalue for males, so we would expect the variable is mostly normally distributed. Let’s visualize it.

Graphs

Legacy Dialogues

Histogram…

18



The “Histogram” box pops up. Select the education variable for the “Variable”, check the“display normal curve” box and choose “Respondent’s Sex” to panel our histogram by column.

Paste the syntax, highlight it, and run.

* Histogram of years of education by sex .GRAPH/HISTOGRAM(NORMAL)=educ/PANEL COLVAR=sex COLOP=CROSS.

And you get your histogram like below.

19



Not bad distributions, but (and not surprisingly) the highest years of education is 12 for so many people, especially for female respondents.

Stem and leaf and box plots are as often used to check variables’ distributions and extremevalues. Here, we use the command EXAMINE and get the descriptive information all at once.

Analyze

Descriptive Statistics

Explore…

The “Explore” dialogue box shows up. Select “educ” (Highest year of school completed) for “Dependent List” and the sex variable (“Respondent’s sex”) for the Factor List. Then first click on Statistics and check the “Descriptives” and the “Percentiles” boxes. Continue.

Then click Plots, check the “Factor levels together for Boxplot” and the “Stem-and-leaf” boxesunder the “Descriptive” heading.

20



Continue, Paste, and run it.

* Boxplot and Stem and leaf .EXAMINE VARIABLES=educ BY sex/PLOT BOXPLOT STEMLEAF/COMPARE GROUP/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE/STATISTICS DESCRIPTIVES/CINTERVAL 95/MISSING LISTWISE/NOTOTAL.

[Descriptives and percentiles output omitted]

Check the legends (highlighted) to see what the stem and leaf represent in your output.

Highest Year of School Completed Stem-and-Leaf Plot for

sex= Male

Frequency Stem & Leaf

11.00 Extremes (=<5.0)

10.00 6 . 00000

13.00 7 . 000000

30.00 8 . 000000000000000

17.00 9 . 00000000

21.00 10 . 0000000000

33.00 11 . 0000000000000000

172.00 12 . 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000

55.00 13 . 000000000000000000000000000

57.00 14 . 0000000000000000000000000000

34.00 15 . 00000000000000000

95.00 16 . 00000000000000000000000000000000000000000000000

25.00 17 . 000000000000

32.00 18 . 0000000000000000

16.00 19 . 00000000

18.00 20 . 000000000

Stem width: 1

Each leaf: 2 case(s)

Highest Year of School Completed Stem-and-Leaf Plot for

sex= Female

Frequency Stem & Leaf

32.00 Extremes (=<7.0)

29.00 8 . 0000000000

28.00 9 . 000000000

34.00 10 . 00000000000

48.00 11 . 0000000000000000

273.00 12 .

0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000

80.00 13 . 000000000000000000000000000

109.00 14 . 000000000000000000000000000000000000

36.00 15 . 000000000000

113.00 16 . 0000000000000000000000000000000000000021.00 17 . 0000000

39.00 18 . 0000000000000

8.00 19 . 000

7.00 Extremes (>=20)

Stem width: 1

Each leaf: 3 case(s)

21



And here is your boxplot.

The top of the box represents the 75th percentile, the bottom of the box represents the 25th percentile, and the line in the middle represents the 50

thpercentile (= median). There is no

middle line in the box for female cases, though. That is because the 50th

percentile and 25th

for the female sample have the same value (=12. Check your output for the percentile table yourself).The lines that extend out the top and bottom of the box are called “whiskers,” which representthe highest and lowest values that are not outliers or extreme values. “Outliers” are values thatare between 1.5 and 3 times the interquartile range (interquartile = box-lengths from the 75th percentile or 25th percentile), and “extreme values” are values that are more than 3 times theinterquartile range. They are represented by circles and asterisks beyond the whiskers,respectively.

EXAMINE is a very useful data exploration command. As you may have noticed, you at the sametime can get a histogram (try just adding “histogram” in the above syntax to the / plotsubcommand and running it) and a q-q plot (in the same way, add “nnplot” in the above syntaxand run it).

We can get a good idea about our data by exploring data like this. Let’s continue and get afrequency table for respondents’ work status and marital status. How many people are workingfull-time or unemployed? How many are married or divorced?

Analyze


Frequencies…

22



After selecting the variables “wrkstat” and “marital,” click the Charts… button. You should getthe “Frequencies: Charts” dialogue box as the below one. Check the “Pie charts” radio button

under the “Chart Type” heading and the “Percentages” button under the “Chart Values” heading.Click Continue, and then paste the syntax. As you can see, those charts can be obtained throughsubcommands available in the FREQUENCIES command.

* Frequencies with pie charts .FREQUENCIES VARIABLES=wrkstat marital/PIECHART PERCENT/ORDER=ANALYSIS.

You should be now seeing frequency tables and nice big pie charts for these two variables.Approximately half of the respondents are working full-time, and are currently married.

23



Labor Force Status

Frequency Percent Valid Percent Cumulative Percent

W 747 49.8 49.8 49.8orking fulltime

161 10.7 10.7 60.5Working parttime

32 2.1 2.1 62.7Temp not working

51 3.4 3.4 66.1Unempl, laid off

231 15.4 15.4 81.5Retired

42 2.8 2.8 84.3School

200 13.3 13.3 97.6Keeping house

Other 36 2.4 2.4 100.0

Valid

Total 1500 100.0 100.0

Marital Status

Frequency Percent Valid Percent Cumulative Percent

795 53.0 53.0 53.0married

165 11.0 11.0 64.0widowed

213 14.2 14.2 78.3divorced

40 2.7 2.7 80.9separated

never married 286 19.1 19.1 100.0

Valid

Total 1499 99.9 100.0

Missing 1 .1NA

Total 1500 100.0

24



Let’s get a crosstab to see if there may be any relationship between political views and opinionsabout life-prolonging measures.

Analyze


Crosstabs…

Choose the variable letdie1 (“Allow incurable patients to die”) for the rows and polviews(“Think of Self as Liberal or Conservative”) for the columns, and then click on Statistics… andcheck the box “Chi-square” in the “Crosstabs: Statistics” dialogue box. Click Continue.

Then click Cells… button to get to the “Crosstabs: Cell Display” dialogue box. We want to knowthe expected value for each cell to compare with the corresponding observed value, so check both the “Observed” and “Expected” checkboxes under the “Count” heading.

25



Click Continue. Then once back to the main “Crosstabs” dialogue box, click Paste.

* Crosstab between letdie1 and polviews .

CROSSTABS/TABLES=letdie1 BY polviews/FORMAT=AVALUE TABLES/STATISTICS=CHISQ/CELLS=COUNT EXPECTED/COUNT ROUND CELL.

Your Chi-square test result is displayed at the bottom of the Crosstabs output.

Chi-Square Tests

Value df Asymp. Sig. (2-sided)

Pearson Chi-Square 33.155a

6 .000

Likelihood Ratio 32.143 6 .000

Linear-by-Linear Association 27.998 1 .000

N of Valid Cases 929

a. 0 cells (.0%) have expected count less than 5. The minimum expected count is 6.94.

As the note at the bottom of the table says, the test is valid by meeting the test condition (i.e., theminimum expected value must be more than 5). It is statistically significant, indicating that political views are associated with opinions about life-prolonging measure. More politicallyliberal people are more open to the idea of allowing incurable patients to die.

Okay, let’s now see how closely associated years of education and age of first marriage.

Graphs

Legacy Dialogue

Scatter/Dot…

The Scatter/Dot dialogue box comes up. Click on “Simple scatter” and click Define.

26



You should reach the “Simple Scatterplot” dialogue box. Choose agewed (“Age when firstmarried”) for the Y-Axis and educ (“Highest year of school completed”) for the X-Axis.

Paste the syntax and run it.

* Scatterplot .GRAPH/SCATTERPLOT(BIVAR)=educ WITH agewed/MISSING=LISTWISE.

And you get the scatterplot (below).

Now, we have 1500 observations in this data, but the number of dots does not seem to be asmany. This is because multiple observations share the same data points. We want to include howdense each data point is. To do so, we use the Chart Editor .

Double-click on anywhere in the scatterplot output area to invoke the Chart Editor . Then go to:

Options

Bin Element

27



This brings you the “Properties” dialogue box (below). In the “Binninb” tab, select the “Color Intensity” radio button under the “Count Indicators” heading. Click Apply.

Then you have a scatterplot that includes information about number count for each data point.

28



So from the densest area of the scatter plots above, you can see many people graduated fromhigh school and then soon got married around the age of 20.

I urge you to closely review the syntax we used in this subsection. So far, we have had SPSS

write codes for us, but again, eventually you should also be able to write your syntax yourself.You can extend it to perform tasks that the point-and-click cannot.

3. How to Define Variable Properties

In this section, we will learn how to do some data management/modification work. In the lastsection, we read external files with different formats and saved them as SPSS data files. Let’sopen the file “xls_gss93.sav”.

File

Open

Data…

Click on the Variable View sheet, and you see no variable information other than variable names.You don’t have any variable labels, values, and missing values defined either. We will first work to define variable properties such as these.

29



Define Variable Properties

Here is part of your codebook of this data file (just part of it, for our practice purpose).

Variable Name Variable Label Values Missing Values Width

id Respondent ID Number - - 4

wrkstat Labor Force Status • 1 “Working fulltime”• 2 “Working parttime”

• 3 “Temp not working”

• 4 “Unempl, laid off”

• 5 “Retired”

• 6 “School”

• 7 “Keeping house”

• 8 “Other”

1

marital Marital Status • 1 “Married”

• 2 “Widowed”

• 3 “Divorced”

• 4 “Separated”

• 5 “Never married”

9 “NA” 1

agewed Age When First Married 0 “nap”99 “na”

2

sibs Number of Brothers and Sisters 98 “dk”

99 “na”

2

childs Number of Children • 8 “Eight or more” 9 “NA” 1

age Age of Respondent - 99 “NA” 2

We include the information above in the data file. We will first use the point-and-click approachand then see the syntax code beneath it. From the menu bar at the top, go to:

Data

Define Variable Properties…

This brings up the “Define Variable Properties” dialogue box. Let’s select all the seven variables.

Click Continue.

30



You’ll reach the next dialogue box where you can define variable properties. Select variables one by one and define their properties according to the codebook. The “Changed” checkbox isautomatically checked once you make changes to value label. An example using the variablewrkstat below…

(1) Highlight and selecta variable. Then the

property items will show

up in the right side.

(2) Variable label

(2) Define measurement level, type, width, etc.

Some variables have missing codes. To define missing values, check the “Missing” checkbox.For example, the variable marital has the value 9 for missing values. To tell SPSS that 9represents missing values for this variable, you do the following.

SPSS automatically checksthis for you when you make

changes to value labels.

(2) Define valuelabels here.

31



Once you finish defining properties for all the variables, click Paste and see what syntaxcommands SPSS wrote. For each variable, several commands are used to define its properties.

To define Measurement level, use VARIABLE LEVEL varname (LEVEL). To define Variable label, use VARIABLE LABELS varname ‘label’ .

To define Format, use FORMATS varname (format ). To define Variable value labels, use VALUE LABELS varname labels. To define Missing values, use MISSING varname (values ).

Remember, each period must be finalized with period “.”.

Let’s run the commands. Then, go to your SPSS Data Editor ’s Variable View. See the results.

32



Save your data (Ctrl + S, or File > Save, or Save button at the task bar )

As I note before, this is how you should change variable properties, because you can keep all thework and decisions you made for future reference and notes. Further, you can repeat the sametask later again if necessary. Don’t do this by using the hidden dialogue boxes of the Variable

View. It will not allow you to keep any systematic records of your work, and you will mostcertainly lose truck of your research work if you keep doing that.

4. How to Create and Recode Variables

Let’s open “GSS93 subset.sav.”

Now, suppose you are arguing that the effect of age on income level is curvilinear—it mayattenuate after reaching some age threshold—and interested in testing this argument using SPSS.

Suppose also that since you are not sure about a specific shape, you think you should try anatural-logged version and a quadratic version of the age variable. So you need to create newvariables that specify these two types of curvilinearity. In such cases as this, you need to createnew age variables of these functional forms.

Let’s first get descriptive statistics and get a good idea about the variables before proceeding.Since we already tried it once, let’s do this by writing a syntax command ourselves this time.Let’s also get a frequency distribution table. We will try writing a syntax command for this too.Type in the following, and run it.

* Descriptives for the variable age.

descriptives variables = age/statistics=mean stddev min max skewness.

frequencies variables=age.

The mean value of variable is 46.23 with a standard deviation of 17.42. It ranges from 18 to 89.The minimum value of this variable is 18 (18 years old) and there is no 0 or below value there(no below-zero value, of course!), so we can log it as is, without adding anything.

We’ll start with the point-and-click approach, and then go over the syntax command. From themenu bar at the top, go:

Transform

Compute Variable…

Then you get a new dialogue box “Compute Variable” popping up. The box under the “TargetVariable” heading at the upper left corner is where you type in a newly created variable name.To define and compute the new variable, enter the expression in the box “Numeric Expression.”

33



Let’s start by creating a logged age variable. We call our new variable lnage, so type in “lnage”in the “Target Variable” box. This is a numeric variable, so click on the Type & Label… buttonright below and make sure it is specified as numeric. Also label this variable “Logged Age.”Click Continue to be back to the “Compute Variable” box.

Then type in your expression in the blank of “Numeric Expression.” We use the functionLN(numexpr ) which returns a base-e log of numexpr (i.e., number or expression). You also canfind functions in the boxes under the “Function group” heading and the “Functions & SpecialVariables” heading in the right side. For LN, select “Arithmetic” in the former, and then find anddouble-click on LN in the latter. The function is automatically entered in the “NumericExpression” box. Plug in the variable age in the parenthesis.

Click Paste and see what commands SPSS writes.

(1) Enter new

variable name(3) Type in expression. You can directly typein LN(age), but if you cannot recall functionsor variable names, you can select the belowtwo boxes in the lower right side and the

variable list in the left side.

(2) Click this button toopen the Type & Label

box, specify type andlabel the new variable“Logged age”. Click Continue and back tothis dialogue box.

COMPUTE lnage = LN(age) .EXECUTE .

VARIABLE LABELS lnage 'Logged Age' .

COMPUTE newvar = expression . is a frequently used command to create a new variable. Andyou will probably feel once you are used to it, it is much quicker to write and run this syntaxcommand than keep pointing and clicking. It really is. LN(expression ) is a function to return anatural log value. We have learned VARIABLE LABELS varname ‘label ’ before.

34



Now, this process involves a transformation command COMPUTE. This creates new variablesand hence updates your data anew. As I mentioned above already, SPSS won’t perform the datatransformation/reading until it needs to, which conversely means it will when it needs to.Meanwhile, to explicitly force a data pass, one can run EXECUTE. SPSS by defaultautomatically adds EXECUTE, when transformation commands are pasted from a dialogue box.

Just to get the idea how it works, try running your command without EXECUTE first, and seeyour Data Editor . A new column is created for the lnage variable in the Data Viewer , but the datais not read into SPSS, because we didn’t have any data pass, whether it’s EXECUTE or any procedural command.

Now highlight and run EXECUTE, and see what happened to your Data Viewer . SPSS spits whatit secretly keeps in its memory and executes the data transformation, and now you have the newdata read into SPSS with the newly created variable lnage.

Let’s also create a quadratic version of the age variable. Let’s call this new variable sqage. This

time, we just use the Syntax Editor . A squared term of a can simply be expressed a*a.

* Squared term of the age variable.compute sqage = age*age.variable labels sqage 'Quadratic Age'.

descriptives variables = sqage.

Now, first, highlight the first two lines and see the Data Viewer . Again, a new column is createdfor the variable “sqage,” but no data transformation has been executed yet. Then, this time,instead of EXECUTE, we run a procedural command descriptives to obtain descriptive statisticsfor this new variable. You get the below result, and if you check the Data Viewer you see the

new data is read in.


N Minimum Maximum Mean Std. Deviation

Quadratic Age 1495 324.00 7921.00 2440.0957 1789.00139

Valid N (listwise) 1495

In this example, there is no EXECUTE, yet SPSS still performed the data transformation becauseit needs to read in sqage to execute the procedure DESCRIPTIVES for this variable. The point is,SPSS waits to make data changes until it absolutely needs to do so. This way, the number of datareadings decreases and thereby SPSS’s processing speeds up.

Let me emphasize again: Therefore, in most cases you don’t need to run EXECUTE every time, because SPSS reads the data when it needs to/has to. Unnecessary EXECUTE makes SPSS readthe data again and again even when it doesn’t have to, and as a result slows down the processing.So use this command just sparingly.

35



This means you can most of the time remove EXECUTE that SPSS by default automaticallygenerates when you paste transformation commands from a dialogue box. But it’s annoyingSPSS pastes and you delete EXECUTE again and again. We don’t want SPSS to be so eager toinsert EXECUTE every time. So, let’s make SPSS a little lazy.

Edit Options…

Then you will see the “Options” dialogue box. Click on the “Data” tab, and check the “Calculatevalues before used” radio button under the “Transformation and Merge Options” heading.

Click OK.

Let’s try pasting the same syntax to create the lnage variable from the dialogue box and see whatthis option change does for us. Click on the Dialogue Recall button of the menu bar (the SPSSwindows has the same menu bar), and from the drop-down list, recall the command we just ranfrom the dialogue box, which is COMPUTE.

Pull-down menu

shows up.

You should be seeing now the same dialogue box as this one, and our last work is still there.Let’s click Paste (no worries about the “Change existing variable?” message; we are just pastingthe command onto the Syntax file), and see what is pasted on your syntax file. Can you see that

36



SPSS “lazy” now, i.e., it does not paste EXECUTE this time? This means that SPSS won’t perform data transformations after every transformation command.

So, you don’t have to force a data pass every single time; let SPSS read updated data when itneeds to. There are, however, some specific situations where you absolutely and explicitly need

to get SPSS to run EXECUTE and force a data pass. Let’s take a look at the following example.

* Must-use EXECUTE example (1)Rule 1: Lag functions and EXECUTE .

* First create a mini data set.data list free / var1.begin data1 2 3 4 5end data.

compute var2 = var1.

list.

* (1)-(a): lag() function w/o intervening EXECUTE.compute lagvar1 = lag(var1) .compute var1 = var1*var1 .

* (1)-(b): lag() function w/ intervening EXECUTE.compute lagvar2 = lag(var2) .execute.compute var2 = var2*var2 .

list.

Here’s the

difference!

What we did above is first to create a simple data containing var1 and var2, which are actuallythe same with five observations whose values are 1,2,3,4,5, and then to lag those variables byusing the function LAG(). The only difference between (1)-(a) and (1)-(b) is whether EXECUTE.is placed after computing those lag variables. Now, see what you got in your output (or the DataViewer ). What difference did EXECUTE make to the new lagged variables?

var1 var2 lagvar1 lagvar2

1.00 1.00 . .

4.00 4.00 1.00 1.00

9.00 9.00 4.00 2.00

16.00 16.00 9.00 3.00

25.00 25.00 16.00 4.00

Number of cases read: 5 Number of cases listed: 5

Look at lagvar1 and lagvar2. You might have been assuming you were creating a set of the samelagged variables, but you got different results.

37



The key, of course, is the presence or absence of EXECUTE after compute lagvar# = lag(var#) .

The difference happened because the function LAG() is calculated after all other transformations

are performed, regardless of command order. So, in the example (1)-(a) without an interveningEXECUTE, the new variable lagvar1 was created from the transformed values of var1 (i.e.,var1*var1). SPSS executed compute var1 = var1*var1 . first, and then, only then, calculated

compute lagvar1 = lag(var1) . . In contrast, in the example (1)-(b), we explicitly placed anintervening EXECUTE after compute lagvar2 = lag(var2), meaning that we forced SPSS to

transform the data and create lagvar2 at that point , before moving on to var2 transformation.Thus, var2 from which new variable lagvar2 was created was its original 1,2,3,4,5 values.

So, depending on what you mean to do, you need to explicitly force a data pass when you useLAG(). This is the rule No.1 about the placement of the EXECUTE command betweentransformation commands.

Let’s take a look at another example. Examine the below syntax.

* Must-use EXECUTE example (2)Rule 2: System variable $CASENUM, SELECT IF and EXECUTE.

* (2)-(a): $casenum and SELECT IF, w/o intervening EXECUTE.compute var3 = $casenum.select if (mod(var3,2) = 0).descriptives var3.

* (2)-(b): $casenum and SELECT IF, w/ intervening EXECUTE.* First re-run Must-use EXECUTE example (1) syntax to bring back the data .compute var3 = $casenum.execute.select if (mod(var3,2) = 0).list variables = var1 var2 var3.

Here’s the

difference!

$CASENUM is a system variable that contains current case sequence number (i.e., 1,2,3,4,5… n).SELECT IF (expression ) is a command for case selection based on specified criteria after IF.MOD(a, non-zero b ) is a function that returns the remainder when a is divided by b. So, in theabove syntax we are telling SPSS to select cases where var3 are even numbers.

We will continue to use the mini data set we created (be sure to have this data active). Now, let’shighlight and run example (2)-(a). What did you get?

Warnings

No cases were input to this procedure. Either there are none in the working data file or all

of them have been filtered out.

This command is not executed.

And indeed, you don’t have any observation in your Data View.

38



Why did this happen? It’s a combination of the two following things. First, the value of

$CASENUM keeps changing in a dynamic manner . For example, if you delete the first case withthe value 1, the formerly the second case with the value 2 moves up and becomes the first casewith the value 1. Secondly, SELECT IF sequentially deletes each unselected case. So, in theexample above, SPSS goes through the following sequence: (1) sees compute var3 =

$casenum. , (2) creates var3 and gives the first case a value of 1, (3) evaluates it against theselection criterion select if (mod(var3,2) = 0). , (4) decides the first case does not meet it, and(5) delete the case. Now, SPSS comes back to the top of this loop, the formerly second case nowhas a value of 1 for var3, SPSS sees it, decides it does not meet the selection criterion, anddeletes it… keeps going the loop until it reaches the last observation (in this case the 5 th observation). Notice no data reading happens throughout this sequence. It is only then that SPSSsees the procedural command descriptives var3. and tries to read the data to executedescriptives. But of course, at this point, all the cases are gone and there is no data left to readin.

That is not what we wanted to do, of course. What should we do? We need to force SPSS to read

the data after the transformation compute var3 = $casenum. to finalize the data before it startsselecting cases. Let’s re-create the same mini-data set (because it’s gone!) and then highlight andrun the example (2)-(b).

var1 var2 var3

4.00 4.00 2.00

16.00 16.00 4.00


Yes, this is exactly what we meant to do. We first created var3 (1,2,3,4,5), finalized it, selected

even-number cases (i.e., 2 and 4), and then printed it.

OK, here’s one last example about EXECUTE in this workshop. Examine the following.

* Must-use EXECUTE example (3)Rule 3: Transformation command, MISSING VALUES and EXECUTE.

* First, create a mini data set.data list list / var1 var2 var3 var4.begin data1 0 1 42 1 2 93 0 5 64 2 1 95 0 2 56 2 6 137 5 7 18 0 2 2end data.

39



list.

* (3)-(a): Transformation followed by MISSING VALUES involving that var,w/o intervening EXECUTE.

compute var5 = 0.if var2 = 0 var5 = 1 .

missing values var2 (0).list.

* Clear missing values.missing values var2 ().

* (3)-(b): Transformation followed by MISSING VALUES involving that var,w/ intervening EXECUTE.

compute var6 = 0.if var2 = 0 var6 = 1 .execute.missing values var2 (0).

list.

Again, here’s

the difference!

After creating a small data set, we first create a new variable var5 in example (3)-(a). Set all theobservations to 0 first, then replace them with 1 for those cases where var2 has the value 0, sothat we can create var5 as a 0/1 dummy variable. There should be four cases coded as 1 in var5 because there are as many 0’s in var2. Then we use the command MISSING VALUES variable

(value ) to declare 0’s of var2 as user-defined missing values.

Now, with the small data, let’s first highlight and run (3)-(a). What did you get?

Your var5 has a value of 0 for all the observations, although the value 0 of var2 is now defined

as missing, as you can see from the Variable View of the SPSS Data Editor . It’s not exactly whatwe wanted; var5 should have the value of 1 when var2 is 0. Why does this happen?

This is actually yet another situation where you must use EXECUTE explicitly; be careful when

you have transformation commands followed by MISSING VALUES that works on the same

variables as the transformations, because the command MISSING VALUES changes the

dictionary (i.e., variable info in the Variable View) before the transformations are executed. Inthis example, the value 0 of var2 is defined as missing before var5 is created and then modifiedon the condition of var2, and hence transformation of var5 where var2 = 0 does not occur.

So what we need to do is to complete the transformation and force a data pass (i.e., finalize var5)

before MISSING VALUES defines the value 0 of var2 as missing. That’s where EXECUTEcomes in. Place it before MISSING VALUES so that the data transformation is executed before the

missing value command is run. Let’s keep having this mini data active, and after resetting the missingvalue definition for var2 (i.e., “* Clear missing values.” part of the above syntax), let’shighlight and run (3)-(b) to create var6 that is 1 when var2 = 0, whereas to define the value 0 of var2 as missing.

Now, did you get something different this time?

40



var1 var2 var3 var4 var5 var6

1.00 .00 .00 4.00 .00 1.00

2.00 1.00 2.00 9.00 .00 .00

3.00 .00 5.00 6.00 .00 1.00

4.00 2.00 1.00 9.00 .00 .00

5.00 .00 2.00 5.00 .00 1.00

6.00 2.00 6.00 13.00 .00 .00

7.00 5.00 7.00 1.00 .00 .00

8.00 .00 2.00 2.00 .00 1.00


See the difference between var5 and var6? Yes, this is what we wanted!

These three above are oft-encountered situations where you need to explicitly force a data passand you should keep them in mind.

Rule 1: Lag functions and EXECUTE .Rule 2: System variable $CASENUM, SELECT IF and EXECUTE. Rule 3: Transformation command, MISSING VALUES and EXECUTE.

The other two situations are when you run WRITE or XSAVE, both of which are treated as

transformation commands. Ending your program with WRITE or XSAVE without any proceduralcommand that forces a data pass leads to an empty data file (because, simply, it is not written or saved). In such cases, you often need EXECUTE after you run those commands.

For more about WRITE or XSAVE, see

HelpCommand Syntax Reference

You can now close the mini data file without saving it.

Recoding Variables

OK, let’s next learn how to recode existing variables. We have the variable “Marital status” inthe GSS data. With the data file GSS93 subset.sav active, we start by running descriptivestatistics to get a good idea what the variable looks like.

descriptives variables = marital/statistics=mean stddev min max skewness.

frequencies variables=marital.

The results are below.

41




N Minimum Maximum Mean Std. Deviation Skewness

Statistic Statistic Statistic Statistic Statistic Statistic Std. Error

Marital Status 1499 1 5 2.24 1.563 .847 .063

Valid N (listwise) 1499

Marital Status

Frequency Percent Valid Percent

Cumulative

Percent

married 795 53.0 53.0 53.0

widowed 165 11.0 11.0 64.0

divorced 213 14.2 14.2 78.3

separated 40 2.7 2.7 80.9

never married 286 19.1 19.1 100.0

Valid

Total 1499 99.9 100.0

Missing NA 1 .1

Total 1500 100.0

This variable has five categories, coded as 1 to 5. The largest category is “married,” and the nextlargest is “never married.” “Separated” is by far the smallest category. Substantively, the middlethree categories may be collapsed to create a new variable with three groups (1) currentlymarried (2) previously married (with the assumption that separation is effectively maritaldissolution) (3) never married:

1. Married → 1. Currently married2. Widowed3. Divorced → 2. Previously married4. Separated5. Never married → 3. Never married

To perform this recode, let’s go from the pull-down menu bar at the top,

Transform

Recode

Into Different Variables…

42



We select Into Different Variables… rather than Into Same Variables… because we want tocreate a new 3-category variable with the original 5-category variable intact (if we use “IntoSame Variables” the original variable would be overwritten with the new one).

Select the variable marital. Under the “Output Variable” heading in the right side, name our

output (new) variable “marital3cat” and add a label (“Marital Status 3 Category”). Click Change.The middle pane (“Numeric Variable -> Output Variable”) should now show “marital -->marital3cat.”

Next, we define this new variable “marital3cat” based on the old variable “marital.” Click Oldand New Values, and you will get another dialogue box “Recode into Different Variables: Oldand New Variables” (below). We want to keep the category 1 (married) as is, collapse thecategories 2, 3, 4 of the “old” variable (marital) into a “new” category and call that 2, and changethe category 5 of the “old” variable (never married) to a new category 3. Here is how to do it.

(1) Select “marital” intothe “Numeric Variable ->

Output Variable” pane.

(3) Click this

button.

(2) We want to lump

2-4 into a new 2.

(3) Click Add.

(2) We recode“marital” into a new,different variable.Decide on a new name

for your new variable,then label it. Click

Change.

(1) Use “Range” asit’s 2 through 4 wewant to recode into

a new category.

43



Once you are done with the point-and-click recoding work, click Continue and go back to the“Recode into Different Variables dialogue box.” Click Paste and see what syntax commandsSPSS writes for you.

RECODE marital (1=1) (5=3) (2 thru 4=2) INTO marital3cat.VARIABLE LABELS marital3cat 'Marital status 3 categories'.

(4) Recode 1 to 1 and 5 to 3 intoa new variable too. Use “Value”instead of “Range.” For everything else follow the samesteps as the above.

(5) Once you are done, click Continue to go back to the previous

Recode into Different Variablesdialogue box. Then click Paste to

paste SPSS syntax onto a SyntaxEditor .

The command RECODE oldvarname recode argument INTO newvarname is to recodevariables into new ones. SPSS adds the command VARIABLES LABELS as we specified

(“Marital Status 3 Category”). As you can see, the syntax to do this task is quite simple,compared with quite a bit of pointing and clicking we did. This is why you should learn to writeyour own syntax yourself!

We don’t have value labels for the new three categories, but we know how to do it by usingsyntax, so let’s add another command to this syntax file to label values. Also, we want to have itwithout any decimals, so format it the same way as marital. Also remember, RECODE is atransformation command (as it entails transformation of data), and SPSS does not execute it and

44



read the data until it has to. Let’s get frequency distribution of the new variable. It at the sametime forces a data pass and let us check the new variable.

Our modified syntax looks like this.

RECODE marital (1=1) (5=3) (2 thru 4=2) INTO marital3cat.VARIABLE LABELS marital3cat 'Marital status 3 categories'.

VALUE LABELS marital3cat1 'Married'2 'Was Married'3 'Never Married' .

FORMATS marital3cat (F4.0).

frequencies variable = marital3cat.

Highlight and run. Looks like we got it done right.

Marital status 3 categories


Cumulative

Percent

Married 795 53.0 53.0 53.0

Was Married 418 27.9 27.9 80.9

Never Married 286 19.1 19.1 100.0

Valid

Total 1499 99.9 100.0

Missing System 1 .1

Total 1500 100.0

5. How to Subset (Select) Data

Sometimes, you may only need some small portion of the data file. For example, you know for sure your analysis will not use some variables and you want to drop them to create a smaller file.Or, your study may focus on female observation only and you want to limit your sample to thatgender group. In this section, we will learn how to subset data (variables or observations).

Subsetting Variables

Suppose we want to limit our “GSS93 subset.sav” data to only those variables we are interestedin for our research project.

45



Before we drop the variables other than our variables of interest, let’s double-check the datainformation to make sure you do not forget any important variables to include.

In your new syntax file, type in and run:

* Get data information.

display dictionary .

SPSS printed out for you the “dictionary” in SPSS Output Viewer – all the same variableinformation and value label information as you can get from the Variable View tab.Suppose that looking through the variable information, you decide that these below are thevariables you will need for your analysis.

id wrkstat marital agewed sibs childs age educ degree padeg madeg sex race relig

Let’s create a file that includes those variables above only. Again, we first do subsetting by the

point-and-click approach, and then see how you can do the same by writing your syntaxcommand.

Have your SPSS Data Editor active (i.e., bring it to the top). From the menu bar, go

File

Save As…

And you have the “Save Data As” box. Make sure to choose the directory you want to save your subset data in. Here, we are going to save it in our working directory verybasicSPSS. Next, weneed to decide the new file’s name in the “File name:” box. Let’s call it “GSSvarsub.”

(1) Select the location youwant to save your new subset

file.

(2) Name your new

file.

(3) Click on

Variables…

46



Now, click Variables… and you will get another dialogue box called “Save Data As: Variables.”By default, SPSS keeps all the variables (all the variables are marked with an X). We will selectthose 14 variables listed above.

You could de-select the variables that you don’t need by clicking on their check boxes. Or, in

this case, I would first click Drop All and then select what I want to keep by clicking on their check boxes, as the number of variables I want to keep is rather small (in the former way youneed to uncheck 45 boxes, in the latter to check 14 boxes… so more efficient). When you finishselecting what you need, click Continue.

In this case I would deselectall first by clicking DropAll, and then select the 14variables. Fewer times of

By default, all the variablesare selected. You can de-select

what are unnecessary for you.

Click Continue when

finish selecting variables.

Now you should be seeing the message “Keeping 14 of 68 variables.”

Click Paste and see the syntax commands SPSS spits for this task.

SAVE OUTFILE='D:\[Your working directory here ]\verybasicSPSS\GSSvarsub.sav'

/DROP=birthmo zodiac income91 rincom91 region xnorcsiz size partyid vote92polviews cappun gunlaw grass life chldidel pillok sexeduc spanking letdie1news tvhours bigband blugrass country blues musicals classicl folk jazzopera rap hvymetal attsprts visitart tvshows tvnews tvpbs scitest4 partnerssexfreq dwelown sei cohort income4 degree2 agecat4 politics region4 marriedclassic3 jazz3 rap3 blues3 /COMPRESSED.

Looks messy, but the whole structure is pretty simple.

47



SAVE OUTFILE = ‘your_new_file_name.sav ’.

is what you use to save your SPSS data file. Then to select some variables to create a subset of the original, we use either one of the subcommands

/DROP = list of variables to drop ./KEEP = list of variables to keep .

SPSS uses /DROP there, but if you write your own syntax commands, you of course can list the14 variables by using /KEEP. In this case, that would actually be simpler. /COMPRESSED is just to save a file in compressed form (this is default; meaning you don’t have to specify thiswhen you write syntax yourself to save a file).

Again, the pasted syntax looks messy with an array of many variable names, but it doesn’t haveto be messy like that when you write your syntax yourself to subset a file, because consecutivevariables such as A, B, C, D, E can be written A to E.

save outfile ='D:\[Your working directory here ]\verybasicSPSS\GSSvarsub.sav'/keep= id to age educ to race relig/compressed.

Another reason you should learn to write your own syntax!

Let’s highlight and run the syntax, and then check your working directory; your new data fileshould be saved there. Let’s open the new file and first see if everything looks okay.

Now that we have a new file for our research project, let’s leave a brief comment to your datafile helps you keep organized. We can do this by using the command DOCUMENT. With this

data file active, let’s write the following command and run it.

* Document this work in the new subset file.document Subset of "GSS93 subset" inc the necessary variables for project A.

* Let’s display the document we just created.display document.

You should get the below output in your Output Viewer . Your data comments are stored with thedate information. This way, you won’t lose track of what each data in your working directory isabout.

Document

1a

document Subset of "GSS93 subset" inc the necessary variables for

project A.

a. Entered 11-Mar-2009

To drop the comment, use the command DROP DOCUMENT.

48



Now, suppose your study focuses on African American population and want to limit your sampleto African American cases only. Suppose also that you want to present a graphic of educationlevel distribution among this demographic.

Let’s first get the break-down of the variable “race.”

frequencies variables=race.

Race of Respondent


Cumulative

Percent

white 1257 83.8 83.8 83.8

black 168 11.2 11.2 95.0

other 75 5.0 5.0 100.0

Valid

Total 1500 100.0 100.0

So, we will select those 168 African American cases.

Data…

Select Cases…

You will have the “Select Cases” dialogue box below. Check the radio button “If conditions issatisfied” and click on the If… button.

(1) Check this radio button, andclick on the button If… right

below

Then another dialogue box “Select Cases:” If shows up. Select the variable race (“Respondent’sRace”) from the left pane, and move it to the right pane by clicking the right-headed arrow. Wewant to select the 168 African American cases, which are coded as 2 in the data as you can seefrom the Variable View or your dictionary. So, complete the expression accordingly, that is, weare selecting cases if race = 2.

Click Continue.

49



(2) Select “Respondent’s race” from theleft pane. “black” is coded as 2, so

complete the expression accordingly.

(3) Thenclick

Continue.

Once you are back to the “Select Cases” dialogue box (the first one), click Paste and see thecommands SPSS writes and acts on (below).

USE ALL.COMPUTE filter_$=(race=2).VARIABLE LABEL filter_$ 'race=2 (FILTER)'.VALUE LABELS filter_$ 0 'Not Selected' 1 'Selected'.FORMAT filter_$ (f1.0).FILTER BY filter_$.

As you can see, SPSS creates a variable “filter_$” based on the race variable, with 0 = “NotSelected” and 1 = “Selected.” Thus, the African American cases should have this variable codedas 1 (because the African American cases will be “selected”).

Let’s obtain a frequency table based on the variable “race.” This way the newly selected data isread in while the new data is checked.

frequencies variables = race.

Race of Respondent

Cumulative

PercentFrequency Percent Valid Percent

Valid black 168 100.0 100.0 100.0

Yes, we have 168 African American observations in our data now.

50



Now, bring SPSS Data Editor to the front and see what happened to our original data.

A “filter” variableSPSS created to filter

observations.

race != 2 are

crossed out.

SPSS indicatesyour selection

filter is on.

Can you see what is going on here? The observation numbers in the row header for those caseswhere race = 1 and 3 are simply crossed out, but SPSS seems to hold all the information of the1500 observations. What SPSS does here is just to filter out the non-black observations by usingthe filter variable (FILTER BY filter_$). Scroll it to the right, and you will see the variable“filter_$” SPSS created for this selection task. As we noted above, the selected cases (i.e.,African Americans) are coded as 1, the unselected are 0 (White and others).

Let’s see the distribution of education levels among the African American respondents.

frequencies variables = educ.

See the output. SPSS gives you a frequency distribution table for the black cases only. There are168 observations, but one case is missing for the education variable.

51



Now, suppose you want to restore the whole data. We will write and run the commands below(very simple!)

filter off.use all.

By this we turn off the filter SPSS used to select cases and tell SPSS to restore the whole file(use all.). See the Data Editor and see what happened. All the observations are now back in.

What is nice about using a filter to subset observations is, as you just saw, it is temporary. Whenyou are conducting analysis, you may want to subset observations in many different ways. It isflexible to create a filter and turn it on and off to select observations. And you can keep thewhole original data intact.

We can subset observations permanently. We already used SELECT IF (expressions ). Thedifference between SELECT IF (expressions ) and FILTER BY… is the former is to permanently select observations while the latter is temporary.

So if you subset observations on the race variable by using SELECT IF (expressions ) and runfrequencies …

select if (race=2).frequencies variables = educ.

Then in the Output Viewer , you should get exactly the same frequency distribution table.However, check the Data Viewer . How many observations do you have now in the data file?

52



We have now only 168 observations, all being African American. SPSS does not cross out theunselected observations. It instead deleted them, permanently.

This method is good if you want to create and save a subset file which only includes cases thatmeet certain conditions (e.g., females only, those with higher education only, etc), but unlike the

filter, you cannot restore the deleted cases unless you go back to the original file, so it may beinconvenient when you are conducting analysis and frequently select and re-select cases. Youshould choose which way to go depending on your purpose.

6. How to Sort and Split Data

In this section, we are going over how to sort data or conduct data analysis by splitting data.

Sorting Data

Let’s open the data file “GSSvarsub.sav” if it’s not already open.

Sorting data is simple and easy. Suppose we want to sort this data by sex (male = 1, female = 2).From the pull-down menu bat at the top, go:

Data

Sort Cases…

And the “Sort Cases” dialogue box appears.

Select the variable sex (“Respondent’s Sex”) and click Paste.

SORT CASES BY sex(A).

53



Too simple a command, isn’t it? The (A) following the variable name sex means thatobservations will be sorted in ascending order. That is the default, so you don’t have to specify itwhen writing your own syntax. Let’s highlight and run the command, then list the variable.

list variables = id sex.

As you can see, the data is ascendingly sorted.

If you need to sort observations in descending order, you need to explicitly specify that with (D) in the syntax (instead of (A)). Try it yourself.

sort cases by sex (d).

You can of course sort by more than one variable. For example, if you want to sort observations by sex, and then within each sex category sort observations by marital status, simply place the by variables in that order (see below).

sort cases by sex marital.list id sex marital.

Split Observations

Suppose you want to obtain group-by-group numbers, such as average years of education by sex.As we already saw, this can be done fairly easily; the command MEANS has the option BY. Youcan write and run this simple syntax command yourself to achieve the goal.

means educ by sex.

And you get the following result.

54



Report

Highest Year of School Completed

Responden

t's Sex Mean N Std. Deviation

Male 13.19 639 3.349

Female 12.92 857 2.849

Total 13.04 1496 3.074

But how can we obtain separate analysis when the commands you want to use does not have thisBY option? Suppose, for example, you suspect that years of education and the number of children are differently correlated across sex—say, having children often makes people interruptor give up on education early, but perhaps females are more adversely impacted than males andthis group-by-group correlations may give us some clue about this argument. The problem,however, is that the command CORRELATIONS does not have any BY option and does not letyou obtain this statistic by sex and make a comparison in a one-step way.

In such cases as this, here is what you do. From the pull-down menu, go:

Data

Split File…

The “Split File” dialogue box shows up.

55



split file off.correlations educ childs.

Correlations

Highest Year of

School Completed

Number of

Children

Pearson Correlation 1.000 -.237

Sig. (2-tailed) .000

Highest Year of School

Completed

N 1496.000 1491

Pearson Correlation -.237 1.000

Sig. (2-tailed) .000

Number of Children

N 1491 1495.000

You can see now SPSS run the analysis on the whole data without creating groups.

7. Simple Regression Example

Now, let’s go over a quick and simple regression example. We use the data file “GSS93subset.sav”. Make this file open and active.

Suppose we are interested in social and demographic factors that might account for respondents’

socioeconomic status measured by the socioeconomic index (SEI). To answer this researchquestion, we set forth the following hypotheses and conduct statistical tests.

H1: A respondent’s years of work experience increase his/her SEI, but with diminishingreturn (implies quadratic term).

H2: The more cultural capital respondent has at their family of orientation, the higher his/her SEI is.

H3: The more divided resource allocation is at respondent’s family of orientation, thelower his/her SEI is.

Then we decide to operationalize the concepts in those statements in the following way.

(1) Use respondents’ age information to proxy respondents’ years of work experience.(2) Use mother’s education as an indicator of cultural capital at respondents’ family of

orientation. Specifically, see if it makes a significant difference if their mother receivededucation of 2-year college or higher.

(3) Use the number of respondents’ siblings to measure resource allocation at their family of orientation.

57



(4) Include a dummy variable for sex and race as our control variable, where female = 1 and black = 1, respectively. Both groups are expected to have a lower score of SEI on average.

So, first of all, let’s create new variables to do the planned analysis. Because we suspect thelength of work experience has diminishing return of SEI, we want to create a quadratic term of

age.

compute sqage = age*age .

We also want to create a dummy variable that indicates whether respondents’ mother has

education of 2-year college or higher. As we can see from the Variables button

So, we need to recode the madeg variables and create a new variable macol.

0 and 1 → 02 through 4 → 17 through 9 → 9Then code 9 as this variable’s missing value.

RECODE madeg (1=0) (2 thru 4=1) (7 thru 9=9) INTO macol.VARIABLE LABELS macol 'mother college degree = 1'.

MISSING VALUES macol(9).VALUE LABELS macol0 'College -'1 'College +' .

We also recode the variable sex (“Respondent’s sex”) to create a dummy variable female andrace (“Respondent’s race”) to create a dummy variable “black.”

58



RECODE sex (1=0) (2=1) INTO female.VARIABLE LABELS female 'female = 1'.

RECODE race (1=0) (3=0)(2=1) INTO black.VARIABLE LABELS black 'black = 1'.

Now, we have all the variables ready for the analysis.

Analyze

Regression

Linear…

Select SEI for the dependent variable, and select sqage, age (age), mother’s college education(macol), female (female), and race (black) under the “Block 1 of 1” heading.

Then click Statistics… to bring up the “Linear Regression: Statistics” dialogue box. Check the“Collinearity diagnostics” box. Click Continue.

Click Continue. Then click Paste.

REGRESSION/MISSING LISTWISE/STATISTICS COEFF OUTS R ANOVA COLLIN TOL/CRITERIA=PIN(.05) POUT(.10)/NOORIGIN/DEPENDENT sei/METHOD=ENTER sqage age macol sibs female black .

59



REGRESSION is SPSS’s command to run an ordinary linear regression model. What you needwhen you write your own syntax is highlighted lines in gray. The line of the subcommand/STATISTICS would be unnecessary if you simply want to get default statistics (i.e., coefficients,ANOVA, multiple R [model summary], excluded variables [which are not relevant here]). Weneed this line in this case because we ask for COLLIN and TOL, which are both collinearity

diagnostics.

Highlight and run those command lines. Here are some of our results.

Model Summaryb

Model R R Square Adjusted R Square

Std. Error of the

Estimate

1 .292a

.085 .078 17.7582

a. Predictors: (Constant), black = 1, mother college degree = 1, female = 1, sqage, Number of Brothers and Sisters, Age of

Respondentb. Dependent Variable: Respondent Socioeconomic Index

Coefficientsa

Unstandardized Coefficients

Standardized

Coefficients Collinearity Statistics

Model B Std. Error Beta t Sig. Tolerance VIF

(Constant) 32.337 5.364

6.029 .000

sqage -.007 .003 -.490 -2.693 .007 .036 27.759

Age of Respondent .865 .240 .656 3.604 .000 .036 27.760

mother college degree = 1 6.931 1.615 .149 4.290 .000 .990 1.011

Number of Brothers and

Sisters-.587 .283 -.072 -2.074 .038 .982 1.018

female = 1 -3.955 1.286 -.107 -3.075 .002 .991 1.009

1

black = 1 -6.586 2.557 -.090 -2.576 .010 .977 1.023

a. Dependent Variable: Respondent Socioeconomic Index

As expected, the quadratic age variable is in the negative direction and statistically significant.Mother’s education, a measure of respondents’ cultural capital at their family of orientation,shows a significant positive impact on respondents’ socioeconomic status. Having more siblings,on the other hand, seems to reduce resources available to people and lead to lower socioeconomic status. Finally, females and African Americans are on average a lower socioeconomic status than males and any other race groups. The overall explanatory power of the model is not quite strong, as indicated the R

2under “Model Summary.”

60



As for the collinearity diagnostics we tried, tolerance and VIF are inversely related (i.e.,1/tolerance = VIF) and thus tell you the same information. Although there is no definite cut-off line, a rule of thumb is VIF > 10 (or tolerance 0.1) merits further investigation. In our example,the VIF is high for the age variables, but this is fully expected since one is the squared term of the other and hence they are highly collinear by definition. Otherwise, the tolerance/VIF values

all look okay.

Another way to check collinearity problems is to use “Collinearity Diagnostics” below. Thegeneral rule of thumb is the condition index larger than 30 indicates strong collinearity. Thedimension 6 has an over 30 number, but this one is again due to the age variables as highlightedin gray below. Otherwise, the result adds support to the absence of collinearity problem.

Collinearity Diagnosticsa

Variance Proportions

Model Dimension Eigenvalue

Condition

Index (Constant) sqage

Age of

Respondent

mother

college

degree = 1

Number of

Brothers

and Sisters

female =

1

black =

1

1 4.317 1.000 .00 .00 .00 .01 .01 .02 .00

2 .938 2.145 .00 .00 .00 .00 .00 .00 .91

3 .788 2.341 .00 .00 .00 .93 .01 .00 .00

4 .434 3.155 .00 .00 .00 .01 .05 .87 .00

5 .397 3.299 .00 .01 .00 .00 .61 .00 .07

6 .125 5.886 .07 .02 .00 .04 .32 .11 .00

1

7 .003 41.477 .93 .97 1.00 .01 .00 .00 .00

a. Dependent Variable: Respondent Socioeconomic Index

1This is the end of The Very Basic SPSS. Thanks for playing!

very basic spss

Documents