data handling using spss 19 - mantra.edina.ac.uk · embedded blanks differently. instead, use...
TRANSCRIPT
Data Handling Using SPSS 19 Research Data MANTRA
DATA HANDLING
USING SPSS 19
Authored by Laine Ruus, EDINA and Data Library, University of Edinburgh
Edited by Robin Rice and Pauline Ward
August 2014
Based on material first written by Joan Corbett, National Centre for Social Research
as part of the Research Data MANTRA (Management Training) course
http://datalib.edina.ac.uk/mantra
This work is licensed under the Creative Commons Attribution 4.0 License.
To view a copy of this license, visit http:/ /creativecommons.org/licenses/by/4.0/ .
Data Handling Using SPSS 19 Research Data MANTRA
Course Outline
Section Paragraph
Numbers
Introduction 1-3
Data log file and the SPSS environment
4 -7
Running SPSS 8-12
Creating an SPSS system file
13 – 17
Output Viewer 18 – 21
Syntax Window 22 - 23
Displaying and saving dataset information 24 - 26
Descriptive statistics and missing values
27 - 42
Compute and recode
43 - 54
Add variables and add cases
55 - 62
Getting your data out of SPSS 63 - 69
Appendix A: Online resources
Appendix B: Selected inter-system limitations on filenames, variable and value names etc
Appendix C: Common file and variable transformations and their corresponding SPSS commands
Introduction
The objective of this exercise is to introduce you to some techniques for using SPSS as well as some
other tools to support your data management (RDM) activities during the course of your research. It
is not about doing statistical analysis using SPSS, but rather how to organize and document your data
management activities in the context of using SPSS for your analyses.
1. When embarking on the exploration of a new research question, after the literature review, and
the formulation of preliminary hypotheses, the next task is generally to begin to identify (a) what
variables you need in order to test your hypothesis, (b) what datafiles (if any) are available that
contain those variables, and (c) what software to use that has the statistical routines and
capabilities you require.
2. The questions you need to be able to answer, vis-à-vis any software you decide to use, are (a)
does the software support the statistical analyses that are most appropriate for my research
question and data? (b) how robust are the measures that it will produce? (c) will it support the
data exploration and data transformations I will need to perform? (d) how will I get my data into
Data Handling Using SPSS 19 Research Data MANTRA
the software (ie what file formats can it read)?, and (e) most importantly, how can I get my data
out of that software (along with any transformations, computations etc) so that I can read it into
other software for other analyses, or store it in a software-neutral format for the longer term?
This practical assumes you have decided to use SPSS for your analyses, at least in part.
Advantages to SPSS include: flexible input and output capabilities, good metadata management
capabilities, and fairly intuitive command names, as well as good documentation.
3. The data being used in this exercise are a subset of variables and cases from:
Norris, Pippa. (2010).May 6th 2010 British General Election Constituency Results Release 5.0
Subset rev. by J. Corbett. [Dataset]. Harvard University. John F. Kennedy School of Government.
Available: http://www.pippanorris.com/
You’ll notice that the citation specifies that this is ‘Release 5.0’. An important part of data
management is keeping track of dataset versions and documenting the changes that have
happened between versions. The web page describing this data set has this information.
Data log file
4. As part of managing your data it is essential to create your own documentation as you work through your analyses. It is good practice to set up a data log right at the start of a project. Use this to keep track of things like the versions and locations of datafiles and documentation, notes about variables and values, and variable and data transformations, as well as the files containing output from your analyses.
5. What software you choose in which to manage your data log is a question of personal choice. One example of a data log format, using Microsoft Word can be found in ‘ScotMP_data_log.doc’. Some researchers prefer to use a word processor (eg MS Word), others to use a format-neutral text editor, such as Notepad or EditPad Lite, and yet others (including me) prefer the table handling and sorting capability of Microsoft Excel (see the file ‘scotmp_dm.xls). Open a new Excel spreadsheet, and eg on sheet 1, enter, in successive columns:
- Current date (YYYYMMDD) - The input file location and format - The output file location, name and format - A comment as to what was done between input and output. - Give the sheet a name, eg ‘data log’ - Before you do anything else, save this Excel file (assign a location and name that you will
remember), but leave it open.
Data Handling Using SPSS 19 Research Data MANTRA
Hint: in order to get the correct path and filename of any file in a Windows environment, locate the file in Windows Explorer, and:
Alternative 1: Click in the address bar showing the path at the top of the Windows Explorer window. The display will toggle between read-friendly display, and the full path display. Copy and paste the full path display, or… Alternative 2: Click on the file to select it. Then right-click the mouse, and select ‘Properties’. The exact path and filename will show in the ‘Location’ field of the properties window, and the filename will show in the first dialogue box. Both path and filename can be copied and pasted into your data log.
6. Note: Especially if you are in the habit of working in different computer environments, it is not recommended that you use blanks in file or folder names. Different operating systems treat embedded blanks differently. Instead, use hyphens, underscores, or CamelCase to separate words to make names more readable. Ie, not ‘variable list.xls’ but ‘variable_list.xls’ or ‘VariableList.xls’.
7. It is good practice to assume that you may not always be using SPSS, or the same version of SPSS
for your analyses. You may need to migrate data from/to different computing environments
(Windows, Mac, Unix) and/or different statistical software (SAS, Stata, R, etc). Therefore you
also need to be aware of limitations in file names, variable names, and other metadata such as
variable labels, value labels, and missing values codes when moving data among different
operating systems and software packages, some of which are listed in Appendix B.
Running SPSS
8. When you run SPSS in Windows, two windows are opened automatically:
– a Data editor window - empty until you open a data file or begin to enter variable values, after
which it will have two views, a Variable View and a Data View,
– an Output window, to which your commands and output will be written.
Additional windows which can be opened from File > New or File > Open are:
o a Syntax window, in which you can ‘paste’ syntax from the drop-down menu choices, enter
syntax directly, edit and run syntax,
o a Script window, in which you can enter, and edit, Python scripts.
Three additional windows, in addition to dialogue windows etc., may or may not open
depending on the procedures you are running: (a) a Pivot table editor window, (b) a Chart editor
window, and (c) a Text output editor window.
9. Open SPSS through your programs menu: Start > IBM SPSS Statistics 19. If a dialog box appears
asking you whether you wish to open an existing data source, click ‘Cancel’. Before starting to
use the data, use Edit > Options to make some recommended changes to SPSS environment
defaults.
Data Handling Using SPSS 19 Research Data MANTRA
10. Then select Edit > Options. The Options box has several tabs. Select the General tab and make
sure that, under ‘Variable Lists’, ‘Display names’ and ‘File’ are selected. This will make sure that
the variables in the dataset are displayed by variable name rather than by variable label and that
variables are listed in the same order as they occur in the dataset – knowing this order is
essential when referring to ranges of variables.
11. It is also useful to see the variable names and values in any output. By default SPSS only shows
the labels, not the names or codes. Click on the ‘Output Labels’ tab, and under both ‘Outline
Labeling’ and ‘Pivot Table Labeling’, select the options to show:
o Variables in item labels shown as: ‘Names and Labels’,
o Variable values in item labels shown as: ‘Values and Labels’.
Data Handling Using SPSS 19 Research Data MANTRA
12. Finally, select the ‘Viewer’ tab and check that the ‘Display commands in the log’ checkbox
(bottom left of the screen) is checked. This will ensure that your output file contains not only the
results of any commands (syntax), but also the syntax itself. This is useful for checking for errors,
as well as as a reminder of the details of recodes and other variable transformations, etc. Click
‘OK’.
Creating an SPSS system file
13. In common with most statistical packages, SPSS needs a variety of information in order to read a
dataset: (a) the data, and (b) instructions as to how to read the data. In its simplest form, SPSS
reads a data file (eg ‘scotmp_raw.txt’), a syntax file (eg ‘scotmp_raw.sps’), and using the input
instructions in the syntax (.sps) file, converts that data and metadata into its preferred format,
the system file, which only exists during your current SPSS session unless you save it.
Note: SPSS can read (and write) a variety of formats. See Appendix D for a detailed list of
software dependent formats and the commands to read and write them. SPSS can also read
more complex file formats, such as multiple records per case, mixed files, and hierarchical files.
14. First let’s look at one type of external, raw data file. Run Notepad (Start > All programs >
Accessories > Notepad) and open the file ‘scotmp_raw.txt’ (from the zip file accompanying this
document – you will need to extract the files to a location to which you can browse from SPSS,
see paragraph 15).
Syntax: Scotmp_raw.sps Data: Scotmp_raw.txt
SPSS
SPSS system file
Data Handling Using SPSS 19 Research Data MANTRA
This is the simplest type of ‘fixed format’ file, with one unit of observation (case) in each row,
and the same variable in the same column or adjacent columns of each case; in this instance, the
case is a seat or constituency in the 2010 election, and the variables describe characteristics of
the seat. Using the cursor to move around in the file, determine:
How many cases (rows) are there in this dataset?
What is the highest column number that contains data?
15. In SPSS, use File > Open > Syntax, and browse to locate and open the syntax file
‘scotmp_raw.sps’. Notice that it contains 4 main sections:
o A data statement, which instructs SPSS the type of file to read, where to locate it, as
well as a list of the variables to read, variable names to assign to them, column locations
of each variable, and whether the variable is numeric or alphabetic (string),
o A variable labels section, in which descriptive labels are assigned to each variable,
o A value labels section, in which descriptive labels are assigned to values of the variables
that require description,
o A missing values section, which assigns certain variable values as missing, which affects
how they are used in statistical analyses, data transformations, and case selection.
16. Next, use Windows Explorer to locate the file ‘scotmp_raw.txt’, determine what the path to the
filename is on your computer (see text box in item 5), and substitute that information for the
file=’[path and filename]’ in the syntax file.
17. Click and drag to select the syntax file contents, down to and including the ‘execute.’ at the end
of the file, and click on the large green arrowhead (the ‘Run’ icon) in the SPSS tool bar to run it.
Output Viewer
18. In SPSS 19, the Output Viewer is labelled as the ‘Output1 [Document1] IBM SPSS Statistics
Viewer’). It is the window in which output from your procedures is displayed, as well as the
syntax that generated it (as a result of the options chosen in item 12 above). This output can be
saved. By default the output file is saved as an SPSS-dependent format with default filename
‘Output[n]’, and requires SPSS to be read; you should therefore use File > Export to save it in MS
Word, .txt, .html or .pdf format, with a meaningful filename. For data management this can be
an important tool for keeping track of what you have done to your data and to keep results to
which you want to refer back.
Data Handling Using SPSS 19 Research Data MANTRA
19. Checking: (1) check the Output viewer window for Error messages, as well as the bottom of the
syntax window (2) click on the Data Editor window, and check both the variable view, and the
data view, for anything that looks not quite right. If there are errors, try to figure out what they
are: normally, fix the first error first, and then rerun the job – usually, errors have a cascading
effect, and fixing the first can eliminate later errors. No visible problems? – you have a successful
file ingest, congratulations!
20. Save the SPSS system file. Open the Data editor window, select File > Save as and save the file
with format SPSS Statistics (*.sav). Now record this file in your Data log file.
21. The Output window now contains a table of what variables were read, in what column locations,
as well as names, labels and values assigned to them, and the formats automatically assigned by
SPSS. Save this output file, and record it in your data log file.
Syntax window
22. You can carry out all your data analysis and manipulation (including creating new variables) in
SPSS using drop-down menus. However, for data management purposes, the risk of doing
everything that way is that there may be no record of what you have done. You may need to
come back to your data and reproduce an analysis at a later data. Good data management will
help you and colleagues to trace back if you need to check any results.
23. Saving output files with commands, as described above, can be one way to do this. With SPSS
you can also analyse and manipulate your data using SPSS command language, rather than the
drop-down menus, which is edited and saved in a ‘syntax file’. For some procedures, this is
actually easier and more customiseable than using the drop-down menus.
In the exercises that follow you will mainly be using syntax to explore the dataset and to create
new variables.
Three very important rules to remember about SPSS syntax:
- All commands must end with a full stop.
- Commands are not case sensitive. This document gives commands in UPPER CASE to distinguish
the required portions of the command from the variable portions (eg variable names).
- You can add comments before and after commands, but not in the middle of commands.
Displaying and saving dataset information
24. You can list the variables in a dataset, in their natural order, by clicking on ‘Variable View’ at the
bottom left hand corner of the SPSS data editor window and scrolling up and down the list. It is
also adviseable to produce a variable list in your output file that can be copied into Excel.
Select File > Display Data File Information > Working File.
Data Handling Using SPSS 19 Research Data MANTRA
25. In the Output Viewer log you will see that this procedure has produced two tables, one with the
list of variables and the other with the value labels. You will also see the command DISPLAY
DICTIONARY in the output window. You could have produced the same tables by typing that
command into a Syntax file and running it.
Data Handling Using SPSS 19 Research Data MANTRA
26. Double click on the ‘Variable Information’ table in the Output window to open the table in a
separate window, from which it can be copied (Edit>Select>Table, then Ctrl+C or Edit > Copy)
and pasted (Ctrl+V) onto sheet 2 of the Data log file. Rename sheet 2 with the name of the
source file and what it contains (eg scotmp variable list). You can then do the same with the
table of value labels, copying it onto a third worksheet in the Data log file. This log file can
function as a handy tool for documenting variable and value transformations as well as file
handling information.
Variable labels Value labels
Descriptive statistics
27. You can use SPSS to look at the number of cases in each category of the variables (using the
frequencies command), check for undocumented codes, missing values, etc. It is important to
run these basic procedures to familiarise yourself with any new dataset and to check that there
are no problems.
28. Nominal and ordinal (aka categorical) variables: Frequencies can be run through the drop-down
menus by clicking on Analyse > Descriptive statistics > Frequencies, selecting the variables and
then clicking OK. Do this now (for this example we have chosen bithplac, constbir and win10 –
all 3 are nominal variables, according to the Measure column in the Data Editor variable view)
and then look at the Output window.
Data Handling Using SPSS 19 Research Data MANTRA
29. In the output, in addition to the frequencies there are now two additional pieces of information -
the SPSS syntax that produced the output, and the name of the dataset. You can also see from
the first table in the output that all 3 variables have data for all 59 constituencies (all 3 have 0s in
the ‘Missing’ row). We will look at the concept of ‘missing values’ later.
30. Continuous variables: To generate descriptive statistics for those variables labelled as ‘scale’ in
the Data Editor variable view, the type of information provided by frequencies is usually not very
informative. We need a different command. Keep the Data Editor variable view window open in
the background (a handy reference to the variable names labelled ‘scale’), and select Analyze >
Descriptive Statistics > Descriptives.
31. Select the scale variables you want to look at in the left window, move them to the right
window, and click on Options.
Data Handling Using SPSS 19 Research Data MANTRA
32. Make sure that Mean, Std deviation, Range, Minimum and Maximum are selected, click
‘Continue’, and ‘OK’ on the previous dialogue screen.
33. If, in the previous example, you click on the ‘Paste’ button, instead of ‘OK’, SPSS will write the
current syntax, with all selected options, into your current Syntax file, instead of running it. Here
you need to edit the syntax if you want to make changes, highlight it, then click on the Run icon.
34. Your Output window should now list the scale variables selected, showing their count (‘N’,
minimum, maximum, range, mean and standard deviation (spread around the mean), as well as
the SPSS commands that generated the output.
Had you neglected to copy the syntax into the Syntax window using the paste button earlier, you
could now mouse-over the SPSS commands in the output box (they will be outlined by a yellow
box), and copy (click R-mouse button, select Copy) and paste (select Syntax window, click R-
mouse button, select Paste) them from the Output viewer into the Syntax window.
35. You can also edit the commands in the Syntax window to show the first and last variable names
in the data file to produce one list of descriptive statistics for all the variables in the file:
DESCRIPTIVES VARIABLES=seat to tuscsex1 /STATISTICS=MEAN STDDEV RANGE MIN MAX.
The resulting output shows that SPSS is perfectly capable of weeding out the string variables for
which it cannot compute the requested measures:
Data Handling Using SPSS 19 Research Data MANTRA
36. You can build up a set of commands in your syntax file quite quickly, which is useful for initial
exploration of the data. You should also add your own notes to the syntax file – anything that
you type with an asterisk (*) in front of it will be treated as a comment by SPSS (don’t forget the
full stop at the end of every command or comment). It is good practice to use comments to give
each set of commands a header explaining what the syntax is doing, and if you are working as
part of a team, who wrote the syntax and the date it was written. If you highlight and run
comments, with the additional syntax to which it refers, the comments will also be echoed in
your Output Window.
37. The output and syntax files should be saved for future reference and you can use your data log
to record that they have been created. Output files in native SPSS format have an extension .spv
(.spo in versions prior to SPSS18); syntax files are flat ascii text file, with the extension .sps. As
your Data log file grows, you may find it easier to add new information at the top of the table (ie
reverse chronological order) rather than the bottom.
38. You may find it useful to set up separate sub-folders for your syntax and output files. During
your research project you will inevitably build up a number of files. Alternatively, collect all
syntax files, output files, and revised data files (where applicable) in one subdirectory, to
distinguish them from other analyses of other data files. If you make a habit of adding the date
in YYYYMMDD format at the beginning of each filename, the file list will display in chronological
order by default.
39. During the course of your research you will often have to create your own variables, ie derived
variables. Using the drop-down menus for this purpose is not recommended, because an
essential part of good data management is keeping a detailed record of how new variables have
been created and syntax files or output files with embedded syntax are the best way of doing
this.
Data Handling Using SPSS 19 Research Data MANTRA
However, if in your own work you find that you still prefer to use menus, you should make sure
to always either paste what you have done into a syntax file or save the output file with the
commands in it.
Defining missing values
40. The variable ‘yob’ contains each winning candidate’s year of birth. If you have not already run
Frequencies on this variable, do so now. You will note that the earliest year of birth in the data
set is ‘1941’, and the latest is ‘9999’, with a frequency of 4 cases. We assume that in these 4
constituencies the year of birth of the winning candidate was unknown.
41. A more useful variable than year of birth, for analysis and interpretation of the data, is age at the
time of the 2010 election. This can easily be computed using SPSS’s compute command.
However, when creating the new variable we do not want to include the cases with ‘9999’ in the
calculations. This is an example of where the ‘missing values’ function in SPSS is useful.
42. By setting ‘9999’ to be a missing value the cases with that value will be ignored in any
subsequent analyses. Use the commands below to define ‘9999’ as a missing value and to give it
a value label that indicates why it is not a valid value. Type the following commands into your
syntax editor window, and run them.
MISSING VALUES yob (9999).
ADD VALUE LABELS yob 9999 “year not known”.
FREQUENCIES yob.
Your output file should now show that there are 4 cases with label ‘year not known’, and flagged in
SPSS system file as missing.
Creating derived variables (compute)
43. Now compute a new variable consisting of candidate’s age at the time of the election in 2010:
* calculating MP's age in 2010.
COMPUTE mpage=2010-yob.
EXECUTE.
VARIABLE LABELS mpage "MP age".
FREQUENCIES mpage.
The new frequencies table of mpage shows that there are 4 cases flagged as Missing by the
system (aka ‘System missing’). These should be the same 4 cases for which we defined ‘9999’ as
a missing value in ‘yob’. But with any compute operation, you should always check your results.
Data Handling Using SPSS 19 Research Data MANTRA
44. One way to check is using the SELECT IF commands. Be very careful when using SELECT IF in a
syntax file; if you do not use the command ‘TEMP’ or ‘TEMPORARY’ before the SELECT IF, the
cases that were not selected will be dropped from the active dataset.
* checking value of YOB assigned missing against MPAGE system missing cases.
TEMP.
SELECT IF sysmis (mpage).
FREQUENCIES yob.
The output confirms that indeed the 4 system missing values are the original cases that we
flagged as missing in ‘yob’.
45. It is good practice to give any system missing cases a value. It is sometimes useful to use a
negative number for missing values so they don’t get confused with genuine values. Many large
Government datasets, for example the Scottish Health Survey, use this convention.
* adding a missing value to mpage.
IF sysmis (mpage) mpage=-9.
ADD VALUE LABELS mpage -9 'year of birth not known'.
MISSING VALUES mpage (-9).
EXECUTE.
FREQUENCIES mpage.
46. When creating a new variable it can be useful to have something to check the values against.
Since there are only 59 cases in the dataset it is possible to list the value of the new variable
along with the year of birth and the candidate’s name. You can use this to check a few by doing
the calculation manually. You could also check the computed values against external sources.
* Listing the variable mpage.
LIST winner10 win10 mpage yob.
Creating derived variables (recode)
47. It may be more useful to have a variable showing the age group of the MP rather than the
individual years of age. You can use the recode command to group the age variable that you
Data Handling Using SPSS 19 Research Data MANTRA
have just created. ALWAYS recode into a new variable, otherwise you will write over the
existing one and lose the original values of it.
48. Missing values are treated differently in RECODE. In the COMPUTE command, the value that we
had set to MISSING was ignored in the calculations. In RECODE we define what we want to
happen to it. We want the new variable to also have a value of -9 and for that to be a missing
value in the new variable.
*creating age group variable mpagegrp.
RECODE mpage (20 thru 29=1)(30 thru 39=2)(40 thru 49=3)(50 thru 59=4)(60 thru 69=5) (-9=-9) INTO
mpagegrp.
EXECUTE.
VARIABLE LABELS mpagegrp "Age group of MP".
VALUE LABELS mpagegrp 1"20 to 29"
2"30 to 39"
3"40 to 49"
4"50 to 59"
5"60 to 69"
-9 "age not known".
MISSING VALUES mpagegrp (-9).
FREQUENCIES mpagegrp.
49. To check the recode, one option, with this few values for the scale variable, is to run a crosstab
of the two variables. Using the drop-down menus, Analyze > Descriptive Statistics > Crosstabs
the output looks like this. In the Case Processing Summary you can see that 4 cases are missing,
they are not included in the output crosstab.
Data Handling Using SPSS 19 Research Data MANTRA
50. Using syntax, however, you can specify that you want the missing values included in the
crosstabs. You can set up a crosstabulation syntax using the drop-down menus, click PASTE to
put the syntax into your Syntax file, and add the line ‘/MISSING=INCLUDE’, so that the syntax
looks like the following:
* checking that recode has worked.
DATASET ACTIVATE DataSet1.
CROSSTABS
/TABLES=mpage BY mpagegrp
/FORMAT=AVALUE TABLES
/MISSING=INCLUDE
/CELLS=COUNT
/COUNT ROUND CELL.
51. The output now looks like this and includes the cases with missing values (‘-9’).
52. Now that the data file has been changed, it should be saved with a new file name and recorded
in the Data log.
* Save file as a new version.
SAVE OUTFILE="[path]\[date]scotmp_v2.sav" / MAP.
53. The above example shows one way of naming files to keep track of successive versions. Another
method is to begin each filename with the YYYYMMDD of the date on which it was created, eg:
20140519scotmp.sav
20140521scotmp.sav
Filenames that begin with a standardized date format (as above) will be sorted in chronological
order when listed.
54. It is important to keep a list of derived variables and the syntax files that created them in your
Data log, eg in a new sheet labelled ‘Derived variables’. As you work through your analyses you
can also add notes to remind you if you have to correct any of the derived variables.
Data Handling Using SPSS 19 Research Data MANTRA
ADDING DATA TO A DATA FILE – adding variables
55. If you have two (or more) datasets that have an identifying variable in common between them it is possible to use MATCH FILES to add the variables from one data file to the other. MATCH FILES can match up to 50 files in one operation, as long as:
All files have the same unique identifying variable (key variable) in common (numeric or string)
All files are sorted in ascending order of the key variable
All files have no duplicate cases, or the same number of duplicate cases
All variables to be merged have unique variable names. Variables with duplicate variable names will either be dropped from the merged file, or must be renamed.
As a first step, open the two additional files election.sav and census.sav, generate lists of the variables in them, and copy the variable lists to your Data log file (see 22-24 above). These lists will allow you to easily see what variables the 3 data files have in common. In this case all 3 files have the variable ‘seat’ in common. Also, all 3 files (scotmp_v2.sav, election.sav and census.sav) have the seat as the unit of observation and in all three datasets each row is for one seat, so this is a simple one-on-one merge.
56. In SPSS it is easy to determine whether a variable has a unique value for each case. The
best way to do this is through the drop-down menus. Open the dataset ‘election.sav’. It contains variables that describe the election results in each constituency.
To check whether ‘seat’ is a unique identifier in the datafile, select Data > Identify Duplicate Cases from the drop-down menus.
Data Handling Using SPSS 19 Research Data MANTRA
The syntax for this procedure is somewhat complex, and therefore it is easier to perform
this procedure using the drop-down menus:
DATASET ACTIVATE DataSet2.
* Identifying Duplicate Cases.
SORT CASES BY seat(A).
MATCH FILES
/FILE=*
/BY seat
/FIRST=PrimaryFirst
/LAST=PrimaryLast.
DO IF (PrimaryFirst).
COMPUTE MatchSequence=1-PrimaryLast.
ELSE.
COMPUTE MatchSequence=MatchSequence+1.
END IF.
LEAVE MatchSequence.
FORMATS MatchSequence (f7).
COMPUTE InDupGrp=MatchSequence>0.
SORT CASES InDupGrp(D).
MATCH FILES
/FILE=*
/DROP=PrimaryFirst InDupGrp MatchSequence.
VARIABLE LABELS PrimaryLast 'Indicator of each last matching case as Primary'.
VALUE LABELS PrimaryLast 0 'Duplicate Case' 1 'Primary Case'.
VARIABLE LEVEL PrimaryLast (ORDINAL).
FREQUENCIES VARIABLES=PrimaryLast.
EXECUTE.
This syntax creates a variable ‘PrimaryLast’ with a value of 1 for the first case found with each value
of seat and a value of 0 for any case that has a value of seat that has already been found. In this
dataset each value of seat is unique.
57. Before closing the ‘election.sav’ dataset we will delete the new PrimaryLast variable, sort the dataset by ‘seat’ and save it. But first we need make sure exactly which data set we are
Data Handling Using SPSS 19 Research Data MANTRA
performing these operations on.
* Sorting election.sav on key variable, and dropping 'PrimaryLast'.
DATASET CLOSE All.
GET FILE="M:\mantra\spss_practicum\scotmp_analysis2\election.sav" / DROP=PrimaryLast.
SORT CASES BY seat(A).
SAVE OUTFILE="M:\mantra\spss_practicum\scotmp_analysis2\election_sorted.sav" / MAP.
We can then do the same with the datasets scotmp.sav and census.sav.
Check for duplicates in the variable ‘seat’ Delete the PrimaryLast variable Sort by ‘seat’ Save the sorted file into a new file.
58. In contrast to the syntax for checking for duplicates, the syntax for MATCH FILES is actually quite simple:
*Merging scotmp.sav, election.sav, and census.sav on 'seat'.
DATASET CLOSE All.
MATCH FILES FILE= "M:\mantra\spss_practicum\scotmp_analysis2\20140526_scotmp_sorted.sav"
/FILE="M:\mantra\spss_practicum\scotmp_analysis2\election_sorted.sav"
/FILE="M:\mantra\spss_practicum\scotmp_analysis2\census_sorted.sav"
/BY seat.
EXECUTE.
SAVE OUTFILE ="M:\mantra\spss_practicum\scotmp_analysis2\20140526_scotmp_merged.sav" /
KEEP=all / MAP.
Alternatively, use the drop-down menus: Data > Merge Files > Add Variables.
ADDING DATA TO A DATA FILE – adding cases
59. Another common file transformation activity is to add cases to an existing data file. In this instance, the ‘ scotmp_merged.sav’ dataset contains only Scottish constituencies. In order to compare them with English/Welsh constituencies, we need to add cases describing those constituencies as well.
60. Open the ‘scotmp_merged.sav’ dataset. From the drop-down menus, select Data > Merge files >
Add cases. 61. You will be prompted for the name of an SPSS system file from which to add cases (the file must
be saved as an SPSS system file prior to running this procedure), either one that is already open, or one that is saved in your system:
Data Handling Using SPSS 19 Research Data MANTRA
62. The next dialogue screen lists, in the right window, all variables held in common between the
two files being merged, and in the left window, those that occur in only one or the other file,
with an indication as to which file each variable is from. These unmatched variables can be
included in the output data set (move them into the right window), but will have system missing
values assigned for those cases in which the variables do not occur. Notice that ‘seat’ is spelled
‘Seat’ in the britmp.sav file.
Variables must match on several aspects in order to be ‘paired’ between the two files: (a) the
variable names match is case sensitive, (b) both variables in both files must be string or numeric
(string variables are flagged by ‘>’) (c) string variables must have the same length. Notice how
the ‘seat’ variable is defined in the two files.
As you can see from the syntax Paste, SPSS’s default is to first rename all unmatched variables,
and then drop them. When SPSS does something unexpected, it is a good idea to paste the
syntax into your Syntax file, and check to see what is happening, in case you missed or
misinterpreted some defaults.
*Adding British constituencies. DATASET ACTIVATE DataSet3. ADD FILES /FILE=* /RENAME (aws10 bithplac bnpppc10 bnpsex10 change05 conppc10 consex10 constbir greenppc greensex greenvt1 ks24mi_1 ks24mi_2 ks24mi_3 ks24migr ks24tpop labppc10 labsex10 ldppc10 ldsex10 mpage mpagegrp mpsex10 other10 seat snpppc10 snpsex10 totalvot tusc10 tuscppc1 tuscsex1 tuscvt10 ukipppc1 ukipsex1 unemploy urbanrur uv010000 uv010001 uv0100_1 uv0100_2 uv0100_3 uv0100_4 uv0100_5 uv0100_6 uv0100_7 uv0100_8 uv0100_9 uv010_10 uv010_11 uv010_12 uv016000 uv016001 uv0160_1 uv0160_2 uv0160_3 uv0160_4 uv0160_5 uv0160_6 uv0160_7 uv0160_8 uv0160_9 uv068000 uv068001 uv0680_1 uv0680_2 uv0680_3 uv0680_4 uv0680_5 uv0680_6 uv0680_7 win05 win10 winner10 yob=d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73)
Data Handling Using SPSS 19 Research Data MANTRA
/FILE='DataSet2' /RENAME (bnp0510 con0510 Con_LDswing Greenvt10 lab0510 Lab_Conswing Lab_LDswing ld0510 Maj05 Maj10 pc05 PC10 PCvt10 Seat Seatch10 Turn0510 TypeChange05_10 ukip0510 Win05 Win10=d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93) /DROP=d0 d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 d11 d12 d13 d14 d15 d16 d17 d18 d19 d20 d21 d22 d23 d24 d25 d26 d27 d28 d29 d30 d31 d32 d33 d34 d35 d36 d37 d38 d39 d40 d41 d42 d43 d44 d45 d46 d47 d48 d49 d50 d51 d52 d53 d54 d55 d56 d57 d58 d59 d60 d61 d62 d63 d64 d65 d66 d67 d68 d69 d70 d71 d72 d73 d74 d75 d76 d77 d78 d79 d80 d81 d82 d83 d84 d85 d86 d87 d88 d89 d90 d91 d92 d93. EXECUTE.
You can directly edit the ‘britmp.sav’ file, in the Data View window, to eliminate eg mixed case
variable names, and edit the length of the ‘seat’ string variable so it is the same in both files. And
then retry Add cases.
You may also want to add a region variable, to flag the difference between the Scottish
constituencies and those elsewhere in the UK. You can easily enter a small number of values of a
new variable manually in the Data Editor. Start by selecting Edit > Insert variable from the drop-
down menu.
Try Add cases again, until you are happy with the result you are getting in the merged output
file. Assign it a name and save it, and log the new file in your data log file. You should also save
the revised versions of the ‘britmp.sav’ file and the ‘scotmp_merged.sav’ file.
Writing a raw data file
63. As important as reading in a data set is writing it out in another format, either for use in a
different statistical package, or for long-term preservation. Software such as Stat/Transfer can
be used to transfer data among software packages or to write raw data in a generic format.
Alternatively, use the appropriate SPSS command to write the outfile in an appropriate format
(see Appendix D for a list of SPSS outfile commands and the formats SPSS can write). Note that
SPSS no longer writes out an SPSS syntax file (although Stat/Transfer does).
64. First check the variable types and variable names. The most generic, non-controversial and
flexible data format is fixed-field format ascii. It is can handle many file structures, including
varieties of hierarchical files, and is not sensitive to commas and/or blanks embedded in
variables.
65. Next, generate the variable information from File > Display Data File Information > Working File
for the file you need to save, or run DISPLAY DICTIONARY from the Syntax window. Check
carefully to make sure that eg Print Format and Write Format are the same for each variable.
Save the variable and value lists.
66. The syntax to write a fixed field format file in SPSS is:
Data Handling Using SPSS 19 Research Data MANTRA
DISPLAY DICTIONARY.
WRITE OUTFILE="[path]\[filename].txt” TABLE / ALL.
EXECUTE.
If there is a chance of the output records exceeding 8,192 characters in length, use the following
structure, with a rough, and over-generous estimate of the output record length:
FILE HANDLE [nickname] name=”[path]\[filename].txt” / LRECL=10000.
WRITE OUTFILE=[nickname]TABLE / ALL.
EXECUTE.
67. When writing a raw data file, make sure to SAVE the output table of where the variables are
written to, as well as the file information generated above – this is your only record of what
variable is in what column(s).
68. Open the data file in a format neutral editor, such as Notepad, which will give you a count of the
number of cases, as well as what the record length of the data records really is (you will need to
position the cursor at the end of a line to see the column indicated at in the bar along the
bottom of the screen). Check this column number against the output table generated by SPSS to
make sure that the final column numbers match.
69. Save your raw data file (file extensions such as .txt or .dat), the output file (in a software neutral
format), and the file in which you have stored the variable and value lists. And finally, of course,
update and save your Data log!
Here concludeth today’s lesson………………………….
Data Handling Using SPSS 19 Research Data MANTRA
Appendix A: Online resources:
IBM SPSS Statistics 19 manuals
http://www.unt.edu/rss/class/Jon/SPSS_SC/Manuals/SPSS_Manuals.htm
IBM SPSS Statistics 19 Command Syntax Reference
http://public.dhe.ibm.com/software/analytics/spss/support/Stats/Docs/19.0/Server/User_Man
uals/English/IBM_SPSS_Statistics_Command_Syntax_Reference.pdf
IBM SPSS online documentation (Note: earliest version available is 20.0)
http://pic.dhe.ibm.com/infocenter/spssstat/v20r0m0/index.jsp
Raynald’s SPSS tools: http://www.spsstools.net/
University of Edinburgh. Information Services. SPSS
http://www.ed.ac.uk/schools-departments/information-services/research-support/research-
computing/statistics/supported-pkgs/spss/overview
Data Handling Using SPSS 19 Research Data MANTRA
Appendix B: Selected inter-system limitations on filenames, variable
and value names etc.
FILE NAMES
Operating systems, and versions, differ as to the file name conventions they accept. VM/CMS requires filenames consisting of 8+8+1 characters. Older versions of DOS and Windows require short (max. 8 characters) filenames + a 3-character extension.
Later versions of Windows accept longer extensions (*.html rather than *.htm, or *.sas7bdat) and multiple extensions (*.tar.gz)
Mac operating systems do not require filename extensions. It is therefore very important to keep track of file format information.
Windows and Mac allow filenames with embedded blanks, whereas these need to be surrounded by quotes in Unix/Linux operating systems.
VARIABLE NAMES
in SPSS variable names must begin with a letter or the characters ‘@’, ‘#’ or ‘$’, and names beginning with ‘#’ or ‘$’ have special functions (scratch and system variables). Variable names should not end in periods since this is a command terminator in SPSS.
Case sensitivity: SAS will convert variable names ‘mpage’ and ‘MPage’ to ‘MPAGE’ for purposes of analysis (ie treat all 3 versions as one and the same variable), ‘though not for purposes of display. In Stata, however, these are treated as 3 different variables. In SPSS, existing variable names are not case sensitive, while new variable names are.
Variable names longer than 8 characters are truncated when exported to SPSS versions pre 12.0, SPSS .por files, SAS pre-V7, and Stata versions pre-7.
Special characters such as ‘@’, ‘#’ and ‘$’ are not allowed in SAS variable names, the last three are replaced with underscores. In Stata, the only allowable characters are letters, numbers, and underscores.
SAS variable names must begin with a letter, or an underscore ‘_’. Tab characters embedded in string variables are preserved in tab-delimited export formats.
VARIABLES AND VALUES
When writing out files in Stata 5-6 and Intercooled 7-8 formats, only the first 2,047 variables are saved.
Variables defined in ranges (eg in SAS, ‘V1’-‘V100’) may lose their variable and value labels.
All SPSS user-defined missing values are mapped to a single system-missing value in SAS.
Variable labels longer than 40 bytes are truncated when exported to SAS v6.
Variable labels longer than 80 bytes are truncated when exported to Stata.
Value labels longer than 80 bytes are truncated when exported to Stata.
Data Handling Using SPSS 19 Research Data MANTRA
Appendix C: Common file and variable transformations and their
corresponding SPSS commands:
File transformations SPSS syntax
- Sort cases SORT CASES
- Sort variables SORT VARIABLES
- Transpose (cases and variables) FLIP
- Merge – add cases ADD FILES
- Merge – add variables MATCH FILES
- Weight cases WEIGHT
- Split files SPLIT FILE
- Aggregate data AGGREGATE
Variable transformations
- Compute new variables COMPUTE
- Recode RECODE
- Rank cases RANK
- Random number generation Transform > Random Number Generators
- Count occurrences COUNT
- Shift values SHIFT VALUES
- Time series operations CREATE
RMV
SEASON
DATE
SPECTRA