preparing a spss syntax file - collections...preparing a spss syntax file purpose the philosophy and...

38
Preparing a SPSS syntax file Purpose The philosophy and purpose behind editing the SPSS syntax file is to ensure that the end user has access to accurate and searchable metadata stored in the Equinox Data Delivery System. Time spent editing the SPSS file before processing it into Equinox saves time in the editing stage for data entry staff, who have fewer errors that they have to correct, and ensures that the files will be immediately findable and usable when they are loaded into the system. The process of checking the files also allows those participating in the SPSS editing and Equinox loading processes to detect anomalies in the documentation and/or data files, and to report these problems to the supplier for clarification or correction. For those reasons, every attempt must be made to ensure that the SPSS syntax file is accurate and complete. Abbreviations, and especially contractions, are to be avoided wherever possible: their use poses difficulties to searchers, since Equinox does not use a thesaurus system which might translate abbreviations to the full text. Restrictions Equinox is only able to handle flat ASCII files. Complex data files (such as the U.S. Census) which include multiple record types within the same file must be broken into separate files by record type, and key variables used to join the files together: see page 104. Special case: bootstrap (replicate) weight variables In statistics, bootstrapping is a method for assigning measures of accuracy to sample estimates. 1 Bootstrapping involves repeatedly generating the same statistics using different sets of weights, and consequently creating ranges for the estimates. Some data sets distributed by Statistics Canada either include bootstrap weights within the file, or provide separate bootstrap weight files so that users wishing to perform bootstrap analysis may do so. The difficulty with the inclusion of bootstrap weights revolves around the size of the data files saved for the user. Most undergraduates, and many graduate students and faculty members will never use bootstrapping, and consequently, do not need the bootstrap weights (often five hundred in number). These variables simply serve to take up space (and, in the case of Stata, memory) if automatically downloaded as part of the data file when not needed. 2 Therefore, Equinox policy is that if a file contains more than thirty bootstrap weights, the unique record identifier and the bootstrap weights will be written out to a separate file that the 1 Efron, B.; Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton, FL: Chapman & Hall/CRC. ISBN 0-412-04231-2; from Wikipedia, http://en.wikipedia.org/wiki/Bootstrapping_(statistics) , accessed 2012-08-09. 2 The General Social Survey of Canada 2005, for example, contains three distinct sets of bootstrap variables: 500 variables for each of the main sample, the cultural/sports sample, and the social network sample. The main data file is 44,572 KB in size; each of the three separate bootstrap files is over 75,000 KB in size.

Upload: others

Post on 20-Oct-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

  • Preparing a SPSS syntax file

    Purpose

    The philosophy and purpose behind editing the SPSS syntax file is to ensure that the end userhas access to accurate and searchable metadata stored in the Equinox Data Delivery System.Time spent editing the SPSS file before processing it into Equinox saves time in the editingstage for data entry staff, who have fewer errors that they have to correct, and ensures thatthe files will be immediately findable and usable when they are loaded into the system. Theprocess of checking the files also allows those participating in the SPSS editing and Equinoxloading processes to detect anomalies in the documentation and/or data files, and to reportthese problems to the supplier for clarification or correction.

    For those reasons, every attempt must be made to ensure that the SPSS syntax file isaccurate and complete. Abbreviations, and especially contractions, are to be avoidedwherever possible: their use poses difficulties to searchers, since Equinox does not use athesaurus system which might translate abbreviations to the full text.

    Restrictions

    Equinox is only able to handle flat ASCII files. Complex data files (such as the U.S. Census)which include multiple record types within the same file must be broken into separate files byrecord type, and key variables used to join the files together: see page 104.

    Special case: bootstrap (replicate) weight variablesIn statistics, bootstrapping is a method for assigning measures of accuracy to sampleestimates.1 Bootstrapping involves repeatedly generating the same statistics using differentsets of weights, and consequently creating ranges for the estimates. Some data setsdistributed by Statistics Canada either include bootstrap weights within the file, or provideseparate bootstrap weight files so that users wishing to perform bootstrap analysis may do so.

    The difficulty with the inclusion of bootstrap weights revolves around the size of the data filessaved for the user. Most undergraduates, and many graduate students and faculty memberswill never use bootstrapping, and consequently, do not need the bootstrap weights (often fivehundred in number). These variables simply serve to take up space (and, in the case of Stata,memory) if automatically downloaded as part of the data file when not needed.2

    Therefore, Equinox policy is that if a file contains more than thirty bootstrap weights, theunique record identifier and the bootstrap weights will be written out to a separate file that the

    1Efron, B.; Tibshirani, R. (1993). An Introduction to the Bootstrap. Boca Raton, FL: Chapman &Hall/CRC. ISBN 0-412-04231-2; from Wikipedia, http://en.wikipedia.org/wiki/Bootstrapping_(statistics),accessed 2012-08-09.

    2The General Social Survey of Canada 2005, for example, contains three distinct sets ofbootstrap variables: 500 variables for each of the main sample, the cultural/sports sample, and the socialnetwork sample. The main data file is 44,572 KB in size; each of the three separate bootstrap files is over75,000 KB in size.

  • user may choose to retrieve or to ignore as desired.3 If needed, the user may retrieve theentire bootstrap file, and then match it to the data with which the bootstrap file is to be used.

    A sample SPSS file (from cycle 24 of the General Social Survey) shows how these bootstrapfiles are written out as separate files which are subsequently processed as a normal file withvariables. The SPSS file is available as a zipped online file.

    SPSS settings required for proper output creation

    With version 21 of SPSS, the program defaults to using unicode character encoding.Unfortunately, if unicode character encoding is used, the revised raw data files are output withthree control characters to start the line () which in turn causes SQL to fail when trying toload them. To avoid this problem, it is necessary to ensure that locale encoding is used byeither ensuring that you select locale encoding when you start SPSS for the first time, or bychanging the default encoding using the interface (as below)

    In order for the programs which populate the Inmagic databases from the SPSS output tooperate properly, you must set certain options in SPSS to specific values. This step needs tobe done once only. The menu locations below were based on SPSS version 20: otherversions may have different menus for these options. Start the process by launching / runningSPSS. Once SPSS is running, click on the Edit menu, and then Options4.1. General tab

    a. check “No scientific notation for small number in tables”b. check “Locale’s writing system” in the Character Encoding for Data and Syntax

    box2. Output labels

    a. Output Labellingi. Variables in item labels shown as Names and Labelsii. Variable values in item labels shown as Values and Labels

    b. Pivot Table Labellingi. Variables in labels shown as Names and Labelsii. Variable values in labels shown as Values and Labels

    3. Pivot Tablesa. Table look: should point to F:\inetpub\equinox\SPSS-setup\extended.sttb. Column widths: Adjust for labels and data for all tablesc. Default Editing Mode: Edit all but very large tables in Viewerd. Copying wide tables to the clipboard in rich text format: Do not adjust width

    4. Currencya. Decimal Separator: Period

    3The exception to this rule is one of the first files processed for the precursor to Equinox, the U.S.National Survey of Families and Households, Cycle 1. This file contains 52 replicate “half-sample”indicators of 1 byte each.

    4You wish to overwrite the default options within SPSS for use with Equinox. It is possible that youhave already set some of these options to these values: however, if all are not set properly, the outputgenerated will not be useable by the programs that are designed to transform the SPSS output to Inmagicinput.

  • General notes on editing syntax files

    SPSS code is case insensitive. The exception to this is labels - any text contained either within‘single’ or “double” quotes (including file names and directories under Unix-based operatingsystems). Labels will appear in the output exactly as they are typed.

    It is recommended that you use a plain ASCII editor (such as Programmer’s File Editor (PFE)or Notepad++) to edit the SPSS syntax files rather than using the syntax editor within SPSS.Since you may need to make multiple similar changes, a programmable editor will be highlyuseful. If using PFE, the font should be set to a fixed-pitch font such as Courier New.

    SPSS commands continue until they locate a command terminator - in English SPSS, this is aperiod (.). As shown in the examples below, the indentation by one character on each lineafter a command (e.g., DATA LIST) is now simply stylistic (although formerly required).However, the syntax file is cleaner and easier to follow if you place a space at the start of thesecond and subsequent lines of a command that spans multiple lines. Additionally, there is noreason to have more than one space at the start of the line: the syntax file will be smaller andeasier to read if lines have only one space to begin them.

    A reminder: save your work

    It would be difficult to overemphasize the importance of saving the syntax file regularly, in caseof power failure or other mishap. In either PFE or Notepad++, will save your work.

    First step: copy the supplied syntax file

    PurposeIt is advisable to keep a copy of the original syntax file supplied by Statistics Canada as areference when making modifications to the syntax file in order to load the data into Equinox.

    ActionNormally, the DLI program stores SPSS syntax files in the documentation (DOC) directory forthe survey. Using Windows Explorer, copy and paste a new copy of the syntax file from thedocumentation directory into your working directory. Rename this file to correspond with theUnique File Identifier that you have established for the file.

    e.g., rename bds_cycle15_2001_main.sps to bds2001m.sps

    Variable naming conventions: renaming variables

    Modern versions of software such as SPSS and SAS have the capability to read variablenames that are longer than 8 characters. However, users of older versions of the software(which may not have licenses that expire) may be limited to variable names of 8 characters orless.

  • In order to make Equinox software-independent, Equinox stores the original variable name asdistributed by the supplier, and if necessary, a alternate variable name that complies to the 8-character restriction. When an alternate is created, it is used as the primary name in Equinox.However, users may switch back to the longer form of the variable name if desired. Thiscapacity of Equinox is also used for recording both the original variable name and a substitutevariable name which is system-generated if the original name is a reserved word in SQL (e.g.,compute1 is stored as the variable name instead of compute, which is a SQL command).

    List of IDENTIFIED SQL and/or Inmagic reserved wordsIt is important to note that reserved words generally are not used as variable names. Thefollowing list identifies reserved words which have been used in one or more files over the lifeof the system to the current date of the manual. Both the reserved word and its substitute arenoted. The renaming of reserved words occurs in all three programs which manipulate outputfrom SPSS.

    Reserved Substitute Reserved Substitute Reserved Substitute

    break BREAK1 compute COMPUTE1 field FIELD1

    foreign FOREIGN1 group GROUP1 level LEVEL1

    name NAME1 over OVER1 return RETURN1

    source SOURCE1 tran TRAN1 union UNION1

    Software recommendationsObtain and install either a purchased or shareware version of the Search and Replacesoftware (SRS) from http://www.searchandreplace.com. This will dramatically speed up theprocess of renaming large number of variables in files, and will ensure consistency in applyingneeded changes. While the instructions below assume that SRS is available, other programsmight be adaptable to perform the same actions. Note that the shareware version expires after30 days; the cost to license the software is $25 per user.

    Equinox field structure related to variable namesField name Function

    Varname This field stores the standard 8-character or less variable name, assuming that itwas not a SQL reserved word. This varname is used as a link to the varorderdatabase and to the values associated with the variable.

    Varnamelong If there is a long variable name (e.g., more than 8 characters), or the originalvariable name was a reserved word in SQL, that variable name will be stored inthis field.

    Varnamelongflag If the Varnamelong field is populated, this field will be populated with the value“Y”

    Three fields in the Equinox database are related to variable names. All of these fields will beautomatically populated by the programs that transform SPSS output into Inmagic input if theprocedures set out below are followed. Both forms of the variable name are searchable.

    Variable naming procedures

  • Experience shows that it is best to make any necessary changes to the variable names prior toany other steps in editing your working copy of the SPSS syntax file (e.g., bds2001m.sps). It isalso recommended that these changes be made in batch mode using the Search and ReplaceSoftware. Using SRS is far more efficient than trying to change them on an ad hoc basis asyou encounter them: with SRS, it is possible to script changes to ensure that EVERY iterationof a change is made at once.

    Open the syntax file in either NotePad++ or in PFE (recommended, because you may havemultiple files onscreen simultaneously, and for which instructions follow below). Open(F:\inetpub\equinox\SPSS-Setup\SearchAndReplaceScript.srs, the sample script file. Open anew file (CTRL-N). Tile the windows horizontally on the screen (Shift-F4).

    Copy the first line and the last line from the sample script window to the new (empty) window.Position the cursor at the end of the first line in the now non-empty window. Close the samplescript window. Retile the windows (Shift-F4), and move into the SPSS file. Load the MacroF:\inetpub\equinox\SPSS-setup\srs-script-populater.kbm into PFE (Macro Load Recording specify file).

    Highlight the first long variable name in the SPSS window. Press F7 - the variable name will becopied from the SPSS window to the SRS script, and will be prefaced by [Search] andfollowed by [Replace] and a blank line. The cursor will move back to the SPSS window.Highlight the next long variable name, press F7. Repeat until you have identified all longvariable names. Note the Potential Pitfalls (see page 19), and ensure that you place thevariables in the correct order in the file (by, if necessary, moving up to a blank line prior to thepotential problem.

    Move to the SRS script window. Each blank line should be replaced with a new variable name(with a maximum length of 8 characters). At any point in the procedure, and certainly uponfinishing adding the new variable names, you should save the SRS script file in the workingdirectory with the SPSS file, using the unique file identifier once again, with a SRS extension(e.g., bds2001m.srs)

    Recommendations on assigning new variable namesIn the case of series such as the General Social Survey of Canada, you may encounter thesame long variable name repeated in different cycles of the file (e.g., AGE_DIV_MA1). If youare working with a new release of a file that has previously been loaded into Equinox, checkthe Equinox database by searching the VarnameLong field to determine if the same longvariable name was previously loaded into the system. If it has, use the same shortenedversion that was used previously.

  • Figure 1: SPSS syntax and SRS Script files

    Figure 1 shows PFE loaded with a SPSS syntax file (pals2006.sps), and the completed SRSscript file (pals2006.srs). Note how the corrections for variable AHAI_Q02_10 precede thosefor variable AHAI_Q02_1: the explanation of this is found in Potential Pitfalls (see page 19).In Figure 1, you can see that there is a long sequence of variables in the SPSS file beginningwith the root AHAI_Q02. The practice that has been used normally to rename these variablesis to remove any underscores (giving a new root of AHAIQ02). The text following the root isnumeric - “_1", “_2", etc. Generally, this is replaced by an alphabetic representation, where“_1" is replaced by “A”, “_3” is replaced by “C”, etc. Following this logic, variable AHAI_Q02_12(if there had been such a variable) would be replaced by AHAIQ02L.

    As we look further down in the SPSS syntax file, we find AHAI_Q12_1_R and AHAI_Q12_3_R- these were replaced by AHAIQ12A and AHAIQ12B respectively. Further down in the file (notshown on the screen) the variable AACM_AACM_Q01 is found - it is replaced byAACMAACM.

    Having given these recommendations, there are no absolute hard-and-fast rules on renamingvariables. The intent of using the same variable names as previously assigned for the samelong variable name is to provide consistency to the end users of the system: if they work withmultiple cycles of a file with long variable names, they will know in advance the Equinoxnames of those variables in new cycles.

    Potential pitfallsIf you have many variables with the same root name, it is important to sequence the changescorrectly. Since SRS operates from the top of the script file to the bottom of the script file,making changes in the order that they appear in the script file. Consider the table below:

  • Wrong Right

    [Search]Root_varname_1[Replace]NewrootA[Search]Root_varname_2[Replace]NewrootB[Search]Root_varname_10[Replace]NewrootJ

    [Search]Root_varname_10[Replace]NewrootJ[Search]Root_varname_1[Replace]NewrootA[Search]Root_varname_2[Replace]NewrootB

    If the variables are not in the SRS script file in the order specified in the “Right” column, thechange to Root_varname_1 will take place. When this change occurs, Root_varname_10 willbe changed to NewrootA0, which does not conform to the convention of no more than 8characters. Consequently, when the replacement for Root_varname_10 is acted on, it will findno instances of the text to replace. You would then have to manually go through the SPSSsyntax file, and change NewrootA0 to NewrootJ (which is the desired outcome of the change).

    Running the SRS scriptHaving created a SRS script file to rename variables to the 8-character standard, you must runthe script against the SPSS syntax file to actually perform the desired changes. This is doneby specifying to SRS that you want to find strings according to the script file, replace themaccording to the script file, and to perform those actions in a specified directory on the SPSSsyntax file. Appendix 1 contains an illustrated guide to running the script.

  • Next stepsIf new variable names are created, you will need to follow the instructions in the section SPSS:Rename Variables (see page ?).

    Introduction to the SPSS syntax file

    For any file loaded into Equinox, there are up to eleven sections of code in the SPSS syntaxfile5. Of these sections, three are required, one is recommended, two are normally required,and five may be required, depending upon the file. These sections of code are:1. Data list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . required2. Create unique Record Identifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . may be required3. Sort cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . recommended4. Create additional variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . may be required5. Recode statement(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . may be required6. Formats statement(s) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . normally required7. Variable labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . required8. Value labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . normally required9. Missing declarations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . may be required10. Rename variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . may be required11. SPSS executable statements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . required

    SPSS: Unnecessary statements

    Delete any and all of the following commands from the SPSS syntax file:• Set Length• Set Width• Title• SubtitleIf they are not deleted, they may cause the programs which convert the SPSS output toInmagic input to fail.

    Creating a single annual Labour Force Survey file

    Statistics Canada distributes the Labour Force Survey as twelve separate monthly files.Practice in Equinox is to store the complete Labour Force Survey as a single annual file.Follow the instructions below to create the annual file.6• open a command window (Start >> Run >> cmd)• move to the drive and subdirectory to which the twelve monthly files have been

    extracted from their zip files (e.g., c >> cd \data\dli\lfs\2012)

    5The syntax file might also be referred to as a SPSS program or SPSS data set description.

    6The example assumes that Statistics Canada uses the same file naming conventions as wereused with the 2012 Labour Force Survey files. If the naming convention changes, the instructions must bemodified to reflect those changes.

  • • copy the individual files into a combined file (copy pub??YY.prn lfsYYYY.dat)7

    Having done this, LFSYYYY.dat will be the file referenced in the Data List statement in theSPSS syntax file (described below).

    SPSS: Data list (required)

    PurposeA data list statement lists for SPSS the variables that are to be read and the input format ofthose variables.

    SPSS syntaxThe data list statement may contain either two parts (a FILE HANDLE and a DATA LIST), orone part (a DATA LIST), depending both upon the file structure and the programmer’sdecisions.

    Two-part syntaxIf the length of the input record is over 8,192(?) characters long, a FILE HANDLEstatement must precede the DATA LIST statement. SPSS syntax files received fromStatistics Canada often (but not always) contain a file handle statement (regardless ofwhether it is necessary). If included, it may be retained. In the example below, theminimal portions of the FILE HANDLE are included - other parts (e,g., rectype=fixed)may be present in files from Statistics Canada.

    FILE HANDLE handle /NAME='path and file specifications' [/LRECL=n].DATA LIST file=handle / Varfirst columnspecs … Varlast columnspecs .

    One-part syntaxDATA LIST FILE='path and file specifications' records=#8 / 1 Varnum1 columnspecs … Varlast columnspecs

    handle a working name for the file, e.g., cgss2012mname= specify the exact path and file name for the data file to be read into SPSS

    (e.g., c:\data\gss12\gss-cycle27-main.txt). The file name and path must beenclosed in ‘single quotes’.

    LRECL= replace “n” with the number of columns in each record of the file (e.g., 467)

    7Where YY is the 2-digit year used in the file name by StatCan, and YYYY is the 4-digit year (e.g.,12, 2012).

    8If there is only one line of data per record, the “records=#” may be omitted. The example shows afile where there were 2 physical records to be read to complete a single logical record (case).

  • / 2 Varnum1 columnspecs ... Varlast .

    Notes about column specifications (columnspecs)You may encounter two different formats used to record the format of variables.

    The first I will refer to as an explicit column style, with a list of variable names each followed bythe columns occupied by the variable indicated.

    Example data list name=’c:\data\smallfile.dat’ / age 1-2 sex 3 (A) income 4-12 (2).

    In the file smallfile.dat, located in the c:\data directory (folder), the variable ageoccupies columns 1 and 2 of the record; the variable sex is found in column 3 and isrecorded as a string (alphabetic characters), and the variable income occupiescolumns 4 to 12, the rightmost two columns being decimal). The period after thespecifications for the variable income indicates that there are no further variables in thefile.9

    The second method of defining variables uses a format called Fortran notation (and is used inthe output from Equinox.

    Example of Fortran notationdata list name='c:\data\smallfile.dat' / age (F2.0) sex (A1) income (F9.2).

    Both examples describe the same file in the same manner.

    When to use string notationIn SPSS, as demonstrated above, string variables are specified in the data list by eitherfollowing the column specifications with (A) or by using the string Fortran notation (A#). Somedivisions within Statistics Canada (and, recently, CIHI in the Discharge Abstract Databasefiles) use string notation for every nominal or ordinal variable, regardless of whether it usesnumbers or alphabetic characters to represent the values.

    9You may encounter author-dependent stylistic differences in syntax files. For example, thecolumn specification for sex might have been written as 3-3 (rather than 3). You may also encountermultiple variables on a single line in the syntax file(e.g., age 1-2 sex 3 (A) income 4-12 (2).)

    .

  • Possible methods of coding the variable SEX

    Using characters Using numbers as strings Using numbers Value

    “M” (or ‘M’/’m’/”m”) ‘1' or “1" 1 Male

    “F” (or ‘F’/’f’/’f”) ‘2' or “2" 2 Female

    Unfortunately, Stata does not permit value labels to be associated with string variables(variables that contain alphabetic (and possibly numeric) characters). If variables thatexclusively contain numbers coded as strings (e.g., do not contain any alphabetic codes) arestored in Equinox as strings, Stata users will be deprived of the value labels associated withthe codes. In the example above, using numbers as strings, a Stata user would see SEX withthe values “1" and “2", without any indication in Stata as to the values associated with thosecodes, and hence would have to refer to the codebook to be able to differentiate responses.

    To avoid this problem, you must modify the SPSS file in two places:1. In the data list statement, if the explicit column style is used remove the string

    designation (the (A) after the column(s)). If Fortran notation is used, change the formatfrom (A#) to (F#).

    2. In the value labels statement, remove the single or double quotation marks from aroundthe numeric codes (see below).

    Decimal pointsIn Equinox, standard practice is to ensure that all variables with decimal values are stored withan explicit decimal point (i.e., that the data loaded into SQL includes the “.” for each variablethat has decimals. This is done because it is more efficient for SQL to load values stored inthis manner than to add them during the loading process.

    Particularly with older files, the decimal was often implicit - not included in the actual file. Thiswas due to the cost (in storage space) of storing an additional byte for each decimal variable.Additionally, coders at Statistics Canada do not consistently record decimal points whenpreparing either SPSS or SAS code. Therefore, it is important to catch missed coding andinsert it into the SPSS syntax file to ensure that variables are recognized as having decimals.10

    Note: be sure to record the variables that have decimals: you will want it for the Formatsstatements (see below).

    Time saving tipsIf you are required to create a SPSS syntax file (rather than edit an existing file), and you have

    10To give some obvious examples of the problems inherent in not recording the decimal pointsaccurately, there is a big difference between earning $24.02 dollars per hour and earning $2,402 dollarsper hour. While it is possible to work 84.5 hours per week, it is impossible to work 845 hours per week. Ifdecimal points on weight variables are not identified, it would be possible to generate population estimatesthat over-estimate the Canadian population by a factor of 10,000 or more.

    Catching this type of mistake at the start when loading a file into Equinox will ensure that no user has tomake on-the-fly corrections to the data when (and if) they recognize that they are getting impossible orimprobable results.

  • multiple variables of exactly the same format in succession, you can specify a list of variablesfollowed the columns occupied by all variables in the list (in either explicit column or Fortrannotation). You can also name sequential variables by using this trick (if you are coding a filecontaining a list of bootstrap variables, for example).

    Example pair 1:incwages incselfe incgovt incinvst 21-40incwages incselfe incgovt incinvst (4F5.0)

    In this example (showing explicit column and Fortran notation), we are defining fourvariables (incwages, incselfe, incgovt, and incinvst): each occupies 5 columns, and has0 decimal points.

    Example pair 2:sbswt001 to sbswt500 165-5074 (4)sbswt001 to sbswt500 (500F10.4)

    In this example (showing explicit column and Fortran notation), we are defining fivehundred variables (named sbswt001, sbswt002, ..., sbswt500): each occupies 10columns and has 4 decimal points.

    SPSS: Create unique record identifier (may be required)

    PurposeBecause Equinox is designed to allow users to subset (pull specific variables from a file whileomitting others), users may occasionally forget to retrieve all of the variables which they needfor the analysis they intend. Statistical software packages such as SAS, SPSS and Stata allowthe user to merge data sets together by matching on a keyed variable: the unique recordidentifier. If the data file does not include a unique record identifier, one must be created inorder to allow users to match subsets.

    When requiredNormally, files shipped by Statistics Canada include a unique record identifier: the nameassigned to this variable will vary (e.g., Recnum, Idnum, ID, Uniqueid, etc.). Creating a uniqueidentifier is normally reserved for files that are distributed by other agencies or individuals whenno such identifier is supplied.

    Labour Force Survey filesWithin each monthly file, a unique identifier is present, but the values assigned to thisvariable are duplicated across months (e.g., there will be a record assigned the uniquecode of “1" in each of the January thru December files). Since Equinox requires a uniqueidentifier that encompasses the entire year (since we store the entire year in a singlefile), a yearly unique identifier must be generated. This is done using the SPSS syntaxfor creating non-geographic unique identifiers.

    Geographic filesA second exception occurs when you wish the unique identifier to explicitly identify thecensus geography covered. Examples of these include aggregated census files that are

  • distributed in ASCII format (such as the Basic Summary Tabulations from the 1986Census), or tables from the Geosuite product. Often, a Statistics Canada file that isgeographically based may have variables for each component of the unique geographicidentifier (e.g., province, census division, and dissemination area), but does not create aunique identifier that combines all these fields into a single field in the file. In this case, aunique identifier should be created that combines these elements, following the SPSSsyntax for creating geographic unique identifiers.

    Complex filesA third exception deals with the infrequent occurrence of processing a complex file (a filewhere there are multiple record types stored within a single file) for inclusion withinEquinox. This rare occurrence is dealt with beginning on page 104.

    SPSS syntax for creating non-geographic unique identifiersSPSS provides a simple and foolproof way to create a unique record identifier. In the commandbelow, the variable will be named uniqueid. While any name not already used on the file wouldbe valid, it is recommended that you assign a name that makes it clear that it is a unique recordidentifier such as uniqueid, recid, recordid, etc.

    compute uniqueid=($casenum).

    SPSS syntax for creating geographic unique identifiersCreating geographic identifiers requires you to look at the format of the existing geographicvariables, and to construct a new variable based on the existing formats.

    Let us assume that our variable list was as follows11:data list file=’c:\data\smallgeogfile.dat’ / pr 1-2 . . . . . . . . . . . . . . . . . . province code cd 3-4 . . . . . . . . . . . . . . census division number csd 5-7 . . . . . . . . . . . . census subdivision number da 8-11 . . . . . . . . . . . . dissemination area number totalpop 12-21 . . . . . . . . . . . . . . . . . . . . . . etc.

    To create a new geographic identifier, we need to calculate the total number of columnsoccupied by the new unique identifier: in this case, we need 2 (pr) + 2 (cd) + 4 (da), for a totalof 8 columns. Therefore, province needs to be shifted six columns to the left, and censusdivision shifted four columns to the left. This is accomplished with the following code:

    compute PRCDDA = pr*6 + cd*4 + da.

    In this instance, the new variable that you create would be called PRCDDA - naming it in thisfashion would provide information to the user as to its content - it’s not simply a sequentialnumber generated by a software package.

    Tips and hintsThere may be a simpler way to create a unique geographic identifier. Had the variable listomitted the census subdivision code from the example above, and been as follows:

    data list file=’c:\data\smallgeogfile.dat’ /

    11Note that “province code”, “census division number”, etc. would NOT appear in the data liststatement: the text is included in this example to illustrate the content of the variables in the file.

  • pr 1-2 cd 3-4 da 5-8

    the simplest way to create the unique identifier is to modify the data list statement:data list file=’c:\data\smallgeogfile.dat’ / prcdda 1-8 pr 1-2 cd 3-4 da 5-8

    Next stepsHaving created a new unique identifier, you will have to define its format (see page 31) and adda variable label for it (see page 32).

    SPSS: Sort cases (recommended)

    PurposeAs stated previously, statistical software packages allow the user to merge data sets togetherby matching on a key variable (the unique identifier), but require that the key variable be sorted.Performing this step prior to loading the file into Equinox ensures that all data files retrievedfrom Equinox will be in sorted order, ready for merging, and eliminating a step for users. It isNOT an essential step, but one that may assist users who analyse the file.

    NotesThe critical requirement for sorting the cases is to ensure that you actually sort on a uniquerecord identifier: look for a variable specified as such in the documentation, or for a variableyou have created.

    SPSS syntaxSPSS provides a command for sorting the file by a variable. For matching files, it is essentialthat the file be sorted in ascending order: hence, (A)

    sort cases by uniqueid12 (A).

    SPSS: Create additional variables (may be required)

    PurposeThe intention behind creating additional variables is to make the file more usable by those whowish to analyse it. Flagging this step with the categorization of “may be required” is anoverstatement: the file would be usable without new variables, but creating new variables mayenhance its usability.

    NotesNormally, the only instance where new variables is created is for geographic identifiers. For

    12For Uniqueid, substitute the name of the variable that is the unique identifier.

  • example, it may be useful to create a region variable, which would allow users to select allAtlantic province records with a single click.

    Moving beyond this to a more substantive example, typically Statistics Canada providesprovince as a single variable, and the Federal Electoral District (FED) code as a secondvariable. Within each province, the coding for FED begins at 1 and increases for each FEDbeyond 1 in the province or territory. Users would be unable to select one FED (coded 1) fromone province, and a second FED (coded 2) from another province without a new variable thatcombines the two parts of the identification.

    The two most likely techniques that you would use to create new geographic variables havealready been described in the section on creating geographic identifiers: computing a newvariable (see page 26), and modifying the data file list (see page 27).

    Next stepsFor each new variable created, you will have to define its format (see page 31), create avariable label (see page 32), and potentially create value labels (see page 34).

    SPSS: Recode statement(s) (may be required)

    PurposeThere are two instances when recode statements may be required. The first is to replaceBLANK (system missing) values with actual coded values. The second is to fix coding thatmight otherwise mislead or confuse users of the data.

    Recoding blank values to coded valuesIn Equinox blank values in fields coded with numbers are replaced by a numeric code. This isdone for both stylistic and for practical reasons. Stylistically, it is more appropriate to physicallyassign values to each desired code, rather than leave it blank and have SPSS assign a label of“System missing” to the blank value. Practically, SQL does not expect blanks in a field that hasbeen identified as numeric - although you may force SQL to load data coded in this manner, itslows the process and is less than optimal.

    Unfortunately, identifying the variables where this is necessary, and subsequently identifying anappropriate value with which to replace the blank, may be a tedious process, because thesame value is unlikely to be useable for all variables. If the variable has no negative values, thestandard practice in Equinox is to recode the blank value as -1. If the variable had negativevalues, standard practice is to recode the blank value as the largest possible negative valuethat could be stored in the field, assuming that that value had not been assigned as valid.

    Special case: recoding all blank values at onceRarely, the variables in a file will permit you to assign the same value for all blanks for allvariables. If no negative values appear in any field in the file, and there are no stringvariables in the file, “-1" could be used to replace all blank values in the file.

    SPSS syntax if all variables can be recoded at once

  • recode all (sysmis=-1).13

    Normally, certain variables (such as investment income, age difference variables, etc.) havenegative values associated with them. The file may also contain string variables that need to becoded using slightly different syntax.

    Recoding blank values to different new valuesLet us assume that we have six variables in the file (VAR001 to VAR006). There are noblank values for VAR001 or VAR003, but that the remaining four variables containblanks (system missing values). VAR005 is a variable that uses negative values: this listof valid values for that variable goes from -30 to 199, and it occupies three columns.

    Sample SPSS syntax for recoding to different valuesrecode VAR002 var004 vAr006 (sysmis=-1).Recode var005 (sysmis=-99).

    Recoding blanks in string variablesLet us add a variable (var007) to the example above. The distributor of the file chose toinclude a three-character string variable to record the highest university degreeachieved: the legitimate values are “BA ”, “BSC”, “MA “, and “PhD”. The value “ ” (threespaces) is used if no university degree was attained.

    We can choose to use any string value that is not in the valid list to represent the blankvalue, which is not considered “System Missing” because the field is a string.

    Sample SPSS syntax for recoding string variablesrecode var007 (” ”=”NAP”).

    Recoding misleading or confusing codeOccasionally, coders take short cuts in coding which result in code which may mislead orconfuse users. This is fortunately a rare, but not unknown occurrence.

    For example, for whatever reason, the Statistics Canada division responsible for the Survey ofHousehold Spending chooses to record the value 0 in the spousal income field to representboth $0 in income earned by the spouse. The value 0 is also used in the variable in instanceswhen there is no spouse in the household. This can lead to confusion among users.14 Similarly,the division uses 0 to represent in the variables for house sale or purchase price to representthat there was no sale or purchase of a house in the year, but does not flag the value as

    13The base format of the Recode command is:recode varlist (oldvalue1=newvalue1) (oldvalue2=newvalue2) ... .

    where varlist is a list of variables to be recoded in with the same manner, separated by spaces. Normally,you will only be doing a single recode, using the old value SYSMIS.

    14Using the income field, a user could accurately calculate the total income earned by all spousesin Canada. However, were they to average this field, expecting to obtain the average income of spouses inCanada, their results would instead give them the average income for all households in Canada obtainedby spouses, whether or not a spouse was present in the household (and hence would seriouslyunderestimate spousal income).

  • missing.15

    Fortunately, the division records in the variable SPSEXP (sex of spouse) and in other variableswhether or not a spouse is present in the house. Using this information, it is possible to recodeother variables to distinguish between absence of spouse and non-presence of spouse. Inorder to remove the ambiguity from these variables, they are recoded in SPSS.

    Recoding the house prices is simple - the value 0 is recoded to the value -1, which issubsequently flagged as missing. Alternately, the value 0 might be coded as missing, whichwould have the same effect.

    Recoding the spousal values is slightly more complicated. Using the DO IF ... END IF syntax inSPSS, all records which are missing for the sex of spouse variable are selected. Then, thevalues for those records are recoded to non-used values, which must subsequently be flaggedas missing.

    Sample SPSS syntax for recoding misleading/confusing coderecode selpricp (0=-1).recode purpricp (0=-1).do if (missing(spsexp)).compute spdisabl=9.16compute spinctot=-9999999.99.compute SPINCEAR=-9999999.99.compute SPINCINV=-9999999.99.compute SPINCTRA=-9999999.99.compute SPINCOTH=-9999999.99.end if.

    Next stepsHaving recoded one or more variables:• you may need to create a format statement for the variable (see page 31) if the format of

    the variable has changed (e.g., instead of occupying 1 column, the variable is now 2columns long (to permit the storage of “-1"));

    • you must edit the value labels for the variable (see page 34); and• if any of the values that were edited were missing values, you must also edit the missing

    values statement for the variables (see page 38).

    Experience shows that it is best to identify the recodes that are necessary and code them allappropriately. As a next step, create all necessary format statements, then fix all necessaryvalue labels. The missing value declarations would be subsequently dealt with in the normalcourse of editing the syntax file.

    15Hence, users could calculate the total amount paid and/or received over all households fromhouse sales in a year, but instead of calculating average selling price or purchase price, would calculateaverage amount spent or received per household (regardless of whether they had sold or purchased ahome in the year).

    16Rather than coding this value to -1, which would require a change in format to the variable (fromone column to 2), a decision was made to code the missing value as 9, which is outside the valid range ofvalues (0=no, 1=yes). Use your discretion in following the guidelines.

  • SPSS: Formats statement(s) (normally required)

    PurposeAssigning correct formats to variables ensure that the variables have enough space to bewritten out onto disk, and that they will subsequently be read properly by SQL.

    When the Formats statement is requiredThere are three possible reasons why format statements may be included in the syntax file:1. there are decimal points in one or more variable (normally the case for weights), or2. you have been forced to increase the length of a variable to accommodate a missing

    value (a general rule that includes adding a negative code to a one-column variable), or3. you have created one or more new variables.

    Decimal valuesBy default, SPSS assumes that no decimals are explicitly stored in data files read into SPSS.Therefore, it will automatically increase the length of any field containing one or more decimalplaces by one column, which is unnecessary if the decimal point was stored in the input file.Therefore, for any variable where the decimal point was stored, you should specify theappropriate format using Fortran notation: one column for each digit (or minus sign) to the leftof the decimal point, one column for the decimal point, and one column for each digit to theright of the decimal point. In the example below, hrlywage could have values from -99.99 to999.99: six columns in total, of which two are decimals and one is used by the decimal point.

    SPSS syntaxformats hrlywage (f6.2).

    Creating new variables or increasing the length of existing variablesIf you have created a new unique identifier, you know the format of the variable that youcreated. For example, on page 26, the variable prcdda was an eight-column variable, with nodecimal points. Similarly, if you have had to add a negative value to a one-column variable toreplace a blank value, you know that the new format of the variable must be two columns. Letus assume that edlevel and age were one-column variables with preassigned values from 0 to9, and consequently required an additional column to deal with coding system missing (blank)cases.

    SPSS syntaxformats edlevel age (f2.0) prcdda (F8.0).

    Tips and hintsAs seen above, a single Formats statement can apply to multiple variables: list the variables,separated by spaces, before you enter the Fortran notation. Consecutive variables may havethe same format applied to them using the “variable name to variable name” syntax. In theexample below, 501 variables are assigned the same format.

    e.g., formats weight btstr001 to btstr500 (f12.5).

  • SPSS: Variable labels (required)

    PurposeThe variable label should provide the Equinox user with a guide to the content of the variable.The variable label also acts as the link on which the user clicks to retrieve additional metadataabout the variable. Therefore, every variable, whether created by the distributor or during theediting of the SPSS syntax file, MUST have a variable label.

    A note about labels in SPSSA label (whether a value label or a variable label) may be enclosed either in single or doublequotes. It is recommended that double quotes be used exclusively to enclose variable andvalue labels. If double quotes are used, contractions (such as in the variable label ”Haven’theard of the Rhinoceros party” or the value label ”Don’t know”) may be written exactly as theyare to be seen. Using single quotes, those entries would have to be typed as ’Haven’’t heard ofthe Rhinoceros party’ and ’Don’’t know’, which is awkward and may lead to errors.

    Avoid truncation and abbreviationSince variable labels are both searchable (through Inmagic) and findable (through CTRL-F on ascreen where they are listed), it is important to avoid truncation and abbreviation whereverpossible. Abbreviations for even generally accepted terms (e.g., GIS = Guaranteed IncomeSupplement) may have multiple meanings (GIs = American soldiers or GIS = GeographicInformation System). Hence, best practice is to include both the full-text and abbreviation in thevariable label, since Equinox does not currently use a thesaurus which might address thisissue.

    Truncation (particularly as practised by Statistics Canada) causes serious problems withsearching. In a single syntax file, you might encounter the word household abbreviated as hh,hhd, hhl, hhld, househ, househld (or any other variant). For users to be able to search for andfind data about a household, they must be able to use a single search term. Therefore, avoid(and if necessary, remove) truncation, and instead use the full term in any variable label.

    SPSS syntaxvariable labels varname1 “variable label” varname2 “variable label” age “Age (6 groups)” hhdinc04 “Household income, 2004" .

    Tips and tricksIn the example above, note the age and household income variables. Since age might be eithera grouped (ordinal) variable (reported in age ranges) or a continuous (interval) variable(reported in single years of age), informing the user in the variable label as to the nature of thevariable is advisable. As a point of practice, indicate if a variable that might be expected to becontinuous is grouped. So, in the example above, users would expect hhdinc04 to be coded asa continuous variables (with explicit dollar values).

    A note about Statistics Canada coding

  • Some divisions within Statistics Canada have a distressing habit of assigning the same variablelabel to a series of related variables, due to using a set number of the leftmost characters in thequestion as the variable label. If this information is loaded into Equinox, users looking at thevariable labels will have no way of immediately deciding which (if any) of the variables to use,since the information that differentiates between the variables is lost. Therefore, it is importantto create a common “root” label, but then to have the pertinent information follow the root of thelabel.

    For example, a file might contain ten variables based on the question, “Thinking back over yourexperiences in the work force in the past ten years, how has your situation at work changed?”The individual questions might be related to such specific topics as “The ethnic balance in theworkforce has changed” or “I am working longer hours” or “My supervisors are moreunderstanding of home-work balance”. In this case, the variable label assigned by StatisticsCanada to each of the ten variables might be ”Thinking back over your experiences in the workforce in the past ten years, how”. This might be replaced with a common “root” label of ”Tenyear work change: ”, leading to variable labels of

    Q47A1 ”Ten year work change: change in ethnic balance”Q47A2 ”Ten year work change: working longer hours”Q47A3 ”Ten year work change: more understanding supervisors”etc.

    Since the full question text is stored in Equinox in the Questiontext and QuestiontextF fields,none of the information is lost to the user, but the variable labels are more useful than theywould have been if merely stored as ”Thinking back over your experiences in the work force inthe past ten years, how”.

    Suppressed variablesDepending on the division within Statistics Canada which issues the file, variables which aresuppressed (i.e., for which all values have been replaced with a single missing or blank value)may be included in the public use microdata file. This is typically done in order to provide anotice to users that this variable is available in the master data file available through theResearch Data Centre program. These suppressed variables may or may not be accompaniedby recoded versions of the variable which have been passed by the disclosure risk process(e.g., Income: exact dollar value with a counterpart variable of Income: grouped).

    The difficulty with this is that if a variable is not explicitly flagged as being suppressed, usersmay select it thinking that they are retrieving a variable containing useful information. After all,given the statistical options available for analysis, a user would far prefer to retrieve the exactdollar value of income as opposed to a grouped variable.

    Therefore, it is important to explicitly identify suppressed variables by adding the wordsuppressed to the appropriate variable labels. Using that principle, the variable labelsstatement might be changed from:

    Variable labels uniqueid “Unique record identier” ageinyrs “Age (single years)” agegroup “Age (grouped)” ...

    toVariable labels

  • uniqueid “Unique record identier” ageinyrs “Suppressed: Age (single years)” agegroup “Age (grouped)” ...

    Omitting this step will put the user at risk of retrieving variables with no content, of neglectingvariables containing content, and of believing that a data file will support more complexresearch than is the case (e.g., of comparing two specific ages, rather than two age groups).

    SPSS: Value labels (normally required)

    PurposeValue labels associate the codes assigned to the variable (e.g., 1) to the meanings of thecodes (e.g., Male). As such, variables should normally be assigned value labels. Having saidthat, it is impractical to assign value labels to large continuous variables - if an income variablecovers incomes ranging from -$50,000 to $250,000, it is both impractical and of limited use tothe user to indicate that 37 means $37, and that 150,235 means $150,235.

    When not to create value labelsNormally, value labels will be included for variables, in order to inform the user as to thecontent of variables. Having said that, there are circumstances that dictate against the creationof value labels.

    The two absolute “rules” for when NOT to create value labels are:• do not create value labels for a unique record identifier, since the user derives no benefit

    from being told that “1432" represents “Respondent 1432", while “37675 represents“Respondent 37675"; and

    • do not create value labels for weight variables, since the user derives no benefit frombeing told that “1.432" represents “Weight of 1.432", while “37.675 represents “Weight of37.675".

    Additionally, guidelines can be offered for determining if value labels should be created:• if a continuous (interval) variable includes one or more decimal places, value labels

    would normally not be created17; • if a continuous (interval) variable is used for income, value labels would normally not be

    created; and• if more than one hundred value labels would be required, consider the benefit to the

    user of having the values labelled: if it is essential to analysis to have labels, includethem (e.g., to select from a list of two hundred different occupations).

    As a consequence of these rules and guidelines, a file which consists solely of a uniqueidentifier and bootstrap variables would have no value labels associated with it, and hence, novalue records nor frequency tables to load into Equinox. The fact that no value labels areassociated with a variable should be explicitly indicated in the syntax file (see RECID below).

    17Note that all variables containing decimal points are not necessarily interval or continuousvariables: depending on the coding, they may instead by nominal, ordinal or grouped variables for whichvalue labels would be required. See, for example, variable HWTGHTM in the CCHS 2010 annualcomponent file, where decimal values are used to record a range of values rather than a precise measure.

  • Special circumstances that dictate the creation of value labelsHaving listed the situations under which value labels would be not required, it’s now importantto indicate the exceptions to those rules and guidelines.

    Recall that the philosophy of Equinox is to provide the user with a syntax file that completelydocuments the data. Therefore, if there are values that don’t have an obvious meaning, a valuelabel must be assigned to them. For example, in an age variable, where values normallyrepresent a single year of age, a value of 20 that is assigned to all respondents aged 20 andyounger or a value of 75 that is assigned to all respondents age 75 and over must be explicitlyassigned a value label (even it the other single year age values are not assigned value labels).If this is not done, the logical expectation by the user of the data file would be that no-oneunder age 20 nor over age 75 was sampled for the survey.

    Assigning these value labels has particular importance when the number of values assigned toa variable is small (e.g., 0, 1 or 2 for number of children at home). In this instance, it isimportant that the user know if 2 is an actual value representing 2 children (in which case, thevariable is an interval variable), or a code that represents all 2 or more children (In which case,the variable is ordinal).

    Missing valuesIf values are identified as Missing (see page 26), value labels must be created in thesyntax file for those specific values.

    Consider an income variable, where the values from -9999 to 99995 represent the actualincome, but the value 99996 is identified as “Not asked”, 99997 represents “No answer”,99998 represents “Don’t know”, and 99999 represents “Missing”. Value labels for thosefour values (99996 to 99999 ONLY) would need to be entered into the syntax file.

    Similarly, consider a second continuous variable, where the values 0.0 to 25.0 representnumber of years of schooling, but the values 99.6 is identified as “Not asked”, 99.7represents “No answer”, 99.8 represents “Don’t know”, and 99.9 represents “Missing”.Value labels for those four values (99.6 to 99.9 ONLY) would need to be entered intothe syntax file.

    Top-coded (cap), bottom-coded (floor) and exceptional valuesIn these circumstances, the variable is typically a continuous variable, but certain valuesare special, in that they do not represent (only) what the recorded value would indicate.In the case of top-coded (cap) values, the value represents the figure recorded and allvalues above that figure. In the case of bottom-coded (floor) values, the valuerepresents the figure recorded and all values below that figure.

    For example, in income variables in the 2001 Canadian Census Public Use MicrodataFile of Individuals, caps and floors were established for two different groups: 1) Allfemales and males living in the Atlantic regions, and 2) all other males. Additionally, thevalue 1 in the total income variable signified that income earned less investment lossestotalled exactly $0, while the value 0 represented that the individual had no income ofany sort. Therefore, six different value labels were required, in addition to the missingvalue code. The syntax for this variable is included in the SPSS syntax example below.Note that no other values (e.g., where 43321 means $43,321) are listed in the syntax.

  • Tips and tricksIf two variables share the same coding, they may share the same SPSS syntax by beingincluded in a “variable list”. A variable list may span multiple lines in a SPSS syntax file. In theexample below, variables sector and sectorsp will be assigned identical value labels, whilevariable recid will be assigned no value labels.

  • Suppressed variablesOnce again, it is important to note for the user that a variable has been suppressed. Althoughit will be flagged as missing, the label itself should reflect that a conscious decision was madeto block access to this variable.

    SPSS syntaxvalue labels uniqueid / ageinyrs 9 “Suppressed” / agegroup 1 “Under 25" 2 “25 to 65" 3 “Over 65" / TOTINCP -50000 "Loss of $50,000 or more (non-Atlantic males)" -30000 "Loss $30,000 or more (female/Atlantic males)" 0 "No income" 1 "Sum of negative and positive income amounts equal zero" 120000 "$120,000 or more (female or Atlantic male respondents)" 200000 "$200,000 or more (non-Atlantic male respondents)" 9999999 "Not applicable" / sector sectorsp 1 "Private sector" 2 "Public sector" 7 "Not applicable" / next variable (list) values for variable (list) /et cetera

    .

  • SPSS: Missing values statements (normally required)

    PurposeDefining missing values identify records that should be excluded from analysis of the variable- for example, excluding records for males from the results of a question about the number oftimes the individual has been pregnant.

    Notes about Statistics Canada coding of missing valuesTypically, the only value (if any) that Statistics Canada divisions flag as missing is a “Validskip” or “Not applicable”. In traditional Statistics Canada coding, these values normally endwith 6 (e.g., 6, 96, 996, 9.6, etc.).

    The practice in Equinox is to identify as missing any value which is not a valid response -hence values for “Not asked”, “Not stated”, “Missing”, “Refused”, “Don’t know”, “No spouse”,etc. are defined as missing in addition to the “Valid skip” and/or “Not applicable” values. As aresult, due to the manner in which missing values are coded, it is often (normally) simplest todelete any SPSS missing value declarations made by Statistics Canada and to start again.

    Restrictions on SPSS syntaxSPSS (unlike SAS or Stata) imposes limitations on the number of missing values that areallowed. SPSS provides three alternatives for declaring missing values:• explicitly declare up to three values as missing;• declare a range of values as missing. Value ranges are specified by using the syntax

    Value1 thru Value2, where value1 may be either a specific value or the reservedword lowest, and value2 may be either a specific value or the reserved wordhighest; or

    • explicitly declare one value and a range of values as missing.

    Special rules exist for defining missing string values. The string to be declared as missingshould be enclosed in double quotes (using the same practice as recommended for variablelabels or value labels). Additionally, two missing values statements should be included in thesyntax file if both string and numeric missing values are declared as missing: one for all stringmissings, and one for numeric missings.

    Tips and tricksAs was previously the case with value labels, variable lists may be used to share coding ofmissing value declarations. As you proceed through the file, you may determine that certainvariables have missing declarations, but did not have value labels defined: in this case, youmust enter the value labels for them. Please refer to Appendix 2 for a demonstration of howto create missing value declarations using Programmer’s File Editor: the same principlesdemonstrated there can be used in other software packages.

    SPSS syntaxThe form of a missing values declaration consists of the command (missing values), followedby one or more variable lists and the values which are to be declared as missing (enclosed inbrackets). It is possible (and recommended) to additionally explicitly indicate in the missingvalue declaration which variables have NO missing values: this is done by listing thesevariables, followed by (). The command terminates (as with all commands) with a period: the

  • declaration and variable list(s) can span multiple lines. An example follows:

    Missing values uniqueid prsnwght () spincome (-1,9996 thru highest) var001 var002 (6) var004 var005 (6 thru highest) var003 (6, 8, 9) var006 (8,9) ageinyrs (9) income (99996 thru highest) .Missing values brthplac (“Not stated”).

    In the example above, there are no missing values for variables recid or prsnwght; spincomehas missing values of -1 and from 99996 to the highest value stored in the variable; var001and var002 share the missing value of 6; var004 and var005 have missing values starting at 6that go up to the highest value stored in the variables18; var003 is the only variable which usesmissing values 6, 8, and 9; var006 uses missing values of 8 and 9; and income has missingvalues from 99996 to the highest value stored in the variable. In the second missing valuescommand, the 10-character string “Not stated“ is flagged as missing.

    SPSS: Executable statements (required)

    PurposeHaving finally created (or cleaned) the SPSS syntax file, it is possible to use the file in orderto extract the information which will be processed and loaded into Equinox.

    18Regardless of whether “highest” is the same value - so var004 might have missing values from6 to 9, while var005 might have missing values from 6 to 99.

  • Order of commandsIt is advisable to arrange the executable commands in the same order in all files that youprocess, in order to ensure that none are forgotten. The order which works most effectively is:• descriptives• write• save• frequencies

    Descriptives commandThe descriptives command will generate the mean, standard deviation, minimum, andmaximum values of each variable. The same syntax may be used for each and every fileprocessed: if string variables are found, a warning will be displayed, and they will be ignored.

    SPSS syntaxdescriptives variables=all.

    Write commandThe write command will generate the data file which will be loaded into SQL. The output fromthe write command will be used to populate the variable list and to determine the order inwhich variables are displayed in Equinox.

    SPSS syntaxwrite outfile=’drive:\directory path\fileid.rev’ table / varlist.

    NotesThe “fileid” was assigned early in the process: see page 11. Be sure to use the “.rev”extension for the file, as the Equinox Management Console will look for files with thisextension when you prepare to load new files into the SQL database.

    The varlist specified the order in which variables are to be written to the file, allowingyou to reorder the variables if desired. Therefore, if you have created new variables,such as a geographic identifier, it may make more sense to place it earlier in the filerather than at the end, where SPSS would normally list all new variables in the orderthat they were processed. Similarly, components of the geographic identifier might beplaced at the end of the file, if they are not uniquely identified.

    Assume a file with variables PR (province), CD (census division), CSD (censussubdivision), and fifty variables (from var001 to var050). Two new variables werecreated: CDCODE (uniquely identifying the census division by combining PR and CD),and CSDCODE (uniquely identifying the census subdivision by combining PR, CD, andCSD, and also acting as the unique identifier for the file). In this case, the varlist mightbe most usefully written as:

    pr cdcode csdcode var001 to var050 cd csd

    since this would present the most relevant variables to the user first, while stillproviding access to the original variables which were used to construct the new ones.

  • Suppressed variablesIt may be prudent to move all suppressed variables to the end of the file, maintainingthe same order as they would normally have appeared. This will ensure that the userencounters valid variables at the top of the list, and does not have to wade throughlarge numbers of suppressed variables, since some files contain hundreds ofsuppressed variables. This follows the same principle as stated above in the Notes ofputting non-uniquely identified geographic variables at the end of the file.

    As a general rule, it might be worthwhile to decide that if there are 100 or moresuppressed variables, they will all be moved to the end of the file. To do this, the varlistwould have to be explicitly written out:

    uniqueid agegroup sex ... incomegp weight ageinyrs ... incomedl.

    Tips and tricks

    If you wish to write the variables in the revised data file in the same order as they existin the SPSS file, the varlist is simply the word all.

    Save commandThe save command serves two purposes. First, it stores the dataset in native SPSS format.This allows you to quickly load the file into SPSS, and may be used to distribute the completefile to others. Second, by its placement following the write command, it forces that commandto execute - otherwise, the “.rev” file will never be created (although the output log would stillindicate the columns that would have been used for the variables).

    SPSS syntaxsave outfile=’drive:\directory path\fileid.sav’.

    Frequencies command(s)The output from the frequencies command(s) is used to generate records for values, todetermine missing values (and any necessary Stata recodes), and to generate theunweighted frequency tables which are loaded into Equinox. Output from the frequencycommands is subject to automatic translation into French before loading into Equinox,speeding the process of making the metadata equivalent in English and French. Additionally,the output is checked and a report is made of missing value labels.

    Restrictions on SPSS syntaxA maximum of 500 variables may be processed in a single frequencies command. Ifthe file contains more than 500 variables for which frequencies must be generated, twoor more frequencies commands must be used.

    Selecting variables on which to run frequenciesFrequencies must be run on all variables for which value labels have been created.Therefore, no frequencies are normally run on weight variables, on unique recordidentifiers, or generally, on continuous variables.

    Variables for which only missing value labels or special value labels were created (seepage 35) must have frequencies run on them, in order to make those labels and that

  • metadata available through Equinox. However, it is not necessary to run frequencieson the entire variable: instead, you should run a partial frequency by selecting thosespecific values.

    Running a partial frequency requires three separate SPSS commands, which mustappear in the order specified, for each variable. The “temporary” command theprecedes each “select” statement ensures that the data file is not modified by theselect command (which might otherwise discard all records from the file which did notmeet the criteria specified). Finally, the frequencies command operates on the valuesselected in the preceding statement. The commands for running partial frequenciesshould appear after the frequencies commands that act on entire variables.

    SPSS syntaxfrequencies variables=var001 var002 var004 to var028 var030 var031 var033 to var099.

    temporary.select if (missing(var003)).frequencies variables=var003.

    temporary.select if (missing(var029)) or (var029=0).frequencies variables=var029.

    temporary.select if ((missing(var032)) or var032=-20000 or (var032=-10000) or (var032=10000) or var032=20000).frequencies variables=var032.

    temporary.select if var100=50000.frequencies variables=var100.

    In this example, the first frequencies command will run complete frequencies onninety-six variables. The first trio of commands19 that follow will run frequencies onvar003, for missing values only. The second trio of commands will run frequencies onvar029, for missing values or the value 0. The third trio of commands20 will runfrequencies on var032, with multiple values selected in addition to missing values. Thefinal trio runs frequencies on var100 for value 50000 only.

    Tips and tricksDo not run frequencies on more than one variable with the temporary / select /frequencies syntax. In addition to potentially obtaining erroneous results (if subsequent

    19Note the brackets in the select statement: they must be included whenever selecting missingvalues. Be aware that all missing values are selected simultaneously with this command.

    20Note how brackets around the entire statement, and around individual values (other than themissing declaration) are optional, and that the command can span multiple lines.

  • variables did not match the criteria identically to the variable on which the selectionwas performed), the program which extracts the information from the output may notprocess the output correctly.

    SPSS: Rename variables (required if alternate variable names arecreated)

    PurposeIf you have renamed variables to conform to the 8-character standard, you must includecommented-out commands in the SPSS syntax file which are used to populate the Equinoxfields dealing with long variable names. If the program renames variables in order to avoidSQL reserved words, you do not need to include this section, as the program willautomatically deal with that metadata.

    SPSS syntaxThe basic syntax for this command is rename variables (curvarname=newvarname).However, since we do not wish the variables to actually be renamed, the command iscommented out by placing an asterisk at the start of the line. The command is repeated foreach variable for which a standardized variable name was created.

    Going back to the example used on page 18, this would lead to the following commandsbeing added into the syntax file (in addition to other rename commands to deal with theremaining long variables:

    * rename variables (ahaiq02i=ahai_q02_10).* rename variables (weight=weight_pumf).* rename variables (ahaiq02a=ahai_q02_1).* rename variables (ahaiq02b=ahai_q02_2).* rename variables (ahaiq02c=ahai_q02_3).* rename variables (ahaiq02d=ahai_q02_4).

    Tips and tricksIf you created a SRS script file which includes both the short and long forms of the variablenames, it is possible to use a PFE macro to copy this information from the SRS script into theSPSS syntax file and to format it properly. A PFE macro has been created which you canuse: F:\inetpub\equinox\SPSS-setup\renamevariables.kbm. See Appendix 3 for step-by-stepinstructions on how to use the macro.

    You’re done editing the file (hopefully)!

    The clean SPSS syntax file should now be ready to use. The next sections of the documentdescribes how to run the syntax file, how to save the output, and what to do with the output.

  • Generating the SPSS outputThe previous page indicated that the process of creating and/or modifying the SPSS syntaxfile was complete. In theory, that is true. However, in reality, depending on what you find asyou deal with the SPSS output, you may have to revisit the syntax, modify it, and regeneratethe output.

    This section of the manual is intended to explain how to run the SPSS syntax file.Additionally, it will specify how to save the SPSS output so that the programs which reformatthe SPSS output into input for Equinox can process the files properly.

    Running the SPSS syntax file

    In Windows Explorer, you should be able to double click on the syntax file to launch SPSS.However, if SPSS does not launch, you may need to run it from the Start menu instead.

    It is VITAL to note that you must be running an “English” version of SPSS for the programsthat transform SPSS output to Inmagic format to work. All of these programs expect decimalpoint output from SPSS, rather than decimal comma output and will fail if decimal pointnotation is not used.

    Move into the Syntax Editor window in SPSS, and open the syntax file that you created:

    Assuming you have not changed any of the default colour patterns used by SPSS, look in theleft box of the window for any commands that are in red - these indicate errors in the syntax.To resolve these problems, click on the red command in the left window - that will positionyou at that point in the file. You then have to determine what the problem is and resolve it.

    Generating SPSS Output

  • In this case, the period was missing from the end of the formats command. Adding itcompletes the command and satisfies SPSS.

    You may encounter errors with variable or value labels. The two most common errors aredemonstrated here: a label is not properly enclosed in double quotes (or begins or ends withtwo double quotes), or is too long. The solution to the first problem is simple: add the (orremove a) quotation mark. The second problem requires some intellectual effort: you willneed to shorten the variable label in order to make the line less than 250 characters long. Besure that any information removed appears in the variable-level metadata which is loaded intoEquinox in the Varnotes field.

    Generating SPSS Output

  • The same problem of excessive line length often occurs in the frequencies command(particularly if you use a macro to simply paste variable names onto the end of the commandline).21 This problem is resolved by breaking the command onto separate lines: for the sake ofreadability, it is recommended that you restrict the line length to approximately 80 characters.

    21Var001 was repeated added to ensure too long a line, in order to deliberately cause a syntaxerror. More importantly, this is an error in preparing the file: a variable should appear once and once onlyin the entire group of frequencies commands.

    Generating SPSS Output

  • Once no red flags appear in the left-hand portion of the syntax window, you may run theprogram. Note that the lack of red merely indicates that the syntax is correct, not that thereare no logic errors in the program.

    Troubleshooting errors found when the syntax file is runWhen the program is run, a new window may appear in the syntax window. Hope that it doesnot, because it means a problem of some sort or another was encountered.22

    To resolve the problems, click on the line number of the error that you wish to resolve: this willtake you to the first line of that command.

    22 The new box in the image above is highlighted in red for purposes of the manual - it customarilyblends into the box otherwise. In order to ensure that there were errors in the file, and that this box woulddisplay, a non-existent variable (var001) was deliberately added into certain commands in the file.

    Generating SPSS Output

  • Here, after examining the variable list and the created variables, we determine that thevariable var001 does not exist (as, in fact, we are told at line 6147). If it had not beendeliberately added in order to generate the error, we would have to check the documentationto see what variable name was mistyped when assigning those missing values.23 The missingvalues, write, and frequencies errors can all be resolved by removing var001 from them.

    23Using PFE macros to populate other commands reduces the chance of mistyping variablenames, and consequently, is heartily recommended.

    Generating SPSS Output

  • Troubleshooting “An invalid numeric field has been found”

    The other message, under Descriptives: “An invalid numeric field has been found. The resulthas been set to the system-missing value” is more problematic. It indicates that SPSS foundnon-numeric information where it expected to find numbers only.

    This may be deliberate - as indicated earlier, one practice is to leave a field blank if it ismissing rather than providing an explicit code for it24. However, it may also indicate that thereare more systemic problems with the data file, such as:• you are attempting to read a string variable as a number;• the data list statement is incorrect, and you are reading across variables (e.g., trying to

    read “0.430.3" or “0-1" or “20 4" into a single variable); or• the data file itself does not contain what is expected.

    If you knew that spaces would be used to indicate system missing cases, this warning maynot anything to be alarmed about. This warning would have been expected, and you will haverecoded the sysmis values to coded values. If this is the case, treat the warning as a flag thatyou may need to do more work on the file when you load it into SQL, and ignore it.

    If there was no indication that spaces would be used for sysmis, open the original data file inPFE or Notepad++. Look for any character other than numbers, positive/negative signs, orspaces.

    If you find any such characters, note the column in which they occur, and look at the SPSSsyntax window.• Determine which variable occupies that column, and look at the supplied

    documentation.• If the documentation indicates that the variable is a string25, ensure that the syntax

    specifies this (denoted by (A) following the column specification), and specifies thecorrect columns for the variable. If either the (A) is missing, or the column specificationis incorrect, make the appropriate correction.

    • If the documentation does not indicate that the variable is a string, ensure that thecolumns occupied by the variable are specified properly in the syntax file. If no valuelabels are specified in the file, treat the variable as a string (i.e., code it as (A)), butcontact the supplier of the data for assistance. If the values specified in thedocumentation do not appear in the file, contact the supplier of the data for assistance.

    24See page 28. This is the best case situation for this warning.

    25Note that the documentation may allude to the variable being string, by showing string coding(e.g., “M” is male) without explicitly stating the fact.

    Generating SPSS Output

  • If no non-numeric characters were found, check the documentation for any indication thatblanks were used to indicate missing values. This may be explicitly stated in thedocumentation, or may be implicitly indicated in frequency tables, as in the example below:

    Code Frequency Value1 15435 Males2 16793 Female 28 Missing

    If you find that spaces were used to designate missing values, be sure to recode the blanks(system missing values, see page 28), and to follow the subsequent steps (page 30).

    If you cannot find indications that spaces were used to indicate missing values, ensure thatthe column specifications in the syntax file matches that specified in the documentation.

    To determine which variables unexpectedly had system missing values, you can look at thefrequencies which you generated in the output window - search for the term SYSMIS. Beaware, however, that you may not have run frequencies on those variables (if they were largecontinuous variables, for example).

    If you find variables with sysmis in the frequency tables, scroll through the data file in thecolumns specified for them looking for any of the instances above (0.430.3, etc.). If youencounter something like this, you may be reasonable sure that a variable was missed in thedata list (and/or the documentation).

    If you cannot resolve the problem, and/or cannot identify the variables with unexpected non-numeric coding, rest assured that there will be another opportunity to identify the problemvariables, and proceed.

    Troubleshooting the write commandThe “.rev” data file which was created should be examined using PFE or Notepad++ todetermine that the data was written correctly to the file. The appearance of one or moreasterisks (*) in the file likely26 indicates that the field length specified for a variable was tooshort. If asterisks are found27:• determine the column(s) occupied by them;• determine which variable uses that column/those columns by examining the output

    from the write command;• determine the maximum length of that variable (by examining the “maximum column”

    in the descriptives output and the missing declarations for that variable); and• if necessary, change the format statement for that variable by adding one or columns,

    or create a format statement for that variable if none previously existed. (see page 31)

    26Definitely, if no asterisks appeared in the data file which you originally received.

    27Use in Notepad++ or in PFE to search for *

    Generating SPSS Output

  • Cleaning up after errors (other than expected SYSMIS warnings)If there were any errors in the SPSS file that you needed to correct, or you detected andcorrected any errors in the “.rev” data file that the program wrote, you must close (and do notsave) the SPSS output window and run the syntax file again. If there were no errors of thissort, (either the error box does not appear or appears as below) you may proceed to the nextstep. Repeat the process you obtain such a result.

    Generating SPSS Output