stata reference manual

Stata Reference Manual

What you should know about Stataafter taking the Stata introduction course

A collection of technical hints

Ivan Iachine, Lars Korsholm,Henrik Støvring, Kirstin Vach, Werner Vach

Version 1.5, Feb., 2004

Contents

1 Entering commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 12 Online help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 13 Producing output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 14 The general syntax of Stata commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Typical errors and error messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 26 Protection of files and data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 47 Data checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 58 The graph command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 59 Stratification using by . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 610 Generating new variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 611 Creating subsamples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 812 Making tables in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 813 Categorization of variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1014 Using Stata as a pocket calculator: The display command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1215 Loops in Stata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1216 Working with do-files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1317 Reshaping datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1418 Working with string variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1519 Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1620 Switching between labels, strings and numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1721 Creating variables with statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 1922 Survival analysis commands . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 2123 Online facilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 2224 How to find a statistical method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Page 23

1 Entering commands

New command:Type the command in the “Stata Command” window. To execute press Enter .See section 4 for the syntax of Stata commands.

Previous command:Double-click the command in the “Review” window or press Page Up until you get the appropri-

ate command, then hit Enter . In general Page Up and Page Down browse previously executedcommands.

Execute a do-file:See Section 16.

2 Online help

Known command name:Use the help menu or the command help:

. help ttest

-------------------------------------------------------------------------------help for ttest, ttesti (manual: [R] ttest)-------------------------------------------------------------------------------

Mean comparison tests---------------------

ttest varname = # [if exp] [in range] [, level(#) ]...

The command whelp opens a new window with the same information and clickable links.

Known name of statistical method:Use the help menu or the command findit:

. findit paired

[R] ttest . . . . . . . . . . . . . . . . . . . . . Mean comparison tests(help ttest)

FAQ . . . . Comparing the p-values between a paired t test and a signrank. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . W. Sribney3/97 Is my boss correct in saying that the p-value given with

a paired ttest should always be lower than the signrank?http://www.stata.com/support/faqs/stat/signrank.html

...

The findit command often results in hints to the Stata Technical Bulletin (STB) and to com-mands you can download from the internet. For more on online facilities see section 23.

3 Producing output

Create a log-file:Use the command log using filename. Execute your commands and finish with the commandlog close. The output is now stored in filename. Usually the file filename will have the exten-sion .smcl. See help log for further information.

1

Copy and Paste:Mark the desired output in the “Results” window, then copy this with Ctrl-C . Paste this into

the file of your choice with Ctrl-V . You may also do Copy and Paste in the ordinary Windowsfashion using the mouse or menus.

4 The general syntax of Stata commands

Syntax:The general syntax of a command is

commandname varlist selector, options

varlist can be one or several variable names, or it might be empty. In the case of severalvariables it is possible to give the varlist as, say, var1-var5, which means all the variablesfrom var1 to var5 in the current order shown by display, or you may use var*, which meansall the variables in the dataset that start with the letters ‘‘var’’.selector can be something like

if sex=="m"if age>18in 1/3

As selector we may use any combination of these. Note that the “logical equal to” symbol is twotimes “==”. in 1/3means the first through thrid observation in the data set (in the current order).

options vary from command to command. They are either single names (e.g. histo) or includeadditional information in parentheses (e.g. bin(7) or xscale(0,20))

Note: There is at most one comma in a Stata command!

Abbreviations:Usually you can abbreviate command names and options. For example, the following two com-mands are equivalent:

. regress bweight hypertension ..., robust

. reg bw hyp ..., r

Each option and command has a minimal number of letters to be used, you can look this up usingthe help command. The minimal number of letters are underlined by stata.

You can also abbreviate variable names by their first letter(s), as long as the identification remainsunique. In the example abovebweight and hypertension must be the only variables begin-ning with bw and hyp.

5 Typical errors and error messages

If you are using the windows-version of STATA, all error messages are in red

Error messages:Error messages try to inform you about what may be wrong, for example if you misspell a variablename,

.tab variabblevariabble not found

if you use an incorrect option

2

.tabulate var1, by(var2)by() invalid

or if the data is assumed to be sorted, but it is not sorted

.by var1: tabulate var2not sorted

Below the error message in red is an r(xxx) code in blue. This code is clickable and providesmore details on what might be wrong and what you should do.

The logic of error messages:Stata cannot know what you intend to do, it can only recover errors by syntax checks. This means,that you can get only indirect hints. For example, if you forget to separate an option by a comma,you will get:

.tabulate var1 var2 chichi not found

because Stata believes, that you meant chi to be a variable. Or if you forget, that by requiresparentheses, you get:

.table var1, by var2by invalid

Here Stata does not realise that you forgot the parentheses, it believes, that you tried to use by asa single option. These examples show that error messages are often very cryptic.

Some typical error messages and what they may indicate:error message possible explanation solution/exampleno; data in memory would be lost changing a dataset without saving save the dataset save newdata.dta

if you want to use a new dataset clear

no variables defined no data loaded use data.dta

not sorted before using a by-option the data has to be sorted sort var1

or use the bysort command bysort var1: ...

xxx not found� unknown variable (e.g. incorrect spelling)� no comma before option e.g. tabulate var1 var2 chi� no blank after function e.g. di Binomial (20,10,0.5)

xxx() invalid incorrect/unknown option e.g. .tab var1, by(var2) correct: by var2: tab var1

xxx invalid incorrect option (e.g. missing ()) e.g. table var1, by var2

xxx invalid name incorrect syntax (e.g. ; instead of :) e.g. by var1; tab var2

no observations incorrect variable type e.g. regress STRINGVAR var

variable with missings only e.g. regress MISSINGVAR var

=exp not allowed ”==” is needed e.g. list var1 if var2=0

type mismatch wrong variable type for this operation e.g. list var1 if STRINGVAR==0 (string variable)

e.g. list var1 if var2==”0” (numeric variable)

The “not enough space to add more ...” error messages:The default installation of Stata starts with a small amount of memory. You have run out of physicalmemory.The quick solution:save your dataset, clear stata, add more memory, load data again:

. save dummynamefile dummyname.dta saved

. clear

. set memory 16m

3

(16384k)

. use dummyname

. erase dummyname.dta

where 16m is 16 mega byte of RAM. Select the amount you need. See help memory

The lasting solution:If you are working with a dofile (and you should!), then insert at the top of the file:

set memory 16m

and rerun your dofile.

The (almost) permanent solution:Right-click on the icon, select Properties, and change the path field to, e.g.C:

�stata

�wstata.exe /m16.

6 Protection of files and data

Stata tries to protect you from yourself so that you do not unintentionally lose data.

The clear and save commands:When you have performed data manipulations and want to analyze a new dataset or want to exitthe session, Stata requires that you decide what to do with your present dataset. Either you mustspecify save newdata or ignore the changes by typing clear. In the last case Stata alsoaccepts clear as an option, e.g.

. use nextdata, clear

or

. exit, clear

The replace option:When you want to use external files and these exist, Stata will refuse to let you overwrite themunless you deliberately use the replace command, e.g.

. log using myfilefile myfile.log already existsr(602);

. log using myfile, replace

this will overwrite the contents of myfile. Similar for files containing graphs

. scatter x y, saving(mygraph, replace)

Note: You can use the replace option, even if the corresponding file does not exist.

The replace command:See section 10.

Be careful with your data!

4

7 Data checking

Before you analyse your data you should verify that they are “as expected”.

describe � varlist � :Gives an overview of your variables, storage type etc.

codebook � varlist � :Provides detailed information on each variable. See section 19.

tabulate and list:The commands tab varname and list varname may give you “on screen” information onvarname but you have to look at the output and remember what you should look for.

The assert command:The assert command lets you automize the conformation process. The command does nothingif everything is “as expected”, but stops with an error message if the assessment fails (and stopsexecuting your dofile). Some examples:Simple arithmetics

. assert 2+2==4

. assert 2>3assertion is falser(9);

If the variable age should contain the age in years (integers) and every one is between 20 and 50.

. assert age==int(age) & age>=20 & age<=50

If the variable sex contains the gender of the person as “F” or “M”

. assert sex=="F" | sex=="M"

If datein is the fist time an object is observed and dateout is the last time

. assert datein<dateout

This may fail if datein can be missing. If you want to run the assessment allowing for missingcases of datein

. assert datein<dateout if datein!=.

When the assessment fails you list the illegitimate cases. E.g.

. list id sex if !(sex=="F" | sex=="M")

id sex19. 19 f

8 The graph command

We refere to chapter 14 in “Introduction to Stata 8” by Svend Juul available fromhttp://www.biostat.au.dk/teaching/software/

5

9 Stratification using by

by-option:Many commands in Stata allow or require you stratification of your data into groups using theby-option, e.g.

. gr size, box by(sex)

by-construct:Other commands allow a preceding by for a stratified analysis, e.g.

. by sex: sum size

In both cases, you have to sort the data first:

. sort sex

There exists no common rule, when by-constructs or by-options are allowed. However, this isalways indicated in the syntax description offered by the help command.

10 Generating new variables

The generate command:You can use the generate command to generate new variables. In the following example, wegenerate a variable for body mass index, an indicator of overweight, and an indicator for absenceof fever, emesis and fatigue:

. l

weight height fever emesis fatigue1. 54 1.73 1 0 12. 88 1.81 1 0 03. 102 1.77 0 0 04. 91 1.91 0 1 05. 74 1.66 0 1 1

. gen bmi=weight/heightˆ2

. gen overw=bmi>25

. gen success=(˜fever) & (˜emesis) & (˜fatigue)

. l

weight height fever emesis fatigue bmi overw success1. 54 1.73 1 0 1 18.0427 0 02. 88 1.81 1 0 0 26.86121 1 03. 102 1.77 0 0 0 32.55769 1 14. 91 1.91 0 1 0 24.94449 0 05. 74 1.66 0 1 1 26.85441 1 0

Note: If you want to generate string variables, you have to specify the length of the string. SeeSection 18.

The replace command:If you want to overwrite an existing value of a variable, you have to use the replace instead ofthe generate command. For example, if height is recorded in centimeter in the data set, but youwant to have it in meter, you just type

6

. replace height=height/100

A perhaps unexpected use of replace appears when you try to define a new variable with asubgroup dependent definition. For example, if the limit for overweight differs between males andfemales, typically you use code like

. generate overw=bmi>23 if sex=="m"

. replace overw=bmi>25 if sex=="f"

The reason for this is, that the first statement fills the variable overw already with missing valuesfor all female subjects, which have to be replaced by the second statement.

Overview about available functions and operators:In generating new variables, you can connect existing variables by a lot of operators and functions.By help operators or help functions you get an overview. The most important onesare summarized in the following list.

. help operators-------------------------------------------------------------------------------help for operators (manual: [U] 20 Functions and expressions)-------------------------------------------------------------------------------

Operators in expressions------------------------

RelationalArithmetic Logical (numeric and string)

------------------- ------------------ -------------------+ addition ˜ not > greater than- subtraction | or < less than* multiplication & and >= > or equal/ division <= < or equalˆ power == equal

˜= not equal

....

Note that the “equal to” symbol is two times “==”.

. help functions

-------------------------------------------------------------------------------help for functions (manual: [R] functions)-------------------------------------------------------------------------------

...

Mathematical functions----------------------

abs(x) absolute valuecos(x) cosine of radiansexp(x) exponentiationln(x) natural logarithmlog(x) same as ln(x)log10(x) base 10 logarithmsin(x) sine of radianssqrt(x) square roottan(x) tangent of radians...

See also section 21.

7

11 Creating subsamples

There are two ways in which you can create subsamples. You can select a subset of your variables (verticalselection) or you can select a subset of your observations (horizontal selection). For both procedures wehave the commands drop and keep.

For variables:The data set has three variables ID, sex and income.

. drop income

which produce the same result as

. keep ID sex

For observations:Drop all observations associated with female individuals (the code “f” in the variable sex indicatea female)

. drop if sex=="f"

which produce the same result as

. keep if sex˜="f"

The consequence of these commands is that the dateset in memory is permantly changed. The dataset ondisk is not effected until you issue the save dataname, replace. To save in a new filename typesave newdataname

12 Making tables in Stata

The tabstat command:You use tabstat when you want to display a series of summary statistics for one or severalvariables.

tabstat varlist [, statistics(statname [...]) by(varname) columns(var|stat) long ]

where statname [...] are the summary statistics that you want to display.

. tabstat erateWL, s(n mean sd)

variable | N mean sd-------------+------------------------------

erateWL | 170 .19375 .1836576--------------------------------------------

If you want separate summary statistics for each group defined by varname you should use theoptions by(varname) c(s) lo.

. tabstat erateS erateWL, s(n mean sd) c(s) by(gender) lo

gender variable | N mean sd--------------------+------------------------------Female erateS | 89 .5580524 .2354242

erateWL | 89 .1685393 .1723018--------------------+------------------------------Male erateS | 81 .5925926 .2487279

erateWL | 81 .2214506 .1926505--------------------+------------------------------Total erateS | 170 .5745098 .2417539

erateWL | 170 .19375 .1836576---------------------------------------------------

8

Se more details in help tabstat.

The table command:You use table when you want to display a series of summary statistics for each level of anothervariable.

table rowvar [colvar [supercolvar] ...] [, contents(clist) row col [options] ]

The philosophy behind the syntax is that we want a table where for each value in the variablerowvar (and colvar and supercolvar) the cell contains clist with layout format givenin options, where clist is summary statistics on third part variables. The option row addsthe relative frequency to each cell such that each row sum up to 100% (similar for the option col).For details on the format options see help table.

. table treat, c(n dec med dec p5 dec p95 dec)

----------+-----------------------------------------------------------treat | N(decrease) med(decrease) p5(decrease) p95(decrease)

----------+-----------------------------------------------------------1 | 205 5.211085 -10.97878 23.597352 | 204 16.30814 -2.117609 33.613963 | 204 13.19776 -25.15851 30.93353

----------+-----------------------------------------------------------

The tabulate command:You use the tabulate command when you want to investigate the association between two (ormore) variables.

tabulate varname1 varname2 [, all cell chi2 column exact gamma lrchi2 row taub V ...]

The interpretation of the syntax is that we tabulate the frequency count of varname1 versusvarname2 with various measures of association, including the common Pearson chi-squared, thelikelihood ratio chi-squared, Cramer’s V, Fisher’s exact test, Goodman and Kruskal’s gamma, andKendall’s tau-b.

. tab res treat, chi2

| treatresult | 1 2 3 | Total

-----------+---------------------------------+----------1 | 74 21 56 | 1512 | 71 47 35 | 1533 | 36 57 52 | 1454 | 24 79 61 | 164

-----------+---------------------------------+----------Total | 205 204 204 | 613

Pearson chi2(6) = 75.7134 Pr = 0.000

It is possible to combine tabulate with summarize to obtain table-like output in a fast way.

. tab treat, summarize(dec)

| Summary of decreasetreat | Mean Std. Dev. Freq.

------------+------------------------------------1 | 5.6048431 11.082792 2052 | 15.710805 11.359821 2043 | 9.3633245 17.387196 204

------------+------------------------------------Total | 10.218785 14.193435 613

See help tabsum.

Specialized tables:There exists a number of “table” commands designed for specific purposes, with epidemiologicdata see help epitab, with cross-sectional time dependent data (also called panel data) seehelp xttab, and with survival data see help ltable.

9

13 Categorization of variables

In many medical applications continuous variables are reduced to variables with a few categories like“low”, “middle” and “high”. Stata supports this step by different functions.

Categorizing a variable at specific cutpoints using the recode function:If you want to categorize a variable at specific cut points, you can use the recode function asin the following example. The new variable assigns to each value the upper value of the interval,where the value falls in. Note that you have to ensure, that the last specified cutpoint is not smallerthen the maximal value in your dataset in order to obtain the desired result (see generation ofcatvar1). In general, the last specified value in the arguments of recode is not the last cutpoint,but the value assigned to each value larger than the last but one argument. This property is used ingenerating catvar2 to assign a missing value to all values larger than 110.

. list

var1. 232. 563. 674. 1235. 996. 17

. gen catvar1=recode(var,50,100,150)

. gen catvar2=recode(var,40,60,80,110,.)(1 missing value generated)

. list

var catvar1 catvar21. 23 50 402. 56 100 603. 67 100 804. 123 150 .5. 99 100 1106. 17 50 40

If you want to recode the values of the grouped variable, you can use the recode command, oryou can use the egen command with the group function, which assigns the values 1, 2, 3 etc. tothe smallest, the next smallest etc. value. Both are illustrated in continuing our example:

. list

var catvar1 catvar21. 23 50 402. 56 100 603. 67 100 804. 123 150 .5. 99 100 1106. 17 50 40

. egen catvarg1=group(catvar1)

. recode catvar2 40=1 60=2 80=3 110=4(5 changes made)

. list

10

var catvar1 catvar2 catvarg11. 23 50 1 12. 17 50 1 13. 56 100 2 24. 67 100 3 25. 99 100 4 26. 123 150 . 3

Note that using the group function implies that data are reordered.

Categorizing a variable at equidistant cutpoints using the autocode function:autocode is an automated version of recode, which you can use, if the cutpoints are equidis-tant. You then have only to specify the number of intervals, the smallest cutpoint and the largestcutpoint. Note, that all values larger than the largest cutpoint get assigned the largest cutpoint, soyou should ensure, that the largest cutpoint is larger than the maximal value in your dataset.

As categorization and recoding is always a dangerous action, you should always try to check theresult, for example by a cross tabulation. This is illustrated in the following example, too.

. list

var1. 232. 563. 674. 1235. 996. 17

. gen catvar=autocode(var,5,0,100)

. list

var catvar1. 23 402. 56 603. 67 804. 123 1005. 99 1006. 17 20

. tab var catvar, missing

| catvarvar | 20 40 60 80 100 | Total

-----------+-------------------------------------------------------+----------17 | 1 0 0 0 0 | 123 | 0 1 0 0 0 | 156 | 0 0 1 0 0 | 167 | 0 0 0 1 0 | 199 | 0 0 0 0 1 | 1

123 | 0 0 0 0 1 | 1-----------+-------------------------------------------------------+----------

Total | 1 1 1 1 2 | 6

Categorizing a variable in groups of equal size using xtile:The xtile command creates a new variable categorizing an existing variable in groups of (ap-proximately) equal size. The number of groups has to be specified using the nq option. This isillustrated in the following example:

11

. list

var1. 232. 563. 674. 1235. 996. 17

. xtile cat2=var, nq(2)



. list

var cat2 cat3 cat41. 17 1 1 12. 23 1 1 13. 56 1 2 24. 67 2 2 35. 99 2 3 36. 123 2 3 4

Note, that xtile reorders the dataset.

One can use xtile also to categorize at cutpoints defined by another variable. Combining it withpctile allows to categorize at percentiles of subgroups. For further details try help xtileand look into the Stata reference manual.

14 Using Stata as a pocket calculator: The display command

The display command allows you to type in expressions and to look at the results. You can use alloperators and functions defined in Stata. Typical examples look like these:

. di 3+47

. di 10.6 - 2 * 7.35-4.1

. di 3ˆ481

. di (2.1 + 2.3)/(4.1 + 47.3)

.08560311

. di 2+3, 2+5.6, 3+65 7.6 9

. di 23.4-invnorm(0.995)*12.3, 23.4 + invnorm(0.995)*12.3-8.2827004 55.0827

15 Loops in Stata

The for command:You can execute a series of Stata commands with the command for. Example:

. for num 1/5: replace varX=varX/1000

12

The index X is substituted in each loop. num tells Stata that we use numerical values for X. 1/5is the list of values � 1 2 3 4 5 � . The ‘:’ indicates that hereafter are the Stata commands to beexecuted in each step of the loop.

It is possible to have several indices ( �� ). Example, I may wish to keep var1-var5 and have newvariables var11-var15 in kilo scale.

. for num 1/5 \ num 11/15: generate varY=varX/1000

where�

tells Stata that here start a second index Y.

Further we may nest a for-loop within an other for-loop to obtain matrix form repeatments1.

. for A in num 1/5: for B in num 1/5: gen varAB=varA*varB

would generate 25 variables var11, var12, ..., var55.

If you use for combined with graph remember the pause option. See also help foreachand help forvalues in Stata 7. See the manual for further details.

16 Working with do-files

What is a do-file:A do-file is a flat text file (ie. ASCII format) containing Stata commands.

Creating a do-file:Open the “Do file editor”. Type in the commands you would ordinarily type in the “Command”window. The editor is similar to for example “NotePad”.

Executing a do-file:Press the Do button (number two top right).

Debugging a do-file:Read the error messages. If this doesn’t help, try the command set trace onwhich gives verydetailed information on command execution. It is reversed to its original setting by set traceoff. The command set trace on place a “-” in front of each line which is executed. The lastline without a “-” sign contains the error. Often useful in combination with set more off.

Why use do-files:For two reasons:

1. Gives you the option of modifying and re-running your commands, ie. it is a time saver (inthe long run...).

2. Provides you with documentation on just how you arrived at your precious conclusions.

Comments in do-files:It is fruitful to write comments to yourself or any reader in your do-files. You write comments bybeginning the line with an asterisk *, then Stata will ignore whatever is in that line.

A nice do file looks like:log using filename, replace* This do-file is an exampleuse data, cleardescribe... some other commandslog close

1This feature is new in Stata 6

13

17 Reshaping datasets

Reshaping wide datasets:Suppose you have the following dataset with measurements of nausea on 3 consecutive days afterchemotherapy:

. list in 1/3

id sex nausea1 nausea2 nausea31. 1 m 78 56 342. 2 f 83 45 673. 3 m 27 22 22

You would like to investigate the increase over time by a regression model. For this, you need adata set, where each line corresponds to one day of one individual. You can use the reshapecommand to achieve this:

. reshape long nausea ,i(id) j(day)

. list in 1/9

id day sex nausea1. 1 1 m 782. 1 2 m 563. 1 3 m 344. 2 1 f 835. 2 2 f 456. 2 3 f 677. 3 1 m 278. 3 2 m 229. 3 3 m 22

.

. regress nausea day, cluster(id)

In Stata’s terminology, you have changed a dataset from wide format to long format.

Note: The i-option specifies the logical unit, whereas the j-option specifies the variable whichindicates observations within a unit.

Reshaping long datasets:Suppose you have the following dataset with measurements of nausea on 3 consecutive days afterchemotherapy:

id day sex nausea1. 1 1 m 782. 1 2 m 563. 1 3 m 344. 2 1 f 835. 2 2 f 456. 2 3 f 677. 3 1 m 278. 3 2 m 229. 3 3 m 22

You would like to make a scatterplot of the measurement on day 2 versus the measurement on day1. For this you need a dataset where you have the variables nausea1 and nausea2. You can use thereshape command to achieve this:

14

. reshape wide nausea, i(id) j(day)

. list in 1/3

id nausea1 nausea2 nausea3 sex1. 1 78 56 34 m2. 2 83 45 67 f3. 3 27 22 22 m

. gr nausea2 nausea1, twoway

.

In Stata’s terminology, you have changed a dataset from long format to wide format.

Note: If you switch from long to wide format, all variables not used as arguments for reshape mustbe constant within each unit specified by the i-option. Otherwise, you get an error message.

Reshaping several variables simultaneously with nonnumeric suffices:In reshaping datasets, the variables can also have nonnumeric suffices, for example left andright. In this case you have to specify the string option. You can also reshape several vari-ables simultaneously. Both is illustrated in the following example:

. list in 1/2

id sex eyeleft eyeright earleft earright1. 1 m 1 1 0 02. 2 f 1 0 1 0

. reshape long eye ear, i(id) j(side) string

. list in 1/4

id side sex eye ear1. 1 left m 1 02. 1 right m 1 03. 2 left f 1 14. 2 right f 0 0

. reshape wide eye ear, i(id) j(side) string

. list in 1/2

id eyeleft earleft eyeright earright sex1. 1 1 0 1 0 m2. 2 1 1 0 0 f

You can use the reshape command also for more complex situations. Take a look at the StataReference Manual.

18 Working with string variables

Generating string variables:If you want to generate a new string variable, you have to specify the length of the variable in thegenerate statement, e.g.

. gen str3 s="abc"

Operations on strings:If you want to concatenate strings, you can use the + operator:

15

. l

treat group1. A 22. A 1

. gen str3 tr_gr=treat+" "+group

. l

treat group tr_gr1. A 2 A 22. A 1 A 1

There exists a lot of functions to work with strings, especially to switch from numbers to stringsand vice versa.

. help functions

-------------------------------------------------------------------------------help for functions (manual: [R] functions)-------------------------------------------------------------------------------

....

String functions----------------

index(s1,s2) --- returns position in s1 in which s2 is first found or 0 ifs1 does not contain s2

length(s) --- returns length of string slower(s) --- returns lowercased variant of sltrim(s) --- returns s with leading blanks removedreal(s) --- converts s into a numeric valuertrim(s) --- returns s with trailing blanks removedstring(n) --- converts n into a stringstring(n,%fmt) --- converts n into a string with %fmt display formatsubstr(s,n1,n2) --- returns the substring of s starting at n1 for a length of

n2; if n1<0, starting position is interpreted as distancefrom end of string; if n2==., the remaining portion of thestring is returned

trim(s) --- returns s with leading and trailing blanks removedupper(s) --- returns uppercased variant of s

....

Se also section 20.

19 LabelsLabelling an existing variable:

If a variable is coded by numerical values, it is often useful to have the meaning of the values andnot the values themselves in tabulations and listings. You can achieve this by assigning labels tothe variable values using the label command:

. list

sex age1. 0 172. 1 23

.

16

. label define labsex 0 male 1 female

. label values sex labsex

.

. list

sex age1. male 172. female 23

Note: The labels are only used in representing the values. Internally, they need to be stored asnumbers. So you can only use sex as a numeric variable.

Distinguishing values and labels:Once a variable is labelled, you might have difficulties to find out, what the real values are. Thecodebook command shows you always both the values and the labels:

. codebook sex

sex --------------------------------------------------------------- (unlabeled)type: numeric (float)

label: labsex

range: [0,1] units: 1unique values: 2 coded missing: 0 / 2

tabulation: Freq. Numeric Label1 0 male1 1 female

Note: If you import datasets from other systems, for example using StatTransfer, values are oftenalready labeled. Hence it is always a good idea to use codebook in the beginning.

Note: Some commands, for example list and tabulate, allow a nolabel-option, such thatthe values instead of the labels are shown.

20 Switching between labels, strings and numbers

Labels and Strings:Sometimes, you would like to use the labels of a variable as strings, for example if you want tocreate a new variable by concatenating. This is done by the decode command, and encode doesthe opposite:

. list

sex age1. male 172. female 23

. list, nolabel

sex age1. 0 172. 1 23

. decode sex, gen(sexstr)

. list

17

sex age sexstr1. male 17 male2. female 23 female

. encode sexstr, gen(gender)

.

. list

sex age sexstr gender1. male 17 male male2. female 23 female female

. codebook gender

gender ------------------------------------------------------------ (unlabeled)type: numeric (long)

label: gender

range: [1,2] units: 1unique values: 2 coded missing: 0 / 2

tabulation: Freq. Numeric Label1 1 female1 2 male

Strings and Numbers:The string function allows to change numbers to strings, and the real function allows tochange strings to numbers.

. list

sex age1. female 232. male 17

. gen str2 agestr=string(age)

. gen years=real(agestr)

. list

sex age agestr years1. female 23 23 232. male 17 17 17

. describe

Contains dataobs: 2vars: 4size: 56 (98.5% of memory free)

-------------------------------------------------------------------------------1. sex float %9.0g labsex2. age float %9.0g3. agestr str2 %9s4. years float %9.0g

-------------------------------------------------------------------------------

18

21 Creating variables with statistics

It is often necessary for an analysis to prepare the dataset by computing new variables with statistics,for example the maximum value observed during a day or subject specific mean values. The followingillustrates some typical tools for this task.

Computing statistics over several variables using egen:The egen command offers functions like rmax or rmean to compute a maximum or a mean“rowwise”. This is illustrated in the following example, where we have for each subject and eachday a measurement at 6 o’clock, 12 o’clock and 18 o’clock. We can use rmax to compute themaximum within each day:

. list in 1/6

subj day val6 val12 val181. 1 1 23.5 34.3 22.92. 1 2 25.8 33.6 27.83. 1 3 12.8 18.9 22.34. 2 1 14.5 17.9 22.85. 2 2 19.8 17.3 15.46. 2 3 33.9 30.3 27.8

. egen maxv=rmax(val6 val12 val18)

. list in 1/6

subj day val6 val12 val18 maxv1. 1 1 23.5 34.3 22.9 34.32. 1 2 25.8 33.6 27.8 33.63. 1 3 12.8 18.9 22.3 22.34. 2 1 14.5 17.9 22.8 22.85. 2 2 19.8 17.3 15.4 19.86. 2 3 33.9 30.3 27.8 33.9

egen offers for this type of tasks the functions rmax, rmin, rmean, rsum, rsd androbs, where the latter gives the number of nonmissing observations. Note that these functionsexpect a list of variables separated by blanks. Do not confuse them with the functions mean,min, max etc., which are also offered by egen for other purposes.

Computing statistics over several observations using collapse:The collapse command allows you to compute statistics from groups of observations. Lookingat the last example, we might now be interested in taking the average over three days for eachsubject. This can be done in the following way:

. list in 1/6

subj day val6 val12 val18 maxv1. 1 1 23.5 34.3 22.9 34.32. 1 2 25.8 33.6 27.8 33.63. 1 3 12.8 18.9 22.3 22.34. 2 1 14.5 17.9 22.8 22.85. 2 2 19.8 17.3 15.4 19.86. 2 3 33.9 30.3 27.8 33.9

. collapse (mean) meanmax=maxv, by(subj)

. list in 1/2

subj meanmax1. 1 30.066672. 2 25.5

19

You can generate simultaneously several statistics, for example you can use collapse (min)minval6=val6 (max) maxval6=val6, by(subj) in order to generate the minimumand maximum of the measurements at 6 o’clock over the three days for each subject. Other statis-tics offered by collapse are median, sd, sum, iqr and all percentiles.

Note: If you have a variable, which is constant within the unit you would like to collapse, andwhich you want to keep in the new dataset (for example the age and sex of a subject), you caninclude them in the by-option. (For example: collapse ..., by(subj age sex))

Computing statistics over several observations using egen:Sometimes it is necessary to generate statistics over observations without reducing the dataset,for example if you want to compare single values with subject specific mean values. The egencommand together with a by-option allows you to do this in an easy manner. In the followingexample we have 6 measurements for each subject, and we would like to compare the values withthe subject specific means in order to check, when a subject suffers from a high or low value. Thiscan be done in the following way:

. list in 1/12

subj time value1. 1 1 17.92. 1 2 23.73. 1 3 45.84. 1 4 37.25. 1 5 19.46. 1 6 20.87. 2 1 44.58. 2 2 48.79. 2 3 52.110. 2 4 46.711. 2 5 44.512. 2 6 40.3

. egen meanval=mean(value), by(subj)

. gen high=val>meanval

. list in 1/12

subj time value meanval high1. 1 1 17.9 27.46667 02. 1 2 23.7 27.46667 03. 1 3 45.8 27.46667 14. 1 4 37.2 27.46667 15. 1 5 19.4 27.46667 06. 1 6 20.8 27.46667 07. 2 1 44.5 46.13333 08. 2 2 48.7 46.13333 19. 2 3 52.1 46.13333 110. 2 4 46.7 46.13333 111. 2 5 44.5 46.13333 012. 2 6 40.3 46.13333 0

egen offers also functions like min, max, median, sd, iqr, rank, sum and func-tions for percentiles. A typical use of egen is in standardizing a variable to the range 0-1 for eachsubject. This looks like

. egen min=min(var), by(subject)

. egen max=max(var), by(subject)

. gen standvar=(var-min)/(max-min)

20

22 Survival analysis commands

A characteristic feature of survival data is the presence of censoring and left truncation. Without censoringand truncation the data are represented by the survival time variable � , which measures the duration of timebetween the initial event and the final event. In the presence of censoring and truncation more variables arerequired to represent the incomplete observation of the survival time � .

With censoring at time � (e.g. end of followup) it is only possible to observe � if the final event occursbefore time � . The final event indicator � is equal to � if � �� (i.e. uncensored observation) and it isequal to � if �� (i.e. censored observation). The censored survival time is equal to � if ��and is equal to � if � �� . With left truncation at time �� the censored observations �� are onlyobserved if �� (otherwise no information is collected). Consequently, under right censoring and lefttruncation the survival time � is represented by three variable �� . In Stata datasets these variablesare usually called time, event and time0 respectively. If all subjects enter at time � (i.e. �� =0) therespective variable time0 may be omitted in the dataset.

Prepare the dataset for analysis:In order to avoid entering the three variable names representing the survival time observations ineach survival analysis command, Stata requires an extra step before any survival analysis commandmay be executed. This step is carried out using the stset command:stset time, failure(event) enter(time0)This ensures, that the variables time, event and time0 will be used automatically by Statain all subsequent survival analysis commands to represent the censored observations. When allsubjects enter at time 0, the enter() option may be omitted:stset time, failure(event)

Kaplan-Meier plot:The sts graph command will produce graphs of Kaplan-Meier estimates of the survival func-tion:sts graph

Kaplan-Meier plots with 95% CI:The sts graph command may be combined with by(indepvar) option to produce separateKaplan-Meier plots for subgroups of the data specified by the different values of indepvar. Thegwood option may be used to add pointwise 95% confidence intervals to the plots.sts graph, by(indepvar) gwood

Kaplan-Meier at age 200:sts list, at(200 201)Note the argument at(200 201), where two time values � �!� , �!�"� are required, because without“201”, at(200) will tabulate the Kaplan-Meier estimator at 200 equidistant time points.

Estimate median survival:The stci command produces median estimates along with confidence intervals:stci, median by(group)

Logrank test:The sts test command may be used to compare survival in two or more groups. The groupsare defined by distinct values of the indepvar variable. The logrank option specifies, that thelogrank test (default) is to be used for the comparison:sts test indepvar, logrank

Cox regression:The stcox command is used to carry out analysis using the Cox regression model:stcox indepvar1 indepvar2 ... indepvarNThis will report hazard ratio estimates. To produce estimates of regression coefficients the nohroption may be used.

21

Increase memory size:Sometimes the extra variables created by the stset command do not fit in the available memory.In this case see section 5 for commands to increase the memory size. Note, that you will have toreload and re-stset the dataset after this operation.

23 Online facilities

Stata is web-aware in the sense that it offers commands that allow you to update and enhance your Stata-version, if you are connected to the Internet. The most important commands are:

update:Typing update will give an overview of when your Stata system was last updated. The commandupdate query will check whether or not your Stata would benefit from an update. Finally youcan execute the command update all to update both your ado-files and executable.

findit: In up-to-date Stata 7.0 the command findit will search all relevant Internet sites for Stata mate-rial containing your search word. For example:

. findit smooth13 Sep 2002 13:53:35

Keyword search--------------

Keywords: smoothSearch: (1) Official help files, FAQs, and STBs

(2) Web resources from Stata and from other users

Search of official help files, FAQs, and STBs---------------------------------------------

[R] kdensity . . . . . . . . . . . . Univariate kernel density estimation(help kdensity)

[R] ksm . . . . . . . . . . . . . . . . . . . Smoothing including lowess(help ksm)

<...cut...>

Example . Applied Survival Analysis: Regression Modeling of Time to Event Data. . . . . . . . . . . . . . . . . . UCLA Academic Technology Services9/01 http://www.ats.ucla.edu/stat/books/asa/default.htm

examples from the book Applied Survival Analysis:Regression Modeling of Time to Event Databy David W. Hosmer, Jr. and Stanley Lemeshow

<...cut...>

STB-53 sg128 . . . Some programs for growth estimation in fisheries biology. . . Salgado-Ugarte, Martinez-Ramirez, Gomez-Marquez, & Pena-Mendoza(help bevholt, fordwal, gullholt, gullplot, nlvbgf, ... if installed)1/00 pp.35--47; STB Reprints Vol 9, pp.278--293programs to estimate and plot the von Bertalanffy growthfunction

STB-41 gr27 . . . . . . . . . An adaptive variable span running line smoother(help autosmoo if installed) . . . . . . . . . . . . . . . P. Sasieni1/98 pp.4--7; STB Reprints Vol 7, pp.63--68smooths yvar on xvar where the smooth is a running line fitwith a variable span

<...cut...>

22

Web resources from Stata and other users----------------------------------------

(contacting http://www.stata.com)

14 packages found (STB omitted)-------------------------------

sthaz from http://www.sun.rhbnc.ac.uk/˜uhss021/statasthaz. Smoothed hazard (transition/failure) rate plots. / Program byKenneth L. Simons. / Compute nonparametric estimates of smoothed hazardrates, and create graphs / of the results. The program also can computeand graph standard errors and / confidence bounds. The estimates use

hazplot from http://www.sun.rhbnc.ac.uk/˜uhss021/statahazplot. Smoothed hazard (transition/failure) rate plots. / Program byKenneth L. Simons. / hazplot plots hazard rates or smoothed hazard rates.It works only on data in / panel form with integer time variables, and thedata must have been stset / using the time0() option. For example, you

<...cut...>

6 references found in tables of contents----------------------------------------

http://www.sun.rhbnc.ac.uk/˜uhss021/stata/Materials by Kenneth L. Simons / Here are assorted utilities for Stata. /Check dummy (indicator) variables to ensure they are okay / Distancebetween latitude & longitude coordinates / Count data points in ageographic radius of each point / Create data points for extra geographic

http://www.stata.com/users/njc/Materials by Nicholas J. Cox, University of Durham / Nicholas J. Cox<[email protected]> is a geographer at the University / of Durham anda frequent contributor to Statalist. His areas of interest / includegraphics, smoothing, probability distributions, circular statistics, / and

<...cut...>

(end of search)

First you see what is in the reference manual, on the Stata FAQ pages, and in the STB, where STBrefers to the “Stata Technical Bulletin”, which is a journal where various enhancements (ado-files)are published with examples of their use. Next you get results from searching the web resourcesfor user written resources.

Installation:To install a specific package you found with findit just follow the blue clickable links.

24 How to find a statistical method

The following list should give you some hints as to where you can find specific statistical methods. Notethat Stata offers many more methods than shown in this list. The list should only help you to find thecorresponding Stata command. Hint: a lot of tables and simple calculations for epidemiologists are to befound under epitab.

23

Description Stata-commandANOVA anova or oneway�� -test for contingency tables tabulate var1 var2, chi, see also epitabconfidence intervals for ci or cii (immediate form)

meansproportionsprobabilitiespercentiles centile

contingency tables tabulatecorrelation

Spearman spearman var1 var2Pearson pwcorr [varlist] or correlate [varlist]

cumulative distribution function cdf from STBcox regression stcox indepvarsFisher’s exact test tabulate var1 var2, exact, see also epitabFriedman friedman from STB, try search friedmanfour fold table tabulate, see also epitabinterrater agreement test kappa var1 var2Kaplan-Meier curves sts graphkappa kappa var1 var2Kruskal-Wallis test kwallislikelihood ratio test lrtestlinear regression regress depvar [varlist]logistic regression logistic depvar [varlist]log rank test sts test indepvar, logrankMann-Whitney two sample test ranksummean, median, sd summarize [varlist]

or tablemeta analysis meta from STB, try search metaMcNemar test symmetry casevar controlvarmultiple linear regression regress depvar [varlist]OR (odds ratio) cc case-var ex-var

or cci a b c d (immediate form)percentiles table var1,c(p25 var2 ...) or centileperson years irrelative frequencies tabulateRR (relative risk) cs cas-var ex-var

or csi a b c d (immediate form)risk ratio cs, csi or ir (for incidence data)ROC curves roctab

or rocfit from STB, try search rocsigntest signtestsimple linear regression regress depvar [varlist]t-test ttesttrend tests nptrendWilcoxon matched-pairs signed-ranks test signrank, see also signtestWilcoxon ranksum test ranksum

24