lecture 3

13
Introduction to Stata – Lecture 3: Panel Data Hayley Fisher 1 March 2010 Key reference: Cameron and Trivedi (2009), chapter 8. 1 Data used in this lecture This lecture uses data from the first four waves of the British Household Panel Survey (BHPS). I will make the relevant source files temporarily available on my website but cannot host them there permanently. You can get the full set of files from the ESDS (http://www.esds.ac.uk/findingData/bhps.asp). If you want to learn more about the BHPS and how to use it in Stata, I recommend the BHPS introductory courses provided by the UK Longitudinal Studies Centre (ULSC) at the University of Essex – details and course materials are available at http://www.iser.essex.ac.uk/survey/bhps/courses. These notes have been loosely based on parts of their course. 2 The British Household Panel Survey The BHPS began in 1991 and has interviewed its initial sample, and additional household members, every year since then. 5,500 households were selected initially, with additional samples of Scotland, Wales and Northern Ireland added since. Currently 17 waves of the survey are available. For a full description of the survey see Taylor, Brice, Buck and Prentice-Lane (2009). Longitudinal datasets such as the BHPS are rarely provided in a format that is straightforward to read into Stata and start working with. The BHPS is available in a number of formats including Stata, but as a series of files containing different variables, split by year and different parts of the survey to give manageable file sizes. I am using the ‘individual response’ and ‘household response’ files from the first four waves of the survey. A substantial part of this lecture will be devoted to putting together a panel from these datasets. 2.1 Assembling a cross section using individual and household data Start your do-file to assemble your dataset by defining a macro for the folder in which the original datafiles are stored. This makes it easy to alter the folder referenced if necessary in future. Here I use a global macro: global dir BHPS To recall a global macro we prefix it with a $ sign – so here $dir. We could simply read in the entire first file in question (aindresp), but this is a large dataset with many variables. Instead, we can load in just specific variables. We need to look at the code- book accompanying the dataset to choose the variables (alternatively look at the online codebooks at http://www.iser.essex.ac.uk/survey/bhps/documentation/volume-b-codebooks). We read in the specific variables by typing: use ahid apno asex aage pid amastat ahgspn aqfachi afiyr afiyrl using $dir/aindresp Note that all variables except pid have the prefix a. This is a convention in the BHPS data files – all files and variables associated with wave 1 have the prefix a, for wave 2 it is b and so on. Let’s describe the data to see what has been loaded here. 1

Upload: apam-benjamin

Post on 03-Nov-2014

37 views

Category:

Documents


4 download

DESCRIPTION

lect2

TRANSCRIPT

Page 1: Lecture 3

Introduction to Stata – Lecture 3: Panel Data

Hayley Fisher

1 March 2010

Key reference: Cameron and Trivedi (2009), chapter 8.

1 Data used in this lecture

This lecture uses data from the first four waves of the British Household Panel Survey (BHPS). I will makethe relevant source files temporarily available on my website but cannot host them there permanently.You can get the full set of files from the ESDS (http://www.esds.ac.uk/findingData/bhps.asp). If youwant to learn more about the BHPS and how to use it in Stata, I recommend the BHPS introductorycourses provided by the UK Longitudinal Studies Centre (ULSC) at the University of Essex – detailsand course materials are available at http://www.iser.essex.ac.uk/survey/bhps/courses. These noteshave been loosely based on parts of their course.

2 The British Household Panel Survey

The BHPS began in 1991 and has interviewed its initial sample, and additional household members,every year since then. 5,500 households were selected initially, with additional samples of Scotland,Wales and Northern Ireland added since. Currently 17 waves of the survey are available. For a fulldescription of the survey see Taylor, Brice, Buck and Prentice-Lane (2009).

Longitudinal datasets such as the BHPS are rarely provided in a format that is straightforward toread into Stata and start working with. The BHPS is available in a number of formats including Stata,but as a series of files containing different variables, split by year and different parts of the survey to givemanageable file sizes. I am using the ‘individual response’ and ‘household response’ files from the firstfour waves of the survey. A substantial part of this lecture will be devoted to putting together a panelfrom these datasets.

2.1 Assembling a cross section using individual and household data

Start your do-file to assemble your dataset by defining a macro for the folder in which the originaldatafiles are stored. This makes it easy to alter the folder referenced if necessary in future. Here I use aglobal macro:

global dir BHPS

To recall a global macro we prefix it with a $ sign – so here $dir.We could simply read in the entire first file in question (aindresp), but this is a large dataset

with many variables. Instead, we can load in just specific variables. We need to look at the code-book accompanying the dataset to choose the variables (alternatively look at the online codebooks athttp://www.iser.essex.ac.uk/survey/bhps/documentation/volume-b-codebooks). We read in the specificvariables by typing:

use ahid apno asex aage pid amastat ahgspn aqfachi afiyr afiyrl using $dir/aindresp

Note that all variables except pid have the prefix a. This is a convention in the BHPS data files – allfiles and variables associated with wave 1 have the prefix a, for wave 2 it is b and so on. Let’s describethe data to see what has been loaded here.

1

Page 2: Lecture 3

. describe

Contains data from BHPS/aindresp.dta

obs: 10,264

vars: 10

size: 348,976 (99.9% of memory free)

----------------------------------------------------------------------------

storage display value

variable name type format label variable label

----------------------------------------------------------------------------

ahid long %12.0g household identification number

apno byte %8.0g person number

asex byte %8.0g asex sex

pid long %12.0g cross-wave person identifier

amastat byte %8.0g amastat marital status

ahgspn byte %8.0g ahgspn pno of spouse/partner

aage byte %8.0g aage age at date of interview

aqfachi byte %8.0g aqfachi highest academic qualification

afiyrl double %10.0g afiyrl annual labour income (1.9.90-1.9.91)

afiyr double %10.0g afiyr annual income (1.9.90-1.9.91)

----------------------------------------------------------------------------

Sorted by:

Three variables here are vital for the construction of our panel dataset. ahid is a household identificationnumber which we will use to match data from the household file, and apno is a person identificationnumber within a given household. This can be used in combination with, for example, ahgspan tomatch couples together. pid is a cross-wave person identifier – it has no a prefix since it matches thesame variable in all waves – this connects people over time. We also have data on individuals’ sex, age,academic qualifications, labour income and total income.

We are going to merge in data from the household file, so we need to sort the individual data by thehousehold identification number and save it.

. sort ahid

. save aind, replace

Then we load data from the household response file – hhresp.dta.

use ahid atenure ahhsize ankids afihhyr using $dir/ahhresp

. describe

Contains data from BHPS/ahhresp.dta

obs: 5,511

vars: 5

size: 104,709 (99.9% of memory free)

--------------------------------------------------------------------------------------

storage display value

variable name type format label variable label

--------------------------------------------------------------------------------------

ahid long %12.0g household identification number

ahhsize byte %8.0g ahhsize number of persons in household

ankids byte %8.0g ankids number of children in household

atenure byte %8.0g atenure housing tenure

afihhyr double %10.0g afihhyr annual household income (1.9.90-1.9.91)

--------------------------------------------------------------------------------------

Sorted by:

This gives household ID numbers and data on household size, number of children, how housing is ownedand total household income. We see there are 5,511 observations (as opposed to 10,264 from the individualdataset). After sorting by household ID we can merge the two datasets together:

2

Page 3: Lecture 3

. merge ahid using aind

variable ahid does not uniquely identify observations in aind.dta

. tabulate _merge

_merge | Freq. Percent Cum.

------------+-----------------------------------

1 | 6 0.06 0.06

3 | 10,264 99.94 100.00

------------+-----------------------------------

Total | 10,270 100.00

. keep if _merge==3

(6 observations deleted)

. drop _merge

We see that there are 6 observations for which there is no individual data, just household data. I dropthese.

Another useful command for describing data is codebook. This produces a codebook based on whatis in Stata. I have extracted the codebook from my log file and posted it on my website for reference.

Having saved the dataset, we can do some analysis with this cross section. First, recode the missingvalues. Page A3-14 of Taylor et al. (2009) outlines the way missing values are handled in these datafiles– any negative values are in fact missing. We can recode these easily together using mvdecode:

. mvdecode _all, mv(-9/-1)

atenure: 17 missing values generated

ahgspn: 1 missing value generated

aqfachi: 371 missing values generated

afiyrl: 352 missing values generated

afiyr: 352 missing values generated

Here _all can be replaced with a list of variables. Having created the necessary variables, simple crosssection regressions can be performed. Here I show that the xi prefix can be used to create interactionterms as well as a series of category dummies.

. generate married=amastat==1

. generate age2=aage^2

. generate lwages=log(afiyrl)

(3937 missing values generated)

. xi: regress lwages aage age2 i.asex*i.married ankids i.aqfachi, vce(robust)

i.asex _Iasex_1-2 (naturally coded; _Iasex_1 omitted)

i.married _Imarried_0-1 (naturally coded; _Imarried_0 omitted)

i.asex*i.marr~d _IaseXmar_#_# (coded as above)

i.aqfachi _Iaqfachi_1-7 (naturally coded; _Iaqfachi_1 omitted)

Linear regression Number of obs = 6318

F( 12, 6305) = 246.80

Prob > F = 0.0000

R-squared = 0.3383

Root MSE = .92103

------------------------------------------------------------------------------

| Robust

lwages | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

aage | .1932962 .0072313 26.73 0.000 .1791205 .2074719

age2 | -.0023262 .0000903 -25.75 0.000 -.0025033 -.0021491

_Iasex_2 | -.3133317 .0411562 -7.61 0.000 -.3940119 -.2326515

_Imarried_1 | .4771106 .0390765 12.21 0.000 .4005075 .5537138

3

Page 4: Lecture 3

_IaseXmar_~1 | -.7551111 .0507519 -14.88 0.000 -.854602 -.6556201

ankids | -.2082833 .0148482 -14.03 0.000 -.2373909 -.1791756

_Iaqfachi_2 | -.1241103 .0967663 -1.28 0.200 -.3138051 .0655845

_Iaqfachi_3 | -.1621818 .0964851 -1.68 0.093 -.3513254 .0269619

_Iaqfachi_4 | -.3929866 .0912033 -4.31 0.000 -.5717762 -.2141971

_Iaqfachi_5 | -.5157779 .0899938 -5.73 0.000 -.6921963 -.3393595

_Iaqfachi_6 | -.5263538 .1001379 -5.26 0.000 -.7226582 -.3300494

_Iaqfachi_7 | -.7906337 .0903131 -8.75 0.000 -.967678 -.6135893

_cons | 5.916767 .1677032 35.28 0.000 5.588011 6.245522

------------------------------------------------------------------------------

Here _Iasex_2 gives the effect of being female, _Imarried_1 the effect of being married for men, and_IaseXmar_~1 the difference in the effect of being married on wages between women and men.

2.2 Assembling a two wave panel

Panel data can be in two formats – long or wide. Wide data stores each variable separately for eachwave, so only has one observation for each individual:

PID inc1 inc2 inc3 inc41 200 210 220 2502 600 660 700 7503 250 280 200 2104 150 190 250 300

Long data stores all observations of a variable, for example income, in the same variable, and has a wavevariable and multiple observations for each individual:

PID wave inc1 1 2001 2 2101 3 2201 4 2502 1 6002 2 6602 3 7002 4 750

To use the panel data features in Stata you need to have your data in long format. If your data is inwide format you can reconfigure it using the reshape command.

To put together a two wave panel in long format we need to extract the same data for wave 2 – withprefix b. This is done as above:

. use bhid bpno bsex bage pid bmastat bhgspn bqfachi bfiyr bfiyrl using $dir/bindresp

. sort bhid

. save bind, replace

file bind.dta saved

. use bhid btenure bhhsize bnkids bfihhyr using $dir/bhhresp

. sort bhid

. merge bhid using bind

variable bhid does not uniquely identify observations in bind.dta

. keep if _merge==3

(2 observations deleted)

. drop _merge

4

Page 5: Lecture 3

. save wave2, replace

file wave2.dta saved

There are two other steps to take to both the wave 1 and wave 2 datasets constructed here – to add awave variable (just using generate, and to remove the prefix which is done using the command renpfix.

. generate wave=2

. renpfix b

To combine the two files we use the append command.1

. use wave1

. generate wave=1

. renpfix a

. append using wave2

Listing the first 20 observations shows that we now have a long panel dataset with two time periods.

. sort pid wave

. list pid wave hid sex age mastat in 1/20, clean

pid wave hid sex age mastat

1. 10002251 1 1000209 female 91 never ma

2. 10004491 1 1000381 male 28 never ma

3. 10004491 2 2000148 male 29 never ma

4. 10004521 1 1000381 male 26 never ma

5. 10004521 2 2000148 male 27 never ma

6. 10007857 1 1000667 female 57 widowed

7. 10007857 2 2000296 female 59 widowed

8. 10014578 1 1001221 female 54 married

9. 10014578 2 2000369 female 55 married

10. 10014608 1 1001221 male 57 married

11. 10014608 2 2000369 male 58 married

12. 10016813 1 1001418 male 36 married

13. 10016813 2 2000504 male 37 married

14. 10016848 1 1001418 female 32 married

15. 10016848 2 2000504 female 33 married

16. 10017933 1 1001507 female 49 married

17. 10017933 2 2000717 female 49 married

18. 10017968 1 1001507 male 46 married

19. 10017968 2 2000717 male 46 married

20. 10019057 1 1001604 female 59 never ma

2.3 Creating a longer panel

When performing the same operations on several waves of data, we can write do-files more efficientlyusing the foreach and forvalues loop commands. Here I use foreach to perform the same commandsjust substituting the wave prefix each time.2 The command to extract all of the data is shown below.

foreach w in a b c d {

use ‘w’hid ‘w’pno ‘w’sex ‘w’age pid ‘w’mastat ‘w’hgspn ‘w’qfachi ‘w’fiyr ‘w’fiyrl ///

using $dir/‘w’indresp

sort ‘w’hid

save ‘w’ind, replace

1To create a wide panel we would not remove the prefixes and would use merge to combine the datasets.2The forvalues command would be used when you want to loop over numbers rather than letters or variables.

5

Page 6: Lecture 3

clear

use ‘w’hid ‘w’tenure ‘w’hhsize ‘w’nkids ‘w’fihhyr using $dir/‘w’hhresp

sort ‘w’hid

merge ‘w’hid using ‘w’ind

keep if _merge==3

drop _merge

renpfix ‘w’

generate wave = index("abcd","‘w’")

save wave‘w’, replace

}

We start the forvalues comman by defining what should be replaced in each iteration of the loop – inthis case w, and giving the list of values to substitute (a, b, c, d for the first 4 waves). We then writeout the code substituting ‘w’ where the prefix would normally be. This reproduces the steps we wentthrough above. The new function used here is index – this returns the position of ‘w’ in the list abcdand so generates the wave variable.

After this code has been run, we can use a similar loop to append the files together, and to deletethe files created in the process:

foreach w in a b c {

append using wave‘w’

}

compress

save BHPS, replace

foreach w in a b c d {

capture erase wave‘w’.dta

capture erase ‘w’ind.dta

}

Note that compress ensures that the data is being stored as efficiently as possible. Sorting and listingthe dataset produced shows:

. sort pid wave

. list pid wave hid sex age mastat in 1/20, clean

pid wave hid sex age mastat

1. 10002251 1 1000209 female 91 never ma

2. 10004491 1 1000381 male 28 never ma

3. 10004491 2 2000148 male 29 never ma

4. 10004521 1 1000381 male 26 never ma

5. 10004521 2 2000148 male 27 never ma

6. 10004521 3 3000192 male 28 never ma

7. 10007857 1 1000667 female 57 widowed

8. 10007857 2 2000296 female 59 widowed

9. 10007857 3 3000257 female 59 widowed

10. 10014578 1 1001221 female 54 married

11. 10014578 2 2000369 female 55 married

12. 10014578 3 3000389 female 56 married

13. 10014608 1 1001221 male 57 married

14. 10014608 2 2000369 male 58 married

15. 10014608 3 3000389 male 59 married

16. 10016813 1 1001418 male 36 married

17. 10016813 2 2000504 male 37 married

18. 10016813 3 3000508 male 37 married

6

Page 7: Lecture 3

19. 10016813 4 4000307 male 39 married

20. 10016848 1 1001418 female 32 married

This shows a panel in long format. Note that age mostly increases by one year between waves for eachindividual (the age variable here is age at interview date which can vary), whilst sex is constant.

In order to perform analysis exploiting the panel dimension of the dataset we must declare the data tobe a panel – we do this using xtset, and declaring the panel variable (here pid) and time variable (herewave). Note that the panel variable and time variable must together uniquely identify every observationin the dataset.

. xtset pid wave

panel variable: pid (unbalanced)

time variable: wave, 1 to 4, but with gaps

delta: 1 unit

This shows that we have an unbalanced panel – as seen in the list above we do not have an observationfor every person in every time period. Some useful commands to investigate panel data are xtdescribe,xtsum and xttrans:

. xtdescribe

pid: 10002251, 10004491, ..., 47737689 n = 12350

wave: 1, 2, ..., 4 T = 4

Delta(wave) = 1 unit

Span(wave) = 4 periods

(pid*wave uniquely identifies each observation)

Distribution of T_i: min 5% 25% 50% 75% 95% max

1 1 2 4 4 4 4

Freq. Percent Cum. | Pattern

---------------------------+---------

7643 61.89 61.89 | 1111

1009 8.17 70.06 | 1...

679 5.50 75.55 | 11..

596 4.83 80.38 | ...1

527 4.27 84.65 | 111.

458 3.71 88.36 | .111

418 3.38 91.74 | ..11

290 2.35 94.09 | .1..

197 1.60 95.68 | ..1.

533 4.32 100.00 | (other patterns)

---------------------------+---------

12350 100.00 | XXXX

xtdescribe gives information about the panel structure – we see that there are 12,350 individuals and4 time periods, and that 62% of people have observations in all four time periods.

. xtsum sex age nkids fiyr lwages mastat

Variable | Mean Std. Dev. Min Max | Observations

-----------------+--------------------------------------------+----------------

sex overall | 1.531028 .4990427 1 2 | N = 39190

between | .4995023 1 2 | n = 12350

within | 0 1.531028 1.531028 | T-bar = 3.17328

| |

age overall | 44.00559 18.43386 15 97 | N = 39190

7

Page 8: Lecture 3

between | 18.93358 15 96.5 | n = 12350

within | 1.044435 32.33892 50.33892 | T-bar = 3.17328

| |

nkids overall | .5951008 .9762426 0 9 | N = 39190

between | .9375915 0 9 | n = 12350

within | .2487689 -3.154899 3.595101 | T-bar = 3.17328

| |

fiyr overall | 8939.016 8929.79 0 287481.8 | N = 37455

between | 8186.321 0 160891.5 | n = 11982

within | 3631.936 -70011.81 204307.5 | T-bar = 3.12594

| |

lwages overall | 8.849195 1.166257 -3.321733 12.56891 | N = 23609

between | 1.206297 -3.321733 11.61299 | n = 8323

within | .4933544 1.250891 14.18753 | T-bar = 2.8366

| |

mastat overall | 2.497844 2.031626 0 6 | N = 39185

between | 2.043197 0 6 | n = 12350

within | .5396737 -2.002156 6.247844 | T-bar = 3.17287

xtsum gives summary statistics and shows variation between individuals and within individuals – sowe see that sex does not vary within individuals, and that log wages vary more between individualsthan within individuals. We also see the total number of observations (N), number of individuals withobservations (n) and average number of time periods for each individual.

. xttrans married, freq

| Married=1

Married=1 | 0 1 | Total

-----------+----------------------+----------

0 | 10,669 492 | 11,161

| 95.59 4.41 | 100.00

-----------+----------------------+----------

1 | 367 15,312 | 15,679

| 2.34 97.66 | 100.00

-----------+----------------------+----------

Total | 11,036 15,804 | 26,840

| 41.12 58.88 | 100.00

xttrans gives an indication of whether there are transitions between groups for categorical variables –for example, we see here that 96% of unmarried individuals remain unmarried in the next period, and98% of married individuals remain married in the next period.

3 Regression using panel data

Having set up our dataset we can perform some regressions. As previously, I use a local macro to storemy list of independent variables:

local xlist "age age2 i.sex*i.married nkids i.qfachi"

We can estimate a pooled OLS regression using the regress command seen in the last lecture. Weshould use robust standard errors clustered by individual.

. xi: regress lwages ‘xlist’, vce(cluster pid)

i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted)

i.married _Imarried_0-1 (naturally coded; _Imarried_0 omitted)

i.sex*i.married _IsexXmar_#_# (coded as above)

8

Page 9: Lecture 3

i.qfachi _Iqfachi_1-7 (naturally coded; _Iqfachi_1 omitted)

Linear regression Number of obs = 23520

F( 12, 8285) = 409.27

Prob > F = 0.0000

R-squared = 0.3092

Root MSE = .96854

(Std. Err. adjusted for 8286 clusters in pid)

------------------------------------------------------------------------------

| Robust

lwages | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .1890312 .0055574 34.01 0.000 .1781373 .1999251

age2 | -.0022714 .0000708 -32.10 0.000 -.0024101 -.0021327

_Isex_2 | -.3258962 .0300076 -10.86 0.000 -.3847186 -.2670737

_Imarried_1 | .488806 .0286158 17.08 0.000 .4327118 .5449002

_IsexXmar_~1 | -.654709 .0376179 -17.40 0.000 -.7284494 -.5809685

nkids | -.2203951 .0115555 -19.07 0.000 -.2430467 -.1977435

_Iqfachi_2 | -.1272603 .077578 -1.64 0.101 -.2793325 .024812

_Iqfachi_3 | -.1781531 .0785097 -2.27 0.023 -.3320517 -.0242544

_Iqfachi_4 | -.4158935 .0734584 -5.66 0.000 -.5598903 -.2718966

_Iqfachi_5 | -.5313831 .0726375 -7.32 0.000 -.6737708 -.3889953

_Iqfachi_6 | -.617404 .080617 -7.66 0.000 -.7754335 -.4593746

_Iqfachi_7 | -.8375914 .073551 -11.39 0.000 -.9817697 -.693413

_cons | 6.046451 .1298038 46.58 0.000 5.792003 6.300899

------------------------------------------------------------------------------

Fixed and random effects regressions are both carried out using the xtreg command. In both cases weshould again get cluster robust standard errors. The default is for Stata to estimate random effects whenxtreg is used – you must specify the option “fe” to get fixed effects:

. xi: xtreg lwages ‘xlist’, fe vce(cluster pid)

i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted)

i.married _Imarried_0-1 (naturally coded; _Imarried_0 omitted)

i.sex*i.married _IsexXmar_#_# (coded as above)

i.qfachi _Iqfachi_1-7 (naturally coded; _Iqfachi_1 omitted)

Fixed-effects (within) regression Number of obs = 23520

Group variable: pid Number of groups = 8286

R-sq: within = 0.0451 Obs per group: min = 1

between = 0.1672 avg = 2.8

overall = 0.1293 max = 4

F(11,8285) = 38.35

corr(u_i, Xb) = -0.3782 Prob > F = 0.0000

(Std. Err. adjusted for 8286 clusters in pid)

------------------------------------------------------------------------------

| Robust

lwages | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .2902458 .0173775 16.70 0.000 .2561815 .3243102

age2 | -.0030951 .0002108 -14.68 0.000 -.0035083 -.0026819

_Isex_2 | (dropped)

9

Page 10: Lecture 3

_Imarried_1 | .0473639 .0421936 1.12 0.262 -.0353461 .1300739

_IsexXmar_~1 | -.0574721 .0682978 -0.84 0.400 -.1913529 .0764086

nkids | -.1347352 .0193899 -6.95 0.000 -.1727443 -.0967261

_Iqfachi_2 | .0239754 .2786775 0.09 0.931 -.5223023 .5702531

_Iqfachi_3 | -.2838254 .2629136 -1.08 0.280 -.7992019 .231551

_Iqfachi_4 | -.1524719 .2703457 -0.56 0.573 -.6824172 .3774734

_Iqfachi_5 | -.4847609 .2739852 -1.77 0.077 -1.021841 .0523187

_Iqfachi_6 | -.4576768 .3040223 -1.51 0.132 -1.053637 .138283

_Iqfachi_7 | -.3651604 .3077017 -1.19 0.235 -.9683329 .238012

_cons | 3.198992 .44452 7.20 0.000 2.327621 4.070362

-------------+----------------------------------------------------------------

sigma_u | 1.1636036

sigma_e | .59940523

rho | .79029066 (fraction of variance due to u_i)

------------------------------------------------------------------------------

Here sex is dropped – this is because it is invariant over time. The lack of variance in qualifications andmarital status explains the imprecise coefficient estimates here.

. xi: xtreg lwages ‘xlist’, re vce(cluster pid)

i.sex _Isex_1-2 (naturally coded; _Isex_1 omitted)

i.married _Imarried_0-1 (naturally coded; _Imarried_0 omitted)

i.sex*i.married _IsexXmar_#_# (coded as above)

i.qfachi _Iqfachi_1-7 (naturally coded; _Iqfachi_1 omitted)

Random-effects GLS regression Number of obs = 23520

Group variable: pid Number of groups = 8286

R-sq: within = 0.0341 Obs per group: min = 1

between = 0.3457 avg = 2.8

overall = 0.3053 max = 4

Random effects u_i ~ Gaussian Wald chi2(12) = 4356.47

corr(u_i, X) = 0 (assumed) Prob > chi2 = 0.0000

(Std. Err. adjusted for 8286 clusters in pid)

------------------------------------------------------------------------------

| Robust

lwages | Coef. Std. Err. z P>|z| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age | .2133805 .0058073 36.74 0.000 .2019984 .2247626

age2 | -.0025222 .0000737 -34.22 0.000 -.0026667 -.0023778

_Isex_2 | -.4601064 .0312502 -14.72 0.000 -.5213557 -.3988571

_Imarried_1 | .341099 .0275778 12.37 0.000 .2870475 .3951504

_IsexXmar_~1 | -.4701444 .0385164 -12.21 0.000 -.5456351 -.3946536

nkids | -.201973 .0116565 -17.33 0.000 -.2248194 -.1791267

_Iqfachi_2 | -.0940869 .0991879 -0.95 0.343 -.2884916 .1003178

_Iqfachi_3 | -.1635236 .0960624 -1.70 0.089 -.3518024 .0247552

_Iqfachi_4 | -.3019628 .0911546 -3.31 0.001 -.4806225 -.1233031

_Iqfachi_5 | -.4841929 .0903848 -5.36 0.000 -.6613438 -.3070421

_Iqfachi_6 | -.524829 .0975616 -5.38 0.000 -.7160463 -.3336117

_Iqfachi_7 | -.785563 .0915491 -8.58 0.000 -.9649959 -.6061301

_cons | 5.485352 .1464953 37.44 0.000 5.198227 5.772478

-------------+----------------------------------------------------------------

sigma_u | .87706892

sigma_e | .59940523

10

Page 11: Lecture 3

rho | .68163491 (fraction of variance due to u_i)

------------------------------------------------------------------------------

Stata reports the standard deviations of the error components estimated in sigma_u and sigma_e. Wealso see different R2 statistics for within and between variation. These can be tabulated if the estimateshave been stored.

. esttab POLS FE RE, b se stats(r2 r2_o r2_b r2_w)

------------------------------------------------------------

(1) (2) (3)

lwages lwages lwages

------------------------------------------------------------

age 0.189*** 0.290*** 0.213***

(0.00556) (0.0174) (0.00581)

age2 -0.00227*** -0.00310*** -0.00252***

(0.0000708) (0.000211) (0.0000737)

_Isex_2 -0.326*** 0 -0.460***

(0.0300) (0) (0.0313)

_Imarried_1 0.489*** 0.0474 0.341***

(0.0286) (0.0422) (0.0276)

_IsexXmar_~1 -0.655*** -0.0575 -0.470***

(0.0376) (0.0683) (0.0385)

nkids -0.220*** -0.135*** -0.202***

(0.0116) (0.0194) (0.0117)

_Iqfachi_2 -0.127 0.0240 -0.0941

(0.0776) (0.279) (0.0992)

_Iqfachi_3 -0.178* -0.284 -0.164

(0.0785) (0.263) (0.0961)

_Iqfachi_4 -0.416*** -0.152 -0.302***

(0.0735) (0.270) (0.0912)

_Iqfachi_5 -0.531*** -0.485 -0.484***

(0.0726) (0.274) (0.0904)

_Iqfachi_6 -0.617*** -0.458 -0.525***

(0.0806) (0.304) (0.0976)

_Iqfachi_7 -0.838*** -0.365 -0.786***

(0.0736) (0.308) (0.0915)

_cons 6.046*** 3.199*** 5.485***

(0.130) (0.445) (0.146)

------------------------------------------------------------

r2 0.309 0.0451

r2_o 0.129 0.305

r2_b 0.167 0.346

r2_w 0.0451 0.0341

11

Page 12: Lecture 3

------------------------------------------------------------

Standard errors in parentheses

* p<0.05, ** p<0.01, *** p<0.001

Estimation can also be easily implemented in first differences using the regress command and dif-ference operator “D.”. We do not need to generate variables in first differences. The option noconstant

is used so that Stata does not add a constant term (which would be differenced out). For example:

. regress D.(lwage age age2 female married nkids), vce(cluster pid) noconstant

Linear regression Number of obs = 14667

F( 4, 6146) = 65.00

Prob > F = 0.0000

R-squared = 0.0208

Root MSE = .73681

(Std. Err. adjusted for 6147 clusters in pid)

------------------------------------------------------------------------------

| Robust

D.lwages | Coef. Std. Err. t P>|t| [95% Conf. Interval]

-------------+----------------------------------------------------------------

age |

D1. | .3179554 .0211961 15.00 0.000 .2764036 .3595071

age2 |

D1. | -.0034294 .0002499 -13.72 0.000 -.0039192 -.0029395

female |

D1. | (dropped)

married |

D1. | .0144043 .0390605 0.37 0.712 -.062168 .0909767

nkids |

D1. | -.0957519 .0221907 -4.31 0.000 -.1392534 -.0522504

------------------------------------------------------------------------------

3.1 Hausman test

Stata can easily perform a Hausman test – that is, a test of whether the individual effects are random.The null hypothesis is that both fixed and random effects are consistent, the alternative hypothesis isthat random effects is not consistent. We must first estimate the fixed and random effects models – andwithout robust standard errors. Then, the Hausman test is conducted using the hausman command.

. quietly xi: xtreg lwages ‘xlist’, fe

. estimates store FE1

. quietly xi: xtreg lwages ‘xlist’, re

. estimates store RE1

. hausman FE1 RE1, sigmamore

---- Coefficients ----

| (b) (B) (b-B) sqrt(diag(V_b-V_B))

| FE1 RE1 Difference S.E.

-------------+----------------------------------------------------------------

age | .2902458 .2133805 .0768653 .0125061

age2 | -.0030951 -.0025222 -.0005729 .0001535

_Imarried_1 | .0473639 .341099 -.293735 .0336022

_IsexXmar_~1 | -.0574721 -.4701444 .4126723 .0470245

nkids | -.1347352 -.201973 .0672378 .0120169

12

Page 13: Lecture 3

_Iqfachi_2 | .0239754 -.0940869 .1180623 .1387245

_Iqfachi_3 | -.2838254 -.1635236 -.1203018 .1676963

_Iqfachi_4 | -.1524719 -.3019628 .1494909 .1630687

_Iqfachi_5 | -.4847609 -.4841929 -.000568 .1697845

_Iqfachi_6 | -.4576768 -.524829 .0671522 .2006431

_Iqfachi_7 | -.3651604 -.785563 .4204026 .1869317

------------------------------------------------------------------------------

b = consistent under Ho and Ha; obtained from xtreg

B = inconsistent under Ha, efficient under Ho; obtained from xtreg

Test: Ho: difference in coefficients not systematic

chi2(11) = (b-B)’[(V_b-V_B)^(-1)](b-B)

= 238.04

Prob>chi2 = 0.0000

Cameron and Trivedi (2009) recommend using the sigmamore option. Here we see the null hypothesisis clearly rejected with a p-value of 0.0000 so the random effects estimates are not consistent.

4 Creating variables to identify changes in variables

We may wish to create a variable which records whether a certain status has changed. For example,whether marital status has changed. Once data is declared to be a panel this is straightforward. Let’sfirst recode mastat so that it has just three categories:

recode mastat (0=.) (1/2=1) (3/5=2) (6=3), generate(ma)

Then to find changes we generate a new variable which incorporates the lagged value and current valueof ma:

generate mach=(10*L.ma)+ma

Having labelled the values we have a useful marital change variable. So we can see that there are 352instances of individuals going from never having been married to having a partner in this sample.

. tabulate mach

marital change | Freq. Percent Cum.

-----------------------------+-----------------------------------

stayed in couple | 16,849 63.94 63.94

partnership ended | 360 1.37 65.30

partnered -> never married! | 113 0.43 65.73

ex-partner -> partnership | 180 0.68 66.41

stayed ex-partner | 3,622 13.74 80.16

never married -> partnership | 352 1.34 81.49

never married -> ex-partner | 14 0.05 81.55

stayed never married | 4,863 18.45 100.00

-----------------------------+-----------------------------------

Total | 26,353 100.00

References

Cameron, A. Colin and Pravin K. Trivedi, Microeconometrics Using Stata, Texas: Stata Press,2009.

Taylor, Marcia Freed, John Brice, Nick Buck, and Elaine Prentice-Lane, “British HouseholdPanel Survey User Manual Volume A: Introduction, Technical Report and Appendices,” ISER,University of Essex, Colchester 2009.

13