basic data analysis and statistics r. shapiro american university in cairo june 3-6, 2012

105
BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012 Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt) Exploring Theories: Bivariate Analysis Multivariate Models (Regression Approaches) Limited Dependent Variables (dichotomous variables) and Interactions Survey Research: Issues and Sources of Error

Upload: marvin-carney

Post on 01-Jan-2016

29 views

Category:

Documents


1 download

DESCRIPTION

BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012. Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt) Exploring Theories: Bivariate Analysis Multivariate Models (Regression Approaches) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

BASIC DATA ANALYSIS AND STATISTICSR. SHAPIRO

American University in CairoJune 3-6, 2012

• Motivation, Intuition, and Numerology (AUCShapiroPresent1.ppt)

• Exploring Theories: Bivariate Analysis• Multivariate Models (Regression Approaches)• Limited Dependent Variables (dichotomous

variables) and Interactions• Survey Research: Issues and Sources of Error• Identifying Causal Mechanisms and Time Series

Analysis• Using “Instruments” to Indentify Causal Effects

Page 2: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Exploring Theories: Bivariate analysis. “Correlation is not causation!” But you have to start somewhere....

First Steps• Centrality of causal theorizing. Dependent and

independent variable(s) unit of analysis? Generalizing to what universe/population? Assumption of unidirectional causation (revisited later)?

X -------- > Y e.g., Democracy -----------> Income (of countries)

Education -----------> Income (of individuals)

• Plausibility of theory? Causal mechanism/story?

Page 3: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Next Steps in Quantitative Research

• Measurement of variables (ideally at the designated unit of analysis). “Validity” and “reliability” of measures?

• Hypothesis specification (for measures); expected covariation/correlation?

• Statistical evidence of covariation/correlation?• Rejecting null hypothesis? Substantive versus

“statistical significance?”• Next steps? Statistical controls, multivariate analysis,

to be continued…. Strengthening causal inferences.

Page 4: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 5: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 6: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 7: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 8: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 9: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Questions at the Statistical Analysis Stage?

• “Level of measurement” for the measures of the dependent and independent variables: Categorical or Continuous? (Further distinction of “nominal,” “ordinal,” “interval” or “ratio” level variables.)

• The preferred statistical method depends on this the level of measurement of the variables!

• Motivation to put everything into a regression analysis framework – for later multivariate analysis.

• The Bivariate Regression approach. Case of Income -------> Test score of individuals

Page 10: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Bivariate Ordinary Least Squares Regression (OLS)

• Case of : Income -------> Test scores of individuals• The regression line takes the form of Predicted Y = intercept + slope (X) or Predicted Y = a + bX, where “a” and “b” take on the

unique numeric values that minimize the average vertical distances (by minimizing the squared distances). between all the points and the regression line. To the extent Y and X are linearly related in this way, the regression line falls much closer to all the points than does the line through the mean of Y.

• Min. for all cases the sum of (Y-Predicted Y)2

Page 11: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Bivariate Scatterplot

Page 12: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Regression Line Versus the Mean: Idea of “Explained Variance”

2.0 2.5 3.0 3.5 4.0

62

06

40

66

06

80

70

0California School District Test Score and Income

(Logged) Average Income

Te

st S

co

res

Mean of Test Scores

Regression Line

Page 13: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Linear regression lets us estimate the slope of the population regression line

• Ultimately our aim is to estimate the causal effect on Y of a unit change in X – but for now, just think of the problem of fitting a straight line to data on two variables, Y and X.

• The slope of the population regression line is the expected effect on Y of a unit change in X.

Page 14: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

The Population Linear Regression Model Yi = β0 + β1Xi + ui, i = 1,…, n

We have n observations, (Xi, Yi), i = 1,.., n.• X is the independent variable or regressor• Y is the dependent variable• β0 = intercept

• β1 = slope

• ui = the regression error • The regression error consists of omitted factors. In general,

these omitted factors are other factors that influence Y, other than the variable X. The regression error also includes error in the measurement of Y.

Page 15: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

The population regression model in a picture: Observations on Y and X (n = 7); the population regression line; and the regression error (the “error term”):

4-15

Page 16: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

The OLS estimator solves:

• The OLS estimator minimizes the average squared difference between the actual values of Yi and the prediction (“predicted value”) based on the estimated line. That is, it minimizes the vertical distances.

• This minimization problem can be solved using calculus.

• The result is the OLS estimators of β0 and β1.

4-16

0 1

2, 0 1

1

min [ ( )]n

b b i ii

Y b b X

Page 17: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

4-17

Page 18: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Application to the California Test Score – Class Size data

• Estimated slope = = – 2.28• Estimated intercept = 698.9• Estimated regression line: Tessscore = 698.9 – 2.28×STR

4-18

Page 19: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

OLS regression: STATA outputregress testscr str, robustRegression with robust standard errors Number of obs = 420 F( 1, 418) = 19.26 Prob > F = 0.0000 R-squared = 0.0512 Root MSE = 18.581------------------------------------------------------------------------- | Robusttestscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]--------+---------------------------------------------------------------- str | -2.279808 .5194892 -4.39 0.000 -3.300945 -1.258671 _cons | 698.933 10.36436 67.44 0.000 678.5602 719.3057-------------------------------------------------------------------------

4-19

Page 20: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Example of the R2 and the SER

Testscore = 698.9 – 2.28×STR, R2 = .05, SER = 18.6

STR explains only a small fraction of the variation in test scores. Does this make sense? Does this mean the STR is unimportant in a policy sense?

4-20

Page 21: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

A real-data example from labor economics: average hourly earnings vs. years of education (data source: Current Population

Survey):

Page 22: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Slope of the Regression Line, Variability Around It, and the Correlation Coefficient

• Predicted Y = a + bX, where b is the slope.• Correlation Coefficient, Pearson’s “r”, ranges

from -1 to 0 to +1, and is larger in size to the extent that the observed data fall very close to the regression line. The r2 indicates how much closer proportionately the regression line fall closer (vertically) to the observed values of the dependent variable that the horizontal line through the mean of the dependent variable. Why are both useful?

Page 23: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Correlation=1

Page 24: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

r = Correlation =.95

Page 25: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Same Slope (b) but Correlation =0.75, Implications? More variability? Why?

Page 26: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Correlation = -.50

Page 27: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

No correlation

Page 28: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

OLS can be sensitive to an outlier (also look for non-linearity? discuss later?):

• Is the lone point an outlier in X or Y?• In practice, outliers are often data glitches (coding or recording

problems). Sometimes (or more often?) they are observations that really shouldn’t be in your data set. Plot your data!

4-28

Page 29: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

The larger the variance of X, the smaller the variance of the slope b

The number of black and blue dots is the same. Using which would you get a more accurate regression line? 4-29

Page 30: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Analyzing Categorical Measures• For categorical independent and dependent

variables: Cross Tabulation• For a categorical independent variable and a

continuous dependent variable or a categorical dependent variable that can be treated as continuous: Compare Means on the dependent variable.

• For a dichotomous dependent variable coded 0-1, the mean is the proportion of cases in the 1 category, so means on it can be compared!

Page 31: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Go to Stata example of standard bivariate analysis, non-regression

• Crucial: Preparing Data -- Recoding; Dealing with “Missing Values,” if any; etc.

• Go to PDF file, W4910x11 Bivariate Crosstabs and Means Analysis. Examples from U.S Survey Data.

• On to a Regression Analysis framework next…

Page 32: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Moving to a regression framework for categorical variables:

• Treating categorical variables as continuous, if categories are “ordered” (“ordinal” vs. “nominal” level variables).

• Special case of dichotomous variables. (The mean of a 0-1 variable is the proportion of cases in the “1” category (Ave. 0,0,1,1,1=.6)

• Crucial bridge: “dummy variable regression.” (And now for some comic relief, normally done at a blackboard with chalk:

Page 33: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 34: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 35: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 36: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 37: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 38: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 39: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Example Using U.S. Survey Data and Stata Software

• Assumptions in treating ordinal variables as continuous variables.

• Statistical versus Substantive Significance? Variability. Sampling error”/confidence intervals. The “standard error.”

• PDF file W4910x11 Bivariate Regression, Dummy Variables.

Page 40: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

• Predicted Y = a + b1X1 + b2X2, where the b’s are the coefficients for which the differences between the observed Y’s and predicted Y’s are minimum. In this case we have more b’s to estimate to min. the sum of (Y-Predicted Y)2

• It now also has the interpretations shown below, beginning with the comparisons of different possible scenarios for “conditional” regressions—that hold one variable constant.

Statistical Control: Understanding Multivariate Models (Multiple Regression Analysis)

Page 41: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

“Effect” of Region and Democracy on Economic Growth (made up data)

• Predicted EG = a + b1(Democracy) +b2(Region), where we think both democracy and region have possible causal effects.

• Case of only two regions (1 and 2; Region is coded 0-1), to illustrate a simple case of Statistical Control/holding one var. constant.

• Linear equation assumes no “interaction”; that is, “effect” of Democracy is the same in Region 1 and 2 (and same for Region; but is it?). There are different possibilities: (and comic relief)

Page 42: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 43: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 44: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 45: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 46: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

(b) Interactions between continuous and binary variables

Yi = β0 + β1Di + β2Xi + ui

• Di is binary—a dummy variable coded 0-1; X is continuous

• As specified above, the effect on Y of X (holding constant D) = β2, which does not depend on D; that is, it is the same for D=0 and for D=1. But what if that is not the case???

• To allow the effect of X to depend on D, include the “interaction term” Di×Xi as a regressor:

Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui

8-46

Page 47: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Binary-continuous interactions, ctd.

8-47

Page 48: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Binary-continuous interactions: the two regression lines

Yi = β0 + β1Di + β2Xi + β3(Di×Xi) + ui

Observations with Di= 0 (the “D = 0” group):

Yi = β0 + β2Xi + ui The D=0 regression line

Observations with Di= 1 (the “D = 1” group):

Yi = β0 + β1 + β2Xi + β3Xi + ui

= (β0+β1) + (β2+β3)Xi + ui The D=1 regression line

8-48

Page 49: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

(c) Interactions between two continuous variables

Yi = β0 + β1X1i + β2X2i + ui

• X1, X2 are continuous

• As specified, the effect of X1 doesn’t depend on X2

• As specified, the effect of X2 doesn’t depend on X1

• To allow the effect of X1 to depend on X2, include the “interaction term” X1i×X2i as a regressor:

Yi = β0 + β1X1i + β2X2i + β3(X1i×X2i) + ui

8-49

Page 50: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 51: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 52: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 53: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 54: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Next: An Instructional Example of a a Simple Three Variable Model

• From U.S. Survey Data. Using Stata software.• Ordinal variables are treated again as

continuous variables, collapsing the number of categories in the independent variables. We would normally not collapse variables; that loses information. We do so here for purposes of seeing how the assumption of “no interaction” plays out in a simple, illustrative way.

• Go to PDF file W4910x11 Control Variables

Page 55: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Example: the California test score data

Regression of TestScore against STR: TestScore = 698.9 – 2.28×STR Now include percent English Learners in the district (PctEL):

= 686.0 – 1.10×STR – 0.65PctEL

• What happens to the coefficient on STR?• What (STR, PctEL) = 0.19)

6-55

Page 56: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Multiple regression in STATA reg testscr str pctel, robust; Regression with robust standard errors Number of obs = 420 F( 2, 417) = 223.82 Prob > F = 0.0000 R-squared = 0.4264 Root MSE = 14.464 ------------------------------------------------------------------------------ | Robust testscr | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+---------------------------------------------------------------- str | -1.101296 .4328472 -2.54 0.011 -1.95213 -.2504616 pctel | -.6497768 .0310318 -20.94 0.000 -.710775 -.5887786 _cons | 686.0322 8.728224 78.60 0.000 668.8754 703.189------------------------------------------------------------------------------ 

TestScore = 686.0 – 1.10×STR – 0.65PctEL

6-56

Page 57: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Another Example of a Multivariate Model Estimated with Stata

• More like real research than the previous example. Multiple control variables. (No collapsing of categories, losing information.)

• Scatterplots to explore non-linearity.• Inclusion of multiplicative terms to.

explore statistical interactions.• Go to PDF file W4911y12 Regressions…..

Page 58: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 59: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Linearity vs. Non-Linearity• Non-linear relationships. Easy cases are

models which are still linear in the coefficients; can be estimated with OLS.

• Case of dichotomous dependent variable (coded 0-1), for which theory is non-linear and not linear in the coefficients. An “S” shaped curve: logit or probit model. What kind of theory? Versus a Linear Probability Model, LPM with OLS.

Page 60: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

But the TestScore – Income relation looks nonlinear...

8-60

Page 61: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Example: the TestScore – Income relation

Incomei = average district income in the ith district

(thousands of dollars per capita) Quadratic specification:

TestScorei = β0 + β1Incomei + β2(Incomei)2 + ui

Cubic specification: TestScorei = β0 + β1Incomei + β2(Incomei)2 + β3(Incomei)3 + ui

8-61

Page 62: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Interpreting the estimated regression function:

(a) Plot the predicted values

Testscore = 607.3 + 3.85Incomei – 0.0423(Incomei)2

(2.9) (0.27) (0.0048)

8-62

Page 63: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Example: Linear Probability Model (LPM). HMDA dataMortgage denial v. ratio of debt payments to income(P/I ratio) in a subset of the HMDA data set (n = 127)

Page 64: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Probit

Page 65: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Logit versus Probit ModelsThe predicted probabilities from the probit and logit models are very close in

these HMDA regressions:

Page 66: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Logit Models (or Logistic Regression) Can not estimate with OLS. Requires Maximum Likelihood

Estimation (MLE)

Page 67: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Logit (continued)• Where P is the Predicted Y for a dichotomous

(0-1) dependent variable, that is, predicting the probability that Y=1. The same goal as the Linear Probability Model (regression) but with a non-linear (S-curve) relationship.

• e = the natural log base (2.718…), and bX refers to the linear combination of indep. vars.

• It involves interactions of independent vars.• Go to W4911y12 Logit, Probit, LPM example

in Stata….

Page 68: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Survey Data Analysis

• Anderson and Guillory (1997) examine satisfaction with democracy.

• Huber, Kernell, and Leoni (2005) examine partisan attachment.

• Interactions and Limited Dependent Variables.

Page 69: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 70: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 71: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 72: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 73: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 74: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 75: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Survey Research: Issues and Sources of Error

• Issues in Survey Research• Go to PDF File Sources of Errors in Surveys.

Page 76: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Identifying Causal Mechanisms and Time Series Analysis

• Getting insight and leverage from observing variations over time and short term changes from longitudinal data.

• Comparing directly changes over time.• Looking for sequences or time lags in the

data over time.• Examples (next slide).

Page 77: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 78: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 79: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 80: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 81: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 82: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 83: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 84: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 85: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Data Analysis Examples

• Unit of analysis is the time period (e.g., year, month, etc.) for a single unit (e.g., one country or other kind of case); e.g. one country’s-years.

• Stata example of simple time series with exogenous variables only; example of a lagged endogenous variable. Go to PDF File W4911y12 Paper5Part1.

Page 86: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Time Series (continued)• Continue to PDF File W4911y12

Paper5Part2.• “Panel” or “pooled time series” data.

For multiple units over time. For example, separate time series for many countries—the unit of analysis is “country-years.” Data show variation both over time and across units; need to watch this.

Page 87: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Time Series (continued)

• Stata example of pooled time series; Go to PDF File W4911y12 Paper5 SupplementPooledTimeSeries.

• Panel Data. Logic of “fixed effects”.• Examples from readings.

Page 88: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 89: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 90: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 91: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 92: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 93: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Using “Instruments” to Indentify Causal Effects

• Issue of “reciprocal causation”/ ”endogeneity”/”simultaneity bias”.

• Need to find an “exogenous” variable as an instrument.

• Assumptions about exogeneity and lack of direct causal effect on an endogenous variable.

• Example from Acemoglu et al.

Page 94: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 95: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 96: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 97: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Instrumental Variables (continued)

• Logic of Indirect Least Squares• Two-Stage Least Squares (TSLS or 2SLS)• Stata example using U.S. Survey Data, Go

to PDF File W4911y12Paper4.• Other research examples; see next table.

Page 98: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 99: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 100: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Factor Analysis and Scale Construction

• Example from Verba and Nie, Participation in America (1972)

• Stata example from U.S. survey data.

Page 101: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 102: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 103: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012
Page 104: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Factor Analysis and Scale Construction (continued)

• Stata Example from U.S. Survey Data• Go to W4911y Paper6Factor Analysis

example.

Page 105: BASIC DATA ANALYSIS AND STATISTICS R. SHAPIRO American University in Cairo June 3-6, 2012

Other Topics?

• Questions?