nonparametric methods featuring the bootstrap

51
Nonparametric Methods Featuring the Bootstrap Jon Atwood November 12, 2013

Upload: cyndi

Post on 24-Feb-2016

92 views

Category:

Documents


0 download

DESCRIPTION

Nonparametric Methods Featuring the Bootstrap. Jon Atwood November 12, 2013. Laboratory for Interdisciplinary Statistical Analysis. LISA helps VT researchers benefit from the use of Statistics. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Nonparametric Methods Featuring the Bootstrap

Nonparametric Methods Featuring the Bootstrap

Jon AtwoodNovember 12, 2013

Page 2: Nonparametric Methods Featuring the Bootstrap

Laboratory for Interdisciplinary Statistical Analysis

Collaboration From our website request a meeting for personalized statistical advice.Coauthor not required, but encouraged.Great advice right now:Meet with LISA before collecting your data

Short Courses Designed to help graduate students apply statistics in their research

Walk-In Consulting

Monday—Friday* 1-3PM, GLC

*Mon—Thurs during the summer

All services are FREE for VT researchers. We assist with research—not class projects or homework.

LISA helps VT researchers benefit from the use of Statistics

www.lisa.stat.vt.edu

Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...)

Page 3: Nonparametric Methods Featuring the Bootstrap

Short Course GoalsBy the end of this course, you will know…• The fundamental differences between

nonparametric and parametric• In what general situations nonparametric

methods would be advantageous• How to implement nonparametric alternatives

to t-tests, ANOVAs and simple linear regression• What nonparametric bootstrapping is and

when you can use it effectively

Page 4: Nonparametric Methods Featuring the Bootstrap

What we will do today… Give a brief overview of nonparametric statistical

methods Take a look at real world data sets! (and some

non-real world data sets) Implement the following methods in R…

Wilcoxon ranked sum and signed rank (alternatives to t-tests

Kruskal-Wallis (alternative to one-way ANOVA) Spearman correlation (alternative to Pearson

correlation) Nonparametric Bootstrapping ?Bonus Topic?

Page 5: Nonparametric Methods Featuring the Bootstrap

What does nonparametric mean?

Parametric is a word that can have several meanings

In statistics, parametric usually means some specific probability distribution is assumed Normal (regression, ANOVA, etc.) Exponential (survival)

May also involve assumptions about parameters (variance)

Well, first of all, what does parametric mean?

Page 6: Nonparametric Methods Featuring the Bootstrap

And now a closer look… Regression Analysis

We want to see the relationship between a bunch of predictor variables (x1, x2,…,xp) and a response variable.

For example, suppose we wanted to see the relationship between weight and blood pressure

A usual regression model might look something like this… 

Simple Linear Regression Plot

Page 7: Nonparametric Methods Featuring the Bootstrap

Error term

εi~ N(0, σ2) The error terms are assumed to

come from a normal distribution with mean 0, variance σ2

Usual methods for testing the significance of our θ estimates are based directly on this assumption of normality

If the assumption of normality false, these methods are invalid

Page 8: Nonparametric Methods Featuring the Bootstrap

So onto nonparametric…

A statistical method is nonparametric if it does not require most of these assumptions

Data are not assumed to follow some underlying distribution

In some cases, it means that variables do not take predetermined forms Nonparametric regression, for example

Page 9: Nonparametric Methods Featuring the Bootstrap

Assumptions nonparametric methods do make

Randomness Independence In multi-sample tests, distributions are of

the same shape

Page 10: Nonparametric Methods Featuring the Bootstrap

Advantages Free from

distributional assumptions of data

Easy to interpret Usually

computationally straightforward

Disadvantages Loss of power

when data does follow usual assumptions

Reduces the data’s information

Larger sample sizes needed due to less efficiency

Nonparametric Methods

Page 11: Nonparametric Methods Featuring the Bootstrap

With that said…Rank Tests

Rank tests are a simple group of nonparametric tests

Instead of using the actual numerical value of something, we use its rank, or relative position on the number line to the other observations

Page 12: Nonparametric Methods Featuring the Bootstrap

As an example… Here’s some data Basically, these are just

numbers I picked out of the sky

Any ideas on what we should call it? Wigs in a wig shop?

Y Value Ascending Value Rank

32 1 164 2 (2+3+4)/

3=3 64 3 (2+3+4)/

3=3 64 4 (2+3+4)/

3=3 311 5 52000 6 6

Y Value Ascending Value Rank

32 1 145 2 254 3 364 4 4311 5 5

2000 6 6

What about ties?

Their ascending values are added together and divided by the total number of ties

Page 13: Nonparametric Methods Featuring the Bootstrap

Wilcoxon rank tests

These provide alternatives to the standard t-tests, which test mean differences between two groups

T-tests assume that the data is normally distributed Can be highly sensitive to outliers (we’ll see that in

an example soon), which may reduce power (ability to detect significant differences)

Wilcoxon tests alleviate these problems by using ranks, not actual values

Page 14: Nonparametric Methods Featuring the Bootstrap

First, the Wilcoxon Rank-Sum Test

Alternative to the independent sample t-test (recall that the independent sample t-test compares two independent samples, testing if the means are equal)

The t-test assumes normality of the data, and equal variances (though adjustments can be made for unequal variances)

The Wilcoxon rank-sum test assumes that the two samples follow continuous distributions, as well as the usual randomness and independence

Page 15: Nonparametric Methods Featuring the Bootstrap

Let’s try it out with some data!

The source of this data is…me, Jon Atwood

Data randomly generated from R

First group, 15 observations distributed normally, mean=30000, variance=2500. Extra observation added

Second group, 16 observations distributed normally, mean=40000, variance=2500

Group 132897.0829383.8

30539.3927448.5

33712.1728166.0726613.330105.7

25786.8130761.0427427.1229060.0225593.8929815.8232453.971000000

Group 240273

35610.542547.441214.440183

36460.845040.839248.633901.740985.142412

38247.236384.241849.538222.839739.4

Page 16: Nonparametric Methods Featuring the Bootstrap

So what is this test…testing? Remember, for the t-test, we are testing

H0: μ1=μ2 vs. Ha: μ1≠μ2 Where μ1 is the mean of group 1 and μ2 is the

mean of group 2

Page 17: Nonparametric Methods Featuring the Bootstrap

What this really means is that F(G1) and F(G2) are basically the same, except one is to the right of the other, shifted by α

α

In the Wilcoxon rank-sum test, we are testing

H0: F(G1)=F(G2)vs.Ha: F(G1)=F(G2-α)Where F(G1) is the

distribution of group 1 and F(G2) is the distribution of group 2, and α is the “location shift”

Page 18: Nonparametric Methods Featuring the Bootstrap

In our problem…Group 1 Group 2Value Rank Value Rank

25593.89 1 33902 1625786.81 2 35611 17

26613.3 3 36384 1827427.12 4 36461 19

27448.5 5 38223 2028166.07 6 38247 2129060.02 7 39249 22

29383.8 8 39739 2329815.82 9 40183 24

30105.7 10 40273 2530539.39 11 40985 2630761.04 12 41214 2732453.97 13 41850 2832897.08 14 42412 2933712.17 15 42547 301000000 32 45041 31Rank sum 152   376

So, group 2 has a rank sum of 224 higher than group 1

The question is, what is the probability of observing a combination of rank sums in the two groups where the difference is greater than 224?

R will compute p-values using the normal approximation of sample sizes greater than 50

Page 19: Nonparametric Methods Featuring the Bootstrap

So let’s do some R-Coding!We will view graphs and results in the R program we run (this applies to all examples)

Page 20: Nonparametric Methods Featuring the Bootstrap

Moving to a paired situation…

Suppose we have two samples, but they are paired in some way

Suppose pairs of people are couples, or plants are paired off into different plots, dogs are paired by breeds, or whatever

Then, the rank-sum test will not be the optimal test. Instead, we use the signed rank test

Page 21: Nonparametric Methods Featuring the Bootstrap

Signed Rank Test The null hypothesis is that the medians of

the two distributions are equal Procedure

1. Calculate all differences between two groups2. Take the absolute value of those differences3. Convert those to ranks (like before)4. Attach a + or -, depending on if the

difference is positive or negative5. Add these signed ranks together to get the

test statistic W

Page 22: Nonparametric Methods Featuring the Bootstrap

Example This data comes from Laureysens et al.

(2004), who, in August and November, took 13 poplar tree clones and measured aluminum levels

Below is a table with the raw data, ranks, and differences

W=59Clone Aug Nov

Abs Value Diff

Rank

S-Rank

Balsam Spire 8.1 11.2 3.1 6 6Beaupre 10 16.3 6.3 9 9

Hazendans 16.5 15.3 1.2 3 -3Hoogvorst 13.6 15.6 2 4 4Raspalje 9.5 10.5 1 2 2

Unal 8.3 15.5 7.2 10 10Columbia

River 18.3 12.7 5.6 8 -8Fritzi Pauley 13.3 11.1 2.2 5 -5

Trichobel 7.9 19.9 12 11 11Gaver 8.1 20.4 12.3 12 12Gibecq 8.9 14.2 5.3 7 7Primo 12.6 12.7 0.1 1 1

Wolterson 13.4 36.8 23.4 13 13

Page 23: Nonparametric Methods Featuring the Bootstrap

How many more combinations could create a more extreme w?

R will give an exact p-value for sample sizes less than 50

Otherwise, a normal approximation is used

Page 24: Nonparametric Methods Featuring the Bootstrap

R Time!

Page 25: Nonparametric Methods Featuring the Bootstrap

More than 2 groups

Suppose we are comparing more than two groups

Normally (no pun intended), we would use an analysis of variance, or ANOVA

But, of course, we need to assume normality for this as well

Page 26: Nonparametric Methods Featuring the Bootstrap

Kruskal-Wallis In this situation, we use the Kruskal-

Wallis test Again, we are converting the actual

numeric values to ranks, regardless of groups

Ultimately, we will compute the test statistic

Page 27: Nonparametric Methods Featuring the Bootstrap

An example We’ll use the built-in R data air quality In this data, Chambers et al. (1973)

compared air quality measurements in New York in 1973 from May to September

The question is, does the air quality differ from month to month?

Page 28: Nonparametric Methods Featuring the Bootstrap

Break out the R!

Page 29: Nonparametric Methods Featuring the Bootstrap

Onto Part 2… In this part of the course, we will do the

following things1. Look at the Spearman correlation

(alternative to the Pearson correlation) between an x variable and y variable

2. Examine nonparametric bootstrapping, and how it can help us when our data does not approximate normality

Page 30: Nonparametric Methods Featuring the Bootstrap

Spearman correlation Suppose you want to discover the

association between infant mortality and GDP by country

Here’s a 2003 scatterplot of the situation

Page 31: Nonparametric Methods Featuring the Bootstrap

This data comes from www.indexmundi.com

In this example, the Pearson correlation is about -.63

Still significant, but perhaps underestimates the monotone nature of the relationship between GDP and infant mortality rate

In addition, the Pearson correlation assumes linearity, which is clearly not present here

Pearson correlation

Page 32: Nonparametric Methods Featuring the Bootstrap

We can use the Spearman correlation instead

This is a correlation coefficient based on ranks, which are computed in the y variable and x variable independently, with sample size n

To calculate the coefficient, we do the following…1. Take each xi and each yi, convert them

into ranks (ranks of x and y are independent of each other

2. Subtract rxi from ryi to get di, the difference in ranks

3. The formula is

Page 33: Nonparametric Methods Featuring the Bootstrap

In this case, the hypotheses areH0: Rs=0vs. Ha: Rs≠0

Basically, we are attempting to see whether or not the two variables are independent, or if there is evidence of an association

Page 34: Nonparametric Methods Featuring the Bootstrap

Let’s return to the GDP data

We will now plot the data in R, and see how to get the Spearman correlation

R tests this by using an exact p-value for small sample sizes, and an approximate t-distribution for larger ones

The test statistic in that case would follow a t-distribution with n-2 degrees of freedom

Page 35: Nonparametric Methods Featuring the Bootstrap

Turn to R now!!!

Page 36: Nonparametric Methods Featuring the Bootstrap

Nonparametric Bootstrapping Suppose you are fitting a multiple linear

regression model

Recall that εi~ N(0, σ2) But what if we have reason to suppose

this assumption is not met? Then, the regular way of testing the

significance of our coefficients is invalid

Yi=β0+β1x1i+β2x2i+…+βkxki+εi

Page 37: Nonparametric Methods Featuring the Bootstrap

So what do we do?

Depending on the situations, there are several options

The one we will talk about today is called nonparametric bootstrapping

Bootstrapping is a resampling method, where we take the data and randomly sample from it to draw inferences

There are several types of bootstrapping. We will focus on the simpler, nonparametric type

Page 38: Nonparametric Methods Featuring the Bootstrap

How do we do nonparametric bootstrapping?

We assign each of the n observations a 1/n probability of being selected

We then take a random sample with replacement, usually of size n

We compute estimated coefficients based on this sample

We repeated the process many times, say 10,000, or 100,000, or 100,000,000,000,000…

Page 39: Nonparametric Methods Featuring the Bootstrap

For example, in regression Suppose we want to test whether or not

β1 = 0. With our newly generated sample of

(10,000, or whatever) β1 coefficients For a 95 percent confidence interval, we

would look at the 250th observation (or 250th lowest), and the 9,750th observation (or the 250th highest)

If this interval does not contain 0, we conclude that there is evidence that β1 is not equal to 0

Page 40: Nonparametric Methods Featuring the Bootstrap

Possible issue? Theoretically, this methods comes out of

the idea that the distribution of a population is approximated by the distribution of its sample

This assumption becomes less and less valid the smaller your sample size

Page 41: Nonparametric Methods Featuring the Bootstrap

Example This data was taken from The Practice

of Econometrics by ER Berndt (1991) We will be regressing wage on

education and experience We will look for whether graphs tell us

the residuals are approximately normal or not

R-Code Now!

Page 42: Nonparametric Methods Featuring the Bootstrap

In Summary We now understand that certain

parametric methods, like t-tests and regression, depend on assumptions that may or may not be met

We know that nonparametric methods are methods that do not make distributional assumptions, and therefore are applicable when data do not meet these assumptions

We can implement these methods in R, incase our data does not meet these assumptions

Page 43: Nonparametric Methods Featuring the Bootstrap

Bonus Topic!

Indicator Variables in RegressionUsed when we have categorical variables as predictors in regression

Page 44: Nonparametric Methods Featuring the Bootstrap

As a start, let’s re-examine the wage data

Here, we will drop education and just look at experience and sex

The full model in a case like this would look likeWi=β0+β1Ei+β2Si+β3S*Ei+εi

Where… W=wageE=experienceS=1 if sex=“male”, 0 otherwiseε is the error term

Page 45: Nonparametric Methods Featuring the Bootstrap

Separate regressions for different sexes

For women, the reference group, Wia=β0a+β1aEi+εia For men, the non-reference group, Wib=β0b+β1bEi+εib

Page 46: Nonparametric Methods Featuring the Bootstrap

With that in mind… Let’s return to the full modelWi=β0+β1Ei+β2Si+β3S*Ei+εi

Suppose we want to see how women do in this model

Well, we can set S=0

We are left with Wi=β0+β1Ei+εiBut, since this is with women, this is equivalent to the women only model, Wia=β0a+β1aEi+εia

Page 47: Nonparametric Methods Featuring the Bootstrap

β0= β0aβ1Ei= β1aEi

Thus…

Page 48: Nonparametric Methods Featuring the Bootstrap

Now, look at “male”We set S=1

Wi=β0+β1Ei+β2*1+β31*Ei+εi= β0+β1Ei+β2+β3Ei+εi=β0a+β1aEi+β2+β3Ei+εi=(β0a+β2)+(β1a+β3)Ei=β0b+β1bEi+εib

Page 49: Nonparametric Methods Featuring the Bootstrap

So…

β2=β0b- β0aβ3=β1b- β1a

Page 50: Nonparametric Methods Featuring the Bootstrap

Example in R!

Page 51: Nonparametric Methods Featuring the Bootstrap

Thank you! References Berndt, ER. The Practice of Econometrics. 1991. NY:

Addison-Wesley. Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P.

A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.

Laureysens, I., R. Blust, L. De Temmerman, C. Lemmens and R. Ceulemans. 2004. Clonal variation in heavy metal accumulation and biomass production in a poplar coppice culture. I. Seasonal variation in leaf, wood and bark concentrations. Environ. Pollution 131: 485-494.