nonparametric methods featuring the bootstrap
DESCRIPTION
Nonparametric Methods Featuring the Bootstrap. Jon Atwood November 12, 2013. Laboratory for Interdisciplinary Statistical Analysis. LISA helps VT researchers benefit from the use of Statistics. - PowerPoint PPT PresentationTRANSCRIPT
Nonparametric Methods Featuring the Bootstrap
Jon AtwoodNovember 12, 2013
Laboratory for Interdisciplinary Statistical Analysis
Collaboration From our website request a meeting for personalized statistical advice.Coauthor not required, but encouraged.Great advice right now:Meet with LISA before collecting your data
Short Courses Designed to help graduate students apply statistics in their research
Walk-In Consulting
Monday—Friday* 1-3PM, GLC
*Mon—Thurs during the summer
All services are FREE for VT researchers. We assist with research—not class projects or homework.
LISA helps VT researchers benefit from the use of Statistics
www.lisa.stat.vt.edu
Experimental Design • Data Analysis • Interpreting ResultsGrant Proposals • Software (R, SAS, JMP, SPSS...)
Short Course GoalsBy the end of this course, you will know…• The fundamental differences between
nonparametric and parametric• In what general situations nonparametric
methods would be advantageous• How to implement nonparametric alternatives
to t-tests, ANOVAs and simple linear regression• What nonparametric bootstrapping is and
when you can use it effectively
What we will do today… Give a brief overview of nonparametric statistical
methods Take a look at real world data sets! (and some
non-real world data sets) Implement the following methods in R…
Wilcoxon ranked sum and signed rank (alternatives to t-tests
Kruskal-Wallis (alternative to one-way ANOVA) Spearman correlation (alternative to Pearson
correlation) Nonparametric Bootstrapping ?Bonus Topic?
What does nonparametric mean?
Parametric is a word that can have several meanings
In statistics, parametric usually means some specific probability distribution is assumed Normal (regression, ANOVA, etc.) Exponential (survival)
May also involve assumptions about parameters (variance)
Well, first of all, what does parametric mean?
And now a closer look… Regression Analysis
We want to see the relationship between a bunch of predictor variables (x1, x2,…,xp) and a response variable.
For example, suppose we wanted to see the relationship between weight and blood pressure
A usual regression model might look something like this…
Simple Linear Regression Plot
Error term
εi~ N(0, σ2) The error terms are assumed to
come from a normal distribution with mean 0, variance σ2
Usual methods for testing the significance of our θ estimates are based directly on this assumption of normality
If the assumption of normality false, these methods are invalid
So onto nonparametric…
A statistical method is nonparametric if it does not require most of these assumptions
Data are not assumed to follow some underlying distribution
In some cases, it means that variables do not take predetermined forms Nonparametric regression, for example
Assumptions nonparametric methods do make
Randomness Independence In multi-sample tests, distributions are of
the same shape
Advantages Free from
distributional assumptions of data
Easy to interpret Usually
computationally straightforward
Disadvantages Loss of power
when data does follow usual assumptions
Reduces the data’s information
Larger sample sizes needed due to less efficiency
Nonparametric Methods
With that said…Rank Tests
Rank tests are a simple group of nonparametric tests
Instead of using the actual numerical value of something, we use its rank, or relative position on the number line to the other observations
As an example… Here’s some data Basically, these are just
numbers I picked out of the sky
Any ideas on what we should call it? Wigs in a wig shop?
Y Value Ascending Value Rank
32 1 164 2 (2+3+4)/
3=3 64 3 (2+3+4)/
3=3 64 4 (2+3+4)/
3=3 311 5 52000 6 6
Y Value Ascending Value Rank
32 1 145 2 254 3 364 4 4311 5 5
2000 6 6
What about ties?
Their ascending values are added together and divided by the total number of ties
Wilcoxon rank tests
These provide alternatives to the standard t-tests, which test mean differences between two groups
T-tests assume that the data is normally distributed Can be highly sensitive to outliers (we’ll see that in
an example soon), which may reduce power (ability to detect significant differences)
Wilcoxon tests alleviate these problems by using ranks, not actual values
First, the Wilcoxon Rank-Sum Test
Alternative to the independent sample t-test (recall that the independent sample t-test compares two independent samples, testing if the means are equal)
The t-test assumes normality of the data, and equal variances (though adjustments can be made for unequal variances)
The Wilcoxon rank-sum test assumes that the two samples follow continuous distributions, as well as the usual randomness and independence
Let’s try it out with some data!
The source of this data is…me, Jon Atwood
Data randomly generated from R
First group, 15 observations distributed normally, mean=30000, variance=2500. Extra observation added
Second group, 16 observations distributed normally, mean=40000, variance=2500
Group 132897.0829383.8
30539.3927448.5
33712.1728166.0726613.330105.7
25786.8130761.0427427.1229060.0225593.8929815.8232453.971000000
Group 240273
35610.542547.441214.440183
36460.845040.839248.633901.740985.142412
38247.236384.241849.538222.839739.4
So what is this test…testing? Remember, for the t-test, we are testing
H0: μ1=μ2 vs. Ha: μ1≠μ2 Where μ1 is the mean of group 1 and μ2 is the
mean of group 2
What this really means is that F(G1) and F(G2) are basically the same, except one is to the right of the other, shifted by α
α
In the Wilcoxon rank-sum test, we are testing
H0: F(G1)=F(G2)vs.Ha: F(G1)=F(G2-α)Where F(G1) is the
distribution of group 1 and F(G2) is the distribution of group 2, and α is the “location shift”
In our problem…Group 1 Group 2Value Rank Value Rank
25593.89 1 33902 1625786.81 2 35611 17
26613.3 3 36384 1827427.12 4 36461 19
27448.5 5 38223 2028166.07 6 38247 2129060.02 7 39249 22
29383.8 8 39739 2329815.82 9 40183 24
30105.7 10 40273 2530539.39 11 40985 2630761.04 12 41214 2732453.97 13 41850 2832897.08 14 42412 2933712.17 15 42547 301000000 32 45041 31Rank sum 152 376
So, group 2 has a rank sum of 224 higher than group 1
The question is, what is the probability of observing a combination of rank sums in the two groups where the difference is greater than 224?
R will compute p-values using the normal approximation of sample sizes greater than 50
So let’s do some R-Coding!We will view graphs and results in the R program we run (this applies to all examples)
Moving to a paired situation…
Suppose we have two samples, but they are paired in some way
Suppose pairs of people are couples, or plants are paired off into different plots, dogs are paired by breeds, or whatever
Then, the rank-sum test will not be the optimal test. Instead, we use the signed rank test
Signed Rank Test The null hypothesis is that the medians of
the two distributions are equal Procedure
1. Calculate all differences between two groups2. Take the absolute value of those differences3. Convert those to ranks (like before)4. Attach a + or -, depending on if the
difference is positive or negative5. Add these signed ranks together to get the
test statistic W
Example This data comes from Laureysens et al.
(2004), who, in August and November, took 13 poplar tree clones and measured aluminum levels
Below is a table with the raw data, ranks, and differences
W=59Clone Aug Nov
Abs Value Diff
Rank
S-Rank
Balsam Spire 8.1 11.2 3.1 6 6Beaupre 10 16.3 6.3 9 9
Hazendans 16.5 15.3 1.2 3 -3Hoogvorst 13.6 15.6 2 4 4Raspalje 9.5 10.5 1 2 2
Unal 8.3 15.5 7.2 10 10Columbia
River 18.3 12.7 5.6 8 -8Fritzi Pauley 13.3 11.1 2.2 5 -5
Trichobel 7.9 19.9 12 11 11Gaver 8.1 20.4 12.3 12 12Gibecq 8.9 14.2 5.3 7 7Primo 12.6 12.7 0.1 1 1
Wolterson 13.4 36.8 23.4 13 13
How many more combinations could create a more extreme w?
R will give an exact p-value for sample sizes less than 50
Otherwise, a normal approximation is used
R Time!
More than 2 groups
Suppose we are comparing more than two groups
Normally (no pun intended), we would use an analysis of variance, or ANOVA
But, of course, we need to assume normality for this as well
Kruskal-Wallis In this situation, we use the Kruskal-
Wallis test Again, we are converting the actual
numeric values to ranks, regardless of groups
Ultimately, we will compute the test statistic
An example We’ll use the built-in R data air quality In this data, Chambers et al. (1973)
compared air quality measurements in New York in 1973 from May to September
The question is, does the air quality differ from month to month?
Break out the R!
Onto Part 2… In this part of the course, we will do the
following things1. Look at the Spearman correlation
(alternative to the Pearson correlation) between an x variable and y variable
2. Examine nonparametric bootstrapping, and how it can help us when our data does not approximate normality
Spearman correlation Suppose you want to discover the
association between infant mortality and GDP by country
Here’s a 2003 scatterplot of the situation
This data comes from www.indexmundi.com
In this example, the Pearson correlation is about -.63
Still significant, but perhaps underestimates the monotone nature of the relationship between GDP and infant mortality rate
In addition, the Pearson correlation assumes linearity, which is clearly not present here
Pearson correlation
We can use the Spearman correlation instead
This is a correlation coefficient based on ranks, which are computed in the y variable and x variable independently, with sample size n
To calculate the coefficient, we do the following…1. Take each xi and each yi, convert them
into ranks (ranks of x and y are independent of each other
2. Subtract rxi from ryi to get di, the difference in ranks
3. The formula is
In this case, the hypotheses areH0: Rs=0vs. Ha: Rs≠0
Basically, we are attempting to see whether or not the two variables are independent, or if there is evidence of an association
Let’s return to the GDP data
We will now plot the data in R, and see how to get the Spearman correlation
R tests this by using an exact p-value for small sample sizes, and an approximate t-distribution for larger ones
The test statistic in that case would follow a t-distribution with n-2 degrees of freedom
Turn to R now!!!
Nonparametric Bootstrapping Suppose you are fitting a multiple linear
regression model
Recall that εi~ N(0, σ2) But what if we have reason to suppose
this assumption is not met? Then, the regular way of testing the
significance of our coefficients is invalid
Yi=β0+β1x1i+β2x2i+…+βkxki+εi
So what do we do?
Depending on the situations, there are several options
The one we will talk about today is called nonparametric bootstrapping
Bootstrapping is a resampling method, where we take the data and randomly sample from it to draw inferences
There are several types of bootstrapping. We will focus on the simpler, nonparametric type
How do we do nonparametric bootstrapping?
We assign each of the n observations a 1/n probability of being selected
We then take a random sample with replacement, usually of size n
We compute estimated coefficients based on this sample
We repeated the process many times, say 10,000, or 100,000, or 100,000,000,000,000…
For example, in regression Suppose we want to test whether or not
β1 = 0. With our newly generated sample of
(10,000, or whatever) β1 coefficients For a 95 percent confidence interval, we
would look at the 250th observation (or 250th lowest), and the 9,750th observation (or the 250th highest)
If this interval does not contain 0, we conclude that there is evidence that β1 is not equal to 0
Possible issue? Theoretically, this methods comes out of
the idea that the distribution of a population is approximated by the distribution of its sample
This assumption becomes less and less valid the smaller your sample size
Example This data was taken from The Practice
of Econometrics by ER Berndt (1991) We will be regressing wage on
education and experience We will look for whether graphs tell us
the residuals are approximately normal or not
R-Code Now!
In Summary We now understand that certain
parametric methods, like t-tests and regression, depend on assumptions that may or may not be met
We know that nonparametric methods are methods that do not make distributional assumptions, and therefore are applicable when data do not meet these assumptions
We can implement these methods in R, incase our data does not meet these assumptions
Bonus Topic!
Indicator Variables in RegressionUsed when we have categorical variables as predictors in regression
As a start, let’s re-examine the wage data
Here, we will drop education and just look at experience and sex
The full model in a case like this would look likeWi=β0+β1Ei+β2Si+β3S*Ei+εi
Where… W=wageE=experienceS=1 if sex=“male”, 0 otherwiseε is the error term
Separate regressions for different sexes
For women, the reference group, Wia=β0a+β1aEi+εia For men, the non-reference group, Wib=β0b+β1bEi+εib
With that in mind… Let’s return to the full modelWi=β0+β1Ei+β2Si+β3S*Ei+εi
Suppose we want to see how women do in this model
Well, we can set S=0
We are left with Wi=β0+β1Ei+εiBut, since this is with women, this is equivalent to the women only model, Wia=β0a+β1aEi+εia
β0= β0aβ1Ei= β1aEi
Thus…
Now, look at “male”We set S=1
Wi=β0+β1Ei+β2*1+β31*Ei+εi= β0+β1Ei+β2+β3Ei+εi=β0a+β1aEi+β2+β3Ei+εi=(β0a+β2)+(β1a+β3)Ei=β0b+β1bEi+εib
So…
β2=β0b- β0aβ3=β1b- β1a
Example in R!
Thank you! References Berndt, ER. The Practice of Econometrics. 1991. NY:
Addison-Wesley. Chambers, J. M., Cleveland, W. S., Kleiner, B. and Tukey, P.
A. (1983) Graphical Methods for Data Analysis. Belmont, CA: Wadsworth.
Laureysens, I., R. Blust, L. De Temmerman, C. Lemmens and R. Ceulemans. 2004. Clonal variation in heavy metal accumulation and biomass production in a poplar coppice culture. I. Seasonal variation in leaf, wood and bark concentrations. Environ. Pollution 131: 485-494.