summary of remainder harry r. erwin, phd school of computing and technology university of sunderland

Post on 16-Jan-2016

231 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Summary of Remainder

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Resources

• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.

• Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.

• Gentle, JE (2002) Elements of Computational Statistics. Springer.

• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Topics

• Multiple Regression

• Contrasts

• Count Data

• Proportion Data

• Survival Data

• Binary Response

• Course Summary

Multiple Regression• Two or more continuous explanatory variables• Your problems are not restricted to order. You often lack enough

data to examine all the potential interactions and higher-order effects.– To explore the possibility of a third order interaction term with three

explanatory variables (A:B:C) requires about 38 = 24 data values. – If there’s potential for curvature, you need 33 = 9 more data values to

pin that down.

• Be selective. If you are considering an interaction term, you have to consider all the lower-order interactions and the individual explanatory variables in it.

Issues to Consider

• Which explanatory variables to include.

• Curvature in the response to explanatory variables.

• Interactions between explanatory variables. (High order interactions tend to be rare.)

• Correlation between explanatory variables.

• Over-parameterization. (Avoid!)

Contrasts

• Contrasts are the basis of hypothesis testing and model simplification in ANOVA

• When you have more than two levels in a categorical variable, you need to know which levels are meaningful and which can be combined.

• Sometimes you know which ones to combine and sometimes not.

• First do the basic ANOVA to determine whether there are significant differences to be investigated.

Model Reduction in ANOVA

• Basically how you reduce a model in ANOVA is by combining factor levels.

• Define your contrasts based on the science:– Treatment versus control– Similar treatments versus other treatments.– Treatment differences within similar treatments.

• You can also aggregate factor levels in steps.• See me if you need to do this. R can automate the

process.

Count Data

• With frequency data, we know how often something happened, but not how often it didn’t happen.

• Linear regression assumes constant variance and normal errors. This is not appropriate for count data:1. Counts are non-negative.

2. Response variance usually increases with the mean.

3. Errors are not normally distributed.

4. Zeros are hard to transform.

Handling Count Data in R

• Use a glm model with family=poisson.– This sets errors to Poisson, so variance is

proportional to the mean.– This sets link to log, so fitted values are positive.

• If you have overdispersion (residual deviance greater than residual degrees of freedom), use family=quasipoisson instead.

Contingency Tables

• There is a risk of data aggregation over important explanatory variables (nuisance variables).

• So check the significance of the real part of the model before you eliminate nuisance variables.

Frequencies and Proportions

• With frequency data, you know how often something happened, but not how often it didn’t happen.

• With proportion data, you know both.• Applied to:

– Mortality and infection rates– Response to clinical treatment– Voting– Sex ratios– Proportional response to experimental treatments

Working With Proportions

• Traditionally, proportion data was modelled by using the percentage as the response variable.

• This is bad for four reasons:1. Errors are not normally distributed.2. Non-constant variance.3. Response is bounded by 0.0 and 1.0.4. The size of the sample, n, is lost.

Testing Proportions

• To compare a single binomial proportion to a constant, use binom.test.– y<-c(15,5)– binom.test(y,0.5)– y<-c(14,6)– binom.test(y,0.5)

• To compare two samples, use prop.test.– prop.test(c(14,6),c(10,10))

• Only use glm methods for complex models:– Regression tables– Contingency tables

GLM Models for Proportions

• Start with a general linear model (glm).• family = binomial (i.e., unfair coin flip)• Use two vectors, one of the success counts and

the other of the failure counts.• number of failures + number of successes =

binomial denominator, n• y<-cbind(successes, failures)• model<-glm(y~whatever,binomial)

How R Handles Proportions

• Weighted regression (weighted by the individual sample sizes).• logit link to ensure linearity• If percentage cover data (e.g., survey data)

– Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance).

• If percentage change in a continuous measurement (e.g. growth)– ANCOVA with final weight as the response and initial weight as a

covariate, or– Use the relative growth rate (log(final/initial)) as response.– Both produce normal errors.

Count Data in Proportions

• R supports the traditional arcsine and probit transformations:– arcsine makes the error distribution normal– probit linearises the relationship between percentage

mortality and log(dose)

• It is usually better to use the logit transformation and assume you have binomial data.

Death and Failure Data

• Applications include:– Time to death– Time to failure– Time to event

• This is useful way to analyse performance when the process leading to a goal is complex—for example when it is a robot performing a task.

Problems with Survival Data

• Non-constant variance, so standard methods are inappropriate.

• If errors are gamma distributed, the variance is proportional to the square of the mean.

• Use a glm with Gamma errors.

How do we deal with events that don’t happen during the study?

• In those trials, we don’t know when the event would occur. We just know the time would be greater than the end of the trial. Those trials are censored.

• The methods for handling censored data make up the field of survival analysis.

• (I used survival analysis in my PhD work. My wife does survival analysis for cancer data.)

Binary Response

• Very common:– dead or alive– occupied or empty– male or female– employed or unemployed

• Response variable is 0 or 1.

• R assumes a binomial trial with sample size 1.

When to use Binary Response Data

• Do a binary response analysis only when you have unique values of one or more explanatory variables for each and every possible individual case.

• Otherwise lump: aggregate to the point where you have unique values. Either:– Analyse the data as a contingency table using Poisson errors,

or– Decide which explanatory variable is key, express the data

as proportions, recode as a count of a two-level factor, and assume binomial errors.

Modelling Binary Response

• Single vector with the response variable

• Use glm with family = binomial• Think about a log-log link instead of logit. Use

the one that gives less deviance.

• Fit the usual way.

• Test significance using 2.

Course Summary

• We’ve had an introduction to thinking critically about data.

• We’ve seen how to use a typical statistical analysis system (R).

• We’ve looked at our projects critically.

• We’ve discussed hypothesis testing.

• We’ve looked at statistical modelling.

Statistical Activities

• Data collection (ideally the statistician has a say on how they are collected)

• Description of a dataset– Averages

– Spreads

– Extreme points

• Inference within a model or collection of models• Model selection

Why Model?

• Usually you do statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model.

• A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.

Structure-in-the-data

• Of most interest…, for example:– Modes– Gaps– Clusters– Symmetry– Shape– Deviations from normality

• Plot the data to understand this.

Visualization

• Multiple views are necessary.• Be able to zoom in on the data as a few points

can obscure the interesting structure.• Scaling of the axes may be necessary, since our

eyes are not perfect tools for detecting structure.• Watch out for time-ordered or location-ordered

data, particularly if time or location are not explicitly reported.

Plots

• Use simple plots to start with.

• Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.

Bottom Line

• I am available for consulting (free).

• E-mail: harry.erwin@sunderland.ac.uk

• Phone: 515-3227 or extension 3227 from university phones.

• Plan on about an hour meeting to allow time to think intelligently about your data.

top related