summary of remainder harry r. erwin, phd school of computing and technology university of sunderland

Summary of Remainder

Harry R. Erwin, PhD

School of Computing and Technology

University of Sunderland

Resources

• Crawley, MJ (2005) Statistics: An Introduction Using R. Wiley.

• Freund, RJ, and WJ Wilson (1998) Regression Analysis, Academic Press.

• Gentle, JE (2002) Elements of Computational Statistics. Springer.

• Gonick, L., and Woollcott Smith (1993) A Cartoon Guide to Statistics. HarperResource (for fun).

Topics

• Multiple Regression

• Contrasts

• Count Data

• Proportion Data

• Survival Data

• Binary Response

• Course Summary

Multiple Regression• Two or more continuous explanatory variables• Your problems are not restricted to order. You often lack enough

data to examine all the potential interactions and higher-order effects.– To explore the possibility of a third order interaction term with three

explanatory variables (A:B:C) requires about 38 = 24 data values. – If there’s potential for curvature, you need 33 = 9 more data values to

pin that down.

• Be selective. If you are considering an interaction term, you have to consider all the lower-order interactions and the individual explanatory variables in it.

Issues to Consider

• Which explanatory variables to include.

• Curvature in the response to explanatory variables.

• Interactions between explanatory variables. (High order interactions tend to be rare.)

• Correlation between explanatory variables.

• Over-parameterization. (Avoid!)

Contrasts

• Contrasts are the basis of hypothesis testing and model simplification in ANOVA

• When you have more than two levels in a categorical variable, you need to know which levels are meaningful and which can be combined.

• Sometimes you know which ones to combine and sometimes not.

• First do the basic ANOVA to determine whether there are significant differences to be investigated.

Model Reduction in ANOVA

• Basically how you reduce a model in ANOVA is by combining factor levels.

• Define your contrasts based on the science:– Treatment versus control– Similar treatments versus other treatments.– Treatment differences within similar treatments.

• You can also aggregate factor levels in steps.• See me if you need to do this. R can automate the

process.

Count Data

• With frequency data, we know how often something happened, but not how often it didn’t happen.

• Linear regression assumes constant variance and normal errors. This is not appropriate for count data:1. Counts are non-negative.

2. Response variance usually increases with the mean.

3. Errors are not normally distributed.

4. Zeros are hard to transform.

Handling Count Data in R

• Use a glm model with family=poisson.– This sets errors to Poisson, so variance is

proportional to the mean.– This sets link to log, so fitted values are positive.

• If you have overdispersion (residual deviance greater than residual degrees of freedom), use family=quasipoisson instead.

Contingency Tables

• There is a risk of data aggregation over important explanatory variables (nuisance variables).

• So check the significance of the real part of the model before you eliminate nuisance variables.

Frequencies and Proportions

• With frequency data, you know how often something happened, but not how often it didn’t happen.

• With proportion data, you know both.• Applied to:

– Mortality and infection rates– Response to clinical treatment– Voting– Sex ratios– Proportional response to experimental treatments

Working With Proportions

• Traditionally, proportion data was modelled by using the percentage as the response variable.

• This is bad for four reasons:1. Errors are not normally distributed.2. Non-constant variance.3. Response is bounded by 0.0 and 1.0.4. The size of the sample, n, is lost.

Testing Proportions

• To compare a single binomial proportion to a constant, use binom.test.– y<-c(15,5)– binom.test(y,0.5)– y<-c(14,6)– binom.test(y,0.5)

• To compare two samples, use prop.test.– prop.test(c(14,6),c(10,10))

• Only use glm methods for complex models:– Regression tables– Contingency tables

GLM Models for Proportions

• Start with a general linear model (glm).• family = binomial (i.e., unfair coin flip)• Use two vectors, one of the success counts and

the other of the failure counts.• number of failures + number of successes =

binomial denominator, n• y<-cbind(successes, failures)• model<-glm(y~whatever,binomial)

How R Handles Proportions

• Weighted regression (weighted by the individual sample sizes).• logit link to ensure linearity• If percentage cover data (e.g., survey data)

– Do an arc-sine transformation, followed by conventional modelling (normal errors, constant variance).

• If percentage change in a continuous measurement (e.g. growth)– ANCOVA with final weight as the response and initial weight as a

covariate, or– Use the relative growth rate (log(final/initial)) as response.– Both produce normal errors.

Count Data in Proportions

• R supports the traditional arcsine and probit transformations:– arcsine makes the error distribution normal– probit linearises the relationship between percentage

mortality and log(dose)

• It is usually better to use the logit transformation and assume you have binomial data.

Death and Failure Data

• Applications include:– Time to death– Time to failure– Time to event

• This is useful way to analyse performance when the process leading to a goal is complex—for example when it is a robot performing a task.

Problems with Survival Data

• Non-constant variance, so standard methods are inappropriate.

• If errors are gamma distributed, the variance is proportional to the square of the mean.

• Use a glm with Gamma errors.

How do we deal with events that don’t happen during the study?

• In those trials, we don’t know when the event would occur. We just know the time would be greater than the end of the trial. Those trials are censored.

• The methods for handling censored data make up the field of survival analysis.

• (I used survival analysis in my PhD work. My wife does survival analysis for cancer data.)

Binary Response

• Very common:– dead or alive– occupied or empty– male or female– employed or unemployed

• Response variable is 0 or 1.

• R assumes a binomial trial with sample size 1.

When to use Binary Response Data

• Do a binary response analysis only when you have unique values of one or more explanatory variables for each and every possible individual case.

• Otherwise lump: aggregate to the point where you have unique values. Either:– Analyse the data as a contingency table using Poisson errors,

or– Decide which explanatory variable is key, express the data

as proportions, recode as a count of a two-level factor, and assume binomial errors.

Modelling Binary Response

• Single vector with the response variable

• Use glm with family = binomial• Think about a log-log link instead of logit. Use

the one that gives less deviance.

• Fit the usual way.

• Test significance using 2.

Course Summary

• We’ve had an introduction to thinking critically about data.

• We’ve seen how to use a typical statistical analysis system (R).

• We’ve looked at our projects critically.

• We’ve discussed hypothesis testing.

• We’ve looked at statistical modelling.

Statistical Activities

• Data collection (ideally the statistician has a say on how they are collected)

• Description of a dataset– Averages

– Spreads

– Extreme points

• Inference within a model or collection of models• Model selection

Why Model?

• Usually you do statistics to explore the structure of data. The questions you might ask are rather open-ended. Your understanding is facilitated by a model.

• A model embodies what you currently know about the data. You can formulate it either as a data-generating process or a set of rules for processing the data.

Structure-in-the-data

• Of most interest…, for example:– Modes– Gaps– Clusters– Symmetry– Shape– Deviations from normality

• Plot the data to understand this.

Visualization

• Multiple views are necessary.• Be able to zoom in on the data as a few points

can obscure the interesting structure.• Scaling of the axes may be necessary, since our

eyes are not perfect tools for detecting structure.• Watch out for time-ordered or location-ordered

data, particularly if time or location are not explicitly reported.

Plots

• Use simple plots to start with.

• Watch for rounded data—shown by horizontal strata in the data. That often signals other problems.

Bottom Line

• I am available for consulting (free).

• E-mail: [email protected]

• Phone: 515-3227 or extension 3227 from university phones.

• Plan on about an hour meeting to allow time to think intelligently about your data.

mailto:[email protected]

summary of remainder harry r. erwin, phd school of computing and technology university of sunderland

Documents

data values

handling count data

proportionswith frequency

count datawith frequency

risk of data aggregation

high order interactions

lowerorder interactions

response variance