bootstraps and jackknives hal whitehead biol4062/5062

Bootstraps and Jackknives

Hal Whitehead

BIOL4062/5062

• Confidence in estimators

• Why use bootstraps or jackknives?

• The jackknife

• The parametric bootstrap

• The non-parametric bootstrap– (“The bootstrap”)

Estimation without confidence(standard error, confidence interval)

has little value

Confidence in estimates:Traditional approach

DATA Biological

model

Estimator Statistical

(Statistic) model

Confidence in estimator?

Confidence in estimates:Traditional approach

e.g. What is sex ratio of vole population?

Trap: 12 males 15 females

Estimate ratio 12/(12+15)=0.444

Using binomial distribution:

SE = [0.444x(1-0.444)/(12+15)]=0.096

So: Sex ratio is estimated to be 0.444 (SE 0.096)

e.g. Asymmetry of size among nestlings in nests of 6

Measure: difference between size of nestling and its most similar neighbour

{1.2 4.3 4.7 3.2 6.1 1.3} =>

[0.1 0.4 0.4 1.1 1.4 0.1] = 0.58

But what confidence have we in this?

Confidence in estimator:Mean distance between animals

In a small population:

what is the expected distance between any two animals?

Estimate is: mean of distances between all pairs of animals

What is confidence in this estimate?

no easy formula (lack of independence)

Use Bootstraps and Jackknives when:

• No clear biological model

• Deriving statistical model – very difficult, impossible, or tedious

• Statistical model too complicated to be useful

• Model may not be quite valid

• Accurate measure of precision under statistical model only possible with large n

The Jackknife• Data D = {X1, X2, X3, .... ,Xn} => statistic s

• Jackknife replicates miss out units (or groups of units) in turn:– J1 = X2, X3, .... ,Xn => statistic s-1 (missing unit 1)

– J2 = X1, X3, .... ,Xn => statistic s-2 (missing unit 2)

– etc.

• Convert into pseudovalues:– φ1 = n⋅s - (n-1)s-1

– φ2 = n⋅s - (n-1)s-2

– etc.

The Jackknife

• The Jackknifed Estimate of s is then:– sJ = mean(φ1,...,φn)

• SE(s) = SE(φ1,...,φn)

The Jackknife

• Jackknifed Estimate removes bias

• Jackknife SE “rough and ready”– usually “conservative” (overestimates SE)

• Jackknife on blocks of units, if data not independent

• Assumes normality for confidence intervals

Correlation between gill weight and body weight in 12 crabs

Jackknife r = 0.878 [Mean φi]

SE 0.0768 [SD(φi)/12)]

r = 0.865Gill(mg) Body(g) r-i φi

1590 14.40 0.888 0.607 1790 15.20 0.884 0.656 10 11.30 0.892 0.570 450 2.50 0.830 1.249 3840 22.70 0.811 1.452 23 14.90 0.863 0.879 10 1.41 0.875 0.751 32 15.81 0.872 0.779 8 4.19 0.845 1.078 22 15.39 0.867 0.843 32 17.25 0.858 0.940 21 9.52 0.877 0.725

Bootstraps

Parametric Bootstrap• Assume Data produced by Model with some Parameters

unknown, which need to be estimated:– Model => Data => Parameter estimates (s)

• The Bootstrap process:– Model + Parameter estimates (s) => Random data => Bootstrap

replicate estimates (s*)

• Distribution of Bootstrap replicate estimates (s*s) give distribution, confidence intervals and standard errors of s (plus indicator of bias)

• Usually use 100-10,000 bootstrap replicates

Parametric Bootstrap–an exampleMark-Recapture Estimate

Mark 25 animals

Recapture 46

of which 12 Marked

What is population size?

“Petersen” estimate is 25x46/12=95.8

What is confidence in this estimate, expected bias?


• Mark 25 animals; Recapture 46, 12 Marked• “Petersen” estimate is 25x46/12=95.8• What is confidence, expected bias?• Parametric Bootstrap Replicates:

– 96 Animals, mark 25, recapture 46– How many marked?

– From simulation (ms=):• 9 14 14 9 14 13 12 13 12 14 ...

– Calculate population estimates (ns= 25x46/ms)• 127.8 82.1 82.1 127.8 82.1 88.5 95.8 88.5 95.8 82.1..


• “Petersen” estimate is 25x46/12=95.8• Bootstrap population estimates (assuming n=96)

– 127.8 82.1 82.1 127.8 82.1 88.5 95.8 88.5 95.8 82.1..

• Expected Bias:– mean(ns) - 96= 99.7 - 96 = 3.7

• Estimated standard error:– SD(ns) = 20.4

• So population estimate is: 92.1 (SE 20.4)

Non-Parametric Bootstrap(A.K.A. “The Bootstrap”)

• Data D = X1, X2, X3, .... ,Xn => statistic s

• Bootstrap replicate:– D*1 = X*1, X*2, X*3, .... ,X*n => statistic s*1

– D*2 = X*1, X*2, X*3, .... ,X*n => statistic s*2

– ...

• X*1, X*2, X*3, .... ,X*n are randomly selected with replacement, from X1, X2, X3, .... ,Xn

• Distribution, confidence interval and SE of s estimated from the distribution, confidence interval and standard error of the s*’s

• Usually use 100-10,000 bootstrap replicates

Non-Parametric Bootstrap: an example:Median Gill Weight in Crabs

Gill weights (in mg):

159 179 100 45 384 230 100 320 80 220 320 210

Median = 195mg MedianReal 159 179 100 45 384 230 100 320 80 220 320 210 195Bootstrap replicates:B1 320 159 45 320 100 320 100 320 100 230 100 210 185B2 384 384 45 384 45 384 100 80 45 179 230 230 205B3 159 320 80 45 45 80 220 210 230 320 230 220 215B4 220 179 384 100 80 100 230 230 179 230 384 45 200B5 320 220 210 100 159 320 220 210 100 80 100 210 210B6 80 100 230 100 210 384 159 220 320 45 45 210 185B7 179 210 80 320 100 230 159 320 100 45 384 320 195B8 384 159 100 159 100 179 100 179 220 384 220 159 169B9 320 210 45 320 179 159 100 210 159 45 210 100 169 ...

Non-Parametric Bootstrap: an example:Median Gill Weight in Crabs

Gill weights (in mg):

159 179 100 45 384 230 100 320 80 220 320 210

Median = 195mg

Bootstrap

mean(1000 samples)

median = 188mg

95% c.i. = 100-275mg

[b(25) -b(975)]

Bootstraps in Molecular Genetics

• Calculate tree based on genetic data– (e.g. 20 species and 300 loci)

• For each bootstrap replicate:– Resample loci with replacement – (20 species with 300 loci, some repeats)– Calculate tree

• Look at agreement between original and bootstrap trees

Bootstrapped spanning tree

Glazko & Nei Mol. Biol. Evol. 2003

Bootstraps• “Better” estimate of

confidence• Variable n• Self-comparisons a

problem– e.g. Mean of associations

• Gives SE’s, confidence intervals and profile of confidence

Jackknives

• “Worse” estimate of confidence– Usually conservative

• underestimates precision

• Fixed n• Self-comparisons not a problem• Reduces Bias• Only directly gives SE

– Confidence intervals need assumption of normality

Bootstraps and Jackknives• Give estimates of confidence (and bias) when:

– distributions unknown, approximate, or intractable

• Parametric bootstrap– very useful if model known– needs programming

• Non-parametric bootstrap– widely applicable (except self-referencing situations)– few assumptions

• Jackknife– approximate– only standard error given directly– useful when bootstrap not applicable

bootstraps and jackknives hal whitehead biol4062/5062

Documents

n s n

n slide

bootstrap slide

25x46m s

jackknifed estimate

model parameter estimates

standard errors of s

confidence intervals