empirical research methods in computer science lecture 2, part 1 october 19, 2005 noah smith

35
Empirical Research Methods in Computer Science Lecture 2, Part 1 October 19, 2005 Noah Smith

Upload: oliver-ward

Post on 28-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Empirical Research Methods in Computer Science

Lecture 2 Part 1October 19 2005Noah Smith

Some tips Perl scripts can be named encode instead

of encodepl encode foo ≢ encode lt foo chmod u+x encode Instead of making us run java Encode

write a shell script binsh cd `dirname $0` java Encode

Check that it works on (say) ugrad10

Assignment 1

If you didnrsquot turn in a first version yesterday donrsquot bother ndash just turn in the final version

Final version due Tuesday 1025 8pm

We will post a few exercises soon Questions

Today

Standard error Bootstrap for standard error Confidence intervals Hypothesis testing

Notation

P is a population S = [s1 s2 sn] is a sample from P

Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F

We may use Y Z for other measurements

Mean

What does mean mean μx is population mean of x

(depends on F)

μx is in general unknown

How do we estimate the mean Sample mean

n

xx

n

1ii

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Some tips Perl scripts can be named encode instead

of encodepl encode foo ≢ encode lt foo chmod u+x encode Instead of making us run java Encode

write a shell script binsh cd `dirname $0` java Encode

Check that it works on (say) ugrad10

Assignment 1

If you didnrsquot turn in a first version yesterday donrsquot bother ndash just turn in the final version

Final version due Tuesday 1025 8pm

We will post a few exercises soon Questions

Today

Standard error Bootstrap for standard error Confidence intervals Hypothesis testing

Notation

P is a population S = [s1 s2 sn] is a sample from P

Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F

We may use Y Z for other measurements

Mean

What does mean mean μx is population mean of x

(depends on F)

μx is in general unknown

How do we estimate the mean Sample mean

n

xx

n

1ii

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Assignment 1

If you didnrsquot turn in a first version yesterday donrsquot bother ndash just turn in the final version

Final version due Tuesday 1025 8pm

We will post a few exercises soon Questions

Today

Standard error Bootstrap for standard error Confidence intervals Hypothesis testing

Notation

P is a population S = [s1 s2 sn] is a sample from P

Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F

We may use Y Z for other measurements

Mean

What does mean mean μx is population mean of x

(depends on F)

μx is in general unknown

How do we estimate the mean Sample mean

n

xx

n

1ii

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Today

Standard error Bootstrap for standard error Confidence intervals Hypothesis testing

Notation

P is a population S = [s1 s2 sn] is a sample from P

Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F

We may use Y Z for other measurements

Mean

What does mean mean μx is population mean of x

(depends on F)

μx is in general unknown

How do we estimate the mean Sample mean

n

xx

n

1ii

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Notation

P is a population S = [s1 s2 sn] is a sample from P

Let X = [x1 x2 xn] be some numerical measurement on the si distributed over P according to unknown F

We may use Y Z for other measurements

Mean

What does mean mean μx is population mean of x

(depends on F)

μx is in general unknown

How do we estimate the mean Sample mean

n

xx

n

1ii

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Mean

What does mean mean μx is population mean of x

(depends on F)

μx is in general unknown

How do we estimate the mean Sample mean

n

xx

n

1ii

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Gzip compression rate

usually lt 1 but not always

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Gzip compression rate

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Accuracy

How good an estimate is the sample mean

Standard error (se) of a statistic We picked one S from P How would vary if we picked a lot of

samples from P There is some ldquotruerdquo se value

x

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Extreme cases

n rarr infin

n = 1

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Standard error (of the sample mean)

Known

ldquoStandard errorrdquo = standard deviation of a statistic

n)x(se x

true standard deviation of x under F

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Gzip compression rate

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Central Limit Theorem

The sampling distribution of the sample mean approaches a normal distribution as n increases

nμx

2xN

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

How to estimate σx

ldquoPlug-in principlerdquo

Therefore

n

1i

2i xx

n1

ˆ

n

1i

2

i

nxx

xse

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Plug-in principle

We donrsquot have (and canrsquot get) P We donrsquot know F the true distribution

over X We do have S (the sample)

We do know the sample distribution over X

Estimating a statistic use for F

F

F

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Good and Bad News

We have a formula to estimate the standard error of the sample mean

We have a formula to estimate only the standard error of the sample mean variance median trimmed mean ratio of means of x and y correlation between x and y

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Bootstrap world

unknown distribution F

observed random sample X

statistic of interest )X(sˆ

empirical distribution

bootstrap random sample X

bootstrap replication )X(sˆ

F

statistics about the estimate (eg standard error)

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Bootstrap sample

X = [30 28 37 34 35] X could be

[28 34 37 34 35] [35 30 34 28 37] [35 35 34 30 28]

Draw n elements with replacement

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Reflection

Imagine doing this with a pencil and paper

The bootstrap was born in 1979 Typically sampling is costly and

computation is cheap In (empirical) CS sampling isnrsquot even

necessarily all that costly

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Bootstrap estimate of se

Let s() be a function for computing an estimate

True value of the standard error Ideal bootstrap estimate Bootstrap estimate with B boostrap

samples

seF

FF

seˆse

BB seˆse

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Bootstrap estimate of se

B

1i

2

B1B

ˆ]i[ˆˆse

FBB

seselim

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Bootstrap intuitively

We donrsquot know F We would like lots of samples from P

but we only have one (S) We approximate F by

Plug-in principle Easy to generate lots of ldquosamplesrdquo

from

F

F

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

B = 25 (mean compression)

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

B = 50 (mean compression)

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

B = 200 (mean compression)

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Correlation (another statistic)

Population P sample S Two values xi and yi for each element

of the sample Correlation coefficient ρ Sample correlation coefficient

n

1i

2i

n

1i

2i

n

1iii

yyxx

yyxxr

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Example gzip compression

r = 09616

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Accuracy of r

No general closed form for se(r) If we assume x and y are bivariate

Gaussian

3n

r1)r(se

2

normal

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

-1-05

005

110

2030

4050

6070

8090

100

-05

0

05

1

senormal

rn

senormal

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Normality

Why assume the data are Gaussian

Alternative bootstrap estimate of the standard error of r

B

1i

2

B1B

r]i[rrse

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Example gzip compression

r = 09616

senormal(r) = 00024

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

se200(r) = 00298

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

se bootstrap advice

Plot the data Runtime Efron and Tibshirani

B = 25 is informative B = 50 often enough seldom need B gt 200 (for se)

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help

Summary so far

A statistic is a ldquotrue factrdquo about the distribution F

We donrsquot know F For some parameter θ we want

estimate ldquoθ hatrdquo accuracy of that estimate (eg standard

error) For the mean μ we have a closed

form For other θ the bootstrap will help