probability distributions and dataset properties lecture 2 likelihood methods in forest ecology...

54
Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th , 2006

Upload: megan-higgins

Post on 12-Jan-2016

228 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Probability Distributions and Dataset Properties

Lecture 2

Likelihood Methods in Forest Ecology

October 9th – 20th , 2006

Page 2: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Statistical Inference

Data

Scientific Model (Scientific hypothesis)

Probability Model(Statistical hypothesis)

Inference

Page 3: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Parametric perspective on inference

Scientific Model (Hypothesis test)Often with linear models

Probability Model(Normal typically)

Inference

Page 4: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Likelihood perspective on inference

Data

Scientific Model (hypothesis)

Probability Model

Inference

Page 5: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

An example...

The Data:xi = measurements of DBH on 50 treesyi = measurements of crown radius on those trees

The Scientific Model:yi = xi + (linear relationship, with 2 parameters ( and an error term () (the residuals))

The Probability Model: is normally distributed, with E[] and variance estimated from the observed variance of the residuals...

Data

Scientific Model (hypothesis)

Probability Model

Inference

Data

Scientific Model (hypothesis)

Probability Model

Inference

Page 6: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

The triangle of statistical inference: Model

• Models clarify our understanding of nature.• Help us understand the importance (or

unimportance) of individuals processes and mechanisms.

• Since they are not hypotheses, they can never be “correct”.

• We don’t “reject” models; we assess their validity.• Establish what’s “true” by establishing which

model the data support.

Page 7: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

The triangle of statistical inference:Probability distributions

• Data are never “clean”.• Most models are deterministic, they

describe the average behavior of a system but not the noise or variability. To compare models with data, we need a statistical model which describes the variability.

• We must understand the the processes giving rise to variability to select the correct probability density function (error structure) that gives rise to the variability or noise.

Page 8: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

DBH (cm)

0 10 20 30 40 50C

row

n ra

dius

(m

)0

1

2

3

4

5

6

The Data: xi = measurements of DBH on 50 trees yi = measurements of crown radius on those trees

The Scientific Model: yi = DBHi +

The Probability Model: is normally distributed.

Data

ScientificProbability Model

Inference

Data

Scientific Model

Probability Model

Inference

An example: Can we predict crown radius using tree diameter?

0

2

4

6

8

10

12

14

16

1.62 2.10 2.57 3.05 3.52 4.00 4.47 4.95 5.42 5.89

Crown radius

Fre

qu

en

cy

Page 9: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Why do we care about probability?

• Foundation of theory of statistics.• Description of uncertainty (error).

– Measurement error

– Process error

• Needed to understand likelihood theory which is required for:Estimating model parameters.Model selection (What hypothesis do data support?).

Page 10: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Error (noise, variability) is your friend!

• Classical statistics are built around the assumption that the variability is normally distributed.

• But…normality is in fact rare in ecology.

• Non-normality is an opportunity to:Represent variability in a more realistic way.Gain insights into the process of interest.

Page 11: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

The likelihood framework

Ask biological question

Collect data

Probability Model Model noise

Ecological Model Model signal

Estimate parameters

Estimate support regions

Answer questions

Model selection

Bolker, Notes

Page 12: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Probability Concepts

• An experiment is an operation with uncertain outcome.

• A sample space is a set of all possible outcomes of an experiment.

• An event is a particular outcome of an experiment, a subset of the sample space.

Page 13: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Random Variables

• A random variable is a function that assigns a numeric value to every outcome of an experiment (event) or sample. For instance

Event Random variable

Tree Growth = f (DBH, light, soil…)

Page 14: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Function: formula expressing a relationship between two variables.

All pdf’s are functions BUT NOT all functions are PDF’s.

Functions and probability density functions

Functions = Scientific Model

pdf’s

Crown radius = DBHWE WILL TALK ABOUT THIS LATER

Used to model noise:Y-(DBH)

Page 15: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Probability Density Functions: properties

• A function that assigns probabilities to ALL the possible values of a random variable (x).

Sx

)x(f

)x(f

1

10

x

Pro

babi

lity

den

sity

f(x)

Page 16: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Probability Density Functions: Expectations

• The expectation of a random variable x is the weighted value of the possible values that x can take, each value weighted by the probability that x assumes it.

• Analogous to “center of gravity”. First moment.

0

1

)x(p:x

N

ii

)x(xpN

x

]X[E

-1 0 1 2

p(-1)=0.10 p(0)=0.25 p(1)=0.3 p(2)=0.35

Page 17: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Probability Density Functions: Variance

• The variance of a random variable reflects the spread of X values around the expected value.

• Second moment of a distribution.

22

2

])X[E(]X[E

]))x(EX[(E]X[Var

Page 18: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Probability Distributions

• A function that assigns probabilities to the possible values of a random variable (X).

• They come in two flavors:

DISCRETE: outcomes are a set of discrete possibilities such as integers (e.g, counting).

CONTINUOUS: A probability distribution over a continuous range (real numbers or the non-negative real numbers).

Page 19: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Event (x)

Pro

ba

bil

ity

Probability Mass Functions

For a discrete random variable, X, the probability that x takes ona value x is a discrete density function, f(x) also known as probability mass or distribution function.

Sx

)x(f

)x(f

}xX{f)x(f

1

10

Page 20: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Pro

babi

lity

den

sity

f(x)

Probability Density Functions: Continuous variables

A probability density function (f(x)) gives the probability that a random variable X takes on values within a range.

1

0

dx)x(f

)x(f

}bXa{Pdx)x(fb

a

a b

Page 21: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Some rules of probability

)A(obPr)A|B(obPr)BA(obPr

)A(obPr

)BA(obPr)A|B(obPr

)B(obPr

)BA(obPr)B|A(obPr

)B(obPr)*A(obPr)BA(obPr

)BA(obPr)B(obPr)A(obPr)BA(obPr

assuming independence

A B

Page 22: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Real data: Histograms

-5 -4 -3 -2 -1 0 1 2 3TEN

0

1

2

3

4

Cou

nt

0.0

0.1

0.2

0.3

0.4

Proportion per B

ar

-10 -5 0 5FIFTY

0

5

10

15

Cou

nt

0.0

0.1

0.2

0.3

Proportion per B

ar

-10 -5 0 5 10HUNDRED

0

10

20

30

40

Cou

nt

0.0

0.1

0.2

0.3

0.4

Proportion per B

ar

-10 -5 0 5 10FIVEHUND

0

20

40

60

80

100

120

Cou

nt

0.0

0.1

0.2

Proportion per B

ar

-10 -5 0 5 10THOUS

0

50

100

150

Cou

nt

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

Proportion per B

ar

n = 10 n = 50 n = 100

n = 500 n = 1000

VARIABLE VARIABLE

VARIABLE

VARIABLE

VARIABLE

Page 23: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Histograms and PDF’s

Probability density functions approximate the distribution of finite data sets.

VARIABLE

-10 -5 0 5 100

50

100

150

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14n = 1000

Pro

ba

bili

ty

Page 24: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Uses of Frequency Distributions

• Empirical (frequentist):Make predictions about the frequency of a particular

event.Judge whether an observation belongs to a

population.

• Theoretical:Predictions about the distribution of the data based

on some basic assumptions about the nature of the forces acting on a particular biological system.

Describe the randomness in the data.

Page 25: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Some useful distributions

1. Discrete Binomial : Two possible outcomes. Poisson: Counts. Negative binomial: Counts. Multinomial: Multiple categorical outcomes.

2. Continuous Normal. Lognormal. Exponential Gamma Beta

Page 26: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

An example: Seed predation

)N(

V)V(obPrx

110

VedVisitProbPr x =no seeds taken

0 to N

Assume each seed has equal probability (p)of being taken. Then:

01

011

1

xif)p(p)!xN(!x

!NV)x(prob

xif)p(V)V()x(prob

)p()takennotseeds)xN((prob

p)takenseedsx(prob

xNx

N

xN

x

Normalization constant

t1 t2 ( )

Page 27: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Zero-inflated binomial

Histogram of rzibinom(n = 1000, prob = 0.6, size = 12, zprob = 0.3)

rzibinom(n = 1000, prob = 0.6, size = 12, zprob = 0.3)

Fre

quen

cy

0 2 4 6 8 10

010

020

030

0

Page 28: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Binomial distribution: Discrete events that can take one of two values

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Event (x)

Pro

bab

ilit

y)!xn(!x

!n

x

n

)p(px

n)xX(P xnx

1

E[x] = npVariance =np(1-p)n = number of sitesp = prob. of survival

Example: Probability of survival derived from pop data

n =20p = 0.5

Page 29: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Binomial distribution

Page 30: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Poisson Distribution: Counts (or getting hit in the head by a horse)

Variance

]X[E!k

)(e)kX(P

k

k = number of seedlings

λ= arrival rate

500 0.5

0 1 2 3 4 5 6 7POISSON

0

100

200

300

400

Cou

nt

0.0

0.1

0.2

0.3

0.4 Proportion per B

ar

Number of Seedlings/quadrat**Alt param= λ=rt

Page 31: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Poisson distribution

Page 32: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Example: Number of seedlings in census quad.

0 10 20 30 40 50 60 70 80 90 100

Number of seedlings/trap

0

10

20

30

40

50

60C

ou

nt

0.0

0.1

0.2

0.3

0.4P

ropo

rtion pe

r Bar

Alchornea latifolia

(Data from LFDP, Puerto Rico)

Page 33: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Clustering in space or time

Poisson processE[X]=Variance[X]

Poisson processE[X]<Variance[X]Overdispersed Clumped or patchy

Negative binomial?

Page 34: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Negative binomial:Table 4.2 & 4.3 in H&M Bycatch Data

E[X]=0.279Variance[X]=1.56

Suggests temporal or spatial aggregationin the data!!

Page 35: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Negative Binomial: Counts

2

1

11

1

p

)p(rVariance

p

r]X[E

)p(pr

n)nX(P rnr

0 10 20 30 40 50NEGBIN

0

10

20

30

40

50

60

70

80

90

100

Cou

nt

0.0

0.1

0.2

Proportion per B

ar

Number of Seeds

Page 36: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Negative Binomial: Counts

large k

Poisson; k

:variance to related kk

mmVariance

m]X[E

km

m

k

m

!n)k(

)nk()nXPr(

nk

0

1

2

0 10 20 30 40 50NEGBIN

0

10

20

30

40

50

60

70

80

90

100

Cou

nt

0.0

0.1

0.2

Proportion per B

ar

Number of Seeds

Page 37: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Negative binomial

Page 38: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Negative Binomial: Count data

0 10 20 30 40 50 60 70 80 90 100

No seedlings/quad.

0

10

20

30

Cou

nt

0.0

0.1

0.2P

roportion per Bar

Prestoea acuminata

(Data from LFDP, Puerto Rico)

Page 39: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Normal PDF with mean = 0

X

0

0.2

0.4

0.6

0.8

1

-5 -4 -3 -2 -1 0 1 2 3 4 5

Pro

b(x

)

Var = 0.25

Var = 0.5

Var = 1

Var = 2

Var = 5

Var = 10

Normal Distribution

2

2

2

2 22

1

Variance

mMean

))mx(

exp()x(f E[x] = mVariance = δ2

Page 40: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Normal Distribution with increasing variance

dcxVariance

mMean

))mx(

exp()x(f

2

2

2 22

1

Page 41: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Lognormal: One tail and no negative values

)e(meiancevarme]x[Eemmedian

),(Y,eX)xln(

expx

)x(f Y

1

2

1

2

1

22

2

2

22

2

2

0.8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0 10 20 30 40 50 60 70

x is always positive

f(x)

x

Page 42: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Lognormal: Radial growth data

0 1 2 3 4HEMLOCK

0

50

100

150

Cou

nt

0.0

0.1

0.2 Prop

ortion per Bar

0 1 2 3REDCEDAR

0

10

20

30

40

Cou

nt0.0

0.1

0.2

Prop

ortion per Bar

Growth (cm/yr) Growth (cm/yr)

Red cedarHemlock

(Data from Date Creek, British Columbia)

Page 43: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Exponential

2

1

1

Variance

]x[E

e)x(f x

Variable

Co

unt

0 1 2 3 4 5 60

10

20

30

40

50

60

70

80

0.0

0.1

0.2

0.3

0.4

Pro

portion

per B

ar

Page 44: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Exponential: Growth data (negatives assumed 0)

0 1 2 3 4 5 6 7 8Growth (mm/yr)

0

200

400

600

800

1000

1200C

ount

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7P

roportion per Bar

Beilschemedia pendula

(Data from BCI, Panama)

Page 45: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Gamma: One tail and flexibility

parameter scales

parameter shapea

as]X[Var

as]x[E

ex)n(s

)x(f s/xaa

2

11

Page 46: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Gamma: “raw” growth data

0 1 2 3 4 5 6 7 8 9Growth (mm/yr)

0

200

400

600

800

1000

Cou

nt

Alseis blackiana

(Data from BCI, Panama)

0 10 20 300

50

100

150

200

Cordia bicolor

Growth (mm/yr)

Page 47: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Beta distribution

.otherwise;xfor)x(x)b,a(

)b,a|x(f ib

ia

ii 01011 11

)ba(

)b()a()b,a(

)ba()ba(

ba)x(Var

x of value expected ba

a)x(E

12

Page 48: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Beta: Light interception by crown trees

(Data from Luquillo, PR)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0GLI

0

100

200

300

400

500

600

Co

unt

0.0

0.1

0.2

0.3

0.4

0.5

0.6

Pro

po

rtion p

er B

ar

Page 49: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Mixture models

• What do you do when your data don’t fit any known distribution?– Add covariates– Mixture models

• Discrete

• Continuous

Page 50: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Discrete mixtures

Page 51: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Discrete mixture:Zero-inflated binomial

0

001

xif)x(prob*V)x(prob

xif)(prob)V()x(prob

Page 52: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

Continuous (compounded) mixtures

Page 53: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

The Method of Moments

• You can match up the sample values of the moments of the distributions and match them up with the theoretical moments.

• Recall that:

• The MOM is a good way to get a first (but biased) estimate of the parameters of a distribution. ML estimators are more reliable.

0)x(p:x

)x(xp]X[E

22 ])X[E(]X[E]X[Var

Page 54: Probability Distributions and Dataset Properties Lecture 2 Likelihood Methods in Forest Ecology October 9 th – 20 th, 2006

MOM: Negative binomial

mu)x(xp]X[E)x(p:x

0

k

mumu])X[E(]X[E]X[Var

222