maximum likelihood estimates

Maximum likelihood estimates

What are they and why do we care?

Relationship to AIC and other model selection criteria

Maximum Likelihood Estimates (MLE)

Given a model () MLE is (are) the value(s) that are most likely to estimate the parameter(s) of interest.

That is, they maximize the probability of the model given the data.

The likelihood of a model is the product of the probabilities of the observations.

Maximum Likelihood Estimation

For linear models (e.g., ANOVA and regression) these are usually determined using the linear equations which minimize the sum of the squared residuals – closed form

For nonlinear models and some distributions we determine MLEs setting the first derivative equal to zero and then making sure it is a maxima by setting the second derivative equal to zero – closed form.

Or we can search for values that maximize the probabilities of all of the observations – numerical estimation.

Search stops when certain criteria are met: Precision of the estimate Change in the likelihood Solution seems unlikely (stops after n iterations)

Binomial probability

Some theory and math An example Assumptions Adding a link function Additional assumptions about bs

Binomial Sampling

Characterized by two mutually exclusive events Heads or tails On or off Dead or alive Used or not used, or Occupied or not occupied.

Often referred to as Bernoulli trials

Models

Trials have an associated parameter p p = probability of success. 1-p = probability of failure ( = q) p + q = 1

p also represents a model Single parameter p is equal for every trial

Binomial Sampling

p is a continuous variable between 0 and 1 (0 <p <1)

y is the number of successful outcomes n is the number of trials.

This estimator is unbiased.

nyp ˆ

npyE )( .

n

qp

nqpp

npqp

pnpnpqy

ˆˆ)pe(s

ˆˆ)ˆar(v

)ˆvar(

)1()var(

Binomial Probability Function

The probability of observing y successes given n trials with the underlying probability p is ...

Example: 10 flips of a fair coin (p = 0.5), 7 of which turn up heads is written

yny ppy

npnyf

)1(,|

7107 )5.01(5.07

105.0,10|7

f

Binomial Probability Function (2)

1172.0

5.05.0120

)1()!(!

!5.0,10|7

37

yny ppyny

nf

evaluated numerically:

In Excel:

=BINOMDIST(y, n, p, FALSE)

Binomial Probability Function (3)

yny ppyny

nyf

)1(

)!(!

!5.0,10|

00.10.20.30.40.50.60.70.80.9

1

0 1 2 3 4 5 6 7 8 9 10

Probability

y

n 10

y

p 0.5

y BINPROB0 0.00101 0.00982 0.04393 0.11724 0.20515 0.24616 0.20517 0.11728 0.04399 0.0098

10 0.0010

Reality: have data (n and y) don’t know the model (p) leads us to the likelihood function:

read the likelihood of p given n and y is ... not a probability function. is a positive function (0 < p < 1)

Likelihood Function of Binomial Probability

yny ppy

nynp

)1(,|L

Likelihood Function of Binomial Probability(2)

Alternatively, the likelihood of the data given the model can be thought of as the product of the probabilities of the individual observations.

The probability of the observations is:

Therefore,

n

i

ff ppynp1

1)1(,|L

ffi ppyP 1)1()( f = 1 for success,

f = 0 for failure

Binomial Probability Function and it's likelihood

p Likelihood

0.00 0.00000000 0.05 0.00000000 0.10 0.00000007 0.15 0.00000105 0.20 0.00000655 0.25 0.00002575 0.30 0.00007501 0.35 0.00017669 0.40 0.00035389 0.45 0.00062169 0.50 0.00097656 0.55 0.00138732 0.60 0.00179159 0.65 0.00210183 0.70 0.00222357 0.75 0.00208569 0.80 0.00167772 0.85 0.00108195 0.90 0.00047830 0.95 0.00008729 1.00 0.00000000

0.000E+00

5.000E-04

1.000E-03

1.500E-03

2.000E-03

2.500E-03

Likelihood

p

7107 )1(7,10| pppL

maximum

Log likelihood

Although the Likelihood function is useful, the log-likelihood has some desirable properties in that the terms are additive and the binomial coefficient does not include p.

)()1ln()ln(ln

lnln

ynppyy

n

yn,|pLL

Log likelihood

Using the alternative:

The estimate of p that

maximizes the value of ln(L) is the MLE.

n

i

ff ppynp1

1 ))1(ln(,|ln L

Precision

L(p|10,7) L(p|100,70)As n , precision , variance

Properties of MLEs

Asymptotically normally distributed Asymptotically minimize variance Asymptotically unbiased as n → One-to-one transformations of MLEs are also

MLEs.

For example mean lifespan:

is also an MLE.

)ˆ(1/lnL S

Assumptions:

n trials must be identical – i.e., the population is well defined (e.g.,20 coin flips, 50 Kirtland's warbler nests, 75 radio-marked black bears in the Pisgah Bear Sanctuary).

Each trial results in one of two mutually exclusive outcomes. (e.g., heads or tails, survived or died, successful or failed, etc.)

The probability of success on each trial remains constant. (homogeneous)

Trials are independent events (the outcome of one does not depend on the outcome of another).

y, the number of successes; is the random variable after n trials.

Example – use/non-use survey

Selected 50 sites (n) at random (or systematically) with a study area.

Visit each site once and ‘surveyed’ for species x Species was detected at 10 sites (y)

Meet binomial assumptions: Sites selected without bias Surveys conducted using same methods Sites could only be used or not used (occupied) No knowledge of habitat differences or species

preferences Sites are independent

Additional assumption – perfect detection

Example – calculating the likelihood

105010 )1(10,50| pppL

maximum

p Likelihood Variance SE0.00 0.00E+000.05 1.26E-14 0.00095 0.0308220.10 1.48E-12 0.0018 0.0424260.15 8.66E-12 0.00255 0.0504980.20 1.36E-11 0.0032 0.0565690.25 9.59E-12 0.00375 0.0612370.30 3.76E-12 0.0042 0.0648070.35 9.06E-13 0.00455 0.0674540.40 1.40E-13 0.0048 0.0692820.45 1.40E-14 0.00495 0.0703560.50 8.88E-16 0.005 0.0707110.55 3.41E-17 0.00495 0.0703560.60 7.31E-19 0.0048 0.0692820.65 7.80E-21 0.00455 0.0674540.70 3.43E-23 0.0042 0.0648070.75 4.66E-26 0.00375 0.0612370.80 1.18E-29 0.0032 0.0565690.85 2.18E-34 0.00255 0.0504980.90 3.49E-41 0.0018 0.0424260.95 5.45E-53 0.00095 0.0308221.00 0.00E+00 0 0

Example – results

MLE = 20% + 6% of the area is occupied

Link functions - adding covariates

“Link” the covariates, the data (X),with the response variable (i.e., use or occupancy)

Usually done with logit link:

Nice properties: Constrains result 0<pi<1

0 1 1

0 1 11

β x βi

i β x βi

ep

e

Link functions - adding covariates

“Link” the covariates, the data (X),with the response variable (i.e., use or occupancy)

Usually done with logit link:

Nice properties: Constrains result 0<pi<1

s can be -∞ < < +∞

Additional assumption –s are normally distributed

0 1 1

0 1 1 11

β x β βi

i ββ x βi

e ep

ee

X

X

βX

Link function

Binomial likelihood:

Substitute the link for p

Voila! – logistic regression

yny ppy

nynp

)1(,|L

yny

y

nynp

)exp(1)exp(

1)exp(1

)exp(,|

XX

XX

L

Link function

More than one covariate can be included

Extend the logit (linear equation).

bs are the estimated parameters (effects);

estimated for each period or group

constrained to be equal using the data (xij).

Link function

The use rates or real parameters of interest are calculated from the s as in this equation.

HUGE concept and applicable to EVERY estimator we examine.

Occupancy and detection probabilities are replaced by the link function submodel of the covariate(s).

Conceivably every sites has a different probability of use that is related to the value of the covariates.

0 1 1

0 1 1 11

β x β βi

i ββ x βi

e ep

ee

X

X

Multinomial probability

An example

Adding a link function

Multinomial Distribution and Likelihoods

Extension of the binomial coefficient with more than two possible mutually exclusive outcomes.

Nearly always introduced by way of die tossing.

Another example

Multiple presence/absence surveys at multiple sites

Binomial Coefficient

The binomial coefficient was the number of ways y successes could be obtained from the n trials

Example 7 successes in 10 trials

!

!

n n

y y

! 3628800720

! 5040

n n

y y

Multinomial coefficient

The multinomial coefficient or the number of possible outcomes for die tossing (6 possibilities):

Example rolling each die face once in 6 trials:

6

1

654321654321 !

!

!!!!!!

!

iiy

n

yyyyyy

nyyyyyy

n

61 2 3 4 5 6

1

! 6! 720720

1! 1! 1! 1! 1! 1! 1!i

i

n ny y y y y y

y

Properties of multinomials

Dependency among the counts.

For example, if a die is thrown and it is not a 1, 2, 3, 4, or 5, then it must be a 6.

6

1i

i

y n

Face Number Variable

1 10 y1

2 11 y2

3 13 y3

4 9 y4

5 8 y5

6 9 y6

TOTAL 60 n

Multinomial pdf

Probability an outcome or series of outcomes:

654321)|( ppppppy

npnyf

iii

6

1

1ii

p

Die example 1

The probability of rolling a fair die (pi = 1/6) six

times (n) and turning up each face only once (ni =

1) is:

(1,1,1,1,1,1| 6 1 6,1 6,1 6,1 6,1 6,1 6)

6! 1 1 1 1 1 1

1! 1! 1! 1! 1! 1! 6 6 6 6 6 6

0.01543

f

Die example 1

Dependency6

1

0.167 0.167 0.167

0.167 0.167 0.167

1

ii

p

Example 2

Another example, the probability of rolling 2 – 2s , 3 – 3s, and 1 – 4 is:

6

(0,2,3,1,0,0 | 6 1 6,1 6,1 6,1 6,1 6,1 6)

6! 1 1 1 1 1 1

0! 2! 3! 1! 0! 0! 6 6 6 6 6 6

720 0.167

120.001286

f

Likelihood

As you might have expected, the likelihood of the multinomial is of greater interest to us

We frequently have data (n, yi...m) and are seeking

to determine the model (pi…m). The likelihood for

our example with the die is:

3 5 61 2 41 2 3 4 5 6

1

( | )

i

y y yy y yi i

i

myi

ii

np n y p p p p p p

y

np

y

L

Log-likelihood

This likelihood has all of the same properties we discussed for the binomial case.

Usually solve to maximize the ln(L)

1 1 2 2 3 3

1

ln( | ) ln ln ln ln ln

ln ln

i m m

m

i ii

nn y y p y p y p y p

y

ny p

y

ipL(

3 5 61 2 41 2 3 4 5 6

1

( | )

i

y y yy y yi i

i

myi

ii

np n y p p p p p p

y

np

y

L

Log-likelihood

Ignoring the multinomial coefficient (constant)

1

ln( | ) ln

ln( )

m

i i ii

n y y p

data probabilities

ipL(

))|(ln( ii nypL

Presence-absence surveys & multinomials

Procedure:

Select a sample of sites

Conduct repeated presence-absence surveys at each site

Usually temporal replication

Sometimes spatial replication

Record presence or absence of species during survey

Encounter histories for each site & species

Encounter history matrix

Each row represents a site

Each column represents a sampling occasion.

On each occasion each species

‘1’ if encountered (captured)

‘0’ if not encountered.

Occasion

Site No. 1 2 3

211 0 0 1

212 0 0 1

213 0 1 0

214 1 0 0

215 1 0 1

216 1 1 0

217 1 1 1

218 0 0 1

Encounter history - example

For sites sampled on 3 occasions there are 8 (=2m = 23) possible encounter histories

10 sites were sampled 3 times(not enough for a good estimate)

1 – Detected during survey

0 – Not-detected during survey

Separate encounter history for each species

yi Encounter History

1 1 0 0

2 1 0 1

0 1 1 0

1 1 1 1

1 0 1 0

0 0 1 1

3 0 0 1

2 0 0 0

10


Each capture history is a possible outcome,

Analogous to one face of the die (ni).

Data consist of the number of times each capture history appears (yi).


1 1 0 0

2 1 0 1

0 1 1 0

1 1 1 1

1 0 1 0

0 0 1 1

3 0 0 1

2 0 0 0

10


Each encounter history has an associated probability (pi)

Each pij can be different


1 1 0 0

2 1 0 1

0 1 1 0

1 1 1 1

1 0 1 0

0 0 1 1

3 0 0 1

2 0 0 0

10

Log-likelihood example

Log-likelihood Calculate log of the probability of encounter

history (ln(Pi))

Multiply ln(Pi) by the number of times observed

(yi) Sum the products

Link function in binomial

Binomial likelihood:

Substitute the link for p

Voila! – logistic regression

yny ppy

nynp

)1(,|L

yny

y

nynp

)exp(1)exp(

1)exp(1

)exp(,|

XX

XX

L

Multinomial with link function

Substitute the logit link for the pi

m

iiii y

y

nnypL

1 )exp(1

)exp(lnln)|(ln(

X

X

But wait a minute!

Is Pr(Occupancy) = Pr(Encounter)?

Is Pr(Occupancy) = Pr(Encounter)?

Probability of encounter includes both detection and use (occupancy).

Occupancy analysis estimates each thus providing conditional estimates of use of sites.


1 1 0 0

2 1 0 1

0 1 1 0

1 1 1 1

1 0 1 0

0 0 1 1

3 0 0 1

2 0 0 0

10

Sites known to be used

Absentor

Not detected?

maximum likelihood estimates

Documents