phylogenetic estimation using maximum likelihood by: jimin zhu xin gong xin gong sravanti polsani...

Phylogenetic Estimation using Maximum Likelihood

By: Jimin ZhuBy: Jimin Zhu

Xin Gong Xin Gong

Sravanti polsaniSravanti polsani

Rama sharmaRama sharma

Shlomit KlopmanShlomit Klopman

The Scope of the The Scope of the PresentationPresentation

• IntroductionIntroduction

•Maximum Likelihood and Coin TossingMaximum Likelihood and Coin Tossing

•The Phylogenetic TreeThe Phylogenetic Tree

•Maximum Likelihood and DNA Maximum Likelihood and DNA SubstitutionSubstitution

•Advantages and Disadvantages Advantages and Disadvantages Maximum Likelihood Maximum Likelihood

IntroductionIntroduction

• Phylogeny: the study of relationships Phylogeny: the study of relationships between life formsbetween life forms

• Phylogenetics is part of the field of Phylogenetics is part of the field of taxonomy and systematicstaxonomy and systematics

• Phylogenetics received a huge push Phylogenetics received a huge push forward thanks to modern computersforward thanks to modern computers

• Various phylogenetic methods are Various phylogenetic methods are used to explain the evolutionary used to explain the evolutionary process, and often give contradicting process, and often give contradicting results! results!

Introduction (cont.)Introduction (cont.)

•Scientists agree that a correct Scientists agree that a correct species linage should be species linage should be determined using statisticsdetermined using statistics

•Maximum Likelihood is the method Maximum Likelihood is the method of choice for establishing the most of choice for establishing the most realistic phylogenetic tree of a given realistic phylogenetic tree of a given datadata

•The Maximum Likelihood method The Maximum Likelihood method was introduced in 1922 by R.A. was introduced in 1922 by R.A. Fisher an English statistician Fisher an English statistician

Maximum Likelihood in a Maximum Likelihood in a NutshellNutshell• The method depends on:The method depends on:

– Complete data set Complete data set – Probabilistic model that describes the dataProbabilistic model that describes the data– Explicitly expressing the likelihood functionExplicitly expressing the likelihood function

• The likelihood of a data set is the The likelihood of a data set is the probability of obtaining it, given the probability of obtaining it, given the chosen probability distribution modelchosen probability distribution model

• We seek the values of the parameters We seek the values of the parameters that maximize the sample likelihood that maximize the sample likelihood

Maximum Likelihood Maximum Likelihood approach approach using Coin Tossing using Coin Tossing Experiment Experiment • Find the parameter value(s) that make the Find the parameter value(s) that make the

observed data most likely. Basically, choose observed data most likely. Basically, choose the value of parameter that maximizes the the value of parameter that maximizes the probability of observing the data.probability of observing the data.

• Probability: Knowing parameters Probability: Knowing parameters Prediction Prediction of outcomeof outcome

• Likelihood: Observation of data Likelihood: Observation of data Estimation Estimation of parametersof parameters

• Parameters describe the characteristics of a population. Parameters describe the characteristics of a population. Their values are estimated from samples collected from Their values are estimated from samples collected from that population.that population.

Simple Coin Tossing Simple Coin Tossing ExperimentExperiment

• Binomial probability distributionBinomial probability distribution The probability of observing h heads The probability of observing h heads

out of n tosses can be described as:out of n tosses can be described as:

Pr[h|p, n] Pr[h|p, n] = = n! n! p phh(1-p)(1-p)n-hn-h

h!(n-h)!h!(n-h)!

Where p is Where p is probability of Heads probability of Heads

(1-p) is probability of Tails.(1-p) is probability of Tails.


Suppose I told you we tossed a coin 10 Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the times and got 4 heads and 6 tails, then the probability would be probability would be

P(4Heads, 6Tails) =P(4Heads, 6Tails) = 10! 10! p p44(1-p)(1-p)66 4!*6!4!*6! The whole notion of maximum likelihood The whole notion of maximum likelihood

estimation is that we choose p to be the estimation is that we choose p to be the one that makes the probability of getting one that makes the probability of getting our set of observations the largest possible: our set of observations the largest possible: i.e. maximize Pi.e. maximize P44 (1-P) (1-P)66 . So our likelihood . So our likelihood function would be: like = function would be: like = pp44(1-p)(1-p)66

Two ways to find MLETwo ways to find MLE

1.1. Take the first derivative of the likelihood Take the first derivative of the likelihood function with respect to each parameter, function with respect to each parameter, set the resulting equations equal to 0, set the resulting equations equal to 0, and solve for the parameter estimates.and solve for the parameter estimates.

Applying log on both sidesApplying log on both sidesLog(L(p)) = n Log(p) + (n-h) Log(1-p)Log(L(p)) = n Log(p) + (n-h) Log(1-p)Take first derivative w.r.t p Take first derivative w.r.t p (n / p) – (n-h) / (1-p) = 0(n / p) – (n-h) / (1-p) = 0Solving for p, We get Solving for p, We get p = h / np = h / n

This value maximizes the likelihood This value maximizes the likelihood function and is the MLE.function and is the MLE.

Find the maximum using Find the maximum using Numeric search procedures Numeric search procedures

2. Plug in different values for p into the 2. Plug in different values for p into the probability model and calculate likelihood.probability model and calculate likelihood.

Lets take sample n = 100, h = 56. Lets take sample n = 100, h = 56.

Imagine that p was 0.5. Imagine that p was 0.5.

Plugging this value into our probability model as Plugging this value into our probability model as follows:- follows:-

L(p = 0.5 | data ) = L(p = 0.5 | data ) = 100 ! 100 ! 0.5 0.55656 0.5 0.54444 = 0.0389 = 0.0389

56! 44! 56! 44!

But what if p was 0.52 instead? But what if p was 0.52 instead?

L(p = 0.52 | data) = L(p = 0.52 | data) = 100 ! 100 ! 0.52 0.5256560.480.484444 = 0.0581 = 0.0581

56! 44! 56! 44!

• So from this we can conclude that So from this we can conclude that pp is is more likely to be 0.52 than 0.5. We can more likely to be 0.52 than 0.5. We can tabulate the likelihood for different tabulate the likelihood for different parameter values to find the maximum parameter values to find the maximum likelihood estimate of likelihood estimate of pp::

p L p Lp L p L ------ ------- ------- -------------- ------- ------- -------- 0.48 0.0222 0.50 0.038890.48 0.0222 0.50 0.03889 0.52 0.0581 0.54 0.07390.52 0.0581 0.54 0.0739 0.56 0.0801 0.56 0.0801 0.58 0.0738 0.58 0.0738 0.60 0.0576 0.62 0.0378 0.60 0.0576 0.62 0.0378

Maximum likelihood estimate for Maximum likelihood estimate for pp seems to be exactly at 0.56. seems to be exactly at 0.56.

MLE: Sample Graphs (using Mathematica)MLE: Sample Graphs (using Mathematica)fp_, H_: Binomial10, Hp^H1 p̂10 HPlotfp, H,p, 0, 1 6H, 4 T H 6;

0.2 0.4 0.6 0.8 1

0.05

0.1

0.15

0.2

0.25

3H, 7T H 3;

0.2 0.4 0.6 0.8 1

0.05

0.1

0.15

0.2

0.25

H 5;

0.2 0.4 0.6 0.8 1

0.05

0.1

0.15

0.2

0.25 5H, 5T

8H, 2T H 8;

0.2 0.4 0.6 0.8 1

0.05

0.1

0.15

0.2

0.25

0.3


• The best estimate for The best estimate for pp from any one sample is from any one sample is clearly going to be the clearly going to be the proportion of headsproportion of heads observed in that sample. observed in that sample.

• A very simple example like this is over rated A very simple example like this is over rated for evaluating for evaluating p using MLE approach.p using MLE approach.

• But not all problems are this simple! The more But not all problems are this simple! The more complex the model and the greater the complex the model and the greater the number of parameters, it often becomes very number of parameters, it often becomes very difficult to make even reasonable guesses at difficult to make even reasonable guesses at the MLEs. the MLEs.

Phylogenetic TreePhylogenetic Tree

A phylogenetic tree is a data structure, characterized by:A phylogenetic tree is a data structure, characterized by:

• topology (form) • its branch lengths

Stores information regarding the relationship of several Stores information regarding the relationship of several species or sequences.species or sequences.

aa bbcc

dd

Rooted tree: assumed : assumed ancestral state "d" is theancestral state "d" is the root species.

Unrooted tree... no implicit ... no implicit "directionality", but is a measure"directionality", but is a measure of of similarity between species.between species.

aa

bb

cc

dd

Types of Phylogenetic Types of Phylogenetic TreesTrees

leaf

branch

root

leaf

branch

(1) A G G C U C C A A (1) A G G C U C C A A

(2) A G G U U C G A A (2) A G G U U C G A A

(3) A G C C C A G A A (3) A G C C C A G A A

(4) A U U U C G G A A (4) A U U U C G G A A

Molecular phylogenetic methods use a given set Molecular phylogenetic methods use a given set

of aligned sequences to construct a phylogenetic of aligned sequences to construct a phylogenetic

TreeTree

sequence 1sequence 1




There are several ways to construct phylogenetic trees.

The Maximum Likelihood method will pick out the tree that most represents the true tree.

j

(1) A G G C T C C A A….A (2) A G G T T C G A A.…A (3) A G C C C A G A A....A (4) A T T T C G G A A....C

The Maximum Likelihood ApproachThe Maximum Likelihood Approach

1. Assumes that all sequences at each site are considered independent.

1 ….N2

xx

The Maximum Likelihood The Maximum Likelihood Approach(cont.)Approach(cont.)

CC AACC GG

yy

1. The log-likelihood is computed for a given topology by using a particular probability model.

L ( j ) = Prob + …+ Prob

N ln L= ln L(1) + ln L(2) ..+ ln L(j)+… + ln L(N) = SUM ln L(i) i=1

Binomial;Multinomial;Poisson…..

a)

b)CC AACC GG

AA

AA

CC AACC GG

GGGG

c)

The Maximum Likelihood Approach The Maximum Likelihood Approach (cont.)(cont.)

3. After procedure is done for3. After procedure is done for all possible topologies,, the topology the topology that showsthat shows the highest likelihood is chosen as theis chosen as the true (realistic)

tree.

#Rooted trees = #Rooted trees =

#Unrooted trees = #Unrooted trees =

How many topologies do we have to go through forHow many topologies do we have to go through for n sequences??

2n 3 !

2n 2 n 2 !

2 n 5 !

2n 3 n 3 !

The Maximum Likelihood The Maximum Likelihood Approach(cont.)Approach(cont.)

The Maximum Likelihood Approach The Maximum Likelihood Approach (cont.)(cont.)

• result is consistent.

• but time consuming!

DNA – THE BASIS OF MOLECULAR

PHYLOGENETICS

•The DNA molecule (polymer) is The DNA molecule (polymer) is made of monomer units called made of monomer units called nucleotidesnucleotides

•Each nucleotide consists of:Each nucleotide consists of:5 carbon sugar5 carbon sugar

a phosphate groupa phosphate group

a nitrogen basea nitrogen base

There are two groups of nitrogen bases:There are two groups of nitrogen bases:

•PurinesPurines

•PyrimidinesPyrimidines

• There are 4 There are 4 different types of different types of nucleotides in DNA, nucleotides in DNA, differing only in the differing only in the nitrogen base.nitrogen base. • The four nitrogen The four nitrogen

base nucleotides base nucleotides are given one are given one letter abbreviation letter abbreviation (the first letter of (the first letter of their name)their name)– ““A”denineA”denine– ““G”uanineG”uanine– ““C”ytosineC”ytosine– ““T”hymineT”hymine

• Purines, is the Purines, is the larger molecule of larger molecule of the two groupsthe two groups

• Adenine and Adenine and Guanine belong to Guanine belong to the purines groupthe purines group

• Pyrimidines, the Pyrimidines, the smaller molecule of smaller molecule of the two groupsthe two groups

• Cytosine and Cytosine and Thymine belong to Thymine belong to the Pyrimidines the Pyrimidines groupgroup

• The DNA backbone is a polymer with The DNA backbone is a polymer with alternating sugar-phosphate sequencealternating sugar-phosphate sequence

• Adenine forms 2 hydrogen bonds with Adenine forms 2 hydrogen bonds with thymine on the opposite strandthymine on the opposite strand

• This is a fixed pairingThis is a fixed pairing

• Guanine forms a triple hydrogen bond with Guanine forms a triple hydrogen bond with CytosineCytosine

• This is also a fixed pairingThis is also a fixed pairing

• Changes in DNA Changes in DNA sequences occur sequences occur through mutationsthrough mutations

• There are two kind of There are two kind of mutations between mutations between nucleotides:nucleotides:– TransitionTransition– transversiontransversion

TransitionTransition• A mutation between two A mutation between two

nucleotides from the same nucleotides from the same nitrogen base groupnitrogen base group– Purine transition G Purine transition G A A– Pyrimidine transition C Pyrimidine transition C T T

TransversionTransversion• A mutation between any two nucleotides A mutation between any two nucleotides

belonging to different groupsbelonging to different groups

Purines Purines Pyrimidines Pyrimidines– T T A A– C C G G

Two basic elementsTwo basic elements of of DNA substitutionDNA substitution

:: CompositionComposition

r:r: The processThe process

: : Composition:Composition:

The composition is just the proportion ofThe composition is just the proportion offour nucleotides. four nucleotides.

= [ 0.1, 0.4, 0.2, 0.3], the sum of = [ 0.1, 0.4, 0.2, 0.3], the sum of = 1 = 1

r: r: The process:The process:

can be described by a matrix of numbers,can be described by a matrix of numbers,describing how the nucleotides change describing how the nucleotides change

fromfromone to another one to another

DNA substitution can be DNA substitution can be described by described by

time-homogeneoustime-homogeneous Poisson Poisson processprocess

DNA substitution modelDNA substitution model

..GGr11r11CC r9 r9AA r5 r5TT

TT r12 r12..CC r7 r7AA r3 r3GG

TTr10r10GG r8 r8..AA r1 r1CC

TTr6r6GG r4 r4CCr2r2..AA

TTGGCCAA

The Likelihood of two DNA The Likelihood of two DNA equencesequences

JC69 model assumedJC69 model assumed : [¼, ¼, ¼, ¼]: [¼, ¼, ¼, ¼]

:: the rate of change, where the rate of change, where is equal is equal for for all nucleotides all nucleotidesn1: the number of sites remain samen1: the number of sites remain same

n2: the number of sites changen2: the number of sites change t: the distance form node A to B.t: the distance form node A to B.

Sequence ASequence A CCGGCCGCGCGCCGGCCGCGCG

Sequence BSequence B CGGGCCGGCCGCGGGCCGGCCG

Length = 11; n1 = 8; n2 = 3; Length = 11; n1 = 8; n2 = 3; = 0.007; = 0.007;

Similarity between A and B is n1/(n1 + n2) = 73%Similarity between A and B is n1/(n1 + n2) = 73%

From following plot we find the ML is 1.4E-14 where distance is From following plot we find the ML is 1.4E-14 where distance is 1717

0 20 40 60 80 1000

210-12410-12610-12810-12110-11

1.210-111.410-11

Likelihood vs distance

High similarity vs. low similarityHigh similarity vs. low similarity

0 20 40 60 80 1000

210-12410-12610-12810-12110-11

1.210-111.410-11

73% similarity

0 200 400 600 800 10000

110-14

210-14

310-14

410-14

510-14

27% similarity

Higher similarity, shorter distance

Long sequences vs. short Long sequences vs. short sequencessequences

Longer sequences input produce sharper curve

0 20 40 60 80 1000

210-12410-12610-12810-12110-11

1.210-111.410-11

The length of sequence 11

0 20 40 60 80 1000

210-218

410-218

610-218

810-218

the length of sequence 220

Big Big vs. small vs. small

Longer distance with slow rate of change

0 20 40 60 80 1000

210-12410-12610-12810-12110-11

1.210-111.410-11

0.007

0 20 40 60 80 1000

210-12410-12610-12810-12110-11

1.210-111.410-11

0.0025

Multi DNA sequences as inputMulti DNA sequences as input

PAUP*PAUP*

is designed for reconstruction ofis designed for reconstruction of

phylogenetic tree based on nucleic acidphylogenetic tree based on nucleic acid

alignments.alignments.

is Available atis Available at

http://www.sinauser.com

Example output from PAUP*Example output from PAUP*

DNA Substitution ModelsDNA Substitution Models•All models are special cases of the All models are special cases of the

general model general model •The unknown parameters are:The unknown parameters are:

– Nucleotide frequencyNucleotide frequency– Rate of change (mutation) Rate of change (mutation)

•Simplest model: equal mutation Simplest model: equal mutation rates and equal nucleotide rates and equal nucleotide frequenciesfrequencies

•Other models assume unequal Other models assume unequal nucleotide frequencies and/or nucleotide frequencies and/or different mutation ratesdifferent mutation rates

Likelihood & PhylogeneticsLikelihood & Phylogenetics

•Maximum Likelihood method helps us:Maximum Likelihood method helps us:– Determine the most probable tree of a Determine the most probable tree of a

set of DNA sequencesset of DNA sequences– Determine the best DNA substitution Determine the best DNA substitution

model to describe our data model to describe our data

Advantages of the Advantages of the Maximum Likelihood Maximum Likelihood MethodMethod•The method can be used in a wide The method can be used in a wide range of estimation problems, and range of estimation problems, and produce consistent results produce consistent results

•When the data set is large the When the data set is large the parameter results have a very small parameter results have a very small variance and come very close to the variance and come very close to the true valuetrue value– This allows us to draw conclusions about This allows us to draw conclusions about

the evolutionary process the evolutionary process

Disadvantages of the Disadvantages of the Maximum Likelihood Maximum Likelihood Method Method

•The Likelihood equations need to be The Likelihood equations need to be worked out for a given distribution, worked out for a given distribution, and they are usually very complicatedand they are usually very complicated– Fortunately Maximum Likelihood software Fortunately Maximum Likelihood software

is becoming commonis becoming common

•Maximum Likelihood estimates can be Maximum Likelihood estimates can be very biased for small samplesvery biased for small samples

phylogenetic estimation using maximum likelihood by: jimin zhu xin gong xin gong sravanti polsani...

Documents