phylogenetic estimation using maximum likelihood by: jimin zhu xin gong xin gong sravanti polsani...
Post on 21-Dec-2015
221 views
TRANSCRIPT
Phylogenetic Estimation using Maximum Likelihood
By: Jimin ZhuBy: Jimin Zhu
Xin Gong Xin Gong
Sravanti polsaniSravanti polsani
Rama sharmaRama sharma
Shlomit KlopmanShlomit Klopman
The Scope of the The Scope of the PresentationPresentation
• IntroductionIntroduction
•Maximum Likelihood and Coin TossingMaximum Likelihood and Coin Tossing
•The Phylogenetic TreeThe Phylogenetic Tree
•Maximum Likelihood and DNA Maximum Likelihood and DNA SubstitutionSubstitution
•Advantages and Disadvantages Advantages and Disadvantages Maximum Likelihood Maximum Likelihood
IntroductionIntroduction
• Phylogeny: the study of relationships Phylogeny: the study of relationships between life formsbetween life forms
• Phylogenetics is part of the field of Phylogenetics is part of the field of taxonomy and systematicstaxonomy and systematics
• Phylogenetics received a huge push Phylogenetics received a huge push forward thanks to modern computersforward thanks to modern computers
• Various phylogenetic methods are Various phylogenetic methods are used to explain the evolutionary used to explain the evolutionary process, and often give contradicting process, and often give contradicting results! results!
Introduction (cont.)Introduction (cont.)
•Scientists agree that a correct Scientists agree that a correct species linage should be species linage should be determined using statisticsdetermined using statistics
•Maximum Likelihood is the method Maximum Likelihood is the method of choice for establishing the most of choice for establishing the most realistic phylogenetic tree of a given realistic phylogenetic tree of a given datadata
•The Maximum Likelihood method The Maximum Likelihood method was introduced in 1922 by R.A. was introduced in 1922 by R.A. Fisher an English statistician Fisher an English statistician
Maximum Likelihood in a Maximum Likelihood in a NutshellNutshell• The method depends on:The method depends on:
– Complete data set Complete data set – Probabilistic model that describes the dataProbabilistic model that describes the data– Explicitly expressing the likelihood functionExplicitly expressing the likelihood function
• The likelihood of a data set is the The likelihood of a data set is the probability of obtaining it, given the probability of obtaining it, given the chosen probability distribution modelchosen probability distribution model
• We seek the values of the parameters We seek the values of the parameters that maximize the sample likelihood that maximize the sample likelihood
Maximum Likelihood Maximum Likelihood approach approach using Coin Tossing using Coin Tossing Experiment Experiment • Find the parameter value(s) that make the Find the parameter value(s) that make the
observed data most likely. Basically, choose observed data most likely. Basically, choose the value of parameter that maximizes the the value of parameter that maximizes the probability of observing the data.probability of observing the data.
• Probability: Knowing parameters Probability: Knowing parameters Prediction Prediction of outcomeof outcome
• Likelihood: Observation of data Likelihood: Observation of data Estimation Estimation of parametersof parameters
• Parameters describe the characteristics of a population. Parameters describe the characteristics of a population. Their values are estimated from samples collected from Their values are estimated from samples collected from that population.that population.
Simple Coin Tossing Simple Coin Tossing ExperimentExperiment
• Binomial probability distributionBinomial probability distribution The probability of observing h heads The probability of observing h heads
out of n tosses can be described as:out of n tosses can be described as:
Pr[h|p, n] Pr[h|p, n] = = n! n! p phh(1-p)(1-p)n-hn-h
h!(n-h)!h!(n-h)!
Where p is Where p is probability of Heads probability of Heads
(1-p) is probability of Tails.(1-p) is probability of Tails.
Simple Coin Tossing Simple Coin Tossing ExperimentExperiment
Suppose I told you we tossed a coin 10 Suppose I told you we tossed a coin 10 times and got 4 heads and 6 tails, then the times and got 4 heads and 6 tails, then the probability would be probability would be
P(4Heads, 6Tails) =P(4Heads, 6Tails) = 10! 10! p p44(1-p)(1-p)66 4!*6!4!*6! The whole notion of maximum likelihood The whole notion of maximum likelihood
estimation is that we choose p to be the estimation is that we choose p to be the one that makes the probability of getting one that makes the probability of getting our set of observations the largest possible: our set of observations the largest possible: i.e. maximize Pi.e. maximize P44 (1-P) (1-P)66 . So our likelihood . So our likelihood function would be: like = function would be: like = pp44(1-p)(1-p)66
Two ways to find MLETwo ways to find MLE
1.1. Take the first derivative of the likelihood Take the first derivative of the likelihood function with respect to each parameter, function with respect to each parameter, set the resulting equations equal to 0, set the resulting equations equal to 0, and solve for the parameter estimates.and solve for the parameter estimates.
Applying log on both sidesApplying log on both sidesLog(L(p)) = n Log(p) + (n-h) Log(1-p)Log(L(p)) = n Log(p) + (n-h) Log(1-p)Take first derivative w.r.t p Take first derivative w.r.t p (n / p) – (n-h) / (1-p) = 0(n / p) – (n-h) / (1-p) = 0Solving for p, We get Solving for p, We get p = h / np = h / n
This value maximizes the likelihood This value maximizes the likelihood function and is the MLE.function and is the MLE.
Find the maximum using Find the maximum using Numeric search procedures Numeric search procedures
2. Plug in different values for p into the 2. Plug in different values for p into the probability model and calculate likelihood.probability model and calculate likelihood.
Lets take sample n = 100, h = 56. Lets take sample n = 100, h = 56.
Imagine that p was 0.5. Imagine that p was 0.5.
Plugging this value into our probability model as Plugging this value into our probability model as follows:- follows:-
L(p = 0.5 | data ) = L(p = 0.5 | data ) = 100 ! 100 ! 0.5 0.55656 0.5 0.54444 = 0.0389 = 0.0389
56! 44! 56! 44!
But what if p was 0.52 instead? But what if p was 0.52 instead?
L(p = 0.52 | data) = L(p = 0.52 | data) = 100 ! 100 ! 0.52 0.5256560.480.484444 = 0.0581 = 0.0581
56! 44! 56! 44!
• So from this we can conclude that So from this we can conclude that pp is is more likely to be 0.52 than 0.5. We can more likely to be 0.52 than 0.5. We can tabulate the likelihood for different tabulate the likelihood for different parameter values to find the maximum parameter values to find the maximum likelihood estimate of likelihood estimate of pp::
p L p Lp L p L ------ ------- ------- -------------- ------- ------- -------- 0.48 0.0222 0.50 0.038890.48 0.0222 0.50 0.03889 0.52 0.0581 0.54 0.07390.52 0.0581 0.54 0.0739 0.56 0.0801 0.56 0.0801 0.58 0.0738 0.58 0.0738 0.60 0.0576 0.62 0.0378 0.60 0.0576 0.62 0.0378
Maximum likelihood estimate for Maximum likelihood estimate for pp seems to be exactly at 0.56. seems to be exactly at 0.56.
MLE: Sample Graphs (using Mathematica)MLE: Sample Graphs (using Mathematica)fp_, H_: Binomial10, Hp^H1 p̂10 HPlotfp, H,p, 0, 1 6H, 4 T H 6;
0.2 0.4 0.6 0.8 1
0.05
0.1
0.15
0.2
0.25
3H, 7T H 3;
0.2 0.4 0.6 0.8 1
0.05
0.1
0.15
0.2
0.25
H 5;
0.2 0.4 0.6 0.8 1
0.05
0.1
0.15
0.2
0.25 5H, 5T
8H, 2T H 8;
0.2 0.4 0.6 0.8 1
0.05
0.1
0.15
0.2
0.25
0.3
Simple Coin Tossing Simple Coin Tossing ExperimentExperiment
• The best estimate for The best estimate for pp from any one sample is from any one sample is clearly going to be the clearly going to be the proportion of headsproportion of heads observed in that sample. observed in that sample.
• A very simple example like this is over rated A very simple example like this is over rated for evaluating for evaluating p using MLE approach.p using MLE approach.
• But not all problems are this simple! The more But not all problems are this simple! The more complex the model and the greater the complex the model and the greater the number of parameters, it often becomes very number of parameters, it often becomes very difficult to make even reasonable guesses at difficult to make even reasonable guesses at the MLEs. the MLEs.
Phylogenetic TreePhylogenetic Tree
A phylogenetic tree is a data structure, characterized by:A phylogenetic tree is a data structure, characterized by:
• topology (form) • its branch lengths
Stores information regarding the relationship of several Stores information regarding the relationship of several species or sequences.species or sequences.
aa bbcc
dd
Rooted tree: assumed : assumed ancestral state "d" is theancestral state "d" is the root species.
Unrooted tree... no implicit ... no implicit "directionality", but is a measure"directionality", but is a measure of of similarity between species.between species.
aa
bb
cc
dd
Types of Phylogenetic Types of Phylogenetic TreesTrees
leaf
branch
root
leaf
branch
(1) A G G C U C C A A (1) A G G C U C C A A
(2) A G G U U C G A A (2) A G G U U C G A A
(3) A G C C C A G A A (3) A G C C C A G A A
(4) A U U U C G G A A (4) A U U U C G G A A
Molecular phylogenetic methods use a given set Molecular phylogenetic methods use a given set
of aligned sequences to construct a phylogenetic of aligned sequences to construct a phylogenetic
TreeTree
sequence 1sequence 1
sequence 2sequence 2
sequence 3sequence 3
sequence 4sequence 4
There are several ways to construct phylogenetic trees.
The Maximum Likelihood method will pick out the tree that most represents the true tree.
j
(1) A G G C T C C A A….A (2) A G G T T C G A A.…A (3) A G C C C A G A A....A (4) A T T T C G G A A....C
The Maximum Likelihood ApproachThe Maximum Likelihood Approach
1. Assumes that all sequences at each site are considered independent.
1 ….N2
xx
The Maximum Likelihood The Maximum Likelihood Approach(cont.)Approach(cont.)
CC AACC GG
yy
1. The log-likelihood is computed for a given topology by using a particular probability model.
L ( j ) = Prob + …+ Prob
N ln L= ln L(1) + ln L(2) ..+ ln L(j)+… + ln L(N) = SUM ln L(i) i=1
Binomial;Multinomial;Poisson…..
a)
b)CC AACC GG
AA
AA
CC AACC GG
GGGG
c)
The Maximum Likelihood Approach The Maximum Likelihood Approach (cont.)(cont.)
3. After procedure is done for3. After procedure is done for all possible topologies,, the topology the topology that showsthat shows the highest likelihood is chosen as theis chosen as the true (realistic)
tree.
#Rooted trees = #Rooted trees =
#Unrooted trees = #Unrooted trees =
How many topologies do we have to go through forHow many topologies do we have to go through for n sequences??
2n 3 !
2n 2 n 2 !
2 n 5 !
2n 3 n 3 !
The Maximum Likelihood The Maximum Likelihood Approach(cont.)Approach(cont.)
The Maximum Likelihood The Maximum Likelihood Approach(cont.)Approach(cont.)
The Maximum Likelihood Approach The Maximum Likelihood Approach (cont.)(cont.)
• result is consistent.
• but time consuming!
DNA – THE BASIS OF MOLECULAR
PHYLOGENETICS
•The DNA molecule (polymer) is The DNA molecule (polymer) is made of monomer units called made of monomer units called nucleotidesnucleotides
•Each nucleotide consists of:Each nucleotide consists of:5 carbon sugar5 carbon sugar
a phosphate groupa phosphate group
a nitrogen basea nitrogen base
There are two groups of nitrogen bases:There are two groups of nitrogen bases:
•PurinesPurines
•PyrimidinesPyrimidines
• There are 4 There are 4 different types of different types of nucleotides in DNA, nucleotides in DNA, differing only in the differing only in the nitrogen base.nitrogen base. • The four nitrogen The four nitrogen
base nucleotides base nucleotides are given one are given one letter abbreviation letter abbreviation (the first letter of (the first letter of their name)their name)– ““A”denineA”denine– ““G”uanineG”uanine– ““C”ytosineC”ytosine– ““T”hymineT”hymine
• Purines, is the Purines, is the larger molecule of larger molecule of the two groupsthe two groups
• Adenine and Adenine and Guanine belong to Guanine belong to the purines groupthe purines group
• Pyrimidines, the Pyrimidines, the smaller molecule of smaller molecule of the two groupsthe two groups
• Cytosine and Cytosine and Thymine belong to Thymine belong to the Pyrimidines the Pyrimidines groupgroup
• The DNA backbone is a polymer with The DNA backbone is a polymer with alternating sugar-phosphate sequencealternating sugar-phosphate sequence
• Adenine forms 2 hydrogen bonds with Adenine forms 2 hydrogen bonds with thymine on the opposite strandthymine on the opposite strand
• This is a fixed pairingThis is a fixed pairing
• Guanine forms a triple hydrogen bond with Guanine forms a triple hydrogen bond with CytosineCytosine
• This is also a fixed pairingThis is also a fixed pairing
• Changes in DNA Changes in DNA sequences occur sequences occur through mutationsthrough mutations
• There are two kind of There are two kind of mutations between mutations between nucleotides:nucleotides:– TransitionTransition– transversiontransversion
TransitionTransition• A mutation between two A mutation between two
nucleotides from the same nucleotides from the same nitrogen base groupnitrogen base group– Purine transition G Purine transition G A A– Pyrimidine transition C Pyrimidine transition C T T
TransversionTransversion• A mutation between any two nucleotides A mutation between any two nucleotides
belonging to different groupsbelonging to different groups
Purines Purines Pyrimidines Pyrimidines– T T A A– C C G G
Two basic elementsTwo basic elements of of DNA substitutionDNA substitution
:: CompositionComposition
r:r: The processThe process
: : Composition:Composition:
The composition is just the proportion ofThe composition is just the proportion offour nucleotides. four nucleotides.
= [ 0.1, 0.4, 0.2, 0.3], the sum of = [ 0.1, 0.4, 0.2, 0.3], the sum of = 1 = 1
r: r: The process:The process:
can be described by a matrix of numbers,can be described by a matrix of numbers,describing how the nucleotides change describing how the nucleotides change
fromfromone to another one to another
DNA substitution can be DNA substitution can be described by described by
time-homogeneoustime-homogeneous Poisson Poisson processprocess
DNA substitution modelDNA substitution model
..GGr11r11CC r9 r9AA r5 r5TT
TT r12 r12..CC r7 r7AA r3 r3GG
TTr10r10GG r8 r8..AA r1 r1CC
TTr6r6GG r4 r4CCr2r2..AA
TTGGCCAA
The Likelihood of two DNA The Likelihood of two DNA equencesequences
JC69 model assumedJC69 model assumed : [¼, ¼, ¼, ¼]: [¼, ¼, ¼, ¼]
:: the rate of change, where the rate of change, where is equal is equal for for all nucleotides all nucleotidesn1: the number of sites remain samen1: the number of sites remain same
n2: the number of sites changen2: the number of sites change t: the distance form node A to B.t: the distance form node A to B.
Sequence ASequence A CCGGCCGCGCGCCGGCCGCGCG
Sequence BSequence B CGGGCCGGCCGCGGGCCGGCCG
Length = 11; n1 = 8; n2 = 3; Length = 11; n1 = 8; n2 = 3; = 0.007; = 0.007;
Similarity between A and B is n1/(n1 + n2) = 73%Similarity between A and B is n1/(n1 + n2) = 73%
From following plot we find the ML is 1.4E-14 where distance is From following plot we find the ML is 1.4E-14 where distance is 1717
0 20 40 60 80 1000
210-12410-12610-12810-12110-11
1.210-111.410-11
Likelihood vs distance
High similarity vs. low similarityHigh similarity vs. low similarity
0 20 40 60 80 1000
210-12410-12610-12810-12110-11
1.210-111.410-11
73% similarity
0 200 400 600 800 10000
110-14
210-14
310-14
410-14
510-14
27% similarity
Higher similarity, shorter distance
Long sequences vs. short Long sequences vs. short sequencessequences
Longer sequences input produce sharper curve
0 20 40 60 80 1000
210-12410-12610-12810-12110-11
1.210-111.410-11
The length of sequence 11
0 20 40 60 80 1000
210-218
410-218
610-218
810-218
the length of sequence 220
Big Big vs. small vs. small
Longer distance with slow rate of change
0 20 40 60 80 1000
210-12410-12610-12810-12110-11
1.210-111.410-11
0.007
0 20 40 60 80 1000
210-12410-12610-12810-12110-11
1.210-111.410-11
0.0025
Multi DNA sequences as inputMulti DNA sequences as input
PAUP*PAUP*
is designed for reconstruction ofis designed for reconstruction of
phylogenetic tree based on nucleic acidphylogenetic tree based on nucleic acid
alignments.alignments.
is Available atis Available at
http://www.sinauser.com
Example output from PAUP*Example output from PAUP*
DNA Substitution ModelsDNA Substitution Models•All models are special cases of the All models are special cases of the
general model general model •The unknown parameters are:The unknown parameters are:
– Nucleotide frequencyNucleotide frequency– Rate of change (mutation) Rate of change (mutation)
•Simplest model: equal mutation Simplest model: equal mutation rates and equal nucleotide rates and equal nucleotide frequenciesfrequencies
•Other models assume unequal Other models assume unequal nucleotide frequencies and/or nucleotide frequencies and/or different mutation ratesdifferent mutation rates
Likelihood & PhylogeneticsLikelihood & Phylogenetics
•Maximum Likelihood method helps us:Maximum Likelihood method helps us:– Determine the most probable tree of a Determine the most probable tree of a
set of DNA sequencesset of DNA sequences– Determine the best DNA substitution Determine the best DNA substitution
model to describe our data model to describe our data
Advantages of the Advantages of the Maximum Likelihood Maximum Likelihood MethodMethod•The method can be used in a wide The method can be used in a wide range of estimation problems, and range of estimation problems, and produce consistent results produce consistent results
•When the data set is large the When the data set is large the parameter results have a very small parameter results have a very small variance and come very close to the variance and come very close to the true valuetrue value– This allows us to draw conclusions about This allows us to draw conclusions about
the evolutionary process the evolutionary process
Disadvantages of the Disadvantages of the Maximum Likelihood Maximum Likelihood Method Method
•The Likelihood equations need to be The Likelihood equations need to be worked out for a given distribution, worked out for a given distribution, and they are usually very complicatedand they are usually very complicated– Fortunately Maximum Likelihood software Fortunately Maximum Likelihood software
is becoming commonis becoming common
•Maximum Likelihood estimates can be Maximum Likelihood estimates can be very biased for small samplesvery biased for small samples