orthology, and its relevance to protein function prediction

44
ology, and its relevance to protein function predic Fitch 1970: “Where the homology is the result of gene duplication so that both copies have descended side by side during the history of an organism, (for example alpha and beta hemoglobin) the genes should be called paralogous (para= in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example, alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact)”

Upload: saxon

Post on 12-Jan-2016

36 views

Category:

Documents


0 download

DESCRIPTION

Orthology, and its relevance to protein function prediction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Orthology, and its relevance to protein function prediction

Orthology, and its relevance to protein function prediction

Fitch 1970: “Where the homology is the result of gene duplication so that both copies have descended side by side during the history of an organism, (for example alpha and beta hemoglobin) the genes should be called paralogous (para= in parallel). Where the homology is the result of speciation so that the history of the gene reflects the history of the species (for example, alpha hemoglobin in man and mouse) the genes should be called orthologous (ortho=exact)”

Page 2: Orthology, and its relevance to protein function prediction

Comparing genomes for their Comparing genomes for their genes, orthologsgenes, orthologs

Gene A

A

B

Species I

Species IIGene duplicationSpeciation

Orthologs

“Which genes do two genomes share, and which don’t they share, and how does that relate to their phenotypical similarities and differences”

Page 3: Orthology, and its relevance to protein function prediction

Orthology: Who cares?

Page 4: Orthology, and its relevance to protein function prediction

Dystrophin related protein 2

Dystrophin

Utrophin

Dystrotelin

Dystrobrevin

The DYS-1 gene from C.elegans is not orthologous to dystrophin, that there is no effect of the knockout on the muscle cells is not so surprising.

Page 5: Orthology, and its relevance to protein function prediction

Genome I

Genome II

35%35%30%30%

25%25%23%23%

Orthologs are expected to have relatively high levels of sequence Orthologs are expected to have relatively high levels of sequence identity to each other (compared to to other non-orthologous identity to each other (compared to to other non-orthologous homologs), because they diverged relatively recently, and …… because homologs), because they diverged relatively recently, and …… because they have similar functions…. (???)they have similar functions…. (???)Large scale orthology determination is often done using bidirectional Large scale orthology determination is often done using bidirectional best hitsbest hits

Best-bidirectional hits: a graph based approach

Page 6: Orthology, and its relevance to protein function prediction

An implementation of large-scale detection of pairwise orthology relations between genomes:

1) Pairwise comparison of all genomes with each other, using the Smith-Waterman algorithm to detect “all” homologous relations (E < 0.01) between the predicted proteins.

2) Select orthologous relations by selecting best-bidirectional hits between proteins. -Scoring as level of sequence identity * length of hit.-Including the possibility of gene fusion/fission (protein A from genome I can be orthologous to proteins B and C from genome II). By selecting

A

B Cgenome II

genome I

Page 7: Orthology, and its relevance to protein function prediction

Genome I

Genome II

35%35%35%35%

25%25%23%23%

Genome III

40%40%30%30% 22%22%

Multiple genomes can be used to check for consistency of bidirectional best hits.

35%35%20%20%

25%25%

Page 8: Orthology, and its relevance to protein function prediction

Solution to the non-transitivity of the concept of orthology sensu stricto is: “Group orthology”

Conceptually: all proteins that are directly descended from one protein in the last common ancestor of the species one is interested in are considered orthologous to each other

Operationally in a “graph-based approach”: Combine all connected “best triangular hits” into Clusters of Orthologous Groups (COGs, Tatusov et al, 1997). WWW.NCBI.NLM.GOV

Page 9: Orthology, and its relevance to protein function prediction
Page 10: Orthology, and its relevance to protein function prediction

Gene duplications are creative, creating the possibility for developing new functions (in this case involved in carnitine synthesis) but ….

They mess up orthology. Furthermore, intricate combinations of gene duplications and speciations make orthology non-transitive

Page 11: Orthology, and its relevance to protein function prediction

Inparalogs versus outparalogs:

Inparalogs are due to relatively recent, species-specific gene duplications, e.g. Q9V6P0 and Q9VY24.

Outparalogs are due to gene duplications that preceded speciations, e.g. Q9V6P0 vs. Q9VDM7

Page 12: Orthology, and its relevance to protein function prediction

Gene A

A

B

Species I

Species IIGene duplicationSpeciation

Non-Orthologs,although bidirectional best hits

Parallel non-orthologous gene-loss can lead to misidentification of orthology relations when using best bi-directional hits as criterion.

Gene loss

Disadvantages of graph-based approaches to orthology

Page 13: Orthology, and its relevance to protein function prediction

B .s ubt il is Dn aK

E. coli H sc A

B uch ne ra H sc A

R .p row aze k ii Hs cA

92 2

H . sa pi e ns 19 96 42

S.cerevi si a e Y H R 0 64C

H . sa pi e ns 59 06 2

H . sa pi e ns 18 711 6

S.cerevi si a e B R 16 9C

S.cerevi si a e Y PL1 06 C

10 00

10 00

S.cerevi si a e Y K L 07 3W

99 7

76 9

60 6

97 1

96 6

E. coli D na K

B uch ne ra D naK

R .p row aze k ii Dn aK

H . sa pi e ns 23 62 7

S.cerevi si a e EC M 10

S.cerevi si a e S SC 110 00

S.cerevi si a e S SQ 1

61 6

50 7

10 00

92 7

85 0

0. 2

Variations in the rate of evolution can lead to misidentification of orthology relations when the latter are based on bi/multi-directional best hits.

Disadvantages of graph-based approaches to orthology

Page 14: Orthology, and its relevance to protein function prediction

Graph based approaches can recognize outparalogs as inparalogs (and vice versa)

Page 15: Orthology, and its relevance to protein function prediction

Because of independent loss events, and because of variable rates of evolution, in large gene families, orthology determination using bi/multi-directional best hits (graph-based approaches) does not always resolve separate orthologous and/or functional groups.

One solution to this is the creation of phylogenies………

Page 16: Orthology, and its relevance to protein function prediction
Page 17: Orthology, and its relevance to protein function prediction

Prediction of orthology using phylogenies (unrooted)Prediction of orthology using phylogenies (unrooted)

Page 18: Orthology, and its relevance to protein function prediction

A tree-based approach would also allow a hierarchical, multi-level view on orthology, e.g. by including a numbering

system

Page 19: Orthology, and its relevance to protein function prediction

Classic usage of phylogeny: inferring evolutionary history

Dyall et al, Nature 2004

Hrdy et al, Nature 2004

Page 20: Orthology, and its relevance to protein function prediction

1) Distance based methods. Cluster the sequences based on relative levels of sequence similarity. Fast, but not a direct reconstruction of “what happened in evolution”. Neighbor Joining is the often used method here

2) Parsimony methods. Reconstruct the phylogeny that required the least amount of mutations. Slower (requires in principle the examination of all trees), but branch & bound makes it faster

N=P I=3…..T(2i-5)

and based on the questionable assumption that the least amount of events possible occurred in evolution.

3) Maximum likelihood methods. Find the phylogeny (including branch lengths) that, given a model of sequence evolution, was most likely to have produced to sequence alignment. This involves the comparison of all trees, estimating the branch lengths optimal for that tree, and subsequently estimating the likelihood of the complete tree . The slowest method of all, that furthermore requires knowledge of a large number of parameters

How to make a tree

Page 21: Orthology, and its relevance to protein function prediction

Likelihood-Based Phylogeny

Page 22: Orthology, and its relevance to protein function prediction

Sequence W: A C G C G T T G G G Sequence X: A C G C G T T G G G Sequence Y: A C G C A A T G A A Sequence Z: A C A C A G G G A A

4 Possible trees for 4 sequences

WX Y Z WY X Z WZ X Y

Page 23: Orthology, and its relevance to protein function prediction

T T A G

ATGC

ATGC

ATGC

All Possible Evolutionary Paths for one tree, for one collumn in the alignment

Page 24: Orthology, and its relevance to protein function prediction

T T A G

GT

GL(path) = L(root) x L(branches)

=P(G) P(T|G)P(G|G) P(A|G)P(G|G) P(T|T)P(T|T)

Likelihood for One Path

Page 25: Orthology, and its relevance to protein function prediction

T

CA

G

a

a

aa a a

Jukes Cantor

T

CA

G

a

f

eb c d

General

T

CA

G

2a

aa

Kimura

2a

2a2a

Calculating the likelihood of any path requires a model of sequence evolution

(and an estimate of the time, and the mathematics to combine both)

Page 26: Orthology, and its relevance to protein function prediction

The Jukes--Cantor Model, (including time…)

3α1αεαεαεαε3α1αεαεαεαε3α1αεαεαεαε3α1

)(S

tttt

tttt

tttt

tttt

rssssrssssrssssr

tS )(

4/)1(

4/)31( where

4

4

tt

tt

es

er

Page 27: Orthology, and its relevance to protein function prediction

Sum over all paths

T T A G

L(Column Cluster 1) = L(all possible Evolutionary Paths)

= L(path1) + L(path2) + L(path3) + … + L(path64)

T T A GATGC

ATGC A

TGC

Page 28: Orthology, and its relevance to protein function prediction

Likelihood of a phylogeny tree for one site

x1x2

x3

x4

x5

t1

t4

t3t2

45 ,2

421

413

534

545

321

),|(),|(),|(),|()(

),|(

xx

txxPtxxPtxxPtxxPxP

tTxxxP

When x4 x5 are unknown,

Page 29: Orthology, and its relevance to protein function prediction

Whole Sequence Likelihood

WX Y Z

L(Sequence) = L(each position i)

Choose the tree with the Maximum Likelihood.

i

Page 30: Orthology, and its relevance to protein function prediction

For All Positions u=1…N

N

u

nuu

N

u

nuu

TxxPTxP

or

TxxPTxP

1

1

1

1

),|(log),|(log(

),|(),|(

Page 31: Orthology, and its relevance to protein function prediction

• Pick an Evolutionary Model (Modeltest can help)

• For each position, Generate all possible tree structures

• Based on the Evolutionary Model, calculate Likelihood of these Trees and Sum them to get the Column Likelihood for each OTU cluster.

• Calculate Tree Likelihood by multiplying the likelihood for each position

• Choose Tree with Greatest Likelihood

Maximum Likelihood Phylogeny

Page 32: Orthology, and its relevance to protein function prediction

Likelihood-based Phylogeny

• Works well for distantly related sequences

• Works well under different molecular clock theory

• Can incorporate any desirable evolutionary model

• Requires a good “model” of evolution

• Requires fast computers

Page 33: Orthology, and its relevance to protein function prediction

Obtaining confidence in our tree

Bootstrapping: 1) Generate e.g. 1000 alignments by random sampling (with

replacement) from the real alignment.2) Determine the phylogenetic tree for each alignment.3) Count how often every subtree appears, and put those values at

the internal branches of the complete tree.

… bootstrap values tend to be on the conservative side….

Page 34: Orthology, and its relevance to protein function prediction

Unrooted tree topologies

= =

=

((A,B),(C,D)) Bracket notation

A A

A

B B

B

C

CC

D

DD

Page 35: Orthology, and its relevance to protein function prediction

-Unrooted tree topologies only reflect relative evolutionary relations (In the primates the humans and chimpanzee are closer related to each other than either is to the the chimpanzee than they are to the Orang-Otang and the Gibbon)

-Rooted trees reflect relative order of descendance (In the primates first the Gibbon branched off, then the Orang-Otang branched off, then the chimpanzee and then the humans)

Orang-Otang Gibbon

Chimp Human

Chimp

HumanOrang-Otang

Gibbon

Baboon

Page 36: Orthology, and its relevance to protein function prediction

Rooting is important to separate different orthologous groups in a tree different functionalities.

Berend’s lecture….

Page 37: Orthology, and its relevance to protein function prediction

To construct a tree one needs a multiple sequence alignment:

Different optimization criteria:

Global optimization:

The best would be to run multidimensional dynamic programming, which would be guaranteed to give you the optimal alignment (optimal here means: the best alignment given the similarity matrix and the gap penalties, not necessarily what is structurally the most likely course of events).

Reconstructing evolution:

Trying, along an evolutionary tree, to minimize the number of insertion/deletion events.

Alignment Phylogeny

Page 38: Orthology, and its relevance to protein function prediction

Automatic sequence alignment use heuristics to obtain some kind of approximation of the optimal alignment.

ClustalW: -pairwise sequence alignments of all pairs (N-1)^2 / 2.-select the most similar pairs of sequences, align those, and subsequently iteratively align the alignments.

T_Coffee:-better but slower

Muscle:

Page 39: Orthology, and its relevance to protein function prediction

Substrate specificities are not necessarily monophyletic (convergent evolution).

Convergent evolution of Trichomonas vaginalis lactate dehydrogenase from malate dehydrogenase. Wu et al., PNAS 1999

Page 40: Orthology, and its relevance to protein function prediction

CH3 - C - C - O-

OH

H O

Lactate

CH2 - C - C - O-

OH

H O

-O - C -

O

Malate

CH2 - C - C - O-

O

O

-O - C -

O

Oxaloacetate

CH3 - C - C - O-

O

O

Pyruvate

LDH

MDH

Lactate/Malate DehydrogenaseDifferent small-molecule specificity

Page 41: Orthology, and its relevance to protein function prediction

Lactate/Malate Dehydrogenase

CH3 - C - C - O-

OH

H O

Lactate

CH2 - C - C - O-

OH

H O

-O - C -

O

Malate

negative

Arg 102positive

Hannenhalli & Russell, JMB, 303, 61-76, 2000

Page 42: Orthology, and its relevance to protein function prediction

Another source of information that can be used for orthology prediction is gene-order conservation.

35% 35%

(be careful for duplicated sets of genes though)

Page 43: Orthology, and its relevance to protein function prediction

Prediction of orthology/function using a combination of Prediction of orthology/function using a combination of sequence similarity and gene-order conservationsequence similarity and gene-order conservation

icd or leuB ?icd or leuB ?

Page 44: Orthology, and its relevance to protein function prediction

Further reading

• The quest for orthologs: finding the corresponding gene across genomes. Kuzniar A, van Ham RC, Pongor S, Leunissen JA. Trends Genet. 2008 Nov;24(11):539-51