orthology analysis

53
Orthology Analysis Erik Sonnhammer Center for Genomics and Bioinformatics Karolinska Institutet, Stockholm

Upload: cachet

Post on 03-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Orthology Analysis. Erik Sonnhammer C enter for G enomics and B ioinformatics Karolinska Institutet, Stockholm. Outline. Basic concepts BLAST-based approaches to orthology Tree-based approaches to orthology Domain-level orthology. Homologs. = genes with a common origin - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Orthology Analysis

Orthology Analysis

Erik Sonnhammer

Center for Genomics and Bioinformatics

Karolinska Institutet, Stockholm

Page 2: Orthology Analysis

Outline

• Basic concepts

• BLAST-based approaches to orthology

• Tree-based approaches to orthology

• Domain-level orthology

Page 3: Orthology Analysis

Homologs

= genes with a common origin

• May be genes in the same or in different organisms

• Does not say that function is identical

• Can only be true or false, and not a percentage!

• Homologs have the same 3D-structure layout

Page 4: Orthology Analysis

Homologs

Orthologs Paralogs

Page 5: Orthology Analysis

Gene Y1 in human

Gene Y in rat

Gene Y2 in human

DGene X in ancient animal

Gene Yin ancient mammal

In-paralogs

Orthologs: Orthologs: separated by speciationseparated by speciation

Gene Xin ancient mammal

Gene Xin human

Gene X in rat

Time

Orthologs

Orthologs

Out-paralogs

paralogs

speciation

D

S

S

Page 6: Orthology Analysis

In/Out-paralog definition

In-paralogs ~ co-orthologsparalogs that were duplicated after the speciation and hence are orthologs to a cluster in the other species

Out-paralogs = not co-orthologsparalogs that were duplicated before the speciation. Not necessarily in the same species.

Sonnhammer & Koonin, Trends Genet. 18:619-620 (2002)

Page 7: Orthology Analysis

Orthologs for functional genomicsOrthologs for functional genomics• Co-orthologs / inparalogs are more likely than outparalogs to

have identical biochemical functions and biological roles.

• Co-orthologs can be used to discover human gene function via model organism experiments

• Co-orthologs are key to exploit functional genomics/proteomics data in in model organisms

Page 8: Orthology Analysis

Orthology and function conservation

• Orthology does not say anything about evolutionary distance.

• Close orthologs, e.g. human-mouse are very likely to have the same biological role in the organism.

• Distant orthologs, e.g. human-worm are less likely to have the same phenotypical role, but may have the same role in the corresponding pathway.

Page 9: Orthology Analysis

Ortholog DatabasesSequence database Orthology

detection methodOrtholog database

SwTrembl proteomes Inparanoid (blast) Inparanoid

proteomes COGs (blast) COGs / KOGs

TIGR gene index COGs (blast) TOGA/EGO

proteomes OrthoMCL (blast) OrthoMCL

Pfam Orthostrapper (tree) HOPS

Pfam RIO (tree)

Page 10: Orthology Analysis

How to find orthologs?How to find orthologs?

1. Calculate phylogenetic tree, look for orthologs in the tree (Orthostrapper, Rio):

2. Two-way best matches between two species can be used to find orthologs without trees.

[However, in-paralogs are harder to find this way]

Page 11: Orthology Analysis

Two-way best match approachto finding orthologs

Page 12: Orthology Analysis

COGsCOG2813:

Out-

paralogs

orthologs

Page 13: Orthology Analysis

Inpara-n-oidInparalog ‘n ortholog identification

Blue = species 1

Red = species 2

Page 14: Orthology Analysis

Inparanoid

Blue = species 1

Red = species 2

Page 15: Orthology Analysis

No overlap - no problems:

Partial overlap - separate:

Complete overlap - merge:

Resolve overlapping clustersResolve overlapping clusters

Page 16: Orthology Analysis

Inparalog score

Score for inparalog P = (scoreAP - scoreAB) / (scoreAA - scoreAB)

0 20 40 60 80 100%

A

P

B

Page 17: Orthology Analysis

Confidence values for main orthologs from sampling

TVHIVDDEEPVR---KSLAFM---LTMNGFAT+ ++DD +R K L M +T+ G ATILLIDDHPMLRTGVKQLISMAPDITVVGEA

Sampling with replacement; insertions kept intact

GAFDEP---LVTHVR..........GA + ++T +RGAEEHMAPDILTLLR..........

“Bootstrap alignment” -> “bootstrap score”

Confidence = (bootstrap alignments best-best matches / nr of bootstraps)

Page 18: Orthology Analysis

http://inparanoid.cgb.ki.se

Page 19: Orthology Analysis

inparanoid.cgb.ki.se

Remm et al, J. Mol. Biol. 314:1041-1052 (2001)

Homo Sapiens vs. C. elegans

Page 20: Orthology Analysis

Ortholog group sizes, human vs XVersion 2.5:

08-apr-03151360 sequences from Swissprot-TREMBL

44996 sequences from Homo sapiens26674 sequences from Mus musculus20316 sequences from Drosophila melanogaster20997 sequences from Caenorhabditis elegans36751 sequences from Arabidopsis thaliana6910 sequences from Saccharomyces cerevisiae8709 sequences from Escherichia coli

Species

Number of orthologs (orthologous groups) in H.sapiens

Number of sequences (in-paralogs) from H.sapiens in orthologous groups

Number of sequences (in-paralogs) from this species in orthologous groups

M.musculus 12458 19532 17055D.melanogaster 5549 15259 9854C.elegans 4541 14222 6537A.thaliana 3258 10863 12178S.cerevisiae 2175 7265 2751E.coli 599 2144 1037

Page 21: Orthology Analysis

Nr of inparalogs per ortholog group

Species Avg. inparalogs in model organism ortholog groups

Avg. inparalogs in human

ortholog groups

Mouse 1.36 1.56

Fly 1.77 2.75

Worm 1.44 3.13

Mustard weed 3.73 3.33

Yeast 1.26 3.34

E. coli 1.73 3.57

Page 22: Orthology Analysis

• No guarantee that the same segment is used in different sequences

• No evolutionary distance model

• Does not take multiple domains into account

Drawbacks of Blast-basedorthology assignment

Page 23: Orthology Analysis

Domain orthology• Inparanoid Human-Fly ortholog pairs with domains in

Pfam-A 13.0: 20335

• Different domain architectures: 5411– Many of these are minor differences, e.g. 22 vs 21 Spectrin repeats

– Sometimes the difference is big:

ef-hand UCH

TBC UCH

Page 24: Orthology Analysis

Tree-based approaches

Page 25: Orthology Analysis

Distance-based tree building

• Bootstrapping: – randomly pick columns to bootstrap alignment, calculate tree

– Repeat 1000 times, frequency of node = bootstrap support

A2 A3

A1 4 8

A2 10

A1

A2

A3

1

3

5

2

A1 MKFYSLPNFPEN

A2 MKYYKLPDLPDE

A3 MRFYTACENPRS

Distance matrix

Page 26: Orthology Analysis

Orthology by tree reconciliation

Species tree

Gene tree

Infer 2 duplications and 2 losses

Page 27: Orthology Analysis

• Assumption that the species tree is fully known

• Does not give confidence values

• Gene trees become unreliable when involving a lot of sequences (more data -> less certainty)

• Computationally expensive

Drawbacks of tree reconciliationfor orthology assignment

Page 28: Orthology Analysis

Partial tree reconciliation

• Find pairwise orthologs by computer parsing of tree.

Page 29: Orthology Analysis

99

45

85

100

82

99

C14F5.4

AAF49194.1

AH6.2

F37H8.4

Y6E2A.9

C47D12.3

T04F8.1

AAF52138.1

PIR-S67168

Pairwise orthology confidence by ‘orthostrapping’

The original tree with bootstrap support values

Page 30: Orthology Analysis

C14F5.4

AAF49194.1

AH6.2

F37H8.4

Y6E2A.9

C47D12.3

T04F8.1

AAF52138.1

PIR-S67168

Pairwise orthology confidence by ‘orthostrapping’

01C14F5.4

10T04F8.1

00C47D12.3

00Y6E2A.9

00F37H8.4

00AH6.2

AAF52138.1

AAF49194.1

FlyWorm

Page 31: Orthology Analysis

C14F5.4

AAF49194.1

AH6.2

F37H8.4

Y6E2A.9

C47D12.3

T04F8.1

AAF52138.1

PIR-S67168

Pairwise orthology confidence by ‘orthostrapping’

02C14F5.4

20T04F8.1

10C47D12.3

00Y6E2A.9

00F37H8.4

00AH6.2

AAF52138.1

AAF49194.1

FlyWorm

Page 32: Orthology Analysis

99

45

85

100

82

99

C14F5.4

AAF49194.1

AH6.2

F37H8.4

Y6E2A.9

C47D12.3

T04F8.1

AAF52138.1

PIR-S67168

Pairwise orthology confidence by ‘orthostrapping’

099C14F5.4

980T04F8.1

810C47D12.3

770Y6E2A.9

770F37H8.4

770AH6.2

AAF52138.1

AAF49194.1

FlyWorm

Page 33: Orthology Analysis

orthostrapper.cgb.ki.se

Page 34: Orthology Analysis

Orthology is not transitive!

Multiple species at different distances may give erroneous groups, that includes out-paralogs

Page 35: Orthology Analysis

Orthology is not transitive!

-> Orthology strictly defined for only 2 species/clades

Combining species of different distances is very dangerous

But OK to combine multiple equidistant ones

YH1D1H2D2

D1 H2

Y

Page 36: Orthology Analysis

Domain-level orthology

Page 37: Orthology Analysis

HOPS - Hierarchy of Orthologs and Paralogs

eukaryota

metazoa

viridiplantae

fungi

nematoda

arthropoda

chordata

1. All species in Pfam are bundled in groups according to scheme:

2. Apply Orthostrapper to groups at same level in Pfam families

3. Display results in NIFAS

Page 38: Orthology Analysis

Pfam

Page 39: Orthology Analysis

Pfam in brief:

Profile-HMMHMMer-2.0

FULL alignment

Search database

Manually curated Automatically made

SEED alignmentrepresentative members

Description file

• Release 13.0 (April 2004):– 7426 families Pfam-A domain families

– Based on 1160000 sequences (Swissprot & Trembl)– 21980 unique Pfam-A domain architectures– 73% of all proteins have >=1 Pfam-A domain

Page 40: Orthology Analysis

HOPS results

Pfam 10, 6190 families:

• 2450 families (40%) have HOPS orthologs

• 1319 families (21%) have HOPS orthologs in all 6 pairwise comparisons

• 286356 pairwise orthology assignments (> 75% orthostrap)

Storm and Sonnhammer, Genome Research 13:2353-2362 (2003)

Page 41: Orthology Analysis

Ways to access HOPS

• NIFAS graphical browser

• By sequence ID at Pfam.cgb.ki.se/HOPS

• Flatfiles (Orthostrap tables of 2 clades)

Page 42: Orthology Analysis

Pfam.cgb.ki.se/HOPS

Page 43: Orthology Analysis
Page 44: Orthology Analysis
Page 45: Orthology Analysis

Evolution of Domain Architectures

NIFAS:

Page 46: Orthology Analysis

ATP sulfurylase /APS kinase

Page 47: Orthology Analysis

Orthologous shuffled domains?

ATP sulfurylase domain, metazoa vs fungi

Page 48: Orthology Analysis

APS kinase domain

Page 49: Orthology Analysis

HOPS orthologs of PPS1_HUMAN (ATP sulfurylase/APS kinase)

Page 50: Orthology Analysis

Summary of ATP sulfurylases/APS kinases:

Shuffled non-orthologous domains

Fungi

Metazoa

Page 51: Orthology Analysis

Conclusions

• Orthologs can be detected by – Blast: fast– tree: slow but less error-prone

• Species at different evolutionary distances should not be combined in orthology analysis

• Inparanoid and Orthostrapper were designed to find inparalogs but not outparalogs

• HOPS/NIFAS can be used to find domain orthologs and analyze domain architecture evolution

Page 52: Orthology Analysis

Future perspectives

• Multiparanoid – multiple species merging of pairwise Inparalogs.

• Functional divergence among inparalogs

Page 53: Orthology Analysis

Acknowledgments

– Christian Storm

– Maido Remm

– Andrey Alexeyenko

– Volker Hollich

– Mats Jonsson

http://sonnhammer.cgb.ki.se