comparative genomics and proteomics in ensembl sep 2006

63
Comparative genomics Comparative genomics and proteomics in and proteomics in Ensembl Ensembl Sep 2006

Upload: camilla-sims

Post on 16-Jan-2016

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Comparative genomics and proteomics in Ensembl Sep 2006

Comparative genomics Comparative genomics and proteomics in and proteomics in

EnsemblEnsembl

Sep 2006

Page 2: Comparative genomics and proteomics in Ensembl Sep 2006

2 of 56

• Rationale• Species available• Comparative proteomics

– Orthologue and paralogue prediction– Protein clustering into families

• Comparative genomics– Genome-wide DNA alignments– Synteny block characterisation

• Future and perspectives

OverviewOverview

Page 3: Comparative genomics and proteomics in Ensembl Sep 2006

3 of 56

The Compara database is one single multispecies database

• Gene orthology/paralogy prediction• Protein clustering• Whole genome alignments• Synteny regions

ComparaCompara

Page 4: Comparative genomics and proteomics in Ensembl Sep 2006

4 of 56

The era of sequencing genomesThe era of sequencing genomes

360

450

990 25

70

140

?

550

25070?

1002003004005001000

Million years

340

1500?

?

Chordata

Vertebrata

AmniotaTetrapoda

Teleostei

Urochordata

Arthropoda

NematodaFungi

Red : whole genome assembly availableGreen : whole genome assembly due within the next year in Ensembl

* 19 species currently in Ensembl* 19 species currently in Ensembl+ 10 + 10 Pre! Pre! EnsemblEnsembl

S. cerevisiae (baker’s yeast) *

C. elegans (nematode) *

A. mellifera (honey bee) *

D. rerio (zebrafish) *

D. melanogaster (fruitfly) *A. gambiae (African malaria mosquito) *A. aegypti (yellow fever mosquito) +

C. intestinalis (transparent sea squirt) * C. savignyi (sea squirt) +

T. rubripes (torafugu) *T. nigroviridis (spotted green pufferfish) *

O. latipes (Japanese medaka)

G. aculeatus (Stickleback) +

23

O. aries (sheep)

G. gallus (chicken) *

X. laevis (African clawed frog)

M. musculus (house mouse) *R. norvegicus (Norway rat) *

M. mulatta (rhesus macaque) *P. troglodytes (chimpanzee) *

C. familiaris (dog) *F. catus (cat)E. caballus (horse)S. scrofa (pig)B. taurus (cow) *

310

197

92

M. domestica (opossum) *

170

L. africana (elephant) +

105

41

91

4574

83

65

20

H. sapiens (human) * +

X. tropicalis (western clawed frog) *Amphibia

AvesMetatheria

Mammalia

Eutheria

Page 5: Comparative genomics and proteomics in Ensembl Sep 2006

5 of 56

• From the Ensembl perspective joins species through– orthologous/paralogous genes links– chromosome synteny links– protein family links

• From a broader perspective– Where are syntenic regions located?– How many genes are conserved?– Where are orthologous/paralogous genes?– Is gene order conserved?– Where are potential regulatory regions?– What is missing in one species, present only in another?

Comparing different speciesComparing different species

Page 6: Comparative genomics and proteomics in Ensembl Sep 2006

6 of 56

Orthologue and Paralogue Orthologue and Paralogue PredictionPrediction

• Evolutionary studies• Identify potential species-specific

proteins/genes• Identify orthologues of (human)

genes in model organisms

Page 7: Comparative genomics and proteomics in Ensembl Sep 2006

7 of 56

Gene EvolutionGene Evolution

• Divergence

• Speciation / Duplication

• Change within allelic population

• Point Mutations / Selection / Drift

• Exon/domain shuffling

• Transposition / Translocation

• Retroposition (reverse transcription)

• Horizontal gene transfer?

Orthologues and ParaloguesOrthologues and Paralogues

Reconstruct the Molecular Evolutionary history from the evidence visible within the known extant genes

Page 8: Comparative genomics and proteomics in Ensembl Sep 2006

8 of 56

• Orthologues : any gene pairwise relation where the ancestor node is a speciation event

• Paralogues : any gene pairwise relation where the ancestor node is a duplication event

HomologueHomologue RelationshipsRelationships

Page 9: Comparative genomics and proteomics in Ensembl Sep 2006

9 of 56

Atime

Duplication

M 2’

Speciation

Duplication

M 2

A 1 A 2

M 1 H 1

H 2

Inparalogues

OutparaloguesOrthologues

Inparalogues

Inparalogues

Orthologous genes have originated from a single ancestor (often have equivalent functions).Paralogous are genes related via duplication:

•Inparalogues (ortholog_one2one, ortholog_one2many, etc.) duplication follows speciation and •Between_species_paralog (outparalogues). Duplication precedes speciation

Homologue RelationshipsHomologue Relationships

Page 10: Comparative genomics and proteomics in Ensembl Sep 2006

10 of 56

• Find orthologous genes by comparing the protein sets of two species (only the longest peptide considered).• blastp+sw all versus all (on a paired species basis)• Build a graph of gene relations based on BRH (best reciprocal hit) and BSR (BLAST score ratio)• Extract connected components (single linkage clusters ), each cluster representing a gene family

Mouse HumanMouse Human Mouse Human

Human

Human

Orthology Prediction AlgorithmOrthology Prediction Algorithm

Page 11: Comparative genomics and proteomics in Ensembl Sep 2006

11 of 56

GeneTree prediction: GeneTree prediction: MUSCLE/PHYMLMUSCLE/PHYML

• Multiple alignment of clusters with MUSCLE (based on BRH and BSR).•Unrooted gene tree built using PHYML (Guidon & Gascuel, 2003)•Tree reconciliation (gene tree with species tree) to call duplication event on internal nood and root the tree using RAP (Dufayard et al. 2005)• Infer pairwise relations of orthology and paralogy types (from each tree)

Page 12: Comparative genomics and proteomics in Ensembl Sep 2006

12 of 56

Molecular PhylogeneticsMolecular Phylogenetics

• Protein sequences in different species, both:

• Provide information about the history of evolution

• Reconstruct evolution

• We are after an alignment that equally reflects all species:

• Modeling the branching processes by comparing gene and species trees (tree reconciliation)

Page 13: Comparative genomics and proteomics in Ensembl Sep 2006

13 of 56

PhylogeniesPhylogenies

Duplication nodeSpeciation node or leaf

Revealing the evolutionary history that has led to the organisms at the current stage.

- Leaves are real genomes- Internal nodes are ancestors

Page 14: Comparative genomics and proteomics in Ensembl Sep 2006

14 of 56

Orthologue and Paralogue typesOrthologue and Paralogue types

• ortholog_one2one• ortholog_one2many• ortholog_many2many• apparent_ortholog_one2one

• within_species_paralog• between_species_paralog

Page 15: Comparative genomics and proteomics in Ensembl Sep 2006

15 of 56

……in Ensembl…in Ensembl…

Page 16: Comparative genomics and proteomics in Ensembl Sep 2006

16 of 56

Orthologue and ParalogueOrthologue and Paralogue typestypes

Page 17: Comparative genomics and proteomics in Ensembl Sep 2006

17 of 56

GeneViewGeneView

Page 18: Comparative genomics and proteomics in Ensembl Sep 2006

18 of 56

GeneViewGeneView

Page 19: Comparative genomics and proteomics in Ensembl Sep 2006

19 of 56

Links to ATV and JalView

GeneTreeMUSCLE

protein alignment

GeneTreeViewGeneTreeView

Page 20: Comparative genomics and proteomics in Ensembl Sep 2006

20 of 56

Duplication node (red)

Speciation node (blue)

GeneTreeViewGeneTreeView

Page 21: Comparative genomics and proteomics in Ensembl Sep 2006

21 of 56

ATVATV

Page 22: Comparative genomics and proteomics in Ensembl Sep 2006

22 of 56

Protein clustering into familiesProtein clustering into families

• Cluster proteins from different organisms that may share the same function

• Obtain some kind of description for ‘novel’ genes/proteins

• Locate family members over the whole genome

• Identify possible orthologues and paralogues in other species

Page 23: Comparative genomics and proteomics in Ensembl Sep 2006

23 of 56

Protein DatasetProtein Dataset

• Nearly a million proteins clustered:– All Ensembl proteins from all species in Ensembl

• 513,256 predicted proteins

– All metazoan (animal) proteins in UniProt

• 55,892 UniProt/Swiss-Prot

• 469,725 UniProt/TrEMBL

• Blastp all versus all, then clustering with MCL

Page 24: Comparative genomics and proteomics in Ensembl Sep 2006

24 of 56

Clustering StrategyClustering Strategy

• BLASTP all-versus-all comparison

• Markov clustering

• For each cluster:– Calculation of multiple sequence

alignments with ClustalW– Assignment of a consensus

description

Page 25: Comparative genomics and proteomics in Ensembl Sep 2006

25 of 56

Markov Clustering (MCL)Markov Clustering (MCL)

• MCL for Markov CLustering algorithm, based on flow simulation in graphs (http://micans.org/mcl/)• Keeps into the same graph/cluster only very well inter-connected nodes (proteins) in the same graph (cluster)

• Allows rapid and accurate detection of protein families on large-scale.• Automatic description and clustalw multiple alignment applied on each cluster

MCL

Page 26: Comparative genomics and proteomics in Ensembl Sep 2006

26 of 56

Link to FamilyView

ProtViewProtView

Page 27: Comparative genomics and proteomics in Ensembl Sep 2006

27 of 56

Ensembl family members

within human

Ensembl family members in

other species

JalView multiple alignments

FamilyViewFamilyView

Page 28: Comparative genomics and proteomics in Ensembl Sep 2006

28 of 56

For For eacheach cluster cluster

• We store– Description and score– Multiple alignment

• Future extensions– Improving descriptions– Multiple alignment assessment– Build phylogeny on each cluster

• Using the multiple alignment• Using dS values (mainly inside mammals)• Extend paralogous prediction

Page 29: Comparative genomics and proteomics in Ensembl Sep 2006

29 of 56

Aligning complete genomesAligning complete genomes

Page 30: Comparative genomics and proteomics in Ensembl Sep 2006

30 of 56

Whole Genome AlignmentsWhole Genome Alignments

• Understand what evolution has done on the species compared, after speciation – What is missing in one species, present only

in another?– Differences between closely related species

may help understanding speciation• Define syntenic regions, those long

regions of DNA sequences were order and orientation is highly conserved

• Conserved non-coding regions– Guides to putative regulatory regions

Page 31: Comparative genomics and proteomics in Ensembl Sep 2006

31 of 56

Evolution at the DNA levelEvolution at the DNA level

…ACTGACATGTACCA…

…AC----CATGCACCA…

Mutation

Sequence edits

Rearrangements

Deletion

InversionTranslocationDuplication

Page 32: Comparative genomics and proteomics in Ensembl Sep 2006

32 of 56

Basic IdeaBasic Idea

• Functional sequences evolve more slowly than non-functional sequences

• Comparing genomic sequences from species at different evolutionary distances allows us to identify:– Coding genes– Non-coding genes– Non-coding regulatory sequences

Page 33: Comparative genomics and proteomics in Ensembl Sep 2006

33 of 56

Aligning large genomic sequencesAligning large genomic sequences

• Independent from protein/gene predictions• Should find all highly similar regions between two

sequences• Should allow for segments without similarity,

rearrangements etc.– Computes run only by few dedicated groups

• Issues– Heavy process– Scalability, as more and more genomes are sequenced– Time constraint– Computes run only by few dedicated groups– As the «true» alignment is not known, then difficult to

measure the alignment accuracy and apply the right method

Page 34: Comparative genomics and proteomics in Ensembl Sep 2006

34 of 56

Using a local alignerUsing a local aligner

• Local alignment– Find all highly similar regions over 2 sequences

• Find the orthologous as well as all the paralogous sequences

– Separated by segments without alignment

– Can handle rearranged sequences– Need post- filtering to limit too much

overlapping alignments

Page 35: Comparative genomics and proteomics in Ensembl Sep 2006

35 of 56

Local Local vv Global Alignment Global Alignment

AG

TG

CC

CT

GG

AA

CC

CT

GA

CG

GT

GG

GT

CA

CA

AA

AC

TT

CT

GG

A

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACTTAATC

AG

TG

CC

CT

GG

AA

CC

CT

GA

CG

GT

GG

GT

CA

CA

AA

AC

TT

CT

GG

A

Local Global

Advantages Compares large genomic regions (requires syntenic maps)

Can detect, rearrangements like translocations, inversions and duplications (!)

Detects insertions and deletions

Disadvantages Fails to identify insertions or deletions

Fails to detect rearrangements (inversions)

Page 36: Comparative genomics and proteomics in Ensembl Sep 2006

36 of 56

GlocalGlocal Alignment ProblemAlignment ProblemFind least cost transformation of one sequence into another using new operations

•Sequence edits (indels, mutations)

•Inversions

•Translocations

•Duplications

•A combination of these

GTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGAG

AGTGACCTGGGAAGACCCTGAACCCTGGGTCACAAAACT

Glocal aligner (Brudno et al., 2003)

Page 37: Comparative genomics and proteomics in Ensembl Sep 2006

37 of 56

BLASTZ-net, tBLAT and MLAGANBLASTZ-net, tBLAT and MLAGAN

• BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse

• Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish

• MLAGAN global alignment used for multispecies alignments

Page 38: Comparative genomics and proteomics in Ensembl Sep 2006

38 of 56

all all versusversus all approach using all approach usingBLASTZ BLASTZ (collaboration with UCSC)(collaboration with UCSC)

• Can handle large sequences

• Used 2-weighted spaced seeding strategy• Dynamic masking

• Makes distinction between repeat and non-repeat sequences (soft masking)• Try aligning inside repeats

• One iterative step with lower threshold to expand alignments

Page 39: Comparative genomics and proteomics in Ensembl Sep 2006

39 of 56

Blastz strategyBlastz strategy

• 10Mb Human fragments (3000)• 30Mb Mouse fragments (100)• Lineage-specific repeats removed

• 48 hours on 1024 CPUs

• Generates 9Gb of output

• When filtered for Best hit on Human, reduced to 2.5Gb•10Mb Human fragments (3000)• 30Mb Mouse fragments (100)

Page 40: Comparative genomics and proteomics in Ensembl Sep 2006

40 of 56

Blastz human genome coverageBlastz human genome coverage

• 40% of the human genome is covered by an alignment of mouse sequences

By rescoring the alignment over a “tight” matrix that is very stringent and look for high conservation (>70% identity), the coverage goes down to 6%

Page 41: Comparative genomics and proteomics in Ensembl Sep 2006

41 of 56

DNA/DNA matches web displayDNA/DNA matches web display

ContigView human EPO

Conserved sequences

Page 42: Comparative genomics and proteomics in Ensembl Sep 2006

42 of 56

DotterViewDotterView

Mouse sequence

Humansequence

Page 43: Comparative genomics and proteomics in Ensembl Sep 2006

43 of 56

Multiple alignmentsMultiple alignments

• Currently 3 sets:– MLAGAN-primates:

– MLAGAN-amniote vertebrates:

– MLAGAN-eutherian mammals:

Page 44: Comparative genomics and proteomics in Ensembl Sep 2006

44 of 56

StrategyStrategy

• Use all coding exons• Use all coding exons

• Get sets of best reciprocal hits

• Use all coding exons

• Get sets of best reciprocal hits

• Create orthology maps

• Use all coding exons• Get sets of best reciprocal hits• Create orthology maps• Build multiple global alignments

Page 45: Comparative genomics and proteomics in Ensembl Sep 2006

45 of 56

MultiContigMultiContigViewView

Page 46: Comparative genomics and proteomics in Ensembl Sep 2006

46 of 56

MultipleMultiple alignmentsalignments

ContigView human EPO

Page 47: Comparative genomics and proteomics in Ensembl Sep 2006

47 of 56

Alignment on basepair level

Human

Dog

Rat

Mouse

Export alignments

AlignSpliceViewAlignSpliceView

Page 48: Comparative genomics and proteomics in Ensembl Sep 2006

48 of 56

MultiContigView MultiContigView vs.vs. AlignSliceView AlignSliceView

Page 49: Comparative genomics and proteomics in Ensembl Sep 2006

49 of 56

AlignViewAlignView

Page 50: Comparative genomics and proteomics in Ensembl Sep 2006

50 of 56

GeneSeqalignViewGeneSeqalignView

Page 51: Comparative genomics and proteomics in Ensembl Sep 2006

51 of 56

GeneSeqalignViewGeneSeqalignView

Page 52: Comparative genomics and proteomics in Ensembl Sep 2006

52 of 56

Syntenic RegionsSyntenic Regions

• Genome alignments are refined into larger syntenic regions

• Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent

• Any clusters less than 100 kb are discarded

Page 53: Comparative genomics and proteomics in Ensembl Sep 2006

53 of 56

SyntenyViewSyntenyViewHuman

chromosome

Mouse chromosomes

Mouse chromosomes

Orthologues

Page 54: Comparative genomics and proteomics in Ensembl Sep 2006

54 of 56

Syntenic blocks

CytoViewCytoView

Page 55: Comparative genomics and proteomics in Ensembl Sep 2006

55 of 56

OutlookOutlook

• OrthoView• Displaying alignments both from whole genome alignments and on orthologues• Consider all isoforms for each gene•Calculate dN/dS

Page 56: Comparative genomics and proteomics in Ensembl Sep 2006

56 of 56

AcknowledgementsAcknowledgements

• Abel Ureta-Vidal• Benoît Ballester• Kathryn Beal• Stephen Fitzgerald• Javier Herrero• Albert Vilella

Ensembl team

Sep 2006

Page 57: Comparative genomics and proteomics in Ensembl Sep 2006

57 of 56

Basic ideaBasic idea

Speciation event

selection

alignment

mutations

Ancestor sequence

MutationRegulatory regionExon

Page 58: Comparative genomics and proteomics in Ensembl Sep 2006

58 of 56

Global Global vv Local Alignments Local AlignmentsLocalGlobal

Advantages Disadvantages

Local Compares large genomic regions (uses syntenic maps)

Can detect, rearrangements like translocations, inversions and duplications (!)

Fails to identify insertions or deletions

Global Detects insertions and deletions

Fails to detect rearrangements (inversions)

(-)

1 2

1 2

inversion duplication

Glocal aligner (Brudno et al., 2003) pairwise only

Page 59: Comparative genomics and proteomics in Ensembl Sep 2006

59 of 56Adapted from Sonnhammer & Koonin (2002) TIG 18, 12: 620

Inparalogues Inparalogues vs vs OutparaloguesOutparalogues

Page 60: Comparative genomics and proteomics in Ensembl Sep 2006

60 of 56

Problems: weak orthologiesProblems: weak orthologies

Page 61: Comparative genomics and proteomics in Ensembl Sep 2006

61 of 56

Problems: missalignmentsProblems: missalignments

Page 62: Comparative genomics and proteomics in Ensembl Sep 2006

62 of 56

Possible solutionsPossible solutions

• Weak orthologies:

• Poor alignments:– report to author– edit alignments, detect wrong

edges, redefine blocks– use another aligner

Page 63: Comparative genomics and proteomics in Ensembl Sep 2006

63 of 56From Edgar, R. C. (2004) NAR 32:1792-1797