orthology predictions for whole mammalian genomes

Download Orthology predictions for whole mammalian genomes

If you can't read please download the document

Post on 14-Jan-2016

24 views

Category:

Documents

4 download

Embed Size (px)

DESCRIPTION

Orthology predictions for whole mammalian genomes. Leo Goodstadt MRC Functional Genomics Unit Oxford University. Finishing. “Evolution of Orthologues” Selection pressures in orthologues and paralogs. “Gene Duplications” Reproduction, immunity or chemosensation. - PowerPoint PPT Presentation

TRANSCRIPT

  • Orthology predictions for whole mammalian genomesLeo GoodstadtMRC Functional Genomics UnitOxford University

  • Evolution of Orthologues Selection pressures in orthologues and paralogsGene Duplications Reproduction, immunity or chemosensationSynonymous substitution rates Mutation and selection varies by chromosome sizeGene birth in the human lineage Ongoing duplications underlie polymorphism

  • Orthology is the key

  • We are consumers of orthology / paralogyStarted off using Ensembl predictionsEnsembl 1:1 covered 50% of predicted mouse genes.Ewans manual survey said 80%

    How it started

  • Paralogues evolve fast (and are fun!)1) General observations for all mammalian genomes

  • dmeldsimdyakderedanadpsedvirdmojdgricelecbricremc2801hsapmmuscfammdomoanaggal0.000.020.040.060.080.100.120.14 Drosophila Nematodes AmniotesLineage specific dN/dSSpecies2) Observations for whole clades of species

  • 3) Inparalogues define lineage specific biologyMarsupial / Monodelphis biology revealed by lineage specific genesChemosensation (OR, V1R and V2R )Reproduction (Vomeronasal Receptors, lipocalins, b-microseminoprotein (12:1))Immunity (IG chains, butyrophilins, leukocyte IG-like receptors, T-cell receptor chains and carcinoembryonic antigen-related cell adhesion molecules ) pancreatic RNAses Detoxification (hypoxanthine phosphoribosyltransferase homologues nitrogen poor diets)KRAB ZnFingers

  • 4) Interesting stories in the aggregate

  • 5) Treasure trove in the detailsclade: #2 (ortholog_id = 17117 in panda) 159 mus genes 47 genes new to assembly 36 10 genes completely new to assembly 36 Interpro matches for this clade:

    !!! Expansion mainly on chr5 and 14, although single (pseudogene?) versions on chr13 and chr16.!!! Mouse DLG5 is: chr14:22,966,420-22,978,653 (expressed in testis: AK147699)

    gene identifier order chrm exons stop length -------------------- ----- ---- ----- ---- ------ MUS_GENE_21705 6639 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 ; ENSMUSP00000086007 4 182 MUS_GENE_22420 6643 5 predicted gene, EG623898 ; ENSMUSP00000099126 2 72 < MUS_GENE_19599 6646 5 spermatogenesis associated glutamate (E)-rich protein 1, pseudogene 1 (Speer1-ps1) on chromosome 5 ; NCBIMUSP_83776567 4 157 < MUS_GENE_23688 6651 5 predicted gene, EG623898 ; ENSMUSP00000094421 2 72 MUS_GENE_19774 6657 5 spermatogenesis associated glutamate (E)-rich protein 3 ;

    On going mouse inparalogues analysis: Lots and lots of reproductive genes

  • Secretoglobin Protein Family members: Androgen-binding proteins. Emes et al. (2004) Genome Res. 14(8):1516-29 6) Candidates for evolutionary and functional analyses

  • Hedges, SB Nature Reviews Genetics 3, 838 -849 (2002)Available GenomesAndDivergences

  • How do we find function in the genome?Nothing in Biology Makes Sense Except in the Light of Evolution. Theodosius Dobzhansky (1900-1975).

  • How to find the function in the genome?Similar SequencesCommon Ancestry (homology)Similar Structures / FoldsSimilar Functions ?(Genes / Genome regions)

  • ARs WholeGenomeHow much of the genome is functional? Compare with the mouseAncestral Repetitive (AR) sequence is non-functional and has evenly distributed conservation scores (red) (symmetrical bell shaped due to biological variation)

    Whole Genome sequence contains some functional sequence under selection and thus has a small excess of conserved sequence under purifying selection (asymetrical)

    N.B. This is an estimate that doesnt take into account sequence Turning over rapidly (not shared by mouse/human)Under positive (diversifying) selection

  • The human genome (euchromatic sequence)Unknown (old repetitive junk?)Protein coding: 1.2%UTR: 0.3% Repeats(Transposable elements, )~45%Conserved non-coding (3.5% ?)Neutral

  • Conserved non-coding materialTranscription factor binding sitesEnhancers, insulators and other non-transcribed regulatory elementsAlternative splicing signalsTransfer RNAs, ribosomal RNAsSmall RNAs (e.g. snoRNAs, microRNAs, siRNAs and piRNAs) regulatory/gene silencing / RNA degradationMacroRNAs (e.g. Xist) enzymatic? / chromosome inactivation

  • Functional parts of genes are highly conserved

  • How many protein coding genes?Walter Gilbert [1980s] 100kAntequera & Bird [1993] 70-80kJohn Quackenbush et al. (TIGR) [2000] 120kEwing & Green [2000] 30kTetraodon analysis [2001] 35kHuman Genome Project (public) [2001] ~ 31kHuman Genome Project (Celera) [2001] 24-40kMouse Genome Project (public) [2002] 25k -30kLee Rowen [2003] 25,947Human Genome Project (finishing) 20-25k [2004]Current predictions [2008] 19-20k

  • Traditional Genome OrthologyReciprocal BLAST best hits between longest transcript of each gene (+ synteny)Assumes:Protein similarity is proportional to evolutionary distance (selection is invariant!)Pairwise relationships adequately represent the evolutionary treeNo gene losses or missing predictions Alternative splicing can be ignored! No gene translocations after tandem duplication

  • Orthology prediction methodsTwo genomesReciprocal best blast hitMultiple genomesClustering ofreciprocal best hitsprotein similaritiesQueryBlast hits

  • Reciprocal Blast Best HitsAdvantages:Fast, Well understoodWorks well for distant lineagesCan correlate with protein structure (domains)Disadvantages:Only provides 1:1 orthologues in the best caseCan be difficult to reconcile with the species tree

  • Genes on chromosome of species 1Genes on chromosome of species 2

  • Reciprocal Blast Best Hits

  • Reciprocal Blast Best Hits

  • How to add duplicated genes? syntenyEnsembl compara in the pastLocal gene order tends to be conserved in mammalian lineagesLook for inparalogs locally even if the protein distances dont add up (sequence error, sampling error etc.)

  • Blast Best Hits in Local Regions

  • Blast Best Hits in Local Regions

  • Problems with relying only on syntenyLocal homologs are often not inparalogs:Local rearrangementsMissing predictions (neighbouring orphans)Need sanity checking

  • Human and Mouse chromosomes:

    Extensive rearrangements only over larger regionsConservation of gene order in the short range

  • Mouse chromosome 2Rat chromosome 3Olfactory Orthology from compara

  • Olfactory OrthologyMouse chromosome 2Rat chromosome 3

  • InparanoidRemm,M., Storm,C.E. and Sonnhammer,E.L.L. (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J. Mol. Biol. 314, 10411052.Avoids multiple alignments and phylogenetic methods for speed and to avoid errors Heuristics are implicitly phylogenetic

  • How Inparanoid worksUse cutoff2.3.4.5.Orthology

  • Identify inparalog candidatesIdentify main orthologuesLongest TranscriptsPairwise alignments scoresAdd lineageSpecific duplicates(inparalogs)With confidences Resolve conflictsUse cutoff2.3.4.OrthologyReciprocal Best Hits are orthologues5.

  • Confidence values for inparalogsMost confident inparalog is when the inparalog is sequence identical to main orthologue.Maximum value = scoreidentical scoreorthologsConfidence = (scoreinparalog scoreorthologs) / (scoreidentical scoreorthologs)AB

  • Resolving conflictsLongest TranscriptsPairwise alignments scoresReciprocal Best Hits are orthologuesAdd inparalogsWith confidences Use cutoff2.3.4.5.OrthologyMerge if orthologs already clustered in same group

    Merge if two equally good best hits

    Delete weaker group

    Merge significantly overlapping

    Divide overlapping

  • Why are there conflicts?Protein differences are a proxy for evolutionary timeProtein similarity scores approximate protein differences (sequence, alignment, estimation errors)Pairwise scores can be used to (conceptually) recover phylogenetic (tree) data

  • Alternatives: phylogenetic methodsInparanoid is great because it models phylogeny explicitlyWhy not use phylogenetic methods directly?Multiple estimators of protein distance4 pairwise scores used out of 30

  • Phylogenetic methodsIterative distance methods are very fast, suitable for whole genome analyses (variants on neighbor joining)Statistically consistent with evolutionary models (can have explicit error model with evolutionary distances, e.g. bionj)Inparanoid type consistency checking can be carried out after phylogeny is predicted

  • AdvantagesDoes not saturate over long evolutionary distancesEasy to align / predict genes (unlike non-coding regions) Sometimes cDNA sequence is not availableDisadvantageAssumes constant evolutionary rateAssumes invariant selectionIs protein similarity a good proxy for evolutionary distance?

  • Redundant genetic code, e.g. GCA GCC GCG GCTThird base of a codon wobbles without changing the translated amino acid dS approximates neutral mutation rate (without selection) in coding regionsUse Silent Mutations as a genetic clock Alanine}

  • Easier to align than Ancestral RepeatsNot neutral sequence!!Genomic > 2x variation in dSAssumes most gene families are local due to tandem duplication and share dSAssume (partial) gene conversions are infrequentdS as proxy for evolutionary distance

Recommended

View more >