icb::chagall - comparative genomics...local aligner: blastz •as in blast, find seeds –seeds are...
TRANSCRIPT
![Page 1: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/1.jpg)
Comparative genomics
Lucy SkrabanekICB, WMC6 May 2008
![Page 2: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/2.jpg)
What does it encompass?• Genome conservation
– transfer knowledge gained from model organismsto non-model organisms
• Genome evolution– understand how genomes change over time in
order to identify evolutionary processes andconstraints
• Genome variation– understand how genomes vary within a species to
identify genes central to particular processes
![Page 3: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/3.jpg)
Main uses• Whole genome comparisons
– Genome evolution• Coding regions comparisons
– Gene prediction– Gene structure (exon-intron) prediction– Function prediction
• Non-coding region comparisons– Regulatory region discovery
• Protein-protein interaction prediction
![Page 4: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/4.jpg)
Metazoan evolutionary tree
Ureta-Vidal et al, NRG 2003
![Page 5: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/5.jpg)
Rearrangement rate• Can estimate the number of rearrangements
from cytological comparisons• Two very different rates:
– Very slow rate of rearrangement (1 or 2exchanges per 10 MYR)
• ~ 7 rearrangements between the human genome fromthe hypothetical primate ancestor
• 13 rearrangements between cat and human
![Page 6: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/6.jpg)
Rearrangement rate• Punctuated by abrupt global genome
rearrangement in some lineages– Gibbons and siamangs rearranged 3 or 4 times
more extensively than human or other greatapes
– Rodent species exhibit very rapid patterns ofchromosome change
• ~180 conserved segments between mouse and human• ~100 conserved segments between rat and human
![Page 7: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/7.jpg)
Human, catand mouseX chromosomecomparison
O’Brien et al, Science 1999
![Page 8: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/8.jpg)
Genome comparison across speciesgross changes in chromosome number
O’Brien et al, Science 1999
![Page 9: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/9.jpg)
Genome alignment• Very different from single gene or
protein alignments• Standard DPAs are too expensive• Made complicated by extensive
rearrangements of large homologoussegments
![Page 10: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/10.jpg)
Problems• Looking for syntenic regions• Rearrangements disrupting syntenic
regions– Insertions– Deletions– Inversions– Translocations– Duplications
![Page 11: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/11.jpg)
Assumptions• The two genomes to be aligned derived
from a common ancestor• There remains sufficient similarity
between the genomes to enable easyidentification of homologous regions
• For the alignment to be informative,there has to have been time for thegenomes to diverge and for selection tohave occurred
![Page 12: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/12.jpg)
Algorithm requirements• Genome alignment algorithms must
– Scale linearly (computationally)– Be robust (not too many parameters)– Be memory-efficient– Be able to handle rearrangements, gene
duplications, repetitive elements• Smith-Waterman, Needleman Wunsch
– Time to do calculation of the order of (O)4
• Not feasible for sequence length > 10,000 bp
– Cannot handle rearrangements or inversions
![Page 13: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/13.jpg)
Alignment methods• Seeding methods (e.g., BLASTZ, BLAT,
EXONERATE)– Produce “local” alignments– All matches found (including all paralogs)
• Very sensitive, not very specific– Very fast
• Anchor-based methods (e.g., MUMmer,GLASS, AVID)– Produce “global” alignments– Specific– Difficulty with rearrangements
Ureta-Vidal et al, Nat Rev Genet April 2003
![Page 14: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/14.jpg)
Local aligner: BLASTZ• As in BLAST, find seeds
– Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions forstrict matches are specified
– 1110100110010101111 combination shown to bethe most sensitive (and more sensitive than the11-consecutive match seed strategy used byBLAST)
– Also allows one transition in 1/12 strict matchpositions
– Seeds with many matches masked out (assumedto be repetitive regions)
![Page 15: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/15.jpg)
BLASTZ• Gap-free extension• Further extend seeds by DPA, allowing gaps
– Low complexity regions have their scoresspecifically down-weighted
• Repeat above steps (using a more sensitiveseed, e.g., 7-mer) for regions that lie betweenmatches, that share the same order andorientation, and are separated by <50 kb
• Post-processing to remove multiple hits to thesame region
• When aligning human and mouse genomes,can achieve 98% alignment of known codingregions
![Page 16: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/16.jpg)
Local aligner: BLAT• Two modes: untranslated and translated
– Untranslated mode performs poorly when conservation < 90%so translated mode usually used when aligning genomesWill more efficiently identify regions that are conserved for their
coding ability rather than for the regulatory functions
• Seeds created by building an index of non-overlapping 5-mers from one genome– Frequent 5-mers and ambiguous sequences (repeats and
low-complexity regions) excluded• The other genome is chopped into <200kb chunks
– Comparisons are made between the indexed 5-mers and thechunks
• DPA applied both upstream and downstream ofmatches
![Page 17: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/17.jpg)
Global aligner: AVID• Assumes that no gene duplications,
inversions or translocations have occurred– Need a pre-processing step to identify syntenic
regions - use BLAT hits, filtered for specificity• Find ‘maximally repeated matches’ in
syntenic regions– Matches are flanked on either side by mismatches
• Determine an ‘anchor set’– Exclude all maximally repeated matches that are
less than half the length of the longest match– Allow only non-overlapping non-crossing matches– Use ‘clean’ matches first, then ‘repeat’ matches
![Page 18: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/18.jpg)
Blue + Red: all maximally repeated matchesRed: anchor set
Bray et al, Genome Res 2003
• n anchors ⇒ n+1 regions to be aligned• Add any matches that were discarded because theywere too short on the first round• Add any new anchors• When there are no more anchors to be added, use theNW algorithm to align the intervening sequence
– If the regions are longer than ~ 4kb, the alignmentis returned with a gap in both sequences at thatposition
![Page 19: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/19.jpg)
Global aligner: LAGAN• Different matching algorithm to AVID
– Generates local alignments– Allows mismatches
• Find highest weightchain
• Compute NWalignment, limiting itto area aroundchains
• MLAGAN - multipleglobal alignment tool
![Page 20: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/20.jpg)
Global aligner: WABA• Takes into account degeneracy of genetic
code• Seeding step similar to BLAT
– Uses the two weighted-spaced seed 6of811011011, where the third position is allowed tomismatch
• In the alignment of homologous regions, usesa seven-state pair HMM which also allowsmismatches in the third position
• Finally, join overlapping matches from thealignment phase
![Page 21: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/21.jpg)
Visualization tools:VISTA vs. PipMaker
• VISTA (VISualization Tools for Alignments)– Uses AVID to generate alignments– Uses a sliding window approach
• Plots percent identity within a fixed window size, at regularintervals
• PipMaker (Percent Identity Plot)– Uses BLASTZ– X-axis is the reference sequence; horizontal lines represent
gap-free alignments
![Page 22: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/22.jpg)
• Once we have a macro-alignment, we canstudy the evolution of genomes betweenspecies, and also can trace the evolutionaryhistory of the structure of each genome itself
• Genome structure rearrangement• Gene duplication, chromosomal duplication
and polyploidization (whole genomeduplication)– New genes
![Page 23: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/23.jpg)
Unravelling the history of the genome• Plants
– Wheat• Allohexaploid (AABBCC)
– Maize• Diploidized allotetraploid• Has grown 12x in the past 5 MY due to increased
numbers of transposable elements– Arabidopsis
• Haploid, 5N
• Yeast– Saccharomyces cerevisiae vs. Kluyveromyces
lactis• Human
– Genome duplication?
![Page 24: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/24.jpg)
Decipher history of lineages• Genome size changes
– Compaction• Large scale deletions - Fugu• Intron loss
– Expansion• Duplication• Transposable element insertions
• Transposons - Alus, LINEs– Ancient insertions prior to eutherian radiation– More recent insertions - maize
Baxendale et al, Nature Genet, 1995
![Page 25: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/25.jpg)
Why gen(om)e duplication?• Duplicated genes provide a source for
genetic novelty during evolution– Either member of a duplicated gene pair can
diverge to either• Acquire a new function which may be positively selected
for• Subfunctionalize• Be differently regulated (e.g., tissue-specific)• Become a pseudogene• Be deleted
• Whole genome duplication allows for theduplication (and subsequent divergence) ofwhole pathways at a time
![Page 26: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/26.jpg)
Duplication events• Resolution of polyploidy
– Non-disjunction of chromosomes (formmultivalents instead of divalents)
• Sterility– Duplicated genes do not start to diverge (or
get deleted) until disomic inheritanceresolved
![Page 27: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/27.jpg)
Rearrangements• Are genome rearrangements the cause or
consequence of diploidization?– Most widely accepted hypothesis is that
diploidization proceeds by structural divergence ofchromosomes
– Some loci appear disomic, others tetrasomic• Stage 1: pairing between similar chromosomes allowed
– Loci near the centromeres can display residual tetrasomy• Stage 2: non-homologous chromosome pairing resolved
![Page 28: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/28.jpg)
Just how hard can it be to tell what happened?
“Take four, or maybe eight, decks of 52 playing cards.Shuffle them all together and then throw some cardsaway. Pick 20 cards at random and drop the rest on thefloor. Give the 20 cards to some evolutionary biologistsand ask them to figure out what you’ve done.”
Skrabanek and Wolfe, Curr Opin Genet Dev, 1998
Now that the human genome has been sequenced, thingsaren’t quite so bleak.
![Page 29: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/29.jpg)
Effects of polyploidization
Wolfe, Nature Rev Genet 2001
![Page 30: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/30.jpg)
Effects of polyploidization
Wolfe, Nature Rev Genet 2001
![Page 31: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/31.jpg)
Saccharomyces cerevisiae (baker's yeast)
![Page 32: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/32.jpg)
Whole GenomeDuplication (WGD)
WGD in S. cerevisiae: Wolfe & Shields, Nature 387:708 (1997)
![Page 33: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/33.jpg)
Yeast
• Saccharomyces cerevisiae– Degenerate tetraploid– Polyploidy followed by extensive deletion and
(70-100) reciprocal translocations– 8% of genes duplicated in 55 blocks (plus
many missed smaller blocks)– Relative orientation of genes in blocks
conserved with respect to the centromere
Seoighe and Wolfe, PNAS, 1998
![Page 34: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/34.jpg)
Duplicatedblocks in
yeast
Wolfe and Shields, Nature 1997
![Page 35: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/35.jpg)
Estimation of time of polyploidy event
• Diverged from Kluyveromyces lactis(unduplicated) ~150 MYA– Comparison of gene sequences and gene order
reveals conservation• 59% of adjacent gene pairs in K. lactis or K. marxianus
are also adjacent in S. cerevisiae• 16% of Kluveromyces neighbors can be explained in
terms of inferred ancestral gene order– Phylogenetic analyses of duplicated genes where both
the Kluveromyces orthologue and an outgrouporthologue were available, deduced that thepolyploidization event in S. cerevisiae occurred around100 MYA
![Page 36: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/36.jpg)
Kellis et al Nature 428:617-624 (2004)
10%retention
5,000 genes
10,000 genes
5,500 genes
5,000 genes
Different subsets retained
Evidence from conservedorder of a very few genes
Evidence from interleavinggenes from sister segments
![Page 37: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/37.jpg)
Each region of K.waltiimatches two regions ofS.cerevisiae
We don’t even needany remaining two-copy genes to inferthe ancestral order
![Page 38: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/38.jpg)
Human• 1970 - Susumu Ohno proposed that vertebrate
genomes had originated via genome duplication• 2R (two Rounds [of genome duplication])
hypothesis– One before, and one after divergence of agnathans
(lamprey) from tetrapods (~430 MYA and~500 MYA)
– Popular and controversial• Split between the map-based people and the tree-
based people• Duplicated regions are evident (covering ≈ 44% of
the genome), but it is difficult to tell whether it isdue to (a) genome duplication(s) or chromosomalduplications
![Page 39: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/39.jpg)
Susumu Ohno1928-2000
Book:Evolution by GeneDuplication(1970)
Whole-genome duplication (polyploidy)
![Page 40: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/40.jpg)
Timing of tetraploidy events
Skrabanek, PhD thesis, 1999
![Page 41: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/41.jpg)
Evidence?• There are regions in the genome which are
quadruplicated– HSA 1, 6, 9 and 19 (MHC region, 10 gene
families)– HSA 4, 5, 8 and 10– HSA 2, 7, 12 and 17 (Hox clusters)
• Expect to see (A,B)(C,D) tree, where the timeof divergence of A from B and C from D isapproximately the same– However, this is not consistently the case
![Page 42: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/42.jpg)
Proposed evolutionof the Hox clusters
Skrabanek, PhD thesis, 1999
![Page 43: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/43.jpg)
Possible routes to 4-gene families
Hokamp et al, J Struct Func Genomics 2003
![Page 44: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/44.jpg)
Hughes, Mol Biol Evol 1998
Estimates ofdivergence timesfor genes in theMHC region onchromosomes1, 6, 9 and 19
![Page 45: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/45.jpg)
Hughes et al, Genome Res, 2001
Divergence timesof genes with
members on atleast two of the
Hox clusterbearing
chromosomes(2, 7, 12, 17)
![Page 46: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/46.jpg)
Hughes et al, Genome Res, 2001
Phylogenetic relationships of four- and three-membered gene families on Hox cluster bearingchromosomes
![Page 47: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/47.jpg)
Conclusions• Duplicated regions seen in human genome are
most likely vertebrate specific• Significant amount of duplication occurring
~350-650 MYA– Possible to explain large margin by alloploidy?
• Probable support for at least one wholegenome duplication event
• Once more genomes are available, such asCiona intestinalis or amphioxus, it may beeasier to decipher the history of duplication– However, long time spans under consideration, and
diploidization requires extensive genomic changes
![Page 48: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/48.jpg)
PLOS Biology 3:e314 (2005)
![Page 49: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/49.jpg)
Panopoulou and Poustka, TIG 21:559, 2005
Fish-specific genome duplication event
![Page 50: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/50.jpg)
Lander et al - Intl. Human Genome Sequencing Consortium paperNature 409:860 (2001)
![Page 51: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/51.jpg)
Hypothetical genomic region
1R
2R
decay
decay
![Page 52: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/52.jpg)
Find most similar vertebrate gene (here M1) to the Ciona gene. Other vertebrate genes areadded to the cluster if they are more similar to M1 than M1 is to the Ciona gene.
S1
> S1
< S1
Duplicates may have arisen byspeciation (lineage-splitting) or bygene duplication events specific toone or more vertebrates
Fugu-specific duplication
Human-specific duplication
Finding all homologs, and only homologs
![Page 53: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/53.jpg)
![Page 54: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/54.jpg)
Number of gene duplications
46.6% of ancestral chordate genes are duplicated in ≥1 lineage34.5% with at least one duplication before Fugu-tetrapod split23.5% with at least one duplication after Fugu-tetrapod split
No evidence of 2R hypothesis from gene family membership alone, nor from phylogenetics (sinceduplications abundant on every branch)
![Page 55: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/55.jpg)
Paralogs generated by a gene duplication before the Fugu-tetrapod split count as matches. N-fold redundancy calculated by identifying all cases where ≥ 2 different duplicates are within a100-gene window, and then counting the number of times that their paralogs occur within a 100-genewindow elsewhere in the genome.
2R
![Page 56: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/56.jpg)
4-fold redundancy most common - accounts for 25% of the genome
![Page 57: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/57.jpg)
1,912 genes duplicated priorto fish-tetrapod split 2,953 paralogous genepairs 32.4% are found in 386detectable paralogoussegments, comprising 772individual genomic segments 454 are tetra-paralogons,where overlapping sets fall into4-fold groups
Of the genes that duplicatedafter the fish-tetrapod split,only 11% are found inparalogous regions (i.e.,duplications after the split didnot include large segments ofthe genome)
![Page 58: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/58.jpg)
Could be of recent origin, orcould have undergone multiplerearrangement events thatdestroyed the tetra-paralogonssignal
![Page 59: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/59.jpg)
Old duplications Recent duplications
Hox-bearing chromosomes:50% of genes duplicated after the fish-tetrapod split are tandem duplicates (separated by< 10 genes), whereas only 6% of genes duplicated before the split are tandem duplicates.
![Page 60: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/60.jpg)
Conclusions• 2R hypothesis most likely scenario• Two rounds of closely spaced auto-
tetraploidization events– Some paralogous pairs within tetraparalogons
extend over longer regions than others• So octaploidy unlikely, since pairs of regions would have
had to lose the same sets of genes– Phylogenetic trees are not consistently nested
• So allotetraploidy or two distantly spaced autotetraploidyevents unlikely
– Tree topologies within paralogous blocks notalways congruent
• Gene loss and rediploidization processes probablyspanned the two duplication events
![Page 61: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/61.jpg)
Identification of functional elements• Coding sequences
– Relatively easy to identify• Many gene prediction programs available• General gene structure known, e.g.,
– TATA boxes– Splice donor-acceptor sites
– ESTs and cDNAs available to aid gene prediction• Non-coding functional sequences
– Much harder to identify– No common structure of regulatory regions– TF binding sites are short and ubiquitous
• Comparative genomics– A genomic sequence that provides a function that is under
selection and tends to be conserved between species iscalled a “functional sequence” (e.g., a protein-coding regionor a transcription factor binding site)
![Page 62: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/62.jpg)
Coding region comparison• Discover new genes
– Annotate gene structure• Exon-intron structure
• Compare gene content– Find genes common to sets of organisms– Find genes unique to an organism
• We can study how genomes vary within aspecies to identify genes central to particularprocesses– Can also compare subspecies e.g., E. coli K12 and
O157:H7 (pathogenic)• Discover gene function• Can find missing genes in metabolic pathways
![Page 63: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/63.jpg)
Nature 420:520, 2002
![Page 64: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/64.jpg)
Predicting structure of ‘new’ genes• Many gene finding programs
– Look for start sites, termination sites, splice sites– Analyze codon usage, exon length
• Comparative method A– Find syntenic regions– Predict genes using conventional gene finding
techniques in both species– Genes predicted in both species are probable
![Page 65: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/65.jpg)
Gene prediction• Comparative method B
– Find syntenic regions– Perform pattern filtering
• Coding exons tend to be well conserved• Conservation higher in first and second positions of
codons– Advantage: can deal with sequencing errors
![Page 66: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/66.jpg)
Identification of paralogues and orthologuesin other species
• Best Reciprocal Hit (BRH)
• Reciprocal Hit by Synteny (RHS)– Identification of adjacent orthologues
• Domain checking - internal qualitycontrol
Mouse Humanhttp://www.nbn.ac.za/
![Page 67: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/67.jpg)
Identification of orthologues
OrphanOthers
Matches to someother chromosome
Human
BRH
Mouse
RHS
http://www.nbn.ac.za/
![Page 68: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/68.jpg)
Genes in mouse shared with other organisms
Nature 420:520, 2002
![Page 69: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/69.jpg)
Non-coding regioncomparison
• Regulatory region discovery– No accurate methods to identify regulatory
sequences based on scanning the DNA sequencealone
• Putative TF binding sites found everywhere• Low specificity
– Assume that regulatory regions are conservedbecause they serve a vital function
![Page 70: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/70.jpg)
Rationale
• DNA sequences encoding and regulating theexpression of essential proteins and RNAswill be conserved
• Consequently, the regulatory profiles ofgenes involved in similar processes amongrelated species will be conserved
• Conversely, sequences that encode orcontrol the expression of proteins or RNAsresponsible for differences between specieswill be divergent
![Page 71: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/71.jpg)
What species?• How distant should the compared species be?
Transgenic mice tend to express the transgene ina similar manner to the source mammalianorganism, including those for which the orthologdoes not exist in mice!
• Rapidly evolving regulatoryregions (e.g., β−globin) allow comparisonsbetween closely related species
• However, some regulatory regions evolve veryslowly (e.g., T-cell receptors), so comparisonswith more distant species are required
![Page 72: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/72.jpg)
Finding regulatory regions• Find conserved
regions betweenspecies
• Can also findconservedregions betweensimilarlyexpressed genes– Search for TF
binding sites inthese conservedregions
Pennacchio and Rubin, Nature Rev Genet 2001
![Page 73: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/73.jpg)
Human vs. mouse• Similar gene content and linear organization
– ~340 syntenic blocks (~150 more than discoveredusing cytological studies) spanning 90% of thegenome
• Difference in genome size– Mouse genome is 14% smaller, probably due to a
higher rate of deletion• Sequence Conservation
– ~40% in alignments– ~5% under selection
• ~1.5% protein coding• ~3.5% non-coding: untranslated regions, regulatory
elements, non-protein-coding genes, and chromosomalstructural elements
Nature 420:520, 2002
![Page 74: ICB::Chagall - Comparative genomics...Local aligner: BLASTZ •As in BLAST, find seeds –Seeds are determined by a 12of19 weighted-spaced seeds strategy where the positions for strict](https://reader033.vdocuments.site/reader033/viewer/2022060906/60a1501b44107132ea1ccbce/html5/thumbnails/74.jpg)
Human vs. mouse genomes
Nature 420:520, 2002