comparative genomics
DESCRIPTION
Comparative Genomics. Overview. Orthologues and paralogues Protein families Genome-wide DNA alignments Syntenic blocks. Comparative Genomics. Allows us to achieve a greater understanding of vertebrate evolution - PowerPoint PPT PresentationTRANSCRIPT
1 of 33
Comparative GenomicsComparative Genomics
2 of 33
OverviewOverview
• Orthologues and paralogues
• Protein families
• Genome-wide DNA alignments
• Syntenic blocks
3 of 33
Comparative GenomicsComparative Genomics
• Allows us to achieve a greater understanding of vertebrate evolution
• Tells us what is common and what is unique between different species at the genome level
• The function of human genes and other regions may be revealed by studying their counterparts in lower organisms
• Helps identify both coding and non-coding genes and regulatory elements
4 of 33
Species in EnsemblSpecies in Ensembl
CAMBRI ORDO SIL DEV CARBON PER TRIA JURA CRETAC TERTIA
57
0
50
5
43
8
40
8
36
0
28
6
24
5
20
8
14
4
65
MY
BP
FISHES
BIRDSREPTILES
MAMMALS PLACENTALS
MONOTREMES
MARSUPIALS
OTHER BIRDS
PALEOGNATHS
PASSERINES
CROCODILES
TURTLES
LIZARDS
AMPHIBIANS
TELEOSTS
SHARKS
RAYS
LATIMERIA
BICHIR/POLYPTERUS
LUNGFISHES
AGNATHANS
NON-VERTEBRATES
5 of 33
Orthologue / Paralogue Prediction Orthologue / Paralogue Prediction AlgorithmAlgorithm
(1) Load the longest translation of each gene from all species used in Ensembl.
(2) Run WUBLASTp+SmithWaterman of every gene against every other (both self and non-self species) in a genome-wise manner.
(3) Build a graph of gene relations based on Best Reciprocal Hits (BRH) and Blast Score Ratio (BSR) values.
(4) Extract the connected components (=single linkage clusters), each cluster representing a gene family.
(5) For each cluster, build a multiple alignment based on the protein sequences using MUSCLE.
(6) For each aligned cluster, build a phylogenetic tree using PHYML. An unrooted tree is obtained at this stage.
(7) Reconcile each gene tree with the species tree to call duplication event on internal nodes and root the tree, using RAP.
(8) From each gene tree, infer gene pairwise relations of orthology and paralogy types.
6 of 33
• Orthologues :any gene pairwise relation where the ancestor node is a speciation event
• Paralogues :any gene pairwise relation where the ancestor node is a duplication event
Homologue RelationshipsHomologue Relationships
7 of 33
Orthologue and Paralogue TypesOrthologue and Paralogue Types
8 of 33
Orthologue and Paralogue typesOrthologue and Paralogue types
9 of 33
GeneView
10 of 33
GeneView
11 of 33
GeneTreeView
GeneTree
MUSCLEprotein alignment
12 of 33
GeneTreeView
Duplication node (red)
Speciation node (blue)
13 of 33
Protein DatasetProtein Dataset
More than 1,500,000 proteins clustered:
• All Ensembl protein predictions from all species supported~ 670,000 protein predictions
• All metazoan (animal) proteins in UniProt:~ 80,000 UniProt/Swiss-Prot~ 830,000 UniProt/TrEMBL
14 of 33
Clustering StrategyClustering Strategy
• BLASTP all-versus-all comparison
• Markov clustering
• For each cluster:• Calculation of multiple sequence
alignments with ClustalW• Assignment of a consensus description
15 of 33
Link to FamilyView
GeneView / TransView / ProtView
16 of 33
Ensembl family members
within human
UniProt family members
Ensembl family members in
other species
Consensus annotation
JalView multiple alignments
FamilyView
17 of 33
JalView
18 of 33
Whole Genome AlignmentsWhole Genome Alignments
• Functional sequences evolve more slowly than non-functional sequences, therefore sequences that remain conserved may perform a biological function.
• Comparing genomic sequences from species at different evolutionary distances allows us to identify:• Coding genes• Non-coding genes• Non-coding regulatory sequences
19 of 33
Selection of Species for DNA Selection of Species for DNA comparisonscomparisons
Both coding and non-coding sequences
~70-75%
~150 MYA
4.2
Opossum
0.42.53.0Size (Gbp)
~65%~80%>99%Sequence
conservation (in coding regions)
Primarily coding
sequences
Both coding and non-coding sequences
Recently changed
sequences and genomic
rearrangements
Aids identification of…
~450 MYA~ 65 MYA~5 MYATime since divergence
PufferfishMouseChimpanzeeHuman vs..
20 of 33
Alignment AlgorithmAlignment Algorithm
• Should find all highly similar regions between two sequences
• Should allow for segments without similarity, rearrangements etc.
• Issues• Heavy process• Scalability, as more and more genomes are
sequenced• Time constraint
21 of 33
BLASTZ-net, tBLAT and PECANBLASTZ-net, tBLAT and PECAN
• BLASTZ-net (comparison on nucleotide level) is used for species that are evolutionary close, e.g. human - mouse
• Translated BLAT (comparison on amino acid level) is used for evolutionary more distant species, e.g. human - zebrafish
• PECAN is used for multispecies alignments• 7 eutherian mammals• 10 amniota vertebrates
22 of 33
BLASTZ-net, tBLAT and PECANBLASTZ-net, tBLAT and PECANFor which combinations of species whole genome alignments have been done is shown on the Comparative Genomics page(Help & Documentation > Genomic Data > Comparative Genomics):
23 of 33
ContigView
Constrained elements
Conservation score
Blastz mouse
tBLAT zebrafish
PECAN alignments
24 of 33
Conserved sequences
human
Conserved sequences
dog
MultiContigView
25 of 33
Human
AlignSliceView
Rat
Dog
Mouse
26 of 33
MultiContigView vs. AlignSliceView
27 of 33
AlignView
28 of 33
GeneSeqalignView
29 of 33
GeneSeqalignView
30 of 33
Syntenic BlocksSyntenic Blocks
• Genome alignments are refined into larger syntenic regions
• Alignments are clustered together when the relative distance between them is less than 100 kb and order and orientation are consistent
• Any clusters less than 100 kb are discarded
31 of 33
Human chromosome
Mouse chromosomes
Mouse chromosomes
SyntenyView
Orthologues
32 of 33
CytoView
Syntenic blocks
Orientation Chromosome
33 of 33
QQ&&AAQ U E S T I O N SQ U E S T I O N S
A N S W E R SA N S W E R S