evolution (1 st lecture). finding elements in dna conserved by evolution characterization of...

40
Evolution (1 st lecture)

Post on 20-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Evolution (1st lecture)

Page 2: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Finding Elements in DNA Conserved by Evolution

Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Gregory Cooper & all

Identification and Characterization of Multi-Species Conserved SequencesElliott Margulies & all

Presented by Penka Markova

Page 3: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Finding Elements in DNA Conserved by Evolution

Premise: highly conserved sequences are more likely to reflect regions under active selection due to the presence of an element(s) that confers biological function

Involves comparative analysis, requires multi-alignments

Page 4: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes
Page 5: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Outline

Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

• Overview• Data• Global Patterns of Nucleotide Substitution• Rates of Transitions and Transversions in the Rodents• Rates of Neutral Point Substitution• Rates of Microinsertion and Microdeletion• Global Identification of Constrained Elements• Regional Variability of Evolutionary Parameters

Identification and Characterization of Multi-Species Conserved Sequences

• Overview• Data• Binomial, Parsimony and Intersecting Methods• Stats • Characteristics of the detected MCSs, conclusions

Page 6: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

1st Paper

Characterization of Evolutionary Rates and Constraints in Three Mammalian

Genomes

Gregory Cooper, Michael Brudno, Eric Stone, Inna Dubchak, Serafim Batzoglou, and Arend Sidow

Page 7: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Overview

Goal: Comparative analysis of rat/mouse/human genome• facilitate insights into basic mechanisms of nucleotide evolution• facilitate the discovery of elements in the genome that play a functional role in

human biology (by leveraging the fact that functional DNA is constrained because of purifying selection )

Summary: Provides analysis of rates and patterns of microevolutionary phenomena that have shaped the human, mouse, and rat genomes since their last common ancestor

• Evidence for shift in the mutational spectrum b/n the mouse and rat lineages (increase of CG content in the rat genome)

• Support for the idea that rates of evolution are influenced by local genomic or cell biological context

• No correlation b/n rates of point substitution & rates of microindels (influences that affect these processes are distinct)

• Identified the regions in the human genome that are evolving slowly (likely to include functional elements important to human biology)

Page 8: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Data

3 complete mammalian genome sequences

Human, rat, mouse new: rat genome

Multi-aligned MLAGAN

2 datasets 1. Containing all sites that are

confidently aligned among all 3 sequences (most included positions originated prior to the last common ancestor)

2. “rodent-specific neutral sites” -containing only sites present in the rodents (heavily enriched for neutrally evolving sites)

Page 9: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Outline

Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

• Overview• Data• Global Patterns of Nucleotide Substitution• Rates of Transitions and Transversions in the Rodents• Rates of Neutral Point Substitution• Rates of Microinsertion and Microdeletion• Global Identification of Constrained Elements• Regional Variability of Evolutionary Parameters

Identification and Characterization of Multi-Species Conserved Sequences

• Overview• Data• Binomial, Parsimony and Intersecting Methods• Stats • Characteristics of the detected MCSs, conclusions

Page 10: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Global Patterns of Nucleotide Substitution

Global shift in the mutation spectra between mouse and rat

• Rat has 0.35% more CG than mouse (41.26% vs 41.61%) – statistically highly significant difference

• CpG dinucleotides 0.92% in the mouse, 1.06% in the rat (the rest of the nucleotides exhibit lower difference)

Consistent bias toward elevated CG in the rat genome

• does not appear to be confined to particular types of transitions or transversions

• based on Dataset1 quantitative analysis (117 million position with single difference in either rodent)

The causative factors for the shift, selective or otherwise, remain to be elucidated

Page 11: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Rates of Transitions and Transversions in the Rodents

Transitions are approximately fourfold more likely than any transversion

Useful for molecular evolutionary studies (most methods of phylogenetic inference model point substitutions on the basis of stationary Markov processes and require user-specified substitution parameters)

Page 12: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Rates of Neutral Point Substitution

Point substitution events in rodent-specific neutral sites (Dataset2)

Neutral rate for the evolutionary tree relating the 3

• Relative branch length of the tree: based on Dataset1 positions without gap in any sequence

• Normalized (rat branch is 1 unit length)

Page 13: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Rates of Microinsertion and Microdeletion

Definition: lesions no larger than 10bp

Dataset1• Gaps of size 11bp or less

Rapid decline in the relative numbers of indel events as size increases

Page 14: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Global Identification of Constrained Elements

Annotated all the regions in the human genome that are evolving, on average, significantly slower than the neutral rate

• Sequences that function in organismal biology tend to be under purifying selection & thus manifest themselves as regions evolving slowly

• 210, 923 constrained elements (>51 bp)

Page 15: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Global Identification of Constrained Elements

Page 16: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Regional Variability of Evolutionary Parameters

• Substantially stable microevolutionary pressures (modest-to-strong correlations between rates of microdeletion [A, B])

• Local evolutionary pressures appear to influence point substitutions and microindels differently (variation in rate of microinsertions/microdeletion does not correlate well with point substitution)

• Local genomic context influences the rate of point substitution regardless of the type of site (correlation b/n neutral rate with the rate of substitution [B])

• CG content correlates with rates of point substitution

Sliding window analysis along rat Chromosome1, window width of 2Mb

Page 17: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Outline

Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

• Overview• Data• Global Patterns of Nucleotide Substitution• Rates of Transitions and Transversions in the Rodents• Rates of Neutral Point Substitution• Rates of Microinsertion and Microdeletion• Global Identification of Constrained Elements• Regional Variability of Evolutionary Parameters

Identification and Characterization of Multi-Species Conserved Sequences

• Overview• Data• Binomial, Parsimony and Intersecting Methods• Stats • Characteristics of the detected MCSs, conclusions

Page 18: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

2nd Paper

Identification and Characterization of Multi-Species Conserved Sequences

Elliott Margulies, Mathieu Blanchette, NISC Comparative Sequencing Program, David Haussler, Eric Green

Page 19: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Overview

GoalsIdentify highly conserved DNA regions, in particular “Multi-species

Conserved Sequences” (MCSs), in a robust fashion• useful in comparative sequence analysis, aiming to elucidate

genome function

Evaluate the relative contribution of different species’ sequences to identifying genomic regions of interest

• one of the criteria considered in choosing additional species for whole-genome sequencing

Summary of resultsProposes 2 strategies for MCS identification (binomial, parsimony)

• detect virtually all known actively conserved sequences (coding seq), but very little neutrally evolving sequence (ancestral repeats)

Analysis of the features of detected MCSsCurrently available genome sequences are insufficient for

comprehensive identification of MCSs in the human genome

Page 20: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Data

Sequences of human and 11 non-human vertebrates

2 primates (chimpansee, baboon), 2 carnivores (cat, dog), 2 artiodactyls (cow and pig), 2 rodents (mouse and rat), 1 bird (chicken), 2 fish (fugu and tetraodon)

Orthologous to a 1.8-Mb region on human chromosome 7q31

Multi-aligned human-referenced pair-wise alignment Repeat-masker, blastz

Systematically annotated for known coding exons, UTRs, and ARs (ancestral repeats)

Page 21: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Binomial, Parsimony, Intersecting

Take into accountPhylogenetic diversity of the aligned species’ sequencesThe varying neutral substitution rateThe characteristics of the available genomic multi-sequence alignment, esp

sparse alignments

RequirementsSufficiently large branch length of the phylogenetic tree (non-functional

regions should be sufficiently diverged)Greater total branch length (compared to the required length for

identification of larger functional elements)Good multi-alignment is crucial

Page 22: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Binomial

Binomial-Based Method for MCS Detection

Calculates the conservation score based on the probability of detecting the observed amount of conservation between the human and each other species’ sequence, assuming neutral substitution rate

Neutral substitution rate is calculated from fourfold degenerate positions (the third base of codons for which any base will encode the same amino acid)

Normalizes for phylogenetic biases by averaging

Final conservation score is calculated from overlapping 25-base windows

Page 23: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Binomial

N number of aligned bases in the 25-base window of the human-species j alignment

K number of perfect matches

pj neutral substitution probability: the probability that a given base in the human sequence has been conserved in species j, assuming the neutral substitution rate between human and species j

K/N baseline conservation level

C(j) cumulative binomial probability of observing at least K matches in N bases

Algorithm

1) within all windows of 25 bases, for each species j:

kNj

kj

N

Kk

ppk

NjC

1)(

otherwisejC

pNKifjC

Nif

s jj

))(1log(

))(log(

00

CGGCTAAG…ACTGACTGGGTCGACTGAG…ACTGACTGGGT

Page 24: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Binomial

Algorithm

2) “phylogenetically average” the individual species’ scores sj to obtain the final conservation score for the window

))(5.05.0

)(5.0)(5.0)(5.0(5/1)(

tetraodonfuguchicken

pigcowcatdogbaboonchimpanzeeBin

sss

ssssssiS

)(max)( 12..12 jSiialScoreBinom Biniij

3) the final score assigned to position i is

4) For a given treshhold t, position I is predicted to be part of an MCS if

tiialScoreBinom )(

Page 25: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Binomial

Binomial-Based Method: Conclusion

Conservation scores below zero represent alignable regions that are less conserved than expected, the opposite for scores above zero

Minimum MCS length is 25 bases

Sequence conservation detected with more diverged species (with higher neutral substitution rates) is weighted more heavily

Measures conservation with respect to one reference sequence only

Page 26: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Parsimony

Parsimony-Based Method

Amount of conservation within each column of the alignment is measured using a phylogenetic parsimony score P(i)

• P(i) reflects the minimal number of substitutions needed along the branches of an established phylogenetic tree to account for the observed bases at the leaves of the tree

Based on P(i) calculates a score under a continuous-time Markov model of neutral evolution, measuring the “surprise” of observing P(i) or smaller parsimony score

Requires a phylogenetic tree, a model of neutral substitution

Page 27: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Parsimony

Human

Baboon

Mouse

Rat

Algorithm

1) Calculate the parsimony score P(i) for the i-th positionP(i) = the minimum number of substitutions, performed along the branches of

the tree, needed to explain the bases observed at the leaves of the tree• notice P(i) is a tight lower bound on the number of substitutions having

actually occurred at position i during evolution

2.0) Define a model of neutral evolution based on the phylogenetic tree T relating the species under study, a neutral

substitution rate matrix Q

• ℓ(e) denotes the length of branch e, r the root of the tree transition probability matrix along a branch (u,v) M(u,v) = e ℓ(u,v)Q

background base distribution π

This model generates a set of random but related bases at the leaves of the tree by simulating evolution.

Page 28: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Parsimony

12..12

)])()(Pr[log()(iij

Pars jPrZiS

2) Define the score assigned to position i based on the 25-base window as

• Z(r) is the random variable describing the parsimony score of the bases of the subtree rooted at r

• Pr[Z(r) P(j)] is the probability that the parsimony score of the bases at the leaves of T generated by the model defined above is at most P(j)

• calculated using a dynamic programming algorithm proceeding from the leaves of T ot its root

• if this probability is small, the position is unlikely to have been generated under neutral evolution

Page 29: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Parsimony

)(max)( 12..12 jSimonyScoreParsi Parsiij

3) the final score assigned to position i is

4) For a given treshhold t, position i is predicted to be part of an MCS if

timonyScoreParsi )(

Parsimony-Based Method: Conclusion

Requires a phylogenetic tree, a model of neutral substitutionProduces higher scores based on conservation across large phylogenetic distance

Page 30: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Algorithms: Binomial, Parsimony, Intersecting

Intersecting Method Intersects the results from the Binomial and Parsimony methods

MCSs can be shorter than 25 bp

ObservationsAll three methods are biased towards the identification of sequences that

are conserved in most species (as opposed to only a subset of species)

Conservation score treshhold used was selected such that 5% of the human sequence from the analyzed region falls within an MCS (5% of the human genome is considered to be under active selection)

Page 31: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Concordance of the binomial- and parsimony- based methods for MCS detection

Page 32: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Results: discrimination of different types of sequence using conservation scores

Page 33: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Results

General features of detected MCSs• detected virtually all known actively conserved sequences

(coding seq), but very little neutrally evolving sequence (ancestral repeats)

• majority of sequences conserved across multiple vertebrate species has no known function (70% of MCSs reside in non-coding regions)

• Uniqueness of the MCSs in the human genome

Correlating MCSs with Functional Elements• MCSs correspond to clusters of transcription factor-binding

sites, non-coding RNA transcripts, and other candidate functional elements

Page 34: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Results: characteristics of the detected MCSs

Page 35: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Positions of MCSs relative to other annotated genomic features (representative region)

Page 36: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Results

Contribution of different species’ sequences to the detection of MCSs

• Rodent sequences detect the greatest number of MCS bases, largest number of non-coding sequence

• Chicken sequence has considerably higher specificity, largest amount of coding MCS bases

• MCSs detected with fish sequences almost exclusively contain coding sequence

• Non-human primate sequences are not useful with the applied methods

• None of the individual species’ sequences alone came close to identifying all the reference MCS bases

• Currently available genome sequences are insufficient for comprehensive identification of MCSs in the human genome

Page 37: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Ability of individual & combinations of species’ sequences to detect MCSs

Page 38: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

Outline (The End)

Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

• Overview• Data• Global Patterns of Nucleotide Substitution• Rates of Transitions and Transversions in the Rodents• Rates of Neutral Point Substitution• Rates of Microinsertion and Microdeletion• Global Identification of Constrained Elements• Regional Variability of Evolutionary Parameters

Identification and Characterization of Multi-Species Conserved Sequences

• Overview• Data• Binomial, Parsimony and Intersecting Methods• Stats • Characteristics of the detected MCSs, conclusions

Page 39: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

The end

Page 40: Evolution (1 st lecture). Finding Elements in DNA Conserved by Evolution Characterization of Evolutionary Rates and Constraints in Three Mammalian Genomes

A(u) is the random variable representing the base generated by this random process at node u.