klaudia walter, wally gilks, lorenz wernisch 12 th december 2006
DESCRIPTION
H U M A N. Modelling the Boundary of Highly C onserved N on-Coding DNA. Klaudia Walter, Wally Gilks, Lorenz Wernisch 12 th December 2006. Overview. Background What are CNEs? A+T nucleotide frequency in and around CNEs Phylogenetic Model What is a phylogenetic tree model? - PowerPoint PPT PresentationTRANSCRIPT
Klaudia Walter, Wally Gilks, Lorenz Wernisch
12th December 2006
HUMAN
Modelling the Boundary of Highly Conserved Non-Coding DNA
Overview
• Background
– What are CNEs?
– A+T nucleotide frequency in and around CNEs
• Phylogenetic Model
– What is a phylogenetic tree model?
– Likelihood of a tree model
– Likelihood of the scaling of a tree
– Likelihood of CNE boundary
– Variable CNE boundaries for each species
Motivation
• DNA sequences that are conserved between organisms are likely to have special functions.
• The Fugu genome represents a good model to find conserved non-coding sequences (CNEs) in the human genome.
• Are conserved regions different from their neighbouring sequences in the genome?
• Is it possible to define CNE boundaries better than with pairwise sequence alignment of Fugu and human?
What are CNEs?
Multiple Alignment of Mouse, Rat, Human and Fugu
Fugu Genome
• Fugu genome contains only 400Mb.
• Only an eighth of human genome.
• Gene repertoire is similar to human.
• Human and Fugu shared last common ancestor 450 million years ago.
(Brenner et al, 1993; Aparicio et al, 2002)
Conserved Non-coding Elements (CNE)
• 1373 CNEs identified in human and Fugu
• 93 - 740 bp long; 68 - 98% identical
• Situated around developmental genes
• Can act over 1 Mb distance, eg. Shh expression (Lettice et al, 2003; Nobrega et al, 2003;
Kleinjan & van Heyningen, 2004)
• Likely to be fundamental to vertebrate life
(Dermitzakis et al, 2002, 2003; Margulies et al, 2003; Bejerano et al 2004a; Woolfe et al, 2005)
Are vertebrate CNEs enhancers?
Coding Exon
Conserved Non-coding Sequence
SOX21 gene
Fugu / Mouse
Fugu / Human Fugu / Rat
element 1element 1
element 19element 19
element 4element 4 element 5element 5
element 8-10element 8-10
sox21 gene element 19
central nervous system
forebrain
eye
Element 19
(Woolfe et al, 2005; McEwen et al, 2006)
CNE Target
Model of duplication of cis-element and target gene
(Vavouri et al, 2006; McEwen et al, 2006)
A+T base frequency in CNEs
Position Specific Base Composition
Upstream flanking region Conserved non-coding
ACTAGCCTCATCGTAGCGCAATTCTAGATGATAACATACCGAGTTCGGTAGGAGCTTAGTATGAGCATAACGCGTGTGCTAGGTCACGGCGCAACATACTTATAGACTACGCCCTTGCACGATCCGGATATCATAGTCTTACAA
A = 0.00C = 0.25G = 0.50T = 0.25
A = 0.50C = 0.00G = 0.25T = 0.25
A+T relative frequency across CNE boundaries in Fugu and human
(Walter et al, 2005)
A+T relative frequency across 2000 genes in human chromosome 1
Genes were aligned at the start and the end.
Distribution of Position Weight Matrix (PWM) Scores for CNEs and Random Sequences
A position weight matrix (PWM) is constructed by dividing the nucleotide probabilities by expected background probabilities.
p(b,i) = probability of base b in position i p(b) = background probability of base b
n
i bp
ibpS
12 )(
),(log
Scoresfor FuguCNEs
Scoresfor HumanCNEs
The sequence logo for the 100 top scoring CNEs.
What do CNEs do?
• Some CNEs enhance GFP (green fluorescent protein) expression in zebrafish embryos.
• The function of CNEs is still unknown.
• Necessary to do more lab experiments.
• Are CNEs defined well enough for experiments?
Conservation pattern across CNE boundaries
1373 Fugu-human CNE pairs plus 100bp flanking regions aligned using Needleman-Wunsch’s algorithm.
A+T frequency in Fugu, Human, Worm and Fly
(Glazov et al, 2005; Vavouri et al, 2006 (submitted))
Are CNE ends well defined?
• Different parameter settings produce different alignments.
• Even just different mismatch penalties change – the alignments– the A+T bias at the CNE boundaries
A+T frequency for Fugu CNEs using pairwise alignments with Human
Phylogenetic Model
5’ flanking conservedHuman ACAGTAT ATCGTAATMouse ACCGTAT ATCGTAATChicken AACGTAT ATCGTAATXenopus CCACTAT ATCGTAATFugu CGACTTA ATCGTAAT
boundary
Multiple sequence alignment
300 bp 100 bp
Phylogenetic tree model
• Substitution rate matrix– Continuous-time Markov process
• Tree topology• Branch lengths• Scaling of tree
AA AC AG AT
CA CC CG CT
GA GC GG GT
TA TC TG TT
q q q q
q q q qQ
q q q q
q q q q
H
M
C
F
Matrix P(t) of substitution probabilities for branch length t
1
( )( ) exp( )
!
i
i
QtP t Qt
i
Q should be diagonalizable. If Q is not symmetric, we need to find the eigensystem of a symmetric matrix S related to Q and to convert results to the eigensystem of Q.
Example:
C G T
A G T
A C T
A C G
a b c
a d eQ
b d f
c e fA, C, G, T
Estimating A+T frequency around Fugu CNE boundary
relative A+Tfrequency
Mouse
Fugu
Xenopus
Chicken
Human
Conserved
scaling C
Mouse
Flanking
scaling F
Fugu
Xenopus
Chicken
Human
Phylogenetic tree with conserved and flanking scalings
flanking scaling F
conserved scaling C
boundary position
sca
le
What is the optimal scaling?
5’ flanking conserved
ACA G TATATCGTAATACC G TATATCGTAATAAC G TATATCGTAATCCA C TATATCGTAATCGA C TTAATCGTAAT
Compute likelihood of scaling
Felsenstein’s algorithm: P(s | T, )
HumanMouseChickenXenopusFugu
Felsenstein’s algorithm
“Pruning” algorithm by Felsenstein (1973, 1981)
uses dynamic programming to calculate likelihood
of a tree model P(S |
Recursion:• If u is a leaf
If xu = a, then
Otherwise,
• Otherwise
( | ) = ( | , ) ( | ) ( | , ) ( | )u v v w wb c
P L a P b a t P L b P c a t P L c
( | ) = 1uP L a
( | ) = 0uP L ab
c
aw
u
v
Likelihood of scaling
• Calculate likelihood P(S | T, ) of scaling vector by
summing over boundary b.
• Assume evolutionary independence of each position i
in the multiple alignment S.
• P(S | T, ) is calculated by Felsenstein’s algorithm.
1
( | , ) ( | , ) ( )N
b bb
P S T P S T P
Model with common scaling and individual boundaries
1 11
( | ,..., ) ( ,..., | ) ( ) ( | ) ( )n
n n ii
P S S P S S P P S P
Probability of scaling given sequences S1, …, Sn
Likelihood of scaling over CNEs
Hierarchical model for
),|,(),|(),|(
),|(),|,...,,(
,
FCFC
Sn
PSPSP
SPSSSP
FC
21
S1 S2 S3 ..... Sn
CF)1 CF)2 CF)3 CF)n
F
C C
Multivariate log normal distribution for (C, F)
Likelihood of boundary b
• The likelihood of the boundary is computed by summing over scalings
• b and are independent.
• Prior on .
)(),|()|( PbSPbSP
Likelihood of boundary b
Boundary shifts for phylogenetic model
Boundary shift 0 bp ≤ 20bp ≤ 50bp ≤ 100bp
Cumulative frequency 12% 40% 61% 80%
density
position
Relative conservation by position
Model for variable boundary
000000 0 11111111000011 1 11111111000011 1 11111111000000 0 00111111000000 0 00111111000000 0 00111111000000 1 11111111000000 1 11111111 0 1
1
0
0
1 1
0
H M C X F
Branches
Positions
Transitions
1. 0000 0001 0010 0011 ......... 1111
2. 0000 0001 0010 0011 ......... 1111
3. 0000 0001 0010 0011 ......... 1111
...... ...... ...... ...... ......
Variable boundary for CNE1031
Human AGTAGTTTCC ATGCCTGTCAMouse AGGAGCCTCT ATGCCTGTCAChicken AGTAGTTTCC ATGCCTGTCAXenopus -GTTATATAC ACGCCTGTCAFugu AATAGTTCCC ATGCCTGTCA
10 bp 10 bp
Boundary shift = 154 bp
Variable boundary for CNE1043
Human TGATGTTGAA TCATTTAAAAMouse TGATGTGTAG TCATTTAAAAChicken TGACGTTCAG TCAGTTAAAAXenopus TGACACTCAA TCATTTAAATFugu TGACGCGCAG TCAGTTAAAT
10 bp 10 bp
Boundary shift = 0 bp
Variable boundary for CNE1037
Human TA-GGCCATT CTGATTTGTAMouse TA-GGCCATT CTGATTTGTAChicken TA-GGCCATT CTGATTTGTAXenopus AA-GACCATA CTGATTTTTTFugu TGTGGTAGGT CTGATTTGTA
10 bp 10 bp
Boundary shift = 65 bp
Conservation structure of CNEs
Summary
• Statistical models for CNE boundaries that incorporates phylogenetic information.
• Aim is to define location of CNE boundaries more reliably than pairwise or multiple sequence alignments.
Acknowledgments
Greg Elgar (Queen Mary College, University of London)
Irina AbnizovaGayle McEwen (MRC Biostatistics Unit, Cambridge)Krys KellyBrian Tom
Tanya Vavouri (QMUL & Sanger Institute, Hinxton)
Adam Woolfe (NHGRI, National Institutes of Health, US)
Yvonne Edwards (University College, University of London)
Martin Goodson
References
• Woolfe A, Goodson M, Goode DK, Snell P, McEwen GK, Vavouri T, Smith SF, North P, Callaway H, Kelly K, Walter K, Abnizova I, Gilks W, Edwards YJ, Cooke JE, Elgar G.
Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 2005, 3(1).
• Walter K, Abnizova I, Elgar G, Gilks WR. Striking nucleotide frequency pattern at the borders of highly conserved vertebrate
non-coding sequences. Trends Genet. 2005, 21(8):436-40.
• Vavouri T, McEwen GK, Woolfe A, Gilks WR, Elgar G. Defining a genomic radius for long-range enhancer action: duplicated conserved
non-coding elements hold the key. Trends Genet. 2006, 22(1):5-10.
• McEwen GK, Woolfe A, Goode D, Vavouri T, Callaway H, Elgar G. Ancient duplicated conserved noncoding elements in vertebrates: a genomic and
functional analysis. Genome Res. 2006,16(4):451-65.
• Vavouri T, Walter K, Gilks WR, Lehner B, Elgar G. Parallel evolution of conserved noncoding elements that target a common set of
developmental regulatory genes from worms to humans. Submitted 2006.
Human CNE boundary
MegaBLAST Phylogenetic
A+Tfrequency
position position
Chicken CNE boundary
MegaBLAST Phylogenetic
A+Tfrequency
position position
Fugu CNE boundary
MegaBLAST Phylogenetic
A+Tfrequency
position position
From rate matrix Q to probability matrix P
' , , ,
AA AC AG AT
CA CC CG CTA C G T
GA GC GG GT
TA TC TG TT
q q q q
q q q qp p Q p p p p
q q q q
q q q q
'
( )A A AA C CA G GA T TA
A AC AG AT C CA G GA T TA
p p q p q p q p q
p q q q p q p q p q
P(t) of substitution probabilities (2)
1/ 2 1/ 2diag( ) diag( )
( , , , )A C G T
S Q
1/ 2 1/ 2
exp( ) diag(exp( )) ( )
exp( ) diag( )exp( ) diag( )
( ) exp( )
TSt V t V
Qt St
P t Qt
is symmetric with
S and Q have the same eigenvalues