in the search of motifs (and other hidden structures)
DESCRIPTION
In the search of motifs (and other hidden structures). Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005. Uncover a hidden structure(?). Motif?. - PowerPoint PPT PresentationTRANSCRIPT
In the search of motifs (and other hidden structures)
Esko Ukkonen
Department of Computer Science & Helsinki Institute of Information Technology HIIT
University of Helsinki
CPM 2005, Jeju, 21 June 2005
Uncover a hidden structure(?)
Motif?
• a pattern that occurs unexpectedly often in (a set of) strings
• pattern: substring, substring with gaps, string in generalized alphabet (e.g., IUPAC), HMMs, binding affinity matrix, cluster of binding affinity matrices,… (= the hidden structure to be learned from data)
• (unexpectedly: statistical modelling)• occurrence: exact, approximate, with high
probability, …• strings ↔ applications: bioinformatics …
Plan of the talk
1. Gapped motifs in a string
2. Founder sequence reconstruction problem, with applications to haplotype analysis and genotype phasing (WABI 2002, ALT 2004, WABI 2005)
3. Uncovering gene enhancer elements
1. Gapped motifs
ATT HATTIVATTI
I#A HATTIVATTI
Substring motifs of a string S
• string S = s1 … sn in alphabet A. • Problem: what are the frequently occurring
(ungapped) substrings of S? Longest substring that occurs at least q times?
• Thm: Suffix tree T(S) of S gives complete occurrence counts of all substring motifs of S in O(n) time (although S may have O(n2) substrings!)
T(S) is full text index
T(S)
P
31 8
P occurs in S at locations 8, 31, …
Path for P exists in T(S) ↔ P occurs in S
Counting the substring motifs
• internal nodes of T(S) ↔ repeating substrings of S
• number of leaves of the subtree of a node for string P = number of occurrences of P in S
T(hattivatti)hattivatti
attivatti
ttivatti
tivatti
ivatti
vatti
atti
tti
ti
i
hattivattiattivatti
ttivattitivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
Substring motifs of hattivatti
hattivattiattivatti
ttivattitivatti
ivatti
vatti
vattivatti
attiti
i
i
tti
ti
t
i
vatti
vatti
vatti
hattivatti
atti
2
2 2
24
Counts for the O(n) maximal motifs shown
Finding repeats in DNA
• human chromosome 3
• the first 48 999 930 bases
• 31 min cpu time (8 processors, 4 GB)
• Human genome: 3x109 bases
• T(HumanGenome) feasible
Longest repeat?
Occurrences at: 28395980, 28401554r Length: 2559
ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg
Ten occurrences?
ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt
Length: 277
Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925
Gapped motifs of S
• gapped pattern: P in (A U {#})*
• gap symbol # matches any symbol in A• aa##bb#b• L(P) = occurrences of P in S• P is called a motif of S if |L(P)| > 1 and a motif
with quorum q if |L(P)| ≥ q.
• Problem: find occurrence count |L(P)| for all gapped motifs P of S
• anban has exponentially many motifs (M-F. Sagot)!
Motifs vs self-alignments
• self-alignments of S => maximal motifs
align the occurrences
S
Motifs vs multiple self-alignments
• self-alignments of S => maximal motifs
expand if possible
Motifs vs self-alignments
• S = aaaaabaaaaa P = a###a
• aaaaabaaaaa aaaaabaaaaa
a###aaaaaabaaaaa aaaaabaaaaa
Motifs vs self-alignments
• S = aaaaabaaaaa P = a###a
• aaaaabaaaaa aaaaabaaaaa
a###aaaaaabaaaaa aaaaabaaaaa
Motifs vs self-alignments
• S = aaaaabaaaaa P = a###a• aaaaabaaaaa aaaaabaaaaa
• aaa#a#aaa is maximal motif for this self-alignment
aaa#a#aaaaaaaabaaaaa aaaaabaaaaa
Maximal motifs
• multiple self-alignments of S ↔ maximal gapped motifs of S: the unanimous columns give the non-gap symbols of the motif
• any motif P has a unique maximal motif M(P) (align the occurrences and maximize); L(M(P)) = L(P) + d
• unfortunately: anban has exponentially many maximal motifs
Blocks of maximal motifs• aaa##b##ba has blocks aaa, b, ba
• Lemma: Maximal substring motifs (1-block motifs) ↔ (branching) nodes of T(S)
• Thm: Each block of a maximal motif of S is a maximal substring motif of S, hence there are O(n) different strings that can be used as a block of a maximal motif.
• Cor: There are O(n2k-1) different maximal motifs with k blocks [O(n2k) unrestricted motifs].
Counting 2-block maximal motifs
• Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).
Algorithm (very simple)
X Yd
for each maximal substring motif X
for each distance d = 1,2, …
mark the leaves of T(S) that correspond to locations L(X) + d
for each maximal substring motif Y, find the number h(Y) of marked leaves
in its subtree in T(S)
the occurrence count of motif (X,d,Y) is h(Y)
2-block motif (X,d,Y)
Algorithm (very simple)
X Yd
for each maximal substring motif X
for each distance d = 1,2, …
mark the leaves of T(S) that correspond to locations L(X) + d
for each maximal substring motif Y, find the number h(Y) of marked leaves
in its subtree in T(S)
the occurrence count of motif (X,d,Y) is h(Y)
2-block motif (X,d,Y)
O(n)
O(n)
O(n)
Counting 2-block maximal motifs (cont)
• Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).
• flexible gaps: x*y * = gap of any length
• Thm: The occurrence counts for all maximal motifs with two blocks and one flexible gap can be found in (optimal) time O(n2).
General case
• Q1: Given q and W, has S a motif with at least W non-gap symbols and at least q occurrences?
• In k-block case, is O(n2k-1) (or even better) time possible?
• related work: A. Apostolico, M-F. Sagot, L. Parida, N. Pisanti, …
2. Founder reconstruction and applications
Haplotype evolution: founders and iterated recombinations
• WABI 2002
only recombinations; mutations not shown
founder haplotypes
current (observed) haplotypes
statistical models of recombination: average fragment length ~ 1/#generations
Uncovering founder sequences
• Problem: Given current sequences C (haplotypes), construct their ‘founders’ that produce the sequences by iterated recombinations using minimum possible total number of cross-overs (i.e., current sequences have a parse into smallest possible number of fragments taken from the founders)
Example
0 0 1 0 0 0 0 1 0 0 1 1
1 1 1 1 1 1 1 0 0 1 1 0
0 0 1 0 1 1 1 1 0 1 1 0
1 1 1 1 0 0 1 0 0 0 1 1
Example
0 0 1 0 0 0 0 1 0 0 1 1
1 1 1 1 1 1 1 0 0 1 1 0
0 0 1 0 1 1 1 1 0 1 1 0
1 1 1 1 0 0 1 0 0 0 1 1
Example
0 0 1 0 0 0 0 1 0 0 1 1
1 1 1 1 1 1 1 0 0 1 1 0
0 0 1 0 1 1 1 1 0 1 1 0
1 1 1 1 0 0 1 0 0 0 1 1
0 0 1 0 1 1 1 0 0 0 1 1
1 1 1 1 0 0 0 1 0 1 1 0
6 cross-overs
Example
0 0 1 0 0 0 0 1 0 0 1 1
1 1 1 1 1 1 1 0 0 1 1 0
0 0 1 0 1 1 1 1 0 1 1 0
1 1 1 1 0 0 1 0 0 0 1 1
Example
0 0 1 0 0 0 0 1 0 0 1 1
1 1 1 1 1 1 1 0 0 1 1 0
0 0 1 0 1 1 1 1 0 1 1 0
1 1 1 1 0 0 1 0 0 0 1 1
0 0 0 0 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 1 1 1 1 1
18 cross-overs
OBS: two founders (colors) always suffice if no restrictions
Founder reconstruction problem
• given a set D of m sequences, construct M founder sequences that give D in minimum number of cross-overs
• solution by dynamic programming, exponential time in m (WABI 2002)
• Q2: NP-hard?
Modeling a set of haplotypes by a HMM
• ’motif’ = Hidden Markov Model
• minimum description length (MDL) modeling
• ALT 2004
Hidden Markov Model (HMM)
• states i with emission alphabet Hi
• emission probabilities P(H Hi)
• state transition probabilities wij
i j
.
.
.
.
wij{P(H)}
Conserved fragments and parses
• haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
• parse 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2
• conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 2 2 2 2 1 fragments
• fragmentation model (HMM) 1 1 1 1
2 1 2 1 2
1 1 2 1 1 1 1 2 2 1 2 2 2 2
Lactose tolerance
• recent finding in Finnish population: an SNP C/T-13910, 14 kb upstream from the lactase gene, associates completely with lactose intolerance
• two datasets over 23 SNPs in the vicinity of this SNP
• lactose intolerant persons: 21 haplotypes
• lactose tolerant persons: 38 haplotypes
Case/control study by HMM
Lactose intolerant (~6 fragments per haplotype)
Lactose tolerant (2 fragments per haplotype => young)
Genotype phasing via founders using a HMM
• the genotype phasing problem: given a set of genotypes, find their resolving haplotype pairs
• find at most M founders that produce resolving haplotype pairs in minimum possible number of cross-overs => relatively good haplotyping method
• improved results with a related HMM, trained with the Expectation Maximization algorithm
• WABI 2005
HMM for haplotyping
…
…
…
…
transition probability distribution
transition probability distribution
emission probability distribution
Example HMM
3. Uncovering gene enhancer elements
Introduction
• Gene expression regulation in multicellular organisms is controlled in combinatorial fashion by so called transcription factors.
• Transcription factors bind to DNA cis-elements on enhancer modules (promoters), and multiple factors need to bind to activate the module.
• In mammals, the modules are few and far • The problem: Locate functional regulatory
modules.
Gene regulation
promoter1 gene1 promoter2 gene2 promoter3 gene3 promoter4 gene4DNA
RNA
transcription
translation
Proteins
transcription factors
Model of cell type specific regulation of target gene expression
GLI X Y (tissue specific TFs)
GLI GLI Ubiquitously expressed TF
transcription
transcription
Common targets (e.g. Patched):
Cell type specific targets (e.g. N-myc):
Binding affinity matrices• The cis-elements are
represented by affinity matrices.– A column per position– A row per nucleotide
• Discovered:– Computationally– Traditional wet lab– Microarrays
9 11 49 51 0 1 1 4 19 3 0 0 0 45 25 16 5 1 2 0 17 0 4 21 18 36 0 0 34 5 21 10
Finding preserved motifs of binding sites
• looking at one (human) genome gives too many positives
• comparative approach: take the 200 kB regions surrounding the same genes (paralogs and orthologs) of different mammals (human, mouse, chicken, …), find preserved clusters (motifs) of binding sites
• Smith-Waterman type algorithm with a novel scoring function
Whole genome comparisons
• Whole genomes can be analyzed with our implementation
• We have compared human genes to orthologs in mouse, rat, chicken, fugu, tetraodon and zebrafish – 100kbp flanking regions on both sides of the gene.– Coding regions masked out.– About 20 000 comparisons for each pair of species.– About 2 min each
Enhancer prediction for N-myc
200 kb Mouse N-Myc genomic region
200
kb H
uman
N-M
yc g
enom
ic r
egio
n
Conserved GLI binding sites in two predicted enhancer elements, CM5 and CM7
coding region of N-Myc
Wet-lab verification
● Selected predicted cis-modules for wet-lab verification
● Fused 1kb DNA segment containing the predicted enhancer to a marker gene with a minimal promoter and generated transgenic embryos.
To conclude
• combinatorial vs probabilistic motifs
• significance of the findings for the applications => statistical modeling
• Want to do computational biology? Then find a good biologist who has good computational intuition.
Acknowledgements
• Mikko Koivisto• Heikki Mannila• Kimmo Palin• Pasi Rastas• Morris Michael• Stefan Kurzt
(Hamburg)
• Outi Hallikas (Biom)• Jussi Taipale (Biom)• Markus Perola (Biom)• Hans Söderlund (VTT)
The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2003-503265.