in the search of motifs (and other hidden structures)

In the search of motifs (and other hidden structures)

Esko Ukkonen

Department of Computer Science & Helsinki Institute of Information Technology HIIT

University of Helsinki

CPM 2005, Jeju, 21 June 2005

Uncover a hidden structure(?)

Motif?

• a pattern that occurs unexpectedly often in (a set of) strings

• pattern: substring, substring with gaps, string in generalized alphabet (e.g., IUPAC), HMMs, binding affinity matrix, cluster of binding affinity matrices,… (= the hidden structure to be learned from data)

• (unexpectedly: statistical modelling)• occurrence: exact, approximate, with high

probability, …• strings ↔ applications: bioinformatics …

Plan of the talk

1. Gapped motifs in a string

2. Founder sequence reconstruction problem, with applications to haplotype analysis and genotype phasing (WABI 2002, ALT 2004, WABI 2005)

3. Uncovering gene enhancer elements

1. Gapped motifs

ATT HATTIVATTI

I#A HATTIVATTI

Substring motifs of a string S

• string S = s1 … sn in alphabet A. • Problem: what are the frequently occurring

(ungapped) substrings of S? Longest substring that occurs at least q times?

• Thm: Suffix tree T(S) of S gives complete occurrence counts of all substring motifs of S in O(n) time (although S may have O(n2) substrings!)

T(S) is full text index

T(S)

P

31 8

P occurs in S at locations 8, 31, …

Path for P exists in T(S) ↔ P occurs in S

Counting the substring motifs

• internal nodes of T(S) ↔ repeating substrings of S

• number of leaves of the subtree of a node for string P = number of occurrences of P in S

T(hattivatti)hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivattiattivatti

ttivattitivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

Substring motifs of hattivatti

hattivattiattivatti

ttivattitivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

2

2 2

24

Counts for the O(n) maximal motifs shown

Finding repeats in DNA

• human chromosome 3

• the first 48 999 930 bases

• 31 min cpu time (8 processors, 4 GB)

• Human genome: 3x109 bases

• T(HumanGenome) feasible

Longest repeat?

Occurrences at: 28395980, 28401554r Length: 2559

ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg

Ten occurrences?

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

Length: 277

Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925

Gapped motifs of S

• gapped pattern: P in (A U {#})*

• gap symbol # matches any symbol in A• aa##bb#b• L(P) = occurrences of P in S• P is called a motif of S if |L(P)| > 1 and a motif

with quorum q if |L(P)| ≥ q.

• Problem: find occurrence count |L(P)| for all gapped motifs P of S

• anban has exponentially many motifs (M-F. Sagot)!

Motifs vs self-alignments

• self-alignments of S => maximal motifs

align the occurrences

S

Motifs vs multiple self-alignments

• self-alignments of S => maximal motifs

expand if possible


• S = aaaaabaaaaa P = a###a

• aaaaabaaaaa aaaaabaaaaa

a###aaaaaabaaaaa aaaaabaaaaa


• S = aaaaabaaaaa P = a###a• aaaaabaaaaa aaaaabaaaaa

• aaa#a#aaa is maximal motif for this self-alignment

aaa#a#aaaaaaaabaaaaa aaaaabaaaaa

Maximal motifs

• multiple self-alignments of S ↔ maximal gapped motifs of S: the unanimous columns give the non-gap symbols of the motif

• any motif P has a unique maximal motif M(P) (align the occurrences and maximize); L(M(P)) = L(P) + d

• unfortunately: anban has exponentially many maximal motifs

Blocks of maximal motifs• aaa##b##ba has blocks aaa, b, ba

• Lemma: Maximal substring motifs (1-block motifs) ↔ (branching) nodes of T(S)

• Thm: Each block of a maximal motif of S is a maximal substring motif of S, hence there are O(n) different strings that can be used as a block of a maximal motif.

• Cor: There are O(n2k-1) different maximal motifs with k blocks [O(n2k) unrestricted motifs].

Counting 2-block maximal motifs

• Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).

Algorithm (very simple)

X Yd

for each maximal substring motif X

for each distance d = 1,2, …

mark the leaves of T(S) that correspond to locations L(X) + d

for each maximal substring motif Y, find the number h(Y) of marked leaves

in its subtree in T(S)

the occurrence count of motif (X,d,Y) is h(Y)

2-block motif (X,d,Y)

Algorithm (very simple)

X Yd

for each maximal substring motif X

for each distance d = 1,2, …

mark the leaves of T(S) that correspond to locations L(X) + d

for each maximal substring motif Y, find the number h(Y) of marked leaves

in its subtree in T(S)

the occurrence count of motif (X,d,Y) is h(Y)

2-block motif (X,d,Y)

O(n)

O(n)

O(n)

Counting 2-block maximal motifs (cont)

• Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).

• flexible gaps: x*y * = gap of any length

• Thm: The occurrence counts for all maximal motifs with two blocks and one flexible gap can be found in (optimal) time O(n2).

General case

• Q1: Given q and W, has S a motif with at least W non-gap symbols and at least q occurrences?

• In k-block case, is O(n2k-1) (or even better) time possible?

• related work: A. Apostolico, M-F. Sagot, L. Parida, N. Pisanti, …

2. Founder reconstruction and applications

Haplotype evolution: founders and iterated recombinations

• WABI 2002

only recombinations; mutations not shown

founder haplotypes

current (observed) haplotypes

statistical models of recombination: average fragment length ~ 1/#generations

Uncovering founder sequences

• Problem: Given current sequences C (haplotypes), construct their ‘founders’ that produce the sequences by iterated recombinations using minimum possible total number of cross-overs (i.e., current sequences have a parse into smallest possible number of fragments taken from the founders)

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

0 0 1 0 1 1 1 0 0 0 1 1

1 1 1 1 0 0 0 1 0 1 1 0

6 cross-overs

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1

18 cross-overs

OBS: two founders (colors) always suffice if no restrictions

Founder reconstruction problem

• given a set D of m sequences, construct M founder sequences that give D in minimum number of cross-overs

• solution by dynamic programming, exponential time in m (WABI 2002)

• Q2: NP-hard?

Modeling a set of haplotypes by a HMM

• ’motif’ = Hidden Markov Model

• minimum description length (MDL) modeling

• ALT 2004

Hidden Markov Model (HMM)

• states i with emission alphabet Hi

• emission probabilities P(H Hi)

• state transition probabilities wij

i j

.

.

.

.

wij{P(H)}

Conserved fragments and parses

• haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

• parse 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

• conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 2 2 2 2 1 fragments

• fragmentation model (HMM) 1 1 1 1

2 1 2 1 2

1 1 2 1 1 1 1 2 2 1 2 2 2 2

Lactose tolerance

• recent finding in Finnish population: an SNP C/T-13910, 14 kb upstream from the lactase gene, associates completely with lactose intolerance

• two datasets over 23 SNPs in the vicinity of this SNP

• lactose intolerant persons: 21 haplotypes

• lactose tolerant persons: 38 haplotypes

Case/control study by HMM

Lactose intolerant (~6 fragments per haplotype)

Lactose tolerant (2 fragments per haplotype => young)

Genotype phasing via founders using a HMM

• the genotype phasing problem: given a set of genotypes, find their resolving haplotype pairs

• find at most M founders that produce resolving haplotype pairs in minimum possible number of cross-overs => relatively good haplotyping method

• improved results with a related HMM, trained with the Expectation Maximization algorithm

• WABI 2005

HMM for haplotyping

…

…

…

…

transition probability distribution

transition probability distribution

emission probability distribution

Example HMM

3. Uncovering gene enhancer elements

Introduction

• Gene expression regulation in multicellular organisms is controlled in combinatorial fashion by so called transcription factors.

• Transcription factors bind to DNA cis-elements on enhancer modules (promoters), and multiple factors need to bind to activate the module.

• In mammals, the modules are few and far • The problem: Locate functional regulatory

modules.

Gene regulation

promoter1 gene1 promoter2 gene2 promoter3 gene3 promoter4 gene4DNA

RNA

transcription

translation

Proteins

transcription factors

Model of cell type specific regulation of target gene expression

GLI X Y (tissue specific TFs)

GLI GLI Ubiquitously expressed TF

transcription

transcription

Common targets (e.g. Patched):

Cell type specific targets (e.g. N-myc):

Binding affinity matrices• The cis-elements are

represented by affinity matrices.– A column per position– A row per nucleotide

• Discovered:– Computationally– Traditional wet lab– Microarrays

9 11 49 51 0 1 1 4 19 3 0 0 0 45 25 16 5 1 2 0 17 0 4 21 18 36 0 0 34 5 21 10

Finding preserved motifs of binding sites

• looking at one (human) genome gives too many positives

• comparative approach: take the 200 kB regions surrounding the same genes (paralogs and orthologs) of different mammals (human, mouse, chicken, …), find preserved clusters (motifs) of binding sites

• Smith-Waterman type algorithm with a novel scoring function

Whole genome comparisons

• Whole genomes can be analyzed with our implementation

• We have compared human genes to orthologs in mouse, rat, chicken, fugu, tetraodon and zebrafish – 100kbp flanking regions on both sides of the gene.– Coding regions masked out.– About 20 000 comparisons for each pair of species.– About 2 min each

Enhancer prediction for N-myc

200 kb Mouse N-Myc genomic region

200

kb H

uman

N-M

yc g

enom

ic r

egio

n

Conserved GLI binding sites in two predicted enhancer elements, CM5 and CM7

coding region of N-Myc

Wet-lab verification

● Selected predicted cis-modules for wet-lab verification

● Fused 1kb DNA segment containing the predicted enhancer to a marker gene with a minimal promoter and generated transgenic embryos.

To conclude

• combinatorial vs probabilistic motifs

• significance of the findings for the applications => statistical modeling

• Want to do computational biology? Then find a good biologist who has good computational intuition.

Acknowledgements

• Mikko Koivisto• Heikki Mannila• Kimmo Palin• Pasi Rastas• Morris Michael• Stefan Kurzt

(Hamburg)

• Outi Hallikas (Biom)• Jussi Taipale (Biom)• Markus Perola (Biom)• Hans Söderlund (VTT)

The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2003-503265.

http://www.cordis.lu/life/

http://www.cordis.lu/life/

in the search of motifs (and other hidden structures)

Documents