in the search of motifs (and other hidden structures)

61
In the search of motifs (and other hidden structures) Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005

Upload: alia

Post on 27-Jan-2016

52 views

Category:

Documents


0 download

DESCRIPTION

In the search of motifs (and other hidden structures). Esko Ukkonen Department of Computer Science & Helsinki Institute of Information Technology HIIT University of Helsinki CPM 2005, Jeju, 21 June 2005. Uncover a hidden structure(?). Motif?. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: In the search of motifs (and other hidden structures)

In the search of motifs (and other hidden structures)

Esko Ukkonen

Department of Computer Science & Helsinki Institute of Information Technology HIIT

University of Helsinki

CPM 2005, Jeju, 21 June 2005

Page 2: In the search of motifs (and other hidden structures)
Page 3: In the search of motifs (and other hidden structures)

Uncover a hidden structure(?)

Page 4: In the search of motifs (and other hidden structures)

Motif?

• a pattern that occurs unexpectedly often in (a set of) strings

• pattern: substring, substring with gaps, string in generalized alphabet (e.g., IUPAC), HMMs, binding affinity matrix, cluster of binding affinity matrices,… (= the hidden structure to be learned from data)

• (unexpectedly: statistical modelling)• occurrence: exact, approximate, with high

probability, …• strings ↔ applications: bioinformatics …

Page 5: In the search of motifs (and other hidden structures)

Plan of the talk

1. Gapped motifs in a string

2. Founder sequence reconstruction problem, with applications to haplotype analysis and genotype phasing (WABI 2002, ALT 2004, WABI 2005)

3. Uncovering gene enhancer elements

Page 6: In the search of motifs (and other hidden structures)

1. Gapped motifs

Page 7: In the search of motifs (and other hidden structures)

ATT HATTIVATTI

I#A HATTIVATTI

Page 8: In the search of motifs (and other hidden structures)

Substring motifs of a string S

• string S = s1 … sn in alphabet A. • Problem: what are the frequently occurring

(ungapped) substrings of S? Longest substring that occurs at least q times?

• Thm: Suffix tree T(S) of S gives complete occurrence counts of all substring motifs of S in O(n) time (although S may have O(n2) substrings!)

Page 9: In the search of motifs (and other hidden structures)

T(S) is full text index

T(S)

P

31 8

P occurs in S at locations 8, 31, …

Path for P exists in T(S) ↔ P occurs in S

Page 10: In the search of motifs (and other hidden structures)

Counting the substring motifs

• internal nodes of T(S) ↔ repeating substrings of S

• number of leaves of the subtree of a node for string P = number of occurrences of P in S

Page 11: In the search of motifs (and other hidden structures)

T(hattivatti)hattivatti

attivatti

ttivatti

tivatti

ivatti

vatti

atti

tti

ti

i

hattivattiattivatti

ttivattitivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

Page 12: In the search of motifs (and other hidden structures)

Substring motifs of hattivatti

hattivattiattivatti

ttivattitivatti

ivatti

vatti

vattivatti

attiti

i

i

tti

ti

t

i

vatti

vatti

vatti

hattivatti

atti

2

2 2

24

Counts for the O(n) maximal motifs shown

Page 13: In the search of motifs (and other hidden structures)

Finding repeats in DNA

• human chromosome 3

• the first 48 999 930 bases

• 31 min cpu time (8 processors, 4 GB)

• Human genome: 3x109 bases

• T(HumanGenome) feasible

Page 14: In the search of motifs (and other hidden structures)

Longest repeat?

Occurrences at: 28395980, 28401554r Length: 2559

ttagggtacatgtgcacaacgtgcaggtttgttacatatgtatacacgtgccatgatggtgtgctgcacccattaactcgtcatttagcgttaggtatatctccgaatgctatccctcccccctccccccaccccacaacagtccccggtgtgtgatgttccccttcctgtgtccatgtgttctcattgttcaattcccacctatgagtgagaacatgcggtgtttggttttttgtccttgcgaaagtttgctgagaatgatggtttccagcttcatccatatccctacaaaggacatgaactcatcatttttttatggctgcatagtattccatggtgtatatgtgccacattttcttaacccagtctacccttgttggacatctgggttggttccaagtctttgctattgtgaatagtgccgcaataaacatacgtgtgcatgtgtctttatagcagcatgatttataatcctttgggtatatacccagtaatgggatggctgggtcaaatggtatttctagttctagatccctgaggaatcaccacactgacttccacaatggttgaactagtttacagtcccagcaacagttcctatttctccacatcctctccagcacctgttgtttcctgactttttaatgatcgccattctaactggtgtgagatggtatctcattgtggttttgatttgcatttctctgatggccagtgatgatgagcattttttcatgtgttttttggctgcataaatgtcttcttttgagaagtgtctgttcatatccttcgcccacttttgatggggttgtttgtttttttcttgtaaatttgttggagttcattgtagattctgggtattagccctttgtcagatgagtaggttgcaaaaattttctcccattctgtaggttgcctgttcactctgatggtggtttcttctgctgtgcagaagctctttagtttaattagatcccatttgtcaattttggcttttgttgccatagcttttggtgttttagacatgaagtccttgcccatgcctatgtcctgaatggtattgcctaggttttcttctagggtttttatggttttaggtctaacatgtaagtctttaatccatcttgaattaattataaggtgtatattataaggtgtaattataaggtgtataattatatattaattataaggtgtatattaattataaggtgtaaggaagggatccagtttcagctttctacatatggctagccagttttccctgcaccatttattaaatagggaatcctttccccattgcttgtttttgtcaggtttgtcaaagatcagatagttgtagatatgcggcattatttctgagggctctgttctgttccattggtctatatctctgttttggtaccagtaccatgctgttttggttactgtagccttgtagtatagtttgaagtcaggtagcgtgatggttccagctttgttcttttggcttaggattgacttggcaatgtgggctcttttttggttccatatgaactttaaagtagttttttccaattctgtgaagaaattcattggtagcttgatggggatggcattgaatctataaattaccctgggcagtatggccattttcacaatattgaatcttcctacccatgagcgtgtactgttcttccatttgtttgtatcctcttttatttcattgagcagtggtttgtagttctccttgaagaggtccttcacatcccttgtaagttggattcctaggtattttattctctttgaagcaattgtgaatgggagttcactcatgatttgactctctgtttgtctgttattggtgtataagaatgcttgtgatttttgcacattgattttgtatcctgagactttgctgaagttgcttatcagcttaaggagattttgggctgagacgatggggttttctagatatacaatcatgtcatctgcaaacagggacaatttgacttcctcttttcctaattgaatacccgttatttccctctcctgcctgattgccctggccagaacttccaacactatgttgaataggagtggtgagagagggcatccctgtcttgtgccagttttcaaagggaatgcttccagtttttgtccattcagtatgatattggctgtgggtttgtcatagatagctcttattattttgagatacatcccatcaatacctaatttattgagagtttttagcatgaagagttcttgaattttgtcaaaggccttttctgcatcttttgagataatcatgtggtttctgtctttggttctgtttatatgctggagtacgtttattgattttcgtatgttgaaccagccttgcatcccagggatgaagcccacttgatcatggtggataagctttttgatgtgctgctggattcggtttgccagtattttattgaggatttctgcatcgatgttcatcaaggatattggtctaaaattctctttttttgttgtgtctctgtcaggctttggtatcaggatgatgctggcctcataaaatgagttagg

Page 15: In the search of motifs (and other hidden structures)

Ten occurrences?

ttttttttttttttgagacggagtctcgctctgtcgcccaggctggagtgcagtggcgggatctcggctcactgcaagctccgcctcccgggttcacgccattctcctgcctcagcctcccaagtagctgggactacaggcgcccgccactacgcccggctaattttttgtatttttagtagagacggggtttcaccgttttagccgggatggtctcgatctcctgacctcgtgatccgcccgcctcggcctcccaaagtgctgggattacaggcgt

Length: 277

Occurrences at: 10130003, 11421803, 18695837, 26652515, 42971130, 47398125In the reversed complement at: 17858493, 41463059, 42431718, 42580925

Page 16: In the search of motifs (and other hidden structures)

Gapped motifs of S

• gapped pattern: P in (A U {#})*

• gap symbol # matches any symbol in A• aa##bb#b• L(P) = occurrences of P in S• P is called a motif of S if |L(P)| > 1 and a motif

with quorum q if |L(P)| ≥ q.

• Problem: find occurrence count |L(P)| for all gapped motifs P of S

• anban has exponentially many motifs (M-F. Sagot)!

Page 17: In the search of motifs (and other hidden structures)

Motifs vs self-alignments

• self-alignments of S => maximal motifs

align the occurrences

S

Page 18: In the search of motifs (and other hidden structures)

Motifs vs multiple self-alignments

• self-alignments of S => maximal motifs

expand if possible

Page 19: In the search of motifs (and other hidden structures)

Motifs vs self-alignments

• S = aaaaabaaaaa P = a###a

• aaaaabaaaaa aaaaabaaaaa

a###aaaaaabaaaaa aaaaabaaaaa

Page 20: In the search of motifs (and other hidden structures)

Motifs vs self-alignments

• S = aaaaabaaaaa P = a###a

• aaaaabaaaaa aaaaabaaaaa

a###aaaaaabaaaaa aaaaabaaaaa

Page 21: In the search of motifs (and other hidden structures)

Motifs vs self-alignments

• S = aaaaabaaaaa P = a###a• aaaaabaaaaa aaaaabaaaaa

• aaa#a#aaa is maximal motif for this self-alignment

aaa#a#aaaaaaaabaaaaa aaaaabaaaaa

Page 22: In the search of motifs (and other hidden structures)

Maximal motifs

• multiple self-alignments of S ↔ maximal gapped motifs of S: the unanimous columns give the non-gap symbols of the motif

• any motif P has a unique maximal motif M(P) (align the occurrences and maximize); L(M(P)) = L(P) + d

• unfortunately: anban has exponentially many maximal motifs

Page 23: In the search of motifs (and other hidden structures)

Blocks of maximal motifs• aaa##b##ba has blocks aaa, b, ba

• Lemma: Maximal substring motifs (1-block motifs) ↔ (branching) nodes of T(S)

• Thm: Each block of a maximal motif of S is a maximal substring motif of S, hence there are O(n) different strings that can be used as a block of a maximal motif.

• Cor: There are O(n2k-1) different maximal motifs with k blocks [O(n2k) unrestricted motifs].

Page 24: In the search of motifs (and other hidden structures)

Counting 2-block maximal motifs

• Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).

Page 25: In the search of motifs (and other hidden structures)

Algorithm (very simple)

X Yd

for each maximal substring motif X

for each distance d = 1,2, …

mark the leaves of T(S) that correspond to locations L(X) + d

for each maximal substring motif Y, find the number h(Y) of marked leaves

in its subtree in T(S)

the occurrence count of motif (X,d,Y) is h(Y)

2-block motif (X,d,Y)

Page 26: In the search of motifs (and other hidden structures)

Algorithm (very simple)

X Yd

for each maximal substring motif X

for each distance d = 1,2, …

mark the leaves of T(S) that correspond to locations L(X) + d

for each maximal substring motif Y, find the number h(Y) of marked leaves

in its subtree in T(S)

the occurrence count of motif (X,d,Y) is h(Y)

2-block motif (X,d,Y)

O(n)

O(n)

O(n)

Page 27: In the search of motifs (and other hidden structures)

Counting 2-block maximal motifs (cont)

• Thm: The occurrence counts for all maximal motifs with two blocks can be found in (optimal) time O(n3).

• flexible gaps: x*y * = gap of any length

• Thm: The occurrence counts for all maximal motifs with two blocks and one flexible gap can be found in (optimal) time O(n2).

Page 28: In the search of motifs (and other hidden structures)

General case

• Q1: Given q and W, has S a motif with at least W non-gap symbols and at least q occurrences?

• In k-block case, is O(n2k-1) (or even better) time possible?

• related work: A. Apostolico, M-F. Sagot, L. Parida, N. Pisanti, …

Page 29: In the search of motifs (and other hidden structures)

2. Founder reconstruction and applications

Page 30: In the search of motifs (and other hidden structures)

Haplotype evolution: founders and iterated recombinations

• WABI 2002

Page 31: In the search of motifs (and other hidden structures)

only recombinations; mutations not shown

founder haplotypes

current (observed) haplotypes

Page 32: In the search of motifs (and other hidden structures)
Page 33: In the search of motifs (and other hidden structures)

statistical models of recombination: average fragment length ~ 1/#generations

Page 34: In the search of motifs (and other hidden structures)
Page 35: In the search of motifs (and other hidden structures)

Uncovering founder sequences

• Problem: Given current sequences C (haplotypes), construct their ‘founders’ that produce the sequences by iterated recombinations using minimum possible total number of cross-overs (i.e., current sequences have a parse into smallest possible number of fragments taken from the founders)

Page 36: In the search of motifs (and other hidden structures)

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

Page 37: In the search of motifs (and other hidden structures)

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

Page 38: In the search of motifs (and other hidden structures)

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

0 0 1 0 1 1 1 0 0 0 1 1

1 1 1 1 0 0 0 1 0 1 1 0

6 cross-overs

Page 39: In the search of motifs (and other hidden structures)

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

Page 40: In the search of motifs (and other hidden structures)

Example

0 0 1 0 0 0 0 1 0 0 1 1

1 1 1 1 1 1 1 0 0 1 1 0

0 0 1 0 1 1 1 1 0 1 1 0

1 1 1 1 0 0 1 0 0 0 1 1

0 0 0 0 0 0 0 0 0 0 0 0

1 1 1 1 1 1 1 1 1 1 1 1

18 cross-overs

OBS: two founders (colors) always suffice if no restrictions

Page 41: In the search of motifs (and other hidden structures)

Founder reconstruction problem

• given a set D of m sequences, construct M founder sequences that give D in minimum number of cross-overs

• solution by dynamic programming, exponential time in m (WABI 2002)

• Q2: NP-hard?

Page 42: In the search of motifs (and other hidden structures)

Modeling a set of haplotypes by a HMM

• ’motif’ = Hidden Markov Model

• minimum description length (MDL) modeling

• ALT 2004

Page 43: In the search of motifs (and other hidden structures)

Hidden Markov Model (HMM)

• states i with emission alphabet Hi

• emission probabilities P(H Hi)

• state transition probabilities wij

i j

.

.

.

.

wij{P(H)}

Page 44: In the search of motifs (and other hidden structures)

Conserved fragments and parses

• haplotypes 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

• parse 1 1 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2

• conserved 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 2 2 2 2 1 fragments

• fragmentation model (HMM) 1 1 1 1

2 1 2 1 2

1 1 2 1 1 1 1 2 2 1 2 2 2 2

Page 45: In the search of motifs (and other hidden structures)

Lactose tolerance

• recent finding in Finnish population: an SNP C/T-13910, 14 kb upstream from the lactase gene, associates completely with lactose intolerance

• two datasets over 23 SNPs in the vicinity of this SNP

• lactose intolerant persons: 21 haplotypes

• lactose tolerant persons: 38 haplotypes

Page 46: In the search of motifs (and other hidden structures)

Case/control study by HMM

Lactose intolerant (~6 fragments per haplotype)

Lactose tolerant (2 fragments per haplotype => young)

Page 47: In the search of motifs (and other hidden structures)

Genotype phasing via founders using a HMM

• the genotype phasing problem: given a set of genotypes, find their resolving haplotype pairs

• find at most M founders that produce resolving haplotype pairs in minimum possible number of cross-overs => relatively good haplotyping method

• improved results with a related HMM, trained with the Expectation Maximization algorithm

• WABI 2005

Page 48: In the search of motifs (and other hidden structures)

HMM for haplotyping

transition probability distribution

transition probability distribution

emission probability distribution

Page 49: In the search of motifs (and other hidden structures)

Example HMM

Page 50: In the search of motifs (and other hidden structures)
Page 51: In the search of motifs (and other hidden structures)

3. Uncovering gene enhancer elements

Page 52: In the search of motifs (and other hidden structures)

Introduction

• Gene expression regulation in multicellular organisms is controlled in combinatorial fashion by so called transcription factors.

• Transcription factors bind to DNA cis-elements on enhancer modules (promoters), and multiple factors need to bind to activate the module.

• In mammals, the modules are few and far • The problem: Locate functional regulatory

modules.

Page 53: In the search of motifs (and other hidden structures)

Gene regulation

promoter1 gene1 promoter2 gene2 promoter3 gene3 promoter4 gene4DNA

RNA

transcription

translation

Proteins

transcription factors

Page 54: In the search of motifs (and other hidden structures)

Model of cell type specific regulation of target gene expression

GLI X Y (tissue specific TFs)

GLI GLI Ubiquitously expressed TF

transcription

transcription

Common targets (e.g. Patched):

Cell type specific targets (e.g. N-myc):

Page 55: In the search of motifs (and other hidden structures)

Binding affinity matrices• The cis-elements are

represented by affinity matrices.– A column per position– A row per nucleotide

• Discovered:– Computationally– Traditional wet lab– Microarrays

9 11 49 51 0 1 1 4 19 3 0 0 0 45 25 16 5 1 2 0 17 0 4 21 18 36 0 0 34 5 21 10

Page 56: In the search of motifs (and other hidden structures)

Finding preserved motifs of binding sites

• looking at one (human) genome gives too many positives

• comparative approach: take the 200 kB regions surrounding the same genes (paralogs and orthologs) of different mammals (human, mouse, chicken, …), find preserved clusters (motifs) of binding sites

• Smith-Waterman type algorithm with a novel scoring function

Page 57: In the search of motifs (and other hidden structures)

Whole genome comparisons

• Whole genomes can be analyzed with our implementation

• We have compared human genes to orthologs in mouse, rat, chicken, fugu, tetraodon and zebrafish – 100kbp flanking regions on both sides of the gene.– Coding regions masked out.– About 20 000 comparisons for each pair of species.– About 2 min each

Page 58: In the search of motifs (and other hidden structures)

Enhancer prediction for N-myc

200 kb Mouse N-Myc genomic region

200

kb H

uman

N-M

yc g

enom

ic r

egio

n

Conserved GLI binding sites in two predicted enhancer elements, CM5 and CM7

coding region of N-Myc

Page 59: In the search of motifs (and other hidden structures)

Wet-lab verification

● Selected predicted cis-modules for wet-lab verification

● Fused 1kb DNA segment containing the predicted enhancer to a marker gene with a minimal promoter and generated transgenic embryos.

Page 60: In the search of motifs (and other hidden structures)

To conclude

• combinatorial vs probabilistic motifs

• significance of the findings for the applications => statistical modeling

• Want to do computational biology? Then find a good biologist who has good computational intuition.

Page 61: In the search of motifs (and other hidden structures)

Acknowledgements

• Mikko Koivisto• Heikki Mannila• Kimmo Palin• Pasi Rastas• Morris Michael• Stefan Kurzt

(Hamburg)

• Outi Hallikas (Biom)• Jussi Taipale (Biom)• Markus Perola (Biom)• Hans Söderlund (VTT)

                  

The BioSapiens project is funded by the European Commission within its FP6 Programme, under the thematic area "Life sciences, genomics and biotechnology for health,"contract number LHSG-CT-2003-503265.