cbio course, spring 2005, hebrew university class: motif finding cs-67693, spring 2005 school of...

52
cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem *Few slides were adopted and edited from www.cs.ucsb.edu/~ambuj/Courses/ bioinformatics/motif %20finding.ppt

Post on 21-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Class: Motif FindingCS-67693, Spring 2005

School of Computer Science & Engineering

Hebrew University, Jerusalem

*Few slides were adopted and edited from www.cs.ucsb.edu/~ambuj/Courses/ bioinformatics/motif%20finding.ppt

Page 2: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Background

Basic dogma: Information is coded in the genome Information includes:

Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing

Page 3: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Eukaryotic Gene

Adapted in part from http://online.itp.ucsb.edu/online/infobio01/burge/

Page 4: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Background

Basic dogma: Information is coded in the genome Information includes:

Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing

Functional units in proteins

Page 5: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Proteins Local structure motifs

diverging type-2 turn

Serine hairpin Type-I hairpin

Frayed helix

Proline helix C-capalpha-alpha corner

glycine helix N-cap

I-sites Library = a catalog of local sequence-structure correlations

Page 6: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Background

Basic dogma: Information is coded in the genome Information includes:

Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing

Functional units in proteins RNA family structure

Page 7: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

RNA – Multiple Align. + structure

Biological Sequence Analysis; Durbin, Eddy, Krogh, Mitchison; Cambridge press, 1998

Page 8: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Background

Basic dogma: Information is coded in the genome Information includes:

Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing

Functional units in proteins RNA family structure How to control which gene to turn on/off and when

Page 9: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Background

In many cases, we can related such functions to reappearing “motifs” in the genome: Splice/start/end site signals in coding genes Binding sites of regulatory elements controlling

transcription of nearby genes A certain function of a protein “domain”.

The definition of what is a sequence “motif” depends on the context !

Page 10: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Background

Basic dogma: Information is coded in the genome Information includes:

Where the genes are coded, including: Transcription Start UTR Exons and Introns Alternative splicing

Functional units in proteins RNA family structure How to control which gene to turn on/off and when

Future

Classes

Page 11: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Regulation of Gene Expression

Gene regulatory proteins bind to specific places (regulatory sites) on DNA. These sites are usually close to the gene.

geneoff

site

genesiteon

regulatory protein

Page 12: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Regulatory Sites

Regulatory sites are sometimes divided to 2 types: Promoter sites – Usually upstream of a gene in

non-translated (non-coding) regions. In some cases, these sites can be in exonic or intronic regions.

Enhancer sites – Can be very far away (either upstream or downstream).

Regulatory proteins recognize sites by conserved DNA patterns, which consist of a short stretch of “partially specific” nucleotide sequences.

Page 13: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

lac operon in E. coli

Page 14: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

Figure 13.16 The lac Operon of E. coli

Page 15: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Promoter…

Page 16: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Page 17: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Page 18: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Transcription Factor Binding Sites

Non-coding regions gene regulation

We want to describe this site

Page 19: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Difficulty of Finding Regulatory Elements

Regulatory sites are short (up to 30 nucleotides). Non-coding regions are very long (includes all

regions which are not translated into proteins). Experiments to find regulatory sites are tedious and

time-consuming. One approach is to mutate different combinations of nucleotides until functionality changes.

We don’t have good understanding on what makes a site active/how active in terms of the chemical/physical constraints

Page 20: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Why Not Use Multiple Alignment?

The motif is short and may appear at different location in different sequences. Most other areas are random

Not all positions within a binding site should be treated in the same way, and usually we don’t know in advance how. Therefore the use of a general scoring matrix is not adequate

The problem is made more complicated since not every sequence contains a motif, due to: The upstream region used may not be long enough to

include a regulatory site in every sequence Usually, potential co-regulated genes are used to

construct the sample, which means that we don’t know for sure whether all these genes are really co-regulated

Page 21: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Computational Approach

Identify a set of genes believed to be controlled by the same regulatory mechanism (co-regulated genes).

Extract regulatory regions of the genes (usually upstream sequences) to form a sample of sequences.

Find some way to identify “conserved” elements in these sequences, resulting in a list of potential regulatory sites.

Page 22: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Find Regulatory Sites

genesite

genesite

genesite

genesite

genesite

sample

Page 23: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Formulating Motif Finding Task

Given a set of sequences, find a common motif shared by these sequences.Steps:

Construct a model of what we mean by common motif.

Solve the problem within the model on simulated samples.

Evaluate performance on real life biological samples.

Page 24: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Formulating Motif Finding Task (2) This means we need to define:

Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)

Type of “motif” class: Search Algorithm: How we search the space of possible

motifs? Scoring function: How we score putative motifs? Output of the algorithm: Should it give us just putative

sites or maybe a binding site model to predict sites? Evaluation technique: How do we test our algorithm?

Page 25: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Task Definition Example

Given a sample of sequences and an unknown pattern (motif) that appears at different unknown positions in each sequence, can we find the unknown pattern?

Input: a set of sequences, each one with an unknown pattern at an unknown position.

Output: a set of starting positions of the pattern in each sequence.

Page 26: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Pattern == Subsequence

atgaccgggatactgatAAAAAAAAGGGGGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaataAAAAAAAAGGGGGGGa

tgagtatccctgggatgacttAAAAAAAAGGGGGGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgAAAAAAAAGGGGGGGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAAAAAAAAGGGGGGGcttatag

gtcaatcatgttcttgtgaatggatttAAAAAAAAGGGGGGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAAAAAAAAGGGGGGGcaattatgagagagctaatctatcgcgtgcgtgttcat

aacttgagttAAAAAAAAGGGGGGGctggggcacatacaagaggagtcttccttatcagttaatgctgtatgacactatgta

ttggcccattggctaaaagcccaacttgacaaatggaagatagaatccttgcatAAAAAAAAGGGGGGGaccgaaagggaag

ctggtgagcaacgacagattcttacgtgcattagctcgcttccggggatctaatagcacgaagcttAAAAAAAAGGGGGGGa

Subsequence = AAAAAAAAGGGGGGG

Page 27: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Pattern == (l,d)

atgaccgggatactgatAgAAgAAAGGttGGGggcgtacacattagataaacgtatgaagtacgttagactcggcgccgccg

acccctattttttgagcagatttagtgacctggaaaaaaaatttgagtacaaaacttttccgaatacAAtAAAAcGGcGGGa

tgagtatccctgggatgacttAAAAtAAtGGaGtGGtgctctcccgatttttgaatatgtaggatcattcgccagggtccga

gctgagaattggatgcAAAAAAAGGGattGtccacgcaatcgcgaaccaacgcggacccaaaggcaagaccgataaaggaga

tcccttttgcggtaatgtgccgggaggctggttacgtagggaagccctaacggacttaatAtAAtAAAGGaaGGGcttatag

gtcaatcatgttcttgtgaatggatttAAcAAtAAGGGctGGgaccgcttggcgcacccaaattcagtgtgggcgagcgcaa

cggttttggcccttgttagaggcccccgtAtAAAcAAGGaGGGccaattatgagagagctaatctatcgcgtgcgtgttcat

AgAAgAAAGGttGGG

cAAtAAAAcGGcGGG..|..|||.|..|||

All variants of AAAAAAAAGGGGGGG

First formulated by Pevzner (ISMB 2000)Pattern = subsequence of length l and exactly d random mismatches in itAll other sequence is assumed randomAssumes exactly one “true” occurrence of the motif in each sequence

Page 28: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Formulating Motif Finding Task (2) We need to define:

Input of the algorithm: This implicitly defines various assumptions we have on the problem (e.g: do we have different belief for each sequence that it belongs to the group?)

Type of “motif” class: Search Algorithm: How we search the space of

possible motifs? Scoring function: How we score putative motifs? Output of the algorithm: Should it give us just putative

sites or maybe a binding site model to predict sites? Evaluation technique: How do we test our algorithm?

Think: •How the (l,d) problem defines these ? •How does it relate to “real” biology?

Page 29: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Define Motif Class?

Subsequences : ACTCTT IUPAC alphabet: {A, C, G, T, R,Y, M, K, S, W, B, D, H, V,

N } = all subsets of {A,C,G,T} PSSM / PWM (Position Specific Score Matrix or Position

Weight Matrix) More general probabilistic/other models: e.g. using Bayesian

Networks modeling language Refined definition based on prior knowledge:

Homo/Hetro dimers Variable gaps Bias to some characteristic information profile (Van,

2003)

Page 30: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

NOTE:

• Independence assumption between biding sites positions !

• The score used in a probabilistic setting is the log odds score

• In many case the BG is a simple, fixed, background distribution (Q) over {ACGT}.

• The entries in the Matrix can be Pi(a), log(Pi(a)) or log(Pi(a)/logQ(a) – depending on the context of its usage !

PSSM Representation of Binding Sites

Position Specific Score Matrix: each possible kmer will get a “score” for being a binding site which is:

Probabilistic interpretation:ACGT

1 2 k

w[i,c] – weight of letter c at position i

)|,...,(

)|,...,(log),...,(

1

11 BGssP

MssPssScore

k

kk

i

ik siwssScore ],[),...,( 1

)()|,...,(1

1 ii

k

ik sPMssP

Page 31: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

• PSSM:+ Enables representing low/high affinity in different Positions+ Trade off Sens. and Spec. in genomic wide scans- Huge Search space, how to cover efficiently?

ABF1 Example – (Targets by Lee at el. ,2002)

>YAL011W: CGTGTTAGATGA

?

PSSM vs. IUPAC

Page 32: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif?

Easier Task - We have aligned samples to learn from: We have a set of known BS, all of length k, (e.g. verified by

some biological experiment) Compute counts for each base in each position, and

normalize == ML estimator: N number of sequence, Na number of “a”s in position i:

Note: This is the ML solution. As in many other cases, this might be problematic

when we have very few samples to learn from (e.g.: we can get probability 0 for base A in position i simply because we did not see enough examples.)

Solution: use pseudo counts or some prior (e.g. Derichele prior)

Page 33: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (2)

BSModel

1 2 3 4 5 6 7ACGT

Remember: In the motif finding problem we have a much harder task –

The input: is a set of (long) sequence suspected to contain a common motif (PSSM according to our current model assumption), but we don’t know where !

The output: Prediction of new BS based on our learned PSSM motif

Predictions

Input Sequence:Dark blue are BS positions which are hidden from us, and we are trying to learn

Page 34: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 )

(Still) one of the most commonly used tools for motif (PSSM) search:

Page 35: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (3) MEME Algorithm (Bailey T.L. and Elkan C.P. 1995 )

The basic probabilistic framework used by MEME: Input: N sequences Assume each has 1 BS Assume a generative model: sequence is either

generated by BS model M (PSSM) or from a fixed background distribution BG

Assume each sequence has exactly 1 BS in it. Scoring function: P(Seq | M,BG) Try to maximize likelihood scoring function by

adjusting M’s (PSSM) parameters.

Page 36: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (4) What’s the problem? Why is it hard?

Think of the positions of the BS in each sequence as H were H is a vector of dimension N

Given H we have complete data. Then inferring M’s ML parameters are just as we saw for the aligned case easy

Problem 1: We don’t have H, we are trying to learn it too and the ML parameters of M for each position become dependent if H is not given we have no close form to compute them analytically and going over all possible H assignments is not feasible, we need to resort to some method to search the space of possible assignments to M’s parameters

Problem 2: The landscape of the likelihood function is typically far from convex many local optima

Page 37: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to Learn PSSM Motif ? (5) MEME Algorithm

MEME uses a technique called EM to search the space of model M’s parameters

EM = Expectation Maximization We review how EM is used in the MEME algorithm

in class….

Page 38: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Problems with the MEME & other Models

Think: In light of what we discussed, what assumptions are made in this model? What might cause us problems in “real” life data? MEME has also other variants we did not

discuss here (oops, zoops, etc.) Also: EM is very sensitive to starting point need

a good way to find good ones

Page 39: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Other Algorithmic Techniques for Motif Finding

MEME (Expectation Maximization) GibbsDNA, AlignAce (Gibbs Sampling) CONSENUS (greedy multiple alignment) WINNOWER (Clique finding in graphs) SP-STAR (Sum of pairs scoring) MITRA (Mismatch trees to prune exhaustive search

space)

More then one way to skin a cat….

Page 40: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to find Binding Sites- Revisited

Find a common motif in gene set (CONSENSUS, MITRA, MEME, AlignACE…)

“Classical” Solutions:

GeneSet

Promoter

Find a common & unique motif in genesDiscriminative Solutions:

Extract the relevant bit from sequences

n

kxn

M

xn

KM

x

K

MKnkPhyper ),,|(

Main problem: In many cases the motif is common not just to the subset of sequences we have, but to many other as well not a good candidate to explain regulation

“A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01

Page 41: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Finding Discriminative Motifs

Define Space of Motifs“mimic” motifs with a simpler class for efficient search

Search Space, Evaluate Motifs using discriminative scoring

Choose Significant Motifs

Correct for multiple hyp.Bonfferoni or FDR criteria

Step1

Step2: “A simple hyper-geometric approach for discovering putative transcription factor binding sites” WABI 01Refine Motifs

Page 42: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Binding Sites - Revisited

→ independence assumption

Two relevant questions: Are there dependencies in binding sites? Do we gain an edge in computational

tasks if we model such dependencies?

promoter

gene

binding site

A?C?T

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 43: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

How to model binding sites ?

))P(X)P(X)P(X)P(XP(X)XP(X 543215 1 T

5432151 T)|T)P(X|T)P(X|T)P(X|T)P(X|P(T)P(X)XP(X )X|)P(X)P(XX|)P(XX|)P(XP(X)XP(X 354133215 1

X1 X2 X3 X4 X5 Profile: Independency model

Tree: Direct dependencies

Mixture of Profiles:Global dependencies

Mixture of Trees:Both types of dependencies

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

T

3541332151 )XT,|T)P(X|)P(XXT,|)P(XXT,|T)P(X|P(T)P(X)XP(X

? )X X X X P(X 54321 represent a distribution of binding sites

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 44: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Learning models: Aligned binding sites

Learning procedure for Bayesian networks

GCGGGGCCGGGCTGGGGGCGGGGTAGGGGGCGGGGGTAGGGGCCGGGCTGGGGGCGGGGTAAAGGGCCGGGCGGGAGGCCGGGAGCGGGGCGGGGCGAGGGGACGAGTCCGGGGCGGTCCATGGGGCGGGGC

Aligned binding sitesModels

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

LearningMachinery

select maximum likelihood model

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 45: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Arabidopsis ABA binding factor 1(49 examples)

Profile

Test LL per instance -19.93

Mixture of Profiles76%

24%

Test LL per instance -18.70 (+1.23)(improvement in likelihood > 2-fold)

X4 X5 X6 X7 X8 X9 X10 X11 X12

Tree

Test LL per instance -18.47 (+1.46)(improvement in likelihood > 2.5-fold)

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 46: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Rap1 Example (Harbison at. el.04)(171 expmples)

Profile Mixture of Profiles

X4 X5 X6 X7 X8 X9 X10 X11 X12

Tree

Page 47: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

¼

½

1

2

4

8

1 11 21 31 41 51 61 71 81 91

Datasets of Binding Sites Significant

Non sig.

67

Fo

ld c

ha

ng

e in

lik

elih

oo

d (

hel

d o

ut

test

da

ta)

Likelihood improvement over profiles

Significant improvement in generalization

Data often exhibits dependencies

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 48: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

EM algorithm

Learning models: unaligned data

Use EM algorithm to simultaneously Identify binding site positions Learn a dependency model

Unaligned Data

Learna model

Identify binding

sites

ModelsX1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

X1 X2 X3 X4 X5

X1 X2 X3 X4 X5

T

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 49: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Evaluating PerformanceDetect target genes on a genomic scale:

ACGTAT…………….………………….AGGGATGCGAGC-1000 0-473

Scoring rule:

Crucial issue: p-value of scores

“CIS: Compound Importance Sampling Method for Protein-DNA Binding Site p-value Estimation” Bioinformatics, 2004, ISMB 04

),,(

),,(log),,Score(

10

11

K

KK XXP

XXPXX

Probability by binding site model

Background model (order-3 markov chain)

Page 50: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

0% 1% 2% 3% 4% 5%

Tru

e P

ositi

ve R

ate

(Sen

sitiv

ity)

False Positive Rate

Profile

Example: ROC curve of HSF1

Mixture of Trees

Tree

~60 FP

Mixture of Profiles

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 51: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

-20 -10 0 10 20 30 40 50 60

-25

-20

-15

-10

-5

0

5

10

15

20

Evaluation – Localization Data5-fold Cross Validation [Lee et al 2002]

Δ s

pe

cif

icit

y (T

P/P

red

icte

d)

Δ sensitivity (TP/True)

Improvement by Mix of Trees over PSSM

“True”

Predicted

TP

“Modeling Dependencies in Protein-DNA Binding Sites”, RECOMB 03

Page 52: Cbio course, spring 2005, Hebrew University Class: Motif Finding CS-67693, Spring 2005 School of Computer Science & Engineering Hebrew University, Jerusalem

cbio course, spring 2005, Hebrew University

Motif Finding - Evaluation Still an open problem We have seen several examples on how performance can be evaluated

in different ways There is (still) no absolute solution for this Main problems:

no large data sets of known sites no real annotation of negative samples How to define success measure? Difference in input/output assumptions …

A recent effort in this direction: “Assessing computational tools for the discovery of transcription factor binding sites” (Nat. Biotech. Jan 05)

compared publicly available tools on the web on (small) data sets of known binding sites based on the Transfac D.B