cs 6243 machine learning advanced topic: pattern recognition (dna motif finding)

CS 6243 Machine Learning

Advanced topic: pattern recognition (DNA motif finding)

Final Project

• Draft description available on course website• More details will be posted soon• Group size 2 to 4 acceptable, with higher

expectation for larger teams• Predict Protein-DNA binding

Biological background for TF-DNA binding

Genome is fixed – Cells are dynamic

• A genome is static– (almost) Every cell in our body has a copy of

the same genome

• A cell is dynamic– Responds to internal/external conditions– Most cells follow a cell cycle of division– Cells differentiate during development

Gene regulation

• … is responsible for the dynamic cell

• Gene expression (production of protein) varies according to:– Cell type– Cell cycle– External conditions– Location– Etc.

Where gene regulation takes place

• Opening of chromatin

• Transcription

• Translation

• Protein stability

• Protein modifications

GenePromoter

RNA polymerase(Protein)

Transcription Factor (TF)(Protein)

Transcriptional Regulation of genes

GeneTF binding site, cis-regulatory element

RNA polymerase(Protein)

Transcription Factor (TF)(Protein)

RNA polymerase

Transcription Factor(Protein)

TF binding site, cis-regulatory element

RNA polymerase

Transcription Factor

New protein

TF binding site, cis-regulatory element

The Cell as a Regulatory Network

A B Make DC

If C then D

If B then NOT D

If A and B then D D

Make BD

If D then B

gene D

gene B

Transcription Factors Binding to DNA

Transcriptional regulation:• Transcription factors

bind to DNA

Binding recognizes specific DNA substrings:

• Regulatory motifs

Experimental methods

• DNase footprinting– Tedious – Time-consuming

• High-throughput techniques: ChIP-chip, ChIP-seq– Expensive– Other limitations

Protein Binding Microarray

Computational methods for finding cis-regulatory motifs

Given a collection of genes that are believed to be regulated by the same/similar protein– Co-expressed genes– Evolutionarily conserved genes

Find the common TF-binding motif from promoters

Essentially a Multiple Local Alignment

• Find “best” multiple local alignment• Multidimensional Dynamic Programming?

– Heuristics must be used

.instance

Characteristics of cis-Regulatory Motifs

• Tiny (6-12bp)• Intergenic regions are

very long • Highly Variable

• ~Constant Size– Because a constant-size

transcription factor binds

• Often repeated• Often conserved

Motif representation

• Collection of exact words– {ACGTTAC, ACGCTAC, AGGTGAC, …}

• Consensus sequence (with wild cards)– {AcGTgTtAC}– {ASGTKTKAC} S=C/G, K=G/T (IUPAC code)

• Position-specific weight matrices (PWM)

Position-Specific Weight Matrix

1 2 3 4 5 6 7 8 9

A .97 .10 .02 .03 .10 .01 .05 .85 .03

C .01 .40 .01 .04 .05 .01 .05 .05 .03

G .01 .40 .95 .03 .40 .01 .3 .05 .03

T .01 .10 .02 .90 .45 .97 .6 .05 .91

A S G T K T K A C

Sequence Logo

1 2 3 4 5 6 7 8 9

A .97 .10 .02 .03 .10 .01 .05 .85 .03

C .01 .40 .01 .04 .05 .01 .05 .05 .03

G .01 .40 .95 .03 .40 .01 .3 .05 .03

T .01 .10 .02 .90 .45 .97 .6 .05 .91

http://weblogo.berkeley.edu/

http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi

Sequence Logo

1 2 3 4 5 6 7 8 9

A .97 .10 .02 .03 .10 .01 .05 .85 .03

C .01 .40 .01 .04 .05 .01 .05 .05 .03

G .01 .40 .95 .03 .40 .01 .3 .05 .03

T .01 .10 .02 .90 .45 .97 .6 .05 .91

http://weblogo.berkeley.edu/

http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi

Entropy and information content

• Entropy: a measure of uncertainty

• The entropy of a random variable X that can assume the n different values x1, x2, . . . , xn with the respective probabilities p1, p2, . . . , pn is defined as

• Example: A,C,G,T with equal probability H = 4 * (-0.25 log2 0.25) = log2 4 = 2 bits Need 2 bits to encode (e.g. 00 = A, 01 = C, 10 = G, 11 = T) Maximum uncertainty

• 50% A and 50% C: H = 2 * (-0. 5 log2 0.5) = log2 2 = 1 bit

• 100% A H = 1 * (-1 log2 1) = 0 bit Minimum uncertainty

• Information: the opposite of uncertainty I = maximum uncertainty – entropy The above examples provide 0, 1, and 2 bits of information,

respectively

1 2 3 4 5 6 7 8 9

A .97 .10 .02 .03 .10 .01 .05 .85 .03

C .01 .40 .01 .04 .05 .01 .05 .05 .03

G .01 .40 .95 .03 .40 .01 .3 .05 .03

T .01 .10 .02 .90 .45 .97 .6 .05 .91

H .24 1.72 .36 .63 1.60 0.24 1.40 0.85 0.58

I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42Mean 1.15Total 10.4

Expected occurrence in random DNA: 1 / 210.4 = 1 / 1340

Expected occurrence of an exact 5-mer: 1 / 210 = 1 / 1024

Sequence Logo

1 2 3 4 5 6 7 8 9

A .97 .10 .02 .03 .10 .01 .05 .85 .03

C .01 .40 .01 .04 .05 .01 .05 .05 .03

G .01 .40 .95 .03 .40 .01 .3 .05 .03

T .01 .10 .02 .90 .45 .97 .6 .05 .91

I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42

Real example

• E. coli. Promoter• “TATA-Box” ~ 10bp upstream of transcription

start• TACGAT• TAAAAT• TATACT• GATAAT• TATGAT• TATGTT

Consensus: TATAAT

Note: none of the instances matches the consensus perfectly

Finding Motifs

Classification of approaches

• Combinatorial algorithms– Based on enumeration of words and

computing word similarities

• Probabilistic algorithms– Construct probabilistic models to distinguish

motifs vs non-motifs

Combinatorial motif finding

• Idea 1: find all k-mers that appeared at least m times – m may be chosen such that # occurrence is statistically

significant– Problem: most motifs have divergence. Each variation may only

appear once.

• Idea 2: find all k-mers, considering IUPAC nucleic acid codes– e.g. ASGTKTKAC, S = C/G, K = G/T– Still inflexible

• Idea 3: find k-mers that approximately appeared at least m times– i.e. allow some mismatches

Combinatorial motif finding

Given a set of sequences S = {x1, …, xn}

• A motif W is a consensus string w1…wK

• Find motif W* with “best” match to x1, …, xn

Definition of “best”:d(W, xi) = min hamming dist. between W and a word in xi

d(W, S) = i d(W, xi)

W* = argmin( d(W, S) )

Exhaustive searches

1. Pattern-driven algorithm:

For W = AA…A to TT…T (4K possibilities)Find d( W, S )

Report W* = argmin( d(W, S) )

Running time: O( K N 4K )

(where N = i |xi|)

Guaranteed to find the optimal solution.

Exhaustive searches

2. Sample-driven algorithm:

For W = a K-char word in some xi

Find d( W, S )Report W* = argmin( d( W, S ) )OR Report a local improvement of W*

Running time: O( K N2 )

Exhaustive searches

• Problem with sample-driven approach:

• If:– True motif does not occur in data, and– True motif is “weak”

• Then,– random strings may score better than any

instance of true motif

Example

• E. coli. Promoter• “TATA-Box” ~ 10bp upstream of transcription

start• TACGAT• TAAAAT• TATACT• GATAAT• TATGAT• TATGTT

Consensus: TATAAT

Each instance differs at most 2 bases from the consensus

None of the instances matches the consensus perfectly

Heuristic methods

• Cannot afford exhaustive search on all patterns

• Sample-driven approaches may miss real patterns

• However, a real pattern should not differ too much from its instances in S

• Start from the space of all words in S, extend to the space with real patterns

Some of the popular tools

• Consensus (Hertz & Stormo, 1999)

• WINNOWER (Pevzner & Sze, 2000)

• MULTIPROFILER (Keich & Pevzner, 2002)

• PROJECTION (Buhler & Tompa, 2001)

• WEEDER (Pavesi et. al. 2001)

• And dozens of others

Probabilistic modeling approaches

for motif finding

Probabilistic modeling approaches

• A motif model– Usually a PWM– M = (Pij), i = 1..4, j = 1..k, k: motif length

• A background model– Usually the distribution of base frequencies in

the genome (or other selected subsets of sequences)

– B = (bi), i = 1..4

• A word can be generated by M or B

Expectation-Maximization

• For any word W, P(W | M) = PW[1] 1 PW[2] 2…PW[K] K

P(W | B) = bW[1] bW[2] …bW[K]

• Let = P(M), i.e., the probability for any word to be generated by M.

• Then P(B) = 1 - • Can compute the posterior probability P(M|W)

and P(B|W) P(M|W) ~ P(W|M) * P(B|W) ~ P(W|B) * (1-)

Initialize: Randomly assign each word to M or B• Let Zxy = 1 if position y in sequence x is a motif, and 0

otherwise• Estimate parameters M, , B

Iterate until converge:• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y• M-step: re-estimate M, given Z (B usually fixed)

• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y

• M-step: re-estimate M, given Z

Initialize E-step

M-step

position

• Multiple EM for Motif Elicitation

• Bailey and Elkan, UCSD

• http://meme.sdsc.edu/

• Multiple starting points

• Multiple modes: ZOOPS, OOPS, TCM

Gibbs Sampling

• Another very useful technique for estimating missing parameters

• EM is deterministic– Often trapped by local optima

• Gibbs sampling: stochastic behavior to avoid local optima

Gibbs Sampling

Initialize: Randomly assign each word to M or B• Let Zxy = 1 if position y in sequence x is a motif, and 0

otherwise• Estimate parameters M, B,

Iterate:• Randomly remove a sequence X* from S• Recalculate model parameters using S \ X*• Compute Zx*y for X*• Sample a y* from Zx*y. • Let Zx*y = 1 for y = y* and 0 otherwise

Gibbs Sampling

• Gibbs sampling: sample one position according to probability– Update prediction of one training sequence at a time

• Viterbi: always take the highest• EM: take weighted average

0 2 4 6 8 10 12 14 16 18 200

position

Sampling

Simultaneously update predictions of all sequences

position

Better background model

• Repeat DNA can be confused as motif– Especially low-complexity CACACA… AAAAA, etc.

• Solution: more elaborate background model– Higher-order Markov model

0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }

Has been applied to EM and Gibbs (up to 3rd order)

Gibbs sampling motif finders• Gibbs Sampler

– First appeared as: Larence et.al. Science 262(5131):208-214. – Continually developed and updated. webpage– The newest version: Thompson et. al. Nucleic Acids Res. 35 (s2):W232-

W237 • AlignACE

– Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14. – Allow don’t care positions– Additional tools to scan motifs on new seqs, and to compare and group

motifs• BioProspector, X. Liu et. al. PSB 2001 , an improvement of

AlignACE– Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38. – Allow two-block motifs– Consider higher-order markov models

cs 6243 machine learning advanced topic: pattern recognition (dna motif finding)

Documents

motif finding workshop project

gibbs sampling for motif finding in biological sequences...

novel motif detection algorithms for finding protein

review open access a survey of motif finding web tools for

more on tf motif finding chip-chip / seq

motif finding : lecture 2 cs 498 cxz. recap problem 1: given...

motif finding yueyi irene liu cs374 lecture oct. 17, 2002

www.bioalgorithms.infoan introduction to bioinformatics...

assignment 6: motif...

cs 6243 machine learning

cs 5263 & cs 4593 bioinformatics motif finding. what is a...

gibbs sampling for motif finding in biological sequences

motif finding pssms expectation maximization gibbs sampling

motif finding & gibbs...

motif finding. regulation of genes gene regulatory element...

cs5263 bioinformatics probabilistic modeling approaches for...

discriminative motif finding for predicting protein...

motif finding & gibbs...

motif finding methods and algorithms

reoptimization of motif finding problem · 2014. 2. 21. ·...