cs 6243 machine learning advanced topic: pattern recognition (dna motif finding)
Post on 19-Jan-2016
229 Views
Preview:
TRANSCRIPT
CS 6243 Machine Learning
Advanced topic: pattern recognition (DNA motif finding)
Final Project
• Draft description available on course website• More details will be posted soon• Group size 2 to 4 acceptable, with higher
expectation for larger teams• Predict Protein-DNA binding
Biological background for TF-DNA binding
Genome is fixed – Cells are dynamic
• A genome is static– (almost) Every cell in our body has a copy of
the same genome
• A cell is dynamic– Responds to internal/external conditions– Most cells follow a cell cycle of division– Cells differentiate during development
Gene regulation
• … is responsible for the dynamic cell
• Gene expression (production of protein) varies according to:– Cell type– Cell cycle– External conditions– Location– Etc.
Where gene regulation takes place
• Opening of chromatin
• Transcription
• Translation
• Protein stability
• Protein modifications
GenePromoter
RNA polymerase(Protein)
Transcription Factor (TF)(Protein)
DNA
Transcriptional Regulation of genes
GeneTF binding site, cis-regulatory element
RNA polymerase(Protein)
Transcription Factor (TF)(Protein)
DNA
Transcriptional Regulation of genes
Transcriptional Regulation of genes
Gene
RNA polymerase
Transcription Factor(Protein)
DNA
TF binding site, cis-regulatory element
Gene
RNA polymerase
Transcription Factor
DNA
New protein
Transcriptional Regulation of genes
TF binding site, cis-regulatory element
The Cell as a Regulatory Network
A B Make DC
If C then D
If B then NOT D
If A and B then D D
Make BD
If D then B
C
gene D
gene B
Transcription Factors Binding to DNA
Transcriptional regulation:• Transcription factors
bind to DNA
Binding recognizes specific DNA substrings:
• Regulatory motifs
Experimental methods
• DNase footprinting– Tedious – Time-consuming
• High-throughput techniques: ChIP-chip, ChIP-seq– Expensive– Other limitations
Protein Binding Microarray
Computational methods for finding cis-regulatory motifs
Given a collection of genes that are believed to be regulated by the same/similar protein– Co-expressed genes– Evolutionarily conserved genes
Find the common TF-binding motif from promoters
.
.
.
Essentially a Multiple Local Alignment
• Find “best” multiple local alignment• Multidimensional Dynamic Programming?
– Heuristics must be used
.
.
.instance
Characteristics of cis-Regulatory Motifs
• Tiny (6-12bp)• Intergenic regions are
very long • Highly Variable
• ~Constant Size– Because a constant-size
transcription factor binds
• Often repeated• Often conserved
Motif representation
• Collection of exact words– {ACGTTAC, ACGCTAC, AGGTGAC, …}
• Consensus sequence (with wild cards)– {AcGTgTtAC}– {ASGTKTKAC} S=C/G, K=G/T (IUPAC code)
• Position-specific weight matrices (PWM)
Position-Specific Weight Matrix
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
A S G T K T K A C
Sequence Logo
fre
que
ncy
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
http://weblogo.berkeley.edu/
http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi
Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
http://weblogo.berkeley.edu/
http://biodev.hgen.pitt.edu/cgi-bin/enologos/enologos.cgi
Entropy and information content
• Entropy: a measure of uncertainty
• The entropy of a random variable X that can assume the n different values x1, x2, . . . , xn with the respective probabilities p1, p2, . . . , pn is defined as
Entropy and information content
• Example: A,C,G,T with equal probability H = 4 * (-0.25 log2 0.25) = log2 4 = 2 bits Need 2 bits to encode (e.g. 00 = A, 01 = C, 10 = G, 11 = T) Maximum uncertainty
• 50% A and 50% C: H = 2 * (-0. 5 log2 0.5) = log2 2 = 1 bit
• 100% A H = 1 * (-1 log2 1) = 0 bit Minimum uncertainty
• Information: the opposite of uncertainty I = maximum uncertainty – entropy The above examples provide 0, 1, and 2 bits of information,
respectively
Entropy and information content
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
H .24 1.72 .36 .63 1.60 0.24 1.40 0.85 0.58
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42Mean 1.15Total 10.4
Expected occurrence in random DNA: 1 / 210.4 = 1 / 1340
Expected occurrence of an exact 5-mer: 1 / 210 = 1 / 1024
Sequence Logo
1 2 3 4 5 6 7 8 9
A .97 .10 .02 .03 .10 .01 .05 .85 .03
C .01 .40 .01 .04 .05 .01 .05 .05 .03
G .01 .40 .95 .03 .40 .01 .3 .05 .03
T .01 .10 .02 .90 .45 .97 .6 .05 .91
I 1.76 0.28 1.64 1.37 0.40 1.76 0.60 1.15 1.42
Real example
• E. coli. Promoter• “TATA-Box” ~ 10bp upstream of transcription
start• TACGAT• TAAAAT• TATACT• GATAAT• TATGAT• TATGTT
Consensus: TATAAT
Note: none of the instances matches the consensus perfectly
Finding Motifs
Classification of approaches
• Combinatorial algorithms– Based on enumeration of words and
computing word similarities
• Probabilistic algorithms– Construct probabilistic models to distinguish
motifs vs non-motifs
Combinatorial motif finding
• Idea 1: find all k-mers that appeared at least m times – m may be chosen such that # occurrence is statistically
significant– Problem: most motifs have divergence. Each variation may only
appear once.
• Idea 2: find all k-mers, considering IUPAC nucleic acid codes– e.g. ASGTKTKAC, S = C/G, K = G/T– Still inflexible
• Idea 3: find k-mers that approximately appeared at least m times– i.e. allow some mismatches
Combinatorial motif finding
Given a set of sequences S = {x1, …, xn}
• A motif W is a consensus string w1…wK
• Find motif W* with “best” match to x1, …, xn
Definition of “best”:d(W, xi) = min hamming dist. between W and a word in xi
d(W, S) = i d(W, xi)
W* = argmin( d(W, S) )
Exhaustive searches
1. Pattern-driven algorithm:
For W = AA…A to TT…T (4K possibilities)Find d( W, S )
Report W* = argmin( d(W, S) )
Running time: O( K N 4K )
(where N = i |xi|)
Guaranteed to find the optimal solution.
Exhaustive searches
2. Sample-driven algorithm:
For W = a K-char word in some xi
Find d( W, S )Report W* = argmin( d( W, S ) )OR Report a local improvement of W*
Running time: O( K N2 )
Exhaustive searches
• Problem with sample-driven approach:
• If:– True motif does not occur in data, and– True motif is “weak”
• Then,– random strings may score better than any
instance of true motif
Example
• E. coli. Promoter• “TATA-Box” ~ 10bp upstream of transcription
start• TACGAT• TAAAAT• TATACT• GATAAT• TATGAT• TATGTT
Consensus: TATAAT
Each instance differs at most 2 bases from the consensus
None of the instances matches the consensus perfectly
Heuristic methods
• Cannot afford exhaustive search on all patterns
• Sample-driven approaches may miss real patterns
• However, a real pattern should not differ too much from its instances in S
• Start from the space of all words in S, extend to the space with real patterns
Some of the popular tools
• Consensus (Hertz & Stormo, 1999)
• WINNOWER (Pevzner & Sze, 2000)
• MULTIPROFILER (Keich & Pevzner, 2002)
• PROJECTION (Buhler & Tompa, 2001)
• WEEDER (Pavesi et. al. 2001)
• And dozens of others
Probabilistic modeling approaches
for motif finding
Probabilistic modeling approaches
• A motif model– Usually a PWM– M = (Pij), i = 1..4, j = 1..k, k: motif length
• A background model– Usually the distribution of base frequencies in
the genome (or other selected subsets of sequences)
– B = (bi), i = 1..4
• A word can be generated by M or B
Expectation-Maximization
• For any word W, P(W | M) = PW[1] 1 PW[2] 2…PW[K] K
P(W | B) = bW[1] bW[2] …bW[K]
• Let = P(M), i.e., the probability for any word to be generated by M.
• Then P(B) = 1 - • Can compute the posterior probability P(M|W)
and P(B|W) P(M|W) ~ P(W|M) * P(B|W) ~ P(W|B) * (1-)
Expectation-Maximization
Initialize: Randomly assign each word to M or B• Let Zxy = 1 if position y in sequence x is a motif, and 0
otherwise• Estimate parameters M, , B
Iterate until converge:• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y• M-step: re-estimate M, given Z (B usually fixed)
Expectation-Maximization
• E-step: Zxy = P(M | X[y..y+k-1]) for all x and y
• M-step: re-estimate M, given Z
Initialize E-step
M-step
prob
abili
ty
position
1
9
5
1
9
5
MEME
• Multiple EM for Motif Elicitation
• Bailey and Elkan, UCSD
• http://meme.sdsc.edu/
• Multiple starting points
• Multiple modes: ZOOPS, OOPS, TCM
Gibbs Sampling
• Another very useful technique for estimating missing parameters
• EM is deterministic– Often trapped by local optima
• Gibbs sampling: stochastic behavior to avoid local optima
Gibbs Sampling
Initialize: Randomly assign each word to M or B• Let Zxy = 1 if position y in sequence x is a motif, and 0
otherwise• Estimate parameters M, B,
Iterate:• Randomly remove a sequence X* from S• Recalculate model parameters using S \ X*• Compute Zx*y for X*• Sample a y* from Zx*y. • Let Zx*y = 1 for y = y* and 0 otherwise
Gibbs Sampling
• Gibbs sampling: sample one position according to probability– Update prediction of one training sequence at a time
• Viterbi: always take the highest• EM: take weighted average
0 2 4 6 8 10 12 14 16 18 200
0.05
0.1
0.15
0.2
position
prob
abili
ty
Sampling
Simultaneously update predictions of all sequences
position
prob
abili
ty
Better background model
• Repeat DNA can be confused as motif– Especially low-complexity CACACA… AAAAA, etc.
• Solution: more elaborate background model– Higher-order Markov model
0th order: B = { pA, pC, pG, pT }1st order: B = { P(A|A), P(A|C), …, P(T|T) }…Kth order: B = { P(X | b1…bK); X, bi{A,C,G,T} }
Has been applied to EM and Gibbs (up to 3rd order)
Gibbs sampling motif finders• Gibbs Sampler
– First appeared as: Larence et.al. Science 262(5131):208-214. – Continually developed and updated. webpage– The newest version: Thompson et. al. Nucleic Acids Res. 35 (s2):W232-
W237 • AlignACE
– Hughes et al., J. of Mol Bio, 2000 10;296(5):1205-14. – Allow don’t care positions– Additional tools to scan motifs on new seqs, and to compare and group
motifs• BioProspector, X. Liu et. al. PSB 2001 , an improvement of
AlignACE– Liu, Brutlag and Liu. Pac Symp Biocomput. 2001;:127-38. – Allow two-block motifs– Consider higher-order markov models
top related