cap 5510: introduction to bioinformaticsgiri/teach/bioinf/s07/lecx1.pdf2/8/07 cap5510 4 unaligned...
TRANSCRIPT
![Page 1: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/1.jpg)
2/8/07 CAP5510 1
CAP 5510: Introduction to Bioinformatics
Giri NarasimhanECS 254; Phone: x3748
[email protected]/~giri/teach/BioinfS07.html
![Page 2: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/2.jpg)
2/8/07 CAP5510 2
Pattern Discovery
![Page 3: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/3.jpg)
2/8/07 CAP5510 3
What we have discussed so far
Why Pattern discovery?Types of patternsHow to represent and store patterns?Types of pattern discovery
Supervised pattern discoveryMotif Detection
Unsupervised pattern discoveryEvaluation of pattern discovery
![Page 4: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/4.jpg)
2/8/07 CAP5510 4
Unaligned Pattern DiscoveryUnaligned Pattern Discovery
Rigoutsos & Floratos, Bioinformatics, ’98
TEIRESIAS: The algorithm is similar to that used in GYM for aligned Pattern discovery.
TEIRESIASProtein SequenceDatabase
Seqlet Dictionary
A..GV
L..H…H
Y.C..C…F
V..G..G.G.T.L•••
![Page 5: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/5.jpg)
2/8/07 CAP5510 5
TEIRESIAS: Key Features
Starts with a set of seed patterns (Enumeration step) Convolution operator applied to all pairs of patterns:
A..GV.S ⊕ V.S.GR = A..GV.S.GROrder of Evaluation carefully chosen so that long patterns get longer firstFinds all maximal patterns.Combinatorial explosion avoided by generating only relevant maximal patterns.
Rigoutsos & Floratos, Bioinformatics, ’98
![Page 6: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/6.jpg)
2/8/07 CAP5510 6
SPLASH
Structural Pattern Localization Analysis by Sequential Histogram (SPLASH)Not limited to fixed alphabet sizePatterns are modeled by a homology metric and thus allow mismatchesEarly pruning of inconsistent seed patterns, leading to increased efficiency. Easily parallelized with availability of extra resources.
Califano, Bioinformatics, ’00; Califano et al., J Comput Biol, ’00
![Page 7: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/7.jpg)
2/8/07 CAP5510 7
Precomputed Sequence Patterns
PROSITEBLOCKS and PRINTSeMOTIFSPATPRODOMPfam
![Page 8: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/8.jpg)
2/8/07 CAP5510 8
Motif Detection ToolsPROSITE (Database of protein families & domains)
Try PDOC00040. Also Try PS00041PRINTS Sample OutputBLOCKS (multiply aligned ungapped segments for highly conserved regions of proteins; automatically created) Sample OutputPfam (Protein families database of alignments & HMMs)
Multiple Alignment, domain architectures, species distribution, links: TryMoSTPROBEProDomDIP
![Page 9: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/9.jpg)
2/8/07 CAP5510 9
Protein Information Sites
SwissPROT & GenBankInterPRO is a database of protein families, domains and functional sites in which identifiable features found in known proteins can be applied to unknown protein sequences. See sample.
PIR Sample Protein page
![Page 10: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/10.jpg)
2/8/07 CAP5510 10
Modular Nature of Proteins
Proteins are collections of “modular”domains. For example,
F2 E E
EF2
F2
K K
K Catalytic Domain
Catalytic Domain
PLAT
Coagulation Factor XII
![Page 11: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/11.jpg)
2/8/07 CAP5510 11
Domain Architecture Tools
CDARTProtein Domain ArchitectureAAH24495; ;It’s domain relatives; Multiple alignment for 2nd domain
SMART
![Page 12: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/12.jpg)
2/8/07 CAP5510 12
Predicting Specialized Structures
COILS – Predicts coiled coil motifsTMPred – predicts transmembrane regionsSignalP – predicts signal peptidesSEG – predicts nonglobular regions
![Page 13: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/13.jpg)
2/8/07 CAP5510 13
Patterns in DNA Sequences
Signals in DNA sequence control eventsStart and end of genesStart and end of intronsTranscription factor binding sites (regulatory elements)Ribosome binding sites
Detection of these patterns are useful for Understanding gene structureUnderstanding gene regulation
![Page 14: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/14.jpg)
2/8/07 CAP5510 14
Motifs in DNA Sequences
Given a collection of DNA sequences of promoter regions, locate the transcription factor binding sites (also called regulatory elements)
Example:
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 15: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/15.jpg)
2/8/07 CAP5510 15
Motifs
http://weblogo.berkeley.edu/examples.html
![Page 16: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/16.jpg)
2/8/07 CAP5510 16
Motifs in DNA Sequences
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 17: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/17.jpg)
2/8/07 CAP5510 17
More Motifs in E. Coli DNA Sequences
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 18: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/18.jpg)
2/8/07 CAP5510 18
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 19: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/19.jpg)
2/8/07 CAP5510 19
Other Motifs in DNA
Sequences: Human Splice
Junctions
http://www.lecb.ncifcrf.gov/~toms/sequencelogo.html
![Page 20: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/20.jpg)
2/8/07 CAP5510 20
Motifs in DNA Sequences
![Page 21: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/21.jpg)
2/8/07 CAP5510 21
Gene
Gene-Specific TFBinding Sites
TATA BoxCAT Box
Basal TFBinding Sites
coding regionupstream region
Transcription Regulation
![Page 22: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/22.jpg)
Transcription Regulation
2/8/07 CAP5510 22
TF
DNA
TFBS
Gene
![Page 23: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/23.jpg)
Transcription Regulation
2/8/07 CAP5510 23
DNA
TF
TFBS
Gene Co-regulated genes
![Page 24: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/24.jpg)
2/8/07 CAP5510 24
Problem: Given the upstream regions of all genes in the genome, find all over-represented sequence signatures.
Motif-prediction: Whole genome
![Page 25: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/25.jpg)
2/8/07 CAP5510 25
Basic Principle: If a TF co-regulates many genes, then all these genes should have at least 1 binding site for it in their upstream region.
Motif-prediction: Whole genome
Gene 1 Gene 2 Gene 3 Gene 4 Gene 5
Binding sites for TF
![Page 26: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/26.jpg)
2/8/07 CAP5510 26
Motif Detection (TFBMs)
See evaluation by Tompa et al.[bio.cs.washington.edu/assessment]
Gibbs Sampling Methods: AlignACE, GLAM, SeSiMCMC, MotifSamplerWeight Matrix Methods: ANN-Spec, Consensus, EM: Improbizer, MEMECombinatorial & Misc.: MITRA, oligo/dyad, QuickScore, Weeder, YMF
![Page 27: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/27.jpg)
2/8/07 CAP5510 27
Predicting Motifs in Whole Genome
MEME: EM algorithm [ Bailey et al., 1994 ]
AlignACE: Gibbs Sampling Approach [ Hughes et al., 2000 ]
Consensus: Greedy Algorithm Based [ Hertz et al., 1990 ]
ANN-Spec: Artificial Neural Network and a Gibbs sampling method [ Workman et al., 2000 ]
YMF: Enumerative search [Sinha et al., 2003 ]
…
![Page 28: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/28.jpg)
2/8/07 CAP5510 28
Input: upstream sequencesX = {X1, X2, …, Xn},
Motif profile: 4×k matrix θ = (θrp), r ∈ {A,C,G,T} 1 ≤ p ≤ kθrp = Pr(residue r in position p of motif)
Background distribution: θr0 = Pr(residue r in background)
EM Method: Model Parameters
![Page 29: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/29.jpg)
2/8/07 CAP5510 29
Z = {Zij}, where1, if motif instance starts at
Zij = position i of Xj0, otherwise
Iterate over probabilistic models that could generate X and Z, trying to converge on this solution
EM Method: Hidden Information
![Page 30: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/30.jpg)
2/8/07 CAP5510 30
M-step: Build a new profile by using every m-window, but weighting each one with value zij.
Initialize: random profile
EM Algorithm
E-step: Using profile, compute a likelihood value zijfor each m-window at position i in input sequence j.
Stop if convergedMEME [Bailey, Elkan 1994]
Goal: Find θ, Z that maximize Pr (X, Z | θ)
![Page 31: CAP 5510: Introduction to Bioinformaticsgiri/teach/Bioinf/S07/Lecx1.pdf2/8/07 CAP5510 4 Unaligned Pattern Discovery Rigoutsos & Floratos, Bioinformatics, ’98 TEIRESIAS: The algorithm](https://reader034.vdocuments.site/reader034/viewer/2022050309/5f711446cd7bcd62e4377753/html5/thumbnails/31.jpg)
Gibbs Sampling for Motif Detection
2/8/07 CAP5510 31