![Page 1: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/1.jpg)
HMM applications in computational biology
10-701
Machine Learning
![Page 2: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/2.jpg)
2
Central dogma
Protein
mRNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
PEPTIDE
CCUGAGCCAACUAUUGAUGAA
![Page 3: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/3.jpg)
Biological data is rapidly accumulating
DNA
RNA
transcription
translation
Proteins
Transcription factors Next generation sequencing
![Page 4: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/4.jpg)
![Page 5: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/5.jpg)
![Page 6: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/6.jpg)
Biological data is rapidly accumulating
DNA
RNA
transcription
translation
Proteins
Transcription factors Array / sequencing technology
![Page 7: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/7.jpg)
Biological data is rapidly accumulating
DNA
RNA
transcription
translation
Proteins
Transcription factors Protein interactions
• 38,000 identified interactions • Hundreds of thousands of predictions
![Page 8: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/8.jpg)
8
![Page 9: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/9.jpg)
FDA Approves Gene-Based Breast Cancer Test*
“ MammaPrint is a DNA microarray-based test that measures the activity of 70 genes in a sample of a woman's breast-cancer tumor and then uses a specific formula to determine whether the patient is deemed low risk or high risk for the spread of the cancer to another site.”
*Washington Post, 2/06/2007
![Page 10: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/10.jpg)
10
![Page 11: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/11.jpg)
Active Learning
11
![Page 12: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/12.jpg)
Sequencing DNA
Due to accumulated errors, we could only reliably read at most 300-500 nucleotides.
First human genome draft in 2001
![Page 13: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/13.jpg)
Shotgun Sequencing
Wikipedia
![Page 14: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/14.jpg)
Caveats
• Errors in reading • Non-trivial assembly task: repeats in the genome
MacCallum et al., GB 2009
![Page 15: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/15.jpg)
Error Correction in DNA sequencing • The fragmentation happens at random locations of the molecules. We expect all positions in the genome to have the same # number of reads K-mers = substrings of length K of the reads. Errors create error k-mers.
Kellly et al., GB 2010
![Page 16: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/16.jpg)
Transcriptome Shotgun Sequencing (RNA-Seq)
Sequencing RNA molecule transcripts. Reminder: • (mRNA) Transcripts are “expression products” of genes. • Different genes having different expression levels so some
transcripts are more or less abundant than others.
@Friedrich Miescher Laboratory
![Page 17: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/17.jpg)
Challenges
• Large datasets: 10-100 millions reads of 75-150 bps. • Memory efficiency: Too time consuming to perform out-
memory processing of data. DNA Sequencing + others : alternative splicing, RNA editing, post-transcription modification.
![Page 18: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/18.jpg)
• Some transcripts are more prone to errors • Errors are harder to correct in reads from lowly expressed transcripts
Errors are non uniformly distributed
![Page 19: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/19.jpg)
SEECER Error Correction + Consensus sequence estimation for RNA-Seq data
![Page 20: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/20.jpg)
Key idea: HMM model
The way sequencers work: • Read letter by letter sequentially • Possible errors: Insertion , Deletion or Misread of a nucleotide
Salmela et al., Bioinformatics 2011
![Page 21: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/21.jpg)
![Page 22: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/22.jpg)
Building (Learning) the HMMs and Making Corrections (Inference)
Learning = Expectation-Maximization Inference = Viterbi algorithm
Seeding: Guessing possible reads using k-mer overlaps. Constructing the HMM from these reads. Speed up: The k-mer overlaps yield approximate multiple alignments of reads. We can learn HMM parameters from this directly.
![Page 23: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/23.jpg)
Clustering to improve seeding
Real biological differences should be supported by a set of reads with similar mismatches to the consensus
![Page 24: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/24.jpg)
1. Clustering positions with mismatches to identify clusters of correlated positions.
2. Build a similarity matrix between these positions.
3. Use Spectral clustering to find clusters of correlated positions.
4. Filter reads have mismatches in these clusters.
![Page 25: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/25.jpg)
Comparison to other methods
![Page 26: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/26.jpg)
Using the corrected reads, the assembler can
recover more transcripts
![Page 27: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/27.jpg)
Analysis of sea cucumber data
B
![Page 28: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/28.jpg)
Data integration in biology
![Page 29: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/29.jpg)
Key problem: Most high-throughput data is static
Sequencing
motif CHIP-chip
PPI microarray
Static data sources Time-series measurements
Time
![Page 30: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/30.jpg)
DREM: Dynamic Regulatory Events Miner
![Page 31: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/31.jpg)
TF C
time
Expression Level
Model Structure
time
1
0.1
0.9
1
0.95
0.05
Expression Level
Time Series Expression Data Static TF-DNA Binding Data
IOHMM Model
TF A
TF B
TF D
?
?
a b
c d
![Page 32: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/32.jpg)
Things are a bit more complicated: Real data
![Page 33: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/33.jpg)
A Hidden Markov Model
T
t
tt
n
i
T
t
tt iHiHpiHiOpOHL2
1
1 1
))(|)(())(|)(();,(
Hidden States
Observed outputs (expression levels)
t=0 t=1 t=2 t=3
H0 H1 H2 H3
O0 O1 O2 O3
Schliep et al Bioinformatics 2003
1
![Page 34: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/34.jpg)
Sum over all genes
Sum over all paths Q
Product over all Gaussian emission density values on path
Product over all transition probabilities on path
Input – Output Hidden Markov Model
Input (Static TF-gene interactions)
Hidden States (transitions between states form a tree structure)
Emissions (Distribution of expression values)
Ig
t=0 t=1 t=2 t=3
H0 H1 H2 H3
O0 O1 O2 O3
Log Likelihood
![Page 35: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/35.jpg)
1 2
3 4
5
6 7 8
9
E. coli. response
PLoS Comp. Bio. 2008
Nature MSB 2011
IRF7
Fly development Science 2010
Genome Research 2010, PLoS ONE 2011
Mouse Immune response
Stem cells differentiation
![Page 36: HMM applications in computational biologyepxing/Class/10701/slides/compBio15.pdf · HMM applications in computational biology 10-701 Machine Learning . 2 Central dogma Protein mRNA](https://reader033.vdocuments.site/reader033/viewer/2022052923/5f040aae7e708231d40c0775/html5/thumbnails/36.jpg)
• Approximate learning to speed up on large datasets.
• In real world, one technique is not enough. A solution involves using
many techniques.
• Precision and Recall are trade-offs.
Things that work