Download - Tracking down ncRNAs in the genomes
![Page 1: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/1.jpg)
Tracking down ncRNAs in the genomes
![Page 2: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/2.jpg)
How to find ncRNA gene
• The stability of ncRNA secondary structure is not sufficiently different from the predicted stability of a random sequence. [Rivas and Eddy Bioinformatics (2000)].
![Page 3: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/3.jpg)
RNA secondary structure prediction problem
• Algorithms/programs to compute the minimum energy:
– Nussinov et al (1978), Waterman (1978), Smith and Waterman (1978), and Zuker and Sankoff (1984).
– Mfold (Zuker 2003) and RNAfold (ViennaRNA) (Hofacker 2003).
• RNA folding via energy minimization has its shortcomings:
– Prediction depends on correct energy parameters.
– Sometimes, the true structure does not
have the minimum energy.
![Page 4: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/4.jpg)
– RNAs with similar functions often have similar structures.– Sequence changes can be tolerated by covarying mutations.
How to find ncRNA genes from a multiple alignment
GCAUCGGugGUUCagu--gguaGAAU---//---CCGAUGCa
UCUAAUAugGCAUauu-----aGUGC---//---UAUUAGAa
GGGGAUGuaGCUUagu--gguaGAGC---//---UAUCCCCa
GCCGCCGuaGCUCagcccgggaGAGC---//---CGGCGGCa
GGGCCCGuaGCUUagcucgguaGAGC---//---CGGGCCCa
![Page 5: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/5.jpg)
RNAalifold
• If we have correct multiple alignments, looking for covarying mutations and finding consensus structure is a good way to do structure prediction. – RNAalignfold (Hofacker et al. 2002)
– The consensus structure prediction is more accurate.
– To find energetically stable consensus structure is more statistically significant.
– Still compute the MFE.
– Covariance information is incorporated into the energy model by rewarding compensatory and consistent mutations.
![Page 6: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/6.jpg)
Loop based energy models
• Stacks (contiguous nested base pairs) are the dominant stabilizing force – contribute the negative energy
• Unpaired bases form loops contribute the positive energy.– Hairpin loops, bulge/internal loops, and multiloops.
• Take into account covariance contribution:–
• Take into account inconsistent sequences:–
• Put together:
![Page 7: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/7.jpg)
Mountain representation of E. coli 16 S rRNA
![Page 8: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/8.jpg)
How to used it to detect the new ncRNA?
• To find energetically stable consensus structure is more statistically significant.
• MFE can be used to compute the statistical significance.– MFE: m
– Mean: ų
– Standard: δ
– Z-score: z = (m- ų)/ δ
• We need randomize the multiple sequence alignment– Shuffle the columns of the input alignment
• Not destroy the gap structure.• Certain sequence pattern.
![Page 9: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/9.jpg)
Alifoldz
![Page 10: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/10.jpg)
Distribution of z-scores for the tRNA test sets
![Page 11: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/11.jpg)
Sensitivity depends sequence divergence and the quality of alignment
(Based on pairwise alignments of SRP RNAs)
![Page 12: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/12.jpg)
Sensitivity on known ncRNAs in S. cerevisiae
• Use MultiPipMaker to generate the multiple alignment of S. cerevisiae and other 6 related yeast genome.
• Extracted the regions of annotated ncRNAs• Refine the poor aligned regions • Window size = 150, slide 20.• False-positive rate: 0.25%.• 30 CPU days.
![Page 13: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/13.jpg)
Some issues about Alifoldz
• Time consuming to compute the z-score
• If the sequence identity is too high (>95%), it would not work.
• If the sequence identity is too low (<60%), it would not work too.
![Page 14: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/14.jpg)
RNAz (PNAS, 2005)
• z-score (for individual sequence)– Using Support Vector Machine (SVM) regression.– Using 1000 random sequences of each of ~10,000 point to
compute the distribution.• Same length.
• Same base composition.
– Mean (ų) and standard deviation (δ) • Are functions of the length and base composition.
– Z-score: z = (m- ų)/ δ
– For an alignment, using the mean of the z-scores.
![Page 15: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/15.jpg)
z-scores calculated by SVM vs. sampled z-scores
![Page 16: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/16.jpg)
Structure conservation index (SCI)
• It is a measure for the structure conservation based on the computing a consensus structure– Using RNAalifold.– – If SCI -> 0, RNAalifold does not find a consensus structure.– If SCI-> 1, having conserved consensus structure.
![Page 17: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/17.jpg)
Classification based on both scores
• Estimate a probability (P) if the alignment is classified as a functional RNA, based on– SCI– z-score– Average pairwise identity– Number of sequences.
• It is also done by SVM.
![Page 18: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/18.jpg)
Classification based on z scores and SCI by a SVM
![Page 19: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/19.jpg)
Some test families
![Page 20: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/20.jpg)
Comparison with other methods
![Page 21: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/21.jpg)
Screening the Comparative Regulatory Genomics (CORG) Database
![Page 22: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/22.jpg)
Using RNAz screening human genome
• Nature Biotechnology 23, 1383 - 1390 (Nov. 2005), “Mapping of conserved RNA secondary structures predicts thousands of functional noncoding RNAs in the human genome”
• Input:– Genome-wide alignments of vertebrates from UCSC genome browser.
– Using PhastCons program to find the most conserved
– Adjacent conserved regions (<50 distances) are joined together.
– All regions > 50 bps.
– Remove all “known genes” and “Refseq genes”
• Output:– Predicted structured RNA elements in the human genomes using RNAz
![Page 23: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/23.jpg)
Results
![Page 24: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/24.jpg)
![Page 25: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/25.jpg)
Statistical analysis of predicted RNAs
![Page 26: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/26.jpg)
Comparison with some RNA database
![Page 27: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/27.jpg)
![Page 28: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/28.jpg)
Problem:
• Correctly aligning multiple and divergent RNA sequences without taking into account the structural information is difficult.
• To get covarying mutation, we need an alignment.
• Do the multiple alignment and structure prediction at the same time.
![Page 29: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/29.jpg)
RNA consensus folding problem
RNA consensus folding problem: computing the common secondary structure for a set of unaligned RNA sequences
– Sankoff (1985) first proposed an algorithm simultaneously align RNA sequences and find the optimal common fold.
• Time complexity is 0(n6) for two sequences with length n.• Implemented as Dynalign (Mathews and Turner 2002)• It’s not practical for multiple sequences.
– Eddy and Durbin (1994) and other groups used stochastic context-free grammars to predict the consensus structure.
• Start from a seed alignment.• Stochastic iteration to improve the alignment and predict structure.• Need a good seed alignment.
![Page 30: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/30.jpg)
Motivation to a new approach
– Base-pairs appear in ‘clusters’: we call them stacks, which is energetically favorable.
– Most of the stability of the RNA secondary structure is determined by stacks.
ACCUU AAGGA
p = (1/4)5 < 0.001.
– Stacks are much less likely to occur by chance.
![Page 31: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/31.jpg)
Statistics of the stacks in Rfam database
Fraction of true stacks missed
00.10.20.30.40.50.60.70.80.9
1
1 2 3 4 5 6 7 8 9 10
length of stacks
![Page 32: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/32.jpg)
Using stacks as anchors for predictions
• The idea of anchors as constraints has been used in multiple genomic sequence alignment.
– MAVID (Bray and Pachter, 2004)– TBA (Blanchette et al., 2004)
Several heuristic methods have been developed by finding anchored stacks:
– Waterman (1989) used a statistical approach to choose conserved stacks within fixed-size windows.
– Ji and Stormo (2004) and Perriquet et al. (2003) use primary sequence conservation of the stacks and the length of loop regions to reduce the searching space.
– stack anchor has low sequence similarity.
– It’s hard to find correct anchors
![Page 33: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/33.jpg)
Problem:
• Selecting one stack at a time may cause wrong matching stacks.
![Page 34: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/34.jpg)
A global approach: configuration of stacks
• RNA secondary structure can be viewed as stacks plus unpaired loops. (no individual base-pairs)
• The energy of the structure is the sum of the energies of stacks and loops.
• Stack configuration:
– Nested stacks
– Parallel stacks
– Crossing stacks (pseudo knots)
• More generalized stacks can include mismatches in the stacks.
![Page 35: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/35.jpg)
RNA Stack-based Consensus Folding (RNAscf) problem
• Find conserved stack configurations for a set of unaligned RNA sequence.
• Optimize both stability (free energy) of the structure and sequence similarity computed based on these common stacks as anchors.
![Page 36: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/36.jpg)
RNA stack-based consensus folding for pairwise sequences
![Page 37: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/37.jpg)
A matching stack-configurations on two sequences
Weights of different costs.Energy of the consensus structureSequence similarity of stacksSequence similarity of unpaired regions
![Page 38: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/38.jpg)
RNA Stack-based Consensus Folding for multiple sequences
![Page 39: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/39.jpg)
Cost function for multiple sequences
A1,1 A1,2 A1,3 A1,4 A1,5 A1,6 A1,k-2 A1,k-1 A1,k
…
…
…
...
A2,1 A2,2 A2,3 A2,4 A2,5 A2,6 A2,k-2 A2,k-1 A2,k
As,1 As,2 As,3 As,4 As,5 As,6 As,k-2 As,k-1 As,k
![Page 40: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/40.jpg)
Compute an optimal stack configuration for two sequences
• Dynamic programming algorithm is used to align RNA sequences and find an optimal configuration at the same time.
– The algorithm is similar to prior work (Sankoff 1985, Bafna et al. 1995)
– Differences: • We use stacks as the basic structural elements. • Prior work used individual base pairs.
– The computational time is O(n4) (n is the number of stacks). • Sankoff’s algorithm is O(m6), (m is the length of the sequences).• The number of possible stacks (size >= 4) is much smaller than the length
of the sequence.• It’s much faster.
![Page 41: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/41.jpg)
For any pair of stacks, there are three choices:
PA
PB
hairpin loop
PA
PB
Loop(PA)
Loop(PB)PA
PB
PX
PY
interior loop/bulge
PA
PB
PiA
PjB
P1A
P1B
multi-loop
![Page 42: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/42.jpg)
The score of matching stacks:
PA
PB
![Page 43: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/43.jpg)
The score of matching hairpin loops:
PA
PB
Loop(PA)
Loop(PB)
![Page 44: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/44.jpg)
The score of matching interior loops or bulges:
PA
PB
PX
PY
Loop(PX,PA)
Loop(PY,PA)
![Page 45: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/45.jpg)
The score of matching two multi-loops:
PA
PB
PiA
PjB
P1A
P1B
Loop(Pi,PA)
Loop(Pi,PB)
![Page 46: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/46.jpg)
Consensus folding for multiple sequences
• We use a heuristic method based on the notion of star-alignment.– Compute an optimal configuration from a random seed pair.– Align all individual sequences to this configuration.– Choose the conserved stack configuration in all sequences.– Allow some stacks to be partially conserved (at least appear in a certain
fraction of the sequences).
![Page 47: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/47.jpg)
Compute the stack configuration for multiple sequences: RNAscf(k,h,f)
.
..
.........
![Page 48: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/48.jpg)
Iterative procedure for RNAscf
1. P = RNAscf(k, h, f).
2. In each sequence, extract the unpaired regions according to the loop regions in P.
3. Predict additional putative stacks that are not crossing with P using smaller k’ and h’.
4. Recompute the alignment for with additional putative stacks using RNAscf(k’,h’,f).
![Page 49: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/49.jpg)
Test dataset
• We choose a set of 12 RNA families from Rfam database:– 20 sequences chosen from the families. (except for CRE and glms, we choose 10
sequences) with annotated structures.– There are 953 stacks.– We compare RNAscf with 3 other programs that are available online for RNA
folding:• RNAfold (energy based minimization) (Hofacker 2003)• COVE (covariance model) (Eddy and Durbin 1994)
– Cove need a staring seed alignment which is produced by ClustalW.• comRNA (computing anchors in multiple sequences) (Ji, Xu and Stormo 2004).
– Sensitivity: the fraction of true stacks that overlapped with predicted stacks.– Accuracy: the fraction of predicted stacks that overlapped with true stacks
![Page 50: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/50.jpg)
Test results
![Page 51: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/51.jpg)
Test results
Sensitivity
00.10.20.30.40.50.60.70.80.9
1
5s_r
RNA
CRE_220(
*)
ctRNA_2
36(+
)
glmS(*)
hamm
er_3
intro
n_II(+
)
lysine
purin
e
sam
_ribo
thiam
ine(+
)
tRNA
ykok
_elem
ent
RNAfold
COVE
comRNA
RNAscf
![Page 52: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/52.jpg)
Test results
Accuracy
00.10.20.30.40.50.60.70.80.9
1
5s_r
RNA
CRE_220(
*)
ctRNA_2
36(+
)
glmS(*)
hamm
er_3
intro
n_II(+
)
lysine
purin
e
sam
_ribo
thiam
ine(+
)
tRNA
ykok
_elem
ent
RNAfold
COVE
comRNA
RNAscf
![Page 53: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/53.jpg)
Performance improves when the number of sequences increases
0.8
0.82
0.84
0.86
0.88
0.9
0.92
0.94
0 10 20 30 40 50 60 70 80
# of input sequences
Sensitivity
Accuracy
(Using Thiamine riboswitch subfamily (RF00059))
![Page 54: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/54.jpg)
RNAscf always finds the right consensus stack configuration.
(Sam riboswitch (RF00162))
![Page 55: Tracking down ncRNAs in the genomes](https://reader035.vdocuments.site/reader035/viewer/2022081503/56815007550346895dbddb83/html5/thumbnails/55.jpg)
Conclusion and future work
• RNAscf is a valid approach to RNA consensus structure prediction.– Use stack configuration to represent RNA secondary structure.– Propose a dynamic programming algorithm to find optimal stack configuration
for pairwise sequences.– Use both primary sequence information and energy information.– Use a star-alignment-like heuristic method to get the consensus structure for
multiple sequences.
• Future work:– Correcting errors by using a stochastic iterative scheme (such as Gibbs
sampling).– Provide P-value for each prediction.– Use RNAscf to find new families of ncRNAs.– Perform constraint folding to refine the predicted structure by adding some
minor basepairs.