the discovery of novel ncrna in genomes andrew uzilov david mathews

60
The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Upload: jaxson-jay-lemacks

Post on 31-Mar-2015

233 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

The Discovery of Novel ncRNA in Genomes

Andrew Uzilov

David Mathews

Page 2: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Uzilov, Keegan, Mathews. BMC Bioinformatics. 2006. In Press.

Page 3: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Outline:

• Background in ncRNA.

• Basic hypothesis.

• The Dynalign algorithm for prediction of an RNA secondary structure common to two sequences.

• Using Dynalign to find ncRNA sequences in genomes.

• Optimizing Dynalign performance.

Page 4: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Central Dogma of Biology:

Page 5: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

RNA is an Active Player:

Page 6: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

What is ncRNA?

• Non-coding RNA (ncRNA) is an RNA that functions without being translated to a protein.

• Known roles for ncRNAs:– RNA catalyzes excision/ligation in introns.– RNA catalyzes the maturation of tRNA.– RNA catalyzes peptide bond formation.– RNA is a required subunit in telomerase.– RNA plays roles in immunity and development (RNAi).– RNA plays a role in dosage compensation.– RNA plays a role in carbon storage.– RNA is a major subunit in the SRP, which is important in protein trafficking.– RNA guides RNA modification.

Page 7: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

AAUUGCGGGAAAGGGGUCAACAGCCGUUCAGUACCAAGUCUCAGGGGAAACUUUGAGAUGGCCUUGCAAAGGGUAUGGUAAUAAGCUGACGGACAUGGUCCUAACCACGCAGCCAAGUCCUAAGUCAACAGAUCUUCUGUUGAUAUGGAUGCAGUUCA

Predicting RNA Secondary and 3D Structure from Sequence:

P 6b

P 6a

P 6

P 4

P 5P 5a

P 5b

P 5c

120

140

160

180

200

220

240

260

AAU

UGCGGGA

A

A

GGGGUCA

ACAGCCG UUCAG

U

ACCA

AGUCUCAGGGG

AAACUUUGAGAU

GGCCUUGCA A A G G

G U A UGGUA

AUA AGCUGACGGACA

UGGUCC

U

A

A

CCA CGCA

GC

CAAGUCC

UAAGUCAACAGAU C U

UCUGUUGAUA

UGGAUGCA

GU

UC A

Cate, et al. (Cech & Doudna).(1996) Science 273:1678.

Waring & Davies. (1984) Gene 28: 277.

Page 8: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

An RNA Secondary Structure:

On average, 46 % of nucleotides are unpaired.

R2 Retrotransposon 3’ UTR from D. melanogaster.RNA 3:1-16.

Page 9: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Gibb’s Free Energy (G°):

Ki = State] [Unpaired

i] Structure[

U n p a ired S ta te S tru c tu re i

= /RTG- oie

S tru c tu re j S tru c tu re i

j] [Structure

i] Structure[= Ki/Kj =

G° quantifies the favorability of a structure at a given temperature.

RT/)GG( io

jo

e

Page 10: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Nearest Neighbor Model for RNA Secondary Structure Free Energy at 37 OC:

C G U U U G G GUU

CACAAACG

-2 .0

-2 .1

-0 .9

-0 .9

-1 .8

-1 .6

+ 5 .0

Ghelix = GCGGC + GGUCA + 2GUUAA + GUGAC =

-2.0 kcal/mol - 2.1 kcal/mol + 2x(-0.9) kcal/mol - 1.8 kcal/mol = -7.7 kcal/mol

Ghairpin loop = Ginitiation (6 nucleotides) + GmismatchGGCA =

5.0 kcal/mol - 1.6 kcal/mol = 3.4 kcal/mol

Gtotal = G

hairpin + Ghelix = 3.4 kcal/mol - 7.7 kcal/mol = -4.3 kcal/mol

Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.

Page 11: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

How is the Lowest Free Energy Structure Determined?

• Naïve approach would be to calculate the free energy of every possible secondary structure.

• Number of secondary structures 1.8N (where N is the number of nucleotides)

• The free energies of 1000 structures can be calculated in 1 second.

• For 100 nucleotide sequence:– Number of secondary structures 3 × 1025 – Time to calculate 1014 years

Page 12: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynamic Programming Algorithm:

• Not to be confused with molecular dynamics.• This is a calculation – not a simulation.• The lowest free energy structure is guaranteed given the

nearest neighbor parameters used.

• Reviewed by Sean Eddy. Nature Biotechnology. 2004. 11: 1457.

Page 13: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynamic Programming Algorithm:

• Named by Richard Bellman in 1953.

• Applies to calculations in which the cost/score is built progressively from smaller solutions.

• Other applications– Sequence alignment– Determining partition functions for RNA

secondary structures– Finding shortest paths– Determining moves in games– Linguistics

Page 14: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynamic Programming:

• Recursion is used to speed the calculation.– The problem is divided into smaller problems.– The smaller problems are used to solve bigger

problems.

• Two Step Process– Fill – determines the lowest free energy

folding possible for each subsequence– Traceback – determined the structure that

has the lowest free energy

Page 15: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

RNA Secondary Structure Prediction Accuracy:

RNA: Nucleotides: Base Pairs:

% Pseudoknot:

Lowest Free Energy

Best Suboptimal

Any Suboptimal

SSU (16 S) rRNA 33,263 8,863 1.4 61.0 ± 23.7 75.7 ± 20.0 90.5 ± 14.1 (44.3 ± 13.2)a (54.0 ± 13.7)a (75.6 ± 12.1)a LSU (23 S) rRNA 13,341 3,585 0.2 76.0 ± 12.4 87.0 ± 8.9 97.7 ± 2.6 (56.9 ± 9.3)a (64.0 ± 10.6)a (82.1 ± 10.9)a 5 S rRNA 26,925 10,188 0.0 74.2 ± 26.9 96.0 ± 5.2 99.9 ± 0.6 Group I Intron 5,518 1,532 6.0 70.8 ± 12.8 83.9 ± 11.2 98.1 ± 4.7 Group I Intron - 2 3,056 865 6.2 (60.5 ± 10.5) (77.4 ± 9.8) (97.3 ± 4.4) Group II Intron 1,626 402 0.0 86.5 ± 3.6 92.4 ± 6.6 100 ± 0.0 RNase P 2,269 694 14.4 64.6 ± 15.2 75.9 ± 10.1 95.6 ± 4.6 RNase P - 2 2,198 1,099 11.3 (59.4 ± 10.2) (77.6 ± 4.9) (97.2 ± 2.7) SRP RNA 24,383 6,273 1.9 68.2 ± 25.8 88.3 ± 12.0 96.3 ± 8.6 tRNA 37,502 10,018 0.0 84.8 ± 18.9 96.5 ± 6.4 99.3 ± 4.7 Total: 151,503 43,519 1.4 72.8 ± 9.1 87.0 ± 8.1 97.2 ± 3.1

Percentage of Known Base Pairs Correctly Predicted:

Mathews, Disney, Childs, Schroeder, Zuker, & Turner. 2004. PNAS 101: 7287.

Page 16: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Pseudoknot:

i < i’ < j < j’

Page 17: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Hypothesis:

• ncRNAs have lower folding free energy change than non-structural sequences, e.g. mRNA, or random sequences.

• Corollary:– ncRNAs, which are structured, can be found in

genomic sequences because they have folding free energy change lower than background sequences.

Page 18: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Do Structural RNAs have Lower Folding Free Energy Change than

Background?

• Yes:– Le et al. 1990. NAR 18:1613.– Seffens & Digby. 1999. NAR 27:1578.– Clote et al. 2005. RNA 11:578.

• No:– Workman & Krogh. 1999. NAR 27:418.– Rivas & Eddy. 2000. Bioinformatics 16:583.

Page 19: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Test of Hypothesis:

ncRNA(tRNA or 5S rRNA)

Negative(First order Markovchain that preservesdinucleotide frequencies)

100 ControlSequences

100 ControlSequences

(First order Markov chainthat preservesdinucleotide frequencies)

Page 20: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Calculate Z Score of Folding Free Energy Change for Positives and

Negatives:

• Calculate the mean, <G37>, and standard deviation, , for the controls.

• Z score is the number of standard deviations that a negative or positive’s free energy change is different from mean:

Z = (G37-<G37>)/

• Choose a Z-score cutoff for classification as ncRNA.

Page 21: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Scoring:

• Sensitivity =

(True Positives)/(True Positives + False Negatives) =

percent of ncRNA correctly classified as ncRNA • Specificity =

(True Negatives)/(True Negatives + False Positives) =

percent of non-ncRNA correctly classified as non-ncRNA

Sequence is ncRNA:

Sequence is not ncRNA:

Sequence is predicted to be ncRNA:

True Positive False Positive

Sequence is predicted to not be ncRNA:

False Negative True Negative

Page 22: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Distribution of Z Scores:C

ount

Page 23: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Receiver-Operator Characteristic (ROC) Curve:

5S only

tRNA only

both 5S rRNA and tRNA

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False positive rate (1 - specificity)

Sen

sitiv

ityity

Page 24: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Why do Structural RNA Sequences Not Have a Significantly Lower Folding Free Energy Change?

• Hypothesis is incorrect.

• Secondary structure prediction has limited accuracy:– Kinetics may play a role in folding.– Free energy nearest neighbors are based on a

limited number of experiments and have error.– The algorithms that are used for these studies

cannot predict pseudoknots (non-nested pairs).

Page 25: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynalign (a 4-D Dynamic Programming Algorithm):

Mathews & Turner. Journal of Molecular Biology. 317: 191-203 (2002)Mathews. Bioinformatics. 21: 2246-2253 (2005)

Algorithm forSecondary Structure Prediction

(2D dynamic programming algorithm)

Algorithm forSequence Alignment

(2D dynamic programming algorithm)

Simultaneously finds the sequence alignment and thermodynamically favorable common secondary structurefor two sequences.Dynalign requires no sequence identity.

Page 26: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Inputs, Optimization, and Outputs:

Input: Sequence 1 Sequence 2

Optimization (minimize G°total):

G°total = G°sequence 1 + G°sequence 2 + (G°gap)(number of gaps)

Output: Sequence Alignment, Structure of 1, Structure of 2where each helix in 1 must be homologous to a BP in 2

Page 27: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

65

70

75

80

85

90

0 0.2 0.3 0.4 0.5 0.6 0.8 1 1.2 1.4 1.6

G°gap (kcal mol-1

gap-1

)

Ave

rage

% A

ccur

acy

Seven 5S rRNAs with secondary structures predicted with 47.8% average accuracy. Average of all 42 pair-wise combinations predicted by Dynalign.

Optimization of Gºgap:

Page 28: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

U C C G U C G UAG

U C U A GG

uG G

uU A

GG A U A C U C G G C U C U C A

CC

CGAGAGAC

CCGGGUUCG

aGU

CCCGGCGACGGAACCA

5 ’

3 ’

Improving the Accuracy of tRNA Secondary Structure Prediction:

RD0260

RE6781

Conventional Free Energy Minimization Predicted Structures:

G C G A C C G G G GC

U G GCU U G G U A A U G

G U AC U C C C C U G

UC

ACGGGA

GAGAA

UGUG GGU

UCAAAUC

CCAUCGGUCGCGCCA

5 ’

3 ’

Page 29: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Improving the Accuracy of tRNA Secondary Structure Prediction:

Dynalign Predicted Structures:

GCGACCGGG

GCUGGCU

UGG

U A AU G G U

ACUCCC

CU

G U CAC

GGGAG A G

AAUG U G G G

U UCA

AAUCCCAU

CGGUCGCGCCA

5 ’

3 ’

UCCGUCGU

AGUCU

AGGu

GG

u U AG G A U

ACUCGG

CU

C U CAC

CCGAG A

GACC C G G G

U UCG

aGU

CCCGGCGACGGAACCA

5 ’

3 ’

RD0260 RE6781

RD0260 GCGACCGGGGCUGGCUUGGUAAUGGUACUCCCCUGUCACGGGAGAGAAUGUGGGUUCAAAUCCCAUCGGUCGCGCCARE6781 UCCGUCGUAGUCUAGGUGGUUAGGAUACUCGGCUCUCACCCGAGAGAC-CCGGGUUCGAGUCCCGGCGACGGAACCA ^^^^^^^ ^^^^ ^^^^ ^^^^^ ^^^^^ ^^^^^ ^^^^^^^^^^^^

Page 30: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Benchmarks:

• Four databases:– All pairwise comparisons (21) of seven 5S sequences

with widely varying accuracy of secondary structure prediction using a single sequence.

– 3 calculations with 6 srp sequences.– All pairwise calculations (780) with 40 randomly

chosen tRNA sequences.– All pairwise comparisons (105) of 15 randomly

chosen 5S rRNA sequences.

Page 31: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Sensitivity:

0

10

20

30

40

50

60

70

80

90

100

5S SRP Rand 5S Rand tRNA

Sensitivity = (Correctly Predicted Pairs)/(Total Known Pairs)

Page 32: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Improving Dynalign Performance:

• The original restriction on the alignments is: |i – k| ≤ M– For the 3’ ends of the sequence to align:

M ≥ | N1 – N2|– For most applications, the ends of the sequences should

align.

• This suggests an alternative restriction: |i N2/N1 – k | ≤ M– This allows a smaller M parameter. Calculation time

scales O(N3M3).

Page 33: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Heuristic to Exclude Base Pairs:

• There are many possible canonical base pairs that are not worth considering because any structure that contains them has a high free energy.

• The “high energy” base pairs can be identified by secondary structure prediction using a single sequence (very fast). The high energy pairs can then be excluded from a Dynalign structure prediction.

Page 34: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

% of Known Pairs within a % Energy Increment from the Lowest

Free Energy Structure:

0

10

20

30

40

50

60

70

80

90

100

1 5 10 20 30 50

% Energy Increment

% o

f K

now

n B

ase

Pai

rs

Page 35: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Time Performance Improvement:Sequence

1:Sequence

2:N1: Original

M:Original Time (hr:min):

Revised M:

Revised Time (hr:min):

RD0260 RE6781 77 15 0:22 6 0:01

H. volcanii 5S

A. Globiform-is 5S

122 15 1:11 6 0:03

D. takashii

R2 3’ UTR

D. melano-gaster

217 24 26:05 8 0:39

3.2 GHz Intel Pentium 4 with 1 GB RAM; Red Hat Enterprise Linux 3;gcc 3.2.3-42 compiler

Page 36: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Revised Hypothesis:

• Dynalign calculated folding free energies for sequence pairs derived from genome alignments can be used to find ncRNAs with high sensitivity and specificity.

Page 37: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Testing the Hypothesis:

ncRNA pair(tRNAs or 5S rRNAs)

Negative pair(Shuffle of global alignment)

20 ControlSequence Pairs

20 ControlSequence Pairs

(Shuffle of global alignment)

Page 38: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynalign ROC Curve has Larger Integral than Single Sequence:

single sequences

pairs of sequences

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1False positive rate (1 - specificity)

Se

ns

itiv

ity

Page 39: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

ROC Curves Depend on M:

column controls, M=6

column controls, M=8

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20False positive rate (1 - specificity)

Se

ns

itiv

ity

Page 40: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

ROC Curves for tRNA and 5S rRNA:

both 5S rRNA and tRNA

5S rRNA only

tRNA only

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

False positive rate (1 - specificity)

Se

ns

itiv

ity

ity

Page 41: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Comparison to Other State of the Art Methods:

• QRNA:– Rivas & Eddy. 2001. BMC Bioinformatics 2:8.– Comparative analysis of aligned sequences, where

compensating base pairs changes indicate ncRNA. Classification by stochastic context-free grammar.

• RNAz:– Washietl, Hofacker, & Stadler. 2005. PNAS 102: 2454.– Folding free energy of two or more aligned sequences using

RNAalifold. Classification by support vector machine (SVM).• Both Methods Use Fixed Alignments:

– Faster than Dynalign.– Limited to sequence alignment algorithm (compensating base

pair changes make accurate alignment difficult).

Page 42: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

QRNA Sequence Types:

Page 43: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynalign vs. RNAz:Dynalign

(column controls, M=8)

RNAz

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20False positive rate (1 - specificity)

Se

ns

itiv

ity

Page 44: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

What About Low Sequence Identity Pairs?

Dynalign below 50% identity only (column

controls, M=8)RNAz below 50%

identity only

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20False positive rate (1 - specificity)

Se

ns

itiv

ity

Page 45: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Human vs. Mouse Alignment (Santa Cruz Genome Server) Pairwise Identities for 50

Nucleotide Windows:% Identity Number of Windows Percent of Windows

0 ≤ i < 10 260195 1.16

10 ≤ i < 20 421451 1.88

20 ≤ i < 30 687299 3.06

30 ≤ i < 40 1328176 5.91

40 ≤ i < 50 3006046 13.39

50 ≤ i < 60 5774890 25.72

60 ≤ i < 70 6112855 27.22

70 ≤ i < 80 3039773 13.54

80 ≤ i < 90 1257154 5.60

90 ≤ i ≤ 100 568476 2.53

Total: 22456315 100.0

Page 46: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Faster Method Using Dynalign:

• Run a single calculation and use a support vector machine (SVM) to classify sequence as ncRNA or not.– Each window only needs to be scanned once.– A probability is assigned to the classification.

• SVM– Trained with tRNA and 5S rRNA sequences.– Input:

• Dynalign total free energy change• Length of the shorter sequence• A,C,G content of each sequence

Page 47: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

ROC of SVM vs. 20 Controls:

column controls, M=8

SVM

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20

False positive rate (1 - specificity)

Sen

siti

vity

Page 48: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Dynalign-SVM vs. RNAz at Low Identity:

Dynalign below50 % identity only

RNAz below50 % identity only

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0.00 0.05 0.10 0.15 0.20False positive rate (1 - specificity)

Sen

sitiv

ity

Page 49: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Unrolling the Method on E. coli:

• Look for ncRNA in E. coli using alignments to S. typhi.– MUMmer (Kurtz et al.. 2004. Genome Biol 5:R12)

• 15,214 blocks of 50 to 150 nucleotides as above (where long alignment blocks were divided into 150 nucleotide windows that overlap 75 nucleotides)

Page 50: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

ncRNA Detection:

  Dynalign RNAz QRNA

  Number of Known ncRNAs found

E. coli (156 ncRNAs known) 107 91 67

S. typhi (110 ncRNAs known) 93 70 64

 Number of hits that are not known

ncRNAs (likely false positives)

E. coli 578 678 661

S. typhi 568 662 634

Page 51: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Epilogue: Improving Dynalign Performance:

• In collaboration with Gaurav Sharma, Electrical and Computer Engineering, University of Rochester, and Arif Harmanci, we pre-determine the sequence alignment probabilities with a Hidden Markov Model.

• Then, we only allow alignments in Dynalign that have probability greater than 10-4.– This removes the need of using the M parameter heuristic.

– This does not affect the accuracy of structure prediction by Dynalign.

Page 52: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Benchmarks Against Other Programs Using 2000 Pairs of 5S rRNA Sequences:

Algorithm: Percent Sequence Identity:

20-40: 40-60: 60-80: 80-100: All:

Dynalign 89.5 90.3 91.4 90.9 90.7

FOLDALIGN 72.6 75.3 78.7 51.0 74.9

StemLoc 27.2 67.3 89.9 77.6 74.0

Consan 66.6 79.9 93.1 86.5 84.2

Single Sequence 68.0 72.2 76.6 78.4 73.9

Percent of Known Pairs Correctly Predicted:

Page 53: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Performance Benchmarks Using 200 Pairs of Sequences:

Algorithm: Time (s): Memory (MB):

tRNA: 5S rRNA: tRNA: 5S rRNA:

Dynalign 10.0 34.4 11.0 12.3

FOLDALIGN 30.3 349.5 134.4 730.2

StemLoc 210.2 616.1 252.2 2788.3

Consan 209.3 1032.8 131.6 317.3

Using a single core on a dual, dual-core Opteron 270 machinerunning Fedora Core 5 and gcc 4.1.1.

Page 54: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Parallelizing Dynalign for SMP:

• In collaboration with Paul Tymann, Computer Science, Rochester Institute of Technology and CS students Chris Connett, Glenn Katzen, Andrew Yohn, we developed an SMP version of Dynalign.

• This takes advantage of the fact that there are a number of positions in the arrays that can be filled independently in the dynamic programming algorithm recursions.

Page 55: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Scaling:

1:001:12

1:49

3:01

0:00

0:28

0:57

1:26

1:55

2:24

2:52

3:21

1 2 3 4

Processors

Tim

e

Two R2 3’ UTRs of length 234 and 217 nucleotides.

Using a dual, dual-core Opteron 270 machine running Fedora Core 5 and gcc 4.1.1.

Page 56: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Preliminary Results with SMP-Dynalign:

• Single sequence secondary structure prediction of E. coli 16S rRNA (1542 nucleotides) has 43.6% sensitivity.

• E. coli 16S rRNA run on Dynalign with:– B. subtilis 16S rRNA (1552 nucleotides) has 80.7%

sensitivity and required 381 minutes on 4 cores and 983 MB or RAM.

– Borrelia burgodorferi 16S rRNA (1532 nucleotides) has 76.4% sensitivity and required 408 minutes on 4 cores and 1.0 GB of RAM.

Page 57: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Conclusions:

• The folding free energy of single sequences does not provide a sensitive and specific method of finding ncRNAs. It does, however, provide a pre-filtering method that can remove 30% of sequences from consideration.

• Dynalign shows promise as a method for ncRNA detection, especially at low pairwise identities of sequences.

Page 58: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

Acknowledgements:

• Past Lab Members:– Andrew Uzilov

– Shan Zhao

– Eliany Sanchez-Baez

• Lab Members:– Sumeet Chandha

– Zhi Lu

– Matthew Seetin

– Rahul Tyagi

– Keith VanNostrand

• Funding:– Alfred P. Sloan

Foundation– National Institutes of

Health

• Computing:– CASCI Lab at

Rochester Institute of Technology

Page 59: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

MUMmer:

0.00% 0.00% 0.05%1.92%

3.96%

7.14%8.36%

11.67%

14.80%

52.10%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

55%

[0, 10) [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) [90, 100]

Percent identity range

Pe

rce

nt

of

win

do

ws

Page 60: The Discovery of Novel ncRNA in Genomes Andrew Uzilov David Mathews

WuBLASTn:

0.00% 0.00% 0.00% 0.00% 0.19%

13.05%

43.85%

29.58%

5.61%7.72%

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

[0, 10) [10, 20) [20, 30) [30, 40) [40, 50) [50, 60) [60, 70) [70, 80) [80, 90) [90, 100]

Percent identity range

Pe

rce

nt

of

win

do

ws