gene prediction: preliminary results
TRANSCRIPT
Gene Prediction:
Preliminary Results
Erin Cook
Paul Cooper
Kristen Knipe
Shaupu Qin
Vani Rajan
Shrutii Sarda
Tianjun Ye
Outline
Homology based Gene Prediction
Ab initio based Gene Prediction
RNA Prediction
2
Homology based Gene
Prediction
Erin Cook & Shrutii Sarda
Follow-up: Overlapping Genes
Do we expect gene overlap in H. haemolyticus?
(aka is there gene overlap in H. influenzae
and/or other Haemophilus spp.?)
• Fukuda et al. 257 overlapping gene pairs
• Palleja et al. BMC Genomics 2008:
– 338 fully-sequenced prokaryotic genomes from
STRING 17% of all genes – some (~1%?) due to
misannotation
YES
4
Follow-up: Plasmids of the
Haemophilus genus
• Range from 3-30 Mdal in size
• Mainly in type b strains of H. influenzae - imparts
ampicillin resistance
• Have been found to be associated with Tn2 – a
transposable element
• Tend to only have partial homology (~48-50%
similarity) with plasmids of related species
• Recent isolation of cryptic plasmids from H.
somnus strains (1-5 kb in length)
5
Follow-up: % coding/non-coding regions of DNA
Species Tax. ID
Acc. ID
DNA
Length
(bp)
Protein
Count
Gene
Count
Total length
of genes (bp)
Percent of
non-coding
DNA (%)
H. influenzae
Rd KW20
71421
NC_000907.1
1830138 1657 1789 1345305 26.5
H. parasuis
SH0165
557723
NC_011852.1
2269156 2021 2299 1803054 20.6
H. somnus
2336
228400
NC_010519.1
2263857 1980 2065 1977672 12.7
H. ducreyi
35000HP
233412
NC_002940.2
1698955 1717 1838 1446108 14.9
H. influenzae
86-028NP
281310
NC_007146.2
1914490 1792 1899 1661757 13.3
H. influenzae
F3031
866630
NC_014920.1
1985832 1770 1892 1673628 15.8
H. influenzae
F3047
935897
NC_014922.1
2007018 1786 1896 1698588 15.4
H. influenzae
PittEE
374930
NC_009566.1
1813033 1613 1689 1446339 20.3
H. influenzae
PittGG
374931
NC_009567.1
1887192 1661 1735 1422240 24.7
6
BLAST Get all k-mers
from query
sequence Gen. all
possible k-mers
with score > T
Find all hits of k-
mers in database
using FSA
Find close hits
(two-hit method)
Extend 2nd
hit ungapped
Dyn. prog. for
gapped alignment
aa
nt
Zvelebil and Baum. Understanding Bioinformatics.2008
7
1. Get all k-mers from query sequence
1a. Gen. all possible k-mers with score > T
BLOSUM62 T=11 in gapped BLAST
http://www-users.math.umd.edu/~poorani/sampletalk/talk.html
Persemlidis and Fondon, Genome Biol. 2001. 8
2. Find all hits of k-mers in DB using FSA
Zvelebil and Baum. Understanding Bioinformatics.2008
Example FSA to recognize CHH, CHY, or CYH
9
3. Find close hits (two-hit method)
+: T=13
• : T=11
Altschul et al. Nuc Ac Res 1997
10
4. Extend 2nd hit ungapped
Parameters: Xu and Sg
•Extension score is
monitored; if drops
below (max- Xu),
extension stops
• If extension score <
Sg, extension is
discarded
11
5. Dynamic. prog. for gapped alignment
Smith-Waterman (not what BLAST uses, but demonstrative)
Seq. 1: GCCCTAGCG
Seq. 2: GCGCAATG
http://www.ibm.com/developerworks/java/library/j-seqalign/index.html
12
BLAST output • Alignment Length: total length of alignment reported
(including gaps)
• % Identity: # identical nt‟s or aa‟s / alignment length
– aa alignments – positives: # subst‟s with „+‟ score in
substitution matrix / alignment length
• Bit Score: calculated from quality of alignment (gaps,
substitutions, etc.)
• e-Value: # of seq‟s with similar score expected to occur in
db by chance
• Coordinates: start and stop on both query and db
13
QueryID
SubjID
Alignment vis.
E-val
Bit Score
%ID
14
Preliminary Analysis:
M21127 454LargeContigs.fna
15
Major Types of BLAST
Variant Query Sequence Type Database Sequence Type
*blastn Nucleotide Nucleotide
*blastp Protein Protein
*blastx Nucleotide translated to
protein
Protein
*tblastn Protein Nucleotide translated to
protein
tblastx Nucleotide translated to
protein
Nucleotide translated to
protein
* Types we will be using in our analysis
•Queries in fasta format
•Databases acquired from “formatdb” on fasta format
16
Why using the BLASTs we‟re using?
• BLASTn – similar but not identical nt sequences
– not for finding homologous protein coding regions in other
organisms - because of the degeneracy of the genetic
code
– aa methods better for this
• BLASTp – most reliable, less conservative than BLASTn
• BLASTx - can provide strong evidence for the presence of a
homologous coding region, even between distantly related
genes
– is appropriate for use early in moderate and large scale
sequencing projects
• tBLASTn - useful for finding protein homologs in unannotated
nucleotide data
– especially suited to working with error prone data like draft
genomic sequences
17
Pangenome/Panproteome
• Haemophilus somnus
– 129PT plasmid pHS129
– 2336
• Haemophilus ducreyi
– 35000HP
• Haemophilus parasuis
– SH0165
• Haemophilus influenzae
– Rd KW20
– 86-028NP
– PittEE
– PittGG
– F3031
– F3047
Combined files of gene/protein sequences from:
Panproteome: 16,003 sequences
Pangenome : 16,083 sequences
18
Contigs
(nucleotide)
• H. inf. prot
•Panproteome
(protein)
• H. inf. prot
•Panproteome
(protein)
ORFs
(protein)
Process,
Filter,
Compare
BLAST Pipeline (part 1)
blastx
tblastn
blastp
•ORFs
•Contigs
(nucleotide)
• H. inf. genes
(nucleotide) blastn
19
BLAST pipeline
• Some things may have slipped through the
cracks!
– Conserved domains?
– Homologs in more distantly-related species?
– Not as confident, but can still give potentially-useful
predictions
20
nr database (NCBI)
• All non-redundant sequences from:
– GenBank CDS translations: annotated collection of
conceptual translations of all publicly available protein-
coding nucleotide
– PDB: Sequences derived from 3-dimensional structure
from Brookhaven Protein Databank
– SwissProt: UniProtKB/Swiss-Prot; manually annotated,
reviewed
– PIR: Part of UniProt consortium
– PRF: Protein Research Foundation, in Japan
NCBI databases: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml
PDB: http://www.rcsb.org/pdb/home/home.do SwissProt: http://www.ebi.ac.uk/uniprot/
PIR: http://pir.georgetown.edu/ PRF: http://www.prf.or.jp/aboutdb-e.html
21
Pfam database (Sanger Inst.)
• Collection of protein families (11,912 in release
24.0, Oct 2009)
• Domains – functional regions
• Conserved domains can indicate conserved
function
• Pfam-A: high quality, manually curated families
• Pfam-B: automatically-generated supplement
– Uses ADDA: Automatic Domain Decomposition
Algorithm
– Lower quality, but catch-all
pfam.sanger.ac.uk
22
“Lonely”
ORFs
(protein)
ORFs with no hits, or
only hits below threshold
NR (protein)
Pfam (protein)
BLAST pipeline
(part 2)
Process,
Filter,
Compare
Integrate
with ab
initio,
RNA blastp
23
Nitty Gritty: what is involved? 1. Find ORFs
(nt and prot)
getorf
2. Format
database
formatdb
3. Run BLAST in
each direction
blastall
4. Filter on e-val,
align. length, %ID
custom perl scripts
5. Find RBHs
public perl scripts 24
ORF statistics (for prot; x3 for nt)
• Total: 75,590 ORFs
• Shortest: 10 residues
• Longest: 2,511
residues
• 1000+: 24
• 500-1000: 254
• 200-500: 972
• Avg size range of
proteins in
panproteome: 304.90
25
Initial Filtering
• Minimum alignment length: 24nt or 8aa
• Minimum fractional alignment
(aligned length / query length): 0.5
• Maximum e-value: 0.0001 (nt), 0.05 (aa)
Quite liberal, but will give good first-pass overview.
26
BLAST results (post-filter)
Analysis # hits # hits
Recip.
Best Hits
1373
1369
1495
1372
1336
1208
1467
5608
1496
12568
1141
4338
1489
12,286
1501
12,519
1439
3070
27
Comparing/Processing BLAST results
• Many challenges!
• Lots of things to consider…
28
Comparing/Processing Blast Results
• Overlapping ORFs
• Are the 1439 from hflu-refseq all present in full
panproteome blast?
– If so, take those 1439, add to “final” list, process the
other 1631
• Which ones do we trust as they are now?
• Which ones to run through nr and Pfam? What
criteria?
• Filter ORFs codon usage? Other parameters?
• How compare/combine blastx/tblastn with blastp?
• How integrate with other groups
• Other…
29
Protein-coding Gene
Prediction by Ab initio
Kristen Knipe, Shaupu Qin & TianjunYe
Ab Initio Gene Prediction Strategy
● Prediction: Use GENEMARKS,
GLIMMER, PRODIGAL to predict the
whole genome.
● Filter: Filter genes with
length>10000(possible bug of the
program)
● Merging: Merge the predicted result
● Validation: Use BLASTx to validate
the merged result.
31
GeneMark.hmm and GeneMarkS
• Minimus2 output (Newbler and Mira)
• Ran GeneMark.hmm using:
• H. influenzae model
• H. influenzae 86 model
• H. ducreyi model
• Ran GeneMarkS
• H. haemophilus model created
32
M19107 (123) M19501 (22) M21127 (38) M21621 (28) M21639 (54) M21709 (37)
0
500
1000
1500
2000
2500
3000
Haemophilus Genome
Nu
mb
er
of
Ge
nes
Number of Genes Predicted by GeneMark.hmm and GeneMarkS
GeneMark.hmm (H. ducreyi)
GeneMark.hmm (H. influenzae86)
GeneMark.hmm (H. influenzae)
GeneMarkS
CDC ID Species Disease
M19107 H. haemolyticus Asymptomatic
M19501 H. haemolyticus Asymptomatic
M21127 H. haemolyticus Pathogenic
M21621 H. haemolyticus Pathogenic
M21639 H. haemolyticus Pathogenic
M21709 H. influenzae Pathogenic
33
34
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
ATG GTG TTG
Perc
en
tag
e
Start Codon
Start Codon Usage
H. influenzae (38.2% GC)
H. influenzae 86 (38.2%GC)
M21709* (38.03% GC)
H. ducreyi (38.2% GC)
M19107* (38.7% GC)
M19501* (38.5% GC)
M21127* (38.6% GC)
M21621* (38.4% GC)
M21639* (38.6% GC)
*predicted by GeneMarkS
*Calculated by Acua Software
0
0.2
0.4
0.6
0.8
1
1.2
GC
AG
CC
GC
GG
CT
TG
CT
GT
GA
CG
AT
GA
AG
AG
TT
CT
TT
GG
AG
GC
GG
GG
GT
CA
CC
AT
AT
AA
TC
AT
TA
AA
AA
GC
TA
CT
CC
TG
CT
TT
TA
TT
GA
TG
AA
CA
AT
CC
AC
CC
CC
GC
CT
CA
AC
AG
AG
AA
GG
CG
AC
GC
CG
GC
GT
AG
CA
GT
TC
AT
CC
TC
GT
CT
AC
AA
CC
AC
GA
CT
GT
AG
TC
GT
GG
TT
TG
GT
AC
TA
TT
AA
TA
GT
GA
Ala Cys Asp Glu Phe Gly His Ile Lys Leu MetAsn Pro Gln Arg Ser Thr Val TrpTyr Stop
Perc
en
tag
e
Codon and Amino Acid
Codon Usage Relative Frequencies (M19107)
35
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
GC
AG
CC
GC
GG
CT
TG
CT
GT
GA
CG
AT
GA
AG
AG
TT
CT
TT
GG
AG
GC
GG
GG
GT
CA
CC
AT
AT
AA
TC
AT
TA
AA
AA
GC
TA
CT
CC
TG
CT
TT
TA
TT
GA
TG
AA
CA
AT
CC
AC
CC
CC
GC
CT
CA
AC
AG
AG
AA
GG
CG
AC
GC
CG
GC
GT
AG
CA
GT
TC
AT
CC
TC
GT
CT
AC
AA
CC
AC
GA
CT
GT
AG
TC
GT
GG
TT
TG
GT
AC
TA
TT
AA
TA
GT
GA
Ala Cys Asp Glu Phe Gly His Ile Lys Leu MetAsn Pro Gln Arg Ser Thr Val TrpTyr Stop
Perc
en
tag
e
Codon and Amino Acid
Codon Usage Frequencies (M19107)
*Calculated by Acua Software
36
Prodigal Results on Three Assembly of
M19107 (Newbler, Mira3, Minimus2)
Newbler Mira3 Minimus
2 Average Gene
Length 853.6782 804.1442 858.9455
Total Gene
Number 1846 1969 1983
GC Content
0.6310 0.6335 0.6322
37
-50
0
50
100
150
200
250
-500 500 1500 2500 3500 4500 5500 6500 7500 8500
newbler
mira
minimus
Gene Length Distribution of Different Assembly
Gene Length
38
Three Prediction Software Results on
Minimus2 Assembly
(Prodigal, GMS, Glimmer3)
Prodigal GMS Glimmer3
Average Gene
Length 858.9455 827.9541 894.5465
Total Gene
Number 1983 2069 1945
39
-50
0
50
100
150
200
250
-200 800 1800 2800 3800 4800 5800 6800 7800 8800 9800
Prodigal
GMS
Glim3
Gene Length Distribution of Minimus Assembly
Gene Length
40
41
It‟s a good way to visualize gene
prediction results
After integrating the results of Homology
Search, we can easily find the difference
between the genes matched with known
proteins and those not.
A Extreme Example
Marie Skovgaard. Et al. (2001) 42
• Many short ORFs are
annotated as genes
Marie Skovgaard. Et al. (2001)
43
Ab Initio Gene Prediction Strategy
● Prediction: Use GENEMARKS,
GLIMMER, PRODIGAL to predict the
whole genome.
● Filter: Filter genes with
length>10000(possible bug of the
program)
● Merging: Merge the predicted result
● Validation: Use BLASTx to validate
the merged result.
44
Prediction Result(After Filter)
M19107 Number of Genes
GENEMARKS 2069
GLIMMER 1945
PRODIGAL 1983
45
Merge Strategy
All Predicted Genes
Genes predicted in all
3 programs
Level 1: High Confidence
(HC)
Genes appear in GeneMark
and GLIMMER
Level 2: Medium
Confidence (MC)
Genes predicted in
only one program
Level 3: Low Confidence
(LC)
46
Merge Result
M19107 Number of Gene
High Confidence
GENES 1058
Median Confidence
GENES 87
Low Confidence
GENES
800(glimmer)+926(gen
emarks)+925(prodigal)
=2651
47
Validation Strategy
Run BLASTx on Merged Genes
Validated Genes with Different Confidence
Level
False Positive/ Pseudo-gene
48
E value < e-20 E value > e-20
BLASTx SAMPLE (validated gene)
49
BLASTx SAMPLE (validated gene)
50
BLASTx SAMPLE(false positive/pseudo gene)
51
Final Result: Currently Running
M19107 Confidence Level Number of Genes
Validated Genes
High Confidence
Median Confidence
Low Confidence
False positive/pseudo gene
52
Delivery
● All the BLASTx,and BLASTn files
corresponding to validated genes,
could be used to do functional
analysis, multiple alignment, and
phylogenetic analysis.
53
RNA Prediction
Paul Cooper & Vani Rajan
Stats on Haemophilus genus from Rfam
55
Haemophilis Strain
# familes # entries tRNA rRNA Other
Rfam: tRNA tmRNA Rfam: 6S Rfam: 5S rRNA
23S rRNA
SS rRNA 5
SRP bact S15 RNaseP
GrpII Intron
ducreyi 35000HP 18 82 47 1 1 7 6 6 1 1 1 0 influenzae 86-028NP 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittEE 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittGG 20 94 55 1 1 7 6 6 1 1 1 0 influenzae Rd KW20 20 96 56 1 1 7 6 6 1 2 1 0
somnus 2336 18 80 47 1 1 6 5 5 1 1 1 1
somnus 129PT 19 80 47 1 1 6 5 5 1 1 1 0
Haemophilis Strain
sRNA
His leader TPP
riboswitch FMN riboswitch Sxy Alpha
Op.RBS Glycine Ribo PreQ1 GcvB Lr-Pk1? Lysine Moco ribo Thr Leader
ducreyi 35000HP 0 1 1 0 1 2 0 1 1 1 2 0
influenzae 86-028NP 2 3 1 1 1 2 1 1 1 1 1 0
influenzae PittEE 2 3 1 1 1 2 1 1 1 1 1 0
influenzae PittGG 2 3 1 1 1 2 1 1 1 1 1 0
influenzae Rd KW20 2 3 1 1 1 2 1 1 1 1 1 0
somnus 2336 0 1 1 1 1 2 0 1 1 1 2 1
somnus 129PT 0 2 1 0 1 2 1 1 1 1 2 1
tRNA predictions: tRNA-Scan-SE
56
• Newbler denovo data
All_Contigs LargeContigs
M19107 50 46
M19501 48 45
M21127 52 49
M21621 52 49
M21639 51 48
M21639_2 51 47
M21709 48 45
H.Hemalyticus straintRNA: tScan-Se
Distribution of tRNAs
57
Influenza
M19107 M19501 M21127 M21621 M21639 M21639_2 M21709
Ala 1 1 1 1 1 1 2
Arg 4 4 4 4 4 4 4
Asn 2 2 2 2 2 2 2
Asp 3 3 3 3 3 3 3
Cys 1 1 1 1 1 1 1
Gln 2 2 2 2 2 2 2
Glu 0 0 0 0 0 0 0
Gly 5 5 5 5 4 4 4
His 1 1 1 1 1 1 1
Ile 0 0 0 0 0 0 0
Leu 5 5 5 5 4 5 5
Lys 4 4 5 4 5 5 4
Met 4 4 4 4 4 4 4
Phe 1 1 1 1 1 1 1
Pro 2 2 2 2 2 2 2
SeC 1 1 0 1 1 1 1
Ser 4 4 4 4 4 4 4
Thr 2 2 2 2 2 2 2
Trp 1 1 1 1 1 1 1
Tyr 1 1 1 1 1 1 1
Val 2 1 4 5 5 3 1
Contigs 194 36 46 39 225 119 39
tRNAs Found
Asymptomatic Pathogenic
LARGE CONTGS FILE
Distribution of tRNAs
58
M19107 M19501 M21127 M21621 M21639 M21639_2 M21709 86 028NP PitEE PitGG
Ala 2 2 2 2 2 2 3 4 4 3
Arg 4 4 4 4 4 4 4 4 4 4
Asn 2 2 2 2 2 2 2 2 2 2
Asp 3 3 3 3 3 3 3 3 3 3
Cys 1 1 1 1 1 1 1 1 1 1
Gln 2 2 2 2 2 2 2 2 2 2
Glu 1 1 1 1 1 1 1 3 3 4
Gly 5 5 5 5 4 4 4 5 5 4
His 1 1 1 1 1 1 1 1 1 1
Ile 1 1 1 1 1 1 1 3 3 2
Leu 5 5 5 5 4 5 5 5 5 5
Lys 4 4 5 4 5 5 4 4 4 4
Met 4 4 4 4 4 4 4 4 4 4
Phe 1 1 1 1 1 1 1 1 1 1
Pro 2 2 2 2 2 2 2 2 2 2
SeC(p) 1 1 1 1 1 1 1 1 1 1
Ser 4 4 4 4 4 4 4 4 4 4
Thr 2 2 2 2 2 2 2 2 2 1
Trp 1 1 1 1 1 1 1 1 1 1
Tyr 1 1 1 1 1 1 1 1 1 1
Val 3 1 4 5 5 4 1 5 5 5
Contigs 217 75 59 50 175 173 54
tRNAs Found
Asymptomatic Pathogenic Influenza
NCBI annotation
ALL CONTIGS FILE
Individual Codons
• Only Glu and Val show different usage
• All other codons show usage similar to influenza
genomes
59
M19107 M19507 M21127 M21621 M21639 M21639_2 M21709 86_028NP PitEE PitGG
Glu GAA 1 1 1 1 1 1 1 3 3 4
Glu GAG 0 0 0 0 0 0 0 0 0 0
Val GTA 2 0 3 4 4 3 0 4 4 4
Val GTC 1 1 1 1 1 1 1 1 1 1
Val GTG 0 0 0 0 0 0 0 0 0 0
Val GTT 0 0 0 0 0 0 0 0 0 0
tRNAs found: codons
Asymptomatic Pathogenic Influenza
rRNA Difficulties
• Highly conserved functional regions
• Followed by a short hyper-variable area.
• Multiple Operon Copies (~55 copies have been found in C.elegans at Chr 1 w/ 275 total)
• A closure of C. violaceum with 57 contigs found: (7 contigs ended with 5SRNA, 3 with 16S)
Chang-Shung Tung1, Simpson Joseph2 & Kevin Y. Sanbonmatsu1 All-atom homology model of the Escherichia coli 30S ribosomal subunit Nature Structural Biology 9, 750 - 755 (2002)
C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. The C. elegans Sequencing Consortium. Science 282, 2012–
2018.
C. Woese Microbiol Bacterial evolution. Rev. 1987 June; 51(2): 221–271.PMCID: PMC373105 60
RNAmmer
• Online submissions of: 10k seq and 20m NT
• Prescreening followed, by HMM
• Bacterial Training: 82% Actinobacteria,
Firmicutes, Proteobacteria
• Highest accuracy in 16S, then 23S
61
rRNA Results
• ##gff-version2
• ##source-version RNAmmer-1.2 (IRIX64)
• ##date 2011-02-19
• ##Type DNA
• # seqname source feature start end score +/- frame
attribute
• # ---------------------------------------------------------------------------------------------------------
• contig00179 RNAmmer-1.2 (IRIX64) rRNA 1572 1686 82.9
+ . 5s_rRNA
• contig00031 RNAmmer-1.2 (IRIX64) rRNA 45 159 63.0
+ . 5s_rRNA
• contig00211 RNAmmer-1.2 (IRIX64) rRNA 172 1699 1894.4
+ . 16s_rRNA
• # ---------------------------------------------------------------------------------------------------------
Preliminary rRNA Results Strain 5S 16S 23S Contigs M19107-hae-AS 2 1 0(1) 217 M19501-hae-AS 1(3) 0(0) 0(0) 75 M21127-hae-P 2 1 1 59 M21621-hae-P 2 1 1 50 M21639-hae-P 2 1 1 175 M21709-Infl-P 3(4) 1 2 52 Rfam influenza(4) 7 -- 5 ---- JCVI infl-KW20 Rd 5 5 5 ---
62
Ribosome Subunit Assembly
63
Clustering of Influ-KW20
• Lengths consistent:
114/1538/2896
• Spacing between 5S and 23S: 247
• Spacing between 23S and 16S: 3x724,
479,2017
• Clusters vary in distance from: 27k-1,800k
A C E F B D 64
Pathogenic Hae-23S Alignment
65
Cladogram with Distance
http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?distance=true&tree=phylogram&jobId=clustalw2-I20110221-192123-0426-9064345-
oy&tool=clustalw2&analysis=tree. EMBl-EBI
Has rRNA been associated with
virulence?
• V. vulnifucus (shellfish/humans)-16S type B
highly associated with virulence over type A1
• H. aegyptius, 15 rRNA (16+23) gene
restriction patterns, one is associated with
most cases of BPF2
• V. cholerae O139 toxins linked to the
ribotype BgII 3
1Nilsson WB, Paranjype RN, DePaola A, Strom MS. Sequence polymorphism of the 16S rRNA gene of Vibrio vulnificus is a possible indicator of strain virulence. J Clin Microbiol.
2003 Jan;41(1):442-6.
2Leen-Jan van Doorn et. Al, Accurate Prediction of Macrolide Resistance in Helicobacter pylori by a PCR Line Probe Assay for Detection of Mutations in the 23S rRNA Gene:
Multicenter Validation Study. Antimicrob Agents Chemother. 2001 May; 45(5): 1500–1504.
3Farquue et al. Molecular analysis of rRNA and cholera toxin genes carried by the new epidemic strain of toxigenic Vibrio cholerae O139 synonym Bengal.J Clin Microbiol. 1994
Apr;32(4):1050-3. 66
Future: rRNA
• Determine coverage of rRNA areas in assembly
• Check contig edges for rRNA partial matches
• Map rRNA to contigs to determine if distance in
between could represent missed rRNA
• Run other methods of rRNA verification
67
Rfam output
##gff-version 3
# rfam_scan.pl (v1.0)
# command line: /usr/bin/rfam_scan-1.0.2.pl -blastdb /storage2/db/rfam/rfam -o
454LargeContigs.rfam /storage2/db/rfam/Rfam.cm 454LargeContigs.fna
# CM file: /storage2/db/rfam/Rfam.cm
# query FASTA file: 454LargeContigs.fna
# start time: Tue Feb 22 12:28:46 EST 2011
# end time: Tue Feb 22 13:18:54 EST 2011
contig00211 Rfam similarity 191 721 344.88 + . evalue=4.15e-42;gc-content=53;
id=SSU_rRNA_5.1;model_end=486;model_start=1;rfam-acc=RF00177;rfam id=SSU_rRNA_5;
score=344.88
Contig00203 Rfam similarity 15472 15836 280.07 + . evalue=6.18e-40;gc-content=49;
id=tmRNA.1;model_end=359;model_start=1;rfam-acc=RF00023;rfam-id=tmRNA;score=280.07
contig00025 Rfam similarity 2611 2987 305.02 + . evalue=2.09e-40;gc-
content=53;id=RNaseP_bact_a.1;model_end=367;model_start=1;rfam-acc=RF00010;rfam-
id=RNaseP_bact_a;score=305.02
68
Rfam preliminary results
• Validated tRNAs found by tRNAScan-SE
• Had difficulty finding long rRNAs
69
Rfam: tRNA tmRNA Rfam: 6S Rfam: 5S rRNA 23S rRNA SS rRNA 5 SRP bact S15 RNaseP GrpII Intron
ducreyi 35000HP 18 82 47 1 1 7 6 6 1 1 1 0
influenzae 86-028NP 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittEE 20 96 57 1 1 7 6 6 1 1 1 0
influenzae PittGG 20 94 55 1 1 7 6 6 1 1 1 0
influenzae Rd KW20 20 96 56 1 1 7 6 6 1 2 1 0
somnus 2336 18 80 47 1 1 6 5 5 1 1 1 1
somnus 129PT 19 80 47 1 1 6 5 5 1 1 1 0
M19107 19 71 50 1 1 2 1 1 1 1 1 0
M19501 19 72 48 1 1 3 0 0 1 1 1 0
M21127 20 75 52 1 1 2 1 1 1 1 1 0
M21621 19 74 52 1 1 2 1 1 1 1 1 0
M21639 20 74 51 1 1 2 1 1 1 1 1 0
M21639_2 21 76 51 1 1 2 1 1 1 1 1 0
M21709 22 77 48 1 1 4 2 1 1 1 1 0
rRNAtRNAHaemophilis Strain # familes # entries Other
Rfam preliminary results
• Can clearly see that some sRNA in Haemalyticus
come from influenza while others are from diff
Haemophilus species
70
His leader TPP riboswitch FMN riboswitch Sxy Alpha Op.RBS Glycine Ribo PreQ1 GcvB Lr-Pk1? Lysine Moco ribo Thr Leader Rrt IsrK
ducreyi 35000HP 0 1 1 0 1 2 0 1 1 1 2 0 0 0
influenzae 86-028NP 2 3 1 1 1 2 1 1 1 1 1 0 0 0
influenzae PittEE 2 3 1 1 1 2 1 1 1 1 1 0 0 0
influenzae PittGG 2 3 1 1 1 2 1 1 1 1 1 0 0 0
influenzae Rd KW20 2 3 1 1 1 2 1 1 1 1 1 0 0 0
somnus 2336 0 1 1 1 1 2 0 1 1 1 2 1 0 0
somnus 129PT 0 2 1 0 1 2 1 1 1 1 2 1 0 0
M19107 1 2 1 0 1 2 0 1 1 1 1 1 0 0
M19501 1 4 1 1 1 2 0 1 1 1 1 1 1 0
M21127 1 3 1 1 1 2 0 1 1 1 1 1 0 0
M21621 1 3 1 0 1 2 0 1 1 1 1 1 0 0
M21639 1 3 1 1 1 2 0 1 1 1 1 1 0 0
M21639_2 1 3 1 1 1 2 0 1 1 1 1 1 0 2
M21709 2 3 1 1 1 2 1 1 1 1 1 1 1 0
Haemophilis Strain sRNA
sRNA prediction
• sRNAPredict3, sRNAScanner, nocoRNAc
• Some problems with inputs
– Require coordinates of protein coding genes
– Descriptions of secondary structures
– Positive training samples to create PWM
(sRNAScanner)
– Biggest problem: requires MSA to find
consensus
71
Future: sRNA
2ndary Structure
• Blast
• ClustalW (MSA)
Prediction
• RNAz
• QRNA
Filter • nocoRNAc
72