gene prediction: preliminary results

Gene Prediction:

Preliminary Results

Erin Cook

Paul Cooper

Kristen Knipe

Shaupu Qin

Vani Rajan

Shrutii Sarda

Tianjun Ye

Outline

Homology based Gene Prediction

Ab initio based Gene Prediction

RNA Prediction

2

Homology based Gene

Prediction

Erin Cook & Shrutii Sarda

Follow-up: Overlapping Genes

Do we expect gene overlap in H. haemolyticus?

(aka is there gene overlap in H. influenzae

and/or other Haemophilus spp.?)

• Fukuda et al. 257 overlapping gene pairs

• Palleja et al. BMC Genomics 2008:

– 338 fully-sequenced prokaryotic genomes from

STRING 17% of all genes – some (~1%?) due to

misannotation

YES

4

Follow-up: Plasmids of the

Haemophilus genus

• Range from 3-30 Mdal in size

• Mainly in type b strains of H. influenzae - imparts

ampicillin resistance

• Have been found to be associated with Tn2 – a

transposable element

• Tend to only have partial homology (~48-50%

similarity) with plasmids of related species

• Recent isolation of cryptic plasmids from H.

somnus strains (1-5 kb in length)

5

Follow-up: % coding/non-coding regions of DNA

Species Tax. ID

Acc. ID

DNA

Length

(bp)

Protein

Count

Gene

Count

Total length

of genes (bp)

Percent of

non-coding

DNA (%)

H. influenzae

Rd KW20

71421

NC_000907.1

1830138 1657 1789 1345305 26.5

H. parasuis

SH0165

557723

NC_011852.1

2269156 2021 2299 1803054 20.6

H. somnus

2336

228400

NC_010519.1

2263857 1980 2065 1977672 12.7

H. ducreyi

35000HP

233412

NC_002940.2

1698955 1717 1838 1446108 14.9

H. influenzae

86-028NP

281310

NC_007146.2

1914490 1792 1899 1661757 13.3

H. influenzae

F3031

866630

NC_014920.1

1985832 1770 1892 1673628 15.8

H. influenzae

F3047

935897

NC_014922.1

2007018 1786 1896 1698588 15.4

H. influenzae

PittEE

374930

NC_009566.1

1813033 1613 1689 1446339 20.3

H. influenzae

PittGG

374931

NC_009567.1

1887192 1661 1735 1422240 24.7

6

BLAST Get all k-mers

from query

sequence Gen. all

possible k-mers

with score > T

Find all hits of k-

mers in database

using FSA

Find close hits

(two-hit method)

Extend 2nd

hit ungapped

Dyn. prog. for

gapped alignment

aa

nt

Zvelebil and Baum. Understanding Bioinformatics.2008

7

1. Get all k-mers from query sequence

1a. Gen. all possible k-mers with score > T

BLOSUM62 T=11 in gapped BLAST

http://www-users.math.umd.edu/~poorani/sampletalk/talk.html

Persemlidis and Fondon, Genome Biol. 2001. 8

2. Find all hits of k-mers in DB using FSA

Zvelebil and Baum. Understanding Bioinformatics.2008

Example FSA to recognize CHH, CHY, or CYH

9

3. Find close hits (two-hit method)

+: T=13

• : T=11

Altschul et al. Nuc Ac Res 1997

10

4. Extend 2nd hit ungapped

Parameters: Xu and Sg

•Extension score is

monitored; if drops

below (max- Xu),

extension stops

• If extension score <

Sg, extension is

discarded

11

5. Dynamic. prog. for gapped alignment

Smith-Waterman (not what BLAST uses, but demonstrative)

Seq. 1: GCCCTAGCG

Seq. 2: GCGCAATG

http://www.ibm.com/developerworks/java/library/j-seqalign/index.html

12

BLAST output • Alignment Length: total length of alignment reported

(including gaps)

• % Identity: # identical nt‟s or aa‟s / alignment length

– aa alignments – positives: # subst‟s with „+‟ score in

substitution matrix / alignment length

• Bit Score: calculated from quality of alignment (gaps,

substitutions, etc.)

• e-Value: # of seq‟s with similar score expected to occur in

db by chance

• Coordinates: start and stop on both query and db

13

QueryID

SubjID

Alignment vis.

E-val

Bit Score

%ID

14

Preliminary Analysis:

M21127 454LargeContigs.fna

15

Major Types of BLAST

Variant Query Sequence Type Database Sequence Type

*blastn Nucleotide Nucleotide

*blastp Protein Protein

*blastx Nucleotide translated to

protein

Protein

*tblastn Protein Nucleotide translated to

protein

tblastx Nucleotide translated to

protein

Nucleotide translated to

protein

* Types we will be using in our analysis

•Queries in fasta format

•Databases acquired from “formatdb” on fasta format

16

Why using the BLASTs we‟re using?

• BLASTn – similar but not identical nt sequences

– not for finding homologous protein coding regions in other

organisms - because of the degeneracy of the genetic

code

– aa methods better for this

• BLASTp – most reliable, less conservative than BLASTn

• BLASTx - can provide strong evidence for the presence of a

homologous coding region, even between distantly related

genes

– is appropriate for use early in moderate and large scale

sequencing projects

• tBLASTn - useful for finding protein homologs in unannotated

nucleotide data

– especially suited to working with error prone data like draft

genomic sequences

17

Pangenome/Panproteome

• Haemophilus somnus

– 129PT plasmid pHS129

– 2336

• Haemophilus ducreyi

– 35000HP

• Haemophilus parasuis

– SH0165

• Haemophilus influenzae

– Rd KW20

– 86-028NP

– PittEE

– PittGG

– F3031

– F3047

Combined files of gene/protein sequences from:

Panproteome: 16,003 sequences

Pangenome : 16,083 sequences

18

Contigs

(nucleotide)

• H. inf. prot

•Panproteome

(protein)

• H. inf. prot

•Panproteome

(protein)

ORFs

(protein)

Process,

Filter,

Compare

BLAST Pipeline (part 1)

blastx

tblastn

blastp

•ORFs

•Contigs

(nucleotide)

• H. inf. genes

(nucleotide) blastn

19

BLAST pipeline

• Some things may have slipped through the

cracks!

– Conserved domains?

– Homologs in more distantly-related species?

– Not as confident, but can still give potentially-useful

predictions

20

nr database (NCBI)

• All non-redundant sequences from:

– GenBank CDS translations: annotated collection of

conceptual translations of all publicly available protein-

coding nucleotide

– PDB: Sequences derived from 3-dimensional structure

from Brookhaven Protein Databank

– SwissProt: UniProtKB/Swiss-Prot; manually annotated,

reviewed

– PIR: Part of UniProt consortium

– PRF: Protein Research Foundation, in Japan

NCBI databases: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml

PDB: http://www.rcsb.org/pdb/home/home.do SwissProt: http://www.ebi.ac.uk/uniprot/

PIR: http://pir.georgetown.edu/ PRF: http://www.prf.or.jp/aboutdb-e.html

21

http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml

Pfam database (Sanger Inst.)

• Collection of protein families (11,912 in release

24.0, Oct 2009)

• Domains – functional regions

• Conserved domains can indicate conserved

function

• Pfam-A: high quality, manually curated families

• Pfam-B: automatically-generated supplement

– Uses ADDA: Automatic Domain Decomposition

Algorithm

– Lower quality, but catch-all

pfam.sanger.ac.uk

22

“Lonely”

ORFs

(protein)

ORFs with no hits, or

only hits below threshold

NR (protein)

Pfam (protein)

BLAST pipeline

(part 2)

Process,

Filter,

Compare

Integrate

with ab

initio,

RNA blastp

23

Nitty Gritty: what is involved? 1. Find ORFs

(nt and prot)

getorf

2. Format

database

formatdb

3. Run BLAST in

each direction

blastall

4. Filter on e-val,

align. length, %ID

custom perl scripts

5. Find RBHs

public perl scripts 24

ORF statistics (for prot; x3 for nt)

• Total: 75,590 ORFs

• Shortest: 10 residues

• Longest: 2,511

residues

• 1000+: 24

• 500-1000: 254

• 200-500: 972

• Avg size range of

proteins in

panproteome: 304.90

25

Initial Filtering

• Minimum alignment length: 24nt or 8aa

• Minimum fractional alignment

(aligned length / query length): 0.5

• Maximum e-value: 0.0001 (nt), 0.05 (aa)

Quite liberal, but will give good first-pass overview.

26

BLAST results (post-filter)

Analysis # hits # hits

Recip.

Best Hits

1373

1369

1495

1372

1336

1208

1467

5608

1496

12568

1141

4338

1489

12,286

1501

12,519

1439

3070

27

Comparing/Processing BLAST results

• Many challenges!

• Lots of things to consider…

28

Comparing/Processing Blast Results

• Overlapping ORFs

• Are the 1439 from hflu-refseq all present in full

panproteome blast?

– If so, take those 1439, add to “final” list, process the

other 1631

• Which ones do we trust as they are now?

• Which ones to run through nr and Pfam? What

criteria?

• Filter ORFs codon usage? Other parameters?

• How compare/combine blastx/tblastn with blastp?

• How integrate with other groups

• Other…

29

Protein-coding Gene

Prediction by Ab initio

Kristen Knipe, Shaupu Qin & TianjunYe

Ab Initio Gene Prediction Strategy

● Prediction: Use GENEMARKS,

GLIMMER, PRODIGAL to predict the

whole genome.

● Filter: Filter genes with

length>10000(possible bug of the

program)

● Merging: Merge the predicted result

● Validation: Use BLASTx to validate

the merged result.

31

GeneMark.hmm and GeneMarkS

• Minimus2 output (Newbler and Mira)

• Ran GeneMark.hmm using:

• H. influenzae model

• H. influenzae 86 model

• H. ducreyi model

• Ran GeneMarkS

• H. haemophilus model created

32

M19107 (123) M19501 (22) M21127 (38) M21621 (28) M21639 (54) M21709 (37)

0

500

1000

1500

2000

2500

3000

Haemophilus Genome

Nu

mb

er

of

Ge

nes

Number of Genes Predicted by GeneMark.hmm and GeneMarkS

GeneMark.hmm (H. ducreyi)

GeneMark.hmm (H. influenzae86)

GeneMark.hmm (H. influenzae)

GeneMarkS

CDC ID Species Disease

M19107 H. haemolyticus Asymptomatic

M19501 H. haemolyticus Asymptomatic

M21127 H. haemolyticus Pathogenic



M21709 H. influenzae Pathogenic

33

34

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

ATG GTG TTG

Perc

en

tag

e

Start Codon

Start Codon Usage

H. influenzae (38.2% GC)

H. influenzae 86 (38.2%GC)

M21709* (38.03% GC)

H. ducreyi (38.2% GC)

M19107* (38.7% GC)

M19501* (38.5% GC)

M21127* (38.6% GC)

M21621* (38.4% GC)

M21639* (38.6% GC)

*predicted by GeneMarkS

*Calculated by Acua Software

0

0.2

0.4

0.6

0.8

1

1.2

GC

AG

CC

GC

GG

CT

TG

CT

GT

GA

CG

AT

GA

AG

AG

TT

CT

TT

GG

AG

GC

GG

GG

GT

CA

CC

AT

AT

AA

TC

AT

TA

AA

AA

GC

TA

CT

CC

TG

CT

TT

TA

TT

GA

TG

AA

CA

AT

CC

AC

CC

CC

GC

CT

CA

AC

AG

AG

AA

GG

CG

AC

GC

CG

GC

GT

AG

CA

GT

TC

AT

CC

TC

GT

CT

AC

AA

CC

AC

GA

CT

GT

AG

TC

GT

GG

TT

TG

GT

AC

TA

TT

AA

TA

GT

GA

Ala Cys Asp Glu Phe Gly His Ile Lys Leu MetAsn Pro Gln Arg Ser Thr Val TrpTyr Stop

Perc

en

tag

e

Codon and Amino Acid

Codon Usage Relative Frequencies (M19107)

35

0

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

0.045

0.05

GC

AG

CC

GC

GG

CT

TG

CT

GT

GA

CG

AT

GA

AG

AG

TT

CT

TT

GG

AG

GC

GG

GG

GT

CA

CC

AT

AT

AA

TC

AT

TA

AA

AA

GC

TA

CT

CC

TG

CT

TT

TA

TT

GA

TG

AA

CA

AT

CC

AC

CC

CC

GC

CT

CA

AC

AG

AG

AA

GG

CG

AC

GC

CG

GC

GT

AG

CA

GT

TC

AT

CC

TC

GT

CT

AC

AA

CC

AC

GA

CT

GT

AG

TC

GT

GG

TT

TG

GT

AC

TA

TT

AA

TA

GT

GA

Ala Cys Asp Glu Phe Gly His Ile Lys Leu MetAsn Pro Gln Arg Ser Thr Val TrpTyr Stop

Perc

en

tag

e

Codon and Amino Acid

Codon Usage Frequencies (M19107)

*Calculated by Acua Software

36

Prodigal Results on Three Assembly of

M19107 (Newbler, Mira3, Minimus2)

Newbler Mira3 Minimus

2 Average Gene

Length 853.6782 804.1442 858.9455

Total Gene

Number 1846 1969 1983

GC Content

0.6310 0.6335 0.6322

37

-50

0

50

100

150

200

250

-500 500 1500 2500 3500 4500 5500 6500 7500 8500

newbler

mira

minimus

Gene Length Distribution of Different Assembly

Gene Length

38

Three Prediction Software Results on

Minimus2 Assembly

(Prodigal, GMS, Glimmer3)

Prodigal GMS Glimmer3

Average Gene

Length 858.9455 827.9541 894.5465

Total Gene

Number 1983 2069 1945

39

-50

0

50

100

150

200

250

-200 800 1800 2800 3800 4800 5800 6800 7800 8800 9800

Prodigal

GMS

Glim3

Gene Length Distribution of Minimus Assembly

Gene Length

40

41

It‟s a good way to visualize gene

prediction results

After integrating the results of Homology

Search, we can easily find the difference

between the genes matched with known

proteins and those not.

A Extreme Example

Marie Skovgaard. Et al. (2001) 42

• Many short ORFs are

annotated as genes

Marie Skovgaard. Et al. (2001)

43

Ab Initio Gene Prediction Strategy

● Prediction: Use GENEMARKS,

GLIMMER, PRODIGAL to predict the

whole genome.

● Filter: Filter genes with

length>10000(possible bug of the

program)

● Merging: Merge the predicted result

● Validation: Use BLASTx to validate

the merged result.

44

Prediction Result(After Filter)

M19107 Number of Genes

GENEMARKS 2069

GLIMMER 1945

PRODIGAL 1983

45

Merge Strategy

All Predicted Genes

Genes predicted in all

3 programs

Level 1: High Confidence

(HC)

Genes appear in GeneMark

and GLIMMER

Level 2: Medium

Confidence (MC)

Genes predicted in

only one program

Level 3: Low Confidence

(LC)

46

Merge Result

M19107 Number of Gene

High Confidence

GENES 1058

Median Confidence

GENES 87

Low Confidence

GENES

800(glimmer)+926(gen

emarks)+925(prodigal)

=2651

47

Validation Strategy

Run BLASTx on Merged Genes

Validated Genes with Different Confidence

Level

False Positive/ Pseudo-gene

48

E value < e-20 E value > e-20

BLASTx SAMPLE (validated gene)

49

BLASTx SAMPLE (validated gene)

50

BLASTx SAMPLE(false positive/pseudo gene)

51

Final Result: Currently Running

M19107 Confidence Level Number of Genes

Validated Genes

High Confidence

Median Confidence

Low Confidence

False positive/pseudo gene

52

Delivery

● All the BLASTx,and BLASTn files

corresponding to validated genes,

could be used to do functional

analysis, multiple alignment, and

phylogenetic analysis.

53

RNA Prediction

Paul Cooper & Vani Rajan

Stats on Haemophilus genus from Rfam

55

Haemophilis Strain

# familes # entries tRNA rRNA Other

Rfam: tRNA tmRNA Rfam: 6S Rfam: 5S rRNA

23S rRNA

SS rRNA 5

SRP bact S15 RNaseP

GrpII Intron

ducreyi 35000HP 18 82 47 1 1 7 6 6 1 1 1 0 influenzae 86-028NP 20 96 57 1 1 7 6 6 1 1 1 0

influenzae PittEE 20 96 57 1 1 7 6 6 1 1 1 0

influenzae PittGG 20 94 55 1 1 7 6 6 1 1 1 0 influenzae Rd KW20 20 96 56 1 1 7 6 6 1 2 1 0

somnus 2336 18 80 47 1 1 6 5 5 1 1 1 1

somnus 129PT 19 80 47 1 1 6 5 5 1 1 1 0

Haemophilis Strain

sRNA

His leader TPP

riboswitch FMN riboswitch Sxy Alpha

Op.RBS Glycine Ribo PreQ1 GcvB Lr-Pk1? Lysine Moco ribo Thr Leader

ducreyi 35000HP 0 1 1 0 1 2 0 1 1 1 2 0

influenzae 86-028NP 2 3 1 1 1 2 1 1 1 1 1 0


influenzae PittGG 2 3 1 1 1 2 1 1 1 1 1 0

influenzae Rd KW20 2 3 1 1 1 2 1 1 1 1 1 0

somnus 2336 0 1 1 1 1 2 0 1 1 1 2 1

somnus 129PT 0 2 1 0 1 2 1 1 1 1 2 1

tRNA predictions: tRNA-Scan-SE

56

• Newbler denovo data

All_Contigs LargeContigs

M19107 50 46

M19501 48 45

M21127 52 49

M21621 52 49

M21639 51 48

M21639_2 51 47

M21709 48 45

H.Hemalyticus straintRNA: tScan-Se

Distribution of tRNAs

57

Influenza

M19107 M19501 M21127 M21621 M21639 M21639_2 M21709

Ala 1 1 1 1 1 1 2

Arg 4 4 4 4 4 4 4

Asn 2 2 2 2 2 2 2

Asp 3 3 3 3 3 3 3

Cys 1 1 1 1 1 1 1

Gln 2 2 2 2 2 2 2

Glu 0 0 0 0 0 0 0

Gly 5 5 5 5 4 4 4

His 1 1 1 1 1 1 1

Ile 0 0 0 0 0 0 0

Leu 5 5 5 5 4 5 5

Lys 4 4 5 4 5 5 4

Met 4 4 4 4 4 4 4

Phe 1 1 1 1 1 1 1

Pro 2 2 2 2 2 2 2

SeC 1 1 0 1 1 1 1

Ser 4 4 4 4 4 4 4

Thr 2 2 2 2 2 2 2

Trp 1 1 1 1 1 1 1

Tyr 1 1 1 1 1 1 1

Val 2 1 4 5 5 3 1

Contigs 194 36 46 39 225 119 39

tRNAs Found

Asymptomatic Pathogenic

LARGE CONTGS FILE

Distribution of tRNAs

58

M19107 M19501 M21127 M21621 M21639 M21639_2 M21709 86 028NP PitEE PitGG

Ala 2 2 2 2 2 2 3 4 4 3

Arg 4 4 4 4 4 4 4 4 4 4

Asn 2 2 2 2 2 2 2 2 2 2

Asp 3 3 3 3 3 3 3 3 3 3

Cys 1 1 1 1 1 1 1 1 1 1

Gln 2 2 2 2 2 2 2 2 2 2

Glu 1 1 1 1 1 1 1 3 3 4

Gly 5 5 5 5 4 4 4 5 5 4

His 1 1 1 1 1 1 1 1 1 1

Ile 1 1 1 1 1 1 1 3 3 2

Leu 5 5 5 5 4 5 5 5 5 5

Lys 4 4 5 4 5 5 4 4 4 4

Met 4 4 4 4 4 4 4 4 4 4

Phe 1 1 1 1 1 1 1 1 1 1

Pro 2 2 2 2 2 2 2 2 2 2

SeC(p) 1 1 1 1 1 1 1 1 1 1

Ser 4 4 4 4 4 4 4 4 4 4

Thr 2 2 2 2 2 2 2 2 2 1

Trp 1 1 1 1 1 1 1 1 1 1

Tyr 1 1 1 1 1 1 1 1 1 1

Val 3 1 4 5 5 4 1 5 5 5

Contigs 217 75 59 50 175 173 54

tRNAs Found

Asymptomatic Pathogenic Influenza

NCBI annotation

ALL CONTIGS FILE

Individual Codons

• Only Glu and Val show different usage

• All other codons show usage similar to influenza

genomes

59

M19107 M19507 M21127 M21621 M21639 M21639_2 M21709 86_028NP PitEE PitGG

Glu GAA 1 1 1 1 1 1 1 3 3 4

Glu GAG 0 0 0 0 0 0 0 0 0 0

Val GTA 2 0 3 4 4 3 0 4 4 4

Val GTC 1 1 1 1 1 1 1 1 1 1

Val GTG 0 0 0 0 0 0 0 0 0 0

Val GTT 0 0 0 0 0 0 0 0 0 0

tRNAs found: codons

Asymptomatic Pathogenic Influenza

rRNA Difficulties

• Highly conserved functional regions

• Followed by a short hyper-variable area.

• Multiple Operon Copies (~55 copies have been found in C.elegans at Chr 1 w/ 275 total)

• A closure of C. violaceum with 57 contigs found: (7 contigs ended with 5SRNA, 3 with 16S)

Chang-Shung Tung1, Simpson Joseph2 & Kevin Y. Sanbonmatsu1 All-atom homology model of the Escherichia coli 30S ribosomal subunit Nature Structural Biology 9, 750 - 755 (2002)

C. elegans Sequencing Consortium (1998). Genome sequence of the nematode C. elegans: a platform for investigating biology. The C. elegans Sequencing Consortium. Science 282, 2012–

2018.

C. Woese Microbiol Bacterial evolution. Rev. 1987 June; 51(2): 221–271.PMCID: PMC373105 60

RNAmmer

• Online submissions of: 10k seq and 20m NT

• Prescreening followed, by HMM

• Bacterial Training: 82% Actinobacteria,

Firmicutes, Proteobacteria

• Highest accuracy in 16S, then 23S

61

rRNA Results

• ##gff-version2

• ##source-version RNAmmer-1.2 (IRIX64)

• ##date 2011-02-19

• ##Type DNA

• # seqname source feature start end score +/- frame

attribute

• # ---------------------------------------------------------------------------------------------------------

• contig00179 RNAmmer-1.2 (IRIX64) rRNA 1572 1686 82.9

+ . 5s_rRNA


+ . 5s_rRNA


+ . 16s_rRNA

• # ---------------------------------------------------------------------------------------------------------

Preliminary rRNA Results Strain 5S 16S 23S Contigs M19107-hae-AS 2 1 0(1) 217 M19501-hae-AS 1(3) 0(0) 0(0) 75 M21127-hae-P 2 1 1 59 M21621-hae-P 2 1 1 50 M21639-hae-P 2 1 1 175 M21709-Infl-P 3(4) 1 2 52 Rfam influenza(4) 7 -- 5 ---- JCVI infl-KW20 Rd 5 5 5 ---

62

Ribosome Subunit Assembly

63

Clustering of Influ-KW20

• Lengths consistent:

114/1538/2896

• Spacing between 5S and 23S: 247

• Spacing between 23S and 16S: 3x724,

479,2017

• Clusters vary in distance from: 27k-1,800k

A C E F B D 64

Pathogenic Hae-23S Alignment

65

Cladogram with Distance

http://www.ebi.ac.uk/Tools/services/web/toolresult.ebi?distance=true&tree=phylogram&jobId=clustalw2-I20110221-192123-0426-9064345-

oy&tool=clustalw2&analysis=tree. EMBl-EBI

Has rRNA been associated with

virulence?

• V. vulnifucus (shellfish/humans)-16S type B

highly associated with virulence over type A1

• H. aegyptius, 15 rRNA (16+23) gene

restriction patterns, one is associated with

most cases of BPF2

• V. cholerae O139 toxins linked to the

ribotype BgII 3

1Nilsson WB, Paranjype RN, DePaola A, Strom MS. Sequence polymorphism of the 16S rRNA gene of Vibrio vulnificus is a possible indicator of strain virulence. J Clin Microbiol.

2003 Jan;41(1):442-6.

2Leen-Jan van Doorn et. Al, Accurate Prediction of Macrolide Resistance in Helicobacter pylori by a PCR Line Probe Assay for Detection of Mutations in the 23S rRNA Gene:

Multicenter Validation Study. Antimicrob Agents Chemother. 2001 May; 45(5): 1500–1504.

3Farquue et al. Molecular analysis of rRNA and cholera toxin genes carried by the new epidemic strain of toxigenic Vibrio cholerae O139 synonym Bengal.J Clin Microbiol. 1994

Apr;32(4):1050-3. 66

Future: rRNA

• Determine coverage of rRNA areas in assembly

• Check contig edges for rRNA partial matches

• Map rRNA to contigs to determine if distance in

between could represent missed rRNA

• Run other methods of rRNA verification

67

Rfam output

##gff-version 3

# rfam_scan.pl (v1.0)

# command line: /usr/bin/rfam_scan-1.0.2.pl -blastdb /storage2/db/rfam/rfam -o

454LargeContigs.rfam /storage2/db/rfam/Rfam.cm 454LargeContigs.fna

# CM file: /storage2/db/rfam/Rfam.cm

# query FASTA file: 454LargeContigs.fna

# start time: Tue Feb 22 12:28:46 EST 2011

# end time: Tue Feb 22 13:18:54 EST 2011

contig00211 Rfam similarity 191 721 344.88 + . evalue=4.15e-42;gc-content=53;

id=SSU_rRNA_5.1;model_end=486;model_start=1;rfam-acc=RF00177;rfam id=SSU_rRNA_5;

score=344.88

Contig00203 Rfam similarity 15472 15836 280.07 + . evalue=6.18e-40;gc-content=49;

id=tmRNA.1;model_end=359;model_start=1;rfam-acc=RF00023;rfam-id=tmRNA;score=280.07

contig00025 Rfam similarity 2611 2987 305.02 + . evalue=2.09e-40;gc-

content=53;id=RNaseP_bact_a.1;model_end=367;model_start=1;rfam-acc=RF00010;rfam-

id=RNaseP_bact_a;score=305.02

68

Rfam preliminary results

• Validated tRNAs found by tRNAScan-SE

• Had difficulty finding long rRNAs

69

Rfam: tRNA tmRNA Rfam: 6S Rfam: 5S rRNA 23S rRNA SS rRNA 5 SRP bact S15 RNaseP GrpII Intron

ducreyi 35000HP 18 82 47 1 1 7 6 6 1 1 1 0

influenzae 86-028NP 20 96 57 1 1 7 6 6 1 1 1 0


influenzae PittGG 20 94 55 1 1 7 6 6 1 1 1 0

influenzae Rd KW20 20 96 56 1 1 7 6 6 1 2 1 0

somnus 2336 18 80 47 1 1 6 5 5 1 1 1 1

somnus 129PT 19 80 47 1 1 6 5 5 1 1 1 0

M19107 19 71 50 1 1 2 1 1 1 1 1 0

M19501 19 72 48 1 1 3 0 0 1 1 1 0

M21127 20 75 52 1 1 2 1 1 1 1 1 0

M21621 19 74 52 1 1 2 1 1 1 1 1 0

M21639 20 74 51 1 1 2 1 1 1 1 1 0

M21639_2 21 76 51 1 1 2 1 1 1 1 1 0

M21709 22 77 48 1 1 4 2 1 1 1 1 0

rRNAtRNAHaemophilis Strain # familes # entries Other

Rfam preliminary results

• Can clearly see that some sRNA in Haemalyticus

come from influenza while others are from diff

Haemophilus species

70

His leader TPP riboswitch FMN riboswitch Sxy Alpha Op.RBS Glycine Ribo PreQ1 GcvB Lr-Pk1? Lysine Moco ribo Thr Leader Rrt IsrK

ducreyi 35000HP 0 1 1 0 1 2 0 1 1 1 2 0 0 0

influenzae 86-028NP 2 3 1 1 1 2 1 1 1 1 1 0 0 0

influenzae PittEE 2 3 1 1 1 2 1 1 1 1 1 0 0 0

influenzae PittGG 2 3 1 1 1 2 1 1 1 1 1 0 0 0

influenzae Rd KW20 2 3 1 1 1 2 1 1 1 1 1 0 0 0

somnus 2336 0 1 1 1 1 2 0 1 1 1 2 1 0 0

somnus 129PT 0 2 1 0 1 2 1 1 1 1 2 1 0 0

M19107 1 2 1 0 1 2 0 1 1 1 1 1 0 0

M19501 1 4 1 1 1 2 0 1 1 1 1 1 1 0

M21127 1 3 1 1 1 2 0 1 1 1 1 1 0 0

M21621 1 3 1 0 1 2 0 1 1 1 1 1 0 0

M21639 1 3 1 1 1 2 0 1 1 1 1 1 0 0

M21639_2 1 3 1 1 1 2 0 1 1 1 1 1 0 2

M21709 2 3 1 1 1 2 1 1 1 1 1 1 1 0

Haemophilis Strain sRNA

sRNA prediction

• sRNAPredict3, sRNAScanner, nocoRNAc

• Some problems with inputs

– Require coordinates of protein coding genes

– Descriptions of secondary structures

– Positive training samples to create PWM

(sRNAScanner)

– Biggest problem: requires MSA to find

consensus

71

Future: sRNA

2ndary Structure

• Blast

• ClustalW (MSA)

Prediction

• RNAz

• QRNA

Filter • nocoRNAc

72

gene prediction: preliminary results

Documents