srr-1 from streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine...

27
1 mlkkqfgnfg eksrkvrvkm rksgkhwvks vmtqigyvil srfsgkekss kvqttsedls 61 rtktsasilt avaalgavvg gttdttsvsa eetptatelt gnektlatae tvvvapevkt 121 vnsdssshst sesqsmstst lqstsaslsa seslmdstsa slsessslse ysslslssse 181 svsasesvqs seaattarvq pramrvvssa sdmetlpaal isgegdvttv qgqdvtdklq 241 nldiklsggv qakagvinmd ksesmhmslk ftidsvnrgd tfeiklsdni dtngasnysi 301 vepiksptge vyatgiydsq kksivysftd faasknning ildiplwpdd ttvqntkedv 361 lfsvkikdqe atiketvkyd ppvridfagg vsvdsritni ddvgkkmtyi sqinvdgksl 421 ynynglytri ynyskestad lknstikiyk ttsdnivesm vqdyssmedv tskfansype 481 kgwydiywgq fiasnetyvi vvetpftnav tlnttlsdyn enngvehnht yssesgysdv 541 naqerkilse lvsssesvss sesvsnsesi stsesvsnse sisssesvss sesistsesv 601 stsesissse svsssesvss sesisssesv snsesissse svsnsesiss sesvsssesi 661 snsesissse svstsesiss sesvsnsesi sssesvssse sisnsesiss sesvstsesi 721 snsesvssse svstsesiss sesvsnsesi stsesvstse sisssesvss sesisssesv 781 snsesisnse svsssesvsn sesisssesv snsesistse svstsesiss sesvsnsesi 841 sssesvsnse sisssesvsn sesisssesv snsesissse svsssesvss sesistsesv 901 snsesissse svsnsesiss sesvsnsesi sssesvsnse sisssesvss sesisssesv 961 sssesvsnse sisssesvsn sesisssesv sssesissse svsnsesils sesvsssesi 1021 sssesissse svsmsttesl sesevsgdse issstesssq sesmnhteik sdsesqhevk 1081 hqvlpetgdn sasalgllga glllgatksr kkkkd Srr-1 from Streptococcus

Upload: kathlyn-simmons

Post on 04-Jan-2016

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

1 mlkkqfgnfg eksrkvrvkm rksgkhwvks vmtqigyvil srfsgkekss kvqttsedls 61 rtktsasilt avaalgavvg gttdttsvsa eetptatelt gnektlatae tvvvapevkt 121 vnsdssshst sesqsmstst lqstsaslsa seslmdstsa slsessslse ysslslssse 181 svsasesvqs seaattarvq pramrvvssa sdmetlpaal isgegdvttv qgqdvtdklq 241 nldiklsggv qakagvinmd ksesmhmslk ftidsvnrgd tfeiklsdni dtngasnysi 301 vepiksptge vyatgiydsq kksivysftd faasknning ildiplwpdd ttvqntkedv 361 lfsvkikdqe atiketvkyd ppvridfagg vsvdsritni ddvgkkmtyi sqinvdgksl 421 ynynglytri ynyskestad lknstikiyk ttsdnivesm vqdyssmedv tskfansype 481 kgwydiywgq fiasnetyvi vvetpftnav tlnttlsdyn enngvehnht yssesgysdv 541 naqerkilse lvsssesvss sesvsnsesi stsesvsnse sisssesvss sesistsesv 601 stsesissse svsssesvss sesisssesv snsesissse svsnsesiss sesvsssesi 661 snsesissse svstsesiss sesvsnsesi sssesvssse sisnsesiss sesvstsesi 721 snsesvssse svstsesiss sesvsnsesi stsesvstse sisssesvss sesisssesv 781 snsesisnse svsssesvsn sesisssesv snsesistse svstsesiss sesvsnsesi 841 sssesvsnse sisssesvsn sesisssesv snsesissse svsssesvss sesistsesv 901 snsesissse svsnsesiss sesvsnsesi sssesvsnse sisssesvss sesisssesv 961 sssesvsnse sisssesvsn sesisssesv sssesissse svsnsesils sesvsssesi 1021 sssesissse svsmsttesl sesevsgdse issstesssq sesmnhteik sdsesqhevk 1081 hqvlpetgdn sasalgllga glllgatksr kkkkd

Srr-1 from Streptococcus

Page 2: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

1 mlkkqfgnfg eksrkvrvkm rksgkhwvks vmtqigyvil srfsgkekss kvqttsedls 61 rtktsasilt avaalgavvg gttdttsvsa eetptatelt gnektlatae tvvvapevkt 121 vnsdssshst sesqsmstst lqstsaslsa seslmdstsa slsessslse ysslslssse 181 svsasesvqs seaattarvq pramrvvssa sdmetlpaal isgegdvttv qgqdvtdklq 241 nldiklsggv qakagvinmd ksesmhmslk ftidsvnrgd tfeiklsdni dtngasnysi 301 vepiksptge vyatgiydsq kksivysftd faasknning ildiplwpdd ttvqntkedv 361 lfsvkikdqe atiketvkyd ppvridfagg vsvdsritni ddvgkkmtyi sqinvdgksl 421 ynynglytri ynyskestad lknstikiyk ttsdnivesm vqdyssmedv tskfansype 481 kgwydiywgq fiasnetyvi vvetpftnav tlnttlsdyn enngvehnht yssesgysdv 541 naqerkilse lvsssesvss sesvsnsesi stsesvsnse sisssesvss sesistsesv 601 stsesissse svsssesvss sesisssesv snsesissse svsnsesiss sesvsssesi 661 snsesissse svstsesiss sesvsnsesi sssesvssse sisnsesiss sesvstsesi 721 snsesvssse svstsesiss sesvsnsesi stsesvstse sisssesvss sesisssesv 781 snsesisnse svsssesvsn sesisssesv snsesistse svstsesiss sesvsnsesi 841 sssesvsnse sisssesvsn sesisssesv snsesissse svsssesvss sesistsesv 901 snsesissse svsnsesiss sesvsnsesi sssesvsnse sisssesvss sesisssesv 961 sssesvsnse sisssesvsn sesisssesv sssesissse svsnsesils sesvsssesi 1021 sssesissse svsmsttesl sesevsgdse issstesssq sesmnhteik sdsesqhevk 1081 hqvlpetgdn sasalgllga glllgatksr kkkkd

Srr-1 from Streptococcus

Page 3: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

vstses issses vsnses issses vssses isnses issses vstses isnses vssses vstses issses vsnses istses vstses issses vssses issses vsnses isnses vssses vsnses issses vsnses istses

i/v nonpolars serine (polar uncharged)

n/s/t polar unchargeds serine (polar uncharged)

e glutamic acid (neg. charge)

s serine (polar uncharged)

Page 4: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Streptococcal Srr proteins

S, signal sequenceN, non-repeat regionRI, small repeat region IRII, large repeat region IIA, cell wall sorting signal(X)S, di-peptide repeat motif.

Page 5: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Gene prediction

sequence

Page 6: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Prokaryotic gene

• “Small” genomes, high gene density– Haemophilus influenza genome 85% genic

• Operons– One transcript, many genes

• No introns– One gene, one protein

• Open reading frames– One ORF per gene– ORFs begin with start,end with stop codon

Page 7: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Eukaryotic Gene

• Much lower gene density

• Undergo several post transcriptional modifications.– 5’ CAP– Poly A tail– Splicing

Page 8: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Goal of Genomics

To understand the function of every gene in an organism

1. Sequence the genome

2. Characterize each gene• Some are already known• Many are similar to known genes• 40% are unknown (no homolog characterized)

Page 9: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Collating the evidenceDNA databases

(EMBL/Genbank/DDBJ) Protein databases (Swall)TrEMBL(automatictranslation of CDS from DNA db’s)

Swissprot(curated

data)mRNA(cDNA)

dbEST(ESTs)

Genomic(finished,draft)

Genome Browsers (Ensembl, UCSC, NCBI)

Genome assembly

Gene prediction

Gene/Protein infoSupporting evidenceExon/intron structureAncilliary databases

Reference sequences (REFSEQ)

NM_00001(mRNA)XM_00001(predicted mRNA)

Domain databases (Interpro, CDD)

PFAM, ProDomSmart, PrintsProsite, TIGRfam

LocusLink/Gene

Gene/Locus

PubmedUnigeneOmimHomology mapsHuman mutation db

supportingevidence

Page 10: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Genome Browsers

Ensembl: www.ensembl.orgEBI and Sanger collaborationGene build, predict novel genes

UCSC: genome.ucsc.eduUniversity of Santa CruzAnnotate other gene builds

NCBI: www.ncbi.nlm.nih.gov/mapview/NCBI map viewerGene build, predicts novel genes

Page 11: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Predicting genes

Open Reading Frames (ORFs)freqency of stop codonssimple algorithm, easy to interpret

Composition biascoding vs. noncoding

Sequence Signalsenhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…

Page 12: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine
Page 13: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Predicted genes are of 4 types

Known genes (highest quality)as catalogued by the reference sequence

projectEnsembl known genes (red genes)NCBI known genes

Novel genes (1) (high quality)based on similarity to known genes, or

cDNAs these need not have 100% matching

supporting evidenceEnsembl novel genes (black)NCBI Loc genes

Page 14: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Novel genes (2) (high quality)based on the presence of ESTsresource of alternative splicing

EST genes in Ensembl (purple) Database of transcribed sequences (DOTs) Assembly

Ab initio gene prediction (questionable)Single organsism: GenscanComparative information: Twinscan

Pseudogenes - matches a known gene but with a

a disrupted ORF - a minefield!

Predicted genes are of 4 types

Page 15: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Gene prediction programs

• Ab initio gene prediction– First ones predicted single exons, e.g. GRAIL (Uberbacher,

‘91) or MZEF (Zhang, ‘97)– Later, predict entire genes e.g. Genscan (Burge ‘97) and

Fgenesh (Solovyev, ‘95)– Predict individual exons based on codon usage and

sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes

– Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00)

– Can not use gene prediction methods alone to accurately identify every gene in a genome

Page 16: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Twinscan

Gene structure prediction modelExtends probability model of GENSCANExploits homology between two related genomesNotable improvement on GENSCAN

Page 17: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Output from Artemis

Page 18: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Bias in nucleotide frequency

Page 19: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Prediction of URO-D structure using different programs

Page 20: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Prediction of URO-D structure using GRAIL and an external EST database

Page 21: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Prediction of URO-D structure using GENEWISE and different species as targets

Page 22: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output

Page 23: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Supporting evidence

mRNA

reverse transcription

cDNA

Expressed Sequence Tag(EST)

full length cDNA sequence

Page 24: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Measuring accuracy

• Sn = Sensitivity = TP/(TP+FN)– How many exons were found out of total present?

• Sp = Specificity = TP/(TP+FP)– How many predicted exons were correct out of total exons

predicted?

Page 25: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Twinscan

Page 26: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Why the errors?

First exons tend to be short so there is less information to use.

Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance.

EST libraries contaminated with genomic sequences

Pseudogenes - test rate of synonymous substitutions (stops are more rare)

Page 27: Srr-1 from Streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine (polar uncharged) e glutamic acid (neg. charge) sserine

Other sources of gene prediction

• ORF detectors– NCBI: http://www.ncbi.nih.gov/gorf/gorf.html ***

• Promoter predictors– CSHL: http://rulai.cshl.org/software/index1.htm– BDGP: fruitfly.org/seq_tools/promoter.html– ICG: TATA-Box predictor

• PolyA signal predictors– CSHL: argon.cshl.org/tabaska/polyadq_form.html

• Splice site predictors– BDGP: http://www.fruitfly.org/seq_tools/splice.html

• Start-/stop-codon identifiers– DNALC: Translator/ORF-Finder– BCM: Searchlauncher