srr-1 from streptococcus. i/v nonpolar s serine (polar uncharged) n/s/t polar uncharged s serine...
TRANSCRIPT
1 mlkkqfgnfg eksrkvrvkm rksgkhwvks vmtqigyvil srfsgkekss kvqttsedls 61 rtktsasilt avaalgavvg gttdttsvsa eetptatelt gnektlatae tvvvapevkt 121 vnsdssshst sesqsmstst lqstsaslsa seslmdstsa slsessslse ysslslssse 181 svsasesvqs seaattarvq pramrvvssa sdmetlpaal isgegdvttv qgqdvtdklq 241 nldiklsggv qakagvinmd ksesmhmslk ftidsvnrgd tfeiklsdni dtngasnysi 301 vepiksptge vyatgiydsq kksivysftd faasknning ildiplwpdd ttvqntkedv 361 lfsvkikdqe atiketvkyd ppvridfagg vsvdsritni ddvgkkmtyi sqinvdgksl 421 ynynglytri ynyskestad lknstikiyk ttsdnivesm vqdyssmedv tskfansype 481 kgwydiywgq fiasnetyvi vvetpftnav tlnttlsdyn enngvehnht yssesgysdv 541 naqerkilse lvsssesvss sesvsnsesi stsesvsnse sisssesvss sesistsesv 601 stsesissse svsssesvss sesisssesv snsesissse svsnsesiss sesvsssesi 661 snsesissse svstsesiss sesvsnsesi sssesvssse sisnsesiss sesvstsesi 721 snsesvssse svstsesiss sesvsnsesi stsesvstse sisssesvss sesisssesv 781 snsesisnse svsssesvsn sesisssesv snsesistse svstsesiss sesvsnsesi 841 sssesvsnse sisssesvsn sesisssesv snsesissse svsssesvss sesistsesv 901 snsesissse svsnsesiss sesvsnsesi sssesvsnse sisssesvss sesisssesv 961 sssesvsnse sisssesvsn sesisssesv sssesissse svsnsesils sesvsssesi 1021 sssesissse svsmsttesl sesevsgdse issstesssq sesmnhteik sdsesqhevk 1081 hqvlpetgdn sasalgllga glllgatksr kkkkd
Srr-1 from Streptococcus
1 mlkkqfgnfg eksrkvrvkm rksgkhwvks vmtqigyvil srfsgkekss kvqttsedls 61 rtktsasilt avaalgavvg gttdttsvsa eetptatelt gnektlatae tvvvapevkt 121 vnsdssshst sesqsmstst lqstsaslsa seslmdstsa slsessslse ysslslssse 181 svsasesvqs seaattarvq pramrvvssa sdmetlpaal isgegdvttv qgqdvtdklq 241 nldiklsggv qakagvinmd ksesmhmslk ftidsvnrgd tfeiklsdni dtngasnysi 301 vepiksptge vyatgiydsq kksivysftd faasknning ildiplwpdd ttvqntkedv 361 lfsvkikdqe atiketvkyd ppvridfagg vsvdsritni ddvgkkmtyi sqinvdgksl 421 ynynglytri ynyskestad lknstikiyk ttsdnivesm vqdyssmedv tskfansype 481 kgwydiywgq fiasnetyvi vvetpftnav tlnttlsdyn enngvehnht yssesgysdv 541 naqerkilse lvsssesvss sesvsnsesi stsesvsnse sisssesvss sesistsesv 601 stsesissse svsssesvss sesisssesv snsesissse svsnsesiss sesvsssesi 661 snsesissse svstsesiss sesvsnsesi sssesvssse sisnsesiss sesvstsesi 721 snsesvssse svstsesiss sesvsnsesi stsesvstse sisssesvss sesisssesv 781 snsesisnse svsssesvsn sesisssesv snsesistse svstsesiss sesvsnsesi 841 sssesvsnse sisssesvsn sesisssesv snsesissse svsssesvss sesistsesv 901 snsesissse svsnsesiss sesvsnsesi sssesvsnse sisssesvss sesisssesv 961 sssesvsnse sisssesvsn sesisssesv sssesissse svsnsesils sesvsssesi 1021 sssesissse svsmsttesl sesevsgdse issstesssq sesmnhteik sdsesqhevk 1081 hqvlpetgdn sasalgllga glllgatksr kkkkd
Srr-1 from Streptococcus
vstses issses vsnses issses vssses isnses issses vstses isnses vssses vstses issses vsnses istses vstses issses vssses issses vsnses isnses vssses vsnses issses vsnses istses
i/v nonpolars serine (polar uncharged)
n/s/t polar unchargeds serine (polar uncharged)
e glutamic acid (neg. charge)
s serine (polar uncharged)
Streptococcal Srr proteins
S, signal sequenceN, non-repeat regionRI, small repeat region IRII, large repeat region IIA, cell wall sorting signal(X)S, di-peptide repeat motif.
Gene prediction
sequence
Prokaryotic gene
• “Small” genomes, high gene density– Haemophilus influenza genome 85% genic
• Operons– One transcript, many genes
• No introns– One gene, one protein
• Open reading frames– One ORF per gene– ORFs begin with start,end with stop codon
Eukaryotic Gene
• Much lower gene density
• Undergo several post transcriptional modifications.– 5’ CAP– Poly A tail– Splicing
Goal of Genomics
To understand the function of every gene in an organism
1. Sequence the genome
2. Characterize each gene• Some are already known• Many are similar to known genes• 40% are unknown (no homolog characterized)
Collating the evidenceDNA databases
(EMBL/Genbank/DDBJ) Protein databases (Swall)TrEMBL(automatictranslation of CDS from DNA db’s)
Swissprot(curated
data)mRNA(cDNA)
dbEST(ESTs)
Genomic(finished,draft)
Genome Browsers (Ensembl, UCSC, NCBI)
Genome assembly
Gene prediction
Gene/Protein infoSupporting evidenceExon/intron structureAncilliary databases
Reference sequences (REFSEQ)
NM_00001(mRNA)XM_00001(predicted mRNA)
Domain databases (Interpro, CDD)
PFAM, ProDomSmart, PrintsProsite, TIGRfam
LocusLink/Gene
Gene/Locus
PubmedUnigeneOmimHomology mapsHuman mutation db
supportingevidence
Genome Browsers
Ensembl: www.ensembl.orgEBI and Sanger collaborationGene build, predict novel genes
UCSC: genome.ucsc.eduUniversity of Santa CruzAnnotate other gene builds
NCBI: www.ncbi.nlm.nih.gov/mapview/NCBI map viewerGene build, predicts novel genes
Predicting genes
Open Reading Frames (ORFs)freqency of stop codonssimple algorithm, easy to interpret
Composition biascoding vs. noncoding
Sequence Signalsenhancers, promoters, start codons, intron/exon boundaries, stop codons, poly-A addition signals…
Predicted genes are of 4 types
Known genes (highest quality)as catalogued by the reference sequence
projectEnsembl known genes (red genes)NCBI known genes
Novel genes (1) (high quality)based on similarity to known genes, or
cDNAs these need not have 100% matching
supporting evidenceEnsembl novel genes (black)NCBI Loc genes
Novel genes (2) (high quality)based on the presence of ESTsresource of alternative splicing
EST genes in Ensembl (purple) Database of transcribed sequences (DOTs) Assembly
Ab initio gene prediction (questionable)Single organsism: GenscanComparative information: Twinscan
Pseudogenes - matches a known gene but with a
a disrupted ORF - a minefield!
Predicted genes are of 4 types
Gene prediction programs
• Ab initio gene prediction– First ones predicted single exons, e.g. GRAIL (Uberbacher,
‘91) or MZEF (Zhang, ‘97)– Later, predict entire genes e.g. Genscan (Burge ‘97) and
Fgenesh (Solovyev, ‘95)– Predict individual exons based on codon usage and
sequence signals (start, stop, splice sites) followed by assembly of putative exons into genes
– Genscan predicts 90% of coding nucleotides, and 70% of coding exons (Guigo, ‘00)
– Can not use gene prediction methods alone to accurately identify every gene in a genome
Twinscan
Gene structure prediction modelExtends probability model of GENSCANExploits homology between two related genomesNotable improvement on GENSCAN
Output from Artemis
Bias in nucleotide frequency
Prediction of URO-D structure using different programs
Prediction of URO-D structure using GRAIL and an external EST database
Prediction of URO-D structure using GENEWISE and different species as targets
Region of URO-D gene from the UCSC genome browser. Note RepeatMasker output
Supporting evidence
mRNA
reverse transcription
cDNA
Expressed Sequence Tag(EST)
full length cDNA sequence
Measuring accuracy
• Sn = Sensitivity = TP/(TP+FN)– How many exons were found out of total present?
• Sp = Specificity = TP/(TP+FP)– How many predicted exons were correct out of total exons
predicted?
Twinscan
Why the errors?
First exons tend to be short so there is less information to use.
Parameters for one organism may not be useful for another organism. Quality degrades with phylogenetic distance.
EST libraries contaminated with genomic sequences
Pseudogenes - test rate of synonymous substitutions (stops are more rare)
Other sources of gene prediction
• ORF detectors– NCBI: http://www.ncbi.nih.gov/gorf/gorf.html ***
• Promoter predictors– CSHL: http://rulai.cshl.org/software/index1.htm– BDGP: fruitfly.org/seq_tools/promoter.html– ICG: TATA-Box predictor
• PolyA signal predictors– CSHL: argon.cshl.org/tabaska/polyadq_form.html
• Splice site predictors– BDGP: http://www.fruitfly.org/seq_tools/splice.html
• Start-/stop-codon identifiers– DNALC: Translator/ORF-Finder– BCM: Searchlauncher