primer designgeneprediction

Sucheta TripathyIICB Course work, 8th Dec 2012

Topics to be covered Primer designing Restriction mappingGene Prediction

ATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAAATGACGCATAGCAGCATAGACGCATAGACGACGCATAGACGACATAGAGACGAGACGCAGAATAGAGGACGAGCAATGATGATAGAA

Oligo Analyzer: http://www.idtdna.com/analyzer/Applications/OligoAnalyzer/http://www.idtdna.com/Analyzer/Applications/Instructions/Default.aspx?AnalyzerDefinitions=true

http://www.idtdna.com/analyzer/Applications/OligoAnalyzer/

http://www.idtdna.com/analyzer/Applications/OligoAnalyzer/

http://www.idtdna.com/Analyzer/Applications/Instructions/Default.aspx?AnalyzerDefinitions=true



Primer DesigningNo non-specific bindingMelting temperatureShould not be forming dimers with itself or

other primers.

Some thoughts1. primers should be 17-28 bases in length; 2. base composition should be 50-60% (G+C); 3. primers should end (3') in a G or C, or CG or GC: this prevents "breathing" of ends and increases efficiency of priming; 4. Tms between 55-80oC are preferred; 5. 3'-ends of primers should not be complementary (ie. base pair), as otherwise primer dimers will be synthesised preferentially to any other product; 6. primer self-complementarity (ability to form 2o structures such as hairpins) should be avoided; 7. runs of three or more Cs or Gs at the 3'-ends of primers may promote mispriming at G or C-rich sequences (because of stability of annealing), and should be avoided.

Adapted from: Innis and Gelfand,1991

Reference: http://bioweb.uwlax.edu/genweb/molecular/seq_anal/primer_design/primer_design.htm

Restriction AnalysisFound in Bacteria and archea.4 types

Type -1: cleavage remote to recognition site (methylase activity)

Type-2: cleavage within a specific distanceType-3: Cleavage within a short distanceType-4: Cleaves modified DNA (methylated)

Ref: http://insilico.ehu.es/restriction/long_seq/http://molbiol-tools.ca/Restriction_endonuclease.htm

http://insilico.ehu.es/restriction/long_seq/

http://insilico.ehu.es/restriction/long_seq/

http://molbiol-tools.ca/Restriction_endonuclease.htm

http://molbiol-tools.ca/Restriction_endonuclease.htm

Gene Prediction Patterns Frame Consistency Dicodon frequencies PSSMs Coding Potential and Fickett’s statistics Fusion of Information Sensitivity and Specificity Prediction programs Known problems

Genes are all about Patterns – real life example

Gene Prediction MethodsCommon sets of rulesHomologyAb initio methods

Compositional informationSignal information

Pattern Recognition in Gene Finding atgttggacagactggacgcaagacgtgtggatgaactcgttttggagctgaacaaggctctatacgtacttaatcaagcggggcgtttgtggagcgagt tacttcacaaaaagctagccaatttgggttcaatgcagtgcctgaccgacatgggtatgtattagtaacgtttggaagaagaaactgttgtggttggtgt ttatgcagacaatctacaggtgactgcaacgaattcaactctcgtggacagttttttcgttgatttacaggacctctcggtaaaggactatgaagaggtg acaaaattcttggggatgcgcatttcttatgcgcctgaaaatgggtatgattatatatcgagaagtgacaacccgggaaatgataaaggataa

atggagaggatgctggagacggtcaagacgaccatcacccctgcgcaggcaatgaagctgtttactgcacccaaagaacctcaagcgaacctggcccgag cacttcatgtacttggtggccatctcggaggcctgcggtggtacttagtcctgaataacgtcgtgccgtacgcgtccgcggatctacgaacggtcctgat agccaaagtggacggcacgcgtgtcgactacctacagcaagctgaggaactggcgcatttcgcgcaatcctgggagcttgaagcgcgcacgaagaacatt

We need to study the basic structures of genes first ….!

Gene Structure – Common sets of rules• Generally true: all long (> 300 bp) orfs in prokaryotic

genomes encode genes

But this may not necessarily be true for eukaryotic genomes

• Eukaryotic introns begin with GT and end with AG(donor and acceptor sites)– CT(A/G)A(C/T) 20-50 bases upstream of acceptor site.

Gene Structure

Each coding region (exon or whole gene) has a fixed translation frame

A coding region always sits inside an ORF of same reading frame

All exons of a gene are on the same strand. Neighboring exons of a gene could have

different reading frames . Exons need to be Frame consistent!

GATGGGACGACAGATAAGGTGATAGATGGTAGGCAGCAG 0 3 6 9 12 0 1

Gene Structure – reading frame consistency Neighboring exons of a gene should be frame-

consistent

exon1 (i, j) in frame a and exon2 (m, n) in frame b are consistent if

b = (m - j - 1 + a) mod 3

04/12/2023

GATGGGACGACAGATAGGTGATTAAGATGGTAGGCCGAGTGGTC

Exon1 (1,16) -> Frame = a = 0 ; i = 1 and j = 16Case1: Exon2 (33,100): Frame = b = 2; m = 33 and n = 100Case2: Exon2 (40,100): Frame = b = 2; m = 40 and n =100

GATGGGACGACAGATAGGTGATTAAGATGGTAGGCCGAGTGGTC

1 16

33

1 16

40

Frame 0

Frame 2

Frame 0 Frame

2

…31,34,37,40,43,46,49,52…

Codon Frequencies Coding sequences are translated into protein sequences We found the following – the dimer frequency in protein

sequences is NOT evenly distributed

Organism specific!!!!!!!!!!!

The average frequency is ¼% (1/20 * 1/20 = 1/400 = ¼%)

Some amino acids prefer to be next to each other

Some other amino acids prefer to be not next to each other

ALA ARG ASN ASP CYS GLU GLN GLY HIS ILE LEU LYS MET PHE PRO SER THR TRP TYR VAL

A .99 .5 .27 .45 .13 .52 .34 .51 .19 .38 .83 .46 .2 .31 .37 .73 .56 .09 .18 .65

R .5 .5 .2 .3 .1 .4 .3 .4 .18 .25 .63 .37 .14 .22 .26 .54 .34 .08 .17 .46

N .31 .19 .11 .2 .05 .23 .23 .26 .07 .13 .27 .16 .07 .11 .15 .24 .16 .04 .08 .27

D .54 .32 .17 .47 .08 .51 .19 .42 .13 .25 .48 .26 .12 .20 .24 .40 .29 .06 .14 .5

C .14 .11 .05 .09 .04 .09 .06 .13 .04 .08 .14 .08 .03 .06 .07 .14 .10 .02 .04 .13

E .57 .43 .22 .42 .09 .59 .30 .33 .16 .28 .64 .40 .17 .20 .21 .44 .36 .06 .16 .44

Q .34 .31 .11 .20 .06 .27 .29 .20 .12 .15 .45 .21 .10 .13 .17 .29 .22 .05 .10 .29

G .50 .39 .22 .37 .11 .37 .21 .50 .16 .28 .50 .33 .14 .23 .21 .54 .35 .07 .17 .46

H .21 .17 .07 .14 .04 .16 .10 .17 .08 .09 .22 .10 .05 .09 .12 .17 .11 .03 .06 .21

I .37 .25 .13 .27 .08 .27 .15 .27 .09 .15 .34 .22 .08 .14 .18 .29 .21 .04 .11 .32

L .79 .65 .30 .53 .16 .62 .45 .50 .25 .31 .97 .47 .19 .32 .44 .71 .49 .10 .22 .67

K .43 .41 .19 .26 .08 .35 .24 .26 .14 .20 .49 .41 .13 .17 .20 .37 .32 .07 .15 .33

M .23 .17 .09 .17 .04 .19 .12 .14 .06 .10 .25 .15 .07 .08 .11 .20 .17 .03 .06 .17

F .30 .20 .11 .22 .07 .22 .13 .26 .08 .14 .33 .15 .08 .14 .14 .27 .18 .04 .10 .28

P .35 .28 .13 .23 .05 .27 .16 .24 .10 .17 .39 .19 .10 .15 .26 .42 .29 .04 .09 .33

S .68 .51 .26 .46 .14 .43 .26 .55 .17 .33 .68 .38 .18 .28 .36 .93 .55 .09 .17 .56

T .51 .36 .19 .29 .10 .31 .21 .37 .12 .25 .52 .32 .13 .21 .31 .54 .42 .07 .13 .39

W 0.07 .08 .05 .06 .02 .06 .05 .06 .03 .06 .12 .07 .03 .04 .04 .09 .07 .02 .03 .07

Y .21 .16 .09 .15 .05 .16 .09 .17 .06 .10 .23 .11 .06 .10 .1 .17 .13 .03 .07 .2

V .69 .42 .24 .46 .13 .48 .31 .42 .18 .29 .72 .37 .16 .27 .33 .55 .42 .07 .18 .61

Dicodon Frequencies

Believe it or not – the biased (uneven) dimer frequencies are the foundation of many gene finding programs!

Basic idea – if a dimer has lower than average dimer frequency; this means that proteins prefer not to have such dimers in its sequence;

Hence if we see a dicodon encoding this dimer, we may want to bet against this dicodon being in a coding region!

Dicodon Frequencies - Examples

Relative frequencies of a di-codon in coding versus non-coding frequency of dicodon X (e.g, AAAAAA) in coding region, total number of

occurrences of X divided by total number of dicocon occurrences frequency of dicodon X (e.g, AAAAAA) in noncoding region, total number

of occurrences of X divided by total number of dicodon occurrences

In human genome, frequency of dicodon “AAA AAA” is ~1% in coding region versus ~5% in non-

coding region

Question: if you see a region with many “AAA AAA”, would you guess it is a coding or non-

coding region?

Basic idea of gene finding

Most dicodons show bias towards either coding or non-coding regions; only fraction of dicodons is neutral

Foundation for coding region identification

Dicodon frequencies are key signal used for coding region detection; all gene finding programs use this information

Regions consisting of dicodons that mostly tend to be in coding regions are probably coding regions; otherwise non-

coding regions

Prediction of Translation Starts Using PSSM Certain nucleotides prefer to be in certain position

around start “ATG” and other nucleotides prefer not to be there

The “biased” nucleotide distribution is information! It is a basis for translation start prediction

Question: which one is more probable to be a translation start?

ATG

A

C

TG

-1-2-4 -3 +3

+5

+4

+6

CACC ATG GC

TCGA ATG TT

TCTAGAAGATGGCAGTGGCGAAGATCTAGAAAATGACAGTGGCGAAGATCTAGAAAATGGCAGTAGCGAAGATCTACT A AATGATAGTAGCGAAGA

A 0,0,0,100 ,0, 75,100, 75 ATGT 100,0,100,0,0, 25, 0, 0 ATGG 0, 0, 0, 0, 75 ,0, 0, 25 ATGC 0,100,0,0, 25 ,0, 0, 0 ATG

Prediction of Translation Starts

Mathematical model: Fi (X): frequency of X (A, C, G, T) in position i

Score a string by log (Fi (X)/0.25)

A

C

TG

CACC ATG GC TCGA ATG TT

log (58/25) + log (49/25) + log (40/25) + log (50/25) + log (43/25) + log (39/25) =

0.37 + 0.29 + 0.20 + 0.30 + 0.24 + 0.29

= 1.69

log (6/25) + log (6/25) + log (15/25) + log (15/25) + log (13/25) + log (14/25) =

-(0.62 + 0.62 + 0.22 + 0.22 + 0.28 + 0.25)

= -2.54The model captures our intuition!

04/12/2023

Evaluation of Gene prediction

• Sensitivity = No. of Correct exons/No. of actual exons(Measurement of False negative) -> How many are discarded by mistake

• Specificity = No. of Correct exons/No. of predicted exons(Measurement of False positive) -> How many are included by mistake

• CC = Metric for combining both

04/12/2023

Real

Predicted

TP FP TN FN

TP

Sn = TP/TP+FN

Sp=TP/TP+FP

(TP*TN) – (FN*FP)/sqrt( (TP+FN)*(TN+FP)*(TP+FP)*(TN+FN) )

Challenges of Gene finder

• Alternative splicing• Nested/overlapping genes• Extremely long/short genes• Extremely long introns• Non-canonical introns• Split start codons• UTR introns• Non-ATG triplet as the start codon• Polycistronic genes• Repeats/transposons

Known Gene Finders

GeneScan GeneMarkHMM Fgenesh GlimmerHMM GeneZilla SNAP PHAT AUGUSTUS Genie

Ref: http://bioinf.uni-greifswald.de/augustus/submission

primer designgeneprediction

Education